api_key: YouTube Data API key.base_dir: Base directory for storing data (default: "infant-video-data-scraper").download_dir: Directory for downloaded videos (default: "downloaded_videos").processed_dir: Directory for YOLO-processed videos (default: "yolo_processed_videos").
- Initialize the YouTube Data API client using the provided
api_key.
- Construct paths for
youtube_research_data,downloaded_videos, andyolo_processed_videoswithinbase_dir. - Create these directories if they do not already exist to organize data systematically.
failed_queries: List to record queries that encounter errors.failed_downloads: List to record downloads that fail.processed_videos: List to store paths of processed videos.downloaded_video_ids: Set to keep track of video IDs that have been downloaded, preventing duplicates.
- Initialize the YOLOv8 model (
yolov8n.ptfor Nano variant) for object detection tasks.
queries: List of search terms (e.g., "baby laughing compilation", "Asian baby smiling videos").max_results_per_query: Maximum number of videos to fetch per query (default: 25).target_total_videos: Total number of unique videos to download (default: 500).
- Scan existing CSV files in
youtube_research_datato populatedownloaded_video_ids, ensuring no re-downloading of previously fetched videos.
For each query in queries:
-
Check Target Completion:
- If the number of accumulated videos (
all_videos) reachestarget_total_videos, terminate the search process.
- If the number of accumulated videos (
-
Execute Search:
- Use the YouTube Data API to search for videos matching the query, retrieving up to
max_results_per_queryresults. - Handle API rate limits by implementing a sleep interval between requests.
- Use the YouTube Data API to search for videos matching the query, retrieving up to
For each video item in the search response:
-
Extract Video ID:
- Retrieve
video_idfrom the search result.
- Retrieve
-
Duplicate Check:
- If
video_idis already indownloaded_video_ids, skip to the next video.
- If
-
Fetch Video Details:
- Retrieve video statistics and content details (e.g., view count, duration) using the YouTube Data API.
-
Compile Video Metadata:
- Create a
video_datadictionary containing metadata such as title, description, publication date, channel information, and video URL.
- Create a
-
Accumulate Video Data:
- Append
video_datatoall_videosand addvideo_idtodownloaded_video_ids.
- Append
all_videos: List of video metadata dictionaries to be downloaded.download: Boolean flag to enable/disable downloading (default: True).resolution: Desired video resolution for download (default: "720p").
- If
downloadisTrue, proceed with downloading videos.
For each video in all_videos:
-
Extract Video Information:
- Retrieve
video_url,title, andvideo_idfrom the video.
- Retrieve
-
Download Video:
- Utilize
pytubefixto download the video in the specified resolution. - If the desired resolution isn't available, download the highest available resolution.
- Sanitize the video title to create a valid filename.
- Save the downloaded video in
download_dirwith the format: "sanitized_title (video_id).mp4".
- Utilize
-
Update Tracking:
- Add
video_idtodownloaded_video_idsupon successful download.
- Add
-
Handle Download Failures:
- If downloading fails, record the
video_id,video_url, and error message infailed_downloads.
- If downloading fails, record the
-
Progress Tracking:
- Utilize
tqdmto display a progress bar, providing real-time feedback on the download status.
- Utilize
-
Rate Limiting:
- Implement a short sleep interval between downloads to respect YouTube's rate limits.
downloaded_path: File path of the downloaded video.video_id: Unique identifier of the video.
- Open the downloaded video file using OpenCV's
VideoCapture. - If the video fails to open, log the error and skip processing.
- Obtain video dimensions (width, height) and frame rate (fps).
- Define the codec and initialize
VideoWriterto save the processed video inprocessed_dirwith the filename prefixed by "processed_".
While the video has frames:
-
Read Frame:
- Capture the current frame from the video.
-
Object Detection:
- Use the YOLOv8 model to detect objects within the frame.
- Specifically, identify detections belonging to the person class (assumed to be class ID 0).
-
Bounding Box Extraction:
- If a person is detected:
- Extract the bounding box coordinates (
x1,y1,x2,y2). - Crop the frame to the bounding box, isolating the baby.
- Extract the bounding box coordinates (
- If no person is detected:
- Optionally, write the original frame or skip.
- If a person is detected:
-
Write Processed Frame:
- Save the cropped (or original) frame to the processed video file.
- Release both
VideoCaptureandVideoWriterobjects. - Log the successful saving of the processed video in
processed_videos.
- If any errors occur during processing, log the
video_idand error message for later review.
- Compile all video metadata into a Pandas
DataFrame. - Save the
DataFrameas a CSV file inyoutube_research_datawith a timestamped filename (e.g., "youtube_videos_20231027_123456.csv").
- If there are entries in
failed_queries, save them as a JSON file ("failed_queries.json") inyoutube_research_data. - Similarly, save
failed_downloadsas "failed_downloads.json" if any downloads failed.
research_queries: List of search queries targeting diverse baby videos across different ethnicities.
- Instantiate the
YouTubeResearchScraperclass with appropriate parameters, including settingprocessed_dirto "yolo_processed_videos".
- Invoke the
collect_research_datamethod with the following parameters:queries:research_queries.max_results_per_query: 25 (to efficiently reach the target of 500 videos).download: True (to enable downloading and processing).resolution: "720p" (desired video quality).target_total_videos: 500 (total unique videos to download).
- Observe the
tqdmprogress bars for download status. - Review console logs for any processing or download errors.
- Check the
youtube_research_datadirectory for CSV logs and JSON error logs.