Parallel Image Processing Pipeline

Team members

Valeria Paredes
Damaris Pech
Ana Paula Ramírez
Joaquín Murguia
Damaris Dzul

Introduction

Image processing is a crucial task in computer vision, AI, and data analysis. However, applying filters to a large dataset of images can be computationally expensive.

This project aims to accelerate image filtering using Python’s multiprocessing module. By leveraging parallel processing, we can improve efficiency and reduce execution time.

This document explains the code structure, parallelization approach, and benchmarking results comparing the serial and parallel versions.

Problem Description

This project consists of:

Downloading images from a .tsv dataset containing image URLs.
Applying image processing filters, specifically:
- Grayscale (converts the image to black and white).
- Blur (reduces details using a Gaussian filter).
- Edge detection (highlights contours using the Canny operator).
Implementing both serial and parallel versions to process images.
Comparing performance when running filters with 1, 2, 3, 4, and 6 processes.

Code Explanation

The code is divided into two main functions:

1. Load the image dataset (`load_photo_dataset`)

This function searches for a photos.tsv file, extracts the necessary columns (photo_id and photo_image_url), and returns a Pandas DataFrame.

def load_photo_dataset(path='../images'):
    """Load the photo dataset and extract relevant information."""
    photo_file = glob.glob(os.path.join(path, "photos.tsv*"))
    
    if not photo_file:
        print("Error: No photos.tsv file found!")
        return None
    
    df = pd.read_csv(photo_file[0], sep='\t', header=0)
    
    # Extract only necessary columns
    if 'photo_id' in df.columns and 'photo_image_url' in df.columns:
        df = df[['photo_id', 'photo_image_url']]
    else:
        print("Error: Required columns not found in photos dataset!")
        return None
    
    return df

Explanation

Folder Creation: The function ensures the image folder exists before downloading.

Iterating Over the Dataset: The function loops through the photo_df DataFrame to extract photo_id and photo_image_url.

Downloading Images:

Each image is downloaded using requests.get().

The image is saved locally using the photo_id as the filename.

Error Handling:

If an image URL is missing, the function skips that image.

A try-except block is used to handle request errors and avoid interruptions.

Download Limit:

The function stops downloading images once the max_images limit (150) is reached.

Image Download and Processing Script

Code Execution The main section of the script runs the functions:

if __name__ == "__main__":
    image_folder = "../images"  # Directory for storing images
    
    photos_df = load_photo_dataset(image_folder)
    
    if photos_df is not None:
        image_paths = download_images(photos_df, image_folder)
        print(f"Found {len(image_paths)} images for processing.")

Explanation:

Defines the image storage directory: image_folder is set to "../images".

Loads the image dataset: The load_photo_dataset() function loads the dataset containing information about the images.

If images are available, it proceeds to download them: The download_images() function handles the download.

Project Approach

Serial Implementation: Images are downloaded and processed sequentially.

Filters are applied one by one to all images.

Limitation: It takes longer since it only uses a single CPU core.

Parallel Implementation:

Uses multiprocessing.Pool to distribute tasks across multiple processes.

Images are processed in parallel, reducing execution time.

Advantage: Better CPU resource utilization.

Benchmarking Results

Execution time was measured with 1, 2, 3, 4, and 6 processes.

Number of Processes	Execution Time (seconds)
1 process	64.00 s
2 processes	39.79 s
3 processes	33.02 s
4 processes	29.76 s
6 processes	28.70 s

(Execution time depends on hardware and dataset size)

Conclusions

Multiprocessing improves performance: The parallel implementation significantly reduced processing time compared to the serial version.
Optimal number of processes varies: Beyond a certain number of processes (usually equal to the number of CPU cores), performance stops improving due to task management overhead.

Factors affecting performance:

The number of images and their size.
CPU workload.

Potential improvements:

Optimization could be done using concurrent.futures.ProcessPoolExecutor or batch-processing strategies.
The number of parallel processes used.

How to Use the Repo

To use this repository and run the parallel image processing pipeline, follow the steps below:

1. Clone the repository:

git clone https://github.com/valinyourarea/Multiprocessing_Projects.git
cd Multiprocessing_Projects

2. Install dependencies:

Make sure you have python3 and pip installed. Then install the required dependencies:

pip install -r requirements.txt

3. Download the dataset:

The project requires a .tsv file containing image URLs. Ensure you have the photos.tsv file in the images/ directory.

4. Run the script:

Once the dataset is ready, you can run the script to download and process the images. Use the following command:

python downloader.py

5. Review results:

The script will process the images and apply filters in parallel. You can check the execution times and compare the serial vs parallel results.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Parallel Image Processing Pipeline		Parallel Image Processing Pipeline
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel Image Processing Pipeline

Team members

Introduction

Problem Description

Code Explanation

1. Load the image dataset (`load_photo_dataset`)

Explanation

Image Download and Processing Script

Explanation:

Project Approach

Parallel Implementation:

Benchmarking Results

Conclusions

Factors affecting performance:

Potential improvements:

How to Use the Repo

1. Clone the repository:

2. Install dependencies:

3. Download the dataset:

4. Run the script:

5. Review results:

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

valinyourarea/Multiprocessing_Projects

Folders and files

Latest commit

History

Repository files navigation

Parallel Image Processing Pipeline

Team members

Introduction

Problem Description

Code Explanation

1. Load the image dataset (load_photo_dataset)

Explanation

Image Download and Processing Script

Explanation:

Project Approach

Parallel Implementation:

Benchmarking Results

Conclusions

Factors affecting performance:

Potential improvements:

How to Use the Repo

1. Clone the repository:

2. Install dependencies:

3. Download the dataset:

4. Run the script:

5. Review results:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

1. Load the image dataset (`load_photo_dataset`)

Packages