Skip to content

🌿 Deep Learning project classifying rare species from images using the BioCLIP dataset. Features transfer learning (ConvNeXtBase), innovative data cleaning with CLIP (zero-shot), and imbalance handling. Developed for the DL course at NOVA IMS.

License

Notifications You must be signed in to change notification settings

Silvestre17/DL_RareSpecies_ImageClassification_MasterProject

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌿 Predicting Rare Species from Images using Deep Learning πŸ¦πŸ¦‹

Work developed for the Deep Learning course in the Master's in Data Science and Advanced Analytics at NOVA IMS (Spring Semester 2024-2025).

GitHub Repo

πŸ“ Description

This project applies advanced Deep Learning techniques to tackle the challenge of rare species classification from images. Using the BioCLIP dataset, sourced from the Encyclopedia of Life (EOL), which contains over 11,000 images across 202 animal families and associated taxonomic metadata (kingdom, phylum, family), we developed a robust pipeline to preprocess imbalanced and noisy data, train multiple neural network architectures, and deploy an innovative zero-shot classification approach to improve model performance. The ultimate goal is to create a tool that can aid in biodiversity conservation through automated species identification.

✨ Objective

The primary objective is to develop a highly accurate image classification model by:

  • Exploring the complex BioCLIP dataset to understand its structure and inherent challenges, such as severe class imbalance.
  • Preprocessing images and implementing data augmentation strategies to create a robust training pipeline.
  • Developing and evaluating multiple deep learning models, from a baseline CNN to state-of-the-art pre-trained architectures.
  • Innovating with a zero-shot classification pre-filtering step to remove noisy data and enhance model accuracy.

πŸŽ“ Project Context

This project was developed for the Deep Learning course in the Master's in Data Science and Advanced Analytics program at NOVA IMS, during the 2nd Semester of the 2024/2025 academic year.

πŸ’Ύ Data Source

The dataset is derived from the BioCLIP project, with images and metadata sourced from the Encyclopedia of Life (EOL).

  • Dataset: 11,983 images of rare species.
  • Target: Classification across 202 unique family labels within the Animalia kingdom.
  • Source Links: BioCLIP Project

πŸ—οΈ Project Workflow (Adapted from the CRISP-DM methodology)

The project follows the CRISP-DM framework, adapted for deep learning, guiding the process from problem understanding to deployment.

Project Flowchart

Figure 1: Project Flowchart.

  1. Business Understanding: πŸ’‘
    • Problem: Classify rare species images into their family based on visual features.
    • Importance: Automate species identification to aid biodiversity conservation.
    • Data Source: BioCLIP dataset with family as the target variable.

Python Pandas

  1. Data Understanding: πŸ”
    • Dataset: 11,983 images, 7 metadata features, 202 families, all within Animalia.
    • Challenges: High class imbalance (Figure B2), potential non-animal outliers (Figure B3).
    • Exploration: Verified data types, checked for missing values/duplicates, and visualized family distribution.
    • Splitting: Stratified split into 80% training, 10% validation, 10% test sets.

Python Pandas NumPy Matplotlib Seaborn

  1. Data Preparation: πŸ› οΈ

TensorFlow Keras Pillow

  1. Modeling: πŸ€–

scikit-learn Keras Keras Tuner Transformers visualkeras

  1. Evaluation: βœ…

    • Metrics: Macro F1-Score (primary due to imbalance), Accuracy, Precision, Recall, AUROC.
    • Analysis: Learning curves (Figure F1) assessed generalization; confusion matrices (Figure F4) and qualitative examples (Figures F2 & F3) identified misclassification patterns (e.g., visually similar species, poor image quality).
    • Callbacks: Used ModelCheckpoint, CSVLogger, LearningRateScheduler, EarlyStopping.
  2. Deployment: πŸš€

    • Deliverables: Code, notebooks, and a comprehensive report detailing methodology and findings.

✨ Innovative Approach: Zero-Shot Image Classification with CLIP πŸš€

Hugging Face

πŸ“ˆ Results & Conclusion

The ConvNeXtBase model, trained on the CLIP-filtered "OnlyAnimals" dataset with SMOTE-inspired augmentation, emerged as the top-performing solution. It achieved a final Accuracy of 83.1% and a Macro F1-Score of 78.7% on the hold-out test set. This project demonstrates that a combination of advanced transfer learning, innovative data cleaning with zero-shot models, and robust imbalance handling can create a powerful and scalable solution for automated species classification, directly supporting biodiversity conservation efforts.

Feel free to explore the notebooks to see the implementation details of each phase!


πŸ“‚ Repository Structure

  1. Data & Image Preparation

  2. Baseline Model - CNN

  3. Pre-trained Models

  4. Tuning Best Model

  5. Innovative Approach


πŸ‘₯ Team Members (Group 37)

  • AndrΓ© Silvestre, 20240502
  • Diogo Duarte, 20240525
  • Filipa Pereira, 20240509
  • Maria Cruz, 20230760
  • Umeima Adam Mahomed, 20240543

About

🌿 Deep Learning project classifying rare species from images using the BioCLIP dataset. Features transfer learning (ConvNeXtBase), innovative data cleaning with CLIP (zero-shot), and imbalance handling. Developed for the DL course at NOVA IMS.

Topics

Resources

License

Stars

Watchers

Forks

Contributors 5