This repository contains modular data processing pipelines that fetch, process, and analyze data from HuggingFace Hub.
├── config.py # Configuration for models pipeline
├── config_datasets.py # Configuration for datasets pipeline
├── config_papers.py # Configuration for papers pipeline
├── utils.py # Shared utility functions and logging
├── data_fetcher.py # Data fetching for models
├── data_fetcher_datasets.py # Data fetching for datasets
├── data_fetcher_papers.py # Data fetching for papers
├── tag_processor.py # Tag processing for models
├── tag_processor_datasets.py # Tag processing for datasets
├── data_processor.py # Main processing logic for models
├── data_processor_datasets.py # Main processing logic for datasets
├── data_processor_papers.py # Semantic taxonomy mapping for papers
├── main.py # Models pipeline orchestrator
├── main_datasets.py # Datasets pipeline orchestrator
├── main_papers.py # Papers pipeline orchestrator
├── test_pipeline.py # Integration test for models
├── test_pipeline_datasets.py # Integration test for datasets
├── test_pipeline_papers.py # Integration test for papers
├── hub_download.py # Weekly snapshot downloader
├── integrated_ml_taxonomy.json # ML taxonomy for papers
├── requirements.txt # Python dependencies
└── README.md # This documentation
Processes HuggingFace model data with feature extraction and categorization.
Processes HuggingFace datasets data.
Processes academic papers with semantic taxonomy mapping using spaCy NLP.
The papers pipeline includes advanced semantic analysis and citation tracking:
- Data Source: Loads papers from
cfahlgren1/hub-statsdataset (daily_papers.parquet) - Semantic Taxonomy: Uses spaCy's
en_core_web_lgmodel for semantic similarity - Hierarchical Classification: Maps paper keywords to ML taxonomy:
- Categories (e.g., Computer Vision, NLP, Deep Learning)
- Subcategories (e.g., Object Detection, Text Classification, GANs)
- Topics (e.g., YOLO, BERT, Transformers)
- Multi-Label Classification: Papers can have multiple categories if they have close similarity scores (within 90% of top score)
- Citation Tracking: Fetches citation counts using
paperscraper(via DOI and title) - Rich Metadata: Preserves all 33+ original columns (authors, GitHub repos, upvotes, etc.)
- Reports & Analytics: Generates detailed matching reports and statistics
- Auto-Upload: Uploads to HuggingFace
evijit/paperverse_daily_data
The pipeline generates:
papers_with_semantic_taxonomy.parquet- Full dataset with taxonomypapers_with_semantic_taxonomy.csv- CSV versiontaxonomy_report.txt- Detailed text reporttaxonomy_distribution.json- Statistics in JSON format
Key settings in respective config files:
Models (config.py):
MODEL_ID_TO_DEBUG: Specific model ID for debuggingTAG_MAP: Feature flags and keywordsMODEL_SIZE_RANGES: Size categorization thresholds
Papers (config_papers.py):
TAXONOMY_FILE_PATH: Path to ML taxonomy JSONSIMILARITY_THRESHOLD: Minimum cosine similarity (default: 0.55)SPACY_MODEL: NLP model to use (default:en_core_web_lg)HF_REPO_ID: Target HuggingFace repositoryENABLE_CITATION_FETCHING: Enable/disable citation fetching (default: True)CITATION_BATCH_SIZE: Batch size for progress updates (default: 100)MULTI_CLASS_ENABLED: Allow multiple classifications per paper (default: True)MULTI_CLASS_SCORE_THRESHOLD: Include classes within 90% of top score (default: 0.90)MAX_CLASSIFICATIONS: Maximum classifications per level (default: 5)
You can test each pipeline independently:
# Test models pipeline (small subset)
export TEST_DATA_LIMIT=100
python test_pipeline.py
# Test datasets pipeline (small subset)
export TEST_DATA_LIMIT=100
python test_pipeline_datasets.py
# Test papers pipeline (small subset)
export TEST_DATA_LIMIT=50
python test_pipeline_papers.py
# Run full pipelines
python main.py # Models
python main_datasets.py # Datasets
python main_papers.py # Paperspip install -r requirements.txtThe papers pipeline requires the spaCy language model and citation scraper:
# Download the spaCy model (will auto-download if missing)
python -m spacy download en_core_web_lg
# Install paperscraper for citation tracking
pip install paperscraperNotes:
- The
en_core_web_lgmodel is ~500MB and will auto-download if not found paperscraperfetches citation counts from Semantic Scholar and Google Scholar- Citation fetching can be disabled by setting
ENABLE_CITATION_FETCHING = Falsein config
To enable automatic upload to HuggingFace:
# Set your HuggingFace token
export HF_TOKEN="your_huggingface_token_here"
# Run the papers pipeline
python main_papers.pyThe papers pipeline will upload results to: evijit/paperverse_daily_data
- Go to https://huggingface.co/settings/tokens
- Create a new token with write permissions
- Copy the token and set it as an environment variable
For automated runs, add HF_TOKEN to your repository secrets:
- Go to repository Settings → Secrets and variables → Actions
- Add new secret:
HF_TOKENwith your token value - The workflow will automatically upload results