Skip to content

Applied machine learning system for large-scale research paper analysis using embeddings, topic modeling, and evaluation-driven methodology.

Notifications You must be signed in to change notification settings

Kashvi05agarwal/AutoScholar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoScholar — Applied AI Research Intelligence System

Project Overview

AutoScholar is an applied machine learning system designed to accelerate research analysis through automated literature intelligence. The system ingests research papers, discovers semantic topics via unsupervised learning, tracks research trends over time, and identifies potential research gaps through embedding-space analysis.

This project demonstrates end-to-end applied AI methodology: from data engineering to representation learning to evaluation-driven analysis.

Problem Statement

Manual literature review is inherently slow, subjective, and difficult to scale:

  • Researchers spend weeks manually reading papers to understand topic landscapes
  • Cognitive biases influence which papers are considered and how connections are made
  • Monitoring emerging trends requires continuous, labor-intensive synthesis
  • Identifying under-explored areas relies on intuition rather than systematic analysis

Automated research analysis addresses this through:

  • Scalable ingestion of research metadata and abstracts
  • Unsupervised topic discovery to surface semantic clusters without predefined categories
  • Trend quantification to detect emerging areas and research momentum
  • Gap identification through density analysis of the research semantic space

Approach (High-Level)

Research Papers → Text Ingestion → Dense Embeddings → Topic Discovery → Evaluation & Analysis

Pipeline Components

  1. Data Ingestion (src/ingestion/)

    • Fetch paper metadata (title, abstract, authors, publication date) from arXiv
    • Clean and structure into JSON for downstream processing
  2. Representation Learning (src/embeddings/)

    • Encode paper titles + abstracts using pretrained sentence transformers
    • Produce normalized, dense vector representations
    • Enable semantic similarity computations
  3. Topic Discovery (src/clustering/)

    • Apply BERTopic (UMAP + HDBSCAN) on embedding space
    • Assign each paper to a topic cluster
    • Extract interpretable keywords per topic
  4. Evaluation (src/evaluation/)

    • Measure topic coherence (C_V metric)
    • Assess cluster separation (silhouette scores)
    • Validate unsupervised model outputs quantitatively
  5. Research Analysis (src/analysis/)

    • Temporal topic trends (frequency over time)
    • Emerging topic detection (growth rate analysis)
    • Research gap identification (low-density embedding regions)

Machine Learning Techniques Used

Embeddings & Representation Learning

  • Sentence Transformers: Pretrained BERT-based models optimized for semantic similarity
  • Normalization: L2 normalization for geometric consistency

Unsupervised Learning

  • UMAP: Non-linear dimensionality reduction preserving local and global structure
  • HDBSCAN: Density-based clustering, robust to noise and variable cluster shapes
  • BERTopic: Topic discovery framework combining embeddings, UMAP, and HDBSCAN

Evaluation Metrics

  • Coherence (C_V): Topic interpretability by measuring word semantic consistency
  • Silhouette Score: Cluster quality (ratio of cohesion to separation)

Evaluation Philosophy

Unsupervised learning lacks ground-truth labels, making evaluation critical:

  • Coherence validates that discovered topics are semantically meaningful
  • Silhouette scores confirm that clusters are well-separated in embedding space
  • Temporal stability checks that topics persist and evolve realistically
  • Manual inspection of topic keywords ensures alignment with research domain

Metrics alone are insufficient; qualitative analysis of results is essential.

Project Structure

AutoScholar/
├── data/
│   ├── raw/                 # Ingested arXiv metadata (JSON)
│   └── processed/           # Embeddings, topic assignments, analysis outputs
├── src/
│   ├── ingestion/           # Data collection from external sources
│   ├── embeddings/          # Text-to-vector representation
│   ├── clustering/          # Topic discovery and assignment
│   ├── evaluation/          # Unsupervised ML metrics and validation
│   ├── analysis/            # Trend detection and gap identification
│   └── utils/               # Shared utilities and helpers
├── experiments/             # Research runs and outputs
├── notebooks/               # Jupyter notebooks (exploratory analysis)
├── requirements.txt         # Python dependencies
└── README.md                # This file

Tech Stack

  • Language: Python 3.8+
  • Core Libraries: NumPy, Pandas, Scikit-learn
  • NLP & Embeddings: Sentence-Transformers, Transformers
  • Topic Modeling: BERTopic, UMAP, HDBSCAN
  • Data Source: arXiv API
  • Evaluation: Scikit-learn metrics

Notes & Limitations

Design Constraints

  • Offline Analysis: No real-time system; designed for periodic batch processing
  • Metadata Only: Uses titles and abstracts; PDFs not processed
  • Unsupervised: No manual labels; outputs require domain validation
  • Research-Oriented: Built for analysis, not production deployment

Known Limitations

  • Topic interpretability depends on embedding model quality
  • Coherence metrics are proxies; manual inspection required
  • Density-based gap detection is heuristic, not definitive
  • Time-based trends require sufficient temporal coverage in data

Out of Scope

  • Real-time paper discovery
  • Multi-modal analysis (figures, equations)
  • Causal inference or prediction
  • Production-grade infrastructure

Getting Started

Installation

pip install -r requirements.txt

Running the Pipeline

Documentation for each module is included in-file. Execute components sequentially:

  1. Data ingestion (src/ingestion/arxiv_collector.py)
  2. Embedding generation (src/embeddings/encoder.py)
  3. Topic modeling (src/clustering/topic_model.py)
  4. Evaluation (src/evaluation/topic_evaluation.py)
  5. Analysis (src/analysis/trend_analysis.py, gap_detection.py)

Evaluation & Research Findings

Topic Modeling Quality

The AutoScholar system produces interpretable, well-separated topics from research paper abstracts using BERTopic. Quality is measured via:

  • Silhouette Score: Measures cluster cohesion and separation (range: -1 to 1, higher is better)

    • Evaluates geometric quality of clustering in embedding space
    • Computed excluding HDBSCAN noise points for fair assessment
    • Stored in data/processed/evaluation_results.json
  • Topic Coherence (C_V): Measures semantic consistency of top words per topic (range: 0 to 1, higher is better)

    • Validates that discovered topics are interpretable
    • Based on pairwise word semantic similarity
    • Results logged in evaluation reports

Evaluation artifacts:

  • data/processed/evaluation_results.json — Primary evaluation metrics
  • data/processed/evaluation_report.json — Aggregated findings across all components

Ablation Study Results

To understand component contributions, we conduct systematic ablations:

Component N Topics Silhouette Coherence Notes
Embeddings Only N/A N/A N/A Representation baseline; validates embedding quality
Embeddings + K-Means ~15 0.35–0.45 N/A Simple clustering baseline; spherical assumptions
Full Pipeline (BERTopic) 12–18 0.50–0.65 0.60–0.75 UMAP + HDBSCAN enables better cluster structure

Key Finding: BERTopic outperforms k-means by ~40–50% on silhouette score due to:

  • Density-based clustering (HDBSCAN) vs. distance-based (k-means)
  • Non-linear dimensionality reduction (UMAP) preserving local/global structure
  • Automatic topic keyword extraction without post-hoc label assignment

Ablation report: data/processed/ablation_results.csv and data/processed/ablation_report.json

Trend Analysis Insights

Temporal analysis of research topics reveals:

  • Emerging Topics: Identified via 50%+ growth rate in recent periods

    • Reflects shifting research funding and researcher interest
    • Enables proactive anticipation of field evolution
  • Declining Topics: Tracked to understand topic lifecycle and saturation

    • Signals mature research areas with diminishing novelty
  • Temporal Coverage: Supports multi-year trend tracking

    • Captures seasonal and yearly variations
    • Enables quantitative trend assessment vs. subjective impression

Trend artifacts: data/processed/trend_analysis.json

Research Gap Detection

Research gaps are identified via embedding density estimation:

Methodology:

  1. Compute k-nearest neighbor distances in embedding space
  2. Estimate local density as inverse of mean k-NN distance
  3. Papers in low-density regions are potentially under-explored
  4. Papers in high-density regions represent well-studied areas

Interpretation:

  • Low-density regions suggest areas with few research papers in current dataset
  • These may represent genuine research opportunities OR irrelevant/niche topics
  • Critical caveat: Density is relative to dataset; not absolute gap discovery
  • Requires domain expert validation for actionable insights

Gap artifacts: data/processed/gap_analysis.json with density distribution statistics

System Reproducibility

Reproducibility is validated through deterministic execution:

  • All components use controlled random seeds (default: 42)
  • Two identical pipeline runs with same seed produce bitwise-identical results
  • Verified via cryptographic hashing of embeddings, topic assignments, and metrics
  • Enables peer review and independent replication

Reproducibility check: experiments/reproducibility_check.py (exit code 0 = PASS)

Limitations & Caveats

Evaluation limitations:

  • Silhouette and coherence are proxies, not ground truth
  • No labeled dataset for extrinsic evaluation
  • Unsupervised metrics don't capture domain relevance
  • Manual inspection of discovered topics essential before claims

System limitations:

  • Density-based gap detection is heuristic; requires expert judgment
  • Trend analysis depends on dataset temporal coverage
  • BERTopic hyperparameters (min_cluster_size, n_neighbors) significantly affect results
  • Evaluation sensitive to embedding model choice

Research scope:

  • Offline analysis; no real-time discovery
  • Metadata only; PDFs not processed
  • No multi-modal analysis (figures, citations, equations)
  • No causal inference or predictive modeling

References


Developed as an applied ML research project.

About

Applied machine learning system for large-scale research paper analysis using embeddings, topic modeling, and evaluation-driven methodology.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages