AutoScholar is an applied machine learning system designed to accelerate research analysis through automated literature intelligence. The system ingests research papers, discovers semantic topics via unsupervised learning, tracks research trends over time, and identifies potential research gaps through embedding-space analysis.
This project demonstrates end-to-end applied AI methodology: from data engineering to representation learning to evaluation-driven analysis.
Manual literature review is inherently slow, subjective, and difficult to scale:
- Researchers spend weeks manually reading papers to understand topic landscapes
- Cognitive biases influence which papers are considered and how connections are made
- Monitoring emerging trends requires continuous, labor-intensive synthesis
- Identifying under-explored areas relies on intuition rather than systematic analysis
Automated research analysis addresses this through:
- Scalable ingestion of research metadata and abstracts
- Unsupervised topic discovery to surface semantic clusters without predefined categories
- Trend quantification to detect emerging areas and research momentum
- Gap identification through density analysis of the research semantic space
Research Papers → Text Ingestion → Dense Embeddings → Topic Discovery → Evaluation & Analysis
-
Data Ingestion (
src/ingestion/)- Fetch paper metadata (title, abstract, authors, publication date) from arXiv
- Clean and structure into JSON for downstream processing
-
Representation Learning (
src/embeddings/)- Encode paper titles + abstracts using pretrained sentence transformers
- Produce normalized, dense vector representations
- Enable semantic similarity computations
-
Topic Discovery (
src/clustering/)- Apply BERTopic (UMAP + HDBSCAN) on embedding space
- Assign each paper to a topic cluster
- Extract interpretable keywords per topic
-
Evaluation (
src/evaluation/)- Measure topic coherence (C_V metric)
- Assess cluster separation (silhouette scores)
- Validate unsupervised model outputs quantitatively
-
Research Analysis (
src/analysis/)- Temporal topic trends (frequency over time)
- Emerging topic detection (growth rate analysis)
- Research gap identification (low-density embedding regions)
- Sentence Transformers: Pretrained BERT-based models optimized for semantic similarity
- Normalization: L2 normalization for geometric consistency
- UMAP: Non-linear dimensionality reduction preserving local and global structure
- HDBSCAN: Density-based clustering, robust to noise and variable cluster shapes
- BERTopic: Topic discovery framework combining embeddings, UMAP, and HDBSCAN
- Coherence (C_V): Topic interpretability by measuring word semantic consistency
- Silhouette Score: Cluster quality (ratio of cohesion to separation)
Unsupervised learning lacks ground-truth labels, making evaluation critical:
- Coherence validates that discovered topics are semantically meaningful
- Silhouette scores confirm that clusters are well-separated in embedding space
- Temporal stability checks that topics persist and evolve realistically
- Manual inspection of topic keywords ensures alignment with research domain
Metrics alone are insufficient; qualitative analysis of results is essential.
AutoScholar/
├── data/
│ ├── raw/ # Ingested arXiv metadata (JSON)
│ └── processed/ # Embeddings, topic assignments, analysis outputs
├── src/
│ ├── ingestion/ # Data collection from external sources
│ ├── embeddings/ # Text-to-vector representation
│ ├── clustering/ # Topic discovery and assignment
│ ├── evaluation/ # Unsupervised ML metrics and validation
│ ├── analysis/ # Trend detection and gap identification
│ └── utils/ # Shared utilities and helpers
├── experiments/ # Research runs and outputs
├── notebooks/ # Jupyter notebooks (exploratory analysis)
├── requirements.txt # Python dependencies
└── README.md # This file
- Language: Python 3.8+
- Core Libraries: NumPy, Pandas, Scikit-learn
- NLP & Embeddings: Sentence-Transformers, Transformers
- Topic Modeling: BERTopic, UMAP, HDBSCAN
- Data Source: arXiv API
- Evaluation: Scikit-learn metrics
- Offline Analysis: No real-time system; designed for periodic batch processing
- Metadata Only: Uses titles and abstracts; PDFs not processed
- Unsupervised: No manual labels; outputs require domain validation
- Research-Oriented: Built for analysis, not production deployment
- Topic interpretability depends on embedding model quality
- Coherence metrics are proxies; manual inspection required
- Density-based gap detection is heuristic, not definitive
- Time-based trends require sufficient temporal coverage in data
- Real-time paper discovery
- Multi-modal analysis (figures, equations)
- Causal inference or prediction
- Production-grade infrastructure
pip install -r requirements.txtDocumentation for each module is included in-file. Execute components sequentially:
- Data ingestion (
src/ingestion/arxiv_collector.py) - Embedding generation (
src/embeddings/encoder.py) - Topic modeling (
src/clustering/topic_model.py) - Evaluation (
src/evaluation/topic_evaluation.py) - Analysis (
src/analysis/trend_analysis.py,gap_detection.py)
The AutoScholar system produces interpretable, well-separated topics from research paper abstracts using BERTopic. Quality is measured via:
-
Silhouette Score: Measures cluster cohesion and separation (range: -1 to 1, higher is better)
- Evaluates geometric quality of clustering in embedding space
- Computed excluding HDBSCAN noise points for fair assessment
- Stored in
data/processed/evaluation_results.json
-
Topic Coherence (C_V): Measures semantic consistency of top words per topic (range: 0 to 1, higher is better)
- Validates that discovered topics are interpretable
- Based on pairwise word semantic similarity
- Results logged in evaluation reports
Evaluation artifacts:
data/processed/evaluation_results.json— Primary evaluation metricsdata/processed/evaluation_report.json— Aggregated findings across all components
To understand component contributions, we conduct systematic ablations:
| Component | N Topics | Silhouette | Coherence | Notes |
|---|---|---|---|---|
| Embeddings Only | N/A | N/A | N/A | Representation baseline; validates embedding quality |
| Embeddings + K-Means | ~15 | 0.35–0.45 | N/A | Simple clustering baseline; spherical assumptions |
| Full Pipeline (BERTopic) | 12–18 | 0.50–0.65 | 0.60–0.75 | UMAP + HDBSCAN enables better cluster structure |
Key Finding: BERTopic outperforms k-means by ~40–50% on silhouette score due to:
- Density-based clustering (HDBSCAN) vs. distance-based (k-means)
- Non-linear dimensionality reduction (UMAP) preserving local/global structure
- Automatic topic keyword extraction without post-hoc label assignment
Ablation report: data/processed/ablation_results.csv and data/processed/ablation_report.json
Temporal analysis of research topics reveals:
-
Emerging Topics: Identified via 50%+ growth rate in recent periods
- Reflects shifting research funding and researcher interest
- Enables proactive anticipation of field evolution
-
Declining Topics: Tracked to understand topic lifecycle and saturation
- Signals mature research areas with diminishing novelty
-
Temporal Coverage: Supports multi-year trend tracking
- Captures seasonal and yearly variations
- Enables quantitative trend assessment vs. subjective impression
Trend artifacts: data/processed/trend_analysis.json
Research gaps are identified via embedding density estimation:
Methodology:
- Compute k-nearest neighbor distances in embedding space
- Estimate local density as inverse of mean k-NN distance
- Papers in low-density regions are potentially under-explored
- Papers in high-density regions represent well-studied areas
Interpretation:
- Low-density regions suggest areas with few research papers in current dataset
- These may represent genuine research opportunities OR irrelevant/niche topics
- Critical caveat: Density is relative to dataset; not absolute gap discovery
- Requires domain expert validation for actionable insights
Gap artifacts: data/processed/gap_analysis.json with density distribution statistics
Reproducibility is validated through deterministic execution:
- All components use controlled random seeds (default: 42)
- Two identical pipeline runs with same seed produce bitwise-identical results
- Verified via cryptographic hashing of embeddings, topic assignments, and metrics
- Enables peer review and independent replication
Reproducibility check: experiments/reproducibility_check.py (exit code 0 = PASS)
Evaluation limitations:
- Silhouette and coherence are proxies, not ground truth
- No labeled dataset for extrinsic evaluation
- Unsupervised metrics don't capture domain relevance
- Manual inspection of discovered topics essential before claims
System limitations:
- Density-based gap detection is heuristic; requires expert judgment
- Trend analysis depends on dataset temporal coverage
- BERTopic hyperparameters (min_cluster_size, n_neighbors) significantly affect results
- Evaluation sensitive to embedding model choice
Research scope:
- Offline analysis; no real-time discovery
- Metadata only; PDFs not processed
- No multi-modal analysis (figures, citations, equations)
- No causal inference or predictive modeling
- Sentence-Transformers: https://www.sbert.net/
- BERTopic: https://maartengr.github.io/BERTopic/
- arXiv API: https://arxiv.org/help/api
- Topic Coherence Metrics: Röder et al., 2015
- Silhouette Analysis: Rousseeuw, 1987
Developed as an applied ML research project.