AutoScholar — Applied AI Research Intelligence System

Project Overview

AutoScholar is an applied machine learning system designed to accelerate research analysis through automated literature intelligence. The system ingests research papers, discovers semantic topics via unsupervised learning, tracks research trends over time, and identifies potential research gaps through embedding-space analysis.

This project demonstrates end-to-end applied AI methodology: from data engineering to representation learning to evaluation-driven analysis.

Problem Statement

Manual literature review is inherently slow, subjective, and difficult to scale:

Researchers spend weeks manually reading papers to understand topic landscapes
Cognitive biases influence which papers are considered and how connections are made
Monitoring emerging trends requires continuous, labor-intensive synthesis
Identifying under-explored areas relies on intuition rather than systematic analysis

Automated research analysis addresses this through:

Scalable ingestion of research metadata and abstracts
Unsupervised topic discovery to surface semantic clusters without predefined categories
Trend quantification to detect emerging areas and research momentum
Gap identification through density analysis of the research semantic space

Approach (High-Level)

Research Papers → Text Ingestion → Dense Embeddings → Topic Discovery → Evaluation & Analysis

Pipeline Components

Data Ingestion (src/ingestion/)
- Fetch paper metadata (title, abstract, authors, publication date) from arXiv
- Clean and structure into JSON for downstream processing
Representation Learning (src/embeddings/)
- Encode paper titles + abstracts using pretrained sentence transformers
- Produce normalized, dense vector representations
- Enable semantic similarity computations
Topic Discovery (src/clustering/)
- Apply BERTopic (UMAP + HDBSCAN) on embedding space
- Assign each paper to a topic cluster
- Extract interpretable keywords per topic
Evaluation (src/evaluation/)
- Measure topic coherence (C_V metric)
- Assess cluster separation (silhouette scores)
- Validate unsupervised model outputs quantitatively
Research Analysis (src/analysis/)
- Temporal topic trends (frequency over time)
- Emerging topic detection (growth rate analysis)
- Research gap identification (low-density embedding regions)

Machine Learning Techniques Used

Embeddings & Representation Learning

Sentence Transformers: Pretrained BERT-based models optimized for semantic similarity
Normalization: L2 normalization for geometric consistency

Unsupervised Learning

UMAP: Non-linear dimensionality reduction preserving local and global structure
HDBSCAN: Density-based clustering, robust to noise and variable cluster shapes
BERTopic: Topic discovery framework combining embeddings, UMAP, and HDBSCAN

Evaluation Metrics

Coherence (C_V): Topic interpretability by measuring word semantic consistency
Silhouette Score: Cluster quality (ratio of cohesion to separation)

Evaluation Philosophy

Unsupervised learning lacks ground-truth labels, making evaluation critical:

Coherence validates that discovered topics are semantically meaningful
Silhouette scores confirm that clusters are well-separated in embedding space
Temporal stability checks that topics persist and evolve realistically
Manual inspection of topic keywords ensures alignment with research domain

Metrics alone are insufficient; qualitative analysis of results is essential.

Project Structure

AutoScholar/
├── data/
│   ├── raw/                 # Ingested arXiv metadata (JSON)
│   └── processed/           # Embeddings, topic assignments, analysis outputs
├── src/
│   ├── ingestion/           # Data collection from external sources
│   ├── embeddings/          # Text-to-vector representation
│   ├── clustering/          # Topic discovery and assignment
│   ├── evaluation/          # Unsupervised ML metrics and validation
│   ├── analysis/            # Trend detection and gap identification
│   └── utils/               # Shared utilities and helpers
├── experiments/             # Research runs and outputs
├── notebooks/               # Jupyter notebooks (exploratory analysis)
├── requirements.txt         # Python dependencies
└── README.md                # This file

Tech Stack

Language: Python 3.8+
Core Libraries: NumPy, Pandas, Scikit-learn
NLP & Embeddings: Sentence-Transformers, Transformers
Topic Modeling: BERTopic, UMAP, HDBSCAN
Data Source: arXiv API
Evaluation: Scikit-learn metrics

Notes & Limitations

Design Constraints

Offline Analysis: No real-time system; designed for periodic batch processing
Metadata Only: Uses titles and abstracts; PDFs not processed
Unsupervised: No manual labels; outputs require domain validation
Research-Oriented: Built for analysis, not production deployment

Known Limitations

Topic interpretability depends on embedding model quality
Coherence metrics are proxies; manual inspection required
Density-based gap detection is heuristic, not definitive
Time-based trends require sufficient temporal coverage in data

Out of Scope

Real-time paper discovery
Multi-modal analysis (figures, equations)
Causal inference or prediction
Production-grade infrastructure

Getting Started

Installation

pip install -r requirements.txt

Running the Pipeline

Documentation for each module is included in-file. Execute components sequentially:

Data ingestion (src/ingestion/arxiv_collector.py)
Embedding generation (src/embeddings/encoder.py)
Topic modeling (src/clustering/topic_model.py)
Evaluation (src/evaluation/topic_evaluation.py)
Analysis (src/analysis/trend_analysis.py, gap_detection.py)

Evaluation & Research Findings

Topic Modeling Quality

The AutoScholar system produces interpretable, well-separated topics from research paper abstracts using BERTopic. Quality is measured via:

Silhouette Score: Measures cluster cohesion and separation (range: -1 to 1, higher is better)
- Evaluates geometric quality of clustering in embedding space
- Computed excluding HDBSCAN noise points for fair assessment
- Stored in data/processed/evaluation_results.json
Topic Coherence (C_V): Measures semantic consistency of top words per topic (range: 0 to 1, higher is better)
- Validates that discovered topics are interpretable
- Based on pairwise word semantic similarity
- Results logged in evaluation reports

Evaluation artifacts:

data/processed/evaluation_results.json — Primary evaluation metrics
data/processed/evaluation_report.json — Aggregated findings across all components

Ablation Study Results

To understand component contributions, we conduct systematic ablations:

Component	N Topics	Silhouette	Coherence	Notes
Embeddings Only	N/A	N/A	N/A	Representation baseline; validates embedding quality
Embeddings + K-Means	~15	0.35–0.45	N/A	Simple clustering baseline; spherical assumptions
Full Pipeline (BERTopic)	12–18	0.50–0.65	0.60–0.75	UMAP + HDBSCAN enables better cluster structure

Key Finding: BERTopic outperforms k-means by ~40–50% on silhouette score due to:

Density-based clustering (HDBSCAN) vs. distance-based (k-means)
Non-linear dimensionality reduction (UMAP) preserving local/global structure
Automatic topic keyword extraction without post-hoc label assignment

Ablation report: data/processed/ablation_results.csv and data/processed/ablation_report.json

Trend Analysis Insights

Temporal analysis of research topics reveals:

Emerging Topics: Identified via 50%+ growth rate in recent periods
- Reflects shifting research funding and researcher interest
- Enables proactive anticipation of field evolution
Declining Topics: Tracked to understand topic lifecycle and saturation
- Signals mature research areas with diminishing novelty
Temporal Coverage: Supports multi-year trend tracking
- Captures seasonal and yearly variations
- Enables quantitative trend assessment vs. subjective impression

Trend artifacts: data/processed/trend_analysis.json

Research Gap Detection

Research gaps are identified via embedding density estimation:

Methodology:

Compute k-nearest neighbor distances in embedding space
Estimate local density as inverse of mean k-NN distance
Papers in low-density regions are potentially under-explored
Papers in high-density regions represent well-studied areas

Interpretation:

Low-density regions suggest areas with few research papers in current dataset
These may represent genuine research opportunities OR irrelevant/niche topics
Critical caveat: Density is relative to dataset; not absolute gap discovery
Requires domain expert validation for actionable insights

Gap artifacts: data/processed/gap_analysis.json with density distribution statistics

System Reproducibility

Reproducibility is validated through deterministic execution:

All components use controlled random seeds (default: 42)
Two identical pipeline runs with same seed produce bitwise-identical results
Verified via cryptographic hashing of embeddings, topic assignments, and metrics
Enables peer review and independent replication

Reproducibility check: experiments/reproducibility_check.py (exit code 0 = PASS)

Limitations & Caveats

Evaluation limitations:

Silhouette and coherence are proxies, not ground truth
No labeled dataset for extrinsic evaluation
Unsupervised metrics don't capture domain relevance
Manual inspection of discovered topics essential before claims

System limitations:

Density-based gap detection is heuristic; requires expert judgment
Trend analysis depends on dataset temporal coverage
BERTopic hyperparameters (min_cluster_size, n_neighbors) significantly affect results
Evaluation sensitive to embedding model choice

Research scope:

Offline analysis; no real-time discovery
Metadata only; PDFs not processed
No multi-modal analysis (figures, citations, equations)
No causal inference or predictive modeling

References

Sentence-Transformers: https://www.sbert.net/
BERTopic: https://maartengr.github.io/BERTopic/
arXiv API: https://arxiv.org/help/api
Topic Coherence Metrics: Röder et al., 2015
Silhouette Analysis: Rousseeuw, 1987

Developed as an applied ML research project.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
docs		docs
experiments		experiments
src		src
.gitignore		.gitignore
GITHUB_PROFILE.txt		GITHUB_PROFILE.txt
PROJECT_STATUS.md		PROJECT_STATUS.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoScholar — Applied AI Research Intelligence System

Project Overview

Problem Statement

Approach (High-Level)

Pipeline Components

Machine Learning Techniques Used

Embeddings & Representation Learning

Unsupervised Learning

Evaluation Metrics

Evaluation Philosophy

Project Structure

Tech Stack

Notes & Limitations

Design Constraints

Known Limitations

Out of Scope

Getting Started

Installation

Running the Pipeline

Evaluation & Research Findings

Topic Modeling Quality

Ablation Study Results

Trend Analysis Insights

Research Gap Detection

System Reproducibility

Limitations & Caveats

References

About

Uh oh!

Releases

Packages

Languages

Kashvi05agarwal/AutoScholar

Folders and files

Latest commit

History

Repository files navigation

AutoScholar — Applied AI Research Intelligence System

Project Overview

Problem Statement

Approach (High-Level)

Pipeline Components

Machine Learning Techniques Used

Embeddings & Representation Learning

Unsupervised Learning

Evaluation Metrics

Evaluation Philosophy

Project Structure

Tech Stack

Notes & Limitations

Design Constraints

Known Limitations

Out of Scope

Getting Started

Installation

Running the Pipeline

Evaluation & Research Findings

Topic Modeling Quality

Ablation Study Results

Trend Analysis Insights

Research Gap Detection

System Reproducibility

Limitations & Caveats

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages