Make-Data-Count-Community · kaysiz · May 27, 2025 · May 27, 2025 · May 27, 2025 · May 27, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,11 @@
 data-citation-corpus-v1.1-output/**
 .env
+.vscode/
 
 # Virtual environments
 accession_number_validation/**env
 
-accession_number_validation/accession_number_validation_data
+accession_number_validation/accession_number_validation_data
+
+corpus-v4/data_ingestion/europepmc_raw_data
+corpus-v4/data_ingestion/europepmc_processed_data
diff --git a/README.md b/README.md
@@ -375,4 +375,58 @@ COMMIT;
 - **Subjects Removed (without related assertions)**: 0
 - **Affiliations Removed (without related assertions)**: 6,199
 - **Invalid Citations Removed**: 44,097
-- **Final Total Records**: 5,550,041
+- **Final Total Records**: 5,550,041
+
+## V4 Updates: EuropePMC Data Processing
+
+### Overview
+Version 4 introduces support for processing EuropePMC dataset citations through automated data collection, mapping, and standardization.
+
+### Features
+- Automated downloading of EuropePMC dataset citation files
+- PMCID to DOI mapping using multiple sources
+- Repository name standardization
+- Rate-limited API interactions with both EuropePMC and DataCite
+- Persistent caching to improve performance on subsequent runs
+
+### Setup and Usage
+
+#### Step 1: Download EuropePMC Data
+Run the downloader script to fetch all necessary files:
+```bash
+chmod +x ./corpus-v4/eupmc_file_downloader.sh
+./corpus-v4/eupmc_file_downloader.sh
+```
+
+This script will:
+- Create a directory for raw data (`europepmc_raw_data`)
+- Download CSV files from EuropePMC TextMinedTerms
+- Download XML files from EuropePMC PMCXMLData
+- Download and extract the PMID-PMCID-DOI mapping file
+
+#### Step 2: Process the Downloaded Data
+Run the Python script to process the files:
+```bash
+cd corpus-v4
+python eupmc_reformat_csv.py
+```
+
+The script processes each CSV file to:
+1. Map PMCIDs to DOIs using:
+   - The downloaded DOI mapping file
+   - A local cache of previous API responses
+   - The EuropePMC API (with rate limiting)
+2. Standardize repository names using the provided mapping file
+3. Generate formatted CSV files with columns:
+   - `repository`: Standardized repository name
+   - `dataset`: Dataset ID
+   - `publication`: Full DOI URL format (https://doi.org/10.XXXX/XXXXX)
+
+### File Descriptions
+- `eupmc_file_downloader.sh`: Downloads necessary files from EuropePMC
+- `eupmc_reformat_csv.py`: Processes downloaded CSVs and creates formatted output
+- `repository_mapping.json`: Maps repository codes to standardized names
+
+### Output
+The script generates formatted CSV files in the `europepmc_processed_data` directory, with one file per repository dataset. It also maintains an API cache to improve performance on subsequent runs.
+```
diff --git a/corpus-v4/crossref_local_api/README.md b/corpus-v4/crossref_local_api/README.md
@@ -0,0 +1,97 @@
+# Crossref Academic Papers API
+
+A Python API for querying academic papers from Crossref data stored in SQLite database on external drive.
+
+## Features
+
+- **External Drive Support**: Store 200GB+ database on external drive
+- **Fast Queries**: Indexed database for sub-second searches
+- **RESTful API**: Easy-to-use web interface
+- **Flexible Search**: Search by DOI, author, journal, title, year, or combinations
+- **Batch Import**: Efficiently import thousands of .jsonl.gz files
+- **Live Crossref Integration**: Fetch real-time data from Crossref API in XML/JSON formats
+- **Data Comparison**: Compare local database entries with live Crossref data
+
+## Setup Instructions
+
+### 1. Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. Configure External Drive Path
+
+Edit both `crossref_api.py` and `crossref_web_api.py` to set your external drive path:
+
+**For macOS:**
+```python
+EXTERNAL_DRIVE_PATH = "/Volumes/MyExternalDrive"
+```
+
+**For Windows:**
+```python
+EXTERNAL_DRIVE_PATH = "D:"  # or whatever your external drive letter is
+```
+
+**For Linux:**
+```python
+EXTERNAL_DRIVE_PATH = "/mnt/external"
+```
+
+### 3. Import Your Data
+
+Run the import script to load your Crossref .jsonl.gz files:
+
+```bash
+python crossref_api.py
+```
+
+This will:
+- Create the SQLite database on your external drive
+- Import all .jsonl.gz files from your data directory
+- Create indexes for fast searching
+- Show import progress
+
+**Note**: For 200GB of data, this may take several hours depending on your drive speed.
+
+### 4. Start the Web API
+
+```bash
+uvicorn crossref_web_api:app --reload
+```
+
+## Database Schema
+
+The SQLite database contains a single `papers` table with the following structure:
+
+- `id`: Primary key
+- `doi`: Digital Object Identifier (unique)
+- `title`: Paper title
+- `journal`: Journal name
+- `year`: Publication year
+- `publisher`: Publisher name
+- `created_at`, `indexed_at`: Timestamps
+
+## Performance Tips
+
+### External Drive Performance
+- **USB 3.0+**: Use USB 3.0 or higher for better performance
+- **SSD**: External SSDs are much faster than mechanical drives
+- **Connection**: Direct connection is faster than through hubs
+
+
+## External Drive Path Examples
+
+**macOS:**
+- External drive: `/Volumes/MyDrive/crossref.db`
+- Network drive: `/Volumes/NetworkDrive/crossref.db`
+
+**Windows:**
+- External drive: `D:/crossref.db`
+- Network drive: `\\NetworkDrive\crossref.db`
+
+**Linux:**
+- External drive: `/mnt/external/crossref.db`
+- Network drive: `/mnt/network/crossref.db`
+