Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
087b37c
exclude data files
kaysiz May 27, 2025
3f8ec70
exclude vscode configs
kaysiz May 27, 2025
dd6579e
Add script to download eupmc files and doi mapping
kaysiz May 27, 2025
9eba5c7
Add repository mapping
kaysiz May 27, 2025
7690c16
add script to process eupmc files
kaysiz Jun 8, 2025
897aff6
Update script to download files to check if folder exists
kaysiz Jun 8, 2025
5d1fcbf
Update the Readme with v4 work
kaysiz Jun 8, 2025
0553378
Add query to combine 2 similar repositories
kaysiz Jul 17, 2025
c6a5b14
Update query to delete the unused repository
kaysiz Jul 17, 2025
5837166
Add new repository mappings
ashwinisukale Jul 18, 2025
22ade42
Move script to cleanup data in a folder
kaysiz Jul 18, 2025
1ab4894
Add script for v4 ror org reconciliation
kaysiz Jul 18, 2025
e1e26d0
Move files
kaysiz Jul 18, 2025
1a38d52
Add deleted assertions from excluded repositories
kaysiz Jul 21, 2025
8c4f57a
Add script to exclude assertion from non valid repositories
kaysiz Jul 22, 2025
a0e07eb
Script to find the match and create assertion affiliation
ashwinisukale Jul 23, 2025
7fbc927
Fixed the csv read attributes to parce csv correctly
ashwinisukale Jul 24, 2025
d97876d
Use the correct csv which has only one true match
ashwinisukale Jul 24, 2025
78b8cf8
process subject mapping for v4
kaysiz Jul 24, 2025
31869d2
Update version for corpus dump and exlude .DS_Store files when zippin…
kaysiz Jul 24, 2025
8d0156f
Merge branch 'ks-data-dump-v4-eumpc' of github.com:datacite/corpus-da…
kaysiz Jul 24, 2025
84b8c5f
Add deleted assertions and opharned affiliations
kaysiz Jul 27, 2025
c49febc
remove unused files
kaysiz Jul 27, 2025
d137708
Move script for affliation matching
kaysiz Jul 27, 2025
6a1e09f
move affiliation mapping script
kaysiz Jul 27, 2025
d9f275f
Update org name reconcialition ids
kaysiz Jul 27, 2025
c712b54
Update subject mapping script
kaysiz Jul 27, 2025
dc11c5f
Add script to exclude assertion from excluded repositories
kaysiz Jul 27, 2025
01211c7
Add comment to update output directory
kaysiz Jul 27, 2025
4fccde2
Add local crossref api to help with faster lookups
kaysiz Aug 11, 2025
e37eac7
Add readme to the crossref api
kaysiz Aug 11, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,11 @@
data-citation-corpus-v1.1-output/**
.env
.vscode/

# Virtual environments
accession_number_validation/**env

accession_number_validation/accession_number_validation_data
accession_number_validation/accession_number_validation_data

corpus-v4/data_ingestion/europepmc_raw_data
corpus-v4/data_ingestion/europepmc_processed_data
56 changes: 55 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -375,4 +375,58 @@ COMMIT;
- **Subjects Removed (without related assertions)**: 0
- **Affiliations Removed (without related assertions)**: 6,199
- **Invalid Citations Removed**: 44,097
- **Final Total Records**: 5,550,041
- **Final Total Records**: 5,550,041

## V4 Updates: EuropePMC Data Processing

### Overview
Version 4 introduces support for processing EuropePMC dataset citations through automated data collection, mapping, and standardization.

### Features
- Automated downloading of EuropePMC dataset citation files
- PMCID to DOI mapping using multiple sources
- Repository name standardization
- Rate-limited API interactions with both EuropePMC and DataCite
- Persistent caching to improve performance on subsequent runs

### Setup and Usage

#### Step 1: Download EuropePMC Data
Run the downloader script to fetch all necessary files:
```bash
chmod +x ./corpus-v4/eupmc_file_downloader.sh
./corpus-v4/eupmc_file_downloader.sh
```

This script will:
- Create a directory for raw data (`europepmc_raw_data`)
- Download CSV files from EuropePMC TextMinedTerms
- Download XML files from EuropePMC PMCXMLData
- Download and extract the PMID-PMCID-DOI mapping file

#### Step 2: Process the Downloaded Data
Run the Python script to process the files:
```bash
cd corpus-v4
python eupmc_reformat_csv.py
```

The script processes each CSV file to:
1. Map PMCIDs to DOIs using:
- The downloaded DOI mapping file
- A local cache of previous API responses
- The EuropePMC API (with rate limiting)
2. Standardize repository names using the provided mapping file
3. Generate formatted CSV files with columns:
- `repository`: Standardized repository name
- `dataset`: Dataset ID
- `publication`: Full DOI URL format (https://doi.org/10.XXXX/XXXXX)

### File Descriptions
- `eupmc_file_downloader.sh`: Downloads necessary files from EuropePMC
- `eupmc_reformat_csv.py`: Processes downloaded CSVs and creates formatted output
- `repository_mapping.json`: Maps repository codes to standardized names

### Output
The script generates formatted CSV files in the `europepmc_processed_data` directory, with one file per repository dataset. It also maintains an API cache to improve performance on subsequent runs.
```
97 changes: 97 additions & 0 deletions corpus-v4/crossref_local_api/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Crossref Academic Papers API

A Python API for querying academic papers from Crossref data stored in SQLite database on external drive.

## Features

- **External Drive Support**: Store 200GB+ database on external drive
- **Fast Queries**: Indexed database for sub-second searches
- **RESTful API**: Easy-to-use web interface
- **Flexible Search**: Search by DOI, author, journal, title, year, or combinations
- **Batch Import**: Efficiently import thousands of .jsonl.gz files
- **Live Crossref Integration**: Fetch real-time data from Crossref API in XML/JSON formats
- **Data Comparison**: Compare local database entries with live Crossref data

## Setup Instructions

### 1. Install Dependencies

```bash
pip install -r requirements.txt
```

### 2. Configure External Drive Path

Edit both `crossref_api.py` and `crossref_web_api.py` to set your external drive path:

**For macOS:**
```python
EXTERNAL_DRIVE_PATH = "/Volumes/MyExternalDrive"
```

**For Windows:**
```python
EXTERNAL_DRIVE_PATH = "D:" # or whatever your external drive letter is
```

**For Linux:**
```python
EXTERNAL_DRIVE_PATH = "/mnt/external"
```

### 3. Import Your Data

Run the import script to load your Crossref .jsonl.gz files:

```bash
python crossref_api.py
```

This will:
- Create the SQLite database on your external drive
- Import all .jsonl.gz files from your data directory
- Create indexes for fast searching
- Show import progress

**Note**: For 200GB of data, this may take several hours depending on your drive speed.

### 4. Start the Web API

```bash
uvicorn crossref_web_api:app --reload
```

## Database Schema

The SQLite database contains a single `papers` table with the following structure:

- `id`: Primary key
- `doi`: Digital Object Identifier (unique)
- `title`: Paper title
- `journal`: Journal name
- `year`: Publication year
- `publisher`: Publisher name
- `created_at`, `indexed_at`: Timestamps

## Performance Tips

### External Drive Performance
- **USB 3.0+**: Use USB 3.0 or higher for better performance
- **SSD**: External SSDs are much faster than mechanical drives
- **Connection**: Direct connection is faster than through hubs


## External Drive Path Examples

**macOS:**
- External drive: `/Volumes/MyDrive/crossref.db`
- Network drive: `/Volumes/NetworkDrive/crossref.db`

**Windows:**
- External drive: `D:/crossref.db`
- Network drive: `\\NetworkDrive\crossref.db`

**Linux:**
- External drive: `/mnt/external/crossref.db`
- Network drive: `/mnt/network/crossref.db`

Loading