-
Notifications
You must be signed in to change notification settings - Fork 0
Europe PMC data ingestion script #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ashwinisukale
wants to merge
24
commits into
main
Choose a base branch
from
eupmc_ingestion
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
24 commits
Select commit
Hold shift + click to select a range
0d7ad49
Europe PMC data ingestion script
ashwinisukale f28ca74
process eroupepmc row
ashwinisukale ee13850
Process CSV files from local
ashwinisukale 86a9ca9
As per CSV column name changed
ashwinisukale 6420bd0
500 citation per activity
ashwinisukale 168e63d
Fixed issue with memory heap
ashwinisukale d230e0a
Corrected name
ashwinisukale 29dded9
Fetch files from s3 and process them
ashwinisukale 1211d02
rename method and use s3
kaysiz 8811820
Update importer to process all umnprocessed activity logs
kaysiz 6ca99c8
Add method to process s3 files and excluded already processed ones
kaysiz 102017a
Add model to for czifile table
kaysiz 840a5ab
Add eupmcfile to process the files from s3
kaysiz 33db726
fix doibaseurl reference
kaysiz 1686f1e
Update processing of eumpc activity logs in parallel and handle publi…
kaysiz 6cfb878
Add repository handling for crossref records
kaysiz f2491f6
update parallel processes number and update activity log when it is d…
kaysiz 766c690
Add logs volume and prod db for v4
kaysiz 6e896ca
update docker image for local dev
kaysiz 6460cd9
Update datacite import dates and local crossref api
kaysiz 4bc2ac7
remove unused code
kaysiz 6b0aa36
undo delete of file
kaysiz 8c99b15
Add v4 release documentation
kaysiz 8b8afce
Update docker image
kaysiz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| version: '3' | ||
|
|
||
| services: | ||
| importer: | ||
| platform: linux/amd64 | ||
| build: | ||
| context: ./packages/server | ||
| dockerfile: ./Dockerfile-development | ||
| entrypoint: | ||
| [ | ||
| 'node_modules/.bin/wait-for-it', | ||
| 'corpus-v4.cpcwgoa3uzw1.eu-west-1.rds.amazonaws.com:5432', | ||
| '--', | ||
| 'sh', | ||
| 'scripts/setupDevServer.sh', | ||
| ] | ||
| command: | ||
| [ | ||
| 'node_modules/.bin/nodemon', | ||
| '--max-old-space-size=8192', | ||
| 'epmcImport.js', | ||
| '--watch', | ||
| '--ext', | ||
| 'js,graphql', | ||
| ] | ||
| ports: | ||
| - 3000:3000 | ||
| environment: | ||
| - NODE_ENV=development | ||
| - POSTGRES_HOST=${POSTGRES_HOST} | ||
| - POSTGRES_PORT=${POSTGRES_PORT:-5432} | ||
| - POSTGRES_DB=${POSTGRES_DB:-dev_db} | ||
| - POSTGRES_USER=${POSTGRES_USER:-dev_user} | ||
| - POSTGRES_PASSWORD=${POSTGRES_PASSWORD:-dev_user_password} | ||
| - PUBSWEET_SECRET=${PUBSWEET_SECRET:-superSecretThing} | ||
| - SERVER_PORT=${SERVER_PORT:-3000} | ||
| - HOSTNAME=${HOSTNAME} | ||
| - CLIENT_URL=${CLIENT_URL:-http://0.0.0.0:4000} | ||
| - PASSWORD_RESET_PATH=${PASSWORD_RESET_PATH:-password-reset} | ||
| - S3_PROTOCOL=http | ||
| - S3_HOST=filehosting | ||
| - S3_PORT=${S3_PORT:-9000} | ||
| - S3_ACCESS_KEY_ID=${S3_ACCESS_KEY_ID:-nonRootUser} | ||
| - S3_SECRET_ACCESS_KEY=${S3_SECRET_ACCESS_KEY:-nonRootPassword} | ||
| - S3_EUROPEPMC_BUCKET=${S3_EUROPEPMC_BUCKET} | ||
| - S3_EUROPEPMC_FOLDER=${S3_EUROPEPMC_FOLDER} | ||
| volumes: | ||
| - ./packages/server/api:/home/node/app/api | ||
| - ./packages/server/config:/home/node/app/config | ||
| - ./packages/server/controllers:/home/node/app/controllers | ||
| - ./packages/server/models:/home/node/app/models | ||
| - ./packages/server/rest:/home/node/app/rest | ||
| - ./packages/server/scripts:/home/node/app/scripts | ||
| - ./packages/server/services:/home/node/app/services | ||
| - ./packages/server/data:/home/node/app/data | ||
| - ./logs:/home/node/app/logs |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # EuropePMC v4 Ingestion – Release Documentation | ||
|
|
||
| ## 1. Purpose of the Release | ||
| - **Goal:** Ingest new citations from EuropePMC (v4). | ||
| - **Outcome:** Expanded citation coverage and updated assertions table. | ||
|
|
||
| --- | ||
|
|
||
| ## 2. High-Level Workflow | ||
| 1. **Download raw CSV files** from EuropePMC. | ||
| 2. **Process and reformat** files by adding additional metadata. | ||
| 3. **Store files in S3** in the correct path structure. | ||
| 4. **Run ingestion pipeline:** | ||
| - Reads S3 files | ||
| - Creates activity logs | ||
| - Processes activity logs | ||
| - Run local apis for Crossref and ROR to mitigate rate limit | ||
| 5. **Enrich data:** | ||
| - DataCite (for DOIs) | ||
| - Crossref (for accession numbers) | ||
| 6. **Insert records into `assertions` table.** | ||
| 7. **Post-processing cleanup** of malformed or duplicate records. | ||
| 8. **Generate final data dump** for release. | ||
|
|
||
|  | ||
|
|
||
| --- | ||
|
|
||
| ## 3. Key Scripts and Commands | ||
| - `https://github.com/Make-Data-Count-Community/corpus-data-file/corpus-v4/data_ingestion/eupmc_reformat_csv.py` | ||
| - `https://github.com/Make-Data-Count-Community/corpus-data-file/corpus-v4/data_ingestion/eupmc_file_downloader.sh` | ||
|
|
||
| --- | ||
|
|
||
| ## 4. Gotchas / Lessons Learned | ||
| - Verify CSV schema changes before reformatting. | ||
| - Ensure correct S3 bucket and path to avoid ingestion failures. | ||
| - Monitor long-running pipeline jobs. | ||
| - Crossref enrichment may fail if DOI rate limits are hit. | ||
|
|
||
| --- | ||
|
|
||
| ## 5. Links to Artifacts | ||
| - **Raw CSV files (S3):** `s3://europepmc-files/unprocessed/` | ||
| - **Processed CSV files (S3):** `s3://europepmc-files/processed/` | ||
| - **V4 dump files (S3):** `s3://corpus-data-files/v4.1/` | ||
|
|
||
| --- |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,20 @@ | ||
| /* | ||
| * Changes to this file require rebuilding the image using 'docker-compose -f docker-compose.withexternaldb.cziimport.yml build' | ||
| */ | ||
| // const dataCitePrefixImport = require('./services/scheduledTaskService/dataCitePrefixImport') | ||
|
|
||
| const epmcImport = require('./services/scheduledTaskService/epmcImport') | ||
|
|
||
| const init = async () => { | ||
| try { | ||
| // uncomment this to fetch all prefixes from datacite API and insert into DB | ||
| // NOTE this is not idempotent - prefixes will be duplicated if run multiple times | ||
| // await dataCitePrefixImport() | ||
|
|
||
| await epmcImport() | ||
| } catch (e) { | ||
| throw new Error(e) | ||
| } | ||
| } | ||
|
|
||
| init() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,37 @@ | ||
| const { BaseModel } = require('@coko/server') | ||
|
|
||
| class CziFileModel extends BaseModel { | ||
| constructor(properties) { | ||
| super(properties) | ||
| this.type = 'CziFileModel' | ||
| } | ||
|
|
||
| static get tableName() { | ||
| return 'czi_files' | ||
| } | ||
|
|
||
| static get schema() { | ||
| return { | ||
| properties: { | ||
| file_name: { | ||
| type: ['string'] | ||
| }, | ||
| type: { | ||
| type: ['string'], | ||
| }, | ||
| proccessed: { | ||
| default: false, | ||
| type: ['boolean', false], | ||
| }, | ||
| done: { | ||
| default: false, | ||
| type: ['boolean', false], | ||
| }, | ||
| }, | ||
| required: ['file_name', 'type'], | ||
| type: 'object', | ||
| } | ||
| } | ||
| } | ||
|
|
||
| module.exports = CziFileModel |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| /* eslint-disable global-require */ | ||
| const model = require('./cziFileModel') | ||
|
|
||
| module.exports = { | ||
| model, | ||
| modelName: 'CziFileModel', | ||
| } |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have
aws-sdklets use that instead. We have an AWS service we can use for s3 already https://github.com/Make-Data-Count-Community/corpus-app/blob/adeb183c77a6aed8b4ad3d5dfdd22012d333acf7/packages/server/services/awsS3Service/index.js