-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description of database
ClinVar is a public archive of reports of the relationships among human variations and phenotypes, with supporting evidence.
Access (API or download)
https://ftp.ncbi.nih.gov/pub/clinvar/vcf_GRCh38/archive_2.0/2018/clinvar_20180701.vcf.gz
https://www.ncbi.nlm.nih.gov/clinvar/docs/api_http/
Also from GCP: bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20180701
Also like the authors of the huggingface space ESM-1b did:
variant_summary
https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz
Data Type
Variant pathogenicity (annotation)
CLNDISDBINCL: MedGen, OMIM and Orphanet codes
CLNHGVS: HCVS mutation code
CLNVC: type of mutation (e.g. Del, Insertion, SNP)
CLNSIGINCL: clinical interpretation (i.e. label) options are Pathgenic, Likely Pathogenic, Benign, Likely benign, Unknown, Not Accessed
CLNSIGCONF: conflicting clinical significance
CLNVI : the variant's clinical sources reported as tag-value pairs of database and variant identifier (e.g. UNIPROT)
GENEINFO: Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)
MC: comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence (e.g. missense)
Target metric
CLNSIGINCL: label of pathogenicity
Tried extracting some data
1)SELECT *
FROM bigquery-public-data.human_variant_annotation.ncbi_clinvar_hg38_20180701
WHERE contains_substr(MC, 'missense_variant')
AND contains_substr(CLNSIGINCL, 'athogenic')
Retrieved 510 rows
2)Went through the data in clinvar.csv from huggingface esm1b project
Data contains file_ID, variant, Clinical significance and allele ID (988571, 5) as shown in attached screenshot
