Bridging Data and Discovery: A Survey on Knowledge Graphs in AI for Science
- Awesome Scientific Knowledge Graphs
- 📑 Table of Contents
- 🧬 Research Scopes
- 📚 Structure of Survey
- 🔗 Evolution of SciKGs
- 🏗️ Construction and Maintenance of SciKGs
- 🌐 Core Functions of SciKGs
- 🤝 SciKG–LLM Integration for Scientific Discovery
- 🧠 Discovery Flywheel
- ⚖️ Challenges and Opportunities in SciKGs
- Collection of SciKGs and its Applications
- Summary of SciKG-LLM Integration
- Databases for Constructing Scientific Knowledge Graph
- Software Tools for Knowledge Graph
- Citation
An overview of the scope in this survey, covering four fundamental scientific tasks in biology, chemistry, and materials science: (a) drug development and optimization, (b) omics interpretation and analysis, (c) chemical reaction and synthesis, and (d) materials design and discovery.
Structure of the survey. Our review is structured around the lifecycle of SciKGs: from their conceptual foundation and construction methodologies, to their applications and synergistic integration with LLMs for discovery, culminating in challenges, opportunities and future directions that envision SciKGs as engines for autonomous scientific discovery.
The co-evolution of knowledge graph technologies and their scientific practices. The technological evolution of KGs (top) has continually enabled new paradigms in SciKG applications (bottom). This progression has moved from static cataloguing and manual integration to machine learning-driven inference, culminating in the current era of bidirectional synergy between LLMs and KGs. This synergy, leveraging tools such as RAG and AI agents, transforms SciKGs from static repositories into dynamic engines for generative scientific discovery. Abbr., SQL: Structured Query Language; RDF: Resource Description Framework; OWL: Web Ontology Language; SPARQL: SPARQL Protocol and RDF Query Language; GNN: graph neural network; KGE: knowledge graph embedding; RAG: retrieval-augmented generation.
Construction and maintenance of SciKGs. (a) The foundation of SciKG construction involves integrating diverse data sources, including structured databases, unstructured text, and multimodal data. (b) Two main approaches for extracting entities and relations from the acquired data are illustrated: rule/dictionary-based extraction, which relies on predefined lexicons and rules, and LLM-based extraction, involving fine-tuning on scientific datasets and prompt engineering. (c) Ontology alignment integrates diverse representations of the same entity (e.g., aspirin), followed by graph embedding into a continuous vector space. (d) Dynamic updating through incremental learning and LLM-driven error correction ensures SciKGs remain accurate and up to date. (e-h) Sub-figures illustrate representative examples of specialized knowledge graphs for drugs, omics, chemicals, and materials, respectively.
Summary of core functions of SciKGs in diverse scientific tasks. SciKGs serve as a foundational infrastructure that: (1) organizes heterogeneous scientific data into structured knowledge; (2) enhances representation learning via graph embedding; (3) enables causal and relational inference for hypothesis generation; and (4) improves AI model interpretability by grounding predictions in traceable, evidence-based knowledge paths.
Synergistic integration of SciKGs and LLMs for knowledge-driven scientific discovery. (a) SciKGs serve as the foundational knowledge infrastructure by ensuring factual grounding and verification, defining reasonable scientific boundaries, and enabling unified representation of heterogeneous data. (b) LLMs act as dynamic semantic engines through five core functions: semantic interface for knowledge access, analytical reasoner for inference, generative engine for hypothesis design, constructor for knowledge curation, and orchestrator for workflow automation. (c) The SciKG-LLM integration empowers four key scientific discovery tasks: multi-source data interpretation, complex system mechanism analysis, system performance optimization, and innovative solution design.

The autonomous scientific discovery flywheel driven by LLM agents and SciKGs.
Challenges and Opportunities in SciKGs. This figure illustrates the major challenges (C1-C4) facing SciKGs, including data quality and completeness, interoperability and integration, dynamic and temporal knowledge, and trustworthy and explainable reasoning. Each challenge is paired with corresponding opportunities (O1-O4) for advancement, such as building standards and benchmarks, integrating multimodal foundation models, autonomous updating via agents, and developing community-driven platforms. The green sections depict workflows (W1-W4) that enable these opportunities, highlighting a path towards more auditable, unified, dynamic, and community-governed SciKGs.
| Year | Title | KG Name | KG Type | Domain | Construction Method | Venue | Paper | Code |
|---|---|---|---|---|---|---|---|---|
| 2025 | TarIKGC: A Target Identification Tool Using Semantics-Enhanced Knowledge Graph Completion with Application to CDK2 Inhibitor Discovery | biological activity KG | public KG | DTI prediction | Semi-automated | Journal of Medicinal Chemistry | Link | Link |
| 2025 | A comprehensive large-scale biomedical knowledge graph for AI-powered data-driven biomedical research | iKraph | Multi-source KG | Drug repurposing and Hypothesis Generation | Semi-automated | Nature Machine Intelligence | Link | Link |
| 2025 | VITAGRAPH: Building a Knowledge Graph for Biologically Relevant Learning Tasks | VITAGRAPH | public KG | Drug repurposing | Semi-automated | arXiv | Link | Link |
| 2024 | A Foundation Model for Clinician-Centered Drug Repurposing | / | public KG | Drug repurposing | Semi-automated | Nature Medicine | Link | Link |
| 2024 | Accurate and Interpretable Drug-Drug Interaction Prediction Enabled by Knowledge Subgraph Learning | / | public KG | DDI prediction | Automated | Nature Communication Medicine | Link | Link |
| 2024 | Knowledge Enhanced Representation Learning for Drug discovery | MKG | Multi-source KG | DTI prediction and Virtual screening and drug discovery | Semi-automated | AAAI | Link | Link |
| 2024 | An experimentally validated approach to automated biological evidence generation in drug discovery using knowledge graphs | Healx KG | public KG | Drug repurposing | Semi-automated | Nature Communications | Link | Link |
| 2024 | DDI-GPT: Explainable Prediction of Drug-Drug Interactions using Large Language Models enhanced with Knowledge Graphs | iBKH | public KG | DDI prediction | Semi-automated | bioRxiv | Link | Link |
| 2024 | MKG-FENN: A Multimodal Knowledge Graph Fused End-to-End Neural Network for Accurate Drug–Drug Interaction Prediction | MKG | Multi-source KG | DDI prediction | Automated | AAAI | Link | Link |
| 2024 | TransFOL: A Logical Query Model for Complex Relational Reasoning in Drug-Drug Interaction | / | public KG | DDI prediction | Semi-automated | Journal of Biomedical and Health Informatics | Link | Link |
| 2024 | KGRLFF: Detecting Drug-Drug Interactions Based on Knowledge Graph Representation Learning and Feature Fusion | / | public KG | DDI prediction | Semi-automated | TCBB | Link | Link |
| 2024 | An effective framework for predicting drug–drug interactions based on molecular substructures and knowledge graph neural network | DKG (Drug knowledge graph) | public KG | DDI prediction | Semi-automated | Computers in Biology and Medicine | Link | Link |
| 2024 | Medical knowledge graph question answering for drug‐drug interaction prediction based on multi‐hop machine reading comprehension | / | public KG | DDI prediction | Automated | CAAI Transactions on Intelligence Technology | Link | |
| 2024 | Integrated Knowledge Graph and Drug Molecular Graph Fusion via Adversarial Networks for Drug–Drug Interaction Prediction | DrugBank | public KG | DDI prediction | Semi-automated | JCIM | Link | Link |
| 2024 | KGE-UNIT: toward the unification of molecular interactions prediction based on knowledge graph and multi-task learning on drug discovery | / | Multi-source KG | DDI prediction, DTI prediction and Hypothesis Generation | Automated | Briefings in Bioinformatics | Link | Link |
| 2023 | Biomedical Knowledge Graph Learning for Drug Repurposing by Extending Guilt-By Association to Multiple Layers | / | public KG | Drug repurposing | Semi-automated | Nature Communications | Link | Link |
| 2023 | Evolution-strengthened knowledge graph enables predicting the targetability and druggability of genes | ESKG (Evolution-strengthened KG) | public KG | DTI prediction | Semi-automated | PNAS nexus | Link | Link |
| 2023 | Drugomics: Knowledge Graph & AI to Construct Physicians' Brain Digital Twin to Prevent Drug Side-Effects and Patient Harm | Drugomics KG | Multi-source KG | Drug toxicity and adverse reactions | Semi-automated | Big Data Analytics | Link | |
| 2023 | Molecular-evaluated and explainable drug repurposing for COVID-19 using ensemble knowledge graph embedding | / | Multi-source KG | Drug repurposing | Semi-automated | Scientific Reports | Link | Link |
| 2023 | Toxicology knowledge graph for structural birth defects | ReproTox-KG | Multi-source KG | Drug toxicity and adverse reactions and Hypothesis Generation | Semi-automated | Communications medicine | Link | Link |
| 2023 | NAFLDkb: A Knowledge Base and Platform for Drug Development against Nonalcoholic Fatty Liver Disease | NAFLDkb | Multi-source KG | Drug repurposing | Semi-automated | JCIM | Link | Link |
| 2023 | Molecule generation toward target protein (SARS-CoV-2) using reinforcement learning-based graph neural network via knowledge graph | / | domain-specific KG | Virtual screening and drug discovery, DTI prediction, and Hypothesis Generation | Semi-automated | Network Modeling and Analysis in Health Informatics and Bioinformatics | Link | |
| 2022 | e-TSN: an Interactive Visual Exploration Platform for Target-Disease Knowledge Mappling from Literature | e-TSN KG | literature-based KG | DTI prediction | Automated | Briefings in Bioinformatics | Link | |
| 2022 | Attention-based knowledge graph representation learning for predicting drug-drug interactions | / | public KG | DDI prediction | Semi-automated | Briefings in Bioinformatics | Link | Link |
| 2022 | Automating Predictive Toxicology Using ComptoxAI | ComptoxAI KG | public KG | Drug toxicity and adverse reactions | Semi-automated | Chemical Research in Toxicology | Link | Link |
| 2022 | KG-MTL: Knowledge Graph Enhanced Multi-Task Learning for Molecular Interaction | DRKG | public KG | DTI prediction | Automated | IEEE | Link | Link |
| 2021 | A Unified Drug-Target Interaction Prediction Framework Based on Knowledge Graph and Recommendation System | / | public KG | DTI prediction | Semi-automated | Nature Communications | Link | Link |
| 2021 | Biological Insights Knowledge Graph: an integrated knowledge graph to support drug development | BIKG | Multi-source KG | DTI prediction and Drug repurposing | Semi-automated | bioRxiv | Link | |
| 2021 | SumGNN: multi-typed drug interaction prediction via efficient knowledge graph summarization | / | public KG | DDI prediction | Semi-automated | Bioinformatics | Link | Link |
| 2021 | Predicting Potential Drug Targets Using Tensor Factorisation and Knowledge Graph Embeddings | Hetionet | public KG | DTI prediction | Semi-automated | BIOKDD | Link | |
| 2021 | Adverse Drug Reaction Discovery Using a Tumor-Biomarker Knowledge Graph | TBKG (Tumor-biomarker KG) | literature-based KG | Drug toxicity and adverse reactions | Semi-automated | Frontiers in Genetics | Link | |
| 2021 | Investigating ADR mechanisms with Explainable AI: a feasibility study with knowledge graph mining | PGxLOD | Multi-source KG | Drug toxicity and adverse reactions | Semi-automated | BMC Medical Informatics and Decision Making | Link | |
| 2020 | Discovering Protein Drug Targets Using Knowledge Graph Embeddings | a knowledge graph of biological entities related to both drugs and targets | public KG | DTI prediction | Automated | Bioinformatics | Link | |
| 2020 | KGNN: Knowledge Graph Neural Network for Drug-Drug Interaction Prediction | / | public KG | DDI prediction | Semi-automated | IJCAI | Link | Link |
| 2020 | Network-based prediction of drug–target interactions using an arbitrary-order proximity embedded deep forest | / | public KG | DTI prediction | Semi-automated | Bioinformatics | Link | Link |
| 2019 | Evaluation of knowledge graph embedding approaches for drug-drug interaction prediction in realistic settings | / | public KG | DDI prediction | Semi-automated | BMC Bioinformatics | Link | Link |
| 2019 | Drug-Drug Interaction Prediction Based on Knowledge Graph Embeddings and Convolutional-LSTM Network | / | public KG | DDI prediction | Semi-automated | arXiv | Link | Link |
| 2019 | GAMENet: Graph Augmented MEmory Networks for Recommending Medication Combination | EHR&DDI Graph | domain-specific KG | Virtual screening and drug discovery | Automated | AAAI | Link | Link |
| 2019 | Facilitating prediction of adverse drug reactions by using knowledge graphs and multi‐label learning models | Bio2RDF KG | public KG | Drug toxicity and adverse reactions | Automated | Briefings in Bioinformatics | Link | |
| 2018 | Modeling polypharmacy side effects with graph convolutional networks | / | public KG | Drug toxicity and adverse reactions | Semi-automated | Bioinformatics | Link | |
| 2018 | Neural networks for link prediction in realistic biomedical graphs: a multi-dimensional evaluation of graph embedding-based approaches | / | public KG | DTI prediction | Semi-automated | BMC Bioinformatics | Link | |
| 2017 | A Network Integration Approach for Drug-Target Interaction Prediction and Computational Drug Repositioning from Heterogeneous Information | / | public KG | DTI prediction and Drug repurposing | Semi-automated | Nature Communications | Link | Link |
| 2017 | Knowledge graph prediction of unknown adverse drug reactions and validation in electronic health records | / | public KG | Drug toxicity and adverse reactions | Semi-automated | Scientific Reports | Link | Link |
| 2017 | Large-scale structural and textual similarity-based mining of knowledge graph to predict drug–drug interactions | / | public KG | DDI prediction | Semi-automated | Journal of Web Semantics | Link | |
| 2017 | Deep mining heterogeneous networks of biomedical linked data to predict novel drug–target associations | LTN (Linked Tripartite Network) | public KG | DTI prediction | Semi-automated | Bioinformatics | Link | Link |
| Year | Title | KG Name | KG Type | Domain | Construction Method | Venue | Paper | Code |
|---|---|---|---|---|---|---|---|---|
| 2025 | A novel approach for target deconvolution from phenotype-based screening using knowledge graph | P53_HUMAN PPIKG | public KG | Proteomics research | Semi-automated | Scientific Reports | Link | Link |
| 2025 | Unified Knowledge-Guided Molecular Graph Encoder with multimodal fusion and multi-task learning | Elemental KG and Biological KG | Multi-source KG | Proteomics research | Semi-automated | Neural Networks | Link | |
| 2025 | PhenoKG: Knowledge Graph-Driven Gene Discovery and Patient Insights from Phenotypes Alone | PhenoKG | public KG | Genomics research | Semi-automated | arXiv | Link | |
| 2024 | Petagraph: A large-scale unifying knowledge graph framework for integrating biomolecular and biomedical data | Petagraph | Multi-source KG | Genomics research | Semi-automated | Scientific Data | Link | Link |
| 2024 | An ontology-based knowledge graph for representing interactions involving RNA molecules | RNA-KG | Multi-source KG | Transcriptomics research | Semi-automated | Scientific Data | Link | Link |
| 2024 | Knowledge graph construction based on granulosa cells transcriptome from polycystic ovary syndrome with normoandrogen and hyperandrogen | causal KG | Multi-source KG | Transcriptomics research | Semi-automated | Journal of Ovarian Research | Link | |
| 2024 | Multi-Modal Protein Knowledge Graph Construction and Applications (Student Abstract) | ProteinKG65 | Multi-source KG | Proteomics research | Semi-automated | AAAI | Link | Link |
| 2024 | Bridging chemical structure and conceptual knowledge enables accurate prediction of compound-protein interaction | DRKG | public KG | Proteomics research | Manual | BMC Biology | Link | Link |
| 2024 | Integration of chromosome locations and functional aspects of enhancers and topologically associating domains in knowledge graphs enables versatile queries about gene regulation | crm, crm2gene, crm2tfac, crm2phen, tad, human genes (after ampliation) graph | public KG | Genomics research | Semi-automated | Nucleic Acids Research | Link | Link |
| 2024 | Identifying compound-protein interactions with knowledge graph embedding of perturbation transcriptomics | / | public KG | Proteomics research | Semi-automated | Cell Genomics | Link | Link |
| 2023 | MMiKG: a knowledge graph-based platform for path mining of microbiota–mental diseases interactions | MMiKG | literature-based KG | Microbiome research | Manual | Briefings in Bioinformatics | Link | |
| 2023 | Transporter proteins knowledge graph construction and its application in drug development | Transporter Proteins Knowledge Graph | public KG | Proteomics research | Semi-automated | Computational and Structural Biotechnology Journal | Link | |
| 2023 | A Knowledge Graph Approach to Elucidate the Role of Organellar Pathways in Disease via Biomedical Reports | / | Multi-source KG | Proteomics research | Semi-automated | JoVE Journal of Biochemistry | Link | |
| 2022 | Knowledge-graph-based cell-cell communication inference for spatially resolved transcriptomic data with SpaTalk | LRT-KG | public KG | Transcriptomics research | Semi-automated | Nature Communications | Link | Link |
| 2022 | A knowledge graph to interpret clinical proteomics data | CKG (Clinical Knowledge graph) | Multi-source KG | Proteomics research | Semi-automated | Nature Biotechnology | Link | Link |
| 2022 | Knowledge integration and decision support for accelerated discovery of antibiotic resistance genes | E. coli knowledge graph | public KG | Genomics research | Semi-automated | Nature Communications | Link | Link |
| 2022 | Machine learning prediction and tau-based screening identifies potential Alzheimer's disease genes relevant to immunity | PKG (protein knowledge graph) | public KG | Proteomics research | Semi-automated | Communications Biology | Link | Link |
| 2022 | OntoProtein: Protein Pretraining With Gene Ontology Embedding | ProteinKG25 | public KG | Proteomics research | Automated | ICLR | Link | Link |
| 2022 | GenomicKB: a knowledge graph for the human genome | GenomicKB (Genomic Knowledgebase) | Multi-source KG | Genomics research | Semi-automated | Nucleic Acids Research | Link | |
| 2022 | Creating and Exploiting the Intrinsically Disordered Protein Knowledge Graph (IDP-KG) | IDP-KG | public KG | Proteomics research | Automated | CEUR Workshop Proceedings | Link | Link |
| 2022 | BioTAGME: A Comprehensive Platform for Biological Knowledge Network Analysis | BioTAGME KG | Multi-source KG | Multi-Omics research | Semi-automated | Frontiers in Genetics | Link | |
| 2022 | Biomedical knowledge graph embeddings for personalized medicine: Predicting disease-gene associations | a biomedical KG for predicting disease-gene association | public KG | Genomics research | Semi-automated | Expert Systems | Link | Link |
| 2022 | Identifying genes targeted by disease-associated non-coding SNPs with a protein knowledge graph | Protein KG | literature-based KG | Genomics research | Semi-automated | PLOS ONE | Link | Link |
| 2022 | KG-MTL: Knowledge Graph Enhanced Multi-Task Learning for Molecular Interaction | DRKG | public KG | Proteomics research | Automated | IEEE | Link | Link |
| 2021 | FORUM: building a Knowledge Graph from public databases and scientific literature to extract associations between chemicals and diseases | FORUM | Multi-source KG | Metabolomics research | Semi-automated | Bioinformatics | Link | Link |
| 2020 | Exploring the Microbiota-Gut-Brain Axis for Mental Disorders with Knowledge Graphs | MiKG (Microbiota knowledge graph) | Multi-source KG | Microbiome research | Semi-automated | Journal of Artificial Intelligence for Medical Sciences | Link | Link |
| 2020 | Metastatic Site Prediction in Breast Cancer using Omics Knowledge Graph and Pattern Mining with Kirchhoff's Law Traversal | Kirchhoff's KG | public KG | Multi-Omics research | Semi-automated | bioRxiv | Link | |
| 2020 | Accurate prediction of kinase-substrate networks using knowledge graphs | a phosphorylation knowledge graph | public KG | Proteomics research | Automated | PLoS Computational Biology | Link | |
| 2020 | An integrative knowledge graph for rare diseases, derived from the Genetic and Rare Diseases Information Center (GARD) | an integrative KG for rare diseases | Multi-source KG | Genomics research | Semi-automated | Journal of Biomedical Semantics | Link | |
| 2019 | Predicting gene-disease associations from the heterogeneous network using graph embedding | / | public KG | Genomics research | Semi-automated | IEEE | Link | |
| 2019 | GenomicsKG: A Knowledge Graph to Visualize Poly-Omics Data | GenomicsKG | public KG | Genomics research | Semi-automated | Journal of Advances in Health | Link | |
| 2018 | Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes | / | public KG | Genomics research | Semi-automated | Bioinformatics | Link | Link |
| 2018 | Heterogeneous network embedding for identifying symptom candidate genes | SDGNet&SDGPNet (two heterogeneous symptom-related networks) | public KG | Genomics research | Semi-automated | JAMIA | Link | |
| 2018 | Network-based integration of multi-omics data for prioritizing cancer genes | / | public KG | Genomics research | Semi-automated | Bioinformatics | Link | Link |
| 2016 | A knowledge-based approach for predicting gene–disease associations | / | public KG | Genomics research | Semi-automated | Bioinformatics | Link |
| Year | Title | KG Name | KG Type | Domain | Construction Method | Venue | Paper | Code |
|---|---|---|---|---|---|---|---|---|
| 2025 | Bi-level Contrastive Learning for Knowledge-Enhanced Molecule Representations | MolKG | public KG | Molecular property prediction | Semi-automated | AAAI | Link | Link |
| 2025 | Automated Retrosynthesis Planning of Macromolecules Using Large Language Models and Knowledge Graphs | / | literature-based KG | Chemical synthesis pathway optimization | Automated | Macromolecular Rapid Communications | Link | Link |
| 2025 | An Automated Approach for Domain-Specific Knowledge Graph Generation─Graph Measures and Characterization | / | literature-based KG | Chemical reaction prediction and Chemical Synthesis Pathway Optimization | Automated | JCIM | Link | |
| 2024 | Large-Scale Knowledge Integration for Enhanced Molecular Property Prediction | ElementKG-CHEBI | Multi-source KG | Molecular property prediction | Semi-automated | Nesy | Link | Link |
| 2024 | Self-Supervised Contrastive Molecular Representation Learning with a Chemical Synthesis Knowledge Graph | Chemical synthesis KG | public KG | Chemical reaction prediction | Semi-automated | JCIM | Link | Link |
| 2023 | Knowledge graph-enhanced molecular contrastive learning with functional prompt | ElementKG | domain-specific KG | Molecular property prediction | Semi-automated | Nature Machine Intelligence | Link | Link |
| 2023 | Marie and BERT─A Knowledge Graph Embedding Based Question Answering System for Chemistry | TWA KG (the World Avatar KG) and Wikidata chemistry KG | Dynamic KG | Chemical reaction prediction and Molecular property prediction | Semi-automated | ACS Omega | Link | Link |
| 2022 | MKGE: Knowledge graph embedding with molecular structure information | KCCR and DeepDDI | public KG | Molecular property prediction | Semi-automated | Computational Biology and Chemistry | Link | |
| 2022 | Prediction of Compound Synthesis Accessibility Based on Reaction Knowledge Graph | Reaction knowledge graph | public KG | Chemical reaction prediction, Molecular property prediction, and Chemical synthesis pathway optimization | Semi-automated | Molecules | Link | Link |
| 2022 | Molecular Contrastive Learning with Chemical Element Knowledge Graph | Chemical element KG | domain-specific KG | Molecular property prediction | Semi-automated | AAAI | Link | Link |
| 2022 | FAIR and Interactive Data Graphics from a Scientific Knowledge Graph | / | literature-based KG | Chemical synthesis pathway optimization, Chemical reaction prediction, and Molecular property prediction | Semi-automated | Scientific Data | Link | |
| 2022 | From Platform to Knowledge Graph: Evolution of Laboratory Automation | The World Avatar KG | Dynamic KG | Chemical Synthesis Pathway Optimization | Semi-automated | JACS Au | Link | |
| 2021 | Intelligent generation of optimal synthetic pathways based on knowledge graph inference and retrosynthetic predictions using reaction big data | Reaction knowledge graph | public KG | Chemical synthesis pathway optimization | Semi-automated | Journal of the Taiwan Institute of Chemical Engineers | Link | |
| 2021 | Automated Calibration of a Poly(oxymethylene) Dimethyl Ether Oxidation Mechanism Using the Knowledge Graph Technology | JPS KG | Dynamic KG | Chemical reaction prediction and Chemical Synthesis Pathway Optimization | Semi-automated | JCIM | Link | |
| 2021 | A graph-based network for predicting chemical reaction pathways in solid-state materials synthesis | / | domain-specific KG | Chemical reaction prediction | Automated | Nature Communications | Link | Link |
| 2020 | Knowledge Graph Approach to Combustion Chemistry and Interoperability | JPS KG | literature-based KG | Chemical reaction prediction | Automated | ACS Omega | Link | |
| 2020 | Multiscale Cross-Domain Thermochemical Knowledge-Graph | JPS KG | Dynamic KG | Chemical reaction prediction and Molecular property prediction | Automated | JCIM | Link | |
| 2016 | Modelling Chemical Reasoning to Predict and Invent Reactions | / | public KG | Chemical reaction prediction | Semi-automated | Chemistry Europe | Link |
| Year | Title | KG Name | KG Type | Domain | Construction Method | Venue | Paper | Code |
|---|---|---|---|---|---|---|---|---|
| 2025 | Construction of a knowledge graph for framework material enabled by large language models and its application | KG-FM | literature-based KG | Material screening and optimization | Semi-automated | npj Computational Materials | Link | Link |
| 2025 | High throughput screening of new piezoelectric materials using graph machine learning and knowledge graph approach | a simple KG encoding structural similarity between materials | public KG | Material screening and optimization | Semi-automated | Computational Materials Science | Link | Link |
| 2024 | MatKG: An autonomously generated knowledge graph in Material Science | MatKG | literature-based KG | New material design | Semi-automated | Scientific Data | Link | Link |
| 2024 | A materials terminology knowledge graph automatically constructed from text corpus | MGED-KG | literature-based KG | Material screening and optimization | Semi-automated | Scientific Data | Link | Link |
| 2024 | Construction and Application of Materials Knowledge Graph in Multidisciplinary Materials Science via Large Language Model | MKG | literature-based KG and dynamic KG | New material design | Semi-automated | NeurIPS | Link | Link |
| 2024 | Generative Retrieval-Augmented Ontologic Graph and Multiagent Strategies for Interpretive Large Language Model-Based Materials Design | Ontological KG | literature-based KG | New material design and Material performance prediction | Semi-automated | ACS Engineering Au | Link | Link |
| 2024 | SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning | an ontological knowledge graph for biologically inspired materials | literature-based KG | New material design and Material performance prediction | Semi-automated | Advanced Materials | Link | |
| 2024 | An ontology-based text mining dataset for extraction of process-structure-property entities | Materials mechanics ontology | literature-based KG | Material performance prediction | Semi-automated | Scientific Data | Link | Link |
| 2024 | Material Property Prediction with Element Attribute Knowledge Graphs and Multimodal Representation Learning | Element KG | domain-specific KG | Material performance prediction | Semi-automated | arXiv | Link | |
| 2024 | Knowledge graph-guided data-driven design of ultra-high-performance concrete (UHPC) with interpretability and physicochemical reaction discovery capability | UHPC KG | literature-based KG | Material screening and optimization | Manual | Construction and Building Materials | Link | |
| 2023 | The materials experiment knowledge graph | MekG | domain-specific KG | Material performance prediction | Semi-automated | Digital Discovery | Link | Link |
| 2023 | Revisiting Electrocatalyst Design by a Knowledge Graph of Cu-Based Catalysts for CO2 Reduction | Cu-Based Catalysts Knowledge Graph for CO2 Reduction | literature-based KG | New material design and Material performance prediction | Semi-automated | ACS Catalysis | Link | Link |
| 2023 | Reinforcement learning-based knowledge graph reasoning for aluminum alloy applications | Aluminum alloy domain KG | public KG | Material performance prediction | Semi-automated | Computational Materials Science | Link | |
| 2023 | Bridging the Semantic-Numerical Gap: A Numerical Reasoning Method of Cross-modal Knowledge Graph for Material Property Prediction | Cross-modal KG | Multi-source KG | Material performance prediction | Semi-automated | arXiv | Link | Link |
| 2023 | Digital Twin-Based Fault Diagnosis Platform for Final Rolling Temperature in Hot Strip Production | / | domain-specific KG | Material screening and optimization | Semi-automated | Materials | Link | |
| 2022 | Grain Knowledge Graph Representation Learning: A New Paradigm for Microstructure-Property Prediction | Grain KG | domain-specific KG | Material performance prediction | Semi-automated | Crystals | Link | Link |
| 2022 | Automating Materials Exploration with a Semantic Knowledge Graph for Li-Ion Battery Cathodes | a semantic knowledge graph dedicated to LIB cathodes | literature-based KG | New material design | Semi-automated | AFM | Link | Link |
| 2022 | High-Throughput Computing Assisted by Knowledge Graph to Study the Correlation between Microstructure and Mechanical Properties of 6XXX Aluminum Alloy | 6XXX Aluminum Alloy KG | domain-specific KG | Material screening and optimization and Material performance prediction | Semi-automated | Materials | Link | |
| 2022 | FAIR and Interactive Data Graphics from a Scientific Knowledge Graph | / | literature-based KG | Material screening and optimization | Semi-automated | Scientific Data | Link | |
| 2022 | Compound Knowledge Graph-Enabled AI Assistant for Accelerated Materials Discovery | CKG (Compound KG) | Multi-source KG | New material design | Semi-automated | Integrating Materials and Manufacturing Innovation | Link | |
| 2021 | EBSD Grain Knowledge Graph Representation Learning for Material Structure-Property Prediction | EBSD Grain KG | domain-specific KG | Material performance prediction | Semi-automated | CCKS | Link | |
| 2020 | propnet: A Knowledge Graph for Materials Science | propnet KG | domain-specific KG | Material performance prediction | Semi-automated | Matter | Link | Link |
| 2020 | NanoMine: A Knowledge Graph for Nanocomposite Materials Science | NanoMine KG | domain-specific KG | New material design | Semi-automated | ISWC | Link | |
| 2018 | Relation extraction with weakly supervised learning based on process-structure-property-performance reciprocity | PSPP KG (Process-Structure-Property-Performance) | literature-based KG | New material design | Semi-automated | Science and Technology of Advanced Materials | Link | Link |
| Name | Year | Domains | Roles of LLMs | Roles of SciKG | Tasks | Application |
|---|---|---|---|---|---|---|
| KnowNET | 2024 | Drug | Semantic Interface (Query Generation) | Grounding (Factual Verification) | M | Guide health information seeking |
| FactFinder | 2024 | Drug | Semantic Interface (Query Generation) | Grounding (Factual Retrieval) | M | Life-science question answering |
| DDI-GPT | 2024 | Drug | Reasoner (Prediction & Explanation) | Representation (Semantic Enhancement) | C | Explainable prediction of drug-drug interactions |
| Soman et al. | 2024 | Drug, Omics | Constructor, Interface (KG Construction, Text Generation) | Grounding (Knowledge Base & Traceability) | M, C | Drug repurposing and medical QA |
| BioLORD | 2024 | Drug, Omics | Reasoner (Semantic Representation Optimization) | Grounding (Knowledge Base & Semantic Support) | M | Enhance biomedical semantic similarity |
| HeCiX | 2024 | Drug, Omics | Semantic Interface (Format Conversion) | Grounding (Knowledge Base) | M | Enhance clinical trial research |
| KRAGEN | 2024 | Drug, Omics | Orchestrator (Plan Generation & Execution) | Grounding (Knowledge Base & Visualization) | M | Visualized biomedical QA system |
| MechGPT | 2024 | Material | Constructor, Reasoner, Orchestrator (KG Construction, Explanation, Multi-agent) | Grounding, Reasoning Constraints (Knowledge & Explainability) | C, S, I | Materials analysis and design |
| SciAgents | 2024 | Material | Constructor, Reasoner, Generator (KG Construction, Analytical Reasoning, Hypothesis Generation) | Grounding (Knowledge Base) | M, I | Automated discovery in biomaterials science |
| MKG | 2024 | Material | Constructor (KG Construction & Maintenance) | Grounding (Knowledge Base) | I | Multidisciplinary materials science discovery |
| OpenTCM | 2025 | Drug | Interface, Reasoner, Constructor (Retrieval, Diagnosis, KG Construction) | Reasoning Constraints (Knowledge Retrieval Enhancement) | M | Traditional Chinese Medicine diagnosis |
| iKraph | 2025 | Drug | Constructor (KG Construction) | Grounding (Knowledge Base) | S | Biomedical Research |
| KGT | 2025 | Drug, Omics | Interface, Reasoner (Query Generation & Reasoning Output) | Grounding, Reasoning Constraints (Fact Checking & Path Constraint) | S, M | Drug repositioning, Framework for pan-cancer QA |
| ESCARGOT | 2025 | Drug, Omics | Generator, Orchestrator (Strategy & Code Generation) | Grounding (Knowledge Base) | S, I | Biomedical AI agent |
| Cat-KG | 2025 | Chemistry | Constructor, Reasoning, Interface (KG Construction, Path Reasoning & Explanation) | Grounding, Reasoning Constraints (Explainability & Path Constraint) | C, M | Relay catalysis pathway recommendation |
| Ma et al. | 2025 | Chemistry | Constructor, Generator (KG Construction & Path Recommendation) | Grounding (Structured Knowledge Management) | S | Automated Retrosynthesis Planning of Macromolecules |
| KG-FM | 2025 | Material | Constructor, Reasoner (Multi-modal Extraction, QA & Reasoning) | Grounding (Knowledge Base & Visualization) | M | Improve LLM QA in framework materials |
| SciToolAgent | 2025 | Comprehensive | Orchestrator (Multi-agent Collaboration) | Grounding (Tool Knowledge Base) | S, M, I | Scientific agent for multi-tool integration |
Tasks Abbreviations: M: Multi-source Data Interpretation; C: Complex System Mechanism Analysis; S: System Performance Optimization; I: Innovative Solution Design
| Domain | Database | Short Description | Statistics | Update Frequency |
|---|---|---|---|---|
| Drug Databases | BindingDB | Publicly accessible collection of measured drug-target binding affinities | 3.1M binding data for 1.3M compounds & 9.6K targets | Weekly |
| DrugBank | Richly annotated resource combining drug data with target, pathway & pharmacogenomic info | 18K approved & investigational drugs, 23K drug-target links, 3.6K drug-transporter links, 6K drug-enzyme links | Monthly | |
| CTD | The comparative toxicogenomics database links chemicals, genes, phenotypes and diseases | 101M toxicogenomic interactions, 19K chemicals, 57K genes, 7K diseases | -- | |
| DisGeNET | Comprehensive platform integrating genes, variants, and human diseases, combining curated data and text-mined evidence | 2.0M gene–disease associations, 4.4M variant–disease associations, and 28M disease–disease associations | -- | |
| DrugCentral | Authoritative, open-access compendium of active pharmaceutical ingredients approved worldwide | 5K drugs, 152K pharmaceutical products | -- | |
| PharmGKB | Provide PGx data from literature annotations to genotype-based treatment guidelines | 209 clinical guideline annotations, 1.2K drug label annotations, 483 FDA drug label annotations | -- | |
| SIDER | Database of marketed drugs and their recorded adverse drug reactions (ADRs) | 1.4K drugs, 6K side effects, 140K drug–side effect pairs | Static | |
| Omics Databases | Uniprot | Comprehensive, high-quality protein sequence & functional annotation database | 573K reviewed entries, 253K unreviewed entries | 4 Weeks |
| Ensembl | Genome browser & annotation resource for vertebrates and selected eukaryotes | 300+ species, 40K coding genes (human), 1M variants | 3 Months | |
| KEGG | Database integrating pathways, genes, compounds, drugs and diseases for system analysis | 75K pathways, 54M genes, 12K drugs, 11K diseases | Daily | |
| Reactome | Curated, peer-reviewed pathway database emphasizing human biology | 2.8K human pathways covering 11.6K proteins, 16K reactions | Monthly | |
| InterPro | Comprehensive resource integrating multiple protein signature databases | 13 member databases covering millions of protein sequences | Quarterly | |
| RNAcentral | Comprehensive ncRNA sequence collection representing all ncRNA types across diverse organisms | 44.5M non-coding RNA sequences, covering 1.1K species from 54 databases | Twice a year | |
| STRING | Database of known and predicted protein–protein interactions across multiple organisms | 59.3M proteins, 20B PPIs, 12.5K organisms | -- | |
| MONDO† | Ontology harmonizing disease concepts with standardized identifiers, mappings, and classifications for clinical use | 17 disease resources integrated into 22K unified disease concepts | Monthly | |
| UMLS | Comprehensive biomedical ontology integrating multiple vocabularies to unify concepts, names, and relationships | 17M names, 3.4M concepts, 8.7M codes, 190 vocabularies, 29 languages | Twice a year | |
| Chemical Database | ChEBI† | Chemical entities of biological interest, a dictionary and ontology of small molecular entities | 62K compounds | Monthly |
| ChEMBL | A curated database of drug-like bioactive molecules that integrates chemical, bioactivity and genomic data to support drug discovery | 2.5M compounds, 1.7M assays, 15.5K drugs, 48.8K drug indications | -- | |
| Reaxys | Elsevier-curated chemical reactions, substances, properties & literature | 283M chemical substances, 73M reactions, 500M physicochemical data points | -- | |
| PubChem | NIH repository of chemical substances, bioactivities & patents | 122M compounds, 338M substances, 297M bioactivities | Daily | |
| ZINC | Free database of commercially available compounds for virtual screening | 980M purchasable compounds | -- | |
| Materials Databases | OQMD | Open-access database of DFT-calculated properties for inorganic and hybrid materials | 1.2M materials | -- |
| Materials Project | High-throughput DFT database of materials properties & crystal structures | 144K inorganic compounds, 76K bandstructures, 64K molecules, 530K nanoporous materials, and diverse tensors and electrodes | -- |
| Category | Software Name | URL | Description | Supported Tasks | License |
|---|---|---|---|---|---|
| Automated KG Construction | DeepKE | Link | A knowledge extraction toolkit for knowledge graph construction supporting cnSchema, low-resource, document-level and multimodal scenarios for entity, relation and attribute extraction | Named Entity Recognition, Relation Extraction, Attribute Extraction | MIT License |
| OneKE | Link | A flexible dockerized system for schema-guided knowledge extraction, capable of extracting information from the web and raw PDF books across multiple domains like science and news | Named Entity Recognition, Web News Extraction, Book Knowledge Extraction | MIT License | |
| AutoKG | Link | An LLM-powered multi-agent framework for automated KG construction and reasoning, integrating external knowledge sources for large-scale extraction | Entity/Relation Extraction, KG Construction, KG Reasoning | MIT License | |
| Graph Databases and Storage | Neo4j | Link | A widely used native graph database with ACID transactions and Cypher query language, suitable for highly connected data analysis | Graph Storage, Graph Querying, Graph Algorithms | GPLv3 |
| JanusGraph | Link | A highly scalable graph database optimized for storing and querying large graphs with billions of vertices and edges distributed across a multi-machine cluster | Graph Storage, Gremlin Query | CC-BY-4.0 | |
| ArangoDB | Link | A scalable graph database system to drive value from connected data, faster. Native graphs, an integrated search engine, and JSON support, via a single query language | Multi-Model Storage, Graph Traversal, Path Querying | BSL 1.1 | |
| Virtuoso | Link | A hybrid relational-RDF database supporting both SPARQL and SQL, widely used for Linked Data publishing | RDF Storage, SPARQL Query, Ontology Reasoning | GPL v2 | |
| TigerGraph | Link | A commercial distributed parallel graph database optimized for real-time graph analytics, offering GSQL for querying at trillion-edge scale | Graph Storage, Parallel Graph Computation, Real-time Querying | Proprietary | |
| Representation Learning & Reasoning | OpenKE | Link | A sub-project of OpenSKL, providing an Open-source Knowledge Embedding toolkit for knowledge representation learning (KRL) | KG Embedding, Link Prediction, Triple Classification | MIT License |
| DGL-KE | Link | A high performance, easy-to-use, and scalable package for learning large-scale knowledge graph embeddings | KG Embedding, Large-scale Link Prediction | Apache 2.0 | |
| PyKEEN | Link | A Python library for KG embeddings with modular design, automated hyperparameter tuning, and reproducibility guarantees | KG Embedding, Model Training and Evaluation, Hyperparameter Optimization | MIT License | |
| AmpliGraph | Link | A suite of neural machine learning models for relational Learning, a branch of machine learning that deals with supervised learning on knowledge graphs | Generate KG embeddings, Link Prediction, Anomaly Detection | Apache 2.0 | |
| LibKGE | Link | A PyTorch-based library for efficient training, evaluation, and hyperparameter optimization of knowledge graph embeddings (KGE) | Link Prediction, Training, Evaluation of KGE Models | MIT License | |
| Pykg2vec | Link | A library for learning the representation of entities and relations in Knowledge Graph | KGE Model Implementations, Hyperparameters Discovery, Learned Embedding Inspecting | MIT License | |
| Auxiliary Tools | Doccano | Link | An open-source text annotation tool with a web interface for humans | Annotation for Text Classification, Sequence Labeling, Sequence to Sequence tasks | MIT License |
| Label Studio | Link | An open source data labeling tool supporting multimodal data, such as text, images, audio, video, time series | Multi-modal Data Annotation, Quality Assurance | Apache 2.0 | |
| Gephi | Link | An award-winning open-source platform for visualizing and manipulating large graphs | Graph Visualization, Network Analysis, Community Detection | CDDL 1.0 | |
| Cytoscape | Link | A network visualization platform originally designed for bioinformatics, now supporting general-purpose network analysis with rich plugins | Graph Visualization, Attribute Integration, Topology Analysis | LGPL | |
| GraphGPT | Link | An experimental tool using GPT models to extract entities and relations from text and generate interactive KG visualizations | Triple Extraction, KG Construction, Visualization | MIT License | |
| LlamaIndex | Link | A component for building KG indices from unstructured text, integrating subject–predicate–object triples into LLM-based retrieval pipelines | Triple Extraction, KG Indexing, KG-based QA | MIT License |
If you find this repository useful, please cite our paper:
@article{Ding_2025,
title={Bridging Data and Discovery: A Survey on Knowledge Graphs in AI for Science},
url={http://dx.doi.org/10.36227/techrxiv.176369442.22009541/v1},
DOI={10.36227/techrxiv.176369442.22009541/v1},
publisher={Institute of Electrical and Electronics Engineers (IEEE)},
author={Ding, Keyan and Zhu, Zhihui and Tang, Yuqi and Feng, Kehua and Zhuang, Xiang and Wang, Hongwei and Yang, Yi and Du, Huifang and Ni, Zhangkai and Wang, Shiqi and Fan, Xiaohui and Xing, Huabin and Bai, Lei and Liu, Qi and Wang, Haofen and Zhang, Qiang and Chen, Huajun},
year={2025},
month=nov }
If you notice any mistakes or have suggestions, please feel free to contact us at: zhihui.zhu01@outlook.com