This repo serves a pipelined tutorial that build up a knowledge base from web-crawled text using Protege and Apache Jena.
Check this link for installing Apache Jena & Fuseki (in Korean).
- Install python dependencies.
pip install cython
pip install -r requirements.txt
- Web scrap. (results in /data/raw)
python scrap.py
- Explore scrapped data. (Results in ./data/stat)
python stat.py
- Translate into Korean. (Optional. Results in ./data/translated)
python translate.py
- Annotate the dataset. Insert any class label of your interest as a text span wrapped with doulbe squared bracket, e.g. [[Fruit]]. Also, correct the chunking of paragraphs, incorrect newlines charaters, and etc.
mkdir ./data/anno
cp -r ./data/raw/* ./data/anno
* Annotation with data/anno
- Create a neat exel file for annotation. (Optional)
python anno2xls.py
-
Create OWL object classes & properties (T-Box) with Protege and place it under ./data/ontology. You can use a WebProtege to collaborate with your team.
- ./data/ontology/root-ontology.owl
-
Populate individuls. (Results in ./data/ontology/basic.txt. Copy and paste the generated text to root-ontology.owl.)
python populate.py
-
Save root-ontology.owl as a turtle syntax.
- ./data/ontology/root-ontology.ttl
-
Run Apache Jena Fuseki and write some SPARQL query you need. (Results in sparql/competency_questions)