A framework for constructing spatio-temporal knowledge graphs from French news articles to monitor food security in Burkina Faso.
Paper: Spatio-Temporal Knowledge Graph from Unstructured Texts: A Multi-Scale Approach for Food Security Monitoring Authors: Charles Abdoulaye Ngom, Landy Rajaonarivo, Sarah Valentin, Maguelonne Teisseire Accepted in AGILE: GIScience Series, 2026
Model DOI: 10.57967/hf/7766 | Dataset DOI: 10.57967/hf/7767
- Overview
- Installation
- Project Structure
- Data Description
- Reproducibility Guide
- Neo4j Queries and Visualizations
- Results and Interpretation
- Pre-trained Models
- Limitations
- Citation
- Acknowledgments
Food security monitoring in West Africa requires timely, fine-grained, and interpretable information from diverse sources. This project implements an integrated NLP and knowledge graph pipeline that extracts, structures, and analyzes food security-related information from French press articles published in 2009 from Burkina Faso.
Dataset scope: The CamemBERT NER model was trained on 15,000 articles. The final knowledge graph was extracted from 1,000 articles (a subset used for validation and graph construction).
The approach combines three specialized components:
- Spatial entity recognition -- Fine-tuned CamemBERT for village-level geographic extraction across the full Burkina Faso administrative hierarchy
- Temporal normalization -- HeidelTime-based extraction and normalization of temporal expressions (TIMEX3 standard)
- Knowledge graph construction -- Projection-based triplet formation and Neo4j graph with spatial, contextual, and temporal inference relationships
- Fine-grained spatial entity extraction: Village-level geographic recognition using fine-tuned CamemBERT, achieving F1=0.94--0.99 across all administrative levels
- Temporal expression normalization: French temporal extraction with HeidelTime (TIMEX3 standard)
- Triplet formation: Projection-based algorithm creating (Risk Indicator, Location, Date) triplets with 81.91% accuracy
- Knowledge graph construction: Neo4j-based graph with spatial hierarchies and temporal inference relations (IS_RECURRENT, IS_SYNCHRONOUS, PRECEDES)
1. Data Collection 2. Data Processing 3. KG Exploitation
βββββββββββββββββββ ββββββββββββββββββββββββββββ βββββββββββββββββββββββ
β Press Articles β β Spatial & Temporal Entity β β Evaluation β
β (15,000 articles)ββββ>β Recognition (CamemBERT + ββββ>β Indicators β
β β β HeidelTime) β β β
β Food Security β β β β β Spatial β
β Lexicon (433 ββββ>β Risk Indicator Detection β β Visualization β
β terms) β β (Lexicon-based) β β of Risk Indicators β
β β β β β βββββββββββββββββββββββ
β Open Data β β Relationship Rules β
β (Wikidata) ββββ>β Definition & Generation β
β β β β β
β Administrative β β Knowledge Graph β
β Hierarchy ββββ>β Construction & Update β
βββββββββββββββββββ ββββββββββββββββββββββββββββ
| Requirement | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Core runtime |
| Neo4j | 4.4+ | Knowledge graph storage and querying |
| Java | 8+ | Required by HeidelTime |
| CUDA GPU | Optional | Recommended for model training (~2-3h on RTX 3090) |
# 1. Clone the repository
git https://github.com/CharlemagneBrain/STKG-FS.git
cd stkgfs
# 2. Create and activate virtual environment
python -m venv venv
source venv/bin/activate # Linux/Mac
# or: venv\Scripts\activate # Windows
# 3. Install Python dependencies
pip install -r requirements.txt
# 4. Verify Java installation (required for HeidelTime)
java -versionHeidelTime requires a TreeTagger installation. Update the path in Temporal_Entities_Detection/config.props:
treeTaggerHome = /path/to/your/TreeTaggerLinuxThe py-heideltime package handles the integration automatically. Install it via:
pip install py-heideltime- Download and install Neo4j Desktop or use Neo4j Aura
- Create a new database
- Import the generated Cypher file (see Step 6)
| Issue | Solution |
|---|---|
cykhash / pyrobuf build failure |
These are optional dependencies. Core functionality is unaffected. |
HeidelTime FileNotFoundError |
Update treeTaggerHome path in config.props |
| CUDA out of memory | Reduce per_device_train_batch_size to 16 or 8 in training arguments |
| Neo4j connection error | Ensure Neo4j is running and credentials are correct |
stkgfs/
βββ Fine-Tuning/ # Step 1: CamemBERT fine-tuning for NER
β βββ CamemBERT_FT.ipynb # Training and evaluation notebook
β βββ annotations/ # BIO-tagged training data
β βββ train_extended_bio_feb.json # 59,900 training sentences
β βββ val_extended_bio_feb.json # 14,758 validation sentences
β βββ test_extended_bio_feb.json # 11,594 test sentences
β
βββ Spatial_Annotation_Detection/ # Step 2: Spatial entity detection
β βββ Spatial_Pipeline.ipynb # CamemBERT vs GLiNER evaluation
β βββ data/
β βββ df_sample.csv # 1,000 annotated articles for evaluation
β
βββ Temporal_Entities_Detection/ # Step 3: Temporal extraction
β βββ Heideltime_Detection.ipynb # HeidelTime processing notebook
β βββ config.props # HeidelTime/TreeTagger configuration
β βββ data/
β βββ df_extended_intersection.csv # Segments with spatial + temporal annotations
β βββ new_heaideltime_today.csv # HeidelTime extraction results
β
βββ Triplet_Formation/ # Step 4: Triplet extraction
β βββ Triplet_Algo.ipynb # Projection-based triplet algorithm
β βββ data/
β βββ lexique_enrichi_final_july.csv # 433-term food security lexicon
β βββ annotated_data.csv # 1,133 expert-annotated triplets
β βββ reconstruct_df_new.csv # Reconstructed articles
β
βββ Events_to_Graph/ # Step 5: Knowledge graph construction
β βββ Preprocessing_for_Graph.ipynb # Data analysis and visualization
β βββ enrich_geo_wikidata_complete.py # Wikidata geocoding enrichment
β βββ generate_neo4j_graph.py # Neo4j Cypher query generation
β βββ data/
β βββ df_preprocessed.csv # Preprocessed triplets with labels
β βββ processed_wd.csv # Wikidata-enriched locations
β βββ annotations_df.csv # Consolidated annotations
β βββ stkg_dtb.cypher # Generated Neo4j import file
β βββ stkg_dtb_events.csv # Event summary CSV
β
βββ Preprocessing/ # Step 0: Text preprocessing
β βββ Preprocess_News_Papers.py # Article segmentation with overlap
β βββ raw_articles.csv # Source articles (available on request, see below)
β
βββ csv_to_map.py # Step 7: Generate interactive Folium map from query results
β
βββ images/ # Figures for paper and README
β βββ extract_kg_overview.png # Knowledge graph overview (Figure 2)
β βββ H1_Ouaga_historical_events.png # Temporal monitoring (Figure 4)
β βββ map_env_agr_eco_risks_2009_annotated.png # Risk map (Figure 5)
β
βββ results_csv/ # Intermediate evaluation results
β βββ entites_df.csv
β βββ filtered_df_nelson.csv
β βββ segmented_csvfile.csv
β βββ spatial_entities_with_pos.csv
β
βββ requirements.txt # Python dependencies
βββ README.md # This file
- 15,000 French news articles published in 2009 (used for NER model training)
- 1,000 articles used for knowledge graph extraction and validation
- Collected from major West African news outlets (LeFaso.net, Burkina24)
- Articles segmented into overlapping windows (configurable length and overlap)
Note: The raw corpus file (
Preprocessing/raw_articles.csv) contains copyrighted press articles and is available on request only for research purposes. Please contact the authors (see Contact) to obtain access.
The CamemBERT NER model is trained on the complete administrative hierarchy of Burkina Faso, derived from the 2022 Statistical Yearbook of Territorial Administration:
| Administrative Level | Count | Examples |
|---|---|---|
| Regions | 13 | Centre, Hauts-Bassins, Sahel |
| Provinces | 45 | Kadiogo, Houet, Soum |
| Departments | 351 | Ouagadougou, Bobo-Dioulasso, Koudougou |
| Villages | 7,936 | Pabre, Koubri, Sya |
Training data was generated via distant supervision: sentences containing place names from the official gazetteer are annotated with their corresponding administrative type using BIO tags.
| Split | Sentences | Purpose |
|---|---|---|
| Train | 59,900 | Model training |
| Validation | 14,758 | Hyperparameter tuning |
| Test | 11,594 | Final evaluation (contains ~20% unseen entities) |
Domain-specific lexicon defined and validated by experts:
- 433 unique terms organized into 92 concepts and 8 thematic categories
- Categories: Agriculture, Environment, Economic, Sociopolitical, Dietary, Health, General Crisis, Human Health
- Includes crisis phase annotations (pre-crisis, crisis, post-crisis) where applicable
- 8 vague terms requiring contextual filtering: production, campagne, prix, cout, crise, stock, pluie, pluviometrie
Important: Results in this repository may differ slightly from values reported in the associated paper. The conclusions and interpretations remain unchanged. Values will be adjusted in the final paper.
The following manual interventions were applied by domain experts:
-
Spatial annotation corrections:
- Homonymous villages: Village names appearing in multiple provinces were disambiguated based on article context (e.g., a village named "Bissa" exists in several provinces)
- Orthographic variations: West African toponyms exhibit significant spelling variability (e.g., "Ouagadougou" vs. "Ouaga")
- False positives: Place names coincidentally matching common French words were removed
-
Triplet validation (annotation campaign on 1,133 triplets):
- Term relevance: Each detected risk indicator was assessed for food-security relevance (46.9% deemed relevant, 53.1% non-relevant)
- Spatial association: Correct linkage of risk term to actual location (84.29% accuracy before correction)
- Temporal association: Correct linkage to event date (96.56% accuracy before correction)
-
Post-processing refinements:
- Multi-location event associations: 21 additional triplets generated when annotators identified events affecting multiple locations
- Region name normalization: Internal Wikidata names mapped to official names (e.g., "Guiriko" -> "Hauts-Bassins", "Kadiogo" -> "Centre")
Follow these steps sequentially to reproduce the complete pipeline. Each step produces outputs consumed by subsequent steps.
Note on computation time: Some pipeline steps are computationally intensive (CamemBERT fine-tuning: ~7h, HeidelTime temporal extraction: ~8h). To allow users to start at any step without re-running the full pipeline, pre-computed intermediate results are provided in the
data/subdirectory of each step.
Data: Requires
Preprocessing/raw_articles.csv(available on request, see Data Description).
The Preprocess_News_Papers.py script segments raw articles into overlapping windows of 300 words with 20% overlap. It reads raw_articles.csv (columns: TXT, ANNEE) and produces a segmented CSV used by downstream steps.
cd Preprocessing
python Main.py- Input:
Preprocessing/raw_articles.csv - Output:
Spatial_Annotation_Detection/data/df_sample.csv(segmented articles, 1,000 articles)
cd Fine-Tuning
jupyter notebook CamemBERT_FT.ipynb- Training data:
Fine-Tuning/annotations/train_extended_bio_feb.json,val_extended_bio_feb.json,test_extended_bio_feb.json - Output: fine-tuned model checkpoint
Training configuration:
| Parameter | Value |
|---|---|
| Base model | camembert-base |
| Learning rate | 5e-5 |
| Batch size | 32 |
| Epochs | 70 |
| Weight decay | 0.01 |
| Frozen layers | Embedding layers only |
| Trainable parameters | 85,062,923 / 110,039,819 (77.3%) |
| Training time | ~7 hours on NVIDIA RTX 3090 |
What the notebook does:
- Loads BIO-tagged training data from
Fine-Tuning/annotations/ - Tokenizes with CamemBERT tokenizer and aligns labels
- Freezes embedding layers to preserve pre-trained French representations
- Trains for 70 epochs with evaluation every 100 steps
- Evaluates on held-out test set with
seqevalclassification report
Skip training: If you do not have the time or GPU resources to fine-tune the model, you can download our pre-trained model directly from HuggingFace and proceed to Step 3:
from transformers import CamembertTokenizerFast, CamembertForTokenClassification tokenizer = CamembertTokenizerFast.from_pretrained("CharlesAbdoulaye/BF_NER") model = CamembertForTokenClassification.from_pretrained("CharlesAbdoulaye/BF_NER")DOI: 10.57967/hf/7766
cd Spatial_Annotation_Detection
jupyter notebook Spatial_Pipeline.ipynb- Input:
Spatial_Annotation_Detection/data/df_sample.csv(1,000 annotated articles) - Output: evaluation metrics comparing CamemBERT vs. GLiNER
What the notebook does:
- Loads 1,000 manually annotated articles from
Spatial_Annotation_Detection/data/df_sample.csv - Compares fine-tuned CamemBERT vs. GLiNER (zero-shot)
- Filters predictions against validated entity sets
- Computes character-level BIO metrics with
seqeval
cd Temporal_Entities_Detection
jupyter notebook Heideltime_Detection.ipynbPrerequisites: Java 8+ and correct config.props paths (see HeidelTime Configuration).
- Input:
Spatial_Annotation_Detection/data/df_sample.csv(segmented articles from Step 1) - Output:
Temporal_Entities_Detection/data/new_heaideltime_today.csv(temporal annotations) - Runtime: ~8 hours (HeidelTime processes each article segment individually via Java subprocess)
What the notebook does:
- Extracts publication dates from article headers using regex
- Runs HeidelTime with French language rules and
newsdocument type - Extracts DATE-type TIMEX3 expressions with character spans
- Saves results to
Temporal_Entities_Detection/data/new_heaideltime_today.csv
Long runtime: HeidelTime processing takes approximately 8 hours on the full corpus. The output file
Temporal_Entities_Detection/data/new_heaideltime_today.csvis included in the repository, so you can skip this step and proceed directly to Step 5 if needed.
cd Triplet_Formation
jupyter notebook Triplet_Algo.ipynb- Input:
Triplet_Formation/data/reconstruct_df_new.csv(reconstructed articles),Triplet_Formation/data/lexique_enrichi_final_july.csv(food security lexicon) - Output:
Events_to_Graph/data/df_preprocessed.csv(triplets with labels)
What the notebook does:
- Reconstructs full articles from segments (with span offset correction)
- Splits into sentences using
wtpsplit(SaT-12l, French) - Detects food security terms via lexicon matching with morphological variants
- Applies vague term filtering: retains ambiguous terms only when co-occurring with non-vague terms or variation triggers (augmentation/diminution)
- Spatial projection: assigns the most specific location to each term using hierarchical preference (village > department > province > region > country)
- Temporal projection: assigns closest valid date not exceeding publication date
- Enriches each triplet with context window (1 sentence before + after)
cd Events_to_Graph
# 6a. Analyze and visualize preprocessed data
jupyter notebook Preprocessing_for_Graph.ipynb
# 6b. Enrich locations with Wikidata coordinates and hierarchy
python enrich_geo_wikidata_complete.py data/df_preprocessed.csv data/processed_wd.csv
# 6c. Generate Neo4j Cypher import file
python generate_neo4j_graph.py data/processed_wd.csv data/stkg_dtb.cypher- Input:
Events_to_Graph/data/df_preprocessed.csv(triplets from Step 5) - Output:
Events_to_Graph/data/processed_wd.csv(Wikidata-enriched),Events_to_Graph/data/stkg_dtb.cypher(Neo4j import file)
Wikidata enrichment (enrich_geo_wikidata_complete.py):
- Queries Wikidata SPARQL endpoint for each unique location
- Retrieves: Wikidata Q-ID, coordinates (lat/lon), administrative type
- Builds administrative hierarchy via
P131(located in administrative entity) - Finds neighboring entities within 10km radius
- Handles name mappings (e.g., "Hauts-Bassins" <-> "Guiriko" in Wikidata)
Graph generation (generate_neo4j_graph.py):
- Filters triplets: keeps only validated entries (
label=1) with Wikidata IDs - Aggregates triplets sharing same (term, location, date) into unique Event nodes
- Creates 4 node types and 3 relationship categories (see Results)
- Normalizes region names from Wikidata internal names to official names
Expected runtime: Wikidata enrichment takes ~10-20 minutes (depends on API rate limits).
// In Neo4j Browser or cypher-shell:
:source /path/to/Events_to_Graph/data/stkg_dtb.cypherOr via command line:
cat Events_to_Graph/data/stkg_dtb.cypher | cypher-shell -u neo4j -p <password>After running the queries from Query 4 and exporting the results as a CSV, use csv_to_map.py to generate an interactive Folium map:
python csv_to_map.py- Input: CSV file with columns
place, type, latitude, longitude, Flood, Fire, ...(exported from Neo4j Query 4) - Output:
output_data/dataviz/carto_all_events.html(interactive HTML map)
The script uses different marker shapes by administrative level (circle = region, star = department, marker = village) and color-coded risk indicators (blue = Flood, red = Fire, green = Agricultural Campaign, purple = Price Increase).
The knowledge graph contains 4 node types and 3 relationship categories:
Node types:
| Node | Properties | Description |
|---|---|---|
Event |
id, title, term, lieu, date, frequency_total, contexts | Food security situation (risk + location + date) |
Risk |
name, theme, concept, phase | Thematic risk indicator |
Location |
name, type, wikidata_id, latitude, longitude | Geographic entity with administrative level |
Time |
datetime, year, month, day | Normalized temporal entity |
Relationship categories:
-
Contextual relationships (Event <-> entities):
CONCERNS: Event -> Risk (witharticle_count)LOCATED_IN: Event -> LocationOCCURRED_ON: Event -> Time
-
Spatial relationships (Location hierarchy):
IS_FROM_DEPARTEMENT: Village -> DepartmentIS_FROM_PROVINCE: Department -> ProvinceIS_FROM_REGION: Province -> Region
-
Temporal inference relationships (Event <-> Event):
IS_RECURRENT: Same risk + same location, different dates (withduration_days)IS_SYNCHRONOUS: Same location + same date, different risks (bidirectional)PRECEDES: Same location + same year, different risks, sequential dates (withduration_days)
Note: For readability and clarity in the paper, some node types and relationships have been simplified or omitted from the queries and figures below. This does not affect the interpretation of the results, as the full graph schema described above remains available for comprehensive exploration.
Query 1: Extract of knowledge graph showing three types of relationships (Figure 4)
MATCH (n:Event)-[c:CONCERNS]->(r:Risk)
MATCH (n)-[li:LOCATED_IN]->(l:Location {name : "Bobo-Dioulasso"})
MATCH (n)-[on:OCCURRED_ON]->(t:Time)
MATCH (n)-[rec:PRECEDES]->(n1:Event)
MATCH (n)-[irec:IS_RECURRENT]->(n2:Event)
MATCH (n2)-[on2:OCCURRED_ON]->(t2:Time)
MATCH (n2:Event)-[rel3:IS_SYNCHRONOUS]->()
MATCH (l)-[isf:IS_FROM_PROVINCE]->(l4:Location)
MATCH (n1)-[loacf:OCCURRED_ON]->(to:Time)
MATCH (n1)-[css:CONCERNS]->(rss:Risk)
MATCH (n)-[cs:CONCERNS]->(rs:Risk)
MATCH (n2)-[cr:CONCERNS]->(rr:Risk)
MATCH (l5)-[dept:IS_FROM_DEPARTEMENT]->(l)
RETURN n, c, r, li, l, on, t, rec, n1, n2, irec, on2, t2, isf, l4, loacf, to, css, rss, cs, rs, cr, rr, l5, deptQuery 2: Knowledge graph extract: temporal monitoring of events in the Ouagadougou department (Figure 4)
// --- BLOC 1: ΓvΓ©nements clΓ©s du 1er septembre (inondation principale) ---
MATCH (e_flood:Event)-[c_flood:CONCERNS]->(r_flood:Risk {name: 'inondation'})
MATCH (e_flood)-[o_flood:OCCURRED_ON]->(t_flood:Time {datetime: '2009-09-01'})
MATCH (e_flood)-[li_flood:LOCATED_IN]->(l_ouaga:Location {name: 'Ouagadougou'})
WITH collect(e_flood)[0..3] AS flood_events,
collect({
event: e_flood,
risk: r_flood,
time: t_flood,
concerns: c_flood,
occurred: o_flood,
located: li_flood // <-- CONSERVΓ UNIQUEMENT ICI
})[0..3] AS flood_triplets,
l_ouaga
// --- BLOC 2: ΓvΓ©nements synchrones du 1er septembre (autres risques) ---
OPTIONAL MATCH (e_sync:Event)-[c_sync:CONCERNS]->(r_sync:Risk)
MATCH (e_sync)-[o_sync:OCCURRED_ON]->(t_sync:Time {datetime: '2009-09-01'})
MATCH (e_sync)-[:LOCATED_IN]->(l_ouaga)
WHERE r_sync.name <> 'inondation'
WITH flood_events, flood_triplets, l_ouaga,
collect({
event: e_sync,
risk: r_sync,
time: t_sync,
concerns: c_sync,
occurred: o_sync
})[0..4] AS sync_triplets
// --- BLOC 3: ΓvΓ©nements AVANT le 1er septembre (prΓ©curseurs) ---
OPTIONAL MATCH (e_before:Event)-[c_before:CONCERNS]->(r_before:Risk)
MATCH (e_before)-[o_before:OCCURRED_ON]->(t_before:Time)
MATCH (e_before)-[:LOCATED_IN]->(l_ouaga)
WHERE t_before.datetime < '2009-09-01'
AND t_before.datetime >= '2009-01-01'
WITH flood_events, flood_triplets, sync_triplets, l_ouaga,
collect({
event: e_before,
risk: r_before,
time: t_before,
concerns: c_before,
occurred: o_before
})[0..10] AS before_triplets
// --- BLOC 4: ΓvΓ©nements APRΓS le 1er septembre (consΓ©quences) ---
OPTIONAL MATCH (e_after:Event)-[c_after:CONCERNS]->(r_after:Risk)
MATCH (e_after)-[o_after:OCCURRED_ON]->(t_after:Time)
MATCH (e_after)-[:LOCATED_IN]->(l_ouaga)
WHERE t_after.datetime > '2009-09-01'
AND t_after.datetime <= '2009-09-15'
WITH flood_events, flood_triplets, sync_triplets, before_triplets, l_ouaga,
collect({
event: e_after,
risk: r_after,
time: t_after,
concerns: c_after,
occurred: o_after
})[0..3] AS after_triplets
// --- BLOC 5: HiΓ©rarchie spatiale ---
OPTIONAL MATCH (l_ouaga)-[hp:IS_FROM_PROVINCE]->(province:Location)
OPTIONAL MATCH (province)-[hr:IS_FROM_REGION]->(region:Location)
// --- BLOC 6: Relation IS_FROM_VILLAGE entre Ouagadougou et ses villages ---
OPTIONAL MATCH (l_village:Location)-[hv:IS_FROM_VILLAGE]->(l_ouaga)
WITH flood_events, flood_triplets, sync_triplets, before_triplets, after_triplets,
l_ouaga, province, hp, region, hr,
collect({relation: hv, village: l_village})[0..3] AS villages_ouaga
// --- BLOC 8: Village Bogodogo avec ses Γ©vΓ©nements ---
OPTIONAL MATCH (e_village:Event)-[li_village:LOCATED_IN]->(l_village:Location {name: 'Bogodogo'})
OPTIONAL MATCH (e_village)-[c_village:CONCERNS]->(r_village:Risk)
OPTIONAL MATCH (e_village)-[o_village:OCCURRED_ON]->(t_village:Time)
OPTIONAL MATCH (l_village)-[hp_village:IS_FROM_PROVINCE]->(province)
OPTIONAL MATCH (l_village)-[hv_bogodogo:IS_FROM_DEPARTEMENT]->(l_ouaga)
WITH flood_events, flood_triplets, sync_triplets, before_triplets, after_triplets,
l_ouaga, province, hp, region, hr, villages_ouaga,
collect({
event: e_village,
risk: r_village,
time: t_village,
location: l_village,
concerns: c_village,
occurred: o_village,
located: li_village,
is_from_province: hp_village,
is_from_village: hv_bogodogo
})[0..2] AS village_events
// --- BLOC 9: Collecter tous les Γ©vΓ©nements ---
WITH flood_triplets, sync_triplets, before_triplets, after_triplets,
l_ouaga, province, hp, region, hr, villages_ouaga, village_events,
[e IN flood_triplets | e.event] +
[e IN sync_triplets | e.event] +
[e IN before_triplets | e.event] +
[e IN after_triplets | e.event] AS all_events
// --- BLOC 10: Relations IS_RECURRENT et IS_SYNCHRONOUS ---
UNWIND all_events AS ev1
UNWIND all_events AS ev2
OPTIONAL MATCH (ev1)-[rec:IS_RECURRENT]-(ev2)
WHERE id(ev1) < id(ev2)
OPTIONAL MATCH (ev1)-[sync_rel:IS_SYNCHRONOUS]-(ev2)
WHERE id(ev1) < id(ev2)
WITH flood_triplets, sync_triplets, before_triplets, after_triplets,
l_ouaga, province, hp, region, hr, villages_ouaga, village_events,
all_events,
collect(DISTINCT CASE WHEN rec IS NOT NULL
THEN {rel: rec, from: ev1, to: ev2} END) AS recurrent_rels,
collect(DISTINCT CASE WHEN sync_rel IS NOT NULL
THEN {rel: sync_rel, from: ev1, to: ev2} END) AS synchronous_rels
// --- BLOC 11: Relations PRECEDES - UNIQUEMENT CHAΓNE DIRECTE ---
// On ne garde que les relations PRECEDES oΓΉ il n'existe pas d'intermΓ©diaire
UNWIND all_events AS ev_from
UNWIND all_events AS ev_to
OPTIONAL MATCH (ev_from)-[prec:PRECEDES]->(ev_to)
WHERE ev_from IS NOT NULL AND ev_to IS NOT NULL
// FILTRE CRUCIAL: exclure si un intermΓ©diaire existe dans notre ensemble
AND NOT EXISTS {
MATCH (ev_from)-[:PRECEDES]->(ev_mid)-[:PRECEDES]->(ev_to)
WHERE ev_mid IN all_events
}
WITH flood_triplets, sync_triplets, before_triplets, after_triplets,
l_ouaga, province, hp, region, hr, villages_ouaga, village_events,
recurrent_rels, synchronous_rels,
collect(DISTINCT CASE WHEN prec IS NOT NULL
THEN {rel: prec, from: ev_from, to: ev_to} END) AS precedes_rels
// --- RETOUR DU SOUS-GRAPHE ---
RETURN
flood_triplets AS Evenements_Inondation_1er_Sept,
sync_triplets AS Evenements_Synchrones_1er_Sept,
before_triplets AS Evenements_Avant_Aout,
after_triplets AS Evenements_Apres_2_15_Sept,
l_ouaga AS Location_Ouagadougou,
province AS Province_Kadiogo,
hp AS IS_FROM_PROVINCE,
region AS Region_Centre,
hr AS IS_FROM_REGION,
villages_ouaga AS Villages_IS_FROM_VILLAGE,
village_events AS Evenements_Village_Bogodogo,
[r IN recurrent_rels WHERE r IS NOT NULL] AS Relations_IS_RECURRENT,
[s IN synchronous_rels WHERE s IS NOT NULL] AS Relations_IS_SYNCHRONOUS,
[p IN precedes_rels WHERE p IS NOT NULL] AS Relations_PRECEDES_Chaine_Directe;
Query 4: Table of Vulnerable areas across the country that were affected by frequent crises in 2009: environmental (Figure 5)
MATCH (e:Event)-[:CONCERNS]->(r:Risk)
WHERE r.theme = 'environment' OR r.theme = 'economique'
MATCH (e)-[:OCCURRED_ON]->(t:Time)
WHERE t.year = 2009
MATCH (e)-[:LOCATED_IN]->(loc:Location)
RETURN
loc.name AS Lieu,
loc.type AS type,
loc.latitude AS latitude,
loc.longitude AS longitude,
// ============= RISQUES ENVIRONNEMENTAUX =============
SUM(CASE WHEN r.name = 'inondation' THEN 1 ELSE 0 END) AS Flood,
SUM(CASE WHEN r.name = 'incendie' THEN 1 ELSE 0 END) AS Fire,
SUM(CASE WHEN r.name = 'sΓ©cheresse' THEN 1 ELSE 0 END) AS Drought,
SUM(CASE WHEN r.name = 'tempΓͺte' THEN 1 ELSE 0 END) AS Storm,
SUM(CASE WHEN r.name = 'tornade' THEN 1 ELSE 0 END) AS Tornado,
SUM(CASE WHEN r.name = 'crue' THEN 1 ELSE 0 END) AS FloodPeak,
SUM(CASE WHEN r.name = 'orage' THEN 1 ELSE 0 END) AS Thunderstorm,
SUM(CASE WHEN r.name = 'dΓ©sertification' THEN 1 ELSE 0 END) AS Desertification,
SUM(CASE WHEN r.name = 'dΓ©forestation' THEN 1 ELSE 0 END) AS Deforestation,
SUM(CASE WHEN r.name = 'Γ©boulement' THEN 1 ELSE 0 END) AS Landslide,
SUM(CASE WHEN r.name = 'Γ©croulement' THEN 1 ELSE 0 END) AS Collapse,
// ============= RISQUES ΓCONOMIQUES =============
// Campagne agricole (seul)
SUM(CASE WHEN r.name = 'campagne agricole' THEN 1 ELSE 0 END) AS AgriculturalCampaign,
// Augmentation des prix (agrège tous les types de hausses)
SUM(CASE WHEN r.name IN [
'augmentation des prix',
'augmentation du prix',
'hausse du prix',
'hausse du coΓ»t',
'hausse des prix',
'montΓ©e du prix',
'prix Γ©levΓ©s',
'Γ©lΓ©vation du coΓ»t',
'inflation'
] THEN 1 ELSE 0 END) AS PriceIncrease
| Lieu | Type | Latitude | Longitude | Flood | Fire | Drought | Storm | Tornado | FloodPeak | Thunderstorm | Desertification | Deforestation | Landslide | Collapse | AgriculturalCampaign | PriceIncrease |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Burkina | country | 12.2383 | -1.5616 | 9 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 2 | 14 |
| Ouagadougou | departement | 12.35 | -1.5167 | 18 | 3 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 7 |
| Paspanga | village | β | β | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| BogandΓ© | village | 12.9714 | -0.1436 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Centre-Nord | province | 13.25 | -1.0833 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Bobo-Dioulasso | departement | 11.1785 | -4.3069 | 2 | 3 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| Sahel | region | 14.4747 | 13.5117 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Yatenga | province | 13.5833 | -2.4167 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| Gounghin | village | β | β | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Fine-tuned CamemBERT evaluated on a held-out test set containing ~20% unseen entities at each hierarchical level:
| Entity Type | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Country | 0.99 | 0.99 | 0.99 | 4,648 |
| Region | 1.00 | 0.99 | 0.99 | 1,433 |
| Province | 0.99 | 0.98 | 0.99 | 541 |
| Department | 0.99 | 0.99 | 0.99 | 6,744 |
| Village | 0.94 | 0.93 | 0.94 | 3,236 |
| Micro avg | 0.98 | 0.98 | 0.98 | 16,602 |
Comparison with baselines (on 1,000 test articles):
- CamemBERT (fine-tuning on general Location): P=0.41, R=0.81, F1=0.55
- GLiNER (zero-shot): P=0.66, R=0.63, F1=0.65
Expert annotation campaign on 1,133 extracted triplets:
| Category | Count | Percentage |
|---|---|---|
| Location and Date correct | 928 | 81.91% |
| Location corrected only | 166 | 14.65% |
| Date corrected only | 27 | 2.38% |
| Both corrected | 12 | 1.06% |
| Total | 1,133 | 100% |
- Spatial projection accuracy: 84.29% (955/1,133 correct associations)
- Temporal projection accuracy: 96.56% (1,094/1,133 correct associations)
Main spatial error patterns:
- Over-aggregation to capital/country level when specific locations appear outside the 2-sentence projection window
- Ambiguous location references when multiple locations of similar granularity are mentioned
| Metric | Value |
|---|---|
| Unique events | 376 |
| Risk types | 77 |
| Distinct locations | 71 (415 including hierarchy) |
| Temporal entities | 147 |
| Source articles | 200 |
| IS_RECURRENT relationships | 141 |
| IS_SYNCHRONOUS relationship pairs | 263 (526 directed edges) |
| PRECEDES relationships | 9,726 |
| Spatial hierarchy relations | 91 |
Corpus distribution by theme:
- Sociopolitical: 29.9%
- Environmental: 23.8%
- Agricultural: 16.3%
- General crisis: 11.4%
- Economic: 10.9%
Top risk indicators: inondation (flood, 28.1%), catastrophe (disaster, 13.1%), insecurite (insecurity, 10.9%), corruption (10.9%), incendie (fire, 10.9%)
- Precision: Proportion of correctly identified entities among all detected entities. High precision = few false alarms.
- Recall: Proportion of actual entities successfully detected. High recall = few missed entities.
- F1-Score: Harmonic mean of precision and recall, providing a single balanced metric (0-1).
- BIO tagging: Begin-Inside-Outside scheme for sequence labeling. B-tag marks entity start, I-tag marks continuation, O marks non-entity tokens.
Some numerical values produced by running the code may differ slightly from those reported in the paper. This is due to:
- Manual corrections applied post-extraction: Expert annotations corrected spatial and temporal associations after initial algorithmic extraction
- Relevance filtering: The 53.1% non-relevant triplets were filtered during the annotation campaign, a step not fully automated in the pipeline
- Region name normalization: Wikidata returns internal region names (e.g., "Guiriko") that are mapped to official names (e.g., "Hauts-Bassins") in post-processing
These differences do not affect the conclusions or interpretations presented in the paper.
The fine-tuned CamemBERT model and training datasets are available at:
Model: CharlesAbdoulaye/BF_NER -- DOI: 10.57967/hf/7766
Datasets: CharlesAbdoulaye/BF_NER_datasets -- DOI: 10.57967/hf/7767
| Property | Value |
|---|---|
| Model name | BF_NER (Burkina Faso Named Entity Recognition) |
| Base model | camembert-base |
| Task | Token classification (NER) |
| Language | French |
| Entity types | country, region, province, departement, village |
| Training data | 59,900 sentences (BIO-tagged, distant supervision from 15,000 articles) |
| F1-Score | 0.96--0.99 per entity type (micro avg: 0.98) |
| License | MIT |
from transformers import CamembertTokenizerFast, CamembertForTokenClassification
import torch
# Load model and tokenizer
tokenizer = CamembertTokenizerFast.from_pretrained("CharlesAbdoulaye/BF_NER")
model = CamembertForTokenClassification.from_pretrained("CharlesAbdoulaye/BF_NER")
model.eval()
# Entity labels
label_list = [
"O",
"B-country", "I-country",
"B-region", "I-region",
"B-departement", "I-departement",
"B-province", "I-province",
"B-village", "I-village"
]
id2label = {i: label for i, label in enumerate(label_list)}
# Example text covering all 5 administrative levels
text = "La crise alimentaire au Burkina Faso a frappe la region des Hauts-Bassins et la province du Houet. La ville de Bobo-Dioulasso et le village de Sya sont particulierement touches."
# Tokenize with offset mapping for span extraction
inputs = tokenizer(text, return_tensors="pt", truncation=True, return_offsets_mapping=True)
offset_mapping = inputs.pop("offset_mapping")[0].tolist()
with torch.no_grad():
outputs = model(**inputs)
preds = torch.argmax(outputs.logits, dim=2)[0].tolist()
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
# Reconstruct entities with spans (handles subword tokens)
entities, current = [], None
for idx, (pred_id, (start, end)) in enumerate(zip(preds, offset_mapping)):
if start == 0 and end == 0:
if current: entities.append(current); current = None
continue
label = id2label[pred_id]
is_subword = not tokens[idx].startswith("\u2581")
if label.startswith("B-"):
if current: entities.append(current)
current = {"type": label[2:], "start": start, "end": end}
elif label.startswith("I-") or (is_subword and current):
if current: current["end"] = end
else:
if current: entities.append(current); current = None
if current: entities.append(current)
# Merge consecutive same-type entities and clean boundaries
merged = []
for ent in entities:
if merged and merged[-1]["type"] == ent["type"]:
gap = text[merged[-1]["end"]:ent["start"]]
if gap in ("", "-", " -", "- "):
merged[-1]["end"] = ent["end"]; continue
merged.append(dict(ent))
for ent in merged:
ent["text"] = text[ent["start"]:ent["end"]].rstrip(".,;:!?")
for ent in merged:
print(f'{ent["text"]:20s} | {ent["type"]:15s} | span=({ent["start"]}, {ent["end"]})')Expected output:
Burkina Faso | country | span=(24, 36)
Hauts-Bassins | region | span=(60, 73)
Houet | province | span=(92, 97)
Bobo-Dioulasso | departement | span=(111, 125)
Sya | village | span=(143, 146)
Step 1: Upload the fine-tuned model
# Install huggingface_hub
pip install huggingface_hub
# Login to HuggingFace (you'll be prompted for your token)
huggingface-cli loginfrom huggingface_hub import HfApi, create_repo
# Create model repository
create_repo("CharlesAbdoulaye/BF_NER", repo_type="model", private=False, exist_ok=True)
# Upload model checkpoint
api = HfApi()
api.upload_folder(
folder_path="/data/charles/agile/camembert-ner-finetuned/checkpoint-15000",
repo_id="CharlesAbdoulaye/BF_NER",
repo_type="model"
)Step 2: Upload training datasets
from huggingface_hub import HfApi, create_repo
import os
# Create dataset repository
create_repo("CharlesAbdoulaye/BF_NER_datasets", repo_type="dataset", private=False, exist_ok=True)
# Upload each dataset file
api = HfApi()
dataset_files = [
"Fine-Tuning/annotations/train_extended_bio_feb.json",
"Fine-Tuning/annotations/val_extended_bio_feb.json",
"Fine-Tuning/annotations/test_extended_bio_feb.json"
]
for file_path in dataset_files:
api.upload_file(
path_or_fileobj=file_path,
path_in_repo=os.path.basename(file_path),
repo_id="CharlesAbdoulaye/BF_NER_datasets",
repo_type="dataset"
)- Lexicon-based detection: May produce false positives when food security terminology appears in non-relevant contexts (policy discussions, historical references, prevention campaigns). The 53.1% non-relevance rate reflects this challenge.
- Media coverage bias: Ouagadougou (402 occurrences) and major cities receive disproportionate coverage compared to rural areas, reflecting the structure of the press corpus rather than actual risk distribution.
- Single country focus: The model is trained specifically on the Burkina Faso administrative hierarchy and may not generalize to other West African countries without retraining.
- Temporal scope: The current corpus covers only 2009 articles. Extending to multiple years would require additional data collection.
If you use this code, data, or model in your research, please cite:
@article{ngom2026stkgfs,
title={Spatio-Temporal Knowledge Graph from Unstructured Texts: A Multi-Scale Approach for Food Security Monitoring},
author={Ngom, Charles Abdoulaye and Rajaonarivo, Landy and Valentin, Sarah and Teisseire, Maguelonne},
journal={AGILE: GIScience Series},
year={2026},
}- Administrative hierarchy data from the 2022 Statistical Yearbook of Territorial Administration, Burkina Faso Ministry of Territorial Administration
- Geographic coordinates and entity linking via Wikidata
- French NER foundation from CamemBERT (Martin et al., 2020)
- Temporal extraction via HeidelTime (Strotgen et al., 2013)
- Sentence segmentation via wtpsplit (SaT model)
For questions about this research, please contact:
- @CharlemagneBrain: charles.ngom@inrae.fr
- Landy Rajaonarivo: landy.rajaonarivo@inrae.fr
- Sarah Valentin: sarah.valentin@cirad.fr
- Maguelonne Teisseire: maguelonne.teisseire@inrae.fr


