RaguTeam · m0tbka · Oct 14, 2025 · Oct 18, 2025 · Oct 20, 2025 · Oct 24, 2025
diff --git a/.env_example b/.env_example
@@ -0,0 +1,10 @@
+# Base URL for the Named Entity Recognition (NER) service
+NER_SERVICE_BASE_URL=http://localhost:8010
+
+# Base URL for the Named Entity Normalization (NEN) and Description service (ragu-lm)
+NEN_SERVICE_BASE_URL=http://localhost:8002
+
+# Base URL for the Relation Extraction (RE) service
+RE_SERVICE_BASE_URL=http://localhost:8003
+
+LLM_API_KEY = ""
diff --git a/docker-compose.yml b/docker-compose.yml
@@ -0,0 +1,31 @@
+version: '3.8'
+
+services:
+  ner_service:
+    image: mrpzzios/runne_contrastive_ner_tf:fixed
+    ports:
+      - "8010:8010"
+#    runtime: nvidia
+#    environment:
+#      - NVIDIA_VISIBLE_DEVICES=all
+    command: -c "python3 server.py"
+
+  re_service:
+    image: mrpzzios/bertre:1.3
+    ports:
+      - "8003:8000"
+    runtime: nvidia
+    environment:
+      - NVIDIA_VISIBLE_DEVICES=all
+
+  custom_service:
+    build:
+      context: ./services
+    ports:
+      - "8002:8000"
+    runtime: nvidia
+    environment:
+      - NVIDIA_VISIBLE_DEVICES=all
+    env_file:
+      - .env
+    shm_size: '32g'
diff --git a/docs/en/pipeline_guide.md b/docs/en/pipeline_guide.md
@@ -0,0 +1,152 @@
+# Pipeline-based Triplet Extraction Guide
+
+This guide explains how to use the OOP-based pipeline for triplet extraction.
+
+## Overview
+
+The pipeline consists of several main steps, orchestrated by the `ragu.triplet.pipeline.Pipeline` class:
+
+1.  **Named Entity Recognition (NER):** Identifies entities in the text.
+2.  **Named Entity Normalization (NEN):** Normalizes the extracted entities.
+3.  **Entity Description:** Generates a description for each entity based on its context.
+4.  **Relation Extraction (RE):** Extracts relations between the normalized entities.
+5.  **Relation Description:** Generates descriptions for the extracted relations, creating the final triplets.
+
+Each step is implemented as a `PipelineStep` that communicates with a dedicated microservice.
+
+## Docker Compose and Service Configuration
+
+To run the full pipeline, you need to use the `docker-compose.yml` file located in the project root. This file defines the microservices required for the different pipeline stages.
+
+### Discrepancy with the Project's `docker-compose.yml`
+
+The configuration below is a generic template to illustrate the architecture. The actual `docker-compose.yml` in this repository is slightly different and more optimized:
+
+1.  **Consolidated Services:** The `nen_service` and `description_service` are consolidated into a single `custom_service`. This service runs the `RaguTeam/RAGU-lm` model, which is capable of handling both Named Entity Normalization (NEN) and description generation for both entities and relations.
+2.  **Specific Ports:** The `ner_service` in the project's actual `docker-compose.yml` uses port `8010`, not `8001`.
+
+This consolidation is a practical optimization that reduces the number of required services. The `custom_service` is built from the local `./services` directory.
+
+### Generic `docker-compose.yml` Structure
+
+```yaml
+version: '3.8'
+
+services:
+  ner_service:
+    image: your_ner_image:latest
+    ports:
+      - "8001:8000"
+
+  nen_service:
+    image: your_nen_image:latest
+    ports:
+      - "8002:8000"
+
+  re_service:
+    image: your_re_image:latest
+    ports:
+      - "8003:8000"
+
+  description_service:
+    image: your_description_image:latest
+    ports:
+      - "8004:8000"
+```
+
+### Environment Configuration (`.env`)
+
+You need to create a `.env` file with the base URLs for each running service. Based on the project's actual `docker-compose.yml`, the file should look like this:
+
+```
+NER_SERVICE_BASE_URL=http://localhost:8010
+NEN_SERVICE_BASE_URL=http://localhost:8002
+RE_SERVICE_BASE_URL=http://localhost:8003
+DESCRIPTION_SERVICE_BASE_URL=http://localhost:8002
+```
+*Note that `NEN_SERVICE_BASE_URL` and `DESCRIPTION_SERVICE_BASE_URL` point to the same `custom_service`.*
+
+## Models Used
+
+The pipeline relies on a combination of models served via Docker containers.
+
+### RaguTeam Hugging Face Models
+
+*   **`RaguTeam/RAGU-lm`**: This is a fine-tuned model specifically for Russian language tasks. It is served by the `custom_service` and performs several key steps in the pipeline:
+    *   Named Entity Normalization (NEN)
+    *   Entity Description Generation
+    *   Relation Description Generation
+
+### Docker Hub Images
+
+The following images are pulled from Docker Hub and are used for specialized NLP tasks:
+
+*   **`mrpzzios/runne_contrastive_ner_tf:fixed`**: This image is used for the **Named Entity Recognition (NER)** step. It appears to be a private or custom-built image and is not publicly documented.
+*   **`mrpzzios/bertre:1.3`**: This image is used for the **Relation Extraction (RE)** step. Similar to the NER image, it seems to be a private or custom-built model.
+
+## Example Usage
+
+The `examples/pipeline/` directory contains scripts that demonstrate how to use the pipeline.
+
+*   **[examples/pipeline/test_pipeline.py](examples/pipeline/test_pipeline.py)**: A lightweight script that shows how to initialize all the clients and run the full pipeline on a single text chunk. This is useful for quick verification of the services.
+
+*   **[examples/pipeline/build_kg_with_pipeline.py](examples/pipeline/build_kg_with_pipeline.py)**: A more comprehensive example that demonstrates the end-to-end process of building a complete Knowledge Graph from a collection of documents. It integrates the extraction pipeline with the chunker, embedder, and graph builder components.
+
+### Basic Python Implementation
+
+```python
+import asyncio
+import os
+from dotenv import load_dotenv
+from ragu.triplet.pipeline import (
+    Pipeline,
+    NERClient,
+    NENClient,
+    REClient,
+    DescriptionClient,
+    NERStep,
+    NENStep,
+    REStep,
+    EntityDescriptionStep,
+    RelationDescriptionStep,
+)
+from ragu.chunker.types import Chunk
+
+load_dotenv()
+
+async def main():
+    # Create clients for each service
+    ner_client = NERClient(os.getenv("NER_SERVICE_BASE_URL"))
+    nen_client = NENClient(os.getenv("NEN_SERVICE_BASE_URL"))
+    re_client = REClient(os.getenv("RE_SERVICE_BASE_URL"))
+    description_client = DescriptionClient(os.getenv("DESCRIPTION_SERVICE_BASE_URL"))
+
+    # Create the pipeline steps
+    steps = [
+        NERStep(ner_client),
+        NENStep(nen_client),
+        EntityDescriptionStep(description_client),
+        REStep(re_client),
+        RelationDescriptionStep(description_client),
+    ]
+
+    # Create the pipeline
+    pipeline = Pipeline(steps)
+
+    # Run the pipeline on a sample chunk
+    chunk = Chunk(
+        content="Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов.",
+        chunk_order_idx=0,
+        doc_id="test_doc"
+    )
+    entities, relations = await pipeline.extract([chunk])
+
+    print("--- Entities ---")
+    print(entities)
+    print("\n--- Relations ---")
+    print(relations)
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
diff --git a/docs/en/pipeline_io_format.md b/docs/en/pipeline_io_format.md
@@ -0,0 +1,161 @@
+# NER and RE I/O Formats
+
+This document describes the standard input and output formats for the Named Entity Recognition (NER) and Relation Extraction (RE) models used in the RAGU project.
+
+## NER (Named Entity Recognition)
+
+### NER Input (`NER_IN`)
+
+The input for the NER model is a single JSON string containing the text to be processed.
+
+**Example:**
+```json
+"Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов."
+```
+
+### NER Output (`NER_OUT`)
+
+The output of the NER model is a JSON object containing the original text and the extracted entities.
+
+- `text`: The original input string.
+- `ners`: A list of extracted entities. Each entity is represented as a list with three elements:
+    1.  `start_char_index` (integer): The starting character offset of the entity in the text.
+    2.  `end_char_index` (integer): The ending character offset of the entity in the text.
+    3.  `entity_type` (string): The type of the entity (e.g., "COUNTRY", "PERSON", "PROFESSION").
+
+**Example:**
+```json
+{
+  "ners": [
+    [67, 73, "COUNTRY"],
+    [74, 87, "PERSON"],
+    [35, 73, "PROFESSION"]
+  ],
+  "text": "Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов."
+}
+```
+
+## RE (Relation Extraction)
+
+### RE Input (`RE_IN`)
+
+The input for the RE model is a JSON object containing text chunks and their corresponding entities.
+
+- `chunks`: A list of text strings (e.g., sentences or paragraphs).
+- `entities_list`: A list where each element is a list of entities found in the corresponding chunk in the `chunks` list. The format for each entity is the same as in the `NER_OUT`.
+
+**Example:**
+```json
+{
+  "chunks": [
+    "Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов.",
+    "Президент Башкирии Муртаза Рахимов решил поменять главу своей администрации. Он уволил Азамата Сагитова."
+  ],
+  "entities_list": [
+    [
+      [67, 73, "COUNTRY"],
+      [74, 87, "PERSON"],
+      [35, 73, "PROFESSION"]
+    ],
+    [
+      [19, 34, "PERSON"],
+      [0, 18, "PROFESSION"],
+      [50, 75, "PROFESSION"],
+      [10, 18, "STATE_OR_PROVINCE"],
+      [80, 86, "EVENT"],
+      [87, 103, "PERSON"]
+    ]
+  ]
+}
+```
+
+### RE Output (`RE_OUT`)
+
+The output of the RE model is a JSON list of extracted relationships. Each object in the list represents a single relationship and contains the following fields:
+
+- `source_entity` (string): The text of the source entity in the relationship.
+- `target_entity` (string): The text of the target entity in the relationship.
+- `relationship_type` (string): The type of the relationship (e.g., "FOUNDED_BY", "WORKPLACE").
+- `relationship_description` (string or null): A natural language description of the relationship.
+- `relationship_strength` (float): A confidence score for the extracted relationship, typically between 0.0 and 1.0.
+- `chunk_id` (integer): The index of the chunk from the `RE_IN` `chunks` list where this relationship was found.
+
+**Example:**
+```json
+[
+  {
+    "source_entity": "России",
+    "target_entity": "Николай Лямов",
+    "relationship_type": "FOUNDED_BY",
+    "relationship_description": null,
+    "relationship_strength": 0.04831777885556221,
+    "chunk_id": 0
+  },
+  {
+    "source_entity": "Николай Лямов",
+    "target_entity": "России",
+    "relationship_type": "WORKPLACE",
+    "relationship_description": null,
+    "relationship_strength": 0.999497652053833,
+    "chunk_id": 0
+  }
+]
+```
+
+## RAGU-lm I/O Formats
+
+The `RAGU-lm` model uses a prompt-based format for its tasks.
+
+### Named Entity Normalization (NEN)
+
+**Input:**
+The input is a formatted string (prompt) that includes the unnormalized entity and the source text.
+
+- **Prompt Template:**
+  '''
+  Выполните нормализацию именованной сущности, встретившейся в тексте.
+
+  Исходная (ненормализованная) именованная сущность: {source_entity}
+
+  Текст: {source_text}
+
+  Нормализованная именованная сущность:
+  '''
+- **Parameters:**
+  - `{source_entity}`: The unnormalized entity to be normalized.
+  - `{source_text}`: The original text containing the entity.
+
+**Output:**
+The output is a string containing the normalized entity.
+
+- **Example Output:**
+  '''
+  пресс-секретарь
+  '''
+
+### Description Generation (DG)
+
+**Input:**
+The input is a formatted string (prompt) that includes the normalized entity and the source text.
+
+- **Prompt Template:**
+  '''
+  Напишите, что означает именованная сущность в тексте, то есть раскройте её смысл относительно текста.
+
+  Именованная сущность: {normalized_entity}
+
+  Текст: {source_text}
+
+  Смысл именованной сущности:
+  '''
+- **Parameters:**
+  - `{normalized_entity}`: The normalized entity for which to generate a description.
+  - `{source_text}`: The original text containing the entity.
+
+**Output:**
+The output is a string containing the generated description for the entity.
+
+- **Example Output:**
+  '''
+  Бывший представитель СМИ экс-президента США Билла Клинтона.
+  '''
diff --git a/examples/data/ru/4.txt → examples/data/ru/4.txt.bad b/examples/data/ru/4.txt → examples/data/ru/4.txt.bad