Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .env_example
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Base URL for the Named Entity Recognition (NER) service
NER_SERVICE_BASE_URL=http://localhost:8010

# Base URL for the Named Entity Normalization (NEN) and Description service (ragu-lm)
NEN_SERVICE_BASE_URL=http://localhost:8002

# Base URL for the Relation Extraction (RE) service
RE_SERVICE_BASE_URL=http://localhost:8003

LLM_API_KEY = ""
31 changes: 31 additions & 0 deletions docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
version: '3.8'

services:
ner_service:
image: mrpzzios/runne_contrastive_ner_tf:fixed
ports:
- "8010:8010"
# runtime: nvidia
# environment:
# - NVIDIA_VISIBLE_DEVICES=all
command: -c "python3 server.py"

re_service:
image: mrpzzios/bertre:1.3
ports:
- "8003:8000"
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all

custom_service:
build:
context: ./services
ports:
- "8002:8000"
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
env_file:
- .env
shm_size: '32g'
152 changes: 152 additions & 0 deletions docs/en/pipeline_guide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Pipeline-based Triplet Extraction Guide

This guide explains how to use the OOP-based pipeline for triplet extraction.

## Overview

The pipeline consists of several main steps, orchestrated by the `ragu.triplet.pipeline.Pipeline` class:

1. **Named Entity Recognition (NER):** Identifies entities in the text.
2. **Named Entity Normalization (NEN):** Normalizes the extracted entities.
3. **Entity Description:** Generates a description for each entity based on its context.
4. **Relation Extraction (RE):** Extracts relations between the normalized entities.
5. **Relation Description:** Generates descriptions for the extracted relations, creating the final triplets.

Each step is implemented as a `PipelineStep` that communicates with a dedicated microservice.

## Docker Compose and Service Configuration

To run the full pipeline, you need to use the `docker-compose.yml` file located in the project root. This file defines the microservices required for the different pipeline stages.

### Discrepancy with the Project's `docker-compose.yml`

The configuration below is a generic template to illustrate the architecture. The actual `docker-compose.yml` in this repository is slightly different and more optimized:

1. **Consolidated Services:** The `nen_service` and `description_service` are consolidated into a single `custom_service`. This service runs the `RaguTeam/RAGU-lm` model, which is capable of handling both Named Entity Normalization (NEN) and description generation for both entities and relations.
2. **Specific Ports:** The `ner_service` in the project's actual `docker-compose.yml` uses port `8010`, not `8001`.

This consolidation is a practical optimization that reduces the number of required services. The `custom_service` is built from the local `./services` directory.

### Generic `docker-compose.yml` Structure

```yaml
version: '3.8'

services:
ner_service:
image: your_ner_image:latest
ports:
- "8001:8000"

nen_service:
image: your_nen_image:latest
ports:
- "8002:8000"

re_service:
image: your_re_image:latest
ports:
- "8003:8000"

description_service:
image: your_description_image:latest
ports:
- "8004:8000"
```

### Environment Configuration (`.env`)

You need to create a `.env` file with the base URLs for each running service. Based on the project's actual `docker-compose.yml`, the file should look like this:

```
NER_SERVICE_BASE_URL=http://localhost:8010
NEN_SERVICE_BASE_URL=http://localhost:8002
RE_SERVICE_BASE_URL=http://localhost:8003
DESCRIPTION_SERVICE_BASE_URL=http://localhost:8002
```
*Note that `NEN_SERVICE_BASE_URL` and `DESCRIPTION_SERVICE_BASE_URL` point to the same `custom_service`.*

## Models Used

The pipeline relies on a combination of models served via Docker containers.

### RaguTeam Hugging Face Models

* **`RaguTeam/RAGU-lm`**: This is a fine-tuned model specifically for Russian language tasks. It is served by the `custom_service` and performs several key steps in the pipeline:
* Named Entity Normalization (NEN)
* Entity Description Generation
* Relation Description Generation

### Docker Hub Images

The following images are pulled from Docker Hub and are used for specialized NLP tasks:

* **`mrpzzios/runne_contrastive_ner_tf:fixed`**: This image is used for the **Named Entity Recognition (NER)** step. It appears to be a private or custom-built image and is not publicly documented.
* **`mrpzzios/bertre:1.3`**: This image is used for the **Relation Extraction (RE)** step. Similar to the NER image, it seems to be a private or custom-built model.

## Example Usage

The `examples/pipeline/` directory contains scripts that demonstrate how to use the pipeline.

* **[examples/pipeline/test_pipeline.py](examples/pipeline/test_pipeline.py)**: A lightweight script that shows how to initialize all the clients and run the full pipeline on a single text chunk. This is useful for quick verification of the services.

* **[examples/pipeline/build_kg_with_pipeline.py](examples/pipeline/build_kg_with_pipeline.py)**: A more comprehensive example that demonstrates the end-to-end process of building a complete Knowledge Graph from a collection of documents. It integrates the extraction pipeline with the chunker, embedder, and graph builder components.

### Basic Python Implementation

```python
import asyncio
import os
from dotenv import load_dotenv
from ragu.triplet.pipeline import (
Pipeline,
NERClient,
NENClient,
REClient,
DescriptionClient,
NERStep,
NENStep,
REStep,
EntityDescriptionStep,
RelationDescriptionStep,
)
from ragu.chunker.types import Chunk

load_dotenv()

async def main():
# Create clients for each service
ner_client = NERClient(os.getenv("NER_SERVICE_BASE_URL"))
nen_client = NENClient(os.getenv("NEN_SERVICE_BASE_URL"))
re_client = REClient(os.getenv("RE_SERVICE_BASE_URL"))
description_client = DescriptionClient(os.getenv("DESCRIPTION_SERVICE_BASE_URL"))

# Create the pipeline steps
steps = [
NERStep(ner_client),
NENStep(nen_client),
EntityDescriptionStep(description_client),
REStep(re_client),
RelationDescriptionStep(description_client),
]

# Create the pipeline
pipeline = Pipeline(steps)

# Run the pipeline on a sample chunk
chunk = Chunk(
content="Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов.",
chunk_order_idx=0,
doc_id="test_doc"
)
entities, relations = await pipeline.extract([chunk])

print("--- Entities ---")
print(entities)
print("\n--- Relations ---")
print(relations)


if __name__ == "__main__":
asyncio.run(main())
```
161 changes: 161 additions & 0 deletions docs/en/pipeline_io_format.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
# NER and RE I/O Formats

This document describes the standard input and output formats for the Named Entity Recognition (NER) and Relation Extraction (RE) models used in the RAGU project.

## NER (Named Entity Recognition)

### NER Input (`NER_IN`)

The input for the NER model is a single JSON string containing the text to be processed.

**Example:**
```json
"Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов."
```

### NER Output (`NER_OUT`)

The output of the NER model is a JSON object containing the original text and the extracted entities.

- `text`: The original input string.
- `ners`: A list of extracted entities. Each entity is represented as a list with three elements:
1. `start_char_index` (integer): The starting character offset of the entity in the text.
2. `end_char_index` (integer): The ending character offset of the entity in the text.
3. `entity_type` (string): The type of the entity (e.g., "COUNTRY", "PERSON", "PROFESSION").

**Example:**
```json
{
"ners": [
[67, 73, "COUNTRY"],
[74, 87, "PERSON"],
[35, 73, "PROFESSION"]
],
"text": "Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов."
}
```

## RE (Relation Extraction)

### RE Input (`RE_IN`)

The input for the RE model is a JSON object containing text chunks and their corresponding entities.

- `chunks`: A list of text strings (e.g., sentences or paragraphs).
- `entities_list`: A list where each element is a list of entities found in the corresponding chunk in the `chunks` list. The format for each entity is the same as in the `NER_OUT`.

**Example:**
```json
{
"chunks": [
"Главным борцом с пробками назначен заместитель министра транспорта России Николай Лямов.",
"Президент Башкирии Муртаза Рахимов решил поменять главу своей администрации. Он уволил Азамата Сагитова."
],
"entities_list": [
[
[67, 73, "COUNTRY"],
[74, 87, "PERSON"],
[35, 73, "PROFESSION"]
],
[
[19, 34, "PERSON"],
[0, 18, "PROFESSION"],
[50, 75, "PROFESSION"],
[10, 18, "STATE_OR_PROVINCE"],
[80, 86, "EVENT"],
[87, 103, "PERSON"]
]
]
}
```

### RE Output (`RE_OUT`)

The output of the RE model is a JSON list of extracted relationships. Each object in the list represents a single relationship and contains the following fields:

- `source_entity` (string): The text of the source entity in the relationship.
- `target_entity` (string): The text of the target entity in the relationship.
- `relationship_type` (string): The type of the relationship (e.g., "FOUNDED_BY", "WORKPLACE").
- `relationship_description` (string or null): A natural language description of the relationship.
- `relationship_strength` (float): A confidence score for the extracted relationship, typically between 0.0 and 1.0.
- `chunk_id` (integer): The index of the chunk from the `RE_IN` `chunks` list where this relationship was found.

**Example:**
```json
[
{
"source_entity": "России",
"target_entity": "Николай Лямов",
"relationship_type": "FOUNDED_BY",
"relationship_description": null,
"relationship_strength": 0.04831777885556221,
"chunk_id": 0
},
{
"source_entity": "Николай Лямов",
"target_entity": "России",
"relationship_type": "WORKPLACE",
"relationship_description": null,
"relationship_strength": 0.999497652053833,
"chunk_id": 0
}
]
```

## RAGU-lm I/O Formats

The `RAGU-lm` model uses a prompt-based format for its tasks.

### Named Entity Normalization (NEN)

**Input:**
The input is a formatted string (prompt) that includes the unnormalized entity and the source text.

- **Prompt Template:**
'''
Выполните нормализацию именованной сущности, встретившейся в тексте.

Исходная (ненормализованная) именованная сущность: {source_entity}

Текст: {source_text}

Нормализованная именованная сущность:
'''
- **Parameters:**
- `{source_entity}`: The unnormalized entity to be normalized.
- `{source_text}`: The original text containing the entity.

**Output:**
The output is a string containing the normalized entity.

- **Example Output:**
'''
пресс-секретарь
'''

### Description Generation (DG)

**Input:**
The input is a formatted string (prompt) that includes the normalized entity and the source text.

- **Prompt Template:**
'''
Напишите, что означает именованная сущность в тексте, то есть раскройте её смысл относительно текста.

Именованная сущность: {normalized_entity}

Текст: {source_text}

Смысл именованной сущности:
'''
- **Parameters:**
- `{normalized_entity}`: The normalized entity for which to generate a description.
- `{source_text}`: The original text containing the entity.

**Output:**
The output is a string containing the generated description for the entity.

- **Example Output:**
'''
Бывший представитель СМИ экс-президента США Билла Клинтона.
'''
File renamed without changes.
Loading