PELinker

A service for entity linking of properties

Developer notes

Make sure there is an available version of python specified in pyproject.toml, for example installed using pyenv.
Install uv : curl -LsSf https://astral.sh/uv/install.sh | sh
Run uv sync --all-groups to create a local environment with project dependencies specified in uv.lock
Add a spacy language model uv run spacy download en_core_web_trf
Set up pre-commit hooks: uv run pre-commit install.
To run pre-commit independently from git commit, run uv run pre-commit run --all-files
To run tests run pytest test

NB.

To run python scripts prefix the command with uv run, e.g. uv run python script.py
To git commit also uv run prefix, e.g. uv run git commit -m "first commit" to make sure pre-commit hooks are used from the correct python environement.

Data Preparation

run/preprocessing/extract_properties_ro
run/preprocessing/extract_properties_go

Merge properties/relations into a table

Uniformize and trim data incoming from different sources

run/preprocessing/merge_properties

Testing against ground truth

Ground truth dataset is stored in data/ground_truth, so run the following to obtain the accuracy of the model in ./reports

python run/testing/run_pel_test.py --text-path ./data/ground_truth/sample.0.gt.json --model-type biobert-stsb --layers-spec sent --extra-context

Serialize Model

"Train" a model on a corpus

uv run python run/save_model.py

Run server

poetry run python run/serve

Container

Build image: docker buildx build -t gg/pelinker:<current_version> --ssh default=$SSH_AUTH_SOCK . 2>&1 | tee build.log
Run container: docker run --name pelinker --env THR_SCORE=0.5 gg/pelinker:latest

Algo flow

flowchart TD
    A["[text]"] -->|"split_text_into_batches"| B["[[batched text]]"]
    subgraph S1["elementary tensors and word bounds"]
        direction LR
        C ~~~ D
    end

    B -->|"get_word_boundaries"| C["[[word group bounds]]"]
    B -->|"process_text"| D["token tensor"]
    subgraph S2["tensors ~ words of interest"]
        direction LR
        E["ll_tt_stacked"] ~~~ F["mapping_table"]
    end

    S1 -->|"render_elementary_tensor_table"| S2

Analysis

An essential part of analysis is to identify patterns in text and study their embeddings vectors.

To run pattern matching over different models and patterns, and plot them to figs folder, where the texts are taken from a csv file with a column named abstract:

cd run
./test.pat.align.sh ./test.pat.align.sh --pattern pat_a --pattern pat_b --plot-path figs --input-path data/test/sample.csv.gz

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
data		data
figs		figs
pelinker		pelinker
run		run
test		test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.pylintrc		.pylintrc
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
logging.conf		logging.conf
pyproject.toml		pyproject.toml
run_test.sh		run_test.sh
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PELinker

Developer notes

Data Preparation

Merge properties/relations into a table

Testing against ground truth

Serialize Model

Run server

Container

Algo flow

Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

growgraph/pelinker

Folders and files

Latest commit

History

Repository files navigation

PELinker

Developer notes

Data Preparation

Merge properties/relations into a table

Testing against ground truth

Serialize Model

Run server

Container

Algo flow

Analysis

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages