Skip to content

PostgreSQL库内embedding,配合pgvector完成近似搜索,搭建简易问答 #51

@Z-Xiao-M

Description

@Z-Xiao-M

前言

在逛huggingface的时候,我看到了这个,有点意思,让我们来玩一玩。

Image

一些基础的知识,本文就不再补充了,请优先阅读前面两篇文章,这对于理解本篇文章更有帮助。

在PostgreSQL中运行ONNX模型 —— pg_onnx

OnnxOCR_PG_ONNX:探讨在PostgreSQL中运行ONNX模型更多可能的验证性项目

环境准备

onnxruntime-extensions 依赖

需要安装onnxruntime-extensions,我先前安装了onnxruntime 位于/usr/local/onnxruntime所以这里我也直接指向这个目录好了

git clone --recursive https://github.com/microsoft/onnxruntime-extensions.git
cd onnxruntime-extensions

mkdir build
cd build

cmake .. \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_INSTALL_PREFIX=/usr/local/onnxruntime \
  -DONNXRUNTIME_DIR=/usr/local/onnxruntime \
  -DOCOS_ENABLE_SPM_TOKENIZER=ON

nohup make &

sudo make install

下载text_to_embedding.onnx模型

点击此处下载我们这次所需要的onnx模型,它是被转换过的,下面给出原模型和论文地址。
原模型网址:https://huggingface.co/intfloat/multilingual-e5-small
论文:https://arxiv.org/pdf/2402.05672

常见问题解答

Image

数据库内准备

拓展创建

这次我们需要用到两个拓展,一个是pg_onnx,一个是pgvector

create extension pg_onnx;
create extension vector;

导入模型和创建所需函数

-- Embedding模型注册
SELECT pg_onnx_import_model(
    'e5-embedding',
    'v1',
    pg_read_binary_file('/home/postgres/model/pgonnx/text_to_embedding.onnx')::bytea,
    '{"ortextensions_path": "libortextensions.so"}'::jsonb,
    'e5 text to embedding'
);

-- 基本的Embedding生成函数
CREATE OR REPLACE FUNCTION e5_embedding(input_text text)
RETURNS vector(384)
AS $$
    SELECT array(
        SELECT jsonb_array_elements_text(
            pg_onnx_execute_session(
                'e5-embedding',
                'v1',
                jsonb_build_object('text', jsonb_build_array(input_text))
            )->'embedding'->0
        )::float
    )::vector(384);
$$
LANGUAGE sql
IMMUTABLE;

-- 用于生成passage的Embedding函数
-- E5型号使用“passage:”前缀
CREATE OR REPLACE FUNCTION e5_embedding_passage(input_text text)
RETURNS vector(384)
AS $$
    SELECT e5_embedding('passage: ' || input_text);
$$
LANGUAGE sql
IMMUTABLE;

-- 生成query的Embedding函数
-- E5模型使用“query:”前缀
CREATE OR REPLACE FUNCTION e5_embedding_query(input_text text)
RETURNS vector(384)
AS $$
    SELECT e5_embedding('query: ' || input_text);
$$
LANGUAGE sql
IMMUTABLE;

表和基础数据准备

-- 测试表创建
CREATE TABLE llm_test (
    i integer NOT NULL PRIMARY KEY,
    txt text,
    v vector(384)
);

-- HNSW索引创建
CREATE INDEX llm_test_v_idx ON llm_test USING hnsw (v vector_ip_ops);

-- 插入测试数据
INSERT INTO llm_test (i, txt) VALUES 
('1', 'Machine learning is a subfield of artificial intelligence'),
('2', 'A database is a system for managing data'),
('3', 'PostgreSQL is a powerful open-source database'),
('4', 'Vector search retrieves results by computing similarity'),
('5', 'ONNX is a standard format for machine learning models'),
('6', 'Natural language processing is a technology for handling text'),
('7', 'Embeddings convert text into vectors'),
('8', 'Cosine similarity measures similarity between vectors'),
('9', 'A tokenizer splits text into tokens'),
('10', 'Transformers are a modern neural network architecture'),
('11', 'SQL is a language for manipulating databases'),
('12', 'Indexes improve query performance'),
('13', 'pgvector is a vector extension for PostgreSQL'),
('14', 'Semantic search retrieves based on meaning'),
('15', 'Neural networks mimic the structure of the brain'),
('16', 'Deep learning uses multi-layer neural networks'),
('17', 'Batch processing handles multiple data at once'),
('18', 'Model inference performs prediction with a trained model'),
('19', 'Fine-tuning adapts an existing model to a specific task'),
('20', 'A cross-encoder evaluates the relevance between two texts');

-- 生成对应的Embedding结果
-- 也可以考虑创建触发器来完成这一部分工作
UPDATE llm_test SET v = e5_embedding_passage(txt);

简单测试

postgres=# WITH q AS (
    SELECT 'What is machine learning?' AS query
),
qv AS MATERIALIZED (
    SELECT e5_embedding_query(q.query) AS v FROM q
)
SELECT 
    i, 
    txt, 
    t.v <#> qv.v AS distance
FROM llm_test t, qv
ORDER BY distance
LIMIT 5;
 i  |                            txt                            |      distance       
----+-----------------------------------------------------------+---------------------
  1 | Machine learning is a subfield of artificial intelligence | -0.8541078567504883
  5 | ONNX is a standard format for machine learning models     | -0.8386106491088867
 16 | Deep learning uses multi-layer neural networks            | -0.8247002363204956
 10 | Transformers are a modern neural network architecture     | -0.8128323554992676
 18 | Model inference performs prediction with a trained model  |  -0.801337718963623
(5 rows)

也可以用中文(可能会对结果造成点影响,主要看模型训练)

postgres=# WITH q AS (
    SELECT '什么是机器学习?' AS query
),
qv AS MATERIALIZED (
    SELECT e5_embedding_query(q.query) AS v FROM q
)
SELECT 
    i, 
    txt, 
    t.v <#> qv.v AS distance
FROM llm_test t, qv
ORDER BY distance
LIMIT 5;
 i  |                            txt                            |      distance       
----+-----------------------------------------------------------+---------------------
  1 | Machine learning is a subfield of artificial intelligence | -0.7919192910194397
  5 | ONNX is a standard format for machine learning models     | -0.7906289100646973
 16 | Deep learning uses multi-layer neural networks            | -0.7704752087593079
 18 | Model inference performs prediction with a trained model  | -0.7630345821380615
  4 | Vector search retrieves results by computing similarity   | -0.7604728937149048
(5 rows)

不说是知识库吧,简易的问答系统感觉就这样子做完了(前提是将答案录进了数据库)。

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions