Skip to content

Comments

Support for tsvector#47

Open
Glandos wants to merge 2 commits intoaltaurog:masterfrom
Glandos:tsvector
Open

Support for tsvector#47
Glandos wants to merge 2 commits intoaltaurog:masterfrom
Glandos:tsvector

Conversation

@Glandos
Copy link

@Glandos Glandos commented Feb 26, 2025

This is the first and basic step.

A tsvector can be defined in Python with a list of tuple (lexeme, positions). positions is a list of position, either int or str (e.g. '1A').
Full example:

[
        ('aà', ['1A', '3A']),
        ('b', ['2A']),
]

A lot of things are missing right now:

  • Documentation
  • Tests
  • Encoding support
  • A better API for tsvector in Python

For encoding, lexemes should be encoded using client connection, but the automatic encoder isn't made for that. Maybe client_encoding could be passed as optional kwargs to every formatter?

For the last point, it's good to remember that tsvector are sorted, and I've no intention of supporting the creation of such a thing (see to_tsvector) which is very complicated to replicate in Python.

@altaurog
Copy link
Owner

altaurog commented Mar 5, 2025

Hello, @Glandos, and thank you for your interest in pgcopy. Unfortunately I don’t know anything about this data type and I haven’t much time these days for this project.

Maybe client_encoding could be passed as optional kwargs to every formatter?

Can you elaborate? What would this look like?

@Glandos
Copy link
Author

Glandos commented Mar 5, 2025

Hello, @Glandos, and thank you for your interest in pgcopy. Unfortunately I don’t know anything about this data type and I haven’t much time these days for this project.

Maybe client_encoding could be passed as optional kwargs to every formatter?

Can you elaborate? What would this look like?

I tried something in 1a352ee (#47)
A tsvector is used for full text search and is basically a sorted list of lexemes (aka words/token) with a position in the original text: a:1 basic:2 like:4 this:5 text:3
Since lexemes are strings, they must be encoding using client encoding, but they are "burried" inside a list. Other attributes are integers, so there's no issues for conversion.

If you think my implementation fit in your code, I can go further to add mentions in documentation, and try to see if I can make some tests.

@altaurog
Copy link
Owner

altaurog commented Mar 5, 2025

Ah, I see. It’s slightly awkward, but I suppose it’s a reasonable approach. I am curious what kind of use-case there is for this, if creating the tsvector in the first place is more easily done on the db. Wouldn’t it be more straightforward to copy strings into an intermediate table and then run to_tsvector in the database?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants