re-enable support for custom multi-word synonyms#457
Conversation
| }); | ||
| } | ||
|
|
||
| function tokenReplacementCheck(line, logprefix) { |
There was a problem hiding this comment.
I've also added this additional linter warning.
Technically the => syntax is actually still supported (the linter doesn't rewrite the configs) but I would like to heavily discourage its use.
The reason for this is an entry such as road => rd would only enter [ "rd" ] in the index, so when a user queries for [ "road" ] (or any ngram of), ☠️ it will not match ☠️
Advanced users may opt to write it as road => road, rd to overcome this issue but this is equivalent to road, rd.
I think we will probably completely drop support for => at some point in the future because it's very easy to make mistakes with.
There was a problem hiding this comment.
I was thinking it might actually be useful to use => sometimes. For example, if we have a single letter abbreviation that we want to ensure is not being applied too often.
c, carrer would replace every instance of c with carrer. However, carrer => career, c would not.
As you mentioned we would probably recommend that all tokens on the left hand side of the replacement also appear on the right, to avoid issues with autocomplete jitter.
096f383 to
19e5650
Compare
|
I've added 1baeadf to address an issue where the linter is was using As a result I've had to remove a few hyphenated synonyms from the canonical synonym lists. |
|
Nice. Does it make sense to add an integration test for multi-word synonyms? |
As discussed in #456, the work in #453 had the unexpected consequences of dropping support for multi-word custom synonyms.
My general guidance here is that multi-word synonyms are poorly supported by lucene/elasticsearch and so should be avoided where possible, great care should be taken to ensure they are compatible with the
match_phrasequeries used by Pelias.Where possible I'd recommend using 'aliases' ie.
doc.setNameAlias()instead, this is a reliable method of achieving the same thing, although it's far less convenient because it's on a per-record basis.So.. having said all that.. this PR re-enables support for multi-word synonyms (in custom synonyms files only) in order to avoid breaking backwards compatibility.
In order to do this I had to move the custom multi-word synonyms outside the multiplexer, apparently multiplexers emit tokens one-by-one to each of their branches, preventing the ability to 'look-ahead' as required by any multi-term analysis within a branch.
resolves #456