[SQL] Review of new vector functions


General Remarks:

* [x] Are `LIST OF DOUBLE` and `ARRAY_OF_DOUBLES` in the functions handled as float?
* [x] What about vector types `ARRAY_OF_SHORTS`, `ARRAY_OF_INTEGERS`, `ARRAY_OF_LONGS`?
* [x] Is `ARRAY_OF_FLOATS` a `LIST OF FLOAT`? If so what is `[1.0, 2.0]` by default? `LIST` or `ARRAY_OF_FLOAT`? I assume a list? I guess this hierarchy needs documenting.
* [x] Is `SparseVector` a SQL data type? Can a property have this type?
* [x] The dot product computation (sum of squares) is implemented in various functions, should there be one that is reused?

Specific Remarks:

* `vectorDimension`
    * [ ] Errors for a `NULL` argument, IMHO should return `0` like `.length()` or `.size()` for a `NULL` argument.
    * [ ] Wouldn't `vectorDim` be sufficient as a name, given that it is also `vectorHasInf` and not `vectorHasInfinity`?
* `vectorHasNaN`, `vectorHasInf`
    * [ ] Cannot be tested with `LIST OF FLOAT` because `NAN` values are automatically converted to `NULL` which is not allowed in typed collections.
    * [ ] `SELECT vectorHasNaN([1.0,sqrt(-1.0),3.0])` errors with `Cannot invoke "Object.getClass()" because "elem" is null`
    * [ ] Is maybe a function `vectorHasNull` needed?
* `vectorIsNormalized`
    * [ ] The test for normalization is numerically as complex as normalizing itself, so instead of testing for normality one could just normalize.
    * [ ] Why is the default threshold `0.001` is this supposed to be approximately `sqrt(eps)` for float? Then it should be `0.0000001` for doubles.
    * [ ] Wouldn't `vectorIsNormal` be sufficient as name?
* `vectorAdd`, `vectorSubtract`
    * [ ] How to broadcast? Meaning how to add (or subtract) a scalar from or to a vector without creating a vector, ie `[1.0,2.0,3.0] + 4.0 (which currently would add the element 4.0 to the vector instead of adding 4.0 to every element)`
    * [ ] Wouldn't `vectorSub` be sufficient as a name?
* `vectorMagnitude`
    * [ ] Why is this function not named `vectorL2Norm`, symmertrically with `vectorL1Norm` and `vectorLInfNorm`?
* `vectorLInfNorm`
    * [ ] The loop does not need a conditional if something like `maxAbs = Math.max(maxAbs, Math.abs(value))` is used.
* `vectorSparsity`
    * [ ] Should there be a default threshold like `sqrt(eps)`?
    * [ ] Alternatively or additionally the [L0 pseudonorm](https://math.stackexchange.com/questions/492834/geometric-mean-limit-of-ell-p-norm-of-sums/492953#492953) could be computed as sparsity measure ([Geometric mean](https://en.wikipedia.org/wiki/Geometric_mean) of absolute values)
* `vectorSum`, `vectorAvg`, `vectorMax`, `vectorMin`
    * [ ] There is an error in the `vectorSum` function as repeated calls of `SELECT vectorSum([1.0,2.0,3.0])` yield different (increasing) results, same when using a property as argument. `vectorAvg` does not have this problem.
    * [ ] These seems to work different than other aggregating functions, which for a single argument aggregate over the argument, which would mean here for example the sum of vector elements.
    * [ ] `vectorAvg` does not produce the arithmetic average (just one specific property of a record)
* `vectorStdDev`, `vectorVariance`
    * [ ] Unlike the `variance` and `stddev` SQL functions, these are not aggregating.
    * [ ] Unlike the `variance` and `stddev` SQL functions, `vectorStdDev` is not reusing the `vectorVariance` code.
* `vectorClip`
    * [ ] Since this is called `clamp` in Java terminology, should this be renamed?
* `vectorCosineSimilarity`
    * [ ] The two loops in the computation can be merged into one.
* `vectorQuantizeBinary`
    * [ ] Why is the median used to decide?
    * [ ] Why is there no `vectorDequantizeBinary`? At least for completeness.
* `vectorDequantizeInt8`
    * [ ] This does not work `SELECT vectorDequantizeInt8(vectorQuantizeInt8([1.0, 2.0, 3.0]), 1.0, 3.0)` and gives the error `Quantized vector must be an array or list, found: QuantizationResult`
* `vectorApproxDistance`
    * [ ] What means ranking is preserved for `INT8`? Vector spaces are not ordered. Is this meant element-wise?
    * [ ] Why can't the function deduce the quantization from its arguments?
    * [ ] The following errors `SELECT vectorApproxDistance(vectorQuantizeInt8([1.0, 2.0, 3.0]),vectorQuantizeInt8(1.0, 3.0, 3.0),'INT8')` with `vectorQuantizeInt8(<vector>)`
* `vectorNormalizeScores`
   * [ ] Wouldn't it be faster to create a new array with the midpoint value for the edge case of range zero instead of looping?
* `vectorMultiScore`
   * [ ] The associated Java class filename does nt fit the pattern (misses the `Vector` prefix).
   * [ ] Why is the weighted average an extra type, and not just `'AVG'` with an extra argument, or `AVG` is always weighted but by default with the vector or ones.
* `vectorHybridScore`
   * [ ] This is just a special case of `vectorMultiScore` for the case of two scores with a weighted average. Is this extra function needed?
   * [ ] This is not really a `vector` function as it does not handle vectors.
* `vectorRRFScore`
   * [ ] This produces wrong results for more than two scores as it cannot be distinguished between optional last argument and score.
   * [ ] Why are the scores not grouped into a vector as for `vectorMultiScore`?
   * [ ] This is not really a `vector` function as it does not handle vectors.
* `vectorScoreTransformation`
   * [ ] This is not really a `vector` function as it does not handle vectors.
   * [ ] `LN` would be more clear in terms of type of logarithm than `LOG`.
   * [ ] additionally `TANH` might be a useful variant to `SIGMOID`.
* `vectorDenseToSparse`
   * [ ] Associated java class filename differs from pattern (`vector` prefix missing)
   * [ ] Couldn't it be named `vectorAsSparse`?
* `vectorSparseCreate`, `vectorSparseDot`, `vectorSparseToDense`
   * [ ] Associated java class filename differs from pattern (`vector` prefix missing)
* `vectorToString`
   * [ ] Are these meant to copy paste into code / scripts or saved to a file which then loaded?
   * [ ] When using the `numpy` `fromString` method is used a comma-separated list is expected, the brackets are only for code AFAIK
   * [ ] In MATLAB the separator determines the type of vector space (or comma!) produce a row vector, while semi-colon results in column vector
   * [ ] Julia is similar to MATLAB in many regards, so the MATLAB variant should work also in Julia, but it would be more obvious if also a `'JULIA'` would be available.
   * [ ] Should this be renamed to `vectorAsString`?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SQL] Review of new vector functions #3099

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[SQL] Review of new vector functions #3099

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions