Skip to content

[SQL] Review of new vector functions #3099

@gramian

Description

@gramian

General Remarks:

  • Are LIST OF DOUBLE and ARRAY_OF_DOUBLES in the functions handled as float?
  • What about vector types ARRAY_OF_SHORTS, ARRAY_OF_INTEGERS, ARRAY_OF_LONGS?
  • Is ARRAY_OF_FLOATS a LIST OF FLOAT? If so what is [1.0, 2.0] by default? LIST or ARRAY_OF_FLOAT? I assume a list? I guess this hierarchy needs documenting.
  • Is SparseVector a SQL data type? Can a property have this type?
  • The dot product computation (sum of squares) is implemented in various functions, should there be one that is reused?

Specific Remarks:

  • vectorDimension
    • Errors for a NULL argument, IMHO should return 0 like .length() or .size() for a NULL argument.
    • Wouldn't vectorDim be sufficient as a name, given that it is also vectorHasInf and not vectorHasInfinity?
  • vectorHasNaN, vectorHasInf
    • Cannot be tested with LIST OF FLOAT because NAN values are automatically converted to NULL which is not allowed in typed collections.
    • SELECT vectorHasNaN([1.0,sqrt(-1.0),3.0]) errors with Cannot invoke "Object.getClass()" because "elem" is null
    • Is maybe a function vectorHasNull needed?
  • vectorIsNormalized
    • The test for normalization is numerically as complex as normalizing itself, so instead of testing for normality one could just normalize.
    • Why is the default threshold 0.001 is this supposed to be approximately sqrt(eps) for float? Then it should be 0.0000001 for doubles.
    • Wouldn't vectorIsNormal be sufficient as name?
  • vectorAdd, vectorSubtract
    • How to broadcast? Meaning how to add (or subtract) a scalar from or to a vector without creating a vector, ie [1.0,2.0,3.0] + 4.0 (which currently would add the element 4.0 to the vector instead of adding 4.0 to every element)
    • Wouldn't vectorSub be sufficient as a name?
  • vectorMagnitude
    • Why is this function not named vectorL2Norm, symmertrically with vectorL1Norm and vectorLInfNorm?
  • vectorLInfNorm
    • The loop does not need a conditional if something like maxAbs = Math.max(maxAbs, Math.abs(value)) is used.
  • vectorSparsity
    • Should there be a default threshold like sqrt(eps)?
    • Alternatively or additionally the L0 pseudonorm could be computed as sparsity measure (Geometric mean of absolute values)
  • vectorSum, vectorAvg, vectorMax, vectorMin
    • There is an error in the vectorSum function as repeated calls of SELECT vectorSum([1.0,2.0,3.0]) yield different (increasing) results, same when using a property as argument. vectorAvg does not have this problem.
    • These seems to work different than other aggregating functions, which for a single argument aggregate over the argument, which would mean here for example the sum of vector elements.
    • vectorAvg does not produce the arithmetic average (just one specific property of a record)
  • vectorStdDev, vectorVariance
    • Unlike the variance and stddev SQL functions, these are not aggregating.
    • Unlike the variance and stddev SQL functions, vectorStdDev is not reusing the vectorVariance code.
  • vectorClip
    • Since this is called clamp in Java terminology, should this be renamed?
  • vectorCosineSimilarity
    • The two loops in the computation can be merged into one.
  • vectorQuantizeBinary
    • Why is the median used to decide?
    • Why is there no vectorDequantizeBinary? At least for completeness.
  • vectorDequantizeInt8
    • This does not work SELECT vectorDequantizeInt8(vectorQuantizeInt8([1.0, 2.0, 3.0]), 1.0, 3.0) and gives the error Quantized vector must be an array or list, found: QuantizationResult
  • vectorApproxDistance
    • What means ranking is preserved for INT8? Vector spaces are not ordered. Is this meant element-wise?
    • Why can't the function deduce the quantization from its arguments?
    • The following errors SELECT vectorApproxDistance(vectorQuantizeInt8([1.0, 2.0, 3.0]),vectorQuantizeInt8(1.0, 3.0, 3.0),'INT8') with vectorQuantizeInt8(<vector>)
  • vectorNormalizeScores
    • Wouldn't it be faster to create a new array with the midpoint value for the edge case of range zero instead of looping?
  • vectorMultiScore
    • The associated Java class filename does nt fit the pattern (misses the Vector prefix).
    • Why is the weighted average an extra type, and not just 'AVG' with an extra argument, or AVG is always weighted but by default with the vector or ones.
  • vectorHybridScore
    • This is just a special case of vectorMultiScore for the case of two scores with a weighted average. Is this extra function needed?
    • This is not really a vector function as it does not handle vectors.
  • vectorRRFScore
    • This produces wrong results for more than two scores as it cannot be distinguished between optional last argument and score.
    • Why are the scores not grouped into a vector as for vectorMultiScore?
    • This is not really a vector function as it does not handle vectors.
  • vectorScoreTransformation
    • This is not really a vector function as it does not handle vectors.
    • LN would be more clear in terms of type of logarithm than LOG.
    • additionally TANH might be a useful variant to SIGMOID.
  • vectorDenseToSparse
    • Associated java class filename differs from pattern (vector prefix missing)
    • Couldn't it be named vectorAsSparse?
  • vectorSparseCreate, vectorSparseDot, vectorSparseToDense
    • Associated java class filename differs from pattern (vector prefix missing)
  • vectorToString
    • Are these meant to copy paste into code / scripts or saved to a file which then loaded?
    • When using the numpy fromString method is used a comma-separated list is expected, the brackets are only for code AFAIK
    • In MATLAB the separator determines the type of vector space (or comma!) produce a row vector, while semi-colon results in column vector
    • Julia is similar to MATLAB in many regards, so the MATLAB variant should work also in Julia, but it would be more obvious if also a 'JULIA' would be available.
    • Should this be renamed to vectorAsString?

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions