Skip to content

[BUG] top and other bucket-dependent aggregations don't work on nested fields #4949

@Swiddis

Description

@Swiddis

What is the bug?
Sibling issue of opensearch-project/OpenSearch#20218: top and other aggregations won't work in cases where a field has "type": "nested". If this is intended behavior, we'll have to patch in a workaround. If it's not intended, we can see about waiting on an upstream fix.

How can one reproduce the bug?
If a field has "type": "nested", any command that compiles to a bucket aggregation will fail to return data.

Data & Mapping:

% cat test_map.json | jq
{
  "mappings": {
    "properties": {
      "rec": {
        "type": "nested",
        "properties": {
          "inner": {
            "type": "integer"
          }
        }
      }
    }
  }
}

% cat test_data.json
{"create": {"_index": "test"}}
{"rec":{"inner": 100}}
{"create": {"_index": "test"}}
{"rec":{"inner": 101}}
{"create": {"_index": "test"}}
{"rec":{"inner": 101}}

Queries that depend on composite buckets:

opensearchsql> source=test | top rec.inner;
fetched rows / total rows = 1/1
+-----------+-------+
| rec.inner | count |
|-----------+-------|
| null      | 3     |
+-----------+-------+
opensearchsql> source=test | stats count() by rec.inner;
fetched rows / total rows = 1/1
+---------+-----------+
| count() | rec.inner |
|---------+-----------|
| 3       | null      |
+---------+-----------+
> source=test | dedup rec.inner;
fetched rows / total rows = 0/0
+-----+
| rec |
|-----|
+-----+

Explained:

opensearchsql> explain source=test | top rec.inner;
= Calcite Plan =
== Logical ==
LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])
  LogicalProject(rec.inner=[$0], count=[$1])
    LogicalFilter(condition=[<=($2, 10)])
      LogicalProject(rec.inner=[$0], count=[$1], _row_number_=[ROW_NUMBER() OVER (ORDER BY $1 DESC)])
        LogicalAggregate(group=[{0}], count=[COUNT()])
          LogicalProject(rec.inner=[$1])
            CalciteLogicalIndexScan(table=[[OpenSearch, test]])

== Physical ==
EnumerableLimit(fetch=[10000])
  EnumerableCalc(expr#0..2=[{inputs}], expr#3=[10], expr#4=[<=($t2, $t3)], proj#0..1=[{exprs}], $condition=[$t4])
    EnumerableWindow(window#0=[window(order by [1 DESC] rows between UNBOUNDED PRECEDING and CURRENT ROW aggs [ROW_NUMBER()])])
      CalciteEnumerableIndexScan(table=[[OpenSearch, test]], PushDownContext=[[AGGREGATION->rel#597:LogicalAggregate.NONE.[](input=RelSubset#596,group={0},count=COUNT())], OpenSearchRequestBuilder(sourceBuilder={
        "from": 0,
        "size": 0,
        "timeout": "1m",
        "aggregations": {
          "composite_buckets": {
            "composite": {
              "size": 1000,
              "sources": [
                {
                  "rec.inner": {
                    "terms": {
                      "field": "rec.inner",
                      "missing_bucket": true,
                      "missing_order": "first",
                      "order": "asc"
                    }
                  }
                }
              ]
            }
          }
        }
      }, requestedTotalSize=2147483647, pageSize=null, startFrom=0)])

opensearchsql> explain source=test | stats count() by rec.inner;
= Calcite Plan =
== Logical ==
LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])
  LogicalProject(count()=[$1], rec.inner=[$0])
    LogicalAggregate(group=[{0}], count()=[COUNT()])
      LogicalProject(rec.inner=[$1])
        CalciteLogicalIndexScan(table=[[OpenSearch, test]])

== Physical ==
CalciteEnumerableIndexScan(table=[[OpenSearch, test]], PushDownContext=[[AGGREGATION->rel#1164:LogicalAggregate.NONE.[](input=RelSubset#1163,group={0},count()=COUNT()), PROJECT->[count(), rec.inner], LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={
  "from": 0,
  "size": 0,
  "timeout": "1m",
  "aggregations": {
    "composite_buckets": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "rec.inner": {
              "terms": {
                "field": "rec.inner",
                "missing_bucket": true,
                "missing_order": "first",
                "order": "asc"
              }
            }
          }
        ]
      }
    }
  }
}, requestedTotalSize=10000, pageSize=null, startFrom=0)])

opensearchsql> explain source=test | dedup rec.inner;
= Calcite Plan =
== Logical ==
LogicalSystemLimit(fetch=[10000], type=[QUERY_SIZE_LIMIT])
  LogicalProject(rec=[$0])
    LogicalFilter(condition=[<=($8, 1)])
      LogicalProject(rec=[$0], rec.inner=[$1], _id=[$2], _index=[$3], _score=[$4], _maxscore=[$5], _sort=[$6], _routing=[$7], _row_number_dedup_=[ROW_NUMBER() OVER (PARTITION BY $1 ORDER BY $1)])
        LogicalFilter(condition=[IS NOT NULL($1)])
          CalciteLogicalIndexScan(table=[[OpenSearch, test]])

== Physical ==
CalciteEnumerableIndexScan(table=[[OpenSearch, test]], PushDownContext=[[PROJECT->[rec, rec.inner], AGGREGATION->rel#953:LogicalAggregate.NONE.[](input=LogicalProject#951,group={0},agg#0=LITERAL_AGG(1)), LIMIT->10000], OpenSearchRequestBuilder(sourceBuilder={
  "from": 0,
  "size": 0,
  "timeout": "1m",
  "_source": {
    "includes": [
      "rec",
      "rec.inner"
    ],
    "excludes": []
  },
  "aggregations": {
    "composite_buckets": {
      "composite": {
        "size": 10000,
        "sources": [
          {
            "rec.inner": {
              "terms": {
                "field": "rec.inner",
                "missing_bucket": false,
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggregations": {
        "$f1": {
          "top_hits": {
            "from": 0,
            "size": 1,
            "version": false,
            "seq_no_primary_term": false,
            "explain": false,
            "_source": {
              "includes": [
                "rec.inner",
                "rec"
              ],
              "excludes": []
            },
            "script_fields": {}
          }
        }
      }
    }
  }
}, requestedTotalSize=10000, pageSize=null, startFrom=0)])

What is the expected behavior?

opensearchsql> source=test | top rec.inner;
fetched rows / total rows = 2/2
+-----------+-------+
| rec.inner | count |
|-----------+-------|
| 101       | 2     |
| 100       | 1     |
+-----------+-------+

What is your host/environment?

  • OS: Linux 5.10
  • Version: mainline, 3.3.2
  • Plugins: Yours truly

Do you have any screenshots?
N/A

Do you have any additional context?
N/A

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions