diff --git a/USAGE.md b/USAGE.md index aad055a92b..c2322b90fe 100644 --- a/USAGE.md +++ b/USAGE.md @@ -1,163 +1,103 @@ # ECS Tooling Usage -In addition to the published schema and artifacts, the ECS repo also contains tools to generate artifacts based on the current published and custom schemas. +In addition to the published schema and artifacts, the ECS repo contains tools to generate artifacts based on ECS schemas and your custom field definitions. -You may be asking if ECS is a specification for storing event data, where does the ECS tooling fit into the picture? As users implement ECS into their Elastic stack, common questions arise: +## Why Use ECS Tooling? -* ECS has too many fields. Users don't want to generate mappings for fields they don't plan on using soon. -* Users want to adopt ECS but also want to painlessly maintain their own custom field mappings alongside ECS. +* **Subset Generation**: ECS has ~850 fields. Generate mappings for only the fields you need. +* **Custom Fields**: Painlessly maintain your own custom field mappings alongside ECS. +* **Multiple Formats**: Generate Elasticsearch templates, Beats configs, CSV exports, and documentation. -Users can use the ECS tools to tackle both problems. What artifacts are relevant will also vary based on need. Many users will find the Elasticsearch templates most useful, but Beats -contributors will instead find the Beats-formatted YAML field definition files valuable. By maintaining only their customizations and use the tools provided by ECS, they can generate -relevant artifacts for their unique set of data sources. +**For detailed developer documentation**, see [scripts/docs/README.md](scripts/docs/README.md). **NOTE** - These tools and their functionality are considered experimental. ## Table of Contents -- [TLDR Example](#tldr-example) -- [Terminology](#terminology) +- [Quick Start Example](#quick-start-example) - [Setup and Install](#setup-and-install) - * [Prerequisites](#prerequisites) - + [Clone from GitHub](#clone-from-github) - + [Install dependencies](#install-dependencies) -- [Usage](#usage) - * [Getting Started - Generating Artifacts](#getting-started---generating-artifacts) - * [Generator Options](#generator-options) - + [Out](#out) - + [Include](#include) - + [Exclude](#exclude) - + [Subset](#subset) - + [Ref](#ref) - + [Mapping & Template Settings](#mapping--template-settings) - + [Strict Mode](#strict-mode) - + [Intermediate-Only](#intermediate-only) - + [Force-docs](#force-docs) - -## TLDR Example - -Before diving into the details, here's a complete example that: - -* takes ECS 1.6 fields -* selects only the subset of fields relevant to the project's use case -* includes custom fields relevant to the project -* outputs the resulting artifacts to a project directory -* replace the ECS project's sample template settings and - mapping settings with ones appropriate to the project +- [Basic Usage](#basic-usage) +- [Key Generator Options](#key-generator-options) + * [Include Custom Fields](#include-custom-fields) + * [Subset - Use Only Needed Fields](#subset---use-only-needed-fields) + * [Ref - Target Specific ECS Version](#ref---target-specific-ecs-version) + * [Other Options](#other-options) +- [Additional Resources](#additional-resources) + +## Quick Start Example + +Here's a complete example that generates artifacts with: +* ECS 9.1 fields as the base +* A subset of only needed fields +* Custom fields added on top +* Custom template settings ```bash -python scripts/generator.py --ref v1.6.0 \ - --semconv-version $(cat otel-semconv-version) \ - --subset ../my-project/fields/subset.yml \ - --include ../my-project/fields/custom/ \ - --out ../my-project/ \ - --template-settings-legacy ../my-project/fields/template-settings-legacy.json \ - --template-settings ../my-project/fields/template-settings.json \ - --mapping-settings ../my-project/fields/mapping-settings.json +python scripts/generator.py \ + --ref v9.1.0 \ + --semconv-version v1.38.0 \ + --subset ../my-project/fields/subset.yml \ + --include ../my-project/fields/custom/ \ + --out ../my-project/ ``` -The generated Elasticsearch template would be output at +This generates: +* `my-project/generated/elasticsearch/composable/` - Modern Elasticsearch templates +* `my-project/generated/elasticsearch/legacy/` - Legacy templates +* `my-project/generated/beats/` - Beats field definitions +* `my-project/generated/csv/` - CSV field reference -`my-project/generated/elasticsearch/legacy/template.json` - -If this sounds interesting, read on to learn all about each of these settings. - -## Terminology - -| Term | Definition | -| ---- | ---------- | -| ECS | Elastic Common Schema. For the purposes of this guide, ECS may refer to either the schema itself or the repo/tooling used to maintain the schema | -| artifacts | Various kinds of files or programs that can be generated based on ECS | -| field set | Groups of related fields in ECS | -| schema | Another term for a group of related fields in ECS. Used interchangeably with field set | -| schema definition | The markup used to define a schema in ECS | -| attributes | The properties of a field or field set that are used to define that field or field set in a schema definition | +**Note**: The `--semconv-version` flag is required. Use the version from the `otel-semconv-version` file or a specific version like `v1.38.0`. ## Setup and Install -### Prerequisites - -* [Python 3.8+](https://www.python.org/) -* [make](https://www.gnu.org/software/make/) -* [pip](https://pypi.org/project/pip/) -* [git](https://git-scm.com/) +**Requirements**: Python 3.8+, git -#### Clone from GitHub - -The recommended way to download the ECS repo is `git clone`: - -``` -$ git clone https://github.com/elastic/ecs -$ cd ecs -``` - -Prior to installing dependencies or running the tools, it's recommended to check out the `git` branch for the ECS version being targeted. - -**Example**: For ECS `1.5.0`: - -``` -$ git checkout v1.5.0 -``` +**Clone and setup**: -#### Install dependencies - -Install dependencies using `pip` (An active `virtualenv` is recommended): - -``` -$ pip install -r scripts/requirements.txt +```bash +git clone https://github.com/elastic/ecs +cd ecs +git checkout v9.1.0 # Optional: target specific version +pip install -r scripts/requirements.txt # virtualenv recommended ``` -## Usage - -### Getting Started - Generating Artifacts +## Basic Usage -Using the defaults, the [generator](scripts/generator.py) script generates the artifacts based on the [current](schemas) ECS schema. +Generate artifacts from the current ECS schema: +```bash +make generate +# or +python scripts/generator.py --semconv-version v1.38.0 ``` -$ python scripts/generator.py -Loading schemas from local files -Running generator. ECS version 1.5.0 -``` - -**Points to note on the defaults**: - -* Artifacts are created in the [`generated`](generated) directory and the entire schema is included -* Documentation updates will be written to the appropriate file under the `docs` directory. More specifics on generated doc files is covered in the [contributor's file](https://github.com/elastic/ecs/blob/main/CONTRIBUTING.md#generated-documentation-files) -* Each run of the script will rewrite the entirety of the `generated` directory -* The script will need to be executed from the top-level of the ECS repo -* The `version` displayed when running `generator.py` is based on the current value of the [version](version) file in the top-level of the repo -The generator's defaults are how the ECS team maintains the official artifacts published in the repo. For your own use cases, you may wish to add your own fields or remove others that are unused. The following section details the available options for controlling the output of those artifacts. +**Key points**: +* Artifacts are created in the `generated/` directory +* Documentation is written to `docs/reference/` +* Each run rewrites the entire `generated/` directory +* Must be run from the ECS repo root +* The `--semconv-version` flag is **required** for OTel integration validation -### Generator Options +**For complete documentation on how the generator works**, see: +* [scripts/docs/README.md](scripts/docs/README.md) - Complete developer documentation +* [scripts/docs/schema-pipeline.md](scripts/docs/schema-pipeline.md) - Pipeline details +* [scripts/generator.py](scripts/generator.py) - Comprehensive inline documentation -#### Out +## Key Generator Options -Generate the ECS artifacts in a different output directory. If the specified directory doesn't exist, it will be created: - -``` -$ python scripts/generator.py --out ../myproject/ecs/out/ -``` +### Include Custom Fields -Inside the directory passed in as the target dir to the `--out` flag, two directories, `generated` and `docs`, will be created. `docs` will contain three asciidoc files based on the contents of the provided schema. `generated` will contain the various artifacts laid out as in the published repo (`beats`, `csv`, `ecs`, `elasticsearch`). +Add custom fields to ECS schemas: -> Note: When running using either the `--subset` or `--include` options, the asciidoc files will _not_ be generated. - -#### Include - -Use the `--include` flag to generate ECS artifacts based on the current ECS schema field definitions plus provided custom fields: - -``` -$ python scripts/generator.py --include ../myproject/ecs/custom-fields/ -$ python scripts/generator.py --include ../myproject/ecs/custom-fields/ ../myproject/ecs/more-custom-fields/ -$ python scripts/generator.py --include ../myproject/ecs/custom-fields/myprefix*.yml -$ python scripts/generator.py --include ../myproject/ecs/custom-fields/[some]*[re].yml -$ python scripts/generator.py --include ../myproject/ecs/custom-fields/myfile1.yml ../myproject/ecs/custom-fields/myfile2.yml +```bash +python scripts/generator.py \ + --semconv-version v1.38.0 \ + --include ../myproject/custom-fields/ \ + --out ../myproject/out/ ``` -The `--include` flag expects one or more directories or subsets of schema YAML files using the same [file format](https://github.com/elastic/ecs/tree/master/schemas#fields-supported-in-schemasyml) as the ECS schema files. This is useful for maintaining custom field definitions that are _outside_ of the ECS schema, but allows for merging the custom fields with the official ECS fields for your deployment. - -For example, if we defined the following schema definition in a file named `myproject/ecs/custom-fields/widget.yml`: +**Custom field format** - Use the same YAML format as ECS schemas: ```yaml --- @@ -165,290 +105,132 @@ For example, if we defined the following schema definition in a file named `mypr title: Widgets group: 2 short: Fields describing widgets - description: > - The widget fields describe a widget and all its widget-related details. + description: Widget-related fields type: group fields: - - name: id level: extended type: keyword short: Unique identifier of the widget - description: > - Unique identifier of the widget. -``` - -Multiple directory targets can also be provided: - -``` -$ python scripts/generator.py \ - --include ../myproject/custom-fields-A/ ../myproject/custom-fields-B \ - --out ../myproject/out/ -``` - -Generate artifacts using `--include` to load our custom definitions in addition to `--out` to place them in the desired output directory: - -``` -$ python scripts/generator.py --include ../myproject/custom-fields/ --out ../myproject/out/ -Loading schemas from local files -Running generator. ECS version 1.5.0 -Loading user defined schemas: ['../myproject/custom-fields/'] -``` - -We see the artifacts were generated successfully: - -``` -$ ls -lah ../myproject/out/ -total 0 -drwxr-xr-x 2 user ecs 64B Jul 8 13:12 docs -drwxr-xr-x 6 user ecs 192B Jul 8 13:12 generated -``` - -And looking at a specific artifact, `../myprojects/out/generated/elasticsearch/legacy/template.json`, we see our custom fields are included: - -```json -... - "widgets": { - "properties": { - "id": { - "ignore_above": 1024, - "type": "keyword" - } - } - } -... + description: Unique identifier of the widget. ``` -Include can be used together with the `--ref` flag to merge custom fields into a targeted ECS version. See [`Ref`](#ref). +**Supports**: Directories, multiple paths, wildcards (`*.yml`), combining with `--ref` -> NOTE: The `--include` mechanism will not validate custom YAML files prior to merging. This allows for modifying existing ECS fields in a custom schema without having to redefine all the mandatory field attributes. +**See also**: [Schema format documentation](https://github.com/elastic/ecs/tree/main/schemas#fields-supported-in-schemasyml) -#### Exclude +### Subset - Use Only Needed Fields -Use the `--exclude` flag to generate ephemeral ECS artifacts based on the current ECS schema field definitions minus fields considered for removal, e.g. to assess impact of removing these. Warning! This is not the recommended route to remove a field permanently as it is not intended to be invoked during the build process. Definitive field removal should be implemented using a custom [Subset](#subset) or via the [RFC process](https://github.com/elastic/ecs/tree/main/rfcs/README.md). Example: - -``` -$ python scripts/generator.py --exclude ../myproject/ecs/custom-fields/ -$ python scripts/generator.py --exclude ../myproject/ecs/custom-fields/ ../myproject/ecs/more-custom-fields/ -$ python scripts/generator.py --exclude ../myproject/ecs/custom-fields/myprefix*.yml -$ python scripts/generator.py --exclude ../myproject/ecs/custom-fields/[some]*[re].yml -$ python scripts/generator.py --exclude ../myproject/ecs/custom-fields/myfile1.yml ../myproject/ecs/custom-fields/myfile2.yml -``` +Generate artifacts with only the fields you need (reduces mapping size): -The `--exclude` flag expects one or more directories or subsets of schema YAML files using the same [file format](https://github.com/elastic/ecs/tree/master/schemas#fields-supported-in-schemasyml) as the ECS schema files. You can also use a subset, provided that relevant `name` and `fields` fields are preserved. - -``` ---- -- name: log - fields: - - name: original -``` - -The root Field Set `name` must always be present and specified with no dots `.`. Subfields may be specified using dot notation, for example: - -``` ---- -- name: log - fields: - - name: syslog.severity.name -``` - -Generate artifacts using `--exclude` to load our custom definitions in addition to `--out` to place them in the desired output directory: - -``` -$ python scripts/generator.py --exclude ../myproject/exclude-set.yml/ --out ../myproject/out/ -Loading schemas from local files -Running generator. ECS version 1.11.0 -``` - -#### Subset - -If your indices will never populate particular ECS fields, there's no need to include those field definitions in your index mappings, with the exception of the `base` fieldset, which must exist and which must contain at least the `@timestamp` field. The `--subset` argument allows for passing a subset definition YAML file which indicates which field sets or specific fields to include in the generated artifacts. - -``` -$ python scripts/generator.py --subset ../myproject/ecs/subset-fields/ -$ python scripts/generator.py --subset ../myproject/ecs/subset-fields/ ../myproject/ecs/more-subset-fields/ -$ python scripts/generator.py --subset ../myproject/ecs/custom-fields/subset.yml -$ python scripts/generator.py --subset ../myproject/ecs/custom-fields/[some]*[re].yml -$ python scripts/generator.py --subset ../myproject/ecs/custom-fields/myfile1.yml ../myproject/ecs/custom-fields/myfile2.yml +```bash +python scripts/generator.py \ + --semconv-version v1.38.0 \ + --subset ../myproject/subset.yml ``` -Example subset file: +**Example subset file**: ```yaml --- -name: malware_event +name: web_logs fields: base: fields: "@timestamp": {} - agent: - fields: "*" - dll: - fields: "*" - ecs: - fields: "*" - process: + http: + fields: "*" # All http fields + url: + fields: "*" # All url fields + user_agent: fields: - same_as_process: - docs_only: True + original: {} # Specific fields only ``` -The subset file has a defined format, starting with the two top-level required fields: - -* `name`: The name of the subset. Also used to name the directory holding the generated subset intermediate files (e.g. `/generated/ecs/subset/`) -* `fields` Contains the subset field filters - -The `fields` object declares which fields to include: - -* The targeted field sets are declared underneath `fields` by their top-level name (e.g. `base`, `agent`, etc.) -* Underneath each field set, all sub-fields can be captured using a wildcard syntax: `fields: "*"` -* Individual leafs fields can also be targeted: `@timestamp: {}` -* For special cases, the `docs_only: True` attribute will add a field into `./docs` but not add into any other generated - artifact. The only current use case for this feature is to document fields only populated in reused field sets. - -Reviewing the above example, the generator using subset will output artifacts containing: +**Subset format**: +* `name`: Subset name (used for output directory) +* `fields`: Declares which fieldsets/fields to include + * `fields: "*"` - Include all fields in fieldset + * `field_name: {}` - Include specific field + * `docs_only: true` - Include in docs only, not artifacts -* The `@timestamp` field from the `base` field set -* All `agent.*` fields, `dll.*`, and `ecs.*` fields +**Tips**: +* Combine with `--include` for custom fields (they must be listed in subset) +* Always include `base` fieldset with at least `@timestamp` -It's also possible to combine `--include` and `--subset` together! Do note that your subset YAML filter file will need to list any custom fields being passed with `--include`. Otherwise, `--subset` will filter those fields out. +**For detailed subset documentation with examples**, see [scripts/docs/schema-pipeline.md](scripts/docs/schema-pipeline.md#subset-filtering) -#### Ref +### Ref - Target Specific ECS Version -The `--ref` argument allows for passing a specific `git` tag (e.g. `v1.5.0`) or commit hash (`1454f8b`) that will be used to build ECS artifacts. +Generate artifacts from a specific ECS version: +```bash +python scripts/generator.py \ + --semconv-version v1.38.0 \ + --ref v9.0.0 ``` -$ python scripts/generator.py --ref v1.5.0 -``` - -The `--ref` argument loads field definitions from the specified git reference (branch, tag, etc.) from directories [`./schemas`](./schemas) and [`./experimental/schemas`](./experimental/schemas) (when specified via `--include`). -Here's another example loading both ECS fields and [experimental](experimental/README.md) changes *from branch "1.7"*, then adds custom fields on top. +**Combines with other options**: +```bash +# Generate from ECS v9.0.0 + experimental + custom fields +python scripts/generator.py \ + --semconv-version v1.38.0 \ + --ref v9.0.0 \ + --include experimental/schemas ../myproject/fields/custom ``` -$ python scripts/generator.py --ref 1.7 --include experimental/schemas ../myproject/fields/custom --out ../myproject/out -``` - -The command above will produce artifacts based on: -* main ECS field definitions as of branch 1.7 -* experimental ECS changes as of branch 1.7 -* custom fields in `../myproject/fields/custom` as they are on the filesystem +Loads schemas from git history (tags, branches, commits). Requires git. -> Note: `--ref` does have a dependency on `git` being installed and all expected commits/tags fetched from the ECS upstream repo. This will unlikely be an issue unless you downloaded the ECS as a zip archive from GitHub vs. cloning it. +### Other Options -#### Mapping & Template Settings - -The `--template-settings-legacy` / `--template-settings` and `--mapping-settings` arguments allow overriding the default template and mapping settings, respectively, in the generated Elasticsearch template artifacts. Both artifacts expect a JSON file which contains custom settings defined. - -``` -$ python scripts/generator.py --template-settings-legacy ../myproject/es-overrides/template.json --mapping-settings ../myproject/es-overrides/mappings.json +**`--out `** - Output to custom directory +```bash +python scripts/generator.py --semconv-version v1.38.0 --out ../myproject/ ``` -The `--template-settings-legacy` argument defines [index level settings](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-modules-settings) that will be applied to the legacy index template in the generated artifacts. The `--template-settings` argument now defines those same settings, but for the composable template in the generated artifacts. - - -This is an example `template.json` to be passed with `--template-setting-legacy`: - -```json -{ - "index_patterns": ["mylog-*"], - "order": 1, - "settings": { - "index": { - "mapping": { - "total_fields": { - "limit": 10000 - } - }, - "refresh_interval": "1s" - } - }, - "template": { - "mappings": {} - } -} +**`--exclude `** - Remove specific fields (for testing deprecation impact) +```bash +python scripts/generator.py --semconv-version v1.38.0 --exclude deprecated-fields.yml ``` -`--mapping-settings` works in the same way except now with the [mapping](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html) settings for the index. This is an example `mapping.json` file: - -```json -{ - "_meta": { - "version": "1.5.0" - }, - "date_detection": false, - "dynamic_templates": [ - { - "strings_as_keyword": { - "mapping": { - "ignore_above": 1024, - "type": "keyword" - }, - "match_mapping_type": "string" - } - } - ], - "properties": {} -} +**`--strict`** - Enable strict validation (required for CI/CD) +```bash +python scripts/generator.py --semconv-version v1.38.0 --strict ``` -For `template.json`, the `mappings` object is left empty: `{}`. Likewise the `properties` object remains empty in the `mapping.json` example. This will be filled in automatically by the script. - -#### Strict Mode - -The `--strict` argument enables "strict mode". Strict mode performs a stricter validation step against the schema's contents. - -Basic usage: - -``` -$ python scripts/generator.py --strict +**`--template-settings`** / **`--mapping-settings`** - Custom Elasticsearch template settings +```bash +python scripts/generator.py \ + --semconv-version v1.38.0 \ + --template-settings ../myproject/template.json \ + --mapping-settings ../myproject/mappings.json ``` -Strict mode requires the following conditions, else the script exits on an exception: +**`--intermediate-only`** - Generate only intermediate files (for debugging) -* Short descriptions must be less than or equal to 120 characters. -* Example values containing arrays or objects must be quoted to avoid unexpected YAML interpretation when the schema files or artifacts are relied on downstream. -* If a regex `pattern` is defined, the example values will be checked against it. -* If `expected_values` is defined, the example value(s) will be checked against the list. +**`--force-docs`** - Generate docs even with `--subset`/`--include`/`--exclude` -The current artifacts generated and published in the ECS repo will always be created using strict mode. However, older ECS versions (pre `v1.5.0`) will cause -an exception if attempting to generate them using `--strict`. This is due to schema validation checks introduced after that version was released. +## Additional Resources -Example: +### Complete Documentation -``` -$ python scripts/generator.py --ref v1.4.0 --strict -Loading schemas from git ref v1.4.0 -Running generator. ECS version 1.4.0 -... -ValueError: Short descriptions must be single line, and under 120 characters (current length: 134). -Offending field or field set: number -Short description: - Unique number allocated to the autonomous system. The autonomous system number (ASN) uniquely identifies each network on the Internet. -``` - -Removing `--strict` will display a warning message, but the script will finish its run successfully: - -``` -$ python scripts/generator.py --ref v1.4.0 -Loading schemas from git ref v1.4.0 -Running generator. ECS version 1.4.0 -/Users/ericbeahan/dev/ecs/scripts/generators/ecs_helpers.py:176: UserWarning: Short descriptions must be single line, and under 120 characters (current length: 134). -Offending field or field set: number -Short description: - Unique number allocated to the autonomous system. The autonomous system number (ASN) uniquely identifies each network on the Internet. - -This will cause an exception when running in strict mode. -``` +* **[scripts/docs/README.md](scripts/docs/README.md)** - Developer documentation index +* **[scripts/docs/schema-pipeline.md](scripts/docs/schema-pipeline.md)** - Complete pipeline documentation with: + * Detailed field reuse explanation with visual examples + * Comprehensive subset filtering guide with real-world examples + * Troubleshooting section for common issues +* **[scripts/generator.py](scripts/generator.py)** - Comprehensive inline documentation -#### Intermediate-Only +### Module-Specific Guides -The `--intermediate-only` argument is used for debugging purposes. It only generates the ["intermediate files"](generated/ecs), `ecs_flat.yml` and `ecs_nested.yml`, without generating the rest of the artifacts. -More information on the different intermediate files can be found in the generated directory's [README](generated/README.md). +* [OTel Integration](scripts/docs/otel-integration.md) - OpenTelemetry mapping validation +* [Elasticsearch Templates](scripts/docs/es-template.md) - Template generation details +* [Beats Configs](scripts/docs/beats-generator.md) - Beats field definitions +* [CSV Export](scripts/docs/csv-generator.md) - CSV field reference +* [Markdown Docs](scripts/docs/markdown-generator.md) - Documentation generation -#### Force-docs +### Contributing -By default, running the generator with `--subset`, `--include`, or `--exclude` flags will not generate the ECS docs in the `docs` directory. Use `--force-docs` to force the documentation to generate -even if one of those flags is also present. +* [CONTRIBUTING.md](CONTRIBUTING.md) - Contribution guidelines +* [Schema Format](https://github.com/elastic/ecs/tree/main/schemas#fields-supported-in-schemasyml) - YAML field definition format diff --git a/docs/reference/ecs-artifacts.md b/docs/reference/ecs-artifacts.md index e5b089c51e..1aa94dfda9 100644 --- a/docs/reference/ecs-artifacts.md +++ b/docs/reference/ecs-artifacts.md @@ -8,7 +8,7 @@ applies_to: # Generated artifacts [ecs-artifacts] -ECS maintains a collection of artifacts which are generated based on the schema. Examples include Elasticsearch index templates, CSV, and Beats field mappings. The maintained artifacts can be found in the [ECS Github repo](https://github.com/elastic/ecs/blob/master/generated#artifacts-generated-from-ecs). +ECS maintains a collection of artifacts which are generated based on the schema. Examples include Elasticsearch index templates, CSV, and Beats field mappings. The maintained artifacts can be found in the [ECS Github repo](https://github.com/elastic/ecs/blob/main/generated#artifacts-generated-from-ecs). -Users can generate custom versions of these artifacts using the ECS project’s tooling. See the tooling [usage documentation](https://github.com/elastic/ecs/blob/master/USAGE.md) for more detail. +Users can generate custom versions of these artifacts using the ECS project’s tooling. See the tooling [usage documentation](https://github.com/elastic/ecs/blob/main/USAGE.md) for more detail. diff --git a/docs/reference/ecs-converting.md b/docs/reference/ecs-converting.md index aadb59bda6..ac10f0aa45 100644 --- a/docs/reference/ecs-converting.md +++ b/docs/reference/ecs-converting.md @@ -22,7 +22,7 @@ Before you start a conversion, be sure that you understand the basics below. Make sure you understand the distinction between Core and Extended fields, as explained in the [Guidelines and Best Practices](/reference/ecs-guidelines.md). -Core and Extended fields are documented in the [*ECS Field Reference*](/reference/ecs-field-reference.md) or, for a single page representation of all fields, please see the [generated CSV of fields](https://github.com/elastic/ecs/blob/master/generated/csv/fields.csv). +Core and Extended fields are documented in the [*ECS Field Reference*](/reference/ecs-field-reference.md) or, for a single page representation of all fields, please see the [generated CSV of fields](https://github.com/elastic/ecs/blob/main/generated/csv/fields.csv). ### An approach to mapping an existing implementation [ecs-conv] diff --git a/docs/reference/ecs-field-reference.md b/docs/reference/ecs-field-reference.md index e7422a011a..f90ed10377 100644 --- a/docs/reference/ecs-field-reference.md +++ b/docs/reference/ecs-field-reference.md @@ -16,7 +16,7 @@ ECS defines multiple groups of related fields. They are called "field sets". The All other field sets are defined as objects in Elasticsearch, under which all fields are defined. -For a single page representation of all fields, please see the [generated CSV of fields](https://github.com/elastic/ecs/blob/master/generated/csv/fields.csv). +For a single page representation of all fields, please see the [generated CSV of fields](https://github.com/elastic/ecs/blob/main/generated/csv/fields.csv). ## Field sets [ecs-fieldsets] diff --git a/docs/reference/ecs-user-usage.md b/docs/reference/ecs-user-usage.md index 7fbe044b20..047d94b6e3 100644 --- a/docs/reference/ecs-user-usage.md +++ b/docs/reference/ecs-user-usage.md @@ -345,5 +345,5 @@ Like the other fields in the [related](/reference/ecs-related.md) field set, `re ## Mapping examples [ecs-user-usage-mappings] -For examples of mapping events from various sources, you can look at [RFC 0007 in section Source Data](https://github.com/elastic/ecs/blob/master/rfcs/text/0007-multiple-users.md#source-data). +For examples of mapping events from various sources, you can look at [RFC 0007 in section Source Data](https://github.com/elastic/ecs/blob/main/rfcs/text/0007-multiple-users.md#source-data). diff --git a/docs/reference/index.md b/docs/reference/index.md index dbeb756605..448657d885 100644 --- a/docs/reference/index.md +++ b/docs/reference/index.md @@ -41,5 +41,5 @@ ECS is a permissive schema. If your events have additional data that cannot be m ECS improvements are released following [Semantic Versioning](https://semver.org/). Major ECS releases are planned to be aligned with major Elastic Stack releases. -Any feedback on the general structure, missing fields, or existing fields is appreciated. For contributions please read the [Contribution Guidelines](https://github.com/elastic/ecs/blob/master/CONTRIBUTING.md). +Any feedback on the general structure, missing fields, or existing fields is appreciated. For contributions please read the [Contribution Guidelines](https://github.com/elastic/ecs/blob/main/CONTRIBUTING.md). diff --git a/scripts/docs/README.md b/scripts/docs/README.md new file mode 100644 index 0000000000..196628fcbd --- /dev/null +++ b/scripts/docs/README.md @@ -0,0 +1,167 @@ +# ECS Scripts Developer Documentation + +This directory contains developer-focused documentation for the ECS generation scripts. + +## Purpose + +The ECS repository includes a comprehensive toolchain for generating various artifacts from schema definitions. These developer guides explain: + +- **How each component works** internally +- **Architecture and design decisions** +- **How to make changes** and extend functionality +- **Troubleshooting** common issues + +## Documentation Structure + +### Module-Specific Guides + +Each major generator module has its own detailed guide: + +- **[otel-integration.md](otel-integration.md)** - OpenTelemetry Semantic Conventions integration + - Validation of ECS ↔ OTel mappings + - Loading OTel definitions from GitHub + - Generating alignment summaries + +- **[markdown-generator.md](markdown-generator.md)** - Markdown documentation generation + - Rendering ECS schemas to human-readable docs + - Jinja2 template system and customization + - OTel alignment documentation + - Adding new page types + +- **[intermediate-files.md](intermediate-files.md)** - Intermediate file generation + - Flat and nested format representations + - Bridge between schema processing and artifact generation + - Top-level vs. reusable fieldsets + - Data structure reference + +- **[es-template.md](es-template.md)** - Elasticsearch template generation + - Composable vs. legacy template formats + - Field type mapping conversion + - Template customization and settings + - Installation and troubleshooting + +- **[ecs-helpers.md](ecs-helpers.md)** - Utility functions library + - Dictionary operations (sorting, merging, copying) + - File operations (YAML I/O, globbing, directories) + - Git operations (tree access, version loading) + - Common patterns and best practices + +- **[csv-generator.md](csv-generator.md)** - CSV field reference generation + - Spreadsheet-compatible field export + - Column structure and multi-field handling + - Analysis and integration examples + - Usage in Excel, Google Sheets, databases + +- **[beats-generator.md](beats-generator.md)** - Beats field definition generation + - YAML field definitions for Elastic Beats + - Default field selection and allowlist + - Contextual naming and field groups + - Integration with Beat modules + +*(More module guides will be added here as documentation is expanded)* + +### Quick Reference + +For high-level usage information, see: +- **[../../USAGE.md](../../USAGE.md)** - User guide for running the generators +- **[../../CONTRIBUTING.md](../../CONTRIBUTING.md)** - Contribution guidelines + +## Scripts Overview + +The `scripts/` directory contains several key components: + +### Core Modules + +| Module | Purpose | Documentation | +|--------|---------|---------------| +| `generator.py` | **Main entry point** - orchestrates complete pipeline | Comprehensive docstrings in file | +| `generators/otel.py` | OTel integration and validation | [otel-integration.md](otel-integration.md) | +| `generators/markdown_fields.py` | Markdown documentation generation | [markdown-generator.md](markdown-generator.md) | +| `generators/intermediate_files.py` | Intermediate format generation | [intermediate-files.md](intermediate-files.md) | +| `generators/es_template.py` | Elasticsearch template generation | [es-template.md](es-template.md) | +| `generators/csv_generator.py` | CSV field reference export | [csv-generator.md](csv-generator.md) | +| `generators/beats.py` | Beats field definition generation | [beats-generator.md](beats-generator.md) | +| `generators/ecs_helpers.py` | Shared utility functions | [ecs-helpers.md](ecs-helpers.md) | + +### Schema Processing + +The schema processing pipeline transforms YAML schema definitions through multiple stages. See [schema-pipeline.md](schema-pipeline.md) for complete pipeline documentation. + +| Module | Purpose | Documentation | +|--------|---------|---------------| +| **Pipeline Overview** | Complete schema processing flow | **[schema-pipeline.md](schema-pipeline.md)** | +| `schema/loader.py` | Load and parse YAML schemas → nested structure | [schema-pipeline.md#1-loaderpy---schema-loading](schema-pipeline.md#1-loaderpy---schema-loading) | +| `schema/cleaner.py` | Validate, normalize, apply defaults | [schema-pipeline.md#2-cleanerpy---validation--normalization](schema-pipeline.md#2-cleanerpy---validation--normalization) | +| `schema/finalizer.py` | Perform field reuse, calculate names | [schema-pipeline.md#3-finalizerpy---field-reuse--name-calculation](schema-pipeline.md#3-finalizerpy---field-reuse--name-calculation) | +| `schema/visitor.py` | Traverse field hierarchies (visitor pattern) | [schema-pipeline.md#visitorpy---field-traversal](schema-pipeline.md#visitorpy---field-traversal) | +| `schema/subset_filter.py` | Filter to include only specified fields | [schema-pipeline.md#4-subset_filterpy---subset-filtering-optional](schema-pipeline.md#4-subset_filterpy---subset-filtering-optional) | +| `schema/exclude_filter.py` | Explicitly remove specified fields | [schema-pipeline.md#5-exclude_filterpy---exclude-filtering-optional](schema-pipeline.md#5-exclude_filterpy---exclude-filtering-optional) | + +### Types + +| Module | Purpose | +|--------|---------| +| `ecs_types/schema_fields.py` | Core ECS type definitions | +| `ecs_types/otel_types.py` | OTel-specific types | + +## Getting Started + +If you're new to the ECS generator codebase: + +1. **Start with the main orchestrator**: Read `generator.py` docstrings to understand the pipeline +2. **Understand schema processing**: Read [schema-pipeline.md](schema-pipeline.md) +3. **Pick a generator**: Choose a specific generator that interests you +4. **Read its documentation**: Start with the module-specific guide +5. **Explore the code**: Read the source with the guide as reference +6. **Run it**: Try generating artifacts to see it in action + +### Quick Command Reference + +```bash +# Standard generation (from local schemas) +python scripts/generator.py --semconv-version v1.24.0 + +# From specific git version +python scripts/generator.py --ref v8.10.0 --semconv-version v1.24.0 + +# With custom schemas +python scripts/generator.py --include custom/schemas/ --semconv-version v1.24.0 + +# Generate subset only +python scripts/generator.py --subset schemas/subsets/minimal.yml --semconv-version v1.24.0 + +# Strict validation mode +python scripts/generator.py --strict --semconv-version v1.24.0 + +# Intermediate files only (fast iteration) +python scripts/generator.py --intermediate-only --semconv-version v1.24.0 +``` + +See `generator.py` docstrings for complete argument documentation. + +## Contributing Documentation + +When adding or modifying generator code: + +1. **Update docstrings**: Add comprehensive Python docstrings to all functions and classes +2. **Update/create guide**: Ensure a markdown guide exists explaining the component +3. **Update this README**: Add links to new documentation +4. **Include examples**: Show practical usage examples +5. **Document edge cases**: Explain tricky parts and gotchas + +### Documentation Standards + +- **Python docstrings**: Use Google-style docstrings with Args, Returns, Raises, Examples +- **Markdown guides**: Include Overview, Architecture, Usage Examples, Troubleshooting +- **Code examples**: Should be runnable (or clearly marked as pseudocode) +- **Diagrams**: Use ASCII/Unicode diagrams for flow visualization +- **Tables**: Use markdown tables for structured comparisons + +## Questions? + +For questions about: +- **Using the tools**: See [USAGE.md](../../USAGE.md) or ask in the [Elastic community forums](https://discuss.elastic.co/) +- **Contributing**: See [CONTRIBUTING.md](../../CONTRIBUTING.md) +- **Architecture**: Read the relevant module guide in this directory +- **Bugs**: [Open an issue](https://github.com/elastic/ecs/issues) + diff --git a/scripts/docs/beats-generator.md b/scripts/docs/beats-generator.md new file mode 100644 index 0000000000..8cab881d97 --- /dev/null +++ b/scripts/docs/beats-generator.md @@ -0,0 +1,587 @@ +# Beats Field Definition Generator + +## Overview + +The Beats Generator (`generators/beats.py`) creates field definitions for Elastic Beats in YAML format. Beats (Filebeat, Metricbeat, Packetbeat, Winlogbeat, etc.) are lightweight data shippers that need field definitions to validate data structure, configure field behavior, and provide user documentation. + +### Purpose + +Beats are Elastic's lightweight data collection agents that ship data to Elasticsearch or Logstash. They need field definitions to: + +1. **Validate Data** - Ensure collected data matches expected structure +2. **Configure Behavior** - Control indexing, doc_values, multi-fields +3. **Document Fields** - Provide field reference to users +4. **Manage Defaults** - Determine which fields are enabled by default + +The challenge: Beats can't load all ~850 ECS fields by default due to memory and performance constraints. The generator uses an allowlist to mark essential fields as `default_field: true`, while keeping others available but not loaded by default. + +## Architecture + +### High-Level Flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ generator.py (main) │ +│ │ +│ Load → Clean → Finalize → Generate Intermediate Files │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ intermediate_files.generate() │ +│ │ +│ Returns: (nested, flat) │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ nested structure +┌─────────────────────────────────────────────────────────────────┐ +│ beats.generate() │ +│ │ +│ 1. Filter non-root fieldsets │ +│ 2. Process 'base' fieldset (fields at root) │ +│ 3. Process other fieldsets (as groups or root) │ +│ 4. Load default_fields allowlist │ +│ 5. Set default_field flags recursively │ +│ 6. Wrap in 'ecs' top-level group │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Output: fields.ecs.yml │ +│ │ +│ - key: ecs │ +│ title: ECS │ +│ fields: │ +│ - name: '@timestamp' │ +│ type: date │ +│ default_field: true │ +│ - name: agent │ +│ type: group │ +│ default_field: true │ +│ fields: │ +│ - name: id │ +│ type: keyword │ +│ default_field: true │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Key Components + +#### 1. generate() + +**Entry Point**: `generate(ecs_nested, ecs_version, out_dir)` + +Orchestrates the entire generation process: +- Filters fieldsets (removes top_level=false) +- Processes base fieldset first +- Processes other fieldsets as groups or root fields +- Applies default_field settings +- Writes YAML output + +#### 2. fieldset_field_array() + +**Purpose**: Convert ECS fields to Beats format + +**Transformations**: +- Filter to Beats-relevant properties +- Convert field names to contextual (relative) names +- Process multi-fields +- Sort fields alphabetically + +**Example**: +``` +ECS field name: http.request.method +Beats name (in http group): request.method +``` + +#### 3. set_default_field() + +**Purpose**: Mark fields that should be loaded by default + +**Logic**: +- Reads allowlist from `beats_default_fields_allowlist.yml` +- Recursively applies default_field flags +- Groups inherit and propagate settings +- Multi-fields inherit from parent field + +#### 4. write_beats_yaml() + +**Purpose**: Save formatted YAML with warning header + +## Beats YAML Structure + +### Top-Level Structure + +```yaml +# WARNING! Do not edit this file directly... + +- key: ecs + title: ECS + description: ECS Fields. + fields: + - name: '@timestamp' + type: date + default_field: true + description: Date/time when the event originated + + - name: agent + type: group + default_field: true + description: Agent fields + fields: + - name: id + type: keyword + default_field: true + description: Unique agent identifier +``` + +### Field Groups + +Field sets become groups in Beats: + +```yaml +- name: http + type: group + default_field: false # Group not default + title: HTTP + description: Fields related to HTTP activity + fields: + - name: request.method + type: keyword + default_field: true # But some fields within are + description: HTTP request method + + - name: request.bytes + type: long + default_field: false # Others are not + description: Request size in bytes +``` + +### Contextual Naming + +Beats uses relative field names within groups: + +| ECS Full Name | Beats Group | Beats Field Name | +|---------------|-------------|------------------| +| @timestamp | (root) | @timestamp | +| agent.id | agent | id | +| http.request.method | http | request.method | +| user.email | user | email | + +### Multi-Fields + +Multi-fields follow the same structure: + +```yaml +- name: message + type: match_only_text + default_field: true + description: Log message + multi_fields: + - name: text + type: match_only_text + default_field: true +``` + +### Root vs Group Fields + +**Root Fields** (root=true in schema): +- Appear directly in top-level fields array +- No group wrapper +- Example: base fieldset fields + +**Group Fields** (root=false or not specified): +- Wrapped in group with metadata +- Nested under group's fields array +- Example: http, user, process fieldsets + +## Default Fields Concept + +### The Challenge + +Beats face a trade-off: +- **More fields** = More memory/CPU usage +- **Fewer fields** = Less data captured + +All ~850 ECS fields would consume too many resources for many use cases. + +### The Solution: default_field + +Fields marked `default_field: true` are: +- Loaded by Beat on startup +- Available for immediate use +- Included in index mappings + +Fields marked `default_field: false`: +- Not loaded by default +- Can be enabled in Beat configuration +- Won't appear in index unless explicitly enabled + +### Allowlist File + +`beats_default_fields_allowlist.yml` contains ~400 essential fields: + +```yaml +!!set +# Core timestamp +'@timestamp': null + +# Essential agent fields +agent.id: null +agent.name: null +agent.type: null +agent.version: null + +# Common network fields +client.ip: null +client.port: null +server.ip: null +server.port: null + +# Essential event categorization +event.kind: null +event.category: null +event.type: null +event.outcome: null + +# Common message/log fields +message: null +log.level: null +... +``` + +### Inheritance Rules + +**Groups**: +- Top-level groups: `default_field: true` +- Nested groups: Inherit from parent + +**Fields**: +- In allowlist: `default_field: true` +- Parent is default: Children are default +- Otherwise: `default_field: false` + +**Multi-fields**: +- Always inherit from parent field + +## Usage Examples + +### Running the Generator + +Typically invoked through the main generator: + +```bash +# From repository root +make clean +make SEMCONV_VERSION=v1.24.0 + +# Beats file created at: +# generated/beats/fields.ecs.yml +``` + +### Programmatic Usage + +```python +from generators.beats import generate +from generators.intermediate_files import generate as gen_intermediate + +# Generate intermediate files +nested, flat = gen_intermediate(fields, 'generated/ecs', True) + +# Generate Beats fields +generate(nested, '8.11.0', 'generated') +# Creates generated/beats/fields.ecs.yml +``` + +### Loading in Beat Module + +```yaml +# In a Beat module (e.g., Filebeat module) +# Reference the generated file: + +--- +- name: http + type: group + description: HTTP fields from ECS + fields: + !include ../../../generated/beats/fields.ecs.yml +``` + +### Checking default_field Settings + +```python +import yaml + +with open('generated/beats/fields.ecs.yml') as f: + beats_def = yaml.safe_load(f) + +def count_default_fields(fields, count={'default': 0, 'non_default': 0}): + for field in fields: + if field.get('default_field', False): + count['default'] += 1 + else: + count['non_default'] += 1 + + if 'fields' in field: + count_default_fields(field['fields'], count) + if 'multi_fields' in field: + count_default_fields(field['multi_fields'], count) + + return count + +counts = count_default_fields(beats_def[0]['fields']) +print(f"Default fields: {counts['default']}") +print(f"Non-default fields: {counts['non_default']}") +``` + +## Making Changes + +### Adding Fields to Allowlist + +To make a field load by default in Beats: + +1. **Edit allowlist**: +```yaml +# beats_default_fields_allowlist.yml +# Add new field +new.field.name: null +``` + +2. **Regenerate**: +```bash +make clean +make SEMCONV_VERSION=v1.24.0 +``` + +3. **Verify**: +```bash +grep "name: field.name" generated/beats/fields.ecs.yml -A 1 +# Should show: default_field: true +``` + +### Removing Fields from Allowlist + +To stop a field from loading by default: + +1. Remove from `beats_default_fields_allowlist.yml` +2. Regenerate as above +3. Verify field now has `default_field: false` + +### Adding New Field Properties + +To include additional properties in Beats output: + +```python +def fieldset_field_array(...): + allowed_keys: List[str] = [ + 'name', + 'level', + # ... existing keys ... + 'new_property', # Add here + ] + # ... rest of function +``` + +### Changing Contextual Naming Logic + +To modify how field names are made relative: + +```python +def fieldset_field_array(...): + # Current logic + if '' == fieldset_prefix: + contextual_name = nested_field_name + else: + contextual_name = '.'.join(nested_field_name.split('.')[1:]) + + # Custom logic example: keep full names + contextual_name = nested_field_name + + # Or: different prefix handling + if fieldset_prefix: + contextual_name = nested_field_name.replace(fieldset_prefix + '.', '') +``` + +## Troubleshooting + +### Common Issues + +#### Fields missing default_field property + +**Symptom**: Some fields don't have `default_field` set + +**Check**: +```bash +# Count fields without default_field +grep -c "name:" generated/beats/fields.ecs.yml +grep -c "default_field:" generated/beats/fields.ecs.yml +# Should be equal (or close, accounting for structure) +``` + +**Solution**: Ensure `set_default_field()` is being called after field processing + +#### Allowlist changes not applying + +**Symptom**: Modified allowlist but field still has old default_field value + +**Solution**: +```bash +# Clean build directory +make clean + +# Regenerate from scratch +make SEMCONV_VERSION=v1.24.0 + +# Verify allowlist was loaded +grep "your.field.name" scripts/generators/beats_default_fields_allowlist.yml +grep "your.field.name" -A 1 generated/beats/fields.ecs.yml +``` + +#### Contextual names incorrect + +**Symptom**: Field names still show full ECS path instead of relative + +**Debug**: +```python +# In fieldset_field_array() +print(f"Field: {nested_field_name}") +print(f"Prefix: {fieldset_prefix}") +print(f"Contextual: {contextual_name}") +``` + +**Check**: +- Is fieldset_prefix being passed correctly? +- Is the split('.')[1:] logic working for your case? + +#### YAML syntax errors + +**Symptom**: Beats can't load the generated YAML + +**Validate**: +```bash +# Check YAML syntax +python -c "import yaml; yaml.safe_load(open('generated/beats/fields.ecs.yml'))" + +# Or use yamllint if available +yamllint generated/beats/fields.ecs.yml +``` + +**Common issues**: +- Unescaped special characters in descriptions +- Incorrect indentation +- Missing required properties + +### Debugging Tips + +#### View generated structure + +```python +import yaml + +with open('generated/beats/fields.ecs.yml') as f: + beats = yaml.safe_load(f) + +# Inspect top-level +print(beats[0].keys()) # ['key', 'title', 'description', 'fields'] + +# Count groups +groups = [f for f in beats[0]['fields'] if f.get('type') == 'group'] +print(f"Groups: {len(groups)}") + +# List all top-level field names +for field in beats[0]['fields'][:10]: + print(f" {field['name']}: {field.get('type', 'no-type')}") +``` + +#### Compare with previous version + +```bash +# Generate current version +make SEMCONV_VERSION=v1.24.0 +cp generated/beats/fields.ecs.yml fields_new.yml + +# Check out previous version +git checkout HEAD~1 + +# Generate old version +make clean +make SEMCONV_VERSION=v1.24.0 +cp generated/beats/fields.ecs.yml fields_old.yml + +# Compare +diff -u fields_old.yml fields_new.yml | less +``` + +#### Trace default_field assignment + +Add debugging to `set_default_field()`: + +```python +def set_default_field(fields, df_allowlist, df=False, path=''): + for fld in fields: + fld_path = fld['name'] + if path != '' and not fld.get('root', False): + fld_path = path + '.' + fld_path + + expected = fld_path in df_allowlist + + # Debug output + if fld_path.startswith('http'): # Focus on http fields + print(f"{fld_path}: in_allowlist={expected}, parent_df={df}") + + # ... rest of function +``` + +## Integration with Beats + +### In Beat Modules + +Beats modules include field definitions: + +```yaml +# module/http/access/_meta/fields.yml +- name: http + type: group + description: Fields related to HTTP + fields: + !include ../../../../../../generated/beats/fields.ecs.yml +``` + +### Loading Custom Fields + +Users can enable non-default fields: + +```yaml +# filebeat.yml +filebeat.modules: + - module: httpmodule + access: + enabled: true + var.additional_fields: + - http.request.body.bytes + - http.request.referrer +``` + +### Field Conflicts + +If a Beat defines custom fields with same names as ECS: +- ECS fields take precedence +- Merge is automatic +- Custom fields should use different names or namespaces + +## Related Files + +- `scripts/generator.py` - Main entry point +- `scripts/generators/intermediate_files.py` - Produces nested structure +- `scripts/generators/ecs_helpers.py` - Utility functions +- `scripts/generators/beats_default_fields_allowlist.yml` - Default field allowlist +- `schemas/*.yml` - Source ECS schemas +- `generated/beats/fields.ecs.yml` - Output file + +## References + +- [Beats Documentation](https://www.elastic.co/guide/en/beats/libbeat/current/index.html) +- [Beats Developer Guide](https://www.elastic.co/guide/en/beats/devguide/current/index.html) +- [Beats Fields YAML Format](https://www.elastic.co/guide/en/beats/devguide/current/fields-yml.html) +- [ECS Beats Integration](https://www.elastic.co/guide/en/ecs/current/ecs-beats.html) + diff --git a/scripts/docs/csv-generator.md b/scripts/docs/csv-generator.md new file mode 100644 index 0000000000..cd370ba84d --- /dev/null +++ b/scripts/docs/csv-generator.md @@ -0,0 +1,428 @@ +# CSV Field Reference Generator + +## Overview + +The CSV Generator (`generators/csv_generator.py`) produces a spreadsheet-compatible field reference for all ECS fields. It exports field definitions to a simple CSV (Comma-Separated Values) format that can be easily imported into spreadsheet applications, databases, or custom analysis tools. + +### Purpose + +This generator creates a human-readable, machine-parseable field catalog that's useful for: + +1. **Quick Reference** - Search and filter fields in Excel/Google Sheets +2. **Data Analysis** - Analyze field usage patterns and statistics +3. **Integration** - Parse for custom tooling and automation +4. **Documentation** - Include in presentations or reports +5. **Version Comparison** - Diff CSV files to see field changes + +The CSV format is intentionally simple and widely compatible, making ECS field data accessible to anyone with a spreadsheet application. + +## Architecture + +### High-Level Flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ generator.py (main) │ +│ │ +│ Load → Clean → Finalize → Generate Intermediate Files │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ intermediate_files.generate() │ +│ │ +│ Returns: (nested, flat) │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ flat dictionary +┌─────────────────────────────────────────────────────────────────┐ +│ csv_generator.generate() │ +│ 1. base_first() - Sort fields (base fields first) │ +│ 2. save_csv() - Write CSV with header + field rows │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Output: fields.csv │ +│ │ +│ ECS_Version,Indexed,Field_Set,Field,Type,Level,Normalization │ +│ 8.11.0,true,base,@timestamp,date,core,,2016-05-23... │ +│ 8.11.0,true,http,http.request.method,keyword,extended,... │ +│ ... │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Key Components + +#### 1. generate() + +**Entry Point**: `generate(ecs_flat, version, out_dir)` + +Orchestrates CSV generation: +- Creates output directory +- Sorts fields appropriately +- Writes CSV file + +#### 2. base_first() + +**Purpose**: Sort fields for readable output + +**Logic**: +1. Base fields (no dots): @timestamp, message, tags, etc. +2. All other fields alphabetically: agent.*, as.*, client.*, ... + +**Rationale**: Base fields are foundational and referenced frequently, so they appear at the top for easy access. + +#### 3. save_csv() + +**Purpose**: Write field data to CSV format + +**Features**: +- Header row with column names +- One row per field (plus multi-fields) +- Multi-fields get separate rows +- Consistent quoting and line endings + +## CSV Structure + +### Columns + +| Column | Description | Example Values | +|--------|-------------|----------------| +| **ECS_Version** | Version of ECS | 8.11.0, 8.11.0+exp | +| **Indexed** | Whether field is indexed | true, false | +| **Field_Set** | Fieldset name | base, http, user, agent | +| **Field** | Full dotted field name | @timestamp, http.request.method | +| **Type** | Elasticsearch field type | keyword, long, ip, date | +| **Level** | Field level | core, extended, custom | +| **Normalization** | Normalization rules | array, to_lower, array, to_lower | +| **Example** | Example value | GET, 192.0.2.1, 2016-05-23... | +| **Description** | Short field description | HTTP request method, User email | + +### Field Set Logic + +- **Base fields** (no dots in name): field_set = 'base' + - Examples: @timestamp, message, tags, labels +- **Other fields**: field_set = first part before dot + - http.request.method → field_set = 'http' + - user.email → field_set = 'user' + +### Multi-Fields + +Fields with multi-fields (alternate representations) get additional rows: + +```csv +8.11.0,true,event,message,match_only_text,core,,Hello world,Log message +8.11.0,true,event,message.text,match_only_text,core,,,Log message +``` + +Multi-field rows: +- Share version, indexed, field_set, level, description +- Have unique field name and type +- Have empty normalization and example + +## Example Output + +```csv +ECS_Version,Indexed,Field_Set,Field,Type,Level,Normalization,Example,Description +8.11.0,true,base,@timestamp,date,core,,2016-05-23T08:05:34.853Z,Date/time when the event originated +8.11.0,true,base,message,match_only_text,core,,Hello World,Log message optimized for viewing +8.11.0,true,base,message.text,match_only_text,core,,,Log message optimized for viewing +8.11.0,true,base,tags,keyword,core,array,"production, eu-west-1",List of keywords for event +8.11.0,true,agent,agent.build.original,keyword,core,,,Extended build information +8.11.0,true,agent,agent.ephemeral_id,keyword,extended,,8a4f500f,Ephemeral identifier +8.11.0,true,agent,agent.id,keyword,core,,8a4f500d,Unique agent identifier +8.11.0,true,http,http.request.body.bytes,long,extended,,1437,Request body size in bytes +8.11.0,true,http,http.request.method,keyword,extended,array,GET,HTTP request method +8.11.0,true,http,http.response.status_code,long,extended,,404,HTTP response status code +``` + +## Usage Examples + +### Running the Generator + +Typically invoked through the main generator: + +```bash +# From repository root +make clean +make SEMCONV_VERSION=v1.24.0 + +# CSV file created at: +# generated/csv/fields.csv +``` + +### Programmatic Usage + +```python +from generators.csv_generator import generate +from generators.intermediate_files import generate as gen_intermediate + +# Generate intermediate files +nested, flat = gen_intermediate(fields, 'generated/ecs', True) + +# Generate CSV +generate(flat, '8.11.0', 'generated') +# Creates generated/csv/fields.csv +``` + +### Analyzing Field Data + +**Count fields by type**: +```python +import csv +from collections import Counter + +with open('generated/csv/fields.csv') as f: + reader = csv.DictReader(f) + types = Counter(row['Type'] for row in reader) + +print("Field types:") +for field_type, count in types.most_common(): + print(f" {field_type}: {count}") +``` + +**Find all extended-level fields**: +```python +import csv + +with open('generated/csv/fields.csv') as f: + reader = csv.DictReader(f) + extended = [row for row in reader if row['Level'] == 'extended'] + +print(f"Extended fields: {len(extended)}") +for field in extended[:5]: + print(f" {field['Field']}") +``` + +**Fields by fieldset**: +```python +import csv +from collections import defaultdict + +with open('generated/csv/fields.csv') as f: + reader = csv.DictReader(f) + by_fieldset = defaultdict(list) + for row in reader: + by_fieldset[row['Field_Set']].append(row['Field']) + +for fieldset in sorted(by_fieldset): + print(f"{fieldset}: {len(by_fieldset[fieldset])} fields") +``` + +## Making Changes + +### Adding New Columns + +To add a new column to the CSV: + +1. **Update header row**: +```python +schema_writer.writerow([ + "ECS_Version", "Indexed", "Field_Set", "Field", + "Type", "Level", "Normalization", "Example", "Description", + "New_Column" # Add here +]) +``` + +2. **Add to data rows**: +```python +schema_writer.writerow([ + version, + indexed, + field_set, + field['flat_name'], + field['type'], + field['level'], + ', '.join(field['normalize']), + field.get('example', ''), + field['short'], + field.get('new_property', 'default_value') # Add here +]) +``` + +3. **Update multi-field rows** similarly + +4. **Update documentation** in this file + +### Changing Field Sorting + +To change sort order: + +```python +def base_first(ecs_flat: Dict[str, Field]) -> List[Field]: + # Custom sorting logic + fields_list = list(ecs_flat.values()) + + # Sort by level, then name + return sorted(fields_list, key=lambda f: (f['level'], f['flat_name'])) + + # Or by fieldset, then name + return sorted(fields_list, key=lambda f: (f['flat_name'].split('.')[0], f['flat_name'])) +``` + +### Changing CSV Format + +To modify CSV formatting: + +```python +schema_writer = csv.writer( + csvfile, + delimiter=';', # Use semicolon instead + quoting=csv.QUOTE_ALL, # Quote all fields + quotechar='"', + lineterminator='\r\n' # Windows line endings +) +``` + +### Filtering Fields + +To exclude certain fields: + +```python +def generate(ecs_flat: Dict[str, Field], version: str, out_dir: str) -> None: + ecs_helpers.make_dirs(join(out_dir, 'csv')) + + # Filter out custom fields + filtered = {k: v for k, v in ecs_flat.items() if v['level'] != 'custom'} + + sorted_fields = base_first(filtered) + save_csv(join(out_dir, 'csv/fields.csv'), sorted_fields, version) +``` + +## Troubleshooting + +### Common Issues + +#### CSV not opening correctly in Excel + +**Symptom**: Fields appear in wrong columns or all in one column + +**Solutions**: +1. Use "Text to Columns" feature: + - Select data → Data tab → Text to Columns + - Choose "Delimited" → Next + - Select "Comma" → Finish + +2. Change Excel import settings: + - File → Options → Advanced → Edit Custom Lists + - Set default delimiter to comma + +3. Save as Excel format after import: + - File → Save As → Excel Workbook (.xlsx) + +#### Unicode/special character issues + +**Symptom**: Strange characters in descriptions or examples + +**Solutions**: +1. Ensure UTF-8 encoding when opening: + - In Excel: Data → Get Data → From File → From Text/CSV + - Select UTF-8 encoding + +2. Or fix in code: +```python +with open(file, 'w', encoding='utf-8') as csvfile: + # ... write CSV +``` + +#### Missing multi-fields + +**Symptom**: Multi-fields not appearing in CSV + +**Check**: +```python +# Verify field has multi_fields +field = flat['message'] +print('multi_fields' in field) +print(field.get('multi_fields')) + +# Check multi-field structure +if 'multi_fields' in field: + for mf in field['multi_fields']: + print(f" {mf['flat_name']}: {mf['type']}") +``` + +#### Empty normalization column + +**Symptom**: Normalization column is always empty + +**Check** field definitions have `normalize` key: +```python +field = flat['some.field'] +print(field.get('normalize', [])) # Should be a list +``` + +### Debugging Tips + +#### Verify field count + +```python +import csv + +with open('generated/csv/fields.csv') as f: + reader = csv.DictReader(f) + rows = list(reader) + print(f"Total rows: {len(rows)}") + +# Compare with flat format +from generators.intermediate_files import generate as gen_intermediate +nested, flat = gen_intermediate(fields, 'generated/ecs', True) +print(f"Flat fields: {len(flat)}") + +# Count multi-fields +multi_field_count = sum( + len(f.get('multi_fields', [])) for f in flat.values() +) +print(f"Multi-fields: {multi_field_count}") +print(f"Expected total: {len(flat) + multi_field_count}") +``` + +#### Check field sets + +```python +import csv +from collections import Counter + +with open('generated/csv/fields.csv') as f: + reader = csv.DictReader(f) + fieldsets = Counter(row['Field_Set'] for row in reader) + +print("Fields per fieldset:") +for fieldset, count in sorted(fieldsets.items()): + print(f" {fieldset}: {count}") +``` + +#### Validate CSV syntax + +```python +import csv + +try: + with open('generated/csv/fields.csv') as f: + reader = csv.DictReader(f) + for i, row in enumerate(reader, 1): + # Check required columns + required = ['Field', 'Type', 'Level'] + for col in required: + if not row[col]: + print(f"Row {i}: Missing {col}") + print("CSV validation passed") +except csv.Error as e: + print(f"CSV error: {e}") +``` + +## Related Files + +- `scripts/generator.py` - Main entry point +- `scripts/generators/intermediate_files.py` - Produces flat format +- `scripts/generators/ecs_helpers.py` - Utility functions +- `schemas/*.yml` - Source ECS schemas +- `generated/csv/fields.csv` - Output file + +## References + +- [CSV Format Specification (RFC 4180)](https://tools.ietf.org/html/rfc4180) +- [Python csv Module Documentation](https://docs.python.org/3/library/csv.html) +- [ECS Field Reference](https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html) + diff --git a/scripts/docs/ecs-helpers.md b/scripts/docs/ecs-helpers.md new file mode 100644 index 0000000000..5567571731 --- /dev/null +++ b/scripts/docs/ecs-helpers.md @@ -0,0 +1,703 @@ +# ECS Helper Utilities + +## Overview + +The ECS Helpers module (`generators/ecs_helpers.py`) provides a comprehensive collection of utility functions used across all ECS generator scripts. These helpers abstract common patterns and provide reusable building blocks for working with schemas, files, and data structures. + +### Purpose + +This module serves as the shared utility layer for the entire ECS build system, providing: + +1. **Dictionary Operations** - Copying, sorting, merging, ordering +2. **File Operations** - YAML I/O, file discovery, directory management +3. **Git Operations** - Repository introspection, version loading +4. **List Operations** - Filtering, extraction, transformation +5. **Field Utilities** - Type checking, filtering by reusability +6. **Warning System** - Consistent warning generation + +By centralizing these utilities, the module ensures consistency across generators and reduces code duplication. + +## Function Categories + +### Dictionary Helpers + +#### dict_copy_keys_ordered() +```python +def dict_copy_keys_ordered(dct: Field, copied_keys: List[str]) -> Field +``` + +**Purpose**: Copy specific keys in a defined order + +**Use Case**: Ensuring consistent field ordering in output files + +**Example**: +```python +field = { + 'description': 'HTTP request method', + 'name': 'method', + 'type': 'keyword', + 'level': 'extended' +} + +# Copy in specific order +ordered = dict_copy_keys_ordered(field, ['name', 'type', 'level', 'description']) +# OrderedDict([('name', 'method'), ('type', 'keyword'), ...]) +``` + +#### dict_copy_existing_keys() +```python +def dict_copy_existing_keys(source: Field, destination: Field, keys: List[str]) -> None +``` + +**Purpose**: Selectively copy keys that exist in source + +**Use Case**: Building Elasticsearch mappings with type-specific parameters + +**Example**: +```python +source = {'type': 'keyword', 'ignore_above': 1024, 'index': True} +dest = {'type': 'keyword'} + +dict_copy_existing_keys(source, dest, ['ignore_above', 'normalizer']) +# dest now: {'type': 'keyword', 'ignore_above': 1024} +# 'normalizer' not copied (not in source) +``` + +#### dict_sorted_by_keys() +```python +def dict_sorted_by_keys(dct: FieldNestedEntry, sort_keys: List[str]) -> List[FieldNestedEntry] +``` + +**Purpose**: Sort dictionary values by multiple criteria + +**Use Case**: Sorting fieldsets for consistent documentation ordering + +**Example**: +```python +fieldsets = { + 'http': {'name': 'http', 'group': 2, 'title': 'HTTP'}, + 'base': {'name': 'base', 'group': 1, 'title': 'Base'}, + 'agent': {'name': 'agent', 'group': 1, 'title': 'Agent'} +} + +sorted_fs = dict_sorted_by_keys(fieldsets, ['group', 'name']) +# Returns: [agent, base, http] (group 1, 1, 2; names alphabetical within group) +``` + +#### ordered_dict_insert() +```python +def ordered_dict_insert( + dct: Field, + new_key: str, + new_value: Union[str, bool], + before_key: Optional[str] = None, + after_key: Optional[str] = None +) -> None +``` + +**Purpose**: Insert key-value pair at specific position + +**Use Case**: Adding fields in specific locations for readability + +**Example**: +```python +from collections import OrderedDict + +d = OrderedDict([('name', 'field'), ('type', 'keyword')]) +ordered_dict_insert(d, 'level', 'extended', after_key='type') +# d now: [('name', 'field'), ('type', 'keyword'), ('level', 'extended')] +``` + +#### safe_merge_dicts() +```python +def safe_merge_dicts(a: Dict[Any, Any], b: Dict[Any, Any]) -> Dict[Any, Any] +``` + +**Purpose**: Merge dictionaries with duplicate key detection + +**Use Case**: Combining schema definitions safely + +**Example**: +```python +base_fields = {'@timestamp': {...}, 'message': {...}} +custom_fields = {'user_id': {...}} + +merged = safe_merge_dicts(base_fields, custom_fields) +# Success: All keys unique + +duplicate_fields = {'message': {...}} # Duplicate key! +merged = safe_merge_dicts(base_fields, duplicate_fields) +# Raises ValueError: Duplicate key found when merging dictionaries: message +``` + +#### fields_subset() +```python +def fields_subset(subset, fields) +``` + +**Purpose**: Extract subset of fields based on specification + +**Use Case**: Generating partial schemas for specific use cases + +**Example**: +```python +subset_spec = { + 'http': {'fields': '*'}, # All HTTP fields + 'user': { # Only specific user fields + 'fields': { + 'name': {}, + 'email': {} + } + } +} + +filtered = fields_subset(subset_spec, all_fields) +# Returns only http.* and user.name, user.email +``` + +### File Helpers + +#### is_yaml() +```python +def is_yaml(path: str) -> bool +``` + +**Purpose**: Check if file has YAML extension + +**Example**: +```python +is_yaml('schemas/http.yml') # True +is_yaml('output.json') # False +is_yaml('file.test.yaml') # True +``` + +#### glob_yaml_files() +```python +def glob_yaml_files(paths: List[str]) -> List[str] +``` + +**Purpose**: Find all YAML files from paths/wildcards/directories + +**Example**: +```python +# Direct files +glob_yaml_files(['schemas/http.yml', 'schemas/user.yml']) +# ['schemas/http.yml', 'schemas/user.yml'] + +# Directory +glob_yaml_files(['schemas/']) +# ['schemas/agent.yml', 'schemas/base.yml', ...] + +# Wildcard +glob_yaml_files(['schemas/*.yml']) +# All YAML files in schemas/ + +# Comma-separated string +glob_yaml_files('schemas/http.yml,schemas/user.yml') +# ['schemas/http.yml', 'schemas/user.yml'] +``` + +#### make_dirs() +```python +def make_dirs(path: str) -> None +``` + +**Purpose**: Create directory and parents safely + +**Example**: +```python +make_dirs('generated/elasticsearch/composable/component') +# Creates all parent directories if they don't exist +# No error if already exists +``` + +#### yaml_dump() / yaml_load() +```python +def yaml_dump(filename: str, data: Dict, preamble: Optional[str] = None) -> None +def yaml_load(filename: str) -> Set[str] +``` + +**Purpose**: Save/load YAML files with consistent formatting + +**Example**: +```python +# Save with header +yaml_dump( + 'output.yml', + {'name': 'http', 'fields': [...]}, + preamble='# Auto-generated - do not edit\n' +) + +# Load +data = yaml_load('schemas/http.yml') +print(data['name']) # 'http' +``` + +#### ecs_files() / usage_doc_files() +```python +def ecs_files() -> List[str] +def usage_doc_files() -> List[str] +``` + +**Purpose**: Get lists of schema or usage doc files + +**Example**: +```python +schemas = ecs_files() +# ['schemas/agent.yml', 'schemas/base.yml', ...] + +usage_docs = usage_doc_files() +# ['ecs-http-usage.md', 'ecs-user-usage.md', ...] +``` + +### Git Helpers + +#### get_tree_by_ref() +```python +def get_tree_by_ref(ref: str) -> git.objects.tree.Tree +``` + +**Purpose**: Access repository contents at specific git reference + +**Use Case**: Loading schemas from specific version/branch/tag + +**Example**: +```python +# Load from tag +tree = get_tree_by_ref('v8.10.0') +http_schema = tree['schemas']['http.yml'].data_stream.read() + +# Load from branch +tree = get_tree_by_ref('main') + +# Load from commit +tree = get_tree_by_ref('abc123def') +``` + +#### path_exists_in_git_tree() +```python +def path_exists_in_git_tree(tree: git.objects.tree.Tree, file_path: str) -> bool +``` + +**Purpose**: Check if path exists in git tree + +**Example**: +```python +tree = get_tree_by_ref('main') + +if path_exists_in_git_tree(tree, 'schemas/http.yml'): + # Load the file + content = tree['schemas']['http.yml'].data_stream.read() +``` + +### List Helpers + +#### list_subtract() +```python +def list_subtract(original: List[Any], subtracted: List[Any]) -> List[Any] +``` + +**Purpose**: Remove elements from list + +**Example**: +```python +all_fields = ['name', 'type', 'description', 'example', 'internal'] +public_fields = list_subtract(all_fields, ['internal']) +# ['name', 'type', 'description', 'example'] +``` + +#### list_extract_keys() +```python +def list_extract_keys(lst: List[Field], key_name: str) -> List[str] +``` + +**Purpose**: Extract specific key from list of dicts + +**Example**: +```python +fields = [ + {'name': 'method', 'type': 'keyword'}, + {'name': 'status', 'type': 'long'} +] + +names = list_extract_keys(fields, 'name') +# ['method', 'status'] + +types = list_extract_keys(fields, 'type') +# ['keyword', 'long'] +``` + +### Field Helpers + +#### is_intermediate() +```python +def is_intermediate(field: FieldEntry) -> bool +``` + +**Purpose**: Check if field is structural (not a data field) + +**Example**: +```python +# http.request is just structure +request_field = { + 'field_details': {'intermediate': True, 'name': 'request'} +} +is_intermediate(request_field) # True + +# http.request.method is actual field +method_field = { + 'field_details': {'name': 'method', 'type': 'keyword'} +} +is_intermediate(method_field) # False +``` + +#### remove_top_level_reusable_false() +```python +def remove_top_level_reusable_false(ecs_nested: Dict[str, FieldNestedEntry]) -> Dict[str, FieldNestedEntry] +``` + +**Purpose**: Filter out non-root fieldsets + +**Example**: +```python +nested = { + 'http': {'reusable': {'top_level': True}}, + 'geo': {'reusable': {'top_level': False}}, # Only for nesting + 'user': {} # No reusable = included by default +} + +filtered = remove_top_level_reusable_false(nested) +# Contains: http, user +# Excludes: geo (can only be used as client.geo, source.geo, etc.) +``` + +### Warning Helper + +#### strict_warning() +```python +def strict_warning(msg: str) -> None +``` + +**Purpose**: Issue warning that becomes error in strict mode + +**Example**: +```python +if 'description' not in field: + strict_warning(f"Field '{field['name']}' is missing description") + # Normal mode: Warning + # Strict mode (--strict flag): Exception +``` + +## Common Patterns + +### Sorting Fieldsets for Output + +```python +from generators import ecs_helpers + +# Sort by group, then name +fieldsets = ecs_helpers.dict_sorted_by_keys(nested, ['group', 'name']) + +# Generate output in consistent order +for fieldset in fieldsets: + generate_documentation(fieldset) +``` + +### Loading Schemas from Git + +```python +from generators import ecs_helpers + +# Load from specific version +tree = ecs_helpers.get_tree_by_ref('v8.10.0') + +# Check if file exists before loading +if ecs_helpers.path_exists_in_git_tree(tree, 'schemas/http.yml'): + content = tree['schemas']['http.yml'].data_stream.read().decode('utf-8') + schema = yaml.safe_load(content) +``` + +### Building Type-Specific Mappings + +```python +from generators import ecs_helpers + +def build_mapping(field): + mapping = {'type': field['type']} + + if field['type'] == 'keyword': + ecs_helpers.dict_copy_existing_keys( + field, mapping, + ['ignore_above', 'normalizer'] + ) + elif field['type'] == 'text': + ecs_helpers.dict_copy_existing_keys( + field, mapping, + ['norms', 'analyzer'] + ) + + return mapping +``` + +### Safe Directory Creation + +```python +from generators import ecs_helpers +from os.path import join + +def save_output(content, out_dir): + # Ensure directory exists + full_dir = join(out_dir, 'elasticsearch', 'composable', 'component') + ecs_helpers.make_dirs(full_dir) + + # Now safe to write files + with open(join(full_dir, 'template.json'), 'w') as f: + f.write(content) +``` + +### Filtering with Subsets + +```python +from generators import ecs_helpers + +# Define what to include +subset = { + 'http': {'fields': '*'}, # All HTTP fields + 'user': {'fields': { # Selected user fields + 'name': {}, + 'email': {}, + 'roles': {'fields': '*'} # Nested subset + }}, + 'event': {'fields': { # Core event fields + 'kind': {}, + 'category': {}, + 'type': {} + }} +} + +# Apply subset filter +filtered_fields = ecs_helpers.fields_subset(subset, all_fields) +``` + +## Design Principles + +### 1. Single Responsibility + +Each function does one thing well: +- `dict_sorted_by_keys()` - only sorts +- `make_dirs()` - only creates directories +- `is_yaml()` - only checks extensions + +### 2. No Side Effects (Except I/O) + +Most functions don't modify their inputs: +```python +# Returns new dict, doesn't modify 'a' or 'b' +merged = safe_merge_dicts(a, b) + +# Exception: Functions explicitly modifying in place +dict_copy_existing_keys(source, dest, keys) # Modifies dest +``` + +### 3. Type Safety + +All functions have type hints for clarity: +```python +def dict_sorted_by_keys( + dct: FieldNestedEntry, + sort_keys: List[str] +) -> List[FieldNestedEntry]: + ... +``` + +### 4. Consistent Error Handling + +- Use exceptions for errors (ValueError, OSError) +- Return None/empty for "not found" cases +- Print informative error messages + +### 5. Composability + +Functions work together: +```python +# Chain operations +files = ecs_helpers.glob_yaml_files(['schemas/']) +for file in sorted(files): + data = ecs_helpers.yaml_load(file) + ecs_helpers.dict_clean_string_values(data) + process(data) +``` + +## Testing Strategies + +### Unit Testing Helpers + +```python +import pytest +from generators import ecs_helpers + +def test_list_subtract(): + result = ecs_helpers.list_subtract([1, 2, 3, 4], [2, 4]) + assert result == [1, 3] + +def test_safe_merge_dicts(): + a = {'x': 1} + b = {'y': 2} + result = ecs_helpers.safe_merge_dicts(a, b) + assert result == {'x': 1, 'y': 2} + + # Test duplicate detection + c = {'x': 99} + with pytest.raises(ValueError, match='Duplicate key'): + ecs_helpers.safe_merge_dicts(a, c) + +def test_is_yaml(): + assert ecs_helpers.is_yaml('file.yml') + assert ecs_helpers.is_yaml('file.yaml') + assert not ecs_helpers.is_yaml('file.json') +``` + +### Integration Testing + +```python +import tempfile +import os +from generators import ecs_helpers + +def test_yaml_round_trip(): + data = {'name': 'test', 'fields': ['a', 'b']} + + with tempfile.NamedTemporaryFile(mode='w', suffix='.yml', delete=False) as f: + filename = f.name + + try: + # Write + ecs_helpers.yaml_dump(filename, data) + + # Read + loaded = ecs_helpers.yaml_load(filename) + + # Verify + assert loaded == data + finally: + os.unlink(filename) +``` + +## Troubleshooting + +### Common Issues + +#### OrderedDict not maintaining order + +**Symptom**: Keys appear in wrong order after processing + +**Solution**: Ensure using `OrderedDict` explicitly: +```python +from collections import OrderedDict + +# Correct +d = OrderedDict([('name', 'x'), ('type', 'keyword')]) + +# Won't preserve order in Python < 3.7 +d = {'name': 'x', 'type': 'keyword'} +``` + +#### Duplicate key errors in safe_merge_dicts + +**Symptom**: ValueError when merging schemas + +**Solution**: Check for unintended duplicates: +```python +# Find duplicates before merging +common_keys = set(a.keys()) & set(b.keys()) +if common_keys: + print(f"Duplicate keys: {common_keys}") + # Decide: use a's values, b's values, or merge differently +``` + +#### glob_yaml_files returns empty list + +**Symptom**: No files found when expected + +**Debug**: +```python +import glob +import os + +path = 'schemas/*.yml' +print(f"Looking for: {path}") +print(f"Current dir: {os.getcwd()}") +print(f"Files found: {glob.glob(path)}") + +# Check if path is correct relative to current directory +``` + +#### YAML unicode errors + +**Symptom**: UnicodeDecodeError when loading YAML + +**Solution**: Ensure UTF-8 encoding: +```python +# In yaml_load: +with open(filename, encoding='utf-8') as f: + return yaml.safe_load(f.read()) +``` + +## Performance Tips + +### For Large Field Sets + +```python +# Bad: Creates new list each iteration +for fieldset in fields: + sorted_fields = dict_sorted_by_keys(fieldset['fields'], 'name') + process(sorted_fields) + +# Good: Sort once if order doesn't change +sorted_fieldsets = dict_sorted_by_keys(fields, ['group', 'name']) +for fieldset in sorted_fieldsets: + process(fieldset) +``` + +### For File Operations + +```python +# Bad: Loading same file multiple times +for operation in operations: + data = yaml_load('config.yml') + operation(data) + +# Good: Load once +config = yaml_load('config.yml') +for operation in operations: + operation(config) +``` + +### For Merging Many Dicts + +```python +# Bad: Nested safe_merge_dicts (deep copying repeatedly) +result = safe_merge_dicts(a, b) +result = safe_merge_dicts(result, c) +result = safe_merge_dicts(result, d) + +# Better: Merge all at once if possible +from functools import reduce +dicts = [a, b, c, d] +result = reduce(safe_merge_dicts, dicts) +``` + +## Related Files + +- `scripts/generator.py` - Main script using these helpers +- `scripts/generators/*.py` - All generators use these utilities +- `scripts/schema/*.py` - Schema processors use these utilities +- `scripts/ecs_types/schema_fields.py` - Type definitions used by helpers + +## References + +- [Python OrderedDict Documentation](https://docs.python.org/3/library/collections.html#collections.OrderedDict) +- [PyYAML Documentation](https://pyyaml.org/wiki/PyYAMLDocumentation) +- [GitPython Documentation](https://gitpython.readthedocs.io/) +- [Python glob Module](https://docs.python.org/3/library/glob.html) + diff --git a/scripts/docs/es-template.md b/scripts/docs/es-template.md new file mode 100644 index 0000000000..0ea6d8a94e --- /dev/null +++ b/scripts/docs/es-template.md @@ -0,0 +1,626 @@ +# Elasticsearch Template Generator + +## Overview + +The Elasticsearch Template Generator (`generators/es_template.py`) converts ECS field schemas into Elasticsearch index templates. These templates define the mapping (field types and properties) for indices that will store ECS-structured data. + +### Purpose + +This generator bridges the gap between ECS schema definitions and Elasticsearch's native mapping format, producing ready-to-install JSON templates that: + +1. **Define field mappings** - Specify types, parameters, and multi-fields +2. **Configure index settings** - Set codec, field limits, refresh intervals +3. **Support two template formats**: + - **Composable** (modern): Modular component templates + - **Legacy** (deprecated): Single monolithic template + +The generated templates can be directly installed into Elasticsearch using the `_index_template` or `_template` APIs. + +## Architecture + +### High-Level Flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ generator.py (main) │ +│ │ +│ Load → Clean → Finalize → Generate Intermediate Files │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ es_template.generate() / generate_legacy() │ +│ │ +│ Input: nested or flat fieldsets + version + settings │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ┌──────────────────┴──────────────────┐ + │ │ + ▼ ▼ +┌──────────────────────────┐ ┌──────────────────────────┐ +│ Composable Templates │ │ Legacy Template │ +│ │ │ │ +│ For each fieldset: │ │ All fields in one: │ +│ 1. Build nested props │ │ 1. Build nested props │ +│ 2. Convert fields │ │ 2. Convert fields │ +│ 3. Save component │ │ 3. Save single template │ +│ │ │ │ +│ Save main template │ └──────────────────────────┘ +└──────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Elasticsearch JSON Templates │ +│ │ +│ Composable: │ +│ - generated/elasticsearch/composable/component/base.json │ +│ - generated/elasticsearch/composable/component/agent.json │ +│ - generated/elasticsearch/composable/component/*.json │ +│ - generated/elasticsearch/composable/template.json │ +│ │ +│ Legacy: │ +│ - generated/elasticsearch/legacy/template.json │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Key Components + +#### 1. Composable Template Generation + +**Entry Point**: `generate(ecs_nested, ecs_version, out_dir, ...)` + +**Process**: +1. For each fieldset: + - Convert flat field names to nested `properties` structure + - Transform ECS field defs to Elasticsearch mappings + - Save as individual component template +2. Generate main template: + - Build component name list + - Create template that composes all components + - Add index patterns, priority, settings + +**Output Files**: +- `component/base.json`, `component/agent.json`, etc. (one per fieldset) +- `template.json` (main composable template) + +#### 2. Legacy Template Generation + +**Entry Point**: `generate_legacy(ecs_flat, ecs_version, out_dir, ...)` + +**Process**: +1. Iterate all fields in sorted order +2. Convert flat field names to nested properties structure +3. Build single monolithic mappings section +4. Generate template with all mappings included + +**Output File**: +- `legacy/template.json` + +#### 3. Field Mapping Conversion + +**Function**: `entry_for(field)` + +Converts ECS field definitions to Elasticsearch mapping format: + +| ECS Type | ES Mapping | Special Parameters | +|----------|------------|-------------------| +| keyword | keyword | ignore_above, synthetic_source_keep | +| text | text | norms | +| long/integer/short/byte | long/integer/short/byte | - | +| float/double/half_float | float/double/half_float | - | +| scaled_float | scaled_float | scaling_factor | +| boolean | boolean | - | +| date | date | - | +| ip | ip | - | +| geo_point | geo_point | - | +| object | object | enabled (if false) | +| nested | nested | enabled (if false) | +| flattened | flattened | ignore_above | +| constant_keyword | constant_keyword | value | +| alias | alias | path | + +**Multi-fields**: Handled via `multi_fields` array in ECS definition + +**Custom parameters**: Merged from `parameters` dict in field definition + +## Template Formats + +### Composable Template (Modern) + +Recommended for Elasticsearch 7.8+. Provides modularity and flexibility. + +**Component Template** (one per fieldset): +```json +{ + "template": { + "mappings": { + "properties": { + "http": { + "properties": { + "request": { + "properties": { + "method": { + "type": "keyword", + "ignore_above": 1024 + } + } + } + } + } + } + } + }, + "_meta": { + "ecs_version": "8.11.0", + "documentation": "https://www.elastic.co/guide/en/ecs/current/ecs-http.html" + } +} +``` + +**Main Template**: +```json +{ + "index_patterns": ["try-ecs-*"], + "composed_of": [ + "ecs_8.11.0_base", + "ecs_8.11.0_agent", + "ecs_8.11.0_http", + "..." + ], + "priority": 1, + "template": { + "settings": { + "index": { + "codec": "best_compression", + "mapping": { + "total_fields": { + "limit": 2000 + } + } + } + }, + "mappings": { + "date_detection": false, + "dynamic_templates": [...] + } + }, + "_meta": { + "ecs_version": "8.11.0", + "description": "Sample composable template that includes all ECS fields" + } +} +``` + +**Installation**: +```bash +# Install component templates +for file in generated/elasticsearch/composable/component/*.json; do + name=$(basename "$file" .json) + curl -X PUT "localhost:9200/_component_template/ecs_8.11.0_$name" \ + -H 'Content-Type: application/json' -d @"$file" +done + +# Install main template +curl -X PUT "localhost:9200/_index_template/ecs" \ + -H 'Content-Type: application/json' \ + -d @generated/elasticsearch/composable/template.json +``` + +### Legacy Template (Deprecated) + +For Elasticsearch < 7.8 or backwards compatibility. + +**Structure**: +```json +{ + "index_patterns": ["try-ecs-*"], + "order": 1, + "settings": { + "index": { + "mapping": { + "total_fields": { + "limit": 10000 + } + }, + "refresh_interval": "5s" + } + }, + "mappings": { + "_meta": { + "version": "8.11.0" + }, + "date_detection": false, + "dynamic_templates": [...], + "properties": { + "agent": {...}, + "http": {...}, + "...": "all fields in one place" + } + } +} +``` + +**Installation**: +```bash +curl -X PUT "localhost:9200/_template/ecs" \ + -H 'Content-Type: application/json' \ + -d @generated/elasticsearch/legacy/template.json +``` + +## Usage Examples + +### Running the Generator + +Typically invoked through the main generator: + +```bash +# From repository root +make clean +make SEMCONV_VERSION=v1.24.0 + +# Generates both composable and legacy templates +``` + +### Programmatic Usage + +```python +from generators.es_template import generate, generate_legacy +from generators.intermediate_files import generate as gen_intermediate + +# Generate intermediate files +nested, flat = gen_intermediate(fields, 'generated/ecs', True) + +# Generate composable templates +generate( + ecs_nested=nested, + ecs_version='8.11.0', + out_dir='generated', + mapping_settings_file=None, # Use defaults + template_settings_file=None # Use defaults +) + +# Generate legacy template +generate_legacy( + ecs_flat=flat, + ecs_version='8.11.0', + out_dir='generated', + mapping_settings_file=None, + template_settings_file=None +) +``` + +### Custom Settings + +**Custom Mapping Settings** (`mapping_settings.json`): +```json +{ + "date_detection": true, + "numeric_detection": false, + "dynamic_templates": [ + { + "strings_as_text": { + "match_mapping_type": "string", + "mapping": { + "type": "text", + "fields": { + "keyword": { + "type": "keyword", + "ignore_above": 256 + } + } + } + } + } + ] +} +``` + +**Custom Template Settings** (`template_settings.json`): +```json +{ + "index_patterns": ["logs-*", "metrics-*"], + "priority": 100, + "template": { + "settings": { + "index": { + "number_of_shards": 1, + "number_of_replicas": 1, + "codec": "best_compression", + "mapping": { + "total_fields": { + "limit": 5000 + } + } + } + } + } +} +``` + +**Usage**: +```python +generate( + ecs_nested=nested, + ecs_version='8.11.0', + out_dir='generated', + mapping_settings_file='mapping_settings.json', + template_settings_file='template_settings.json' +) +``` + +## Making Changes + +### Adding Support for New Field Type + +To add a new Elasticsearch field type: + +1. **Update entry_for() function**: +```python +def entry_for(field: Field) -> Dict: + field_entry: Dict = {'type': field['type']} + try: + # ... existing type handling ... + + elif field['type'] == 'new_type': + ecs_helpers.dict_copy_existing_keys( + field, field_entry, + ['param1', 'param2'] # Type-specific parameters + ) + + # ... rest of function ... +``` + +2. **Update schema definitions** to use new type +3. **Test** with sample field +4. **Document** in this guide + +### Customizing Component Template Naming + +To change the naming convention for component templates: + +```python +def component_name_convention( + ecs_version: str, + ecs_nested: Dict[str, FieldNestedEntry] +) -> List[str]: + version: str = ecs_version.replace('+', '-') + names: List[str] = [] + for (fieldset_name, fieldset) in ecs_helpers.remove_top_level_reusable_false(ecs_nested).items(): + # Change naming pattern here + names.append("my_prefix_{}_{}".format(version, fieldset_name)) + return names +``` + +**Note**: If you change component names, update any deployment scripts that reference them. + +### Adding Custom Metadata + +To add custom metadata to templates: + +**For component templates**: +```python +def save_component_template(...): + # ... existing code ... + template['_meta']['custom_field'] = 'custom_value' + template['_meta']['team'] = 'security' + # ... save ... +``` + +**For main template**: +```python +def finalize_template(...): + # ... existing code ... + if not is_legacy: + template['_meta']['custom_info'] = {...} +``` + +### Modifying Default Settings + +To change default template settings: + +```python +def default_template_settings(ecs_version: str) -> Dict: + return { + "index_patterns": ["your-pattern-*"], # Change pattern + "priority": 500, # Higher priority + "template": { + "settings": { + "index": { + "number_of_shards": 1, # Add shard config + "codec": "default", # Change codec + "mapping": { + "total_fields": { + "limit": 5000 # Increase limit + } + } + } + }, + } + } +``` + +## Troubleshooting + +### Common Issues + +#### "Total fields limit exceeded" + +**Symptom**: Error when installing template or indexing documents + +``` +illegal_argument_exception: Limit of total fields [1000] has been exceeded +``` + +**Cause**: Elasticsearch default limit is 1000 fields, ECS has 800+ + +**Solutions**: +1. Increase limit in template settings: + ```json + { + "settings": { + "index": { + "mapping": { + "total_fields": { + "limit": 2000 + } + } + } + } + } + ``` + +2. Use composable templates (smaller field count per component) + +3. Use selective field sets (only fields you need) + +#### Component template not found + +**Symptom**: Error installing main composable template + +``` +resource_not_found_exception: component template [ecs_8.11.0_http] not found +``` + +**Cause**: Component templates must be installed before main template + +**Solution**: Install components first, then main template: +```bash +# Install all components +for file in generated/elasticsearch/composable/component/*.json; do + # ... install component +done + +# Then install main template +curl -X PUT "localhost:9200/_index_template/ecs" ... +``` + +#### Mapping conflicts + +**Symptom**: Cannot update mapping with different type + +``` +illegal_argument_exception: mapper [field] cannot be changed from type [keyword] to [text] +``` + +**Cause**: Trying to change existing field type + +**Solutions**: +1. Delete and recreate index: + ```bash + curl -X DELETE "localhost:9200/my-index" + # Recreate with new mapping + ``` + +2. Reindex to new index with updated mapping: + ```bash + curl -X POST "localhost:9200/_reindex" -d '{ + "source": {"index": "old-index"}, + "dest": {"index": "new-index"} + }' + ``` + +3. Use index aliases to transparently switch + +#### JSON formatting issues + +**Symptom**: Template JSON won't load + +**Check**: +- Valid JSON syntax (no trailing commas) +- Proper escaping of special characters +- Matching brackets and braces + +**Debug**: +```python +import json +with open('template.json') as f: + try: + template = json.load(f) + print("Valid JSON") + except json.JSONDecodeError as e: + print(f"Invalid JSON: {e}") +``` + +### Debugging Tips + +#### Validate generated templates + +```bash +# Check JSON syntax +jq . generated/elasticsearch/composable/template.json + +# Count fields in component +jq '.template.mappings.properties | .. | .type? | select(. != null)' \ + generated/elasticsearch/composable/component/http.json | wc -l +``` + +#### Compare templates + +```bash +# Compare two versions +diff -u \ + old_version/elasticsearch/composable/template.json \ + new_version/elasticsearch/composable/template.json +``` + +#### Test template installation + +```bash +# Start test Elasticsearch +docker run -p 9200:9200 -e "discovery.type=single-node" \ + docker.elastic.co/elasticsearch/elasticsearch:8.11.0 + +# Install template +curl -X PUT "localhost:9200/_index_template/ecs_test" \ + -H 'Content-Type: application/json' \ + -d @generated/elasticsearch/composable/template.json + +# Verify installation +curl "localhost:9200/_index_template/ecs_test" + +# Test with sample document +curl -X POST "localhost:9200/try-ecs-test/_doc" \ + -H 'Content-Type: application/json' -d '{ + "agent": {"name": "test"}, + "http": {"request": {"method": "GET"}} +}' + +# Check mapping applied +curl "localhost:9200/try-ecs-test/_mapping" +``` + +#### Inspect field mappings + +```python +import json + +with open('generated/elasticsearch/composable/component/http.json') as f: + template = json.load(f) + +def print_fields(props, prefix=''): + for name, field in props.items(): + if 'type' in field: + print(f"{prefix}{name}: {field['type']}") + if 'properties' in field: + print_fields(field['properties'], prefix + name + '.') + +print_fields(template['template']['mappings']['properties']) +``` + +## Related Files + +- `scripts/generator.py` - Main entry point +- `scripts/generators/intermediate_files.py` - Produces nested/flat structures +- `scripts/generators/ecs_helpers.py` - Utility functions +- `scripts/ecs_types/schema_fields.py` - Type definitions +- `schemas/*.yml` - Source ECS schemas +- `generated/elasticsearch/composable/` - Composable template output +- `generated/elasticsearch/legacy/` - Legacy template output + +## References + +- [Elasticsearch Composable Templates](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html) +- [Elasticsearch Mapping Types](https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-types.html) +- [ECS Field Reference](https://www.elastic.co/guide/en/ecs/current/ecs-field-reference.html) +- [Index Template Best Practices](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html#avoid-index-pattern-collisions) + diff --git a/scripts/docs/intermediate-files.md b/scripts/docs/intermediate-files.md new file mode 100644 index 0000000000..c8778b7863 --- /dev/null +++ b/scripts/docs/intermediate-files.md @@ -0,0 +1,563 @@ +# Intermediate File Generator + +## Overview + +The Intermediate File Generator (`generators/intermediate_files.py`) is a critical component in the ECS build pipeline. It transforms processed schemas into standardized intermediate representations that serve as the foundation for all downstream artifact generation. + +### Purpose + +This generator bridges the gap between schema processing and artifact generation by creating two normalized formats: + +1. **Flat Format** (`ecs_flat.yml`) - Single-level field dictionary +2. **Nested Format** (`ecs_nested.yml`) - Hierarchical fieldset organization + +These intermediate files provide: +- **Stable Interface**: Consistent data structure for all generators +- **Separation of Concerns**: Schema processing logic separate from artifact generation +- **Debugging Aid**: Human-readable checkpoints in the pipeline +- **Multiple Consumers**: CSV, Elasticsearch templates, Beats, documentation + +## Architecture + +### Pipeline Position + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Schema Processing │ +│ │ +│ 1. loader.py - Load YAML schemas from files │ +│ 2. cleaner.py - Normalize and validate │ +│ 3. finalizer.py - Apply transformations │ +│ 4. subset_filter.py - Optional filtering │ +│ 5. exclude_filter.py - Optional exclusions │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ intermediate_files.generate() │ +│ [THIS MODULE] │ +│ │ +│ Input: Dict[str, FieldEntry] (processed schemas) │ +│ │ +│ ┌───────────────────────┐ ┌───────────────────────┐ │ +│ │ generate_flat_fields()│ │generate_nested_fields()│ │ +│ │ │ │ │ │ +│ │ • Filter non-root │ │ • Keep all fieldsets │ │ +│ │ • Flatten hierarchy │ │ • Group by fieldset │ │ +│ │ • Index by flat_name │ │ • Preserve metadata │ │ +│ └───────────┬───────────┘ └──────────┬─────────────┘ │ +│ │ │ │ +│ ▼ ▼ │ +│ ecs_flat.yml (850 fields) ecs_nested.yml (45 fieldsets) │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Artifact Generators │ +│ │ +│ • CSV Generator - Uses ecs_flat.yml │ +│ • Elasticsearch - Uses ecs_nested.yml │ +│ • Beats Generator - Uses ecs_nested.yml │ +│ • Markdown Generator - Uses ecs_nested.yml │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Data Flow + +``` +Input: Processed Schemas + ↓ +{ + 'http': { + 'field_details': {...}, + 'schema_details': {...}, + 'fields': { + 'request': { + 'fields': { + 'method': { + 'field_details': { + 'flat_name': 'http.request.method', + 'type': 'keyword', + ... + } + } + } + } + } + } +} + ↓ + ├─── generate_flat_fields() ───→ Flat Format + │ { + │ 'http.request.method': { + │ 'name': 'method', + │ 'type': 'keyword', + │ ... + │ } + │ } + │ + └─── generate_nested_fields() ──→ Nested Format + { + 'http': { + 'name': 'http', + 'title': 'HTTP', + 'fields': { + 'http.request.method': {...} + } + } + } +``` + +## File Formats + +### Flat Format (ecs_flat.yml) + +**Purpose**: Quick lookup and iteration over all fields + +**Structure**: +```yaml +# Single-level dictionary, fields indexed by full dotted name +http.request.method: + name: method + flat_name: http.request.method + type: keyword + description: HTTP request method + example: GET + level: extended + normalize: + - array + otel: + - relation: match + stability: stable + +http.response.status_code: + name: status_code + flat_name: http.response.status_code + type: long + description: HTTP response status code + example: 404 + level: extended +``` + +**Characteristics**: +- **Keys**: Full dotted field names (e.g., `http.request.method`) +- **Values**: Complete field definitions +- **Excludes**: Non-root reusable fieldsets (top_level=false) +- **Excludes**: Intermediate structural fields +- **Count**: ~850 fields in standard ECS + +**Use Cases**: +- CSV generation (one row per field) +- Simple field lookups by name +- Validation scripts +- Field counting and statistics + +### Nested Format (ecs_nested.yml) + +**Purpose**: Preserve logical grouping and fieldset metadata + +**Structure**: +```yaml +# Top-level: fieldsets +http: + name: http + title: HTTP + group: 2 + description: Fields related to HTTP activity + type: group + reusable: + top_level: true + expected: + - client + - server + reused_here: + - full: http.request + short: request + schema_name: http.request + fields: + # Flat dictionary of all fields in this fieldset + http.request.method: + name: method + flat_name: http.request.method + type: keyword + description: HTTP request method + ... + http.response.status_code: + name: status_code + flat_name: http.response.status_code + type: long + ... + +user: + name: user + title: User + group: 2 + description: User fields + reusable: + top_level: true + expected: + - client + - destination + - server + - source + fields: + user.email: + name: email + ... +``` + +**Characteristics**: +- **Keys**: Fieldset names (e.g., `http`, `user`, `process`) +- **Values**: Fieldset metadata + fields dictionary +- **Includes**: All fieldsets (even top_level=false) +- **Fields**: Stored in nested `fields` dict (still flat, not hierarchical) +- **Count**: ~45 fieldsets in standard ECS + +**Use Cases**: +- Documentation generation (one page per fieldset) +- Elasticsearch templates (field grouping) +- Beats configuration +- Understanding field relationships + +## Key Concepts + +### Top-Level vs. Non-Root Reusable Fieldsets + +Some fieldsets are designed ONLY to be reused in specific locations: + +**Non-Root Reusable** (top_level=false): +```yaml +# geo fieldset - only valid under client.geo, source.geo, etc. +geo: + reusable: + top_level: false # Never appears as geo.* at root + expected: + - client.geo + - destination.geo + - source.geo +``` + +**Root Reusable** (top_level=true): +```yaml +# user fieldset - valid at root AND reused locations +user: + reusable: + top_level: true # Can appear as user.* at root + expected: + - client.user + - destination.user + - source.user +``` + +**Filtering Behavior**: +- **Flat format**: Excludes top_level=false fieldsets +- **Nested format**: Includes all fieldsets (consumers decide) + +### Intermediate Fields + +Some fields exist only for structural purposes: + +```yaml +# Intermediate field - creates hierarchy but isn't a real field +http.request: + intermediate: true # Not a field itself + fields: + method: {...} # Actual field: http.request.method + body: {...} # Actual field: http.request.body +``` + +These are excluded from intermediate files as they don't represent actual data. + +### Internal Attributes + +Attributes removed from final output: +- `node_name`: Internal tree traversal identifier +- `intermediate`: Flag for structural-only fields +- `dashed_name`: Alternative name format (not needed in output) + +## Usage Examples + +### Running the Generator + +Typically invoked through the main generator: + +```bash +# From repository root +make clean +make SEMCONV_VERSION=v1.24.0 + +# Or directly with Python +python scripts/generator.py --semconv-version v1.24.0 +``` + +### Programmatic Usage + +```python +from schema import loader, cleaner, finalizer +from generators.intermediate_files import generate + +# Process schemas +fields = loader.load_schemas() +cleaner.clean(fields) +finalizer.finalize(fields) + +# Generate intermediate files +nested, flat = generate( + fields=fields, + out_dir='generated/ecs', + default_dirs=True # Also save raw ecs.yml +) + +# Use the returned structures +print(f"Total fields: {len(flat)}") +print(f"Total fieldsets: {len(nested)}") + +# Access specific field +method_field = flat['http.request.method'] +print(f"Type: {method_field['type']}") + +# Access fieldset +http_fieldset = nested['http'] +print(f"Title: {http_fieldset['title']}") +print(f"Fields in HTTP: {len(http_fieldset['fields'])}") +``` + +### Reading Generated Files + +```python +import yaml + +# Load flat format +with open('generated/ecs/ecs_flat.yml') as f: + flat = yaml.safe_load(f) + +# Iterate all fields +for field_name, field_def in flat.items(): + print(f"{field_name}: {field_def['type']}") + +# Load nested format +with open('generated/ecs/ecs_nested.yml') as f: + nested = yaml.safe_load(f) + +# Process by fieldset +for fieldset_name, fieldset in nested.items(): + print(f"\n{fieldset['title']} ({fieldset_name})") + for field_name in fieldset['fields']: + print(f" - {field_name}") +``` + +## Making Changes + +### Adding New Field Attributes + +If you add a new attribute to field definitions: + +1. **Update schema files** (in `schemas/*.yml`) +2. **Update type definitions** (in `ecs_types/schema_fields.py`) +3. **No changes needed here** - attributes pass through automatically +4. **Update downstream consumers** if they need to use the new attribute + +Example: Adding a `sensitivity` attribute +```yaml +# In schema +- name: password + type: keyword + sensitivity: high # NEW attribute + +# Automatically appears in both formats: +# ecs_flat.yml +user.password: + name: password + type: keyword + sensitivity: high # Passed through + +# ecs_nested.yml +user: + fields: + user.password: + sensitivity: high # Passed through +``` + +### Filtering Additional Attributes + +To remove an attribute from intermediate files: + +```python +def remove_internal_attributes(field_details: Field) -> None: + """Remove internal-only attributes.""" + field_details.pop('node_name', None) + field_details.pop('intermediate', None) + field_details.pop('new_internal_attr', None) # Add this +``` + +### Changing Flat Format Filtering + +To change which fields appear in the flat format: + +```python +def generate_flat_fields(fields: Dict[str, FieldEntry]) -> Dict[str, Field]: + """Generate flat field representation.""" + filtered: Dict[str, FieldEntry] = remove_non_root_reusables(fields) + + # Add additional filtering + filtered = remove_deprecated_fields(filtered) # NEW + + flattened: Dict[str, Field] = {} + visitor.visit_fields_with_memo(filtered, accumulate_field, flattened) + return flattened +``` + +### Modifying Nested Format Structure + +To change fieldset-level attributes: + +```python +def generate_nested_fields(fields: Dict[str, FieldEntry]) -> Dict[str, FieldNestedEntry]: + """Generate nested fieldset representation.""" + nested: Dict[str, FieldNestedEntry] = {} + + for (name, details) in fields.items(): + fieldset_details = { + **copy.deepcopy(details['field_details']), + **copy.deepcopy(details['schema_details']) + } + + # Add custom processing + if 'beta' in fieldset_details: + fieldset_details['stability'] = 'beta' # NEW + + # ... rest of processing ... +``` + +## Troubleshooting + +### Common Issues + +#### Missing fields in flat format + +**Symptom**: Field appears in schema but not in ecs_flat.yml + +**Possible causes**: +1. Field is in a fieldset with `top_level: false` + - **Check**: Look at fieldset's `reusable.top_level` setting + - **Solution**: If field should be at root, set `top_level: true` + +2. Field is marked as `intermediate: true` + - **Check**: Look for `intermediate` attribute in schema + - **Solution**: Remove if field should be included + +3. Field is in a filtered subset + - **Check**: Are you using `--subset` or `--exclude` flags? + - **Solution**: Adjust filtering or run without filters + +#### Fieldset missing from nested format + +**Symptom**: Fieldset defined in schema but not in ecs_nested.yml + +**Unlikely**: The nested format includes all fieldsets by design. + +**Check**: +- Verify fieldset is properly defined in schema +- Check for schema validation errors earlier in pipeline +- Ensure schema file is in the loaded directory + +#### Unexpected attributes in output + +**Symptom**: Internal attributes appearing in intermediate files + +**Solution**: Add to `remove_internal_attributes()`: +```python +def remove_internal_attributes(field_details: Field) -> None: + field_details.pop('node_name', None) + field_details.pop('intermediate', None) + field_details.pop('unwanted_attr', None) # Add this +``` + +#### File size concerns + +**Symptom**: Intermediate YAML files are very large + +**Context**: This is normal - ecs_flat.yml with ~850 fields is ~150KB + +**Optimization options**: +1. Use YAML references for common values (complex) +2. Compress files for distribution (gzip) +3. Consider JSON format (more compact) + +### Debugging Tips + +#### View raw processed schemas + +Enable debugging output: +```python +# In generator.py or your script +nested, flat = generate(fields, out_dir, default_dirs=True) +# This creates ecs.yml with raw processed schemas +``` + +#### Compare before/after + +```bash +# Generate with current code +python scripts/generator.py --semconv-version v1.24.0 --out /tmp/test1 + +# Make changes + +# Generate again +python scripts/generator.py --semconv-version v1.24.0 --out /tmp/test2 + +# Compare +diff /tmp/test1/ecs/ecs_flat.yml /tmp/test2/ecs/ecs_flat.yml +``` + +#### Field counts + +```python +import yaml + +with open('generated/ecs/ecs_flat.yml') as f: + flat = yaml.safe_load(f) + print(f"Total fields: {len(flat)}") + +with open('generated/ecs/ecs_nested.yml') as f: + nested = yaml.safe_load(f) + total_fields = sum(len(fs['fields']) for fs in nested.values()) + print(f"Total fields (from nested): {total_fields}") + print(f"Total fieldsets: {len(nested)}") +``` + +#### Field type distribution + +```python +from collections import Counter + +with open('generated/ecs/ecs_flat.yml') as f: + flat = yaml.safe_load(f) + types = Counter(f['type'] for f in flat.values()) + for field_type, count in types.most_common(): + print(f"{field_type}: {count}") +``` + +## Related Files + +- `scripts/generator.py` - Main entry point, calls this generator +- `scripts/schema/loader.py` - Loads raw schemas +- `scripts/schema/cleaner.py` - Validates and normalizes +- `scripts/schema/finalizer.py` - Applies transformations +- `scripts/schema/visitor.py` - Field traversal utilities +- `scripts/generators/csv_generator.py` - Consumes flat format +- `scripts/generators/es_template.py` - Consumes nested format +- `scripts/generators/markdown_fields.py` - Consumes nested format +- `scripts/generators/beats.py` - Consumes nested format +- `schemas/*.yml` - Source schema definitions +- `generated/ecs/ecs_flat.yml` - Flat output +- `generated/ecs/ecs_nested.yml` - Nested output + +## References + +- [ECS Schema Structure](../../USAGE.md) +- [Visitor Pattern Documentation](../schema/visitor.py) +- [ECS Type Definitions](../ecs_types/schema_fields.py) +- [CSV Generator](csv-generator.md) *(coming soon)* +- [Elasticsearch Template Generator](es-template.md) *(coming soon)* + diff --git a/scripts/docs/markdown-generator.md b/scripts/docs/markdown-generator.md new file mode 100644 index 0000000000..920cb9bc78 --- /dev/null +++ b/scripts/docs/markdown-generator.md @@ -0,0 +1,513 @@ +# Markdown Documentation Generator + +## Overview + +The Markdown Generator (`generators/markdown_fields.py`) transforms ECS field schemas into human-readable documentation published on the Elastic documentation site. It's the final step in the documentation pipeline, converting structured YAML field definitions into comprehensive markdown pages. + +### Purpose + +This generator creates the official ECS reference documentation, including: + +1. **Field Reference Pages** - Complete catalog of all fields +2. **Fieldset Pages** - Detailed documentation for each fieldset (e.g., HTTP, User, Process) +3. **OTel Alignment Documentation** - Showing convergence with OpenTelemetry +4. **Index and Navigation** - Entry points and cross-references + +The output is human-friendly markdown that integrates with Elastic's documentation infrastructure. + +## Architecture + +### High-Level Flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ generator.py (main) │ +│ │ +│ 1. Load schemas │ +│ 2. Clean and finalize │ +│ 3. Generate intermediate files │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ markdown_fields.generate() - Entry Point │ +│ │ +│ Input: Nested fieldsets + OTel generator + version info │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Page Generation Functions │ +│ │ +│ ├─ page_index() → index.md │ +│ ├─ page_field_reference() → ecs-field-reference.md │ +│ ├─ page_otel_alignment_overview() → ecs-otel-alignment-*.md │ +│ ├─ page_otel_alignment_details() → ecs-otel-alignment-*.md │ +│ └─ page_fieldset() [for each] → ecs-{name}.md │ +│ │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Jinja2 Template Rendering │ +│ │ +│ Templates (scripts/templates/): │ +│ - index.j2 │ +│ - fieldset.j2 │ +│ - ecs_field_reference.j2 │ +│ - otel_alignment_overview.j2 │ +│ - otel_alignment_details.j2 │ +│ - field_values.j2 │ +│ - macros.j2 (shared template macros) │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ Markdown Files Output │ +│ │ +│ Written to: docs/reference/ │ +│ - index.md │ +│ - ecs-field-reference.md │ +│ - ecs-otel-alignment-overview.md │ +│ - ecs-otel-alignment-details.md │ +│ - ecs-http.md, ecs-user.md, ecs-process.md, ... │ +│ (one per fieldset) │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Key Components + +#### 1. Generate Function + +**Entry Point**: `generate(nested, docs_only_nested, ecs_version, semconv_version, otel_generator, out_dir)` + +Orchestrates the entire markdown generation process: +- Creates output directory +- Generates each page type +- Saves rendered markdown to files + +**Called by**: `generator.py` main script after all schema processing is complete + +#### 2. Helper Functions + +These prepare data for template consumption: + +| Function | Purpose | +|----------|---------| +| `render_fieldset_reuse_text()` | Extract expected nesting locations | +| `render_nestings_reuse_section()` | Build reuse section data | +| `extract_allowed_values_key_names()` | Get allowed value names | +| `sort_fields()` | Sort and enrich field lists | +| `check_for_usage_doc()` | Check for usage doc existence | + +#### 3. Page Generation Functions + +Each decorated with `@templated()` for automatic rendering: + +| Function | Template | Output File | Purpose | +|----------|----------|-------------|---------| +| `page_index()` | index.j2 | index.md | Main landing page | +| `page_field_reference()` | ecs_field_reference.j2 | ecs-field-reference.md | All fields catalog | +| `page_fieldset()` | fieldset.j2 | ecs-{name}.md | Individual fieldset docs | +| `page_otel_alignment_overview()` | otel_alignment_overview.j2 | ecs-otel-alignment-overview.md | Alignment statistics | +| `page_otel_alignment_details()` | otel_alignment_details.j2 | ecs-otel-alignment-details.md | Field mappings | +| `page_field_values()` | field_values.j2 | (not saved directly) | Event categorization fields | + +#### 4. Template System + +**Framework**: Jinja2 + +**Configuration**: +```python +template_env = jinja2.Environment( + loader=FileSystemLoader('scripts/templates/'), + keep_trailing_newline=True, # Preserve trailing newlines + trim_blocks=True, # Remove first newline after block + lstrip_blocks=False # Don't strip leading whitespace +) +``` + +**Template Location**: `scripts/templates/` + +**Shared Macros**: `macros.j2` contains reusable template components + +## Template Development + +### Adding a New Page Type + +To add a new documentation page: + +1. **Create the template** in `scripts/templates/`: + ```jinja2 + {# my_new_page.j2 #} + # {{ title }} + + Version: {{ version }} + + {% for item in items %} + ## {{ item.name }} + {{ item.description }} + {% endfor %} + ``` + +2. **Create page function** in `markdown_fields.py`: + ```python + @templated('my_new_page.j2') + def page_my_new_page(items, version): + """Generate my new documentation page. + + Args: + items: List of items to document + version: Version string + + Returns: + Rendered markdown content + """ + return dict( + title="My New Page", + items=items, + version=version + ) + ``` + +3. **Call in generate()** function: + ```python + def generate(nested, docs_only_nested, ecs_version, semconv_version, otel_generator, out_dir): + # ... existing code ... + + save_markdown( + path.join(out_dir, 'my-new-page.md'), + page_my_new_page(some_items, ecs_version) + ) + ``` + +### Template Best Practices + +1. **Use macros for repeated patterns**: + ```jinja2 + {# In macros.j2 #} + {% macro field_row(field) -%} + | {{ field.name }} | {{ field.type }} | {{ field.description }} | + {%- endmacro %} + + {# In your template #} + {% from 'macros.j2' import field_row %} + {% for field in fields %} + {{ field_row(field) }} + {% endfor %} + ``` + +2. **Handle missing data gracefully**: + ```jinja2 + {% if field.example %} + Example: `{{ field.example }}` + {% endif %} + ``` + +3. **Keep formatting consistent**: + - Use consistent heading levels + - Follow markdown best practices + - Include blank lines between sections + +4. **Comment complex logic**: + ```jinja2 + {# Sort fields by type, then name #} + {% for field in fields|sort(attribute='type,name') %} + ... + {% endfor %} + ``` + +## Data Structures + +### Nested Fieldsets Structure + +```python +{ + 'http': { + 'name': 'http', + 'title': 'HTTP', + 'group': 2, + 'description': 'HTTP request and response fields', + 'fields': { + 'http.request.method': { + 'name': 'method', + 'flat_name': 'http.request.method', + 'type': 'keyword', + 'description': 'HTTP request method', + 'example': 'GET', + 'level': 'extended', + 'otel': [{'relation': 'match', 'stability': 'stable'}], + 'allowed_values': [...] # Optional + }, + # ... more fields ... + }, + 'reusable': { # If fieldset is reusable + 'expected': [ + {'full': 'client.http', 'short': 'client.http'}, + {'full': 'server.http', 'short': 'server.http'} + ] + }, + 'reused_here': [ # Fieldsets nested here + { + 'full': 'client.geo', + 'schema_name': 'geo', + 'short': 'geo', + 'beta': '', + 'normalize': [] + } + ] + }, + # ... more fieldsets ... +} +``` + +### OTel Mapping Summary Structure + +```python +{ + 'namespace': 'http', + 'title': 'HTTP', + 'nr_all_ecs_fields': 25, + 'nr_plain_ecs_fields': 20, + 'nr_otel_fields': 18, + 'nr_matching_fields': 10, + 'nr_equivalent_fields': 5, + 'nr_related_fields': 3, + 'nr_conflicting_fields': 1, + 'nr_metric_fields': 0, + 'nr_otlp_fields': 0, + 'nr_not_applicable_fields': 1 +} +``` + +## Usage Examples + +### Running the Generator + +Typically invoked through the main generator: + +```bash +# From repository root +make clean +make SEMCONV_VERSION=v1.24.0 + +# Or directly with Python +python scripts/generator.py --semconv-version v1.24.0 +``` + +### Programmatic Usage + +```python +from generators import markdown_fields +from generators.otel import OTelGenerator + +# Prepare data +nested = {...} # From intermediate_files.generate() +docs_only = {...} +otel_gen = OTelGenerator('v1.24.0') + +# Generate all markdown docs +markdown_fields.generate( + nested=nested, + docs_only_nested=docs_only, + ecs_generated_version='8.11.0', + semconv_version='v1.24.0', + otel_generator=otel_gen, + out_dir='docs/reference' +) +``` + +### Testing Template Changes + +To test template modifications without full regeneration: + +```python +from generators.markdown_fields import render_template + +# Test a template with sample data +context = { + 'fieldset': {'name': 'http', 'title': 'HTTP'}, + 'sorted_fields': [...] +} + +output = render_template('fieldset.j2', **context) +print(output) +``` + +## Making Changes + +### Modifying Existing Pages + +To change an existing page's content: + +1. **Locate the template**: Find the `.j2` file in `scripts/templates/` +2. **Edit the template**: Modify Jinja2 markup +3. **Update page function** (if needed): Adjust context data in `markdown_fields.py` +4. **Test**: Regenerate documentation and review output +5. **Validate**: Check markdown renders correctly + +Example - Adding a field to fieldset pages: + +```python +# In markdown_fields.py +@templated('fieldset.j2') +def page_fieldset(fieldset, nested, ecs_generated_version): + # ... existing code ... + return dict( + fieldset=fieldset, + sorted_fields=sorted_fields, + # Add new data + field_count=len(sorted_fields), # NEW + # ... rest of context ... + ) +``` + +```jinja2 +{# In fieldset.j2 #} +# {{ fieldset.title }} + +This fieldset contains {{ field_count }} fields. {# NEW #} + +{# ... rest of template ... #} +``` + +### Changing Field Display Order + +To modify how fields are sorted: + +```python +def sort_fields(fieldset): + """Sort fields by custom criteria.""" + fields_list = list(fieldset['fields'].values()) + for field in fields_list: + field['allowed_value_names'] = extract_allowed_values_key_names(field) + + # Change sorting key + return sorted(fields_list, key=lambda f: (f.get('level'), f['name'])) + # Now sorts by level first, then name +``` + +### Adding Conditional Sections + +To show content only for certain fieldsets: + +```jinja2 +{% if fieldset.name == 'event' %} +## Special Event Categorization + +The event fieldset includes special categorization fields... +{% endif %} +``` + +## Troubleshooting + +### Common Issues + +#### "Template not found: xyz.j2" + +**Cause**: Template file doesn't exist or path is wrong + +**Solution**: +- Verify template exists in `scripts/templates/` +- Check template name spelling +- Ensure `TEMPLATE_DIR` path is correct + +#### Markdown not rendering correctly + +**Cause**: Jinja2 whitespace control or markdown syntax issues + +**Solutions**: +- Check for extra/missing blank lines +- Use `{%-` and `-%}` for whitespace control +- Validate markdown with a linter +- Review `trim_blocks` and `lstrip_blocks` settings + +Example whitespace issue: +```jinja2 +{# BAD - Creates unwanted blank lines #} +{% for field in fields %} +{{ field.name }} +{% endfor %} + +{# GOOD - Cleaner output #} +{% for field in fields -%} +{{ field.name }} +{% endfor %} +``` + +#### Context variable not available in template + +**Cause**: Variable not passed in context dictionary + +**Solution**: Update the page function's return dict: +```python +@templated('my_template.j2') +def page_something(...): + return dict( + existing_var=value, + new_var=new_value # Add missing variable + ) +``` + +#### Jinja2 syntax errors + +**Cause**: Invalid template syntax + +**Common mistakes**: +- Unclosed blocks: `{% if x %}` without `{% endif %}` +- Wrong syntax: `{{ if x }}` instead of `{% if x %}` +- Missing filters: `{{ var|missing_filter }}` + +**Debugging**: +```python +try: + output = render_template('my_template.j2', **context) +except jinja2.TemplateError as e: + print(f"Template error: {e}") + print(f"Line: {e.lineno}") +``` + +### Performance Considerations + +For large schemas (100+ fieldsets): + +1. **Avoid redundant processing**: + ```python + # BAD - Sorts multiple times + for fieldset in fieldsets: + sorted_fields = sorted(fieldset['fields'].values(), ...) + + # GOOD - Sort once, reuse + for fieldset in fieldsets: + if not hasattr(fieldset, '_sorted_fields'): + fieldset['_sorted_fields'] = sorted(fieldset['fields'].values(), ...) + sorted_fields = fieldset['_sorted_fields'] + ``` + +2. **Use generators for large datasets**: + ```python + # Instead of building large lists + results = [generate_page(fs) for fs in fieldsets] + + # Use generator + results = (generate_page(fs) for fs in fieldsets) + ''.join(results) # Consume as needed + ``` + +## Related Files + +- `scripts/generator.py` - Main entry point, calls this generator +- `scripts/generators/intermediate_files.py` - Produces nested structures +- `scripts/generators/otel.py` - Provides OTel summaries +- `scripts/generators/ecs_helpers.py` - Utility functions +- `scripts/templates/*.j2` - Jinja2 templates +- `docs/fields/usage/*.md` - Usage documentation (manually written) +- `docs/reference/*.md` - Generated markdown output + +## References + +- [Jinja2 Documentation](https://jinja.palletsprojects.com/) +- [Markdown Guide](https://www.markdownguide.org/) +- [ECS Documentation](https://www.elastic.co/guide/en/ecs/current/index.html) +- [Elastic Doc Build Tools](https://github.com/elastic/docs) + diff --git a/scripts/docs/otel-integration.md b/scripts/docs/otel-integration.md new file mode 100644 index 0000000000..14f4ea2a70 --- /dev/null +++ b/scripts/docs/otel-integration.md @@ -0,0 +1,395 @@ +# OpenTelemetry Semantic Conventions Integration + +## Overview + +The OTel integration module (`generators/otel.py`) manages the alignment between Elastic Common Schema (ECS) and OpenTelemetry Semantic Conventions. This is a critical component supporting the ECS donation to OpenTelemetry initiative. + +### Purpose + +As ECS and OTel Semantic Conventions converge into a single standard, this module: + +1. **Validates** that ECS field mappings reference valid OTel attributes and metrics +2. **Enriches** ECS definitions with OTel stability information +3. **Generates** alignment summaries for documentation +4. **Detects** potential unmapped fields that exist in both standards + +## Architecture + +### High-Level Flow + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ generator.py │ +│ (Main Entry Point) │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ OTelGenerator.__init__() │ +│ │ +│ 1. Clone/load OTel semconv repo from GitHub │ +│ 2. Parse all YAML model files │ +│ 3. Extract attributes and metrics │ +│ 4. Build lookup indexes │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ OTelGenerator.validate_otel_mapping() │ +│ │ +│ Pass 1: Validate mapping structure │ +│ - Check relation types are valid │ +│ - Verify required/forbidden properties │ +│ - Confirm referenced attributes/metrics exist │ +│ │ +│ Pass 2: Enrich with stability information │ +│ - Add stability levels from OTel definitions │ +└────────────────────────────┬────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ OTelGenerator.get_mapping_summaries() │ +│ │ +│ Generate statistics for each namespace: │ +│ - Count fields by relation type │ +│ - Calculate coverage percentages │ +│ - Used for documentation generation │ +└─────────────────────────────────────────────────────────────────┘ +``` + +### Key Components + +#### 1. Model Loading (`get_model_files`, `get_tree_by_url`) + +**Purpose**: Load OTel semantic conventions from GitHub + +- Clones the semantic-conventions repository (or uses cached version) +- Checks out a specific version tag (e.g., `v1.24.0`) +- Recursively collects all YAML model files +- Caches the repository in `./build/otel-semconv/` for performance + +**Key Files**: All `.yml`/`.yaml` files in the `model/` directory of the OTel semconv repo + +#### 2. Attribute/Metric Extraction (`get_attributes`, `get_metrics`) + +**Purpose**: Parse model files and build lookup indexes + +- Extracts non-deprecated attributes from `attribute_group` entries +- Extracts non-deprecated metrics from `metric` entries +- Applies prefixes to attribute IDs (e.g., `http.` prefix) +- Preserves display names for documentation + +**Output**: Dictionaries keyed by attribute ID / metric name + +#### 3. Validation (`OTelGenerator.validate_otel_mapping`) + +**Purpose**: Ensure mapping integrity + +- Uses visitor pattern to traverse all ECS fields +- Validates each OTel mapping configuration +- Checks existence of referenced attributes/metrics +- Enriches mappings with stability levels +- Prints warnings for potential unmapped fields + +#### 4. Summary Generation (`OTelGenerator.get_mapping_summaries`) + +**Purpose**: Generate documentation statistics + +- Counts fields by relation type for each namespace +- Identifies OTel-only namespaces (not yet in ECS) +- Produces data structure consumed by markdown generators +- Sorted alphabetically for consistent output + +## OTel Mapping Configuration + +### Relation Types + +ECS fields can have one or more OTel mappings, each with a `relation` type: + +#### `match` +Names and semantics are identical. + +```yaml +- name: method + flat_name: http.request.method + otel: + - relation: match +``` + +**Requirements**: No additional properties +**Generated stability**: From OTel attribute definition + +#### `equivalent` +Semantically equivalent but different names. + +```yaml +- name: status_code + flat_name: http.response.status_code + otel: + - relation: equivalent + attribute: http.response.status_code +``` + +**Requirements**: Must specify `attribute` +**Generated stability**: From OTel attribute definition + +#### `related` +Related concepts but not semantically identical. + +```yaml +- name: original + flat_name: url.original + otel: + - relation: related + attribute: url.full + note: Similar but may have different encoding +``` + +**Requirements**: Must specify `attribute` +**Optional**: `note` explaining the relationship + +#### `conflict` +Conflicting definitions that need resolution. + +```yaml +- name: bytes + flat_name: http.request.body.bytes + otel: + - relation: conflict + attribute: http.request.body.size + note: ECS uses bytes, OTel uses size +``` + +**Requirements**: Must specify `attribute` +**Optional**: `note` explaining the conflict + +#### `metric` +Maps to an OTel metric rather than an attribute. + +```yaml +- name: duration + flat_name: http.client.request.duration + otel: + - relation: metric + metric: http.client.request.duration +``` + +**Requirements**: Must specify `metric` +**Forbidden**: `attribute`, `otlp_field` + +#### `otlp` +Maps to an OTLP protocol-specific field. + +```yaml +- name: trace_id + flat_name: trace.id + otel: + - relation: otlp + otlp_field: trace_id + stability: stable +``` + +**Requirements**: Must specify `otlp_field` and `stability` +**Forbidden**: `attribute`, `metric` + +#### `na` +Not applicable for OTel mapping. + +```yaml +- name: ecs_version + flat_name: ecs.version + otel: + - relation: na + note: ECS-specific field +``` + +**Requirements**: None +**Forbidden**: `attribute`, `metric`, `otlp_field`, `stability` + +### Validation Rules + +The validator enforces strict rules for each relation type: + +| Relation | Required Properties | Forbidden Properties | Validates Existence | +|----------|---------------------|---------------------|---------------------| +| `match` | - | attribute, metric, otlp_field, stability | Yes (attribute) | +| `equivalent` | attribute | metric, otlp_field, stability | Yes (attribute) | +| `related` | attribute | metric, otlp_field, stability | Yes (attribute) | +| `conflict` | attribute | metric, otlp_field, stability | Yes (attribute) | +| `metric` | metric | attribute, otlp_field, stability | Yes (metric) | +| `otlp` | otlp_field, stability | attribute, metric | No | +| `na` | - | attribute, metric, otlp_field, stability | No | + +## Usage Examples + +### Running the Generator + +The OTel generator is invoked as part of the main ECS generator: + +```bash +# From repository root +make clean +make SEMCONV_VERSION=v1.24.0 +``` + +This triggers `scripts/generator.py`, which: +1. Creates an `OTelGenerator` instance +2. Validates all mappings +3. Generates documentation with summaries + +### Programmatic Usage + +```python +from generators.otel import OTelGenerator +from schema import loader + +# Initialize generator with specific OTel version +generator = OTelGenerator('v1.24.0') + +# Load ECS schemas +fields = loader.load_schemas() + +# Validate all OTel mappings +generator.validate_otel_mapping(fields) + +# Generate summaries for documentation +from generators.intermediate_files import generate_nested_fields +nested = generate_nested_fields(fields) +summaries = generator.get_mapping_summaries(nested) + +# Use summaries in documentation +for summary in summaries: + print(f"{summary['namespace']}: {summary['nr_matching_fields']} matches") +``` + +## Making Changes + +### Adding New Relation Types + +If a new relation type is needed: + +1. **Update validation** in `OTelGenerator.__check_mapping()`: + ```python + elif otel['relation'] == 'new_type': + must_have(ecs_field_name, otel, otel['relation'], 'required_property') + must_not_have(ecs_field_name, otel, otel['relation'], 'forbidden_property') + # Add validation logic + ``` + +2. **Update summary counting** in `OTelGenerator.get_mapping_summaries()`: + ```python + elif otel['relation'] == "new_type": + summary['nr_new_type_fields'] += 1 + ``` + +3. **Update type definition** in `ecs_types/otel_types.py`: + ```python + class OTelMappingSummary(TypedDict, total=False): + # ... existing fields ... + nr_new_type_fields: int + ``` + +4. **Update documentation templates** in `templates/otel_alignment_*.j2` + +5. **Update this documentation** with the new relation type + +### Updating OTel Semconv Version + +To use a newer version of OTel semantic conventions: + +1. **Check available versions**: + Visit https://github.com/open-telemetry/semantic-conventions/tags + +2. **Update version file**: + ```bash + echo "v1.25.0" > otel-semconv-version + ``` + +3. **Regenerate**: + ```bash + make clean + make SEMCONV_VERSION=v1.25.0 + ``` + +4. **Handle validation errors**: + - If attributes were renamed: Update ECS mappings in `schemas/*.yml` + - If attributes were deprecated: Update or remove mappings + - If validation rules changed: Update `otel.py` validator + +### Testing Changes + +After modifying the OTel generator: + +1. **Run validation**: + ```bash + python scripts/generator.py --semconv-version v1.24.0 + ``` + +2. **Check generated files**: + - `docs/reference/otel-*.md` - Alignment documentation + - Verify summary statistics are correct + +3. **Run tests** (if applicable): + ```bash + python -m pytest scripts/tests/ + ``` + +## Troubleshooting + +### Common Issues + +#### "Attribute 'X' does not exist in Semantic Conventions version Y" + +**Cause**: ECS field references an OTel attribute that doesn't exist in the specified version + +**Solutions**: +- Check if attribute was renamed in OTel +- Update the `attribute` value in the ECS schema +- Verify the semconv version is correct +- Check if attribute was deprecated/removed + +#### "OTel mapping must specify the property 'attribute'" + +**Cause**: Mapping has relation type requiring the `attribute` property, but it's missing + +**Solution**: Add the required property to the mapping: +```yaml +otel: + - relation: equivalent + attribute: otel.attribute.name # Add this +``` + +#### "Clone is too slow / Network timeout" + +**Cause**: First-time clone of semantic-conventions repo can be large + +**Solutions**: +- Be patient on first run (repo is cached after) +- Check network connectivity +- Manually clone: `git clone https://github.com/open-telemetry/semantic-conventions.git ./build/otel-semconv/` + +#### "WARNING: Field 'X' exists in OTel but is not mapped" + +**Cause**: Field name matches OTel attribute but has no mapping defined + +**Action**: Consider if this should be mapped: +- If yes: Add appropriate OTel mapping to schema +- If no: Add `otel: [{relation: na}]` to suppress warning + +## Related Files + +- `scripts/generator.py` - Main entry point, orchestrates generation +- `scripts/generators/markdown_fields.py` - Uses summaries for docs +- `scripts/ecs_types/otel_types.py` - Type definitions +- `scripts/schema/visitor.py` - Field traversal mechanism +- `templates/otel_alignment_*.j2` - Jinja2 templates for docs +- `schemas/*.yml` - ECS field definitions with OTel mappings +- `otel-semconv-version` - File containing target OTel version + +## References + +- [ECS Documentation](https://www.elastic.co/guide/en/ecs/current/index.html) +- [OTel Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/) +- [ECS-OTel Convergence Announcement](https://opentelemetry.io/blog/2023/ecs-otel-semconv-convergence/) +- [Semantic Conventions Repository](https://github.com/open-telemetry/semantic-conventions) + diff --git a/scripts/docs/schema-pipeline.md b/scripts/docs/schema-pipeline.md new file mode 100644 index 0000000000..7ef8976dda --- /dev/null +++ b/scripts/docs/schema-pipeline.md @@ -0,0 +1,1653 @@ +# ECS Schema Processing Pipeline + +## Overview + +The ECS schema processing pipeline transforms YAML schema definitions into various output formats (Elasticsearch templates, Beats configs, markdown docs, etc.). It's a multi-stage pipeline where each stage has a specific responsibility. + +**Pipeline Stages:** +``` +┌─────────────┐ +│ YAML Schema │ Raw schema files in schemas/*.yml +│ Files │ +└──────┬──────┘ + │ + v +┌─────────────┐ +│ loader.py │ Load & nest: YAML → deeply nested dict +└──────┬──────┘ + │ + v +┌─────────────┐ +│ cleaner.py │ Validate, normalize, apply defaults +└──────┬──────┘ + │ + v +┌─────────────┐ +│finalizer.py │ Perform field reuse, calculate names +└──────┬──────┘ + │ + v (Optional filters) +┌─────────────┐ ┌────────────────┐ +│subset_filter│─>│exclude_filter │ +│ .py │ │ .py │ +└──────┬──────┘ └────────┬───────┘ + │ │ + v v +┌─────────────────────────────┐ +│ intermediate_files.py │ Generate flat & nested YAML +└──────────────┬──────────────┘ + │ + v + ┌────────────────────┐ + │ Generators │ + ├────────────────────┤ + │ • es_template.py │ Elasticsearch templates + │ • beats.py │ Beats field definitions + │ • csv_generator.py │ CSV field export + │ • markdown_fields │ Markdown documentation + └────────────────────┘ +``` + +## Quick Reference + +### Field Reuse Cheat Sheet + +| Concept | What | When to Use | Example | +|---------|------|-------------|---------| +| **Foreign Reuse** | Copy fieldset to different location | Same fields needed elsewhere | `user` → `destination.user` | +| **Transitive** | Reuse carries nested reuses | Automatic composition | If `group` in `user`, `destination.user` gets `group` too | +| **Self-Nesting** | Copy fieldset into itself | Parent/child relationships | `process` → `process.parent` | +| **Non-Transitive** | Self-nesting stays local | Avoid unwanted propagation | `process.parent` NOT at `source.process.parent` | +| **order: 1** | High priority reuse | Has dependencies | `group` reused before `user` | +| **order: 2** | Default priority | Most fieldsets | Standard reuse timing | + +**Quick Syntax:** +```yaml +# Foreign reuse (goes to other fieldsets) +fieldset: + reusable: + expected: + - destination # Simple: reuse as same name + - at: process # Complex: reuse with different name + as: parent + +# Self-nesting (stays in same fieldset) +process: + reusable: + expected: + - at: process # ← Same name as fieldset = self-nesting + as: parent +``` + +### Subset Definition Cheat Sheet + +| Syntax | Meaning | Result | +|--------|---------|--------| +| `fields: '*'` | Include all fields | Every field in fieldset | +| `fields: { field: {} }` | Include specific field | Just that one field | +| `fields: { parent: { fields: '*' }}` | Include all nested | All fields under parent | +| `index: false` | Don't index field | Field exists but not searchable | +| `docs_only: true` | Documentation only | In docs, not in artifacts | + +**Quick Syntax:** +```yaml +name: my_subset +fields: + base: + fields: '*' # All base fields + + http: + fields: + request: + fields: + method: {} # Just this field + response: + fields: '*' # All response fields + + destination: + fields: + user: # Reused fieldset + fields: + name: {} # Specific user fields +``` + +### Common Patterns + +#### Pattern 1: Network Endpoint Fields (Foreign Reuse) + +**Problem:** Need same fields for source, destination, client, server + +**Solution:** Create reusable fieldset, reuse at all locations +```yaml +# In geo schema +geo: + reusable: + top_level: false # Only via reuse + expected: + - client + - destination + - host + - observer + - server + - source + fields: + - name: city_name + - name: country_name + - name: location # latitude/longitude +``` + +**Result:** `source.geo.city_name`, `destination.geo.city_name`, etc. + +#### Pattern 2: Parent-Child Hierarchy (Self-Nesting) + +**Problem:** Need to represent parent process, effective user, session leader + +**Solution:** Self-nesting +```yaml +process: + reusable: + expected: + - at: process + as: parent + - at: process + as: session_leader + fields: + - name: pid + - name: name +``` + +**Result:** `process.pid`, `process.parent.pid`, `process.session_leader.pid` + +#### Pattern 3: Minimal Web Subset + +**Problem:** Only need basic HTTP fields for web logs + +**Solution:** +```yaml +name: web_minimal +fields: + base: { fields: '*' } + http: + fields: + request: { fields: { method: {}, bytes: {} }} + response: { fields: { status_code: {}, bytes: {} }} + url: { fields: { domain: {}, path: {} }} +``` + +**Result:** ~10-15 fields instead of 850 + +#### Pattern 4: Security Monitoring Subset + +**Problem:** Need security-relevant fields only + +**Solution:** +```yaml +name: security +fields: + base: { fields: '*' } + event: { fields: { action: {}, category: {}, type: {}, outcome: {} }} + source: { fields: { ip: {}, port: {}, user: { fields: { name: {} }}}} + destination: { fields: { ip: {}, port: {} }} + process: + fields: + name: {} + pid: {} + parent: { fields: { name: {}, pid: {} }} + file: + fields: + path: {} + hash: { fields: { sha256: {} }} +``` + +**Result:** Security-focused field set + +--- + +## Core Concepts + +### Deeply Nested Structure + +All pipeline stages work with a deeply nested dictionary structure: + +```python +{ + 'fieldset_name': { + 'schema_details': { # Fieldset-level metadata + 'root': bool, + 'group': int, + 'reusable': {...}, + 'title': str + }, + 'field_details': { # Properties of the fieldset itself + 'name': str, + 'description': str, + 'type': 'group' + }, + 'fields': { # Nested fields + 'field_name': { + 'field_details': {...}, + 'fields': {...} # Recursive + } + } + } +} +``` + +### Intermediate Fields + +Auto-created parent fields for nesting structure: +- Created automatically for dotted names: `request.method` → creates `request` intermediate +- Marked with `intermediate: true` +- Type: `object` +- Skipped by some validation/processing steps + +### Field Reuse + +**Why Field Reuse Exists:** + +Without reuse, we'd need to duplicate the same fields everywhere: +```yaml +# Without reuse - lots of duplication! ❌ +source: + - name: ip + - name: port + - name: address +destination: + - name: ip # Duplicated! + - name: port # Duplicated! + - name: address # Duplicated! +client: + - name: ip # Duplicated again! + - name: port # Duplicated again! + # ... and so on +``` + +With reuse, we define fields once and reuse them: +```yaml +# With reuse - define once, reuse everywhere! ✅ +user: + reusable: + top_level: false # Not at root + expected: + - destination # Reuse at destination.user + - source # Reuse at source.user + - client # Reuse at client.user + fields: + - name: name + - name: email + - name: id +``` + +**Two Types of Reuse:** + +#### 1. Foreign Reuse (Transitive) - Copy Across Fieldsets + +**What it does:** Copies a fieldset into a completely different fieldset + +**Example:** `user` fields appear at `destination.user.*`, `source.user.*` + +**Why "transitive":** If A is reused in B, and B is reused in C, then C automatically gets A too. + +**Visual Example:** +``` +Before Reuse: +┌──────────┐ ┌─────────────┐ +│ user │ │ destination │ +├──────────┤ ├─────────────┤ +│ • name │ │ • ip │ +│ • email │ │ • port │ +│ • id │ └─────────────┘ +└──────────┘ + +After Reuse (user → destination.user): +┌─────────────────────────────┐ +│ destination │ +├─────────────────────────────┤ +│ • ip │ +│ • port │ +│ • user ← (reused!) │ +│ ├─ name │ +│ ├─ email │ +│ └─ id │ +└─────────────────────────────┘ + +Result: destination.user.name, destination.user.email, destination.user.id +``` + +**Transitivity in Action:** +``` +Step 1: group → user.group +┌──────────┐ ┌──────────────────┐ +│ group │ ───> │ user │ +│ • id │ │ • name │ +│ • name │ │ • email │ +└──────────┘ │ • group (copied) │ + │ ├─ id │ + │ └─ name │ + └──────────────────┘ + +Step 2: user (with group!) → destination.user +┌──────────────────┐ ┌────────────────────────────────┐ +│ user │ ───> │ destination │ +│ • name │ │ • ip │ +│ • email │ │ • port │ +│ • group │ │ • user (copied with group!) │ +│ ├─ id │ │ ├─ name │ +│ └─ name │ │ ├─ email │ +└──────────────────┘ │ └─ group ← (transitive!) │ + │ ├─ id │ + │ └─ name │ + └────────────────────────────────┘ + +Result: destination.user.group.id exists because transitivity! +``` + +#### 2. Self-Nesting (Non-Transitive) - Copy Within Same Fieldset + +**What it does:** Copies a fieldset into itself with a different name + +**Example:** `process` fields appear at `process.parent.*` + +**Why "non-transitive":** This nesting is local only. When the fieldset is reused elsewhere, the self-nesting doesn't come along. + +**Visual Example:** +``` +Before Self-Nesting: +┌──────────┐ +│ process │ +├──────────┤ +│ • pid │ +│ • name │ +│ • args │ +└──────────┘ + +After Self-Nesting (process → process.parent): +┌───────────────────────────┐ +│ process │ +├───────────────────────────┤ +│ • pid │ +│ • name │ +│ • args │ +│ • parent ← (self-nested!) │ +│ ├─ pid │ +│ ├─ name │ +│ └─ args │ +└───────────────────────────┘ + +Result: process.pid, process.name, process.parent.pid, process.parent.name +``` + +**Non-Transitivity in Action:** +``` +Scenario: process has self-nesting, then process is reused at source + +Step 1: process → process.parent (self-nesting) +┌───────────────────────┐ +│ process │ +│ • pid │ +│ • name │ +│ • parent (self-nest) │ +│ ├─ pid │ +│ └─ name │ +└───────────────────────┘ + +Step 2: process → source.process (foreign reuse) +┌─────────────────────────┐ +│ source │ +│ • ip │ +│ • port │ +│ • process │ +│ ├─ pid │ +│ └─ name │ +│ └─ parent? ← NO! ❌ │ +└─────────────────────────┘ + +Result: source.process.parent does NOT exist! +Why? Self-nesting is NOT transitive - it stays local to original fieldset. +``` + +**When to Use Each Type:** + +| Use Case | Type | Example | +|----------|------|---------| +| Same fields needed in multiple places | Foreign Reuse | user at destination, source, client | +| Capture hierarchical relationship | Self-Nesting | process.parent, process.session_leader | +| Build complex nested structures | Foreign Reuse | geo at client.geo, server.geo | +| Represent parent/child relationships | Self-Nesting | user.target, user.effective | + +**Reuse Order:** + +Some fieldsets depend on others being reused first: +```yaml +group: + reusable: + order: 1 # ← Reused FIRST (high priority) + expected: + - user # group goes into user + +user: + reusable: + order: 2 # ← Reused SECOND (default priority) + expected: + - destination # user (now with group) goes into destination +``` + +**Processing Order:** +1. Order 1 fieldsets → Foreign reuse → Self-nesting +2. Order 2 fieldsets → Foreign reuse → Self-nesting + +**Result:** `destination.user.group.*` exists because group was reused into user before user was reused into destination. + +## Pipeline Stages + +### 1. loader.py - Schema Loading + +**Purpose:** Load YAML schema files and create initial nested structure + +**Input:** +- YAML schema files (`schemas/*.yml`) +- Optional: git ref for specific version +- Optional: custom/experimental schemas + +**Processing:** +1. Load schemas from filesystem or git +2. Nest dotted field names into hierarchical structure +3. Merge multiple sources (ECS + experimental + custom) +4. Create intermediate fields for parents + +**Output:** Deeply nested field dictionary with minimal defaults + +**Key Functions:** +- `load_schemas()`: Main entry point +- `deep_nesting_representation()`: Convert flat to nested +- `nest_fields()`: Build nested hierarchy +- `merge_fields()`: Merge multiple sources + +**Example:** +```python +from schema import loader +fields = loader.load_schemas() +# Or from specific version: +fields = loader.load_schemas(ref='v8.10.0') +``` + +### 2. cleaner.py - Validation & Normalization + +**Purpose:** Validate schemas and apply sensible defaults + +**Input:** Nested fields from loader + +**Processing:** +1. Validate mandatory attributes present +2. Strip whitespace from all strings +3. Apply type-specific defaults (e.g., `ignore_above=1024` for keywords) +4. Expand shorthand notations (reuse locations) +5. Validate constraints (description length, examples, patterns) + +**Output:** Validated and enriched fields + +**Defaults Applied:** +- `group: 2` (fieldset priority) +- `root: false` (not a root fieldset) +- `ignore_above: 1024` (for keyword fields) +- `norms: false` (for text fields) +- `short: description` (if not specified) + +**Validation:** +- Mandatory attributes: name, title, description, type, level +- Short descriptions < 120 characters (strict mode) +- Valid regex patterns +- Example values match patterns/expected_values +- Field levels: core/extended/custom + +**Key Functions:** +- `clean()`: Main entry point +- `schema_cleanup()`: Process fieldsets +- `field_cleanup()`: Process fields +- `normalize_reuse_notation()`: Expand reuse shorthand + +**Example:** +```python +from schema import loader, cleaner +fields = loader.load_schemas() +cleaner.clean(fields, strict=False) # Warnings +cleaner.clean(fields, strict=True) # Exceptions +``` + +### 3. finalizer.py - Field Reuse & Name Calculation + +**Purpose:** Perform field reuse and calculate final field names + +**Input:** Cleaned fields + +**Processing:** + +**Phase 1: Field Reuse** +1. Organize reuses by order and type (foreign vs self) +2. For each order level: + a. Foreign reuses: Copy fieldset to different location (transitive) + b. Self-nestings: Copy fieldset into itself (non-transitive) +3. Mark reused fields with `original_fieldset` +4. Record reuse metadata in `reused_here` + +**Phase 2: Name Calculation** +1. Traverse all fields with path tracking +2. Calculate `flat_name`: full dotted name +3. Calculate `dashed_name`: kebab-case version +4. Calculate multi-field `flat_names` +5. Apply OTel reuse mappings + +**Output:** Complete field structure with all reuses and final names + +**Reuse Example:** +``` +Order 1: +- group → user.group (foreign reuse) + +Order 2: +- user (now with group) → destination.user (foreign reuse) + Result: destination.user.group exists! (transitive) +- process → process.parent (self-nesting) + Result: source.process.parent does NOT exist (non-transitive) +``` + +**Key Functions:** +- `finalize()`: Main entry point +- `perform_reuse()`: Execute reuse operations +- `calculate_final_values()`: Compute final names +- `field_finalizer()`: Calculate individual field names + +**Example:** +```python +from schema import loader, cleaner, finalizer +fields = loader.load_schemas() +cleaner.clean(fields) +finalizer.finalize(fields) +# Fields now have flat_name, dashed_name calculated +``` + +### 4. subset_filter.py - Subset Filtering (Optional) + +**Purpose:** Filter to include only specified fields + +Subset filtering is like a **whitelist** - you specify exactly which fields to include, and everything else is excluded. + +**Why Use Subsets:** + +- **Reduce field count:** Full ECS has ~850 fields. Subsets let you use only 50-100 fields for specific use cases +- **Performance:** Fewer fields = smaller mappings = better Elasticsearch performance +- **Simplicity:** Only the fields you actually need +- **Domain-specific:** Create subsets for web, security, infrastructure, etc. + +**Input:** Finalized fields (after reuse) + +**Processing:** +1. Load subset definition files +2. Extract matching fields recursively +3. Handle `docs_only` fields separately +4. Merge multiple subsets (union) + +**Output:** +- Filtered fields (main subset) +- Docs-only fields (separate) + +--- + +## Understanding Subset Definitions + +A subset definition is a YAML file that mirrors the field structure, but only includes what you want: + +### Basic Subset Structure + +```yaml +name: minimal # Subset name (used for output directory) +fields: # Top-level: list fieldsets to include + base: # Fieldset name + fields: '*' # '*' = include ALL fields in this fieldset + + http: # Another fieldset + fields: # Specify which fields to include + request: # Nested field + fields: # Go deeper + method: {} # Include this field + bytes: {} # Include this field + response: + fields: '*' # Include ALL response fields +``` + +### Visual Representation + +**Before Subset (Full ECS):** +``` +base +├─ @timestamp +├─ message +├─ tags +└─ labels + +http +├─ request +│ ├─ method +│ ├─ bytes +│ ├─ referrer +│ └─ body +└─ response + ├─ status_code + ├─ bytes + └─ body + +user +├─ name +├─ email +└─ id +``` + +**Subset Definition:** +```yaml +name: minimal +fields: + base: + fields: '*' # All base fields + http: + fields: + request: + fields: + method: {} # Just method + bytes: {} # Just bytes +``` + +**After Subset:** +``` +base ✓ (all fields kept) +├─ @timestamp +├─ message +├─ tags +└─ labels + +http ✓ (partially kept) +├─ request +│ ├─ method ✓ (explicitly included) +│ ├─ bytes ✓ (explicitly included) +│ ├─ referrer ✗ (not in subset) +│ └─ body ✗ (not in subset) +└─ response ✗ (entire section excluded) + +user ✗ (not in subset at all) +``` + +--- + +## Subset Syntax Guide + +### 1. Include All Fields: `fields: '*'` + +```yaml +http: + fields: '*' # Include every field in http fieldset +``` + +**Result:** `http.request.method`, `http.request.bytes`, `http.response.status_code`, etc. + +### 2. Include Specific Fields: Nested Structure + +```yaml +http: + fields: + request: # Include request section + fields: + method: {} # Include just this field +``` + +**Result:** Only `http.request.method` + +### 3. Mix Wildcard and Specific + +```yaml +http: + fields: + request: + fields: + method: {} # Specific field + bytes: {} # Specific field + response: + fields: '*' # All response fields +``` + +**Result:** +- `http.request.method` ✓ +- `http.request.bytes` ✓ +- `http.request.referrer` ✗ (not specified) +- `http.response.status_code` ✓ (wildcard) +- `http.response.bytes` ✓ (wildcard) + +### 4. Deep Nesting + +```yaml +destination: + fields: + user: # Reused fieldset under destination + fields: + name: {} # Just the name field +``` + +**Result:** Only `destination.user.name` + +--- + +## Complete Subset Examples + +### Example 1: Minimal Subset (Web Logs) + +**Use Case:** Just enough fields for basic web log analysis + +```yaml +name: web_minimal +fields: + base: + fields: '*' # @timestamp, message, tags, labels + + http: + fields: + request: + fields: + method: {} # GET, POST, etc. + bytes: {} # Request size + response: + fields: + status_code: {} # 200, 404, etc. + bytes: {} # Response size + + url: + fields: + domain: {} + path: {} + query: {} + + user_agent: + fields: + original: {} # Raw user agent string +``` + +**Result:** ~15 fields instead of 850 + +### Example 2: Security Subset + +**Use Case:** Security monitoring and threat detection + +```yaml +name: security +fields: + base: + fields: '*' + + event: + fields: + action: {} + category: {} + type: {} + outcome: {} + + source: + fields: + ip: {} + port: {} + address: {} + user: + fields: + name: {} + id: {} + + destination: + fields: + ip: {} + port: {} + address: {} + + process: + fields: + name: {} + pid: {} + executable: {} + command_line: {} + parent: # Self-nested field + fields: + name: {} + pid: {} + + file: + fields: + path: {} + name: {} + hash: + fields: + sha256: {} +``` + +**Result:** ~30-40 security-focused fields + +### Example 3: Infrastructure Subset + +**Use Case:** Server and infrastructure monitoring + +```yaml +name: infrastructure +fields: + base: + fields: '*' + + host: + fields: + name: {} + hostname: {} + ip: {} + os: + fields: + platform: {} + version: {} + + system: + fields: + cpu: + fields: '*' # All CPU metrics + memory: + fields: '*' # All memory metrics + process: + fields: + state: {} + cpu: + fields: '*' + + container: + fields: + id: {} + name: {} + image: + fields: + name: {} + tag: {} +``` + +--- + +## Field Options in Subsets + +Beyond just including fields, you can set options: + +### Disable Indexing + +```yaml +http: + fields: + request: + fields: + body: + index: false # Don't index this field + enabled: false # Don't process at all +``` + +**Result:** `http.request.body` exists but isn't indexed (saves space, still in _source) + +### docs_only Fields + +```yaml +http: + fields: + request: + fields: + referrer: + docs_only: true # In documentation but not artifacts +``` + +**Result:** Field appears in markdown docs but NOT in Elasticsearch templates, Beats configs, etc. + +**Use Case:** Deprecated fields you still want documented for legacy data + +--- + +## Multiple Subsets (Union) + +You can specify multiple subset files - they're merged together: + +```bash +python generator.py \ + --subset subsets/base.yml subsets/web.yml \ + --semconv-version v1.24.0 +``` + +**Merging Logic:** +- Field in ANY subset → Included in result +- `enabled: false` in subset A, `enabled: true` in subset B → Result: `enabled: true` +- Union operation: More permissive wins + +**Example:** + +`subsets/base.yml`: +```yaml +fields: + base: + fields: '*' + http: + fields: + request: + fields: + method: {} +``` + +`subsets/security.yml`: +```yaml +fields: + http: + fields: + request: + fields: + bytes: {} # Different field + source: + fields: + ip: {} +``` + +**Merged Result:** +``` +base.* (from base.yml) +http.request.method (from base.yml) +http.request.bytes (from security.yml) +source.ip (from security.yml) +``` + +--- + +## Common Subset Pitfalls + +### ❌ Mistake 1: Forgetting Intermediate Fields + +**Wrong:** +```yaml +http: + fields: + method: {} # ❌ Wrong! method is under request +``` + +**Right:** +```yaml +http: + fields: + request: # ✓ Need intermediate field + fields: + method: {} +``` + +### ❌ Mistake 2: Including Fieldset Without Fields Key + +**Wrong:** +```yaml +base: {} # ❌ Missing fields key +``` + +**Right:** +```yaml +base: + fields: '*' # ✓ Must have fields +``` + +### ❌ Mistake 3: Using Wildcards at Wrong Level + +**Wrong:** +```yaml +fields: '*' # ❌ Can't wildcard top level +``` + +**Right:** +```yaml +fields: + base: + fields: '*' # ✓ Wildcard inside fieldset + http: + fields: '*' +``` + +--- + +## Testing Your Subset + +### Check Field Count + +```bash +# Generate subset +python generator.py --subset subsets/minimal.yml --semconv-version v1.24.0 + +# Count fields in generated CSV +wc -l generated/csv/fields.csv +# Should be much less than full ECS (~850 fields) +``` + +### Verify Specific Fields + +```bash +# Check if specific field exists +grep "http.request.method" generated/csv/fields.csv +``` + +### View Generated Schema + +```yaml +# Look at generated nested structure +cat generated/ecs/ecs_nested.yml + +# Check flat structure +cat generated/ecs/ecs_flat.yml +``` + +--- + +## Subset Best Practices + +1. **Start with base:** Almost always include `base: {fields: '*'}` +2. **Be specific:** Only include fields you actually use +3. **Test thoroughly:** Generate and verify the output +4. **Document why:** Add comments explaining the subset purpose +5. **Version control:** Keep subset definitions in git +6. **Iterate:** Start small, add fields as needed + +--- + +**Key Functions:** +- `filter()`: Main entry point +- `extract_matching_fields()`: Recursive filtering +- `combine_all_subsets()`: Merge multiple subsets + +**Example:** +```python +from schema import subset_filter +fields, docs = subset_filter.filter( + fields, + ['subsets/minimal.yml'], + 'generated' +) +``` + +### 5. exclude_filter.py - Exclude Filtering (Optional) + +**Purpose:** Explicitly remove specified fields + +**Input:** Fields (optionally after subset filter) + +**Processing:** +1. Load exclude definition files +2. Remove specified fields +3. Auto-remove empty parents (except base) + +**Output:** Fields with exclusions removed + +**Exclude Definition:** +```yaml +- name: http + fields: + - name: request.referrer # Remove this field + - name: response.body +``` + +**Key Functions:** +- `exclude()`: Main entry point +- `exclude_fields()`: Remove matching fields +- `pop_field()`: Recursive removal + +**Example:** +```python +from schema import exclude_filter +fields = exclude_filter.exclude( + fields, + ['excludes/deprecated.yml'] +) +``` + +### 6. intermediate_files.py - Generate Intermediate Formats + +**Purpose:** Generate standardized intermediate YAML representations + +**Input:** Final processed fields + +**Processing:** +1. Generate flat format: `{flat_name: field_def}` +2. Generate nested format: `{fieldset: {fields: {...}}}` +3. Remove internal attributes (node_name, intermediate) +4. Filter non-root reusables (flat format only) + +**Output:** +- `ecs_flat.yml`: Flat dictionary +- `ecs_nested.yml`: Nested by fieldset +- `ecs.yml`: Raw debug format (optional) + +**Key Functions:** +- `generate()`: Main entry point +- `generate_flat_fields()`: Create flat representation +- `generate_nested_fields()`: Create nested representation + +**Example:** +```python +from generators import intermediate_files +nested, flat = intermediate_files.generate( + fields, + 'generated/ecs', + default_dirs=True +) +``` + +## Helper Modules + +### visitor.py - Field Traversal + +**Purpose:** Traverse deeply nested structures using visitor pattern + +**Functions:** +- `visit_fields()`: Call different functions for fieldsets vs fields +- `visit_fields_with_path()`: Pass path array to callback +- `visit_fields_with_memo()`: Pass accumulator object + +**Example:** +```python +from schema import visitor + +# Count all fields +count = {'total': 0} +def counter(details, memo): + memo['total'] += 1 +visitor.visit_fields_with_memo(fields, counter, count) +``` + +## Common Patterns + +### Running the Full Pipeline + +```python +from schema import loader, cleaner, finalizer +from generators import intermediate_files + +# Load schemas +fields = loader.load_schemas() + +# Clean and validate +cleaner.clean(fields, strict=False) + +# Perform reuse and calculate names +finalizer.finalize(fields) + +# Generate intermediate files +nested, flat = intermediate_files.generate( + fields, + 'generated/ecs', + default_dirs=True +) + +# Now ready for generators (es_template, beats, etc.) +``` + +### With Subset Filtering + +```python +from schema import subset_filter + +# ... run pipeline through finalizer ... + +# Apply subset filter +fields, docs = subset_filter.filter( + fields, + ['subsets/minimal.yml'], + 'generated' +) + +# Continue with generators +``` + +### With Exclude Filtering + +```python +from schema import exclude_filter + +# ... run pipeline through finalizer ... + +# Apply exclude filter +fields = exclude_filter.exclude( + fields, + ['excludes/deprecated.yml'] +) + +# Continue with generators +``` + +## Debugging Tips + +### View Intermediate Structure + +```python +import yaml + +# After loader +with open('debug_loaded.yml', 'w') as f: + yaml.dump(fields, f, default_flow_style=False) + +# After cleaner +with open('debug_cleaned.yml', 'w') as f: + yaml.dump(fields, f, default_flow_style=False) + +# After finalizer +with open('debug_finalized.yml', 'w') as f: + yaml.dump(fields, f, default_flow_style=False) +``` + +### Check Specific Field + +```python +# Find a specific field +def find_field(details): + if 'flat_name' in details['field_details']: + if details['field_details']['flat_name'] == 'http.request.method': + print(details['field_details']) + +from schema import visitor +visitor.visit_fields(fields, field_func=find_field) +``` + +### Validate Reuse + +```python +# Check what was reused where +for name, schema in fields.items(): + if 'reused_here' in schema['schema_details']: + print(f"{name} contains:") + for reuse in schema['schema_details']['reused_here']: + print(f" - {reuse['full']}") +``` + +## Extending the Pipeline + +### Adding New Validation + +Add to `cleaner.py`: + +```python +def my_custom_validation(field): + if 'my_custom_attr' in field['field_details']: + # Validate it + pass + +# In field_cleanup(): +def field_cleanup(field): + # ... existing code ... + my_custom_validation(field) +``` + +### Adding New Calculated Fields + +Add to `finalizer.py`: + +```python +def field_finalizer(details, path): + # ... existing calculations ... + + # Add new calculated field + details['field_details']['my_calculated'] = calculate_something(path) +``` + +### Adding New Filter Type + +Create new module like `custom_filter.py`: + +```python +def filter(fields, config): + # Your custom filtering logic + return filtered_fields +``` + +## Testing + +### Unit Tests + +Located in `scripts/tests/unit/`: +- `test_loader.py`: Schema loading +- `test_cleaner.py`: Validation +- `test_finalizer.py`: Reuse logic + +### Integration Tests + +Run full pipeline: +```bash +cd scripts +python3 generator.py --strict +``` + +## Related Documentation + +- [otel-integration.md](otel-integration.md) - OTel integration +- [markdown-generator.md](markdown-generator.md) - Markdown docs +- [intermediate-files.md](intermediate-files.md) - Intermediate formats +- [es-template.md](es-template.md) - Elasticsearch templates +- [ecs-helpers.md](ecs-helpers.md) - Utility functions +- [csv-generator.md](csv-generator.md) - CSV export +- [beats-generator.md](beats-generator.md) - Beats configs + +## Troubleshooting + +### Common Errors + +**ValueError: Missing mandatory attribute** +- Fix: Add required attribute to schema YAML +- Required: name, title, description, type, level + +**ValueError: Schema has root=true and cannot be reused** +- Fix: Don't try to reuse base or other root fieldsets +- Root fieldsets appear at document root, can't be nested + +**KeyError during reuse** +- Fix: Check reuse order; dependencies must be reused first +- Use `order: 1` for fieldsets that others depend on + +**Duplicate field names** +- Fix: Check for conflicting custom schemas +- Use `safe_merge_dicts` which raises on conflicts + +--- + +### Field Reuse Troubleshooting + +#### Problem: Field not appearing where expected + +**Symptom:** Expected `destination.user.group.id` but it doesn't exist + +**Cause:** Reuse order is wrong - `group` not reused into `user` before `user` reused into `destination` + +**Solution:** +```yaml +# Ensure correct order +group: + reusable: + order: 1 # ← FIRST + expected: + - user + +user: + reusable: + order: 2 # ← SECOND + expected: + - destination +``` + +**How to verify:** +```python +# Check what's in destination.user +from schema import visitor + +def show_fields(details): + if 'flat_name' in details['field_details']: + name = details['field_details']['flat_name'] + if name.startswith('destination.user'): + print(name) + +visitor.visit_fields(fields, field_func=show_fields) +``` + +#### Problem: Self-nesting appearing in reused locations + +**Symptom:** Expected `source.process.parent` NOT to exist, but it does + +**Cause:** Something went wrong with non-transitive logic, or it's actually foreign reuse + +**Solution:** +1. Check if `process.parent` is foreign reuse (wrong) or self-nesting (correct): +```yaml +process: + reusable: + expected: + - at: process # ← Self-nesting (correct) + as: parent + - source # ← Foreign reuse +``` + +2. If it's self-nesting, it should NOT appear at `source.process.parent` +3. If you WANT it everywhere, change to foreign reuse: +```yaml +# Create separate parent_process fieldset +parent_process: + reusable: + order: 1 + expected: + - at: process + as: parent +``` + +#### Problem: Reused fields have wrong OTel mappings + +**Symptom:** `destination.user.name` has different OTel mapping than `user.name` + +**Cause:** Need to use `otel_reuse` for location-specific mappings + +**Solution:** +```yaml +# In user schema +- name: name + otel_reuse: + - ecs: destination.user.name # ← Specific location + mapping: + relation: equivalent + attribute: destination.user.name + - ecs: source.user.name + mapping: + relation: equivalent + attribute: source.user.name +``` + +#### Problem: Can't reuse fieldset + +**Symptom:** `ValueError: Schema X has attribute root=true and cannot be reused` + +**Cause:** Trying to reuse a root fieldset (`base`, etc.) + +**Why:** Root fieldsets have fields at document root level. Can't nest them. + +**Solution:** Don't reuse root fieldsets. If you need similar functionality, create a new non-root fieldset. + +--- + +### Subset Filtering Troubleshooting + +#### Problem: Subset includes too many fields + +**Symptom:** Wanted 50 fields, got 200 + +**Cause:** Used `fields: '*'` wildcard on wrong fieldsets + +**Solution:** Be more specific: +```yaml +# Too broad +http: + fields: '*' # ← Gets ALL http fields + +# More specific +http: + fields: + request: + fields: + method: {} + bytes: {} +``` + +**How to verify field count:** +```bash +# Count lines in CSV (minus header) +wc -l generated/csv/fields.csv +# Or +grep -c "^" generated/csv/fields.csv +``` + +#### Problem: Subset excludes fields I need + +**Symptom:** Missing `http.request.method` in generated artifacts + +**Cause 1:** Forgot to include it in subset definition + +**Solution:** +```yaml +http: + fields: + request: + fields: + method: {} # ← Must explicitly include +``` + +**Cause 2:** Forgot intermediate fields in path + +**Solution:** +```yaml +# Wrong - missing 'request' intermediate +http: + fields: + method: {} # ❌ + +# Right - include full path +http: + fields: + request: # ✓ + fields: + method: {} +``` + +**How to debug:** +```bash +# Check what's in flat YAML +grep "http.request.method" generated/ecs/ecs_flat.yml + +# If nothing found, field wasn't included in subset +``` + +#### Problem: ValueError: 'fields' key expected, not found + +**Symptom:** `ValueError: 'fields' key expected, not found in subset for http` + +**Cause:** Schema has nested fields but subset doesn't specify them + +**Solution:** +```yaml +# Wrong +http: {} # ❌ Missing fields key + +# Right +http: + fields: '*' # ✓ Or specific fields +``` + +#### Problem: ValueError: 'fields' key not expected + +**Symptom:** `ValueError: 'fields' key not expected, found in subset for @timestamp` + +**Cause:** Trying to add nested fields to a leaf field (one that doesn't have children) + +**Solution:** +```yaml +# Wrong - @timestamp is a leaf field, can't have nested fields +base: + fields: + @timestamp: + fields: # ❌ @timestamp doesn't have nested fields + value: {} + +# Right - @timestamp is included as-is +base: + fields: + @timestamp: {} # ✓ Just include it +``` + +#### Problem: Subset doesn't include reused fields + +**Symptom:** Subset has `destination` but not `destination.user.*` + +**Cause:** Subset filtering happens AFTER reuse, must include destination in subset + +**Solution:** +```yaml +# Include both the parent and nested fields +destination: + fields: + ip: {} + port: {} + user: # ← Include reused fieldset + fields: + name: {} + email: {} +``` + +**Remember:** Subset sees the FINAL structure after reuse. If `user` is reused at `destination.user`, your subset must explicitly include `destination.user` fields. + +#### Problem: Multiple subsets not merging as expected + +**Symptom:** Field in subset A but not in final output + +**Cause:** Typo in subset definition or field path + +**Solution:** +```bash +# Check each subset file independently +python generator.py --subset subsets/subset_a.yml --semconv-version v1.24.0 +# Verify field exists in generated CSV + +python generator.py --subset subsets/subset_b.yml --semconv-version v1.24.0 +# Check second subset + +# Then try both together +python generator.py \ + --subset subsets/subset_a.yml subsets/subset_b.yml \ + --semconv-version v1.24.0 +``` + +**How to debug:** +```python +# Load and inspect subset definitions +import yaml + +with open('subsets/subset_a.yml') as f: + subset_a = yaml.safe_load(f) + print(subset_a) # Check structure matches schema +``` + +--- + +### Quick Debugging Commands + +**Check what fields exist after pipeline:** +```python +from schema import loader, cleaner, finalizer +fields = loader.load_schemas() +cleaner.clean(fields) +finalizer.finalize(fields) + +# List all fields +from schema import visitor +def print_field(details): + if 'flat_name' in details['field_details']: + print(details['field_details']['flat_name']) +visitor.visit_fields(fields, field_func=print_field) +``` + +**Check reuse metadata:** +```python +# See what was reused where +for name, schema in fields.items(): + if 'reused_here' in schema['schema_details']: + print(f"\n{name} contains reused fieldsets:") + for reuse in schema['schema_details']['reused_here']: + print(f" - {reuse['full']}") +``` + +**Verify subset syntax before running:** +```bash +# Validate YAML syntax +python -c "import yaml; yaml.safe_load(open('subsets/test.yml'))" + +# If no output, YAML is valid +# If error, fix YAML syntax first +``` + +### Strict Mode Issues + +If `--strict` fails with warnings: +- Review the warning messages +- Fix schema YAMLs to meet requirements +- Or run without `--strict` (warnings only) diff --git a/scripts/generator.py b/scripts/generator.py index fafa5abde7..451bac7258 100644 --- a/scripts/generator.py +++ b/scripts/generator.py @@ -15,6 +15,113 @@ # specific language governing permissions and limitations # under the License. +"""ECS Generator - Main Entry Point. + +This is the main orchestrator for the ECS artifact generation process. It coordinates +the entire pipeline from loading YAML schemas to generating all output artifacts. + +Pipeline Overview: + + 1. **Schema Processing** (schema/ modules): + - Load schemas from YAML files or git ref + - Clean and validate field definitions + - Perform field reuse across fieldsets + - Apply optional subset/exclude filters + + 2. **OTel Validation** (generators/otel.py): + - Validate OTel semantic conventions mappings + - Enrich fields with OTel stability information + + 3. **Intermediate Files** (generators/intermediate_files.py): + - Generate ecs_flat.yml (flat field dictionary) + - Generate ecs_nested.yml (nested by fieldset) + + 4. **Artifact Generation** (generators/ modules): + - CSV field reference (csv_generator.py) + - Elasticsearch templates (es_template.py) + - Beats field definitions (beats.py) + - Markdown documentation (markdown_fields.py) + +Command-Line Usage: + + Basic generation: + ```bash + python scripts/generator.py --semconv-version v1.24.0 + ``` + + From specific git version: + ```bash + python scripts/generator.py --ref v8.10.0 --semconv-version v1.24.0 + ``` + + With custom schemas: + ```bash + python scripts/generator.py \\ + --include custom/schemas/ \\ + --semconv-version v1.24.0 + ``` + + Generate subset only: + ```bash + python scripts/generator.py \\ + --subset schemas/subsets/minimal.yml \\ + --semconv-version v1.24.0 + ``` + + Strict validation: + ```bash + python scripts/generator.py \\ + --strict \\ + --semconv-version v1.24.0 + ``` + +Key Features: + - **Git Reference Support**: Generate from any ECS version tag + - **Custom Schema Merging**: Add custom fields to ECS + - **Subset Filtering**: Generate artifacts for specific field subsets + - **Exclude Filtering**: Remove deprecated fields for testing + - **Strict Mode**: Enforce stricter validation rules + - **Intermediate-Only Mode**: Generate only intermediate files (fast iteration) + - **Experimental Support**: Include experimental schemas with +exp version tag + +Output Structure: + + generated/ + ├── ecs/ + │ ├── ecs_flat.yml # Flat field dictionary + │ ├── ecs_nested.yml # Nested by fieldset + │ └── subset/ # Per-subset intermediate files + ├── elasticsearch/ + │ ├── composable/ # Modern composable templates + │ │ ├── template.json + │ │ └── component/*.json + │ └── legacy/ # Legacy single template + │ └── template.json + ├── beats/ + │ └── fields.ecs.yml # Beats field definitions + └── csv/ + └── fields.csv # CSV field reference + + docs/reference/ + ├── fields/ # Field documentation pages + └── otel/ # OTel alignment docs + +Environment Requirements: + - Python 3.7+ + - Git repository (for --ref support) + - OTel semantic conventions version specified + +Exit Codes: + - 0: Success + - Non-zero: Error during generation + +See Also: + - scripts/docs/schema-pipeline.md: Complete pipeline documentation + - scripts/docs/README.md: All module documentation index + - USAGE.md: User guide for running generators + - CONTRIBUTING.md: Development guidelines +""" + import argparse import os from typing import ( @@ -40,6 +147,76 @@ def main() -> None: + """Main entry point for ECS artifact generation. + + Orchestrates the complete pipeline: + 1. Parse command-line arguments + 2. Determine ECS version (from git ref or local) + 3. Setup output directories + 4. Run schema processing pipeline: + - Load schemas (from git ref or filesystem) + - Clean and validate + - Finalize (perform reuse) + - Apply filters (subset/exclude) + - Validate OTel mappings + 5. Generate intermediate files (flat & nested YAML) + 6. Generate all artifacts: + - CSV field reference + - Elasticsearch templates (composable & legacy) + - Beats field definitions + - Markdown documentation (optional) + + Pipeline Stages: + + Schema Processing: + - loader.load_schemas(): Load from YAML or git + - cleaner.clean(): Validate and normalize + - finalizer.finalize(): Perform field reuse + - subset_filter.filter(): Apply subset (if specified) + - exclude_filter.exclude(): Remove excluded fields (if specified) + + Validation: + - otel.OTelGenerator(): Validate OTel mappings + + Intermediate: + - intermediate_files.generate(): Create ecs_flat.yml & ecs_nested.yml + + Artifacts: + - csv_generator.generate(): CSV field reference + - es_template.generate(): Composable Elasticsearch template + - es_template.generate_legacy(): Legacy Elasticsearch template + - beats.generate(): Beats field definitions + - markdown_fields.generate(): Documentation (optional) + + Early Exit Conditions: + - --intermediate-only: Stop after generating intermediate files + - Custom schemas/subsets without --force-docs: Skip markdown generation + + Version Handling: + - Reads from 'version' file or git ref + - Appends '+exp' if experimental schemas included + - Used to tag all generated artifacts + + Output Directories: + - Default: generated/ and docs/reference/ + - Custom: {args.out}/generated/ and {args.out}/docs/reference/ + + Raises: + KeyError: If --semconv-version not provided + Various exceptions from pipeline stages on validation errors + + Example Execution: + >>> # From command line: + >>> # python scripts/generator.py --semconv-version v1.24.0 + Running generator. ECS version 8.11.0 + Loading schemas from local files + # ... pipeline output ... + + Note: + Markdown docs are skipped when using custom schemas/subsets unless + --force-docs is specified, since custom fields may not have proper + documentation structure. + """ args = argument_parser() if not args.semconv_version: @@ -98,6 +275,89 @@ def main() -> None: def argument_parser() -> argparse.Namespace: + """Parse and return command-line arguments for the generator. + + Configures argument parser with all supported options for controlling + the ECS generation pipeline. + + Returns: + Parsed arguments namespace with all configuration options + + Arguments: + + Schema Loading: + - --ref: Git reference (tag/branch/commit) to load schemas from + Example: --ref v8.10.0 + Note: Also applies to experimental schemas if included + + - --include: Additional schema directories/files to include + Can specify multiple times or space-separated + Examples: + --include custom/schemas/ + --include experimental/schemas + --include custom/field1.yml custom/field2.yml + + Filtering: + - --subset: Subset definition files to filter included fields + Example: --subset schemas/subsets/minimal.yml + + - --exclude: Exclude definition files to remove fields + Example: --exclude excludes/deprecated.yml + Useful for testing deprecation impact + + Output: + - --out: Custom output directory (default: current directory) + Generated files go to {out}/generated/ + Docs go to {out}/docs/reference/ + + Elasticsearch Templates: + - --template-settings: JSON file with composable template settings + Overrides index_patterns, priority, settings + + - --template-settings-legacy: JSON file with legacy template settings + + - --mapping-settings: JSON file with mapping settings + Overrides date_detection, dynamic_templates + + Validation & Control: + - --strict: Enable strict validation mode + Warnings become errors + Enforces description length limits + Required for CI/CD + + - --intermediate-only: Generate only intermediate files + Skip artifact generation + Useful for debugging pipeline + + - --force-docs: Generate markdown docs even with custom schemas + By default, docs skipped with --include/--subset/--exclude + Use this to override and generate anyway + + - --semconv-version: OTel Semantic Conventions version (REQUIRED) + Example: --semconv-version v1.24.0 + Used for OTel mapping validation + + Special Handling: + - Empty --include from Makefile is cleaned up (converted to empty list) + - This allows Makefile to pass --include ${VAR} even when VAR is empty + + Example Usage: + >>> # Standard generation + >>> args = argument_parser() + >>> # python generator.py --semconv-version v1.24.0 + + >>> # With all options + >>> # python generator.py \\ + >>> # --ref v8.10.0 \\ + >>> # --include custom/schemas/ \\ + >>> # --subset subsets/minimal.yml \\ + >>> # --strict \\ + >>> # --semconv-version v1.24.0 + + Note: + --semconv-version is required. The main() function will raise KeyError + if not provided. + """ parser = argparse.ArgumentParser() parser.add_argument('--ref', action='store', help='Loads fields definitions from `./schemas` subdirectory from specified git reference. \ Note that "--include experimental/schemas" will also respect this git ref.') @@ -130,6 +390,52 @@ def argument_parser() -> argparse.Namespace: def read_version(ref: Optional[str] = None) -> str: + """Read ECS version string from file or git reference. + + Determines the ECS version to use for generated artifacts. Version can + come from either: + - Local 'version' file (default) + - Git ref's 'version' file (when --ref specified) + + Args: + ref: Optional git reference (tag/branch/commit) to load version from + + Returns: + ECS version string (e.g., '8.11.0') + + Side Effects: + Prints message indicating version source (local files vs git ref) + + Version Sources: + - ref=None: Reads from './version' file in current directory + - ref='v8.10.0': Reads from 'version' file in that git ref + + Processing: + - Reads version file content + - Strips trailing whitespace/newlines + - Returns clean version string + + Example: + >>> # Reading from local file + >>> version = read_version() + Loading schemas from local files + >>> print(version) + 8.11.0 + + >>> # Reading from git tag + >>> version = read_version('v8.10.0') + Loading schemas from git ref v8.10.0 + >>> print(version) + 8.10.0 + + Note: + Main() appends '+exp' to the version if experimental schemas are + included via --include experimental/schemas. + + Used By: + - main(): To determine version for all generated artifacts + - All generators: Version appears in metadata, headers, filenames + """ if ref: print('Loading schemas from git ref ' + ref) tree = ecs_helpers.get_tree_by_ref(ref) diff --git a/scripts/generators/beats.py b/scripts/generators/beats.py index 42c2ec09b8..3ba553bb26 100644 --- a/scripts/generators/beats.py +++ b/scripts/generators/beats.py @@ -15,6 +15,49 @@ # specific language governing permissions and limitations # under the License. +"""Beats Field Definition Generator. + +This module generates field definitions for Elastic Beats in YAML format. Beats +(Filebeat, Metricbeat, Packetbeat, etc.) are lightweight data shippers that need +field definitions to: +- Validate collected data structure +- Configure field behavior (indexing, doc_values, etc.) +- Provide field documentation to users +- Determine which fields are included by default + +The generator transforms ECS schemas into the Beats-specific YAML structure, +handling: +- Field hierarchies and grouping +- Multi-field configurations +- Default field selection (fields enabled by default) +- Contextual naming (relative to parent group) +- Type-specific parameters + +Output Structure: + The generated YAML follows Beats field definition format: + - Top-level 'ecs' group containing all fields + - Nested field groups for each fieldset + - Fields with Beats-specific properties + - default_field flags for selective field loading + +Default Fields: + Beats can't load all ~850 ECS fields by default (performance/memory concerns). + The generator uses an allowlist (beats_default_fields_allowlist.yml) to mark + which fields should be enabled by default. Users can enable additional fields + as needed. + +Output: + generated/beats/fields.ecs.yml - Beats field definitions + +Use Cases: + - Integration into Beats module configurations + - Field validation in data collection pipelines + - Documentation generation for Beats users + - Custom Beat development + +See also: scripts/docs/beats-generator.md for detailed documentation +""" + from os.path import join from collections import OrderedDict from typing import ( @@ -35,6 +78,40 @@ def generate( ecs_version: str, out_dir: str ) -> None: + """Generate Beats field definitions from ECS schemas. + + Main entry point for Beats field generation. Creates a Beats-compatible YAML + file with all ECS fields, properly structured with field groups and default_field + settings. + + Args: + ecs_nested: Nested fieldset structure from intermediate_files.generate() + ecs_version: ECS version string (e.g., '8.11.0') + out_dir: Output directory (typically 'generated') + + Generates: + generated/beats/fields.ecs.yml - Beats field definitions + + Process: + 1. Filter out non-root reusable fieldsets (top_level=false) + 2. Process 'base' fieldset first (adds fields directly to root) + 3. Process other fieldsets in sorted order: + - If root=true: Add fields directly to root + - Otherwise: Create field group for fieldset + 4. Load default_fields allowlist + 5. Apply default_field flags based on allowlist + 6. Write formatted YAML with warning header + + Field Structure: + - Base fields appear at root level + - Other fieldsets appear as groups with nested fields + - Each field has Beats-specific properties only + - default_field flags control which fields load by default + + Example: + >>> generate(nested, '8.11.0', 'generated') + # Creates generated/beats/fields.ecs.yml + """ # base first ecs_nested = ecs_helpers.remove_top_level_reusable_false(ecs_nested) beats_fields: List[OrderedDict] = fieldset_field_array(ecs_nested['base']['fields'], ecs_nested['base']['prefix']) @@ -70,6 +147,41 @@ def generate( def set_default_field(fields, df_allowlist, df=False, path=''): + """Recursively set default_field flags based on allowlist. + + Beats can't load all ECS fields by default due to performance/memory constraints. + This function marks fields that should be loaded by default using an allowlist, + and propagates the setting through field hierarchies and multi-fields. + + Args: + fields: List of field definitions to process + df_allowlist: Set of field paths that should be default fields + df: Parent's default_field value (inherited by children) + path: Current field path for building full field names + + Behavior: + - Checks if field path is in allowlist + - Groups are default if top-level (path equals name) + - Children inherit parent's default_field setting + - Recursively processes group fields and multi-fields + - Inserts default_field key before 'fields' key for readability + + Default Field Logic: + 1. Field is in allowlist → default_field: true + 2. Top-level group → default_field: true + 3. Parent is default → children are default + 4. Otherwise → default_field: false + + Note: + Modifies fields list in place by adding/updating default_field property. + The allowlist is loaded from beats_default_fields_allowlist.yml. + + Example: + >>> fields = [{'name': 'method', 'type': 'keyword'}] + >>> allowlist = {'http.request.method'} + >>> set_default_field(fields, allowlist, path='http.request') + # fields[0] now has default_field: true + """ for fld in fields: fld_df = fld.get('default_field', df) fld_path = fld['name'] @@ -89,6 +201,47 @@ def fieldset_field_array( source_fields: Dict[str, Field], fieldset_prefix: str ) -> List[OrderedDict]: + """Convert ECS fields to Beats field array format. + + Transforms ECS field definitions into Beats-compatible field structures by: + - Filtering to Beats-relevant properties only + - Converting field names to contextual names (relative to parent) + - Processing multi-fields appropriately + - Sorting fields alphabetically + + Args: + source_fields: Dictionary of ECS field definitions (keyed by flat_name) + fieldset_prefix: Prefix of parent fieldset (empty string for base) + + Returns: + Sorted list of Beats field definitions + + Field Properties Included: + Main fields: name, level, required, type, object_type, ignore_above, + multi_fields, format, input/output_format, output_precision, + description, example, enabled, index, doc_values, path, + scaling_factor, pattern + + Multi-fields: name, type, norms, default_field, normalizer, ignore_above + + Contextual Naming: + Beats uses relative field names within groups: + - ECS: 'http.request.method' + - Beats (in http group): 'request.method' + - Beats (in base): '@timestamp' + + Example: + >>> fields = { + ... 'http.request.method': { + ... 'name': 'method', + ... 'flat_name': 'http.request.method', + ... 'type': 'keyword', + ... 'description': 'HTTP method' + ... } + ... } + >>> fieldset_field_array(fields, 'http') + [OrderedDict([('name', 'request.method'), ('type', 'keyword'), ...])] + """ allowed_keys: List[str] = [ 'name', 'level', @@ -150,6 +303,28 @@ def write_beats_yaml( ecs_version: str, out_dir: str ) -> None: + """Write Beats field definitions to YAML file. + + Creates the final Beats YAML file with a warning header indicating it's + auto-generated and should not be edited directly. + + Args: + beats_file: Complete Beats field structure with all fields and metadata + ecs_version: ECS version string for the header + out_dir: Output directory + + Generates: + generated/beats/fields.ecs.yml + + File Structure: + - Warning header (DO NOT EDIT) + - Single YAML document wrapped in array + - Top-level 'ecs' key with title, description, fields + + Note: + The file is wrapped in an array ([beats_file]) because Beats expects + a YAML array of field definition documents. + """ ecs_helpers.make_dirs(join(out_dir, 'beats')) warning: str = file_header().format(version=ecs_version) ecs_helpers.yaml_dump(join(out_dir, 'beats/fields.ecs.yml'), [beats_file], preamble=warning) @@ -159,6 +334,18 @@ def write_beats_yaml( def file_header() -> str: + """Generate warning header for generated Beats YAML file. + + Returns header text warning users not to edit the file directly, as it's + auto-generated from ECS schemas. + + Returns: + Formatted header string with placeholder for version + + Usage: + >>> header = file_header().format(version='8.11.0') + # Inserts version into the warning message + """ return """ # WARNING! Do not edit this file directly, it was generated by the ECS project, # based on ECS version {version}. diff --git a/scripts/generators/csv_generator.py b/scripts/generators/csv_generator.py index 719b8914ce..1ca27e44fc 100644 --- a/scripts/generators/csv_generator.py +++ b/scripts/generators/csv_generator.py @@ -15,6 +15,47 @@ # specific language governing permissions and limitations # under the License. +"""CSV Field Reference Generator. + +This module generates a CSV (Comma-Separated Values) export of all ECS fields, +providing a simple, spreadsheet-compatible format for field reference and analysis. + +The CSV format is useful for: +- **Quick field lookup** - Search and filter in spreadsheet applications +- **Data analysis** - Import into analytics tools for field statistics +- **Integration** - Easy parsing for custom tooling +- **Documentation** - Lightweight reference for presentations/reports +- **Diff analysis** - Compare field changes between versions + +CSV Structure: + Each row represents one field with columns: + - ECS_Version: Version of ECS (e.g., '8.11.0') + - Indexed: Whether field is indexed in Elasticsearch (true/false) + - Field_Set: Fieldset name (e.g., 'http', 'user', 'base') + - Field: Full dotted field name (e.g., 'http.request.method') + - Type: Field data type (e.g., 'keyword', 'long', 'ip') + - Level: Field level (core/extended/custom) + - Normalization: Normalization rules (array/to_lower/etc.) + - Example: Example value for the field + - Description: Short field description + +Multi-fields: + Fields with multi-fields (alternate representations) get additional rows, + one per multi-field variant (e.g., 'message.text' for 'message' field). + +Output: + generated/csv/fields.csv - Single CSV file with all fields + +Use Cases: + - Load into Excel/Google Sheets for analysis + - Import into database for field registry + - Parse for custom validation tools + - Compare versions with diff tools + - Generate reports on field usage + +See also: scripts/docs/csv-generator.md for detailed documentation +""" + import _csv import csv import sys @@ -31,12 +72,66 @@ def generate(ecs_flat: Dict[str, Field], version: str, out_dir: str) -> None: + """Generate CSV field reference from flat ECS field definitions. + + Main entry point for CSV generation. Creates a single CSV file containing + all ECS fields with their metadata, sorted with base fields first. + + Args: + ecs_flat: Flat field dictionary from intermediate_files.generate() + version: ECS version string (e.g., '8.11.0') + out_dir: Output directory (typically 'generated') + + Generates: + generated/csv/fields.csv - Complete field reference + + Process: + 1. Create output directory (generated/csv/) + 2. Sort fields with base fields first, then alphabetically + 3. Write CSV with header and field rows + + Example: + >>> from generators.intermediate_files import generate as gen_intermediate + >>> nested, flat = gen_intermediate(fields, 'generated/ecs', True) + >>> generate(flat, '8.11.0', 'generated') + # Creates generated/csv/fields.csv + """ ecs_helpers.make_dirs(join(out_dir, 'csv')) sorted_fields = base_first(ecs_flat) save_csv(join(out_dir, 'csv/fields.csv'), sorted_fields, version) def base_first(ecs_flat: Dict[str, Field]) -> List[Field]: + """Sort fields with base fields first, then remaining fields alphabetically. + + Base fields are top-level fields without dots in their names (e.g., '@timestamp', + 'message', 'tags'). These are placed first, followed by all other fields in + alphabetical order by field name. + + Args: + ecs_flat: Flat field dictionary mapping field names to definitions + + Returns: + List of field definitions in desired sort order + + Sorting logic: + 1. Base fields (no dots): @timestamp, ecs, labels, message, tags + 2. All other fields alphabetically: agent.*, as.*, client.*, ... + + Example: + >>> fields = { + ... 'http.request.method': {...}, + ... 'message': {...}, + ... 'agent.name': {...}, + ... '@timestamp': {...} + ... } + >>> sorted_fields = base_first(fields) + >>> [f['flat_name'] for f in sorted_fields] + ['@timestamp', 'message', 'agent.name', 'http.request.method'] + + Note: + Base fields appear at the top of the CSV for easy reference. + """ base_list: List[Field] = [] sorted_list: List[Field] = [] for field_name in sorted(ecs_flat): @@ -48,6 +143,54 @@ def base_first(ecs_flat: Dict[str, Field]) -> List[Field]: def save_csv(file: str, sorted_fields: List[Field], version: str) -> None: + """Write field definitions to CSV file. + + Creates a CSV file with one row per field (plus header row), including + all field metadata. Multi-fields (alternate representations) get their + own rows. + + Args: + file: Output file path + sorted_fields: List of field definitions in desired order + version: ECS version string + + CSV Format: + Columns: ECS_Version, Indexed, Field_Set, Field, Type, Level, + Normalization, Example, Description + + Example row: + 8.11.0,true,http,http.request.method,keyword,extended,array,GET,"HTTP method" + + Field Set Logic: + - Base fields (no dots): field_set = 'base' + - Other fields: field_set = first part before dot (e.g., 'http.x' -> 'http') + + Multi-fields: + If field has multi_fields, each gets its own row with: + - Same version, indexed, field_set, level, example, description + - Different field name and type (e.g., 'message.text', type='match_only_text') + - Empty normalization + + Indexed Column: + - 'true' if field is indexed (default) + - 'false' if field has index=false + - Lowercase for consistency + + Normalization Column: + - Comma-separated list of normalizations (e.g., 'array, to_lower') + - Empty if no normalizations + + Note: + - Uses QUOTE_MINIMAL to only quote fields containing special characters + - Unix line endings (\n) for consistency + - Python 2/3 compatible file opening + + Example output: + ECS_Version,Indexed,Field_Set,Field,Type,Level,Normalization,Example,Description + 8.11.0,true,base,@timestamp,date,core,,2016-05-23T08:05:34.853Z,Date/time + 8.11.0,true,http,http.request.method,keyword,extended,array,GET,HTTP method + 8.11.0,true,http,http.request.method.text,match_only_text,extended,,,HTTP method + """ open_mode: str = "wb" if sys.version_info >= (3, 0): open_mode: str = "w" diff --git a/scripts/generators/ecs_helpers.py b/scripts/generators/ecs_helpers.py index ab688f8aea..8ee576d411 100644 --- a/scripts/generators/ecs_helpers.py +++ b/scripts/generators/ecs_helpers.py @@ -15,6 +15,38 @@ # specific language governing permissions and limitations # under the License. +"""ECS Generator Helper Utilities. + +This module provides a collection of utility functions used across all ECS +generator scripts. These helpers handle common operations for: + +- **Dictionary Operations**: Copying, sorting, merging, ordering +- **File Operations**: YAML loading/saving, file globbing, directory creation +- **Git Operations**: Tree access, path checking +- **List Operations**: Subtraction, key extraction +- **Field Introspection**: Intermediate field detection, reusability filtering +- **Warnings**: Strict mode warning generation + +These utilities abstract common patterns and provide a consistent interface +for operations performed by multiple generators. They're designed to be +simple, reusable building blocks that compose together. + +Key Design Principles: + - Single responsibility per function + - Type-safe with type hints + - Consistent error handling + - No side effects (except I/O operations) + +Common Use Cases: + - Sorting fieldsets by multiple criteria + - Loading schema files from git or filesystem + - Creating output directories safely + - Merging field definitions without conflicts + - Filtering fieldsets by reusability settings + +See also: scripts/docs/ecs-helpers.md for detailed documentation +""" + import glob import os import yaml @@ -43,6 +75,27 @@ def dict_copy_keys_ordered(dct: Field, copied_keys: List[str]) -> Field: + """Copy specified keys from dictionary in a specific order. + + Creates an OrderedDict containing only the specified keys in the order given. + Useful for ensuring consistent field ordering in output files. + + Args: + dct: Source dictionary + copied_keys: List of keys to copy, in desired order + + Returns: + OrderedDict with specified keys in given order + + Note: + Keys not present in source dictionary are silently skipped. + + Example: + >>> field = {'name': 'x', 'type': 'keyword', 'description': '...'} + >>> dict_copy_keys_ordered(field, ['name', 'type', 'level']) + OrderedDict([('name', 'x'), ('type', 'keyword')]) + # 'level' not in source, so skipped + """ ordered_dict = OrderedDict() for key in copied_keys: if key in dct: @@ -51,12 +104,65 @@ def dict_copy_keys_ordered(dct: Field, copied_keys: List[str]) -> Field: def dict_copy_existing_keys(source: Field, destination: Field, keys: List[str]) -> None: + """Copy specified keys from source to destination dictionary if they exist. + + Copies only keys that are present in the source dictionary, modifying + the destination dictionary in place. Commonly used to selectively copy + field properties based on field type. + + Args: + source: Dictionary to copy from + destination: Dictionary to copy to (modified in place) + keys: List of keys to attempt to copy + + Note: + - Destination is modified in place + - Keys not in source are silently skipped + - Existing keys in destination are overwritten + + Example: + >>> source = {'type': 'keyword', 'ignore_above': 1024, 'index': True} + >>> dest = {'type': 'keyword'} + >>> dict_copy_existing_keys(source, dest, ['ignore_above', 'norms']) + >>> dest + {'type': 'keyword', 'ignore_above': 1024} + # 'norms' not in source, so not copied + """ for key in keys: if key in source: destination[key] = source[key] def dict_sorted_by_keys(dct: FieldNestedEntry, sort_keys: List[str]) -> List[FieldNestedEntry]: + """Sort dictionary values by multiple sort criteria. + + Sorts the values of a dictionary by one or more keys within those values, + returning a list of sorted values. Commonly used to sort fieldsets by + group and name for consistent output ordering. + + Args: + dct: Dictionary of nested entries (e.g., fieldsets) + sort_keys: Key(s) to sort by (string or list of strings) + + Returns: + List of dictionary values sorted by specified criteria + + Behavior: + - If sort_keys is a string, converts to single-element list + - Sorts by first key, then second key (if provided), etc. + - Uses Python's natural sorting (numbers < strings) + + Example: + >>> fieldsets = { + ... 'http': {'name': 'http', 'group': 2, 'title': 'HTTP'}, + ... 'base': {'name': 'base', 'group': 1, 'title': 'Base'}, + ... 'agent': {'name': 'agent', 'group': 1, 'title': 'Agent'} + ... } + >>> sorted_fs = dict_sorted_by_keys(fieldsets, ['group', 'name']) + >>> [f['name'] for f in sorted_fs] + ['agent', 'base', 'http'] + # Sorted by group (1, 1, 2), then by name (agent, base, http) + """ if not isinstance(sort_keys, list): sort_keys = [sort_keys] @@ -80,6 +186,32 @@ def ordered_dict_insert( before_key: Optional[str] = None, after_key: Optional[str] = None ) -> None: + """Insert a key-value pair at a specific position in an ordered dictionary. + + Inserts a new key-value pair before or after a specified key, maintaining + the dictionary's order. If neither before_key nor after_key is found, the + new pair is appended to the end. + + Args: + dct: OrderedDict to modify (modified in place) + new_key: Key to insert + new_value: Value to associate with new_key + before_key: Insert before this key (takes precedence over after_key) + after_key: Insert after this key (used if before_key not specified) + + Note: + - Modifies dictionary in place + - If both before_key and after_key specified, before_key takes precedence + - If neither key is found, new pair appended to end + - If key already exists, it will be duplicated (use with caution) + + Example: + >>> from collections import OrderedDict + >>> d = OrderedDict([('a', 1), ('c', 3)]) + >>> ordered_dict_insert(d, 'b', 2, after_key='a') + >>> list(d.items()) + [('a', 1), ('b', 2), ('c', 3)] + """ output = OrderedDict() inserted: bool = False for key, value in dct.items(): @@ -98,7 +230,35 @@ def ordered_dict_insert( def safe_merge_dicts(a: Dict[Any, Any], b: Dict[Any, Any]) -> Dict[Any, Any]: - """Merges two dictionaries into one. If duplicate keys are detected a ValueError is raised.""" + """Safely merge two dictionaries, raising error on duplicate keys. + + Merges dictionary b into a deep copy of dictionary a. Raises ValueError + if any keys conflict, preventing accidental data loss or overwrites. + + Args: + a: First dictionary (will be deep copied) + b: Second dictionary to merge in + + Returns: + New dictionary with all keys from both dictionaries + + Raises: + ValueError: If any key exists in both dictionaries + + Note: + Dictionary a is deep copied, so original is not modified. + This ensures merge operation has no side effects. + + Example: + >>> a = {'x': 1, 'y': 2} + >>> b = {'z': 3} + >>> safe_merge_dicts(a, b) + {'x': 1, 'y': 2, 'z': 3} + + >>> c = {'y': 99} # Duplicate key + >>> safe_merge_dicts(a, c) + ValueError: Duplicate key found when merging dictionaries: y + """ c = deepcopy(a) for key in b: if key not in c: @@ -109,6 +269,45 @@ def safe_merge_dicts(a: Dict[Any, Any], b: Dict[Any, Any]) -> Dict[Any, Any]: def fields_subset(subset, fields): + """Extract a subset of fields based on subset specification. + + Recursively filters fields based on a subset specification, retaining + only the fieldsets and fields specified in the subset definition. + Used to generate partial ECS schemas (e.g., for specific use cases). + + Args: + subset: Dictionary specifying which fieldsets/fields to include + fields: Complete fields dictionary to filter + + Returns: + Filtered fields dictionary containing only specified fields + + Raises: + ValueError: If unsupported options found in subset specification + + Subset specification format: + { + 'fieldset_name': { + 'fields': '*' | {'field1': {...}, 'field2': {...}} + } + } + + Behavior: + - Missing 'fields' key = include all fields in fieldset + - 'fields': '*' = include all fields in fieldset + - 'fields': {...} = recursively apply subset to nested fields + + Example: + >>> subset = { + ... 'http': {'fields': '*'}, # All HTTP fields + ... 'user': {'fields': { # Only specific user fields + ... 'name': {}, + ... 'email': {} + ... }} + ... } + >>> filtered = fields_subset(subset, all_fields) + # Returns only http.* and user.name, user.email + """ retained_fields = {} allowed_options = ['fields'] for key, val in subset.items(): @@ -126,6 +325,23 @@ def fields_subset(subset, fields): def yaml_ordereddict(dumper, data): + """YAML representer for OrderedDict that preserves key order. + + Custom YAML dumper function that serializes OrderedDict while maintaining + the order of keys. Registered with PyYAML to automatically handle OrderedDict + instances during yaml.dump(). + + Args: + dumper: YAML dumper instance + data: OrderedDict to represent + + Returns: + YAML MappingNode with keys in original order + + Note: + Primarily for Python 2 compatibility. Python 3.7+ dicts maintain + insertion order by default, making this less critical. + """ # YAML representation of an OrderedDict will be like a dictionary, but # respecting the order of the dictionary. # Almost sure it's unndecessary with Python 3. @@ -137,11 +353,30 @@ def yaml_ordereddict(dumper, data): return yaml.nodes.MappingNode(u'tag:yaml.org,2002:map', value) +# Register the representer globally yaml.add_representer(OrderedDict, yaml_ordereddict) def dict_clean_string_values(dict: Dict[Any, Any]) -> None: - """Remove superfluous spacing in all field values of a dict""" + """Remove leading/trailing whitespace from all string values in dictionary. + + Cleans up string values by stripping whitespace, useful for normalizing + field definitions loaded from YAML where formatting might vary. + + Args: + dict: Dictionary to clean (modified in place) + + Note: + - Only string values are modified + - Non-string values (numbers, bools, nested dicts) are left unchanged + - Modifies dictionary in place + + Example: + >>> data = {'name': ' field ', 'type': 'keyword', 'level': ' core '} + >>> dict_clean_string_values(data) + >>> data + {'name': 'field', 'type': 'keyword', 'level': 'core'} + """ for key in dict: value = dict[key] if isinstance(value, str): @@ -155,12 +390,47 @@ def dict_clean_string_values(dict: Dict[Any, Any]) -> None: def is_yaml(path: str) -> bool: - """Returns True if path matches an element of the yaml extensions set""" + """Check if a file path has a YAML extension. + + Determines if a file path ends with .yml or .yaml extension. + + Args: + path: File path to check + + Returns: + True if path has YAML extension, False otherwise + + Example: + >>> is_yaml('schemas/http.yml') + True + >>> is_yaml('output.json') + False + >>> is_yaml('file.test.yaml') + True + """ return set(path.split('.')[1:]).intersection(YAML_EXT) != set() def safe_list(o: Union[str, List[str]]) -> List[str]: - """converts o to a list if it isn't already a list""" + """Convert string or list to list, splitting on comma if needed. + + Normalizes input to a list format, useful for handling flexible + function arguments that can be either strings or lists. + + Args: + o: String (comma-separated) or list of strings + + Returns: + List of strings + + Example: + >>> safe_list(['a', 'b', 'c']) + ['a', 'b', 'c'] + >>> safe_list('a,b,c') + ['a', 'b', 'c'] + >>> safe_list('single') + ['single'] + """ if isinstance(o, list): return o else: @@ -168,7 +438,30 @@ def safe_list(o: Union[str, List[str]]) -> List[str]: def glob_yaml_files(paths: List[str]) -> List[str]: - """Accepts string, or list representing a path, wildcard or folder. Returns list of matched yaml files""" + """Find all YAML files matching given paths, wildcards, or directories. + + Flexible file finder that handles: + - Direct file paths (schemas/http.yml) + - Wildcards (schemas/*.yml) + - Directories (schemas/ -> all YAML files in dir) + - Comma-separated strings ('path1,path2') + + Args: + paths: String or list of paths/wildcards/directories + + Returns: + Sorted list of matching YAML file paths + + Example: + >>> glob_yaml_files(['schemas/http.yml', 'schemas/user.yml']) + ['schemas/http.yml', 'schemas/user.yml'] + + >>> glob_yaml_files(['schemas/']) + ['schemas/agent.yml', 'schemas/base.yml', ...] + + >>> glob_yaml_files('schemas/*.yml') + ['schemas/agent.yml', 'schemas/base.yml', ...] + """ all_files: List[str] = [] for path in safe_list(paths): if is_yaml(path): @@ -180,12 +473,46 @@ def glob_yaml_files(paths: List[str]) -> List[str]: def get_tree_by_ref(ref: str) -> git.objects.tree.Tree: + """Get git tree object for a specific reference (branch, tag, commit). + + Retrieves the file tree from the current repository at a specific git + reference, allowing generators to load schemas from any point in history. + + Args: + ref: Git reference (branch name, tag, commit SHA) + + Returns: + Git tree object representing repository contents at that reference + + Example: + >>> tree = get_tree_by_ref('v8.10.0') + >>> tree['schemas']['http.yml'] # Access file from that version + """ repo: git.repo.base.Repo = git.Repo(os.getcwd()) commit: git.objects.commit.Commit = repo.commit(ref) return commit.tree def path_exists_in_git_tree(tree: git.objects.tree.Tree, file_path: str) -> bool: + """Check if a path exists in a git tree object. + + Tests whether a file or directory exists in a git tree without raising + an exception. + + Args: + tree: Git tree object to check + file_path: Path relative to tree root + + Returns: + True if path exists in tree, False otherwise + + Example: + >>> tree = get_tree_by_ref('main') + >>> path_exists_in_git_tree(tree, 'schemas/http.yml') + True + >>> path_exists_in_git_tree(tree, 'nonexistent.yml') + False + """ try: _ = tree[file_path] except KeyError: @@ -194,6 +521,18 @@ def path_exists_in_git_tree(tree: git.objects.tree.Tree, file_path: str) -> bool def usage_doc_files() -> List[str]: + """Get list of usage documentation files for fieldsets. + + Scans the docs/reference directory for usage documentation files + following the pattern ecs-{fieldset}-usage.md. + + Returns: + List of usage doc filenames (e.g., ['ecs-http-usage.md']) + + Note: + Returns empty list if docs/reference directory doesn't exist. + Used by markdown generator to link to usage docs when available. + """ usage_docs_dir: str = os.path.join(os.path.dirname(__file__), '../../docs/reference') usage_docs_path: pathlib.PosixPath = pathlib.Path(usage_docs_dir) if usage_docs_path.is_dir(): @@ -202,12 +541,41 @@ def usage_doc_files() -> List[str]: def ecs_files() -> List[str]: - """Return the schema file list to load""" + """Get list of ECS schema files to load. + + Returns sorted list of all YAML files in the schemas directory. + This is the primary source of ECS field definitions. + + Returns: + Sorted list of schema file paths + + Example: + >>> ecs_files() + ['schemas/agent.yml', 'schemas/base.yml', 'schemas/http.yml', ...] + """ schema_glob: str = os.path.join(os.path.dirname(__file__), '../../schemas/*.yml') return sorted(glob.glob(schema_glob)) def make_dirs(path: str) -> None: + """Create directory and all parent directories if they don't exist. + + Safe wrapper around os.makedirs that handles existing directories + gracefully and provides clear error messages on failure. + + Args: + path: Directory path to create + + Raises: + OSError: If directory creation fails (with descriptive message) + + Note: + Uses exist_ok=True, so won't fail if directory already exists. + + Example: + >>> make_dirs('generated/elasticsearch/composable/component') + # Creates all missing parent directories + """ try: os.makedirs(path, exist_ok=True) except OSError as e: @@ -220,6 +588,25 @@ def yaml_dump( data: Dict[str, FieldNestedEntry], preamble: Optional[str] = None ) -> None: + """Write data to a YAML file with optional preamble text. + + Serializes dictionary to YAML format with human-friendly formatting. + Optionally prepends text (e.g., copyright header, comments). + + Args: + filename: Path to output file + data: Dictionary to serialize + preamble: Optional text to write before YAML content + + Note: + - Uses default_flow_style=False for readable multi-line format + - Supports Unicode characters + - Overwrites existing file + + Example: + >>> yaml_dump('output.yml', {'name': 'test'}, '# Auto-generated\\n') + # Creates file with comment header followed by YAML + """ with open(filename, 'w') as outfile: if preamble: outfile.write(preamble) @@ -227,6 +614,25 @@ def yaml_dump( def yaml_load(filename: str) -> Set[str]: + """Load and parse a YAML file. + + Reads a YAML file and parses it into Python data structures using + safe_load (prevents arbitrary code execution). + + Args: + filename: Path to YAML file + + Returns: + Parsed YAML content (typically dict or list) + + Note: + Uses yaml.safe_load for security (no arbitrary code execution). + + Example: + >>> data = yaml_load('schemas/http.yml') + >>> data['name'] + 'http' + """ with open(filename) as f: return yaml.safe_load(f.read()) @@ -234,12 +640,49 @@ def yaml_load(filename: str) -> Set[str]: def list_subtract(original: List[Any], subtracted: List[Any]) -> List[Any]: - """Subtract two lists. original = subtracted""" + """Remove all elements of one list from another. + + Returns a new list containing elements from original that are not + in subtracted. Useful for filtering lists. + + Args: + original: List to subtract from + subtracted: Elements to remove + + Returns: + New list with subtracted elements removed + + Example: + >>> list_subtract([1, 2, 3, 4, 5], [2, 4]) + [1, 3, 5] + >>> list_subtract(['a', 'b', 'c'], ['b']) + ['a', 'c'] + """ return [item for item in original if item not in subtracted] def list_extract_keys(lst: List[Field], key_name: str) -> List[str]: - """Returns an array of values for 'key_name', from a list of dictionaries""" + """Extract values for a specific key from a list of dictionaries. + + Builds a list of values by extracting the same key from each dictionary + in the input list. Useful for converting list of objects to list of + specific attribute values. + + Args: + lst: List of dictionaries + key_name: Key to extract from each dictionary + + Returns: + List of values for the specified key + + Example: + >>> fields = [ + ... {'name': 'http', 'group': 2}, + ... {'name': 'user', 'group': 1} + ... ] + >>> list_extract_keys(fields, 'name') + ['http', 'user'] + """ acc = [] for d in lst: acc.append(d[key_name]) @@ -250,12 +693,57 @@ def list_extract_keys(lst: List[Field], key_name: str) -> List[str]: def is_intermediate(field: FieldEntry) -> bool: - """Encapsulates the check to see if a field is an intermediate field or a "real" field.""" + """Check if a field is an intermediate structural field (not a real data field). + + Intermediate fields exist only to provide hierarchical structure in schemas. + They don't represent actual data fields and should be excluded from most + output formats. + + Args: + field: Field entry to check + + Returns: + True if field is intermediate, False otherwise + + Example: + >>> field = {'field_details': {'intermediate': True, 'name': 'request'}} + >>> is_intermediate(field) + True + # 'http.request' is just structure, not a field + + >>> field = {'field_details': {'name': 'method', 'type': 'keyword'}} + >>> is_intermediate(field) + False + # 'http.request.method' is an actual field + """ return ('intermediate' in field['field_details'] and field['field_details']['intermediate']) def remove_top_level_reusable_false(ecs_nested: Dict[str, FieldNestedEntry]) -> Dict[str, FieldNestedEntry]: - """Returns same structure as ecs_nested, but skips all field sets with reusable.top_level: False""" + """Filter out fieldsets that should not appear at the root level. + + Returns a copy of ecs_nested excluding fieldsets with reusable.top_level=false. + These fieldsets are meant to be used only in specific reuse locations, not + at the event root. + + Args: + ecs_nested: Dictionary of nested fieldsets + + Returns: + Filtered dictionary excluding non-root fieldsets + + Example: + >>> nested = { + ... 'http': {'reusable': {'top_level': True}}, + ... 'geo': {'reusable': {'top_level': False}}, # Only for reuse + ... 'user': {} # No reusable setting = included + ... } + >>> filtered = remove_top_level_reusable_false(nested) + >>> 'geo' in filtered + False + >>> 'http' in filtered and 'user' in filtered + True + """ components: Dict[str, FieldNestedEntry] = {} for (fieldset_name, fieldset) in ecs_nested.items(): if fieldset.get('reusable', None): @@ -269,11 +757,27 @@ def remove_top_level_reusable_false(ecs_nested: Dict[str, FieldNestedEntry]) -> def strict_warning(msg: str) -> None: - """Call warnings.warn(msg) for operations that would throw an Exception - if operating in `--strict` mode. Allows a custom message to be passed. + """Issue a warning that would be an error in strict mode. + + Generates a warning for issues that are tolerated in normal mode but would + cause an exception when the generator is run with --strict flag. This allows + schema developers to gradually fix issues without blocking the build. + + Args: + msg: Custom warning message describing the issue + + Note: + - Uses stacklevel=3 to show warning at caller's call site + - Automatically adds boilerplate about strict mode + - Warning will be converted to exception with --strict flag + + Example: + >>> strict_warning("Field 'user.name' is missing description") + UserWarning: Field 'user.name' is missing description - :param msg: custom text which will be displayed with wrapped boilerplate - for strict warning messages. + This will cause an exception when running in strict mode. + Warning check: + ... """ warn_message: str = f"{msg}\n\nThis will cause an exception when running in strict mode.\nWarning check:" warnings.warn(warn_message, stacklevel=3) diff --git a/scripts/generators/es_template.py b/scripts/generators/es_template.py index 94b9012f0d..c38beb1695 100644 --- a/scripts/generators/es_template.py +++ b/scripts/generators/es_template.py @@ -15,6 +15,52 @@ # specific language governing permissions and limitations # under the License. +"""Elasticsearch Template Generator. + +This module generates Elasticsearch index templates from ECS schemas. It supports +both modern composable templates and legacy single-file templates, producing +JSON files that can be directly installed into Elasticsearch. + +Composable Templates (Modern): + - One component template per ECS fieldset + - Main template that composes all components together + - Allows selective field adoption + - Recommended for Elasticsearch 7.8+ + +Legacy Templates (Deprecated): + - Single monolithic template with all fields + - Backwards compatibility for older Elasticsearch versions + - Simpler but less flexible + +The generator transforms ECS field definitions into Elasticsearch mapping syntax, +handling: +- Field type mappings (keyword, text, IP, etc.) +- Multi-field configurations +- Custom parameters (ignore_above, norms, etc.) +- Nested object structures +- Index and doc_values settings + +Key Components: + - generate(): Composable template generation entry point + - generate_legacy(): Legacy template generation entry point + - entry_for(): Converts ECS field to Elasticsearch mapping + - dict_add_nested(): Builds nested property structures + - Template settings: Configurable index patterns, priorities, settings + +Output Files: + Composable: + - generated/elasticsearch/composable/template.json + - generated/elasticsearch/composable/component/*.json (one per fieldset) + Legacy: + - generated/elasticsearch/legacy/template.json + +These templates are ready to be installed into Elasticsearch using: + PUT _index_template/ecs (composable) + PUT _template/ecs (legacy) + +See also: scripts/docs/es-template.md for detailed documentation +""" + import json import sys from typing import ( @@ -42,13 +88,74 @@ def generate( mapping_settings_file: str, template_settings_file: str ) -> None: - """This generates all artifacts for the composable template approach""" + """Generate all composable template artifacts for Elasticsearch. + + Creates the modern composable template approach where each ECS fieldset becomes + a separate component template. This allows users to selectively include only + the fieldsets they need, reducing mapping overhead. + + Args: + ecs_nested: Nested fieldset structure from intermediate_files + ecs_version: ECS version string (e.g., '8.11.0') + out_dir: Output directory (typically 'generated') + mapping_settings_file: Path to JSON file with custom mapping settings, or None + template_settings_file: Path to JSON file with custom template settings, or None + + Generates: + - generated/elasticsearch/composable/component/{fieldset}.json (one per fieldset) + - generated/elasticsearch/composable/template.json (main template) + + Process: + 1. Generate individual component template for each fieldset + 2. Build list of component names following naming convention + 3. Generate main composable template that references all components + + Note: + Composable templates require Elasticsearch 7.8 or later. + Each component can be updated independently in production. + + Example: + >>> generate(nested, '8.11.0', 'generated', None, None) + # Creates: + # - generated/elasticsearch/composable/component/base.json + # - generated/elasticsearch/composable/component/agent.json + # - ... + # - generated/elasticsearch/composable/template.json + """ all_component_templates(ecs_nested, ecs_version, out_dir) component_names = component_name_convention(ecs_version, ecs_nested) save_composable_template(ecs_version, component_names, out_dir, mapping_settings_file, template_settings_file) def save_composable_template(ecs_version, component_names, out_dir, mapping_settings_file, template_settings_file): + """Save the main composable index template that references component templates. + + Creates the composable template JSON file that ties together all component + templates. This template defines index patterns, priority, settings, and + lists all component templates to compose. + + Args: + ecs_version: ECS version string + component_names: List of component template names to include + out_dir: Output directory + mapping_settings_file: Path to custom mapping settings JSON, or None + template_settings_file: Path to custom template settings JSON, or None + + Generates: + generated/elasticsearch/composable/template.json + + Template structure: + { + "index_patterns": ["try-ecs-*"], + "composed_of": ["ecs_8.11.0_base", "ecs_8.11.0_agent", ...], + "priority": 1, + "template": { + "mappings": {...}, + "settings": {...} + }, + "_meta": {...} + } + """ mappings_section = mapping_settings(mapping_settings_file) template = template_settings(ecs_version, mappings_section, template_settings_file, component_names=component_names) @@ -61,7 +168,35 @@ def all_component_templates( ecs_version: str, out_dir: str ) -> None: - """Generate one component template per field set""" + """Generate individual component templates for each ECS fieldset. + + Creates one JSON file per fieldset containing the Elasticsearch mapping + for all fields in that fieldset. Each component template is self-contained + and can be installed/updated independently. + + Args: + ecs_nested: Nested fieldset structure from intermediate_files + ecs_version: ECS version string + out_dir: Output directory + + Generates: + One file per fieldset in generated/elasticsearch/composable/component/: + - base.json + - agent.json + - http.json + - etc. + + Process: + 1. Filter out non-root reusable fieldsets (top_level=false) + 2. For each remaining fieldset: + a. Build nested property structure from flat field names + b. Convert each field to Elasticsearch mapping format + c. Save as component template JSON + + Note: + Uses dict_add_nested() to convert flat names like 'http.request.method' + into nested structure: {http: {properties: {request: {properties: {method: {...}}}}}} + """ component_dir: str = join(out_dir, 'elasticsearch/composable/component') ecs_helpers.make_dirs(component_dir) @@ -81,6 +216,44 @@ def save_component_template( out_dir: str, field_mappings: Dict ) -> None: + """Save a single component template JSON file. + + Creates a component template with the mappings for one fieldset, including + metadata about the ECS version and documentation link. + + Args: + template_name: Name of the fieldset (e.g., 'http', 'user') + field_level: Field level (core/extended/custom) - used to determine if doc link needed + ecs_version: ECS version string + out_dir: Component template directory + field_mappings: Nested dictionary of field mappings + + Generates: + {out_dir}/{template_name}.json + + Template structure: + { + "template": { + "mappings": { + "properties": { + "http": { + "properties": { + "request": {...}, + "response": {...} + } + } + } + } + }, + "_meta": { + "ecs_version": "8.11.0", + "documentation": "https://www.elastic.co/guide/en/ecs/current/ecs-http.html" + } + } + + Note: + Documentation URL is only added for ECS fields (not custom fields). + """ filename: str = join(out_dir, template_name) + ".json" reference_url: str = "https://www.elastic.co/guide/en/ecs/current/ecs-{}.html".format(template_name) @@ -102,6 +275,35 @@ def component_name_convention( ecs_version: str, ecs_nested: Dict[str, FieldNestedEntry] ) -> List[str]: + """Build list of component template names following ECS naming convention. + + Generates the standardized names for component templates that will be + referenced in the main composable template. Names follow the pattern: + ecs_{version}_{fieldset} + + Args: + ecs_version: ECS version string (e.g., '8.11.0' or '8.11.0+exp') + ecs_nested: Nested fieldset structure + + Returns: + Sorted list of component template names + + Name format: + - Version: Replace '+' with '-' (e.g., '8.11.0+exp' -> '8.11.0-exp') + - Fieldset: Lowercase name + - Pattern: 'ecs_{version}_{fieldset}' + + Examples: + >>> component_name_convention('8.11.0', nested) + ['ecs_8.11.0_agent', 'ecs_8.11.0_base', 'ecs_8.11.0_client', ...] + + >>> component_name_convention('8.11.0+exp', nested) + ['ecs_8.11.0-exp_agent', 'ecs_8.11.0-exp_base', ...] + + Note: + Only includes fieldsets with top_level=true (or no reusable setting). + Names are used in the 'composed_of' array in the main template. + """ version: str = ecs_version.replace('+', '-') names: List[str] = [] for (fieldset_name, fieldset) in ecs_helpers.remove_top_level_reusable_false(ecs_nested).items(): @@ -119,7 +321,49 @@ def generate_legacy( mapping_settings_file: str, template_settings_file: str ) -> None: - """Generate the legacy index template""" + """Generate legacy single-file index template for backwards compatibility. + + Creates the older-style index template where all field mappings are defined + in a single monolithic JSON file. This format is supported by all versions + of Elasticsearch but is less flexible than composable templates. + + Args: + ecs_flat: Flat field dictionary from intermediate_files + ecs_version: ECS version string + out_dir: Output directory + mapping_settings_file: Path to custom mapping settings JSON, or None + template_settings_file: Path to custom template settings JSON, or None + + Generates: + generated/elasticsearch/legacy/template.json + + Process: + 1. Iterate through all fields in sorted order + 2. Convert flat names to nested property structure + 3. Build complete mappings section + 4. Generate template with settings and mappings + + Template structure: + { + "index_patterns": ["try-ecs-*"], + "order": 1, + "settings": {...}, + "mappings": { + "_meta": {...}, + "date_detection": false, + "dynamic_templates": [...], + "properties": { + "agent": {...}, + "http": {...}, + ... + } + } + } + + Note: + Legacy templates are deprecated in Elasticsearch 7.8+ in favor of + composable templates. Use this only for backwards compatibility. + """ field_mappings = {} for flat_name in sorted(ecs_flat): field = ecs_flat[flat_name] @@ -138,6 +382,22 @@ def generate_legacy_template_version( out_dir: str, template_settings_file: str ) -> None: + """Create and save the legacy template JSON file. + + Builds the complete legacy template structure and writes it to disk. + + Args: + ecs_version: ECS version string + mappings_section: Complete mappings dictionary including properties + out_dir: Output directory + template_settings_file: Path to custom template settings JSON, or None + + Generates: + generated/elasticsearch/legacy/template.json + + Note: + Creates the legacy directory if it doesn't exist. + """ ecs_helpers.make_dirs(join(out_dir, 'elasticsearch', "legacy")) template: Dict = template_settings(ecs_version, mappings_section, template_settings_file, is_legacy=True) @@ -153,6 +413,53 @@ def dict_add_nested( name_parts: List[str], value: Dict ) -> None: + """Recursively build nested Elasticsearch properties structure from flat field name. + + Converts a flat field name like 'http.request.method' into nested Elasticsearch + mapping structure: + + { + "http": { + "properties": { + "request": { + "properties": { + "method": {} + } + } + } + } + } + + Args: + dct: Dictionary being built (modified in place) + name_parts: List of name components (e.g., ['http', 'request', 'method']) + value: Field mapping to place at the leaf (e.g., {'type': 'keyword'}) + + Behavior: + - Recursively creates nested 'properties' dictionaries + - If current level already exists as object type, skips (avoids overwriting) + - Sets value at the deepest nesting level + + Example: + >>> mapping = {} + >>> dict_add_nested(mapping, ['http', 'request', 'method'], {'type': 'keyword'}) + >>> mapping + { + 'http': { + 'properties': { + 'request': { + 'properties': { + 'method': {'type': 'keyword'} + } + } + } + } + } + + Note: + Modifies the dictionary in place. Used to build the complete mapping + structure from individual fields. + """ current_nesting: str = name_parts[0] rest_name_parts: List[str] = name_parts[1:] if len(rest_name_parts) > 0: @@ -171,6 +478,58 @@ def dict_add_nested( def entry_for(field: Field) -> Dict: + """Convert an ECS field definition to Elasticsearch mapping format. + + Transforms ECS field metadata into the appropriate Elasticsearch mapping + configuration, handling type-specific parameters and multi-fields. + + Args: + field: ECS field definition with type, parameters, multi_fields, etc. + + Returns: + Dictionary containing Elasticsearch mapping for this field + + Mapping rules by type: + - keyword/flattened: Copy ignore_above, synthetic_source_keep + - constant_keyword: Copy value parameter + - text: Copy norms setting + - alias: Copy path (field being aliased) + - scaled_float: Copy scaling_factor + - object/nested: Copy enabled flag if false + - All others: Copy index, doc_values if index=false + + Multi-fields: + If field has multi_fields, creates 'fields' section with alternate + representations (e.g., keyword field with text.match_phrase multi-field) + + Custom parameters: + If field has 'parameters' dict, merges directly into mapping + + Example: + >>> field = { + ... 'type': 'keyword', + ... 'ignore_above': 1024, + ... 'multi_fields': [{ + ... 'name': 'text', + ... 'type': 'match_only_text' + ... }] + ... } + >>> entry_for(field) + { + 'type': 'keyword', + 'ignore_above': 1024, + 'fields': { + 'text': {'type': 'match_only_text'} + } + } + + Raises: + KeyError: If required field properties are missing + + Note: + This function handles the core logic of translating ECS semantics + into Elasticsearch mapping syntax. + """ field_entry: Dict = {'type': field['type']} try: if field['type'] == 'object' or field['type'] == 'nested': @@ -214,6 +573,30 @@ def entry_for(field: Field) -> Dict: def mapping_settings(mapping_settings_file: str) -> Dict: + """Load mapping settings from file or use defaults. + + Mapping settings control how Elasticsearch handles unmapped fields and + other mapping behaviors like date detection and dynamic templates. + + Args: + mapping_settings_file: Path to JSON file with custom settings, or None/empty + + Returns: + Dictionary with mapping settings including: + - date_detection: Whether to auto-detect date fields + - dynamic_templates: Rules for mapping unmapped fields + + Example custom settings file: + { + "date_detection": false, + "dynamic_templates": [{ + "strings_as_keyword": { + "match_mapping_type": "string", + "mapping": {"type": "keyword", "ignore_above": 1024} + } + }] + } + """ if mapping_settings_file: with open(mapping_settings_file) as f: mappings = json.load(f) @@ -229,6 +612,47 @@ def template_settings( is_legacy: Optional[bool] = False, component_names: Optional[List[str]] = None ) -> Dict: + """Load and finalize template settings (index patterns, priority, settings). + + Template settings define which indices the template applies to, its priority, + index settings, and other metadata. Can be customized via JSON file or use defaults. + + Args: + ecs_version: ECS version string + mappings_section: Mapping settings and field properties + template_settings_file: Path to custom template settings JSON, or None + is_legacy: Whether generating legacy template format + component_names: List of component names (for composable templates only) + + Returns: + Complete template dictionary ready to save + + Structure for composable: + { + "index_patterns": ["try-ecs-*"], + "composed_of": ["ecs_8.11.0_base", ...], + "priority": 1, + "template": { + "settings": {...}, + "mappings": {...} + }, + "_meta": {"ecs_version": "8.11.0", ...} + } + + Structure for legacy: + { + "index_patterns": ["try-ecs-*"], + "order": 1, + "settings": {...}, + "mappings": { + "_meta": {...}, + "properties": {...} + } + } + + Note: + Calls finalize_template() to merge mappings and component names. + """ if template_settings_file: with open(template_settings_file) as f: template = json.load(f) @@ -250,6 +674,31 @@ def finalize_template( mappings_section: Dict, component_names: List[str] ) -> None: + """Finalize template by merging mappings and metadata. + + Completes the template structure by adding mappings, component references + (for composable), and metadata. Handles structural differences between + legacy and composable template formats. + + Args: + template: Base template dictionary (modified in place) + ecs_version: ECS version string + is_legacy: Whether this is a legacy template + mappings_section: Complete mappings with properties + component_names: List of component template names (composable only) + + Legacy modifications: + - Mappings placed directly under 'mappings' key + - Moves _meta from root into mappings section + + Composable modifications: + - Mappings placed under template.mappings + - Adds composed_of array with component names + - Adds _meta at root level + + Note: + Modifies the template dictionary in place. + """ if is_legacy: if mappings_section: template['mappings'] = mappings_section @@ -269,6 +718,23 @@ def finalize_template( def save_json(file: str, data: Dict) -> None: + """Save dictionary as formatted JSON file. + + Writes JSON with consistent formatting: 2-space indentation, sorted keys, + and trailing newline. + + Args: + file: Path to output file + data: Dictionary to serialize + + Format: + - 2-space indentation for readability + - Sorted keys for consistent diffs + - Trailing newline (Unix convention) + + Note: + Python 2/3 compatible (uses binary mode for Python 2). + """ open_mode = "wb" if sys.version_info >= (3, 0): open_mode = "w" @@ -278,6 +744,27 @@ def save_json(file: str, data: Dict) -> None: def default_template_settings(ecs_version: str) -> Dict: + """Generate default settings for composable template. + + Provides sensible defaults for a composable template including index + patterns, priority, and index settings. + + Args: + ecs_version: ECS version string + + Returns: + Template settings dictionary with: + - index_patterns: Matches 'try-ecs-*' indices (safe for testing) + - priority: 1 (very low, won't override production templates) + - codec: best_compression (saves disk space) + - total_fields.limit: 2000 (enough for all ECS fields) + + Note: + These are sample settings. Production use should customize: + - index_patterns to match actual indices + - priority based on template precedence needs + - settings based on cluster capacity and use case + """ return { "index_patterns": ["try-ecs-*"], "_meta": { @@ -301,6 +788,25 @@ def default_template_settings(ecs_version: str) -> Dict: def default_legacy_template_settings(ecs_version: str) -> Dict: + """Generate default settings for legacy template. + + Provides sensible defaults for a legacy template with higher total_fields + limit and refresh_interval setting. + + Args: + ecs_version: ECS version string + + Returns: + Legacy template settings with: + - index_patterns: Matches 'try-ecs-*' indices + - order: 1 (low priority) + - total_fields.limit: 10000 (legacy templates need higher limits) + - refresh_interval: 5s (balance between real-time and performance) + + Note: + Legacy templates require higher total_fields limits because they + include all mappings in one template. Adjust for production use. + """ return { "index_patterns": ["try-ecs-*"], "_meta": {"version": ecs_version}, @@ -319,6 +825,27 @@ def default_legacy_template_settings(ecs_version: str) -> Dict: def default_mapping_settings() -> Dict: + """Generate default mapping settings for dynamic field handling. + + Provides sensible defaults for how Elasticsearch handles unmapped fields. + These settings prevent common issues like: + - Automatic date detection causing mapping conflicts + - String fields being mapped as text (memory intensive) + + Returns: + Mapping settings with: + - date_detection: false (prevents auto-detection of date strings) + - dynamic_templates: Maps unmapped strings as keyword with ignore_above + + Dynamic template behavior: + All unmapped string fields become: + - type: keyword (not text) + - ignore_above: 1024 (truncates very long strings) + + Note: + These settings apply to fields not explicitly defined in ECS. + Customize based on your data characteristics. + """ return { "date_detection": False, "dynamic_templates": [ diff --git a/scripts/generators/intermediate_files.py b/scripts/generators/intermediate_files.py index 94787c4156..23138ede7b 100644 --- a/scripts/generators/intermediate_files.py +++ b/scripts/generators/intermediate_files.py @@ -15,6 +15,44 @@ # specific language governing permissions and limitations # under the License. +"""Intermediate File Generator. + +This module generates standardized intermediate representations of ECS schemas +that serve as the foundation for all other output formats. It produces two +key representations: + +1. **Flat Format** (ecs_flat.yml): Single-level dictionary of all fields + - Keys: Full dotted field names (e.g., 'http.request.method') + - Values: Complete field definitions with metadata + - Used by: CSV generator, some template generators + - Excludes: Non-root reusable fieldsets (top_level=false) + +2. **Nested Format** (ecs_nested.yml): Hierarchical grouping by fieldset + - Keys: Fieldset names (e.g., 'http', 'user', 'process') + - Values: Fieldset metadata plus nested 'fields' dictionary + - Used by: Markdown generator, Elasticsearch templates, Beats + - Includes: All fieldsets regardless of top_level setting + +These intermediate formats provide a stable, normalized interface between +schema processing and artifact generation. This separation allows: +- Multiple generators to consume the same standardized data +- Schema evolution without breaking downstream generators +- Easy debugging of the transformation pipeline + +Key Components: + - generate(): Main entry point producing both formats + - generate_flat_fields(): Creates flat field dictionary + - generate_nested_fields(): Creates nested fieldset hierarchy + - Visitor pattern: Used for efficient field traversal + +Output Files: + - generated/ecs/ecs.yml: Raw processed schemas (debugging only) + - generated/ecs/ecs_flat.yml: Flat field representation + - generated/ecs/ecs_nested.yml: Nested fieldset representation + +See also: scripts/docs/intermediate-files.md for detailed documentation +""" + import copy from os.path import join from typing import ( @@ -36,6 +74,44 @@ def generate( out_dir: str, default_dirs: bool ) -> Tuple[Dict[str, FieldNestedEntry], Dict[str, Field]]: + """Generate all intermediate file representations from processed schemas. + + This is the main entry point for intermediate file generation. It orchestrates + the creation of both flat and nested representations and saves them to YAML + files. These files serve as the normalized interface for all downstream + generators. + + Args: + fields: Processed field entries from schema loader/cleaner/finalizer + out_dir: Output directory path (typically 'generated/ecs') + default_dirs: If True, also save raw ecs.yml for debugging + + Returns: + Tuple of (nested, flat) dictionaries: + - nested: Fieldsets organized hierarchically with metadata + - flat: All fields in single-level dictionary by dotted name + + Generates files: + - {out_dir}/ecs_flat.yml: Flat field representation + - {out_dir}/ecs_nested.yml: Nested fieldset representation + - {out_dir}/ecs.yml: Raw fields (only if default_dirs=True) + + Note: + Creates output directory if it doesn't exist. + The returned dictionaries are also used directly by some generators + without reading back from the YAML files. + + Example: + >>> from schema import loader, cleaner, finalizer + >>> fields = loader.load_schemas() + >>> cleaner.clean(fields) + >>> finalizer.finalize(fields) + >>> nested, flat = generate(fields, 'generated/ecs', True) + >>> len(flat) # Number of fields in flat representation + 850 + >>> len(nested) # Number of fieldsets + 45 + """ ecs_helpers.make_dirs(join(out_dir)) # Should only be used for debugging ECS development @@ -50,7 +126,48 @@ def generate( def generate_flat_fields(fields: Dict[str, FieldEntry]) -> Dict[str, Field]: - """Generate ecs_flat.yml""" + """Generate flat field representation mapping dotted names to field definitions. + + Creates a single-level dictionary where every field (including nested ones) + is represented by its full dotted name as the key. This format is useful for: + - Quick field lookups by name + - CSV generation + - Simple iteration over all fields + + Args: + fields: Processed field entries from schema pipeline + + Returns: + Dictionary mapping field flat_names to field definitions: + { + 'http.request.method': { + 'name': 'method', + 'flat_name': 'http.request.method', + 'type': 'keyword', + 'description': '...', + ... + }, + ... + } + + Processing steps: + 1. Filter out non-root reusable fieldsets (top_level=false) + 2. Use visitor pattern to traverse all fields + 3. Accumulate fields in flat dictionary + 4. Remove internal-only attributes + + Note: + - Excludes intermediate fields (used only for nesting) + - Excludes fieldsets marked with top_level=false + - Each field appears only once by its canonical flat_name + + Example: + >>> flat = generate_flat_fields(fields) + >>> flat['http.request.method']['type'] + 'keyword' + >>> list(flat.keys())[:3] + ['@timestamp', 'agent.build.original', 'agent.ephemeral_id'] + """ filtered: Dict[str, FieldEntry] = remove_non_root_reusables(fields) flattened: Dict[str, Field] = {} visitor.visit_fields_with_memo(filtered, accumulate_field, flattened) @@ -58,7 +175,44 @@ def generate_flat_fields(fields: Dict[str, FieldEntry]) -> Dict[str, Field]: def accumulate_field(details: FieldEntry, memo: Field) -> None: - """Visitor function that accumulates all field details in the memo dict""" + """Visitor callback that accumulates field definitions in a flat dictionary. + + This function is called by the visitor pattern for each field encountered + during traversal. It extracts the field definition, cleans it, and adds + it to the memo dictionary using the flat_name as the key. + + Args: + details: Field entry containing field_details and possibly schema_details + memo: Dictionary being accumulated with field definitions (modified in place) + + Behavior: + - Skips schema-level entries (fieldset definitions) + - Skips intermediate fields (used only for structure, not actual fields) + - Deep copies field details to avoid mutation + - Removes internal attributes not needed in output + - Adds field to memo dictionary by flat_name + + Note: + This is a callback function used with visitor.visit_fields_with_memo(). + It modifies the memo dictionary in place rather than returning a value. + + Example: + >>> memo = {} + >>> field_entry = { + ... 'field_details': { + ... 'flat_name': 'http.request.method', + ... 'name': 'method', + ... 'type': 'keyword', + ... 'node_name': 'method', # Will be removed + ... 'intermediate': False # Will be removed + ... } + ... } + >>> accumulate_field(field_entry, memo) + >>> 'http.request.method' in memo + True + >>> 'node_name' in memo['http.request.method'] + False + """ if 'schema_details' in details or ecs_helpers.is_intermediate(details): return field_details: Field = copy.deepcopy(details['field_details']) @@ -69,7 +223,60 @@ def accumulate_field(details: FieldEntry, memo: Field) -> None: def generate_nested_fields(fields: Dict[str, FieldEntry]) -> Dict[str, FieldNestedEntry]: - """Generate ecs_nested.yml""" + """Generate nested fieldset representation with hierarchical structure. + + Creates a dictionary where each fieldset is a top-level entry containing: + - Fieldset metadata (name, title, description, group, etc.) + - Schema details (reusability, nesting information) + - Nested 'fields' dictionary with all fields in that fieldset + + This format preserves the logical grouping of fields by fieldset and + includes metadata about how fieldsets relate to each other (reuse, + nesting, etc.). Used by most generators including markdown docs. + + Args: + fields: Processed field entries from schema pipeline + + Returns: + Dictionary mapping fieldset names to fieldset definitions: + { + 'http': { + 'name': 'http', + 'title': 'HTTP', + 'group': 2, + 'description': '...', + 'reusable': {...}, + 'fields': { + 'http.request.method': {...}, + 'http.response.status_code': {...}, + ... + } + }, + ... + } + + Processing steps: + 1. For each fieldset, merge field_details and schema_details + 2. Remove internal attributes (node_name, dashed_name, etc.) + 3. Use visitor to accumulate all fields in the fieldset + 4. Store fields in nested 'fields' dictionary + 5. Clean up conditional attributes (root=false removed) + + Note: + - Includes ALL fieldsets, even those with top_level=false + - Consumers of this format should check top_level flag themselves + - Each fieldset's fields are in a flat dict (not hierarchical) + - The "nesting" refers to grouping by fieldset, not field hierarchy + + Example: + >>> nested = generate_nested_fields(fields) + >>> nested['http']['title'] + 'HTTP' + >>> len(nested['http']['fields']) + 25 + >>> list(nested.keys())[:3] + ['agent', 'as', 'base'] + """ nested: Dict[str, FieldNestedEntry] = {} # Flatten each field set, but keep all resulting fields nested under their # parent/host field set. @@ -101,22 +308,86 @@ def generate_nested_fields(fields: Dict[str, FieldEntry]) -> Dict[str, FieldNest def remove_internal_attributes(field_details: Field) -> None: - """Remove attributes only relevant to the deeply nested structure, but not to ecs_flat/nested.yml.""" + """Remove internal-only attributes from field definitions before output. + + Certain attributes are used during schema processing but aren't relevant + in the final intermediate file outputs. This function removes them to + keep the output files clean and focused on user-facing information. + + Args: + field_details: Field definition dictionary (modified in place) + + Attributes removed: + - node_name: Internal identifier used during tree traversal + - intermediate: Flag for structural fields (not actual data fields) + + Note: + Modifies the field_details dictionary in place. + Uses pop() with None default to safely handle missing keys. + + Example: + >>> field = { + ... 'name': 'method', + ... 'flat_name': 'http.request.method', + ... 'type': 'keyword', + ... 'node_name': 'method', # Internal + ... 'intermediate': False # Internal + ... } + >>> remove_internal_attributes(field) + >>> 'node_name' in field + False + >>> 'name' in field + True + """ field_details.pop('node_name', None) field_details.pop('intermediate', None) def remove_non_root_reusables(fields_nested: Dict[str, FieldEntry]) -> Dict[str, FieldEntry]: - """ - Remove field sets that have top_level=false from the root of the field definitions. + """Filter out fieldsets marked as non-root reusable (top_level=false). + + Some fieldsets are designed only to be reused in specific locations (via + the reuse mechanism) and should never appear at the root level of events. + For example, 'geo' might only be valid under 'client.geo', 'server.geo', + etc., but not as standalone 'geo.*' fields at the event root. + + This filtering is ONLY applied to the flat representation, where having + non-root fields at the top level would be confusing and incorrect. The + nested representation keeps all fieldsets so consumers have complete + information about each fieldset definition. + + Args: + fields_nested: Complete dictionary of field entries + + Returns: + Filtered dictionary containing only: + - Fieldsets without 'reusable' metadata (always included) + - Fieldsets with reusable.top_level=true + + Excludes: + - Fieldsets with reusable.top_level=false - This attribute means they're only meant to be in the "reusable/expected" locations - and not at the root of user's events. + Note: + This implements an allow-list approach: fieldsets are included by + default unless explicitly marked as non-root. - This is only relevant for the 'flat' field representation. The nested one - still needs to keep all field sets at the root of the YAML file, as it - the official information about each field set. It's the responsibility of - users consuming ecs_nested.yml to skip the field sets with top_level=false. + Example: + >>> fields = { + ... 'http': {'schema_details': {}}, # No reusable - included + ... 'geo': {'schema_details': { + ... 'reusable': {'top_level': False} # Excluded + ... }}, + ... 'user': {'schema_details': { + ... 'reusable': {'top_level': True} # Included + ... }} + ... } + >>> filtered = remove_non_root_reusables(fields) + >>> 'http' in filtered + True + >>> 'geo' in filtered + False + >>> 'user' in filtered + True """ fields: Dict[str, FieldEntry] = {} for (name, field) in fields_nested.items(): diff --git a/scripts/generators/markdown_fields.py b/scripts/generators/markdown_fields.py index 87be2acb8c..e8433130ff 100644 --- a/scripts/generators/markdown_fields.py +++ b/scripts/generators/markdown_fields.py @@ -15,6 +15,34 @@ # specific language governing permissions and limitations # under the License. +"""Markdown Documentation Generator. + +This module generates comprehensive markdown documentation from ECS field schemas. +It produces human-readable reference documentation including: +- Field reference pages for each fieldset +- OTel alignment overview and detailed mappings +- Index pages and cross-references +- Usage documentation integration + +The generator uses Jinja2 templates to render structured data into markdown format, +creating the official ECS documentation published on elastic.co. + +Key Components: + - generate(): Main entry point for documentation generation + - page_*(): Template rendering functions for different page types + - Helper functions: Field sorting, reuse tracking, allowed values extraction + - Jinja2 integration: Template loading and rendering + +Templates used (from scripts/templates/): + - index.j2: Main index page + - fieldset.j2: Individual fieldset documentation + - ecs_field_reference.j2: Complete field reference + - otel_alignment_details.j2: Detailed OTel mappings + - otel_alignment_overview.j2: OTel alignment summary + +See also: scripts/docs/markdown-generator.md for detailed documentation +""" + from functools import wraps import os.path as path import os @@ -26,6 +54,40 @@ def generate(nested, docs_only_nested, ecs_generated_version, semconv_version, otel_generator, out_dir): + """Generate all markdown documentation files from ECS schemas. + + This is the main entry point for markdown generation. It orchestrates the + creation of all documentation pages including: + - Main index page + - OTel alignment overview and details + - Complete field reference + - Individual fieldset pages (one per ECS fieldset) + + Args: + nested: Dictionary of nested fieldsets with field hierarchies + docs_only_nested: Additional fields used only in documentation + ecs_generated_version: ECS version string (e.g., '8.11.0' or '8.11.0+exp') + semconv_version: OTel semantic conventions version (e.g., 'v1.24.0') + otel_generator: OTelGenerator instance for alignment summaries + out_dir: Output directory path for generated markdown files + + Generates files: + - index.md: Main documentation index + - ecs-otel-alignment-details.md: Detailed field-by-field mappings + - ecs-otel-alignment-overview.md: Summary statistics + - ecs-field-reference.md: Complete field reference + - ecs-{fieldset}.md: One file per fieldset (e.g., ecs-http.md) + + Note: + - Creates output directory if it doesn't exist + - Strips leading 'v' from semconv_version for display + - Uses Jinja2 templates from scripts/templates/ + + Example: + >>> from generators.otel import OTelGenerator + >>> otel_gen = OTelGenerator('v1.24.0') + >>> generate(nested, docs_only, '8.11.0', 'v1.24.0', otel_gen, 'docs/reference') + """ ecs_helpers.make_dirs(out_dir) @@ -48,10 +110,25 @@ def generate(nested, docs_only_nested, ecs_generated_version, semconv_version, o def render_fieldset_reuse_text(fieldset): - """Renders the expected nesting locations - if the the `reusable` object is present. + """Extract and sort expected nesting locations for reusable fieldsets. + + When a fieldset is marked as reusable, this function extracts the list of + locations where it's expected to be nested and sorts them alphabetically. + Used in documentation to show where users can expect to find these fields. + + Args: + fieldset: Fieldset dictionary potentially containing 'reusable' metadata + + Returns: + Generator of sorted full field paths (e.g., ['client.as', 'destination.as']), + or None if the fieldset is not reusable - :param fieldset: The fieldset to evaluate + Example: + >>> fieldset = {'reusable': {'expected': [ + ... {'full': 'destination.geo'}, {'full': 'client.geo'} + ... ]}} + >>> list(render_fieldset_reuse_text(fieldset)) + ['client.geo', 'destination.geo'] """ if not fieldset.get('reusable'): return None @@ -61,9 +138,31 @@ def render_fieldset_reuse_text(fieldset): def render_nestings_reuse_section(fieldset): - """Renders the reuse section entries. - - :param fieldset: The target fieldset + """Build reuse section data showing which fieldsets are nested here. + + Creates a list of metadata about other fieldsets that are reused (nested) + within this fieldset. Each entry includes the nesting path, schema name, + description, and any special properties like beta status or normalization. + + Args: + fieldset: Target fieldset dictionary potentially containing 'reused_here' list + + Returns: + List of dictionaries sorted by nesting path, each containing: + - flat_nesting: The nesting path with wildcard (e.g., 'client.geo.*') + - name: Schema name of the reused fieldset + - short: Short description + - beta: Beta status marker (if applicable) + - normalize: Normalization rules (if applicable) + + Returns None if no fieldsets are reused here. + + Example: + >>> fieldset = {'reused_here': [ + ... {'full': 'client.geo', 'schema_name': 'geo', 'short': 'Location'} + ... ]} + >>> render_nestings_reuse_section(fieldset) + [{'flat_nesting': 'client.geo.*', 'name': 'geo', 'short': 'Location', ...}] """ if not fieldset.get('reused_here'): return None @@ -81,11 +180,24 @@ def render_nestings_reuse_section(fieldset): def extract_allowed_values_key_names(field): - """Extracts the `name` keys from the field's - allowed_values if present in the field - object. + """Extract names of all allowed values for a field. + + For fields with constrained value sets (like enumerations), this extracts + the name of each allowed value for display in documentation. + + Args: + field: Field dictionary potentially containing 'allowed_values' list - :param field: The target field + Returns: + List of allowed value names, or empty list if no allowed values defined + + Example: + >>> field = {'allowed_values': [ + ... {'name': 'success', 'description': 'Success'}, + ... {'name': 'failure', 'description': 'Failure'} + ... ]} + >>> extract_allowed_values_key_names(field) + ['success', 'failure'] """ if not field.get('allowed_values'): return [] @@ -93,13 +205,31 @@ def extract_allowed_values_key_names(field): def sort_fields(fieldset): - """Prepares a fieldset's fields for being - passed into the j2 template for rendering. This - includes sorting them into a list of objects and - adding a field for the names of any allowed values - for the field, if present. - - :param fieldset: The target fieldset + """Prepare and sort fieldset fields for template rendering. + + Converts the fieldset's fields dictionary into a sorted list and enriches + each field with extracted allowed value names. This prepares the data + structure for consumption by Jinja2 templates. + + Args: + fieldset: Fieldset dictionary containing 'fields' dictionary + + Returns: + List of field dictionaries sorted alphabetically by name, each enriched + with 'allowed_value_names' property + + Note: + Modifies field objects in place by adding 'allowed_value_names' key. + Fields are sorted by their 'name' property for consistent output. + + Example: + >>> fieldset = {'fields': { + ... 'status': {'name': 'status', 'allowed_values': [...]}, + ... 'method': {'name': 'method'} + ... }} + >>> sorted_fields = sort_fields(fieldset) + >>> [f['name'] for f in sorted_fields] + ['method', 'status'] """ fields_list = list(fieldset['fields'].values()) for field in fields_list: @@ -108,18 +238,54 @@ def sort_fields(fieldset): def check_for_usage_doc(fieldset_name, usage_file_list=ecs_helpers.usage_doc_files()): - """Checks if a usage doc exists for the specified - fieldset. + """Check if a usage documentation file exists for a fieldset. - :param fieldset_name: The name of the target fieldset + Usage docs provide additional guidance and examples for using specific + fieldsets. This function checks if such a document has been created. + + Args: + fieldset_name: Name of the fieldset (e.g., 'http', 'user') + usage_file_list: List of available usage doc filenames (defaults to + scanning docs/fields/usage/ directory) + + Returns: + True if a usage doc exists, False otherwise + + Note: + Usage docs follow the naming pattern: ecs-{fieldset_name}-usage.md + They are typically stored in docs/fields/usage/ + + Example: + >>> check_for_usage_doc('http') + True # If docs/fields/usage/ecs-http-usage.md exists """ return f"ecs-{fieldset_name}-usage.md" in usage_file_list def templated(template_name): - """Decorator function to simplify rendering a template. + """Decorator to automatically render a function's return value through a Jinja2 template. + + This decorator simplifies page generation by allowing functions to return + context dictionaries that are automatically rendered through specified templates. - :param template_name: the name of the template to be rendered + Args: + template_name: Name of the Jinja2 template file (e.g., 'fieldset.j2') + + Returns: + Decorator function that wraps the target function + + Behavior: + - If decorated function returns a dict: Passes it as template context + - If function returns None: Uses empty dict as context + - If function returns non-dict: Returns value unchanged (bypass rendering) + + Example: + >>> @templated('index.j2') + ... def page_index(version): + ... return {'version': version} + >>> + >>> # Calling page_index('8.11.0') automatically renders index.j2 + >>> # with {'version': '8.11.0'} as context """ def decorator(func): @wraps(func) @@ -135,18 +301,50 @@ def decorated_function(*args, **kwargs): def render_template(template_name, **context): - """Renders a template from the template folder with the given - context. + """Render a Jinja2 template with the provided context. + + Loads a template from the configured template directory and renders it + with the given variables. - :param template_name: the name of the template to be rendered - :param context: the variables that should be available in the - context of the template. + Args: + template_name: Name of the template file (e.g., 'fieldset.j2') + **context: Keyword arguments passed as template variables + + Returns: + Rendered template as a string + + Raises: + jinja2.TemplateNotFound: If the template file doesn't exist + + Note: + Templates are loaded from scripts/templates/ directory. + The template_env is configured with keep_trailing_newline=True + and trim_blocks=True for consistent formatting. + + Example: + >>> render_template('index.j2', version='8.11.0', title='ECS Reference') + '# ECS Reference\\n\\nVersion: 8.11.0\\n...' """ template = template_env.get_template(template_name) return template.render(**context) def save_markdown(f, text): + """Save rendered markdown text to a file. + + Creates parent directories if needed and writes the markdown content. + + Args: + f: Full file path where markdown should be saved + text: Rendered markdown content to write + + Note: + Creates any missing parent directories automatically. + Overwrites existing files without warning. + + Example: + >>> save_markdown('docs/reference/ecs-http.md', markdown_content) + """ os.makedirs(path.dirname(f), exist_ok=True) with open(f, "w") as outfile: outfile.write(text) @@ -169,6 +367,19 @@ def save_markdown(f, text): @templated('index.j2') def page_index(ecs_generated_version): + """Generate the main documentation index page. + + Creates the landing page for ECS documentation with version information + and links to other documentation pages. + + Args: + ecs_generated_version: ECS version string (e.g., '8.11.0') + + Returns: + Rendered markdown content for index.md + + Template: index.j2 + """ return dict(ecs_generated_version=ecs_generated_version) @@ -177,6 +388,29 @@ def page_index(ecs_generated_version): @templated('fieldset.j2') def page_fieldset(fieldset, nested, ecs_generated_version): + """Generate documentation page for a single fieldset. + + Creates comprehensive documentation for one ECS fieldset including: + - Fieldset description and metadata + - List of all fields with types and descriptions + - Reuse information (where this fieldset is used) + - Nesting information (what fieldsets are nested here) + - Link to usage documentation if available + + Args: + fieldset: Fieldset dictionary containing name, fields, and metadata + nested: Complete nested fieldsets structure (for context) + ecs_generated_version: ECS version string + + Returns: + Rendered markdown content for ecs-{fieldset_name}.md + + Template: fieldset.j2 + + Example: + >>> page_fieldset(nested['http'], nested, '8.11.0') + # Returns markdown for ecs-http.md + """ sorted_reuse_fields = render_fieldset_reuse_text(fieldset) render_nestings_reuse_fields = render_nestings_reuse_section(fieldset) sorted_fields = sort_fields(fieldset) @@ -192,6 +426,24 @@ def page_fieldset(fieldset, nested, ecs_generated_version): @templated('ecs_field_reference.j2') def page_field_reference(ecs_generated_version, es, fieldsets): + """Generate the complete ECS field reference page. + + Creates a comprehensive reference listing all ECS fieldsets and their fields + in a single document. This serves as a complete field catalog. + + Args: + ecs_generated_version: ECS version string + es: Elasticsearch product name (typically "Elasticsearch") + fieldsets: List of all fieldsets sorted by group and name + + Returns: + Rendered markdown content for ecs-field-reference.md + + Template: ecs_field_reference.j2 + + Note: + This page can be quite large as it includes all fields from all fieldsets. + """ return dict(ecs_generated_version=ecs_generated_version, es=es, fieldsets=fieldsets) @@ -201,6 +453,22 @@ def page_field_reference(ecs_generated_version, es, fieldsets): def page_field_details(nested, docs_only_nested): + """Generate combined field details for all fieldsets. + + Creates a consolidated view of all fieldsets with detailed information. + Merges documentation-only fields into the main nested structure. + + Args: + nested: Dictionary of nested fieldsets + docs_only_nested: Additional fields only used in documentation + + Returns: + Concatenated markdown content for all fieldsets + + Note: + This function modifies the nested dictionary in place by merging + docs_only_nested fields. Currently not used in main generation flow. + """ if docs_only_nested: for fieldset_name, fieldset in docs_only_nested.items(): nested[fieldset_name]['fields'].update(fieldset['fields']) @@ -211,6 +479,19 @@ def page_field_details(nested, docs_only_nested): @templated('field_details.j2') def generate_field_details_page(fieldset): + """Generate detailed documentation for a single fieldset. + + Helper function for page_field_details that renders one fieldset + with complete information including reuse and nesting details. + + Args: + fieldset: Fieldset dictionary to document + + Returns: + Rendered markdown content for this fieldset's details + + Template: field_details.j2 + """ # render field reuse text section sorted_reuse_fields = render_fieldset_reuse_text(fieldset) render_nestings_reuse_fields = render_nestings_reuse_section(fieldset) @@ -227,6 +508,33 @@ def generate_field_details_page(fieldset): @templated('otel_alignment_details.j2') def page_otel_alignment_details(nested, ecs_generated_version, semconv_version): + """Generate detailed OTel alignment documentation showing field-by-field mappings. + + Creates comprehensive documentation showing how each ECS field maps to + OpenTelemetry Semantic Conventions. Only includes fieldsets that have + at least one OTel mapping defined. + + Args: + nested: Dictionary of nested fieldsets + ecs_generated_version: ECS version string + semconv_version: OTel semantic conventions version (without 'v' prefix) + + Returns: + Rendered markdown content for ecs-otel-alignment-details.md + + Template: otel_alignment_details.j2 + + Note: + - Filters out fieldsets with no OTel mappings + - Deep copies fieldsets to avoid modifying original data + - Converts fields dict to sorted list for template iteration + + Example output includes: + - Relation types (match, equivalent, related, etc.) + - Attribute/metric names + - Stability levels + - Explanatory notes + """ fieldsets = [deepcopy(fieldset) for fieldset in ecs_helpers.dict_sorted_by_keys( nested, ['group', 'name']) if is_eligable_for_otel_mapping(fieldset)] for fieldset in fieldsets: @@ -239,6 +547,25 @@ def page_otel_alignment_details(nested, ecs_generated_version, semconv_version): def is_eligable_for_otel_mapping(fieldset): + """Check if a fieldset has any OTel mappings defined. + + Determines whether a fieldset should be included in OTel alignment + documentation by checking if any of its fields have OTel mappings. + + Args: + fieldset: Fieldset dictionary containing 'fields' + + Returns: + True if at least one field has an 'otel' mapping, False otherwise + + Example: + >>> fieldset = {'fields': { + ... 'method': {'otel': [{'relation': 'match'}]}, + ... 'version': {} + ... }} + >>> is_eligable_for_otel_mapping(fieldset) + True + """ for field in fieldset['fields'].values(): if 'otel' in field: return True @@ -249,6 +576,35 @@ def is_eligable_for_otel_mapping(fieldset): @templated('otel_alignment_overview.j2') def page_otel_alignment_overview(otel_generator, nested, ecs_generated_version, semconv_version): + """Generate OTel alignment overview with summary statistics. + + Creates high-level documentation showing alignment statistics between + ECS and OpenTelemetry Semantic Conventions. Provides counts of: + - Total fields in each namespace + - Number of matching, equivalent, related fields + - Conflicting and not-applicable fields + - Coverage percentages + + Args: + otel_generator: OTelGenerator instance for computing summaries + nested: Dictionary of nested fieldsets + ecs_generated_version: ECS version string + semconv_version: OTel semantic conventions version (without 'v' prefix) + + Returns: + Rendered markdown content for ecs-otel-alignment-overview.md + + Template: otel_alignment_overview.j2 + + Note: + Uses the OTelGenerator to compute alignment statistics. + Includes both ECS namespaces and OTel-only namespaces. + + Example output includes tables with: + - Namespace names + - Field counts by relation type + - Coverage percentages + """ fieldsets = ecs_helpers.dict_sorted_by_keys(nested, ['group', 'name']) summaries = otel_generator.get_mapping_summaries(fieldsets) return dict(summaries=summaries, @@ -260,6 +616,32 @@ def page_otel_alignment_overview(otel_generator, nested, ecs_generated_version, @templated('field_values.j2') def page_field_values(nested, template_name='field_values_template.j2'): + """Generate documentation for categorization fields with allowed values. + + Creates specialized documentation for key event categorization fields + that have constrained value sets. Focuses on the core event taxonomy + fields used for event classification. + + Args: + nested: Dictionary of nested fieldsets + template_name: Template to use (default: 'field_values_template.j2') + Note: Currently not used, hardcoded to 'field_values.j2' + + Returns: + Rendered markdown content showing allowed values for categorization fields + + Template: field_values.j2 + + Fields documented: + - event.kind: High-level event category + - event.category: Event category for filtering + - event.type: Sub-category for event + - event.outcome: Event outcome (success, failure, unknown) + + Note: + Currently only processes fields from the 'event' fieldset. + The template_name parameter is accepted but not used. + """ category_fields = ['event.kind', 'event.category', 'event.type', 'event.outcome'] nested_fields = [] for cat_field in category_fields: diff --git a/scripts/generators/otel.py b/scripts/generators/otel.py index c6c0469535..42b30d04be 100644 --- a/scripts/generators/otel.py +++ b/scripts/generators/otel.py @@ -1,3 +1,39 @@ +# Licensed to Elasticsearch B.V. under one or more contributor +# license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright +# ownership. Elasticsearch B.V. licenses this file to you under +# the Apache License, Version 2.0 (the "License"); you may +# not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. + +"""OpenTelemetry Semantic Conventions Integration Module. + +This module handles the integration between ECS (Elastic Common Schema) and +OpenTelemetry Semantic Conventions. It provides functionality to: +- Load OTel semantic conventions from GitHub +- Validate ECS field mappings against OTel attributes and metrics +- Generate alignment summaries for documentation + +The module supports the ECS donation to OpenTelemetry initiative by maintaining +mappings between the two standards. + +Key Components: + - OTelGenerator: Main class for validation and summary generation + - Model loading functions: Fetch OTel definitions from git + - Validation functions: Ensure mapping integrity + +See also: scripts/docs/otel-integration.md for detailed documentation +""" + import git import os import shutil @@ -27,7 +63,26 @@ def get_model_files( git_repo: str, semconv_version: str, ) -> List[OTelModelFile]: - """Loads OpenTelemetry Semantic Conventions model from GitHub""" + """Load OpenTelemetry Semantic Conventions model files from a GitHub repository. + + This function clones or uses a cached version of the OTel semantic conventions + repository and extracts all model files (YAML) from the 'model' directory. + + Args: + git_repo: URL of the git repository containing semantic conventions + semconv_version: Git tag or branch name to checkout (e.g., 'v1.24.0') + + Returns: + List of OTel model files, each containing groups of attributes/metrics + + Raises: + KeyError: If the 'model' directory doesn't exist in the repository + + Example: + >>> files = get_model_files(OTEL_SEMCONV_GIT, 'v1.24.0') + >>> len(files) # Number of YAML files found + 150 + """ target_dir = "model" tree: git.objects.tree.Tree = get_tree_by_url(git_repo, semconv_version) if ecs_helpers.path_exists_in_git_tree(tree, target_dir): @@ -39,7 +94,24 @@ def get_model_files( def get_attributes( model_files: List[OTelModelFile] ) -> Dict[str, OTelAttribute]: - """Retrieves (non-deprecated) OTel attributes from the model files""" + """Extract all non-deprecated OTel attributes from model files. + + Iterates through all model files and extracts attributes from attribute_groups, + filtering out deprecated entries. Prefixes are applied to attribute IDs when + specified in the group definition. + + Args: + model_files: List of OTel model files loaded from the repository + + Returns: + Dictionary mapping attribute IDs (e.g., 'http.request.method') to their + full attribute definitions including stability, type, and metadata + + Note: + - Only processes groups with type='attribute_group' + - Skips deprecated groups and attributes + - Preserves group display names for documentation purposes + """ attributes: Dict[str, OTelAttribute] = {} for model_file in model_files: @@ -58,7 +130,23 @@ def get_attributes( def get_metrics( model_files: List[OTelModelFile] ) -> Dict[str, OTelAttribute]: - """Retrieves (non-deprecated) OTel metrics from the model files""" + """Extract all non-deprecated OTel metrics from model files. + + Iterates through all model files and extracts metric definitions, + filtering out deprecated entries. + + Args: + model_files: List of OTel model files loaded from the repository + + Returns: + Dictionary mapping metric names (e.g., 'http.server.request.duration') + to their full metric group definitions including stability and metadata + + Note: + - Only processes groups with type='metric' + - Skips deprecated metrics + - Metric names are used as dictionary keys + """ metrics: Dict[str, OTelGroup] = {} for model_file in model_files: @@ -72,6 +160,23 @@ def collectOTelModelFiles( tree: git.objects.tree.Tree, level=0 ) -> List[OTelModelFile]: + """Recursively collect all YAML model files from a git tree. + + Traverses the directory tree structure and parses all YAML files found, + returning them as OTel model file objects. + + Args: + tree: Git tree object representing a directory + level: Current recursion depth (used for tracking) + + Returns: + List of parsed OTel model files from all YAML files in the tree + + Note: + - Recursively processes subdirectories + - Only processes files with .yml or .yaml extensions + - Files are parsed using yaml.safe_load for security + """ otel_model_files: List[OTelModelFile] = [] for entry in tree: if entry.type == "tree": @@ -87,6 +192,28 @@ def get_tree_by_url( url: str, git_ref: str, ) -> git.objects.tree.Tree: + """Clone or update a git repository and return the tree for a specific reference. + + This function manages a local cache of the OTel semantic conventions repository. + If the repository is already cloned and contains the requested ref, it reuses + the cached version. Otherwise, it clones fresh from the remote. + + Args: + url: Git repository URL to clone from + git_ref: Git reference (tag or branch) to checkout (e.g., 'v1.24.0') + + Returns: + Git tree object representing the repository contents at the specified ref + + Note: + - Caches the repository in LOCAL_TARGET_DIR_OTEL_SEMCONV (./build/otel-semconv/) + - If cached repo doesn't have the requested ref, re-clones from remote + - Prints status message when downloading from remote + + Example: + >>> tree = get_tree_by_url(OTEL_SEMCONV_GIT, 'v1.24.0') + Loading OpenTelemetry Semantic Conventions version "v1.24.0" + """ repo: git.repo.base.Repo clone_from_remote = False if os.path.exists(LOCAL_TARGET_DIR_OTEL_SEMCONV): @@ -110,6 +237,29 @@ def get_otel_attribute_name( field: Field, otel: OTelMapping ) -> str: + """Extract the OTel attribute name from a mapping. + + Determines the appropriate OTel attribute name based on the mapping relation type: + - 'match': Use the ECS field's flat_name (names are identical) + - Other relations: Use the explicitly specified 'attribute' property + + Args: + field: ECS field definition containing flat_name + otel: OTel mapping configuration with relation type + + Returns: + The OTel attribute name to use for lookups + + Raises: + KeyError: If mapping is for a metric (not attribute) or if the relation + type doesn't support attribute name extraction + + Example: + >>> field = {'flat_name': 'http.request.method'} + >>> otel = {'relation': 'match'} + >>> get_otel_attribute_name(field, otel) + 'http.request.method' + """ if otel['relation'] == 'match': return field['flat_name'] elif 'attribute' in otel: @@ -122,20 +272,77 @@ def get_otel_attribute_name( def must_have(ecs_field_name, otel, relation_type, property): + """Validate that a required property exists in an OTel mapping. + + Args: + ecs_field_name: Name of the ECS field being validated + otel: OTel mapping configuration dictionary + relation_type: The relation type requiring this property + property: Name of the required property + + Raises: + ValueError: If the required property is missing + """ if property not in otel: raise ValueError( f"On field '{ecs_field_name}': An OTel mapping with relation type '{relation_type}' must specify the property '{property}'!") def must_not_have(ecs_field_name, otel, relation_type, property): + """Validate that a forbidden property does not exist in an OTel mapping. + + Args: + ecs_field_name: Name of the ECS field being validated + otel: OTel mapping configuration dictionary + relation_type: The relation type forbidding this property + property: Name of the forbidden property + + Raises: + ValueError: If the forbidden property is present + """ if property in otel: raise ValueError( f"On field '{ecs_field_name}': An OTel mapping with relation type '{relation_type}' must not have the property '{property}'!") class OTelGenerator: + """Main class for OTel Semantic Conventions integration with ECS. + + This class handles the complete workflow of: + 1. Loading OTel semantic conventions from GitHub + 2. Validating ECS field mappings against OTel definitions + 3. Generating alignment summaries for documentation + + The generator is initialized with a specific OTel semantic conventions version + and maintains in-memory caches of all attributes and metrics for validation. + + Attributes: + attributes: Dictionary of all OTel attributes (keyed by attribute ID) + otel_attribute_names: List of all attribute IDs for quick lookup + metrics: Dictionary of all OTel metrics (keyed by metric name) + otel_metric_names: List of all metric names for quick lookup + semconv_version: Version of OTel semantic conventions being used + + Example: + >>> generator = OTelGenerator('v1.24.0') + >>> generator.validate_otel_mapping(ecs_fields) + >>> summaries = generator.get_mapping_summaries(fieldsets) + """ def __init__(self, semconv_version: str): + """Initialize the OTel generator with a specific semantic conventions version. + + Loads all model files from the OTel semantic conventions repository and + extracts attributes and metrics for validation and reference. + + Args: + semconv_version: Git tag or branch of semantic conventions to use + (e.g., 'v1.24.0') + + Note: + This operation may take time on first run as it clones the repository. + Subsequent runs with the same version use a cached clone. + """ model_files = get_model_files(OTEL_SEMCONV_GIT, semconv_version) self.attributes: Dict[str, OTelAttribute] = get_attributes(model_files) @@ -147,6 +354,22 @@ def __init__(self, semconv_version: str): self.semconv_version = semconv_version def __set_stability(self, details): + """Set stability level on OTel mappings from their corresponding OTel definitions. + + Called by the visitor pattern during field traversal. Enriches each mapping + with the stability level (experimental, stable, deprecated) from the OTel + semantic conventions. + + Args: + details: Field details dictionary containing 'field_details' with + optional 'otel' mappings + + Note: + - For metrics: Uses the metric group's stability + - For attributes: Uses the attribute's stability + - Modifies the otel mapping in place + - Private method used internally during validation + """ field_details = details['field_details'] if 'flat_name' in field_details and 'otel' in field_details: for otel in field_details['otel']: @@ -156,17 +379,60 @@ def __set_stability(self, details): otel['stability'] = self.attributes[get_otel_attribute_name(field_details, otel)]['stability'] def __check_metric_name(self, field_name, metric_name): + """Validate that a referenced metric exists in OTel semantic conventions. + + Args: + field_name: Name of the ECS field being validated + metric_name: OTel metric name to verify + + Raises: + ValueError: If the metric doesn't exist in the loaded conventions + """ if not metric_name in self.otel_metric_names: raise ValueError( f"On field '{field_name}': Metric '{metric_name}' does not exist in Semantic Conventions version {self.semconv_version}!") def __check_attribute_name(self, field_details, otel): + """Validate that a referenced attribute exists in OTel semantic conventions. + + Args: + field_details: ECS field definition + otel: OTel mapping configuration + + Raises: + ValueError: If the attribute doesn't exist in the loaded conventions + """ otel_attr_name = get_otel_attribute_name(field_details, otel) if not otel_attr_name in self.otel_attribute_names: raise ValueError( f"On field '{field_details['flat_name']}': Attribute '{otel_attr_name}' does not exist in Semantic Conventions version {self.semconv_version}!") def __check_mapping(self, details): + """Validate an ECS field's OTel mapping configuration. + + Performs comprehensive validation of OTel mappings including: + - Required and forbidden properties for each relation type + - Existence of referenced attributes/metrics + - Proper structure and consistency + + Called by the visitor pattern during field traversal. + + Args: + details: Field details dictionary containing 'field_details' + + Raises: + ValueError: If mapping configuration is invalid + + Note: + Relation types and their requirements: + - 'match': Names are identical, no extra properties + - 'equivalent': Requires 'attribute', semantically equivalent + - 'related': Requires 'attribute', related but different + - 'conflict': Requires 'attribute', conflicting definitions + - 'metric': Requires 'metric', maps to OTel metric + - 'otlp': Requires 'otlp_field' and 'stability', protocol-specific + - 'na': Not applicable, no extra properties + """ field_details = details['field_details'] if 'flat_name' in field_details and (not 'intermediate' in field_details or not field_details['intermediate']): ecs_field_name = field_details['flat_name'] @@ -215,6 +481,29 @@ def validate_otel_mapping( self, field_entries: Dict[str, FieldEntry] ) -> None: + """Validate all OTel mappings in ECS field definitions. + + This is the main validation entry point. It performs two passes over + all fields: + 1. Validate mapping structure and referenced attributes/metrics exist + 2. Enrich mappings with stability information from OTel definitions + + Args: + field_entries: Dictionary of all ECS field entries to validate + + Raises: + ValueError: If any mapping is invalid or references non-existent + OTel attributes/metrics + + Note: + Uses the visitor pattern to traverse nested field structures. + Prints warnings for unmapped fields that match OTel attribute names. + + Example: + >>> generator = OTelGenerator('v1.24.0') + >>> fields = loader.load_schemas() + >>> generator.validate_otel_mapping(fields) + """ visitor.visit_fields(field_entries, None, self.__check_mapping) visitor.visit_fields(field_entries, None, self.__set_stability) @@ -222,6 +511,40 @@ def get_mapping_summaries( self, fieldsets: List[FieldNestedEntry], ) -> List[OTelMappingSummary]: + """Generate alignment summaries between ECS fieldsets and OTel namespaces. + + Creates summary statistics for each ECS fieldset and each OTel namespace, + showing the degree of alignment between the two standards. This is used + for generating documentation. + + Args: + fieldsets: List of ECS fieldsets (nested field groups) + + Returns: + List of summary objects containing: + - namespace: The fieldset/namespace name + - title: Display title + - nr_all_ecs_fields: Total ECS fields in this namespace + - nr_plain_ecs_fields: ECS-only fields (not reused from other sets) + - nr_otel_fields: Total OTel attributes in this namespace + - nr_matching_fields: Fields with 'match' relation + - nr_equivalent_fields: Fields with 'equivalent' relation + - nr_related_fields: Fields with 'related' relation + - nr_conflicting_fields: Fields with 'conflict' relation + - nr_metric_fields: Fields mapped to metrics + - nr_otlp_fields: Fields mapped to OTLP protocol fields + - nr_not_applicable_fields: Fields marked as not applicable + + Note: + - Summaries are sorted alphabetically by namespace + - Includes summaries for OTel namespaces that have no ECS equivalent + - Used by markdown_fields.py to generate documentation + + Example: + >>> summaries = generator.get_mapping_summaries(nested_fieldsets) + >>> for s in summaries: + ... print(f"{s['namespace']}: {s['nr_matching_fields']} matches") + """ summaries: List[OTelMappingSummary] = [] otel_namespaces = set([attr.split('.')[0] for attr in self.attributes.keys()]) diff --git a/scripts/schema/cleaner.py b/scripts/schema/cleaner.py index 84c6abef15..f6ce573cc2 100644 --- a/scripts/schema/cleaner.py +++ b/scripts/schema/cleaner.py @@ -15,6 +15,70 @@ # specific language governing permissions and limitations # under the License. +"""Schema Cleaner Module. + +This module performs validation, normalization, and enrichment of schema +definitions after loading. It ensures schemas are well-formed and fills in +sensible defaults to simplify downstream processing. + +The cleaner operates on the deeply nested structure produced by loader.py and +makes in-place modifications. It's the second stage of the schema processing +pipeline: + + loader.py → cleaner.py → finalizer.py → intermediate_files.py + +Responsibilities: + 1. **Validation**: Check mandatory attributes are present + 2. **Normalization**: Strip whitespace, standardize values + 3. **Defaults**: Fill in sensible defaults for optional attributes + 4. **Enrichment**: Pre-calculate helpful derived fields + 5. **Shorthand Expansion**: Convert shorthand notation to full form + 6. **Quality Checks**: Validate descriptions, examples, patterns + +What the Cleaner Does: + - Validates mandatory attributes (name, title, description, type, level) + - Strips leading/trailing whitespace from string values + - Sets defaults for missing optional attributes: + * group=2 (standard priority) + * root=false (not a root fieldset) + * type='group' (for fieldsets) + * ignore_above=1024 (for keyword fields) + * norms=false (for text fields) + - Calculates schema prefix for field names + - Expands reuse location shorthand notation + - Validates field levels (core/extended/custom) + - Checks description lengths + - Validates regex patterns + - Validates example values against patterns/expected_values + +What the Cleaner Does NOT Do: + - Perform field reuse (handled by finalizer.py) + - Calculate final field names (handled by finalizer.py) + - Generate output artifacts (handled by generators) + - Modify schema structure (only enriches existing structure) + +Strict Mode: + When run with --strict flag, warnings become exceptions. This enforces: + - Short descriptions under 120 characters + - Valid example values + - Proper regex patterns + - No YAML interpretation issues + +Key Concepts: + - **Mandatory Attributes**: Must be present or cleaner raises ValueError + - **Defaults**: Optional attributes get sensible defaults if missing + - **Intermediate Fields**: Auto-created parents (type=object, intermediate=true) + - **Reuse Notation**: Shorthand 'destination' expands to {'at': 'destination', 'as': 'user'} + +Example: + >>> from schema import loader, cleaner + >>> fields = loader.load_schemas() + >>> cleaner.clean(fields, strict=False) + # Fields now have defaults filled in and are validated + +See also: scripts/docs/schema-pipeline.md for complete pipeline documentation +""" + import re from typing import ( Dict, @@ -32,25 +96,44 @@ MultiField, ) -# This script performs a few cleanup functions in place, within the deeply nested -# 'fields' structure passed to `clean(fields)`. -# -# What happens here: -# -# - check that mandatory attributes are present, without which we can't do much. -# - cleans things up, like stripping spaces, sorting arrays -# - makes lots of defaults explicit -# - pre-calculate a few additional helpful fields -# - converts shorthands into full representation (e.g. reuse locations) -# -# This script only deals with field sets themselves and the fields defined -# inside them. It doesn't perform field reuse, and therefore doesn't -# deal with final field names either. - strict_mode: Optional[bool] # work-around from https://github.com/python/mypy/issues/5732 def clean(fields: Dict[str, Field], strict: Optional[bool] = False) -> None: + """Clean, validate, and enrich schema definitions in place. + + This is the main entry point for the cleaner module. It uses the visitor + pattern to traverse all fieldsets and fields, applying validation, + normalization, and defaults to each. + + Args: + fields: Deeply nested field dictionary from loader.py + strict: If True, warnings become exceptions (enforces stricter validation) + + Side Effects: + Modifies fields dictionary in place: + - Adds default values for optional attributes + - Strips whitespace from strings + - Expands shorthand notation + - Calculates derived fields + + Raises: + ValueError: If mandatory attributes are missing or invalid + + Processing Order: + 1. Visit each fieldset, call schema_cleanup() + 2. Visit each field, call field_cleanup() + 3. Both use depth-first traversal (parents before children) + + Example: + >>> fields = loader.load_schemas() + >>> clean(fields, strict=False) # Warnings for issues + >>> clean(fields, strict=True) # Exceptions for issues + + Note: + Sets global strict_mode variable that controls warning behavior. + This is a workaround for passing state to visitor callbacks. + """ global strict_mode strict_mode = strict visitor.visit_fields(fields, fieldset_func=schema_cleanup, field_func=field_cleanup) @@ -60,6 +143,38 @@ def clean(fields: Dict[str, Field], strict: Optional[bool] = False) -> None: def schema_cleanup(schema: FieldEntry) -> None: + """Clean, validate, and enrich a single fieldset definition. + + Performs all cleanup operations for a fieldset (schema-level node): + - Validates mandatory attributes + - Strips whitespace + - Fills in defaults + - Calculates prefix + - Expands reuse notation + - Validates constraints + + Args: + schema: Fieldset entry with 'schema_details', 'field_details', 'fields' + + Side Effects: + Modifies schema dictionary in place + + Raises: + ValueError: If mandatory attributes missing or invalid + + Defaults Applied: + - group: 2 (standard priority) + - root: False (not a root fieldset) + - type: 'group' (fieldset type) + - short: Copy of description + - reusable.order: 2 (default reuse priority) + + Calculated Fields: + - prefix: '' if root=true, else 'name.' (e.g., 'http.') + + Note: + Called by visitor for each fieldset during traversal. + """ # Sanity check first schema_mandatory_attributes(schema) # trailing space cleanup @@ -87,7 +202,35 @@ def schema_cleanup(schema: FieldEntry) -> None: def schema_mandatory_attributes(schema: FieldEntry) -> None: - """Ensures for the presence of the mandatory schema attributes and raises if any are missing""" + """Validate that all mandatory fieldset attributes are present. + + Checks for required attributes at both field_details and schema_details level. + For reusable fieldsets, also validates reusable-specific attributes. + + Args: + schema: Fieldset entry to validate + + Raises: + ValueError: If any mandatory attributes are missing + + Mandatory Attributes: + All fieldsets: + - name: Fieldset identifier + - title: Display title + - description: Fieldset description + + Reusable fieldsets (if reusable key present): + - expected: Array of reuse locations + - top_level: Whether fieldset can appear at root + + Example: + >>> schema = { + ... 'field_details': {'name': 'http', 'description': '...'}, + ... 'schema_details': {} + ... } + >>> schema_mandatory_attributes(schema) + ValueError: Schema http is missing the following mandatory attributes: title + """ current_schema_attributes: List[str] = sorted(list(schema['field_details'].keys()) + list(schema['schema_details'].keys())) missing_attributes: List[str] = ecs_helpers.list_subtract(SCHEMA_MANDATORY_ATTRIBUTES, current_schema_attributes) @@ -105,7 +248,27 @@ def schema_mandatory_attributes(schema: FieldEntry) -> None: def schema_assertions_and_warnings(schema: FieldEntry) -> None: - """Additional checks on a fleshed out schema""" + """Perform additional validation checks on enriched fieldset. + + Called after defaults are filled in and normalization is complete. + Validates quality constraints like description length and format. + + Args: + schema: Fieldset entry to validate + + Side Effects: + May print warnings or raise exceptions depending on strict_mode + + Checks Performed: + - Short description is single line and under 120 characters + - Beta description (if present) is single line + - Reuse short_override descriptions (if present) are single line + + Note: + Behavior depends on global strict_mode variable: + - strict=False: Prints warnings + - strict=True: Raises exceptions + """ single_line_short_description(schema, strict=strict_mode) if 'beta' in schema['field_details']: single_line_beta_description(schema, strict=strict_mode) @@ -114,19 +277,59 @@ def schema_assertions_and_warnings(schema: FieldEntry) -> None: def normalize_reuse_notation(schema: FieldEntry) -> None: - """ - Replace single word reuse shorthands from the schema YAMLs with the explicit {at: , as:} notation. - - When marking "user" as reusable under "destination" with the shorthand entry - `- destination`, this is expanded to the complete entry - `- { "at": "destination", "as": "user" }`. - The field set is thus nested at `destination.user.*`, with fields such as `destination.user.name`. - - The dictionary notation enables nesting a field set as a different name. - An example is nesting "process" fields to capture parent process details - at `process.parent.*`. - The dictionary notation `- { "at": "process", "as": "parent" }` will yield - fields such as `process.parent.pid`. + """Expand shorthand reuse notation to explicit {at:, as:} dictionary format. + + Schema YAMLs allow two formats for specifying where a fieldset should be reused: + + 1. Shorthand string: 'destination' + Expands to: {'at': 'destination', 'as': 'user'} + Results in: destination.user.* fields + + 2. Explicit dict: {'at': 'process', 'as': 'parent'} + Already explicit, just validated + Results in: process.parent.* fields + + This function normalizes both formats to the explicit dictionary form and + calculates the 'full' path for convenience. + + Args: + schema: Fieldset entry (only processed if reusable) + + Side Effects: + Modifies schema['schema_details']['reusable']['expected'] in place, + converting all entries to explicit dictionary format with 'full' key + + Raises: + ValueError: If dictionary notation is incomplete (missing 'at' or 'as') + + Reuse Examples: + Shorthand: + ```yaml + reusable: + expected: + - destination # Shorthand + ``` + Becomes: + {'at': 'destination', 'as': 'user', 'full': 'destination.user'} + + Explicit: + ```yaml + reusable: + expected: + - at: process + as: parent + ``` + Becomes: + {'at': 'process', 'as': 'parent', 'full': 'process.parent'} + + Use Cases: + - 'user' reused at 'destination', 'source', 'client', 'server' + - 'process' reused as 'process.parent' (self-nesting) + - 'geo' reused under 'client.geo', 'server.geo' (not top-level) + + Note: + The 'full' path is used by downstream stages to quickly identify + where fields will appear after reuse is performed. """ if 'reusable' not in schema['schema_details']: return @@ -151,6 +354,37 @@ def normalize_reuse_notation(schema: FieldEntry) -> None: def field_cleanup(field: FieldDetails) -> None: + """Clean, validate, and enrich a single field definition. + + Performs all cleanup operations for a field: + - Validates mandatory attributes + - Strips whitespace (unless intermediate field) + - Fills in defaults + - Validates constraints + + Args: + field: Field entry with 'field_details' and optionally 'fields' + + Side Effects: + Modifies field dictionary in place + + Raises: + ValueError: If mandatory attributes missing or invalid + + Processing Steps: + 1. Validate mandatory attributes present + 2. Skip further processing if intermediate field + 3. Clean string values (strip whitespace) + 4. Clean allowed_values if present + 5. Apply datatype-specific defaults + 6. Validate constraints (examples, patterns, etc.) + + Note: + Intermediate fields are skipped because they're auto-generated + structural fields, not real data fields. + + Called by visitor for each field during traversal. + """ field_mandatory_attributes(field) if ecs_helpers.is_intermediate(field): return @@ -163,6 +397,33 @@ def field_cleanup(field: FieldDetails) -> None: def field_defaults(field: FieldDetails) -> None: + """Apply default values for optional field attributes. + + Sets sensible defaults based on field type, reducing boilerplate in + schema YAML files. Also processes multi-fields. + + Args: + field: Field entry to enrich with defaults + + Side Effects: + Modifies field dictionary in place + + Defaults Applied: + General: + - short: Copy of description (if not specified) + - normalize: [] (empty array if not specified) + + Type-specific (see field_or_multi_field_datatype_defaults): + - keyword: ignore_above=1024 + - text: norms=false + - fields with index=false: doc_values=false + + Multi-fields: + - name: type name if not specified (e.g., 'text', 'keyword') + + Note: + Multi-fields get their own defaults applied recursively. + """ field['field_details'].setdefault('short', field['field_details']['description']) field['field_details'].setdefault('normalize', []) field_or_multi_field_datatype_defaults(field['field_details']) @@ -174,7 +435,32 @@ def field_defaults(field: FieldDetails) -> None: def field_or_multi_field_datatype_defaults(field_details: Union[Field, MultiField]) -> None: - """Sets datatype-related defaults on a canonical field or multi-field entries.""" + """Apply datatype-specific defaults to field or multi-field. + + Different Elasticsearch field types have different sensible defaults. + This function applies appropriate defaults based on the 'type' attribute. + + Args: + field_details: Field or multi-field definition dict + + Side Effects: + Modifies field_details dictionary in place + + Defaults by Type: + - keyword: ignore_above=1024 (truncate very long values) + - text: norms=false (save space, usually not needed for search) + - wildcard: Remove 'index' param (not applicable) + - index=false: doc_values=false, remove ignore_above + + Rationale: + - ignore_above prevents errors from very long strings + - norms=false is common for log data (saves significant space) + - doc_values=false with index=false is an optimization + - wildcard fields don't support some parameters + + Note: + Works for both regular fields and multi-fields (same logic applies). + """ if field_details['type'] == 'keyword': field_details.setdefault('ignore_above', 1024) if field_details['type'] == 'text': @@ -192,7 +478,43 @@ def field_or_multi_field_datatype_defaults(field_details: Union[Field, MultiFiel def field_mandatory_attributes(field: FieldDetails) -> None: - """Ensures for the presence of the mandatory field attributes and raises if any are missing""" + """Validate that all mandatory field attributes are present. + + Checks for required attributes with special handling for type-specific + requirements (alias, scaled_float). + + Args: + field: Field entry to validate + + Raises: + ValueError: If any mandatory attributes are missing + + Mandatory Attributes: + All fields: + - name: Field identifier + - description: Field description + - type: Elasticsearch field type + - level: Field level (core/extended/custom) + + Type-specific: + - alias fields: Also require 'path' (target field) + - scaled_float fields: Also require 'scaling_factor' + + Note: + Intermediate fields (auto-created parents) are skipped as they + don't need full validation. + + Example: + >>> field = { + ... 'field_details': { + ... 'name': 'method', + ... 'description': '...' + ... # Missing 'type' and 'level' + ... } + ... } + >>> field_mandatory_attributes(field) + ValueError: Field is missing the following mandatory attributes: type, level + """ if ecs_helpers.is_intermediate(field): return current_field_attributes: List[str] = sorted(field['field_details'].keys()) @@ -212,7 +534,33 @@ def field_mandatory_attributes(field: FieldDetails) -> None: def field_assertions_and_warnings(field: FieldDetails) -> None: - """Additional checks on a fleshed out field""" + """Perform additional validation checks on enriched field. + + Called after defaults are filled in and normalization is complete. + Validates quality constraints and semantic correctness. + + Args: + field: Field entry to validate + + Side Effects: + May print warnings or raise exceptions depending on strict_mode + + Checks Performed: + - Short description is single line and under 120 characters + - Beta description (if present) is single line + - Pattern (if present) is valid regex + - Example value matches pattern/expected_values + - Level is one of: core, extended, custom + + Raises: + ValueError: Always for invalid level (regardless of strict mode) + + Note: + Behavior depends on global strict_mode variable: + - strict=False: Prints warnings for most issues + - strict=True: Raises exceptions for all issues + - Invalid level always raises (can't continue with invalid level) + """ if not ecs_helpers.is_intermediate(field): # check short description length if in strict mode single_line_short_description(field, strict=strict_mode) @@ -227,13 +575,30 @@ def field_assertions_and_warnings(field: FieldDetails) -> None: ACCEPTABLE_FIELD_LEVELS) raise ValueError(msg) -# Common +# Common Validation Helpers SHORT_LIMIT = 120 def single_line_short_check(short_to_check: str, short_name: str) -> Union[str, None]: + """Check if a short description meets formatting requirements. + + Validates that a short description is: + - Single line (no newline characters) + - Under 120 characters long + + Args: + short_to_check: Short description string to validate + short_name: Name of field/fieldset (for error messages) + + Returns: + Error message string if validation fails, None if valid + + Note: + Does not raise or warn directly; returns error message for caller + to handle based on strict mode. + """ short_length: int = len(short_to_check) if "\n" in short_to_check or short_length > SHORT_LIMIT: msg: str = "Short descriptions must be single line, and under {} characters (current length: {}).\n".format( @@ -246,7 +611,22 @@ def single_line_short_check(short_to_check: str, short_name: str) -> Union[str, def strict_warning_handler(message, strict): - """Handles warnings based on --strict mode""" + """Handle validation messages based on strict mode. + + Args: + message: Validation error/warning message + strict: Whether to treat as error (True) or warning (False) + + Raises: + ValueError: If strict=True + + Side Effects: + Prints warning if strict=False + + Note: + This centralized handler allows consistent behavior across all + validation checks. + """ if strict: raise ValueError(message) else: @@ -254,6 +634,18 @@ def strict_warning_handler(message, strict): def single_line_short_description(schema_or_field: FieldEntry, strict: Optional[bool] = True): + """Validate that short description is single line and under limit. + + Args: + schema_or_field: Field or fieldset entry to validate + strict: Whether to raise exception (True) or print warning (False) + + Raises: + ValueError: If validation fails and strict=True + + Side Effects: + Prints warning if validation fails and strict=False + """ error: Union[str, None] = single_line_short_check( schema_or_field['field_details']['short'], schema_or_field['field_details']['name']) if error: @@ -261,6 +653,24 @@ def single_line_short_description(schema_or_field: FieldEntry, strict: Optional[ def single_line_short_override_description(schema_or_field: FieldEntry, strict: Optional[bool] = True): + """Validate that reuse short_override descriptions are single line. + + When a fieldset is reused, it can have custom short descriptions for + each reuse location. This validates all such overrides. + + Args: + schema_or_field: Fieldset entry with reusable expected locations + strict: Whether to raise exception (True) or print warning (False) + + Raises: + ValueError: If validation fails and strict=True + + Side Effects: + Prints warning if validation fails and strict=False + + Note: + Only validates short_override if present; it's optional. + """ for field in schema_or_field['schema_details']['reusable']['expected']: if not 'short_override' in field: continue @@ -270,9 +680,40 @@ def single_line_short_override_description(schema_or_field: FieldEntry, strict: def check_example_value(field: Union[List, FieldEntry], strict: Optional[bool] = True) -> None: - """ - Checks if value of the example field is of type list or dict. - Fails or warns (depending on strict mode) if so. + """Validate example value meets field constraints. + + Performs several validation checks on the example value: + 1. Not a YAML-interpreted object/array (should be quoted string) + 2. Matches pattern regex (if pattern specified) + 3. In expected_values list (if expected_values specified) + + Args: + field: Field entry with field_details + strict: Whether to raise exception (True) or print warning (False) + + Raises: + ValueError: If validation fails and strict=True + + Side Effects: + Prints warning if validation fails and strict=False + + Example Value Formats: + - Simple: "GET" + - Array: '["GET", "POST"]' (must be quoted to avoid YAML parsing) + - With pattern: Must match the regex in 'pattern' attribute + + Special Handling: + - Array fields (normalize contains 'array'): Parses and validates each value + - Missing example: Skipped (example is optional) + + Common Issues: + - Unquoted array: [GET, POST] becomes Python list → Error + Fix: Quote it: "[GET, POST]" + - Pattern mismatch: Example doesn't match validation regex + - Invalid enum: Example not in expected_values list + + Note: + This prevents documentation from containing invalid or misleading examples. """ example_value: str = field['field_details'].get('example', '') pattern: str = field['field_details'].get('pattern', '') @@ -309,6 +750,21 @@ def check_example_value(field: Union[List, FieldEntry], strict: Optional[bool] = def single_line_beta_description(schema_or_field: FieldEntry, strict: Optional[bool] = True) -> None: + """Validate that beta description is single line. + + Beta fields/fieldsets have a 'beta' attribute explaining why they're + in beta. This must be a single line for consistency. + + Args: + schema_or_field: Field or fieldset entry with beta attribute + strict: Whether to raise exception (True) or print warning (False) + + Raises: + ValueError: If validation fails and strict=True + + Side Effects: + Prints warning if validation fails and strict=False + """ if "\n" in schema_or_field['field_details']['beta']: msg: str = "Beta descriptions must be single line.\n" msg += f"Offending field or field set: {schema_or_field['field_details']['name']}" @@ -316,8 +772,24 @@ def single_line_beta_description(schema_or_field: FieldEntry, strict: Optional[b def validate_pattern_regex(field, strict=True): - """ - Validates if field['pattern'] contains a valid regular expression. + """Validate that pattern attribute is a valid regular expression. + + Some fields have a 'pattern' attribute specifying a validation regex. + This ensures the pattern itself is syntactically valid. + + Args: + field: Field definition dict with 'pattern' attribute + strict: Whether to raise exception (True) or print warning (False) + + Raises: + ValueError: If validation fails and strict=True + + Side Effects: + Prints warning if validation fails and strict=False + + Note: + Uses Python's re.compile() to test validity. + Invalid patterns would cause runtime errors if not caught here. """ try: re.compile(field['pattern']) diff --git a/scripts/schema/exclude_filter.py b/scripts/schema/exclude_filter.py index 324a16807f..80304e48d0 100644 --- a/scripts/schema/exclude_filter.py +++ b/scripts/schema/exclude_filter.py @@ -15,6 +15,65 @@ # specific language governing permissions and limitations # under the License. +"""Schema Exclude Filter Module. + +This module explicitly removes specified fieldsets and fields from schemas. +It's the inverse of subset filtering - while subsets specify what to INCLUDE, +excludes specify what to REMOVE. + +Exclude filters run after subset filters in the pipeline and are used for: + - **Deprecation testing**: Remove fields to test impact before actual removal + - **Impact analysis**: See what breaks when fields are removed + - **Custom deployments**: Remove unwanted fieldsets entirely + - **Security**: Exclude fields with sensitive data + - **Performance testing**: Remove expensive fields to measure impact + +Exclude Definition Format: + Excludes are defined as YAML arrays of fieldsets/fields to remove: + + ```yaml + - name: http + fields: + - name: request.referrer # Remove specific field + - name: response.body # Remove another field + + - name: geo + fields: + - name: location # Remove nested field + ``` + +Removal Behavior: + - Specified fields: Removed from schema + - Parent fields: Removed if all children removed (except 'base') + - Nested removal: Can remove deeply nested fields + - 'base' protection: Never auto-remove base fieldset + +Exclude vs Subset: + - Subset: Whitelist approach (specify what to keep) + - Exclude: Blacklist approach (specify what to remove) + - Can use both: Subset first (include only X), then exclude (remove Y from X) + +Example: + >>> from schema import loader, cleaner, finalizer, exclude_filter + >>> fields = loader.load_schemas() + >>> cleaner.clean(fields) + >>> finalizer.finalize(fields) + >>> filtered = exclude_filter.exclude( + ... fields, + ... ['excludes/deprecated.yml'] + ... ) + # specified fields removed from schema + +Common Use Case - Testing Deprecation: + Before removing a field from ECS, create an exclude file and test: + 1. Generate schemas with field excluded + 2. Run test suite to find breakages + 3. Update affected code + 4. Finally remove field from actual schemas + +See also: scripts/docs/schema-pipeline.md for pipeline documentation +""" + from typing import ( Dict, List, @@ -27,12 +86,32 @@ FieldNestedEntry, ) -# This script should be run downstream of the subset filters - it takes -# all ECS and custom fields already loaded by the latter and explicitly -# removes a subset, for example, to simulate impact of future removals - def exclude(fields: Dict[str, FieldEntry], exclude_file_globs: List[str]) -> Dict[str, FieldEntry]: + """Remove specified fields from schema. + + Main entry point for exclude filtering. Loads exclude definitions and + removes matching fields from the field dictionary. + + Args: + fields: Complete field dictionary (typically after subset filtering) + exclude_file_globs: List of paths/globs to exclude definition YAML files + + Returns: + Modified field dictionary with excluded fields removed + + Side Effects: + Modifies fields dictionary in place (also returns it) + + Processing: + 1. Load exclude definition files + 2. For each exclude list, traverse and remove specified fields + 3. Auto-remove parent fields if all children removed (except 'base') + + Example: + >>> fields = exclude(all_fields, ['excludes/deprecated.yml']) + # Fields specified in deprecated.yml are removed + """ excludes: List[FieldNestedEntry] = load_exclude_definitions(exclude_file_globs) if excludes: @@ -42,6 +121,18 @@ def exclude(fields: Dict[str, FieldEntry], exclude_file_globs: List[str]) -> Dic def long_path(path_as_list: List[str]) -> str: + """Convert path array to dotted string. + + Args: + path_as_list: Array of path components + + Returns: + Dot-joined path string + + Example: + >>> long_path(['http', 'request', 'method']) + 'http.request.method' + """ return '.'.join([e for e in path_as_list]) @@ -51,7 +142,32 @@ def pop_field( path: List[str], removed: List[str] ) -> str: - """pops a field from yaml derived dict using path derived from ordered list of nodes""" + """Recursively remove a field at specified path. + + Traverses nested field structure and removes the field at the end of the + path. Auto-removes parent fields if they become empty (except 'base'). + + Args: + fields: Field dictionary to modify + node_path: Remaining path components to traverse + path: Complete original path (for error messages) + removed: List of already removed paths (to avoid duplicate errors) + + Returns: + Flat name of removed field + + Raises: + ValueError: If path not found and not already removed + + Behavior: + - Leaf field: Remove it + - Parent field: Recurse to child, then remove parent if empty + - 'base' exception: Never auto-remove base fieldset even if empty + + Note: + Modifies fields dict in place. Tracks removed paths to handle + parent removal gracefully. + """ if node_path[0] in fields: if len(node_path) == 1: flat_name: str = long_path(path) @@ -81,7 +197,24 @@ def exclude_trace_path( path: List[str], removed: List[str] ) -> None: - """traverses paths to one or more nodes in a yaml derived dict""" + """Traverse and remove fields specified in exclude list. + + Processes an array of field specifications from an exclude definition, + removing each one and tracking what was removed. + + Args: + fields: Field dictionary to modify + item: List of field specifications to remove + path: Current path prefix + removed: List tracking removed field paths + + Raises: + ValueError: If exclude item has 'fields' (nested excludes not supported) + + Note: + Exclude definitions specify fields to remove, not nested structures. + Each item should be a leaf field path, not a container with sub-fields. + """ for list_item in item: node_path: List[str] = path.copy() # cater for name.with.dots @@ -98,7 +231,23 @@ def exclude_trace_path( def exclude_fields(fields: Dict[str, FieldEntry], excludes: List[FieldNestedEntry]) -> Dict[str, FieldEntry]: - """Traverses fields and eliminates any field which matches the excludes""" + """Apply all exclude definitions to field dictionary. + + Iterates through exclude definitions and removes matching fields. + + Args: + fields: Field dictionary to modify + excludes: List of exclude definition documents + + Returns: + Modified field dictionary (also modified in place) + + Processing: + For each exclude document: + - For each fieldset item in document: + - Remove specified fields from that fieldset + - Clean up empty parents + """ if excludes: for ex_list in excludes: for item in ex_list: @@ -107,6 +256,20 @@ def exclude_fields(fields: Dict[str, FieldEntry], excludes: List[FieldNestedEntr def load_exclude_definitions(file_globs: List[str]) -> List[FieldNestedEntry]: + """Load exclude definition files from filesystem. + + Args: + file_globs: List of file paths or glob patterns + + Returns: + List of parsed exclude definition documents + + Raises: + ValueError: If file_globs specified but no files found + + Note: + Returns empty list if file_globs is empty/None (no exclusions). + """ if not file_globs: return [] excludes: List[FieldNestedEntry] = loader.load_definitions(file_globs) diff --git a/scripts/schema/finalizer.py b/scripts/schema/finalizer.py index 43ede81a19..ab973bd254 100644 --- a/scripts/schema/finalizer.py +++ b/scripts/schema/finalizer.py @@ -15,34 +15,154 @@ # specific language governing permissions and limitations # under the License. +"""Schema Finalizer Module. + +This module performs field reuse and calculates final field names. It's the third +stage of the schema processing pipeline, after loader and cleaner: + + loader.py → cleaner.py → finalizer.py → intermediate_files.py + +The finalizer performs two critical operations: +1. **Field Reuse**: Copy fieldsets to multiple locations (composition) +2. **Name Calculation**: Compute flat_name, dashed_name for all fields + +Field Reuse Mechanism: + ECS uses composition to avoid repetition. Common fieldsets like 'user', 'geo', + and 'process' can be reused at multiple locations. For example: + + - user → destination.user, source.user, client.user, server.user + - geo → client.geo, destination.geo, host.geo (but NOT at root level) + - process → process.parent (self-nesting for parent process) + +Two Phases of Reuse: + + **Phase 1: Foreign Reuse (Across Fieldsets)** + Copy a fieldset into a different fieldset. Example: + - 'user' → 'destination.user.*' + - Fields: destination.user.name, destination.user.email, etc. + - Transitive: If 'group' is reused in 'user', then 'destination.user' + automatically contains 'destination.user.group.*' + + **Phase 2: Self-Nesting (Within Same Fieldset)** + Copy a fieldset into itself with a different name. Example: + - 'process' → 'process.parent.*' + - Fields: process.parent.pid, process.parent.name, etc. + - NOT transitive: 'source.process' does NOT get 'source.process.parent' + + Key Difference: Phase 1 reuse is transitive (carried along when destination + is also reused). Phase 2 reuse is local only (not propagated). + +Reuse Order: + Some fieldsets depend on others being reused first. The 'order' attribute + controls reuse sequence: + - order=1: Reused first (e.g., 'group' must be in 'user' before 'user' is reused) + - order=2: Default priority (most fieldsets) + + Within each order level, Phase 1 (foreign) happens before Phase 2 (self-nesting). + +Tracking Reuse: + - original_fieldset: Set on all reused fields to track their source + - reused_here: List on receiving fieldset showing what was reused into it + - nestings: Legacy list of nested fieldset names (maintained for compatibility) + +Final Field Names: + After reuse is complete, calculates: + - flat_name: Full dotted name (e.g., 'destination.user.name') + - dashed_name: Kebab-case version (e.g., 'destination-user-name') + - Multi-field flat_names: e.g., 'user.name.text' + +Example: + >>> from schema import loader, cleaner, finalizer + >>> fields = loader.load_schemas() + >>> cleaner.clean(fields) + >>> finalizer.finalize(fields) + # Now fields contain all reused copies and have final names calculated + +See also: scripts/docs/schema-pipeline.md for complete pipeline documentation +""" + import copy import re from schema import visitor -# This script takes the fleshed out deeply nested fields dictionary as emitted by -# cleaner.py, and performs field reuse in two phases, repeated for each reuse order, from highest -# priority to lowest. -# -# Phase 1 performs field reuse across field sets. E.g. `group` fields should also be under `user`. -# This type of reuse is then carried around if the receiving field set is also reused. -# In other words, user.group.* will be in other places where user is nested: -# source.user.* will contain source.user.group.* - -# Phase 2 performs field reuse where field sets are reused within themselves, with a different name. -# Examples are nesting `process` within itself, as `process.parent.*`, -# or nesting `user` within itself at `user.target.*`. -# This second kind of nesting is not carried around everywhere else the receiving field set is reused. -# So `user.target.*` is *not* carried over to `source.user.target*` when we reuse `user` under `source`. - def finalize(fields): - """Intended entrypoint of the finalizer.""" + """Finalize schemas by performing reuse and calculating final field names. + + This is the main entry point for the finalizer module. It orchestrates + the two-phase reuse process and then calculates all final field properties. + + Args: + fields: Deeply nested field dictionary from cleaner.py + + Side Effects: + Modifies fields dictionary in place: + - Adds reused fieldset copies at specified locations + - Sets original_fieldset on all reused fields + - Calculates flat_name, dashed_name for all fields + - Sets multi-field flat_names + - Adds reused_here metadata to receiving fieldsets + + Processing Steps: + 1. Perform field reuse (two-phase, respecting order) + 2. Calculate final values (names and derived properties) + + Example: + >>> fields = loader.load_schemas() + >>> cleaner.clean(fields) + >>> finalize(fields) + # Fields now contain all reused copies with calculated names + """ perform_reuse(fields) calculate_final_values(fields) def order_reuses(fields): + """Organize reuse operations by priority order and phase type. + + Examines all reusable fieldsets and categorizes their reuse locations into: + - Foreign reuses (Phase 1): Reuse into different fieldset + - Self-nestings (Phase 2): Reuse into same fieldset + + Both are grouped by 'order' priority for sequential processing. + + Args: + fields: Deeply nested field dictionary + + Returns: + Tuple of (foreign_reuses, self_nestings): + - foreign_reuses: {order: {schema_name: [reuse_entries]}} + - self_nestings: {order: {schema_name: [reuse_entries]}} + + Structure Example: + foreign_reuses = { + 1: { # Order 1 (high priority) + 'group': [ + {'at': 'user', 'as': 'group', 'full': 'user.group'} + ] + }, + 2: { # Order 2 (default priority) + 'user': [ + {'at': 'destination', 'as': 'user', 'full': 'destination.user'}, + {'at': 'source', 'as': 'user', 'full': 'source.user'} + ] + } + } + + self_nestings = { + 2: { + 'process': [ + {'at': 'process', 'as': 'parent', 'full': 'process.parent'} + ] + } + } + + Note: + - Foreign vs self determined by comparing source and destination fieldset names + - Order values typically 1 (high priority) or 2 (default) + - Used by perform_reuse() to execute reuses in correct sequence + """ foreign_reuses = {} self_nestings = {} for schema_name, schema in fields.items(): @@ -65,7 +185,55 @@ def order_reuses(fields): def perform_reuse(fields): - """Performs field reuse respecting order for both foreign reuses and self-nestings""" + """Execute all field reuse operations in correct order and phases. + + Orchestrates the two-phase reuse process, respecting priority order. + For each order level (1, 2, etc.): + 1. Phase 1: Foreign reuses (transitive) + 2. Phase 2: Self-nestings (non-transitive) + + Args: + fields: Deeply nested field dictionary + + Side Effects: + Modifies fields dictionary in place by: + - Adding reused field copies at destination locations + - Marking all copied fields with original_fieldset + - Setting intermediate=True on reused fieldset wrappers + - Recording reused_here metadata on destination fieldsets + + Reuse Process: + 1. Organize reuses by order and type (foreign vs self) + 2. For each order (sorted, low to high): + a. Phase 1: Process all foreign reuses at this order + - Deep copy source fieldset's fields + - Mark all with original_fieldset + - Place at destination location + b. Phase 2: Process all self-nestings at this order + - Make pristine copy before any self-nesting + - Deep copy for each self-nesting location + - Place under same fieldset + + Example: + Order 1: + - Foreign: group → user.group + Order 2: + - Foreign: user (now containing group) → destination.user (includes group!) + - Self: process → process.parent + + Transitive Behavior: + Foreign reuses are transitive: + - If A is reused in B, and B is reused in C, then C gets A too + - Example: group in user, user in destination → destination.user.group + + Self-nestings are NOT transitive: + - If A self-nests as A.parent, reusing A elsewhere doesn't include A.parent + - Example: process.parent exists, but source.process.parent does NOT + + Note: + Uses deep copy to avoid sharing references between locations. + Each reused location gets independent field copies. + """ foreign_reuses, self_nestings = order_reuses(fields) # Process foreign reuses and self-nestings together, respecting order @@ -123,10 +291,32 @@ def perform_reuse(fields): def ensure_valid_reuse(reused_schema, destination_schema=None): - """ - Raise if either the reused schema or destination schema have root=true. + """Validate that schemas can participate in reuse operation. + + Root fieldsets (root=true) cannot be reused or have fields reused into them + because their fields appear at document root level without a prefix. - Second param is optional, if testing for a self-nesting (where source=destination). + Args: + reused_schema: Schema being reused (source) + destination_schema: Schema receiving the reuse (destination), optional + for self-nesting validation + + Raises: + ValueError: If reused_schema has root=true, or if destination_schema + (when provided) has root=true + + Rationale: + Root fieldsets like 'base' have fields at document root (@timestamp, + labels, tags). These can't be meaningfully reused elsewhere, and other + fieldsets can't be reused into them without breaking the root contract. + + Example: + >>> # Valid: user is not root, can be reused + >>> ensure_valid_reuse(fields['user'], fields['destination']) + + >>> # Invalid: base is root + >>> ensure_valid_reuse(fields['base'], fields['destination']) + ValueError: Schema base has attribute root=true and cannot be reused """ if reused_schema['schema_details']['root']: msg = "Schema {} has attribute root=true and therefore cannot be reused.".format( @@ -139,7 +329,41 @@ def ensure_valid_reuse(reused_schema, destination_schema=None): def append_reused_here(reused_schema, reuse_entry, destination_schema): - """Captures two ways of denoting what field sets are reused under a given field set""" + """Record metadata about what was reused into a destination schema. + + Maintains two tracking mechanisms: + 1. nestings: Legacy list of full paths (e.g., ['destination.user']) + 2. reused_here: Detailed array with descriptions and metadata + + Args: + reused_schema: Source schema being reused + reuse_entry: Reuse configuration dict with 'at', 'as', 'full', etc. + destination_schema: Destination schema receiving the reuse + + Side Effects: + Modifies destination_schema['schema_details'] in place: + - Appends to 'nestings' array + - Appends to 'reused_here' array + + reused_here Entry Structure: + { + 'schema_name': 'user', # Source schema + 'full': 'destination.user', # Full path + 'short': '...', # Description (from short_override or source) + 'normalize': [...], # Optional normalization rules + 'beta': '...' # Optional beta notice + } + + Use Cases: + - Documentation: Show what fieldsets are nested where + - Validation: Verify expected reuse locations + - Metadata: Track normalization and beta status at reuse location + + Note: + Supports short_override for contextual descriptions at reuse location. + Example: 'user' might have different short description when reused + at 'destination.user' vs 'source.user'. + """ # Legacy, too limited destination_schema['schema_details'].setdefault('nestings', []) destination_schema['schema_details']['nestings'] = sorted( @@ -163,7 +387,36 @@ def append_reused_here(reused_schema, reuse_entry, destination_schema): def set_original_fieldset(fields, original_fieldset): - """Recursively set the 'original_fieldset' attribute for all fields in a group of fields""" + """Recursively mark all fields with their source fieldset name. + + When fields are reused, they need to remember where they came from. + This function stamps all fields in a field group with the original_fieldset + attribute, using the visitor pattern. + + Args: + fields: Field group dictionary (can be nested) + original_fieldset: Name of source fieldset (e.g., 'user', 'process') + + Side Effects: + Modifies all fields in place, adding 'original_fieldset' attribute + + Behavior: + - Uses setdefault, so doesn't override if already set + - Preserves nested original_fieldset (e.g., group fields in user) + - Applied recursively to all nested fields + + Example: + >>> # Copying 'user' fields to 'destination.user' + >>> reused_fields = copy.deepcopy(fields['user']['fields']) + >>> set_original_fieldset(reused_fields, 'user') + # All fields now have original_fieldset='user' + # destination.user.name shows it came from user + + Use Cases: + - Documentation: Show which fields are reused vs native + - OTel mapping: Reused fields may have different OTel mappings + - Debugging: Track field provenance through reuse chain + """ def func(details): # Don't override if already set (e.g. 'group' for user.group.* fields) details['field_details'].setdefault('original_fieldset', original_fieldset) @@ -171,7 +424,45 @@ def func(details): def field_group_at_path(dotted_path, fields): - """Returns the ['fields'] hash at the dotted_path.""" + """Navigate to and return the fields dictionary at a dotted path. + + Traverses the nested field structure following a dotted path and returns + the 'fields' dictionary at that location, creating it if necessary for + object/group/nested types. + + Args: + dotted_path: Dot-separated path string (e.g., 'destination.user') + fields: Root fields dictionary to navigate from + + Returns: + The 'fields' dictionary at the specified path + + Raises: + ValueError: If path doesn't exist or non-nestable field is in the way + + Behavior: + - Follows path components left to right + - Creates 'fields' dict if needed for object/group/nested types + - Fails if path goes through non-nestable field (keyword, long, etc.) + + Example: + >>> # Get user fields under destination + >>> user_fields = field_group_at_path('destination', fields) + # Returns fields['destination']['fields'] + + >>> # Navigate deeper + >>> group_fields = field_group_at_path('destination.user', fields) + # Returns fields['destination']['fields']['user']['fields'] + + Use Cases: + - Placing reused fieldsets at specific locations + - Adding fields to nested structures during reuse + - Validating paths exist before modification + + Note: + Auto-creates 'fields' dict for object/group/nested types if missing. + This supports incremental field addition during reuse. + """ path = dotted_path.split('.') nesting = fields for next_field in path: @@ -190,17 +481,77 @@ def field_group_at_path(dotted_path, fields): def calculate_final_values(fields): - """ - This function navigates all fields recursively. + """Calculate final field names and properties after reuse is complete. + + Traverses all fields and computes path-based values that couldn't be + calculated until after reuse was performed: + - flat_name: Full dotted field name + - dashed_name: Kebab-case version for use in URLs, filenames + - multi-field flat_names: Names for alternate representations + - OTel mappings for reused fields (from otel_reuse) - It populates a few more values for the fields, especially path-based values - like flat_name. + Args: + fields: Deeply nested field dictionary (after reuse) + + Side Effects: + Modifies all field definitions in place, adding calculated properties + + Processing: + Uses visitor pattern with path tracking to build full field names + from the root down through all nesting levels. + + Example: + Before: field name='name', path=['destination', 'user'] + After: flat_name='destination.user.name', + dashed_name='destination-user-name' + + Note: + Must be called AFTER perform_reuse() so all fields are in final + locations before calculating names. """ visitor.visit_fields_with_path(fields, field_finalizer) def field_finalizer(details, path): - """This is the function called by the visitor to perform the work of calculate_final_values""" + """Calculate and set final field properties based on path. + + Callback function used by visitor during calculate_final_values traversal. + Computes all path-dependent field properties. + + Args: + details: Field entry dict with 'field_details' + path: Array of field names from root to parent (e.g., ['destination', 'user']) + + Side Effects: + Modifies details['field_details'] in place, adding: + - flat_name: Full dotted name + - dashed_name: Kebab-case name + - multi_fields[].flat_name: Names for each multi-field + - otel: Mappings (for reused fields with otel_reuse) + + Calculated Values: + - flat_name: path + node_name joined by dots + Example: ['destination', 'user'] + 'name' → 'destination.user.name' + + - dashed_name: flat_name with dots/underscores → dashes, @ removed + Example: 'destination.user.name' → 'destination-user-name' + '@timestamp' → 'timestamp' + + - multi-field flat_names: parent flat_name + '.' + multi-field name + Example: 'message.text' for text multi-field on message + + OTel Reuse Handling: + - For reused fields: Checks otel_reuse for location-specific mappings + - Removes base otel mapping (from original location) + - Applies otel_reuse mapping if it matches the new flat_name + - Cleans up otel_reuse attribute after processing + + Example: + >>> # Field at path ['destination', 'user'] with node_name 'name' + >>> field_finalizer(details, ['destination', 'user']) + # details['field_details']['flat_name'] = 'destination.user.name' + # details['field_details']['dashed_name'] = 'destination-user-name' + """ name_array = path + [details['field_details']['node_name']] flat_name = '.'.join(name_array) diff --git a/scripts/schema/loader.py b/scripts/schema/loader.py index 3ac9e8ad20..8f933c2c3a 100644 --- a/scripts/schema/loader.py +++ b/scripts/schema/loader.py @@ -15,6 +15,65 @@ # specific language governing permissions and limitations # under the License. +"""Schema Loader Module. + +This module is the entry point for the ECS schema processing pipeline. It loads +schema definitions from YAML files (either from filesystem or git) and transforms +them into a deeply nested structure that can be processed by downstream stages. + +Loading Sources: + 1. **ECS Core Schemas**: From schemas/*.yml directory + 2. **Experimental Schemas**: From experimental/schemas/ (optional) + 3. **Custom Schemas**: User-provided schema files (optional) + 4. **Git References**: Load schemas from specific git tags/branches + +The loading process: + - Reads raw YAML schema files (arrays of fieldset definitions) + - Transforms flat dotted field names into nested structures + - Merges multiple schema sources together safely + - Creates intermediate parent fields automatically (e.g., 'http.request') + - Preserves minimal structure for downstream processing + +Output Structure: + The deeply nested structure returned looks like: + + { + 'schema_name': { + 'schema_details': { # Fieldset-level metadata + 'reusable': {...}, + 'root': bool, + 'group': int, + 'title': str + }, + 'field_details': { # Field properties for the fieldset itself + 'name': str, + 'description': str, + 'type': 'group' + }, + 'fields': { # Nested fields within this fieldset + 'field_name': { + 'field_details': {...}, + 'fields': {...} # Recursive nesting + } + } + } + } + +Key Concepts: + - **Deeply Nested**: Dotted names like 'http.request.method' become nested dicts + - **Intermediate Fields**: Auto-created parent fields (e.g., 'http.request') + - **Schema Merging**: Custom schemas can extend or override ECS schemas + - **Minimal Defaults**: Only sets bare minimum; cleaner.py fills in rest + +This module does NOT: + - Validate field definitions (handled by cleaner.py) + - Perform field reuse (handled by finalizer.py) + - Calculate final field names (handled by finalizer.py) + - Apply defaults beyond structure (handled by cleaner.py) + +See also: scripts/docs/schema-pipeline.md for complete pipeline documentation +""" + import copy import git import glob @@ -35,42 +94,6 @@ SchemaDetails, ) -# Loads main ECS schemas and optional additional schemas. -# They are deeply nested, then merged together. -# This script doesn't fill in defaults other than the bare minimum for a predictable -# deeply nested structure. It doesn't concern itself with what "should be allowed" -# in being a good ECS citizen. It just loads things and merges them together. - -# The deeply nested structured returned by this script looks like this. -# -# [schema name]: { -# 'schema_details': { -# 'reusable': ... -# }, -# 'field_details': { -# 'type': ... -# }, -# 'fields': { -# [field name]: { -# 'field_details': { ... } -# 'fields': { -# -# (dotted key names replaced by deep nesting) -# [field name]: { -# 'field_details': { ... } -# 'fields': { -# } -# } -# } -# } -# } - -# Schemas at the top level always have all 3 keys populated. -# Leaf fields only have 'field_details' populated. -# Any intermediate field with other fields nested within them have 'fields' populated. -# Note that intermediate fields rarely have 'field_details' populated, but it's supported. -# Examples of this are 'dns.answers', 'observer.egress'. - EXPERIMENTAL_SCHEMA_DIR = 'experimental/schemas' @@ -79,7 +102,46 @@ def load_schemas( ref: Optional[str] = None, included_files: Optional[List[str]] = [] ) -> Dict[str, FieldEntry]: - """Loads ECS and custom schemas. They are returned deeply nested and merged.""" + """Load ECS schemas from filesystem or git, optionally including custom schemas. + + This is the main entry point for schema loading. It orchestrates loading from + multiple sources and merges them into a unified deeply nested structure. + + Args: + ref: Optional git reference (tag/branch/commit) to load schemas from. + If None, loads from current filesystem. + included_files: Optional list of additional schema files or directories + to include (e.g., custom schemas, experimental schemas) + + Returns: + Dictionary mapping schema names to their deeply nested field structures. + Each schema has 'schema_details', 'field_details', and 'fields' keys. + + Loading Order: + 1. Load ECS core schemas (from git ref or filesystem) + 2. If ref specified + experimental requested: Load experimental from git + 3. Load any remaining custom schema files from filesystem + 4. Merge all sources together (custom can override ECS) + + Raises: + ValueError: If schema file has missing 'name' attribute + KeyError: If git ref doesn't contain expected schema directory + + Note: + - Experimental schemas are only loaded from git if --ref is specified + - Custom schemas are always loaded from filesystem (not git) + - Merging allows custom schemas to extend/override ECS definitions + + Example: + >>> # Load current ECS schemas + >>> fields = load_schemas() + >>> len(fields) # Number of fieldsets + 45 + + >>> # Load from specific version with custom schemas + >>> fields = load_schemas(ref='v8.10.0', + ... included_files=['custom/myfields.yml']) + """ # ECS fields (from git ref or not) schema_files_raw: Dict[str, FieldNestedEntry] = load_schemas_from_git( ref) if ref else load_schema_files(ecs_helpers.ecs_files()) @@ -103,6 +165,20 @@ def load_schemas( def load_schema_files(files: List[str]) -> Dict[str, FieldNestedEntry]: + """Load multiple schema YAML files from filesystem and merge them. + + Args: + files: List of file paths to YAML schema files + + Returns: + Dictionary mapping schema names to their raw (not yet nested) definitions + + Raises: + ValueError: If duplicate schema names are found across files + + Note: + Uses safe_merge_dicts to prevent accidental overwrites. + """ fields_nested: Dict[str, FieldNestedEntry] = {} for f in files: new_fields: Dict[str, FieldNestedEntry] = read_schema_file(f) @@ -114,6 +190,29 @@ def load_schemas_from_git( ref: str, target_dir: Optional[str] = 'schemas' ) -> Dict[str, FieldNestedEntry]: + """Load schema files from a specific git reference. + + Checks out the specified git reference and reads all YAML files from the + target directory without checking out files to the filesystem. + + Args: + ref: Git reference (tag, branch, or commit SHA) to load from + target_dir: Directory path within the git tree (default: 'schemas') + + Returns: + Dictionary mapping schema names to their raw (not yet nested) definitions + + Raises: + KeyError: If target directory doesn't exist in the git ref + ValueError: If duplicate schema names are found + + Note: + Reads files directly from git objects without filesystem checkout. + + Example: + >>> schemas = load_schemas_from_git('v8.10.0') + >>> schemas = load_schemas_from_git('main', target_dir='experimental/schemas') + """ tree: git.objects.tree.Tree = ecs_helpers.get_tree_by_ref(ref) fields_nested: Dict[str, FieldNestedEntry] = {} @@ -129,7 +228,23 @@ def load_schemas_from_git( def read_schema_file(file_name: str) -> Dict[str, FieldNestedEntry]: - """Read a raw schema yml file into a dict.""" + """Read and parse a single YAML schema file from filesystem. + + Args: + file_name: Path to YAML schema file + + Returns: + Dictionary with schema name as key, schema definition as value + + Raises: + ValueError: If schema is missing 'name' attribute + yaml.YAMLError: If file contains invalid YAML + + Example: + >>> schemas = read_schema_file('schemas/http.yml') + >>> 'http' in schemas + True + """ with open(file_name) as f: raw: List[FieldNestedEntry] = yaml.safe_load(f.read()) return nest_schema(raw, file_name) @@ -139,7 +254,22 @@ def read_schema_blob( blob: git.objects.blob.Blob, ref: str ) -> Dict[str, FieldNestedEntry]: - """Read a raw schema yml git blob into a dict.""" + """Read and parse a YAML schema from a git blob object. + + Args: + blob: Git blob object containing YAML schema content + ref: Git reference being loaded (for error messages) + + Returns: + Dictionary with schema name as key, schema definition as value + + Raises: + ValueError: If schema is missing 'name' attribute + yaml.YAMLError: If blob contains invalid YAML + + Note: + Constructs friendly file name for error messages: "http.yml (git ref v8.10.0)" + """ content: str = blob.data_stream.read().decode('utf-8') raw: List[FieldNestedEntry] = yaml.safe_load(content) file_name: str = "{} (git ref {})".format(blob.name, ref) @@ -147,11 +277,32 @@ def read_schema_blob( def nest_schema(raw: List[FieldNestedEntry], file_name: str) -> Dict[str, FieldNestedEntry]: - """ - Raw schema files are an array of schema details: [{'name': 'base', ...}] + """Transform raw schema array into dictionary keyed by schema name. + + Schema YAML files contain an array (list) of schema definitions. This + function converts that array into a dictionary for easier access, using + each schema's 'name' attribute as the key. + + Args: + raw: List of schema definitions from YAML file + file_name: Name of source file (for error messages) + + Returns: + Dictionary mapping schema names to their definitions: + {'http': {...}, 'user': {...}} + + Raises: + ValueError: If any schema is missing the mandatory 'name' attribute - This function loops over the array (usually 1 schema per file) and turns it into - a dict with the schema name as the key: { 'base': { 'name': 'base', ...}} + Note: + Most schema files contain exactly one schema, but multiple schemas + per file are supported. + + Example: + >>> raw = [{'name': 'http', 'title': 'HTTP', 'fields': [...]}] + >>> nested = nest_schema(raw, 'http.yml') + >>> nested + {'http': {'name': 'http', 'title': 'HTTP', 'fields': [...]}} """ fields: Dict[str, FieldNestedEntry] = {} for schema in raw: @@ -162,6 +313,59 @@ def nest_schema(raw: List[FieldNestedEntry], file_name: str) -> Dict[str, FieldN def deep_nesting_representation(fields: Dict[str, FieldNestedEntry]) -> Dict[str, FieldEntry]: + """Transform flat schema definitions into deeply nested field structures. + + Takes schemas with flat field arrays and converts them into the deeply nested + structure used by the rest of the pipeline. This involves: + - Separating schema-level metadata from field-level metadata + - Converting dotted field names into nested dictionaries + - Creating intermediate parent fields automatically + + Args: + fields: Dictionary of raw schema definitions with flat field arrays + + Returns: + Dictionary mapping schema names to deeply nested structures with: + - schema_details: Fieldset-level metadata (root, group, reusable, title) + - field_details: Field properties for the fieldset itself + - fields: Recursively nested field definitions + + Structure Transformation: + Input (flat): + { + 'http': { + 'name': 'http', + 'title': 'HTTP', + 'fields': [ + {'name': 'request.method', 'type': 'keyword'}, + {'name': 'response.status_code', 'type': 'long'} + ] + } + } + + Output (deeply nested): + { + 'http': { + 'schema_details': {'title': 'HTTP', ...}, + 'field_details': {'name': 'http', ...}, + 'fields': { + 'request': { + 'field_details': {'intermediate': True, ...}, + 'fields': { + 'method': { + 'field_details': {'type': 'keyword', ...} + } + } + } + } + } + } + + Note: + - Schema-level keys (root, group, reusable, title) go to schema_details + - Everything else becomes field_details + - Intermediate fields are auto-created for nesting paths + """ deeply_nested: Dict[str, FieldEntry] = {} for (name, flat_schema) in fields.items(): @@ -188,6 +392,58 @@ def deep_nesting_representation(fields: Dict[str, FieldNestedEntry]) -> Dict[str def nest_fields(field_array: List[Field]) -> Dict[str, Dict[str, FieldEntry]]: + """Convert flat array of fields with dotted names into nested structure. + + Takes a flat array of field definitions (where 'name' can contain dots like + 'request.method') and builds a nested dictionary structure. Automatically + creates intermediate parent fields as needed. + + Args: + field_array: List of field definitions with potentially dotted names + + Returns: + Dictionary with 'fields' key containing nested field structure + + Field Nesting Logic: + 1. Split dotted names: 'request.method' -> ['request', 'method'] + 2. Create intermediate fields for parents: 'request' becomes type='object' + 3. Mark intermediate fields so they can be identified later + 4. Preserve explicitly defined object/nested fields (not intermediate) + 5. Place leaf field at deepest nesting level + + Example: + Input: + [ + {'name': 'method', 'type': 'keyword'}, + {'name': 'request.method', 'type': 'keyword'}, + {'name': 'request.bytes', 'type': 'long'} + ] + + Output: + { + 'fields': { + 'method': { + 'field_details': {'name': 'method', 'type': 'keyword'} + }, + 'request': { + 'field_details': { + 'name': 'request', + 'type': 'object', + 'intermediate': True # Auto-created + }, + 'fields': { + 'method': {...}, + 'bytes': {...} + } + } + } + } + + Note: + - Intermediate fields get type='object' and intermediate=True + - Explicitly defined object/nested fields keep intermediate=False + - node_name is set for all fields (used internally for tracking) + """ schema_root: Dict[str, Dict[str, FieldEntry]] = {'fields': {}} for field in field_array: nested_levels: List[str] = field['name'].split('.') @@ -226,6 +482,17 @@ def nest_fields(field_array: List[Field]) -> Dict[str, Dict[str, FieldEntry]]: def array_of_maps_to_map(array_vals: List[MultiField]) -> Dict[str, MultiField]: + """Convert array of multi-field definitions to dictionary keyed by name. + + Args: + array_vals: List of multi-field definitions, each with a 'name' key + + Returns: + Dictionary mapping multi-field names to their definitions + + Note: + If duplicate names exist, the last one wins (useful for overrides). + """ ret_map: Dict[str, MultiField] = {} for map_val in array_vals: name: str = map_val['name'] @@ -235,6 +502,14 @@ def array_of_maps_to_map(array_vals: List[MultiField]) -> Dict[str, MultiField]: def map_of_maps_to_array(map_vals: Dict[str, MultiField]) -> List[MultiField]: + """Convert dictionary of multi-fields back to sorted array. + + Args: + map_vals: Dictionary of multi-field definitions + + Returns: + Sorted list of multi-field definitions (sorted by name) + """ ret_list: List[MultiField] = [] for key in map_vals: ret_list.append(map_vals[key]) @@ -242,13 +517,82 @@ def map_of_maps_to_array(map_vals: Dict[str, MultiField]) -> List[MultiField]: def dedup_and_merge_lists(list_a: List[MultiField], list_b: List[MultiField]) -> List[MultiField]: + """Merge two multi-field lists, removing duplicates and preferring list_b. + + When the same multi-field name appears in both lists, the definition from + list_b takes precedence. This allows custom schemas to override ECS defaults. + + Args: + list_a: First list of multi-field definitions (lower priority) + list_b: Second list of multi-field definitions (higher priority) + + Returns: + Merged and sorted list of unique multi-field definitions + + Example: + >>> list_a = [{'name': 'text', 'type': 'text'}] + >>> list_b = [{'name': 'keyword', 'type': 'keyword'}] + >>> dedup_and_merge_lists(list_a, list_b) + [{'name': 'keyword', ...}, {'name': 'text', ...}] # Sorted by name + """ list_a_map: Dict[str, MultiField] = array_of_maps_to_map(list_a) list_a_map.update(array_of_maps_to_map(list_b)) return map_of_maps_to_array(list_a_map) def merge_fields(a: Dict[str, FieldEntry], b: Dict[str, FieldEntry]) -> Dict[str, FieldEntry]: - """Merge ECS field sets with custom field sets.""" + """Recursively merge two field dictionaries, with b taking precedence. + + Performs deep merging of field structures, allowing custom schemas to extend + or override ECS definitions. Handles special cases for arrays (normalize, + multi_fields) and nested structures. + + Args: + a: Base field dictionary (typically ECS fields) + b: Override field dictionary (typically custom fields) + + Returns: + New deeply nested field dictionary with merged content + + Merge Behavior: + - New fieldsets in b: Added to result + - Existing fieldsets: Merged recursively + - field_details: b values override a values + - normalize arrays: Concatenated (a + b) + - multi_fields arrays: Merged with deduplication (b takes precedence) + - schema_details: Merged with special handling for reusable settings + - Nested fields: Merged recursively + + Special Cases: + 1. normalize: Arrays are concatenated, allowing additions + 2. multi_fields: Deduplicated merge (custom can override ECS multi-fields) + 3. reusable.expected: Arrays concatenated (adds new reuse locations) + 4. reusable.top_level: Last value wins (can change reusability) + + Example: + >>> ecs = { + ... 'user': { + ... 'field_details': {'name': 'user', 'type': 'group'}, + ... 'fields': { + ... 'name': {'field_details': {'type': 'keyword'}} + ... } + ... } + ... } + >>> custom = { + ... 'user': { + ... 'fields': { + ... 'email': {'field_details': {'type': 'keyword'}} + ... } + ... } + ... } + >>> merged = merge_fields(ecs, custom) + # Result has both user.name (from ECS) and user.email (from custom) + + Note: + - Deep copies inputs to avoid mutation + - Safe for merging experimental, custom, and ECS schemas + - Used by load_schemas() to combine multiple schema sources + """ a = copy.deepcopy(a) b = copy.deepcopy(b) for key in b: @@ -293,17 +637,43 @@ def merge_fields(a: Dict[str, FieldEntry], b: Dict[str, FieldEntry]) -> Dict[str def load_yaml_file(file_name): + """Load and parse a YAML file. + + Args: + file_name: Path to YAML file + + Returns: + Parsed YAML content (typically dict or list) + """ with open(file_name) as f: return yaml.safe_load(f.read()) -# You know, for silent tests def warn(message: str) -> None: + """Print a warning message (overridable for testing). + + Args: + message: Warning message to display + + Note: + This function exists to enable silent tests (can be mocked). + """ print(message) def eval_globs(globs): - """Accepts an array of glob patterns or file names, returns the array of actual files""" + """Expand glob patterns to actual file paths. + + Args: + globs: Array of glob patterns or file names + + Returns: + Array of actual file paths matching the patterns + + Note: + Directories ending with '/' are converted to 'dir/*' + Warns if a pattern matches no files. + """ all_files = [] for g in globs: if g.endswith('/'): @@ -317,6 +687,17 @@ def eval_globs(globs): def load_definitions(file_globs): + """Load subset or exclude definition files. + + Args: + file_globs: List of file paths or glob patterns + + Returns: + List of loaded YAML definition objects + + Note: + Used by subset_filter.py and exclude_filter.py to load their configs. + """ sets = [] for f in ecs_helpers.glob_yaml_files(file_globs): raw = load_yaml_file(f) diff --git a/scripts/schema/subset_filter.py b/scripts/schema/subset_filter.py index 8df16e4ba2..524af65e67 100644 --- a/scripts/schema/subset_filter.py +++ b/scripts/schema/subset_filter.py @@ -15,6 +15,81 @@ # specific language governing permissions and limitations # under the License. +"""Schema Subset Filter Module. + +This module filters the complete ECS schema to include only specified fieldsets +and fields. It enables generating reduced schema subsets for specific use cases, +reducing field count and simplifying artifacts for targeted deployments. + +Subsetting Use Cases: + - **Minimal deployments**: Include only essential fields + - **Domain-specific**: Only fields relevant to specific data types + - **Performance**: Reduce mapping overhead in Elasticsearch + - **Compliance**: Remove fields containing sensitive data types + - **Testing**: Create small subsets for testing pipelines + +Subset Definition Format: + Subsets are defined in YAML files with hierarchical structure: + + ```yaml + name: minimal + fields: + base: + fields: '*' # All base fields + http: + fields: + request: + fields: + method: {} # Just this field + bytes: {} + response: + fields: '*' # All response fields + user: + fields: + name: {} + email: {} + ``` + +Filtering Logic: + - Fieldsets not in subset: Completely removed + - Fieldsets with fields='*': All fields included + - Fieldsets with fields={...}: Only specified fields included + - Hierarchical filtering: Applies recursively to nested fields + +Special Features: + 1. **docs_only**: Fields marked for documentation but not artifacts + - Appear in markdown docs but not Elasticsearch templates + - Useful for deprecated/transitional fields + + 2. **Multiple subsets**: Can specify multiple subset files + - Merged together (union of all fields) + - Later subsets can extend earlier ones + + 3. **Field options**: Per-field configuration in subsets + - enabled: false (disable field in artifacts) + - index: false (don't index in Elasticsearch) + +Output: + - Filtered field dictionary (main subset) + - Separate docs_only subset (if docs_only fields present) + - Intermediate files for each named subset + +Example: + >>> from schema import loader, cleaner, finalizer, subset_filter + >>> fields = loader.load_schemas() + >>> cleaner.clean(fields) + >>> finalizer.finalize(fields) + >>> main, docs = subset_filter.filter( + ... fields, + ... ['subsets/minimal.yml'], + ... 'generated' + ... ) + # main contains only specified fields + # docs contains docs_only fields + +See also: scripts/docs/schema-pipeline.md for pipeline documentation +""" + import copy import os from typing import ( @@ -34,15 +109,62 @@ FieldEntry ) -# This script takes all ECS and custom fields already loaded, and lets users -# filter out the ones they don't need. - def filter( fields: Dict[str, FieldEntry], subset_file_globs: List[str], out_dir: str ) -> Tuple[Dict[str, FieldEntry], Dict[str, FieldEntry]]: + """Filter fields to include only those specified in subset definitions. + + Main entry point for subset filtering. Loads subset definitions, applies + them to filter fields, and handles special docs_only fields separately. + + Args: + fields: Complete field dictionary from finalizer + subset_file_globs: List of paths/globs to subset YAML files + out_dir: Output directory for intermediate files + + Returns: + Tuple of (filtered_fields, docs_only_fields): + - filtered_fields: Main subset for artifact generation + - docs_only_fields: Fields for documentation only (not in artifacts) + + Processing Steps: + 1. Load all subset definition files + 2. Generate intermediate files for each named subset + 3. Merge all subsets together (union) + 4. Extract fields matching merged subset + 5. Handle docs_only fields separately + 6. Remove docs_only fields from main subset + + Subset Merging: + Multiple subset files are combined with union semantics: + - Field in any subset: Included in result + - enabled/index: True if true in ANY subset + - Fields can be added but not removed by later subsets + + docs_only Feature: + Fields marked with docs_only: true: + - Appear in markdown documentation + - Excluded from Elasticsearch templates, Beats configs, etc. + - Useful for deprecated fields still needing docs + + Example: + >>> fields, docs = filter( + ... all_fields, + ... ['subsets/minimal.yml', 'subsets/security.yml'], + ... 'generated' + ... ) + >>> len(fields) # Main subset field count + 200 + >>> len(docs) # Docs-only field count + 10 + + Side Effects: + - Writes intermediate files for each subset to out_dir/ecs/subset/ + - Cleans up intermediate fields in filtered results + """ subsets: List[Dict[str, Any]] = load_subset_definitions(subset_file_globs) for subset in subsets: subfields: Dict[str, FieldEntry] = extract_matching_fields(fields, subset['fields']) @@ -126,7 +248,38 @@ def remove_docs_only_entries(paths: List[str], fields: Dict[str, FieldEntry]) -> def combine_all_subsets(subsets: Dict[str, Any]) -> Dict[str, Any]: - """Merges N subsets into one. Strips top level 'name' and 'fields' keys as well as non-ECS field options since we can't know how to merge those.""" + """Merge multiple subset definitions into one using union semantics. + + Combines N subset definitions where a field is included if it appears + in ANY subset. Options like enabled/index are OR'd (true if true anywhere). + + Args: + subsets: List of subset definition dictionaries + + Returns: + Single merged subset definition containing union of all fields + + Processing: + 1. Strip non-ECS options from each subset (can't merge unknown options) + 2. Merge subsets together using union logic + 3. Return combined result + + Merging Logic: + - Field in subset A or B: Included in result + - enabled: True if true in A OR B + - index: True if true in A OR B + - fields: Merged recursively + + Example: + >>> subset1 = {'http': {'fields': {'request': {'fields': '*'}}}} + >>> subset2 = {'http': {'fields': {'response': {'fields': '*'}}}} + >>> combined = combine_all_subsets([subset1, subset2]) + # Result includes both http.request and http.response + + Note: + Strips non-ECS options (custom keys) because merging semantics are unknown. + Only ECS-known options (fields, enabled, index, docs_only) are preserved. + """ merged_subset = {} for subset in subsets: strip_non_ecs_options(subset['fields']) @@ -135,6 +288,20 @@ def combine_all_subsets(subsets: Dict[str, Any]) -> Dict[str, Any]: def load_subset_definitions(file_globs: List[str]) -> List[Dict[str, Any]]: + """Load subset definition files from filesystem. + + Args: + file_globs: List of file paths or glob patterns + + Returns: + List of parsed subset definition dictionaries + + Raises: + ValueError: If file_globs specified but no files found + + Note: + Returns empty list if file_globs is empty/None (no filtering). + """ if not file_globs: return [] subsets: List[Dict[str, Any]] = loader.load_definitions(file_globs) @@ -178,7 +345,51 @@ def extract_matching_fields( fields: Dict[str, FieldEntry], subset_definitions: Dict[str, Any] ) -> Dict[str, FieldEntry]: - """Removes fields that are not in the subset definition. Returns a copy without modifying the input fields dict.""" + """Extract only the fields specified in subset definition. + + Recursively filters fields to include only those in the subset definition. + Returns a copy of the filtered fields without modifying the input. + + Args: + fields: Complete field dictionary to filter + subset_definitions: Subset specification defining which fields to keep + + Returns: + New field dictionary containing only specified fields + + Raises: + ValueError: If subset structure doesn't match field structure + + Filtering Rules: + - Field not in subset: Excluded + - Field with fields='*': All nested fields included + - Field with fields={...}: Recursively filter nested fields + - Field options (enabled, index): Applied to field_details + + Special Handling: + - Intermediate fields: If options specified, converts to real field + (sets intermediate=False, adds description, level='custom') + - Options: Subset can specify field-level options like enabled=false + + Structure Validation: + - If schema field has nested 'fields', subset MUST have 'fields' key + - If schema field has no nested 'fields', subset MUST NOT have 'fields' + - This prevents mistakes in subset definitions + + Example: + >>> subset_def = { + ... 'http': { + ... 'fields': { + ... 'request': {'fields': {'method': {}}} + ... } + ... } + ... } + >>> filtered = extract_matching_fields(all_fields, subset_def) + # Returns only http.request.method + + Note: + Creates deep copies to avoid modifying original field structures. + """ retained_fields: Dict[str, FieldEntry] = {x: fields[x].copy() for x in subset_definitions} for key, val in subset_definitions.items(): retained_fields[key]['field_details'] = fields[key]['field_details'].copy() diff --git a/scripts/schema/visitor.py b/scripts/schema/visitor.py index 1e1ca4441c..b90423c2ba 100644 --- a/scripts/schema/visitor.py +++ b/scripts/schema/visitor.py @@ -15,6 +15,55 @@ # specific language governing permissions and limitations # under the License. +"""Field Visitor Module. + +This module provides utilities for traversing deeply nested field structures using +the Visitor pattern. It enables performing operations on all fields/fieldsets in a +schema tree without needing to write recursive traversal code repeatedly. + +The Visitor Pattern: + The visitor pattern separates algorithms (visitor functions) from the data + structure they operate on. This allows: + - Multiple operations without modifying the field structure + - Consistent traversal order across different operations + - Clean separation of concerns + - Reusable traversal logic + +Common Use Cases: + - Validation: Check all fields meet requirements (cleaner.py) + - Transformation: Modify field properties (finalizer.py) + - Accumulation: Collect fields into flat structures (intermediate_files.py) + - Enrichment: Add calculated properties to fields (finalizer.py) + - Analysis: Generate statistics about schema structure + +Visitor Functions: + 1. visit_fields(): Call different functions for fieldsets vs fields + 2. visit_fields_with_path(): Track nesting path during traversal + 3. visit_fields_with_memo(): Pass accumulator through traversal + +Structure Assumptions: + All visitor functions expect the deeply nested structure created by loader.py: + - Fieldsets have 'schema_details', 'field_details', and 'fields' keys + - Regular fields have 'field_details' and optionally 'fields' keys + - Intermediate fields have 'field_details' with intermediate=True + +Example Usage: + >>> # Count all fields + >>> count = {'total': 0} + >>> def counter(details, memo): + ... memo['total'] += 1 + >>> visit_fields_with_memo(fields, counter, count) + >>> print(count['total']) + + >>> # Validate all fields + >>> def validator(details): + ... if 'type' not in details['field_details']: + ... raise ValueError('Missing type') + >>> visit_fields(fields, field_func=validator) + +See also: scripts/docs/schema-pipeline.md for pipeline documentation +""" + from typing import ( Callable, Dict, @@ -34,18 +83,49 @@ def visit_fields( fieldset_func: Optional[Callable[[FieldEntry], None]] = None, field_func: Optional[Callable[[FieldDetails], None]] = None ) -> None: - """ - This function navigates the deeply nested tree structure and runs provided - functions on each fieldset or field encountered (both optional). + """Recursively visit all fieldsets and fields, calling appropriate functions. + + Traverses the deeply nested field structure and invokes different callback + functions for fieldsets (which have schema_details) vs regular fields. + This allows different processing logic for different node types. + + Args: + fields: Deeply nested field dictionary to traverse + fieldset_func: Optional function to call for each fieldset. + Receives dict with 'schema_details', 'field_details', 'fields' + field_func: Optional function to call for each field. + Receives dict with 'field_details' and optionally 'fields' + + Traversal Order: + - Depth-first traversal (process parent before children) + - Processes current node first, then recursively processes children + - For each node: call appropriate function, then recurse into 'fields' + + Node Identification: + - Fieldset: Has 'schema_details' key (top-level schemas only) + - Field: Has 'field_details' key but no 'schema_details' - The argument 'fields' should be at the named field grouping level: - {'name': {'schema_details': {}, 'field_details': {}, 'fields': {}} + Example: + >>> def validate_fieldset(details): + ... if 'title' not in details['schema_details']: + ... raise ValueError('Missing title') + >>> + >>> def validate_field(details): + ... if 'type' not in details['field_details']: + ... raise ValueError('Missing type') + >>> + >>> visit_fields(fields, + ... fieldset_func=validate_fieldset, + ... field_func=validate_field) - The 'fieldset_func(details)' provided will be called for each field set, - with the dictionary containing their details ({'schema_details': {}, 'field_details': {}, 'fields': {}). + Use Cases: + - cleaner.py: Validates and normalizes fieldsets and fields separately + - finalizer.py: Sets original_fieldset on reused fields + - Any operation needing different logic for fieldsets vs fields - The 'field_func(details)' provided will be called for each field, with the dictionary - containing the field's details ({'field_details': {}, 'fields': {}). + Note: + Both callback functions are optional. You can provide just one if you + only need to process fieldsets or fields. """ for (_, details) in fields.items(): if fieldset_func and 'schema_details' in details: @@ -63,13 +143,48 @@ def visit_fields_with_path( func: Callable[[FieldDetails], None], path: Optional[List[str]] = [] ) -> None: - """ - This function navigates the deeply nested tree structure and runs the provided - function on all fields and field sets. + """Recursively visit all fields, passing the nesting path to the callback. + + Traverses the deeply nested structure and calls the provided function for + each field and fieldset, passing both the details and the path array showing + where in the hierarchy the field is located. + + Args: + fields: Deeply nested field dictionary to traverse + func: Callback function receiving (details, path) + - details: Dict with 'field_details' and optionally 'fields' + - path: List of field names from root to current location + path: Current path (used internally during recursion, start with []) + + Path Building: + - Root fieldsets (root=true): Don't add to path + - Other fieldsets/fields: Add their name to path + - Path represents dotted field name: ['http', 'request'] = 'http.request' + + Traversal Order: + - Depth-first traversal + - Process current node first, then recurse into children + + Example: + >>> def show_path(details, path): + ... field_name = details['field_details'].get('name', 'unknown') + ... dotted_path = '.'.join(path + [field_name]) + ... print(f"Field: {dotted_path}") + >>> + >>> visit_fields_with_path(fields, show_path) + Field: http + Field: http.request + Field: http.request.method + Field: http.request.bytes + + Use Cases: + - finalizer.py: Calculate flat_name using path + - Any operation needing to know field's full path + - Building dotted field names during transformation - The 'func' provided will be called for each field, - with the dictionary containing their details ({'field_details': {}, 'fields': {}) - as well as the path array leading to the location of the field in question. + Note: + Root fieldsets don't add to the path because their fields appear at + the root level of events (e.g., '@timestamp', not 'base.@timestamp'). """ for (name, details) in fields.items(): if 'field_details' in details: @@ -87,13 +202,52 @@ def visit_fields_with_memo( func: Callable[[FieldEntry, Field], None], memo: Optional[Dict[str, Field]] = None ) -> None: - """ - This function navigates the deeply nested tree structure and runs the provided - function on all fields and field sets. + """Recursively visit all fields, passing an accumulator (memo) to the callback. + + Traverses the deeply nested structure and calls the provided function for + each field and fieldset, passing both the details and a memo object that + can be used to accumulate results or state during traversal. + + Args: + fields: Deeply nested field dictionary to traverse + func: Callback function receiving (details, memo) + - details: Dict with 'field_details' and optionally 'fields' + - memo: Accumulator object passed through all calls + memo: Accumulator object (can be dict, list, or any mutable object) + + Memo Pattern: + The memo object is passed to every callback and can be modified in place + to accumulate results. Common memo types: + - Dict: Accumulate fields by name + - List: Collect fields meeting criteria + - Counter dict: Track statistics + + Traversal Order: + - Depth-first traversal + - Process current node first, then recurse into children + - Same memo object passed to all callbacks + + Example: + >>> # Accumulate all keyword fields + >>> keyword_fields = {} + >>> def collect_keywords(details, memo): + ... field = details['field_details'] + ... if field.get('type') == 'keyword': + ... memo[field['flat_name']] = field + >>> + >>> visit_fields_with_memo(fields, collect_keywords, keyword_fields) + >>> len(keyword_fields) + 450 # Number of keyword fields found + + Use Cases: + - intermediate_files.py: Accumulate fields into flat dictionary + - Collecting fields for analysis or statistics + - Building indexes or lookup tables during traversal + - Any operation needing to build results while traversing - The 'func' provided will be called for each field, - with the dictionary containing their details ({'field_details': {}, 'fields': {}) - as well as the 'memo' you pass in. + Note: + The memo is mutable and shared across all callbacks. Be careful not + to accidentally mutate it in unexpected ways. """ for (name, details) in fields.items(): if 'field_details' in details: diff --git a/scripts/templates/ecs_field_reference.j2 b/scripts/templates/ecs_field_reference.j2 index 3a1d485ab8..4e0d8e366f 100644 --- a/scripts/templates/ecs_field_reference.j2 +++ b/scripts/templates/ecs_field_reference.j2 @@ -16,7 +16,7 @@ ECS defines multiple groups of related fields. They are called "field sets". The All other field sets are defined as objects in {{ es }}, under which all fields are defined. -For a single page representation of all fields, please see the [generated CSV of fields](https://github.com/elastic/ecs/blob/master/generated/csv/fields.csv). +For a single page representation of all fields, please see the [generated CSV of fields](https://github.com/elastic/ecs/blob/main/generated/csv/fields.csv). ## Field sets [ecs-fieldsets] diff --git a/scripts/templates/index.j2 b/scripts/templates/index.j2 index 02c154a4f5..c56a8807c0 100644 --- a/scripts/templates/index.j2 +++ b/scripts/templates/index.j2 @@ -41,5 +41,5 @@ ECS is a permissive schema. If your events have additional data that cannot be m ECS improvements are released following [Semantic Versioning](https://semver.org/). Major ECS releases are planned to be aligned with major Elastic Stack releases. -Any feedback on the general structure, missing fields, or existing fields is appreciated. For contributions please read the [Contribution Guidelines](https://github.com/elastic/ecs/blob/master/CONTRIBUTING.md). +Any feedback on the general structure, missing fields, or existing fields is appreciated. For contributions please read the [Contribution Guidelines](https://github.com/elastic/ecs/blob/main/CONTRIBUTING.md).