elgen is a sample data generator and indexer for Elasticsearch on Elastic Cloud.
❯ python elgen.py --index my-index --clear-index --limit 200 --size 10000
💣 Clearing index my-index
📒 Indexing documents to index my-index on Elastic Cloud my-cloud:dXMtd(...)
🪄 Generating 50 documents
🪄 Generating 50 documents
🪄 Generating 50 documents
🪄 Generating 50 documents
✅ DONE - Processed 200 documents of ~10000 bytes each
Total duration: 4.018 seconds
Average throughput: 49.772 docs/sec
It uses Faker to create random documents that contain realistic data. The resulting dataset can be used to measure the throughput of Elastic bulk indexing and Machine Learning inference process.
A sample document looks like this:
{
"id": "5cd32827-2721-4d68-83b8-d1d0157c7501",
"title": "There record happen charge experience available suggest",
"author": "Steven Shannon",
"summary": "Base reduce have affect able southern.\nQuestion sister stuff yet million. Especially few student before.",
"text": "Stock PM way same green. (...) Every example end again live remember way."
}- Python 3.6+
- Elastic Cloud deployment (if you also want to index the data)
Generate 10 documents and print them to the console in JSON format:
python elgen.pyGenerate 100 documents of approximately 50 KB each:
python elgen.py --limit 100 --size 50000Generate documents and index them in my-index in the specified Elastic Cloud deployment:
python elgen.py --index my-index --elastic-cloud-id my-cloud:... --elastic-username john --elastic-password doe123Same as above, also run the documents through the my-pipeline ingest pipeline before indexing (see Ingest pipeline):
python elgen.py --index my-index --pipeline my-pipeline --elastic-cloud-id my-cloud:... --elastic-username john --elastic-password doe123Generate documents and save them to data.ndjson for bulk indexing in Elasticsearch:
python elgen.py --out-file data.ndjsonSee further options in Configuration.
By default elgen bulk indexes documents into the specified index. If an ingest pipeline is attached to the index, it can be applied during the indexing process by specifying it with the --pipeline option. It will also set the _run_ml_inference flag, which will run any Machine Learning inference pipelines associated with the index.
To learn more about ingest and inference pipelines please refer to the Enterprise Search guide.
Optional arguments:
| Argument | Effect | Notes |
|---|---|---|
-o, --out-file |
Output file | NDJSON file containing bulk index actions |
-c, --cloud-id |
Elastic Cloud ID | See also Environment variables |
-u, --elastic-username |
Elastic username | Default: elastic See also Environment variables |
-p, --elastic-password |
Elastic password | See also Environment variables |
-i, --index |
Target Elasticsearch index | |
-x, --clear-index |
Clear index before indexing documents | |
-l, --limit |
Number of documents to generate | Default: 10 |
-q, --pipeline |
Ingest pipeline name | See Ingest pipeline |
-s, --size |
Approximate size of the documents in bytes | Default: 1000 |
-b, --batch-size |
Batch size for bulk generation and indexing | Default: 50 |
-d, --debug |
Enable debug logging | |
-h, --help |
Show help message and exit |
ELASTIC_CLOUD_ID- Elastic Cloud ID. If set, it will be automatically passed in--elastic-cloud-id.ELASTIC_USERNAME- Elastic username. If set, it will be automatically passed in--elastic-username.ELASTIC_PASSWORD- Elastic password. If set, it will be automatically passed in--elastic-password.