Skip to content

Add streaming/generator support for large batch operations #5

@jjroelofs

Description

@jjroelofs

Proposal

For very large batch operations, the current implementation loads all results into memory before outputting. This can cause memory exhaustion on sites with many entities.

Current Behavior

public function batch(...): array {
  $results = [];
  foreach ($entities as $entity) {
    $results[] = $this->collector->collectIntel($entity, [], $plugins);  // Accumulates in memory
  }
  return $results;  // Full array returned
}

With --limit=1000 on entities with rich field data, this could consume significant memory.

Proposed Enhancement

Add a ci:stream command or --stream option that outputs entities one at a time using JSON Lines format:

#[CLI\Command(name: 'ci:stream', aliases: ['cist'])]
public function stream(string $entity_type, array $options = [...]): void {
  // Process one entity at a time
  foreach ($this->getEntityIterator($entity_type, $options) as $entity) {
    $intel = $this->collector->collectIntel($entity, [], $plugins);
    // Output immediately as JSON line
    $this->output()->writeln(json_encode($intel));
    // Memory freed after each iteration
  }
}

Benefits

  • Memory efficiency: Constant memory usage regardless of batch size
  • Streaming output: Results appear as they're processed
  • Pipeline-friendly: JSON Lines format works with jq, head, tail, etc.
  • Resilience: Partial results available even if process is interrupted

Use Cases

  1. Exporting all content for AI training
  2. Generating sitemaps or content inventories
  3. Migration/sync pipelines
  4. Large-scale content analysis

Output Format

JSON Lines (one JSON object per line):

{"entity":{"entity_type":"node","id":"1",...},"fields":{...},"intel":{...}}
{"entity":{"entity_type":"node","id":"2",...},"fields":{...},"intel":{...}}

Alternative: Generator in Service

public function collectIntelBatch(string $entity_type, array $options): \Generator {
  foreach ($this->getEntityIterator(...) as $entity) {
    yield $this->collectIntel($entity);
  }
}

This keeps memory-efficient iteration in the service layer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions