Skip to content

Conversation

@gregfurman
Copy link
Collaborator

Motivation

Includes a new python processor that runs within a sandboxed WASM environment.

Changes

  • Adds a new python processor that runs within a sandboxed WASM environment.
  • Adds a WasmModulePool in internal/impl/wasm/wasmpool/pool.go that uses a Golang sync.Pool to execute multiple concurrent guests and have their lifecycles managed by the GC. This allows us to spin up more instances as demand increases and have the Go GC handle the guest lifecycle.
  • Adds an entrypoint.py that allows python execution script to compile and execute -- stdout and stdin is used to pass data between the guest and host.
  • If the x_wasm tag is used, the WASM executable and corresponding entrypoint.py are embedded at compile time. Using the processor without this tag panics on init.
  • Adds a script at scripts/install.sh that downloads the python-3.12.0.wasm and checks it against the SHA in runtime/python-3.12.0.wasm.sha256sum.

Installation/Usage

Run the following commands to create a config, install the python runtime (and check it against the SHA), and execute a pipeline.

cat > python.yaml <<'EOF'
pipeline:
  processors:
    - python:
        script: |
          root = sum(this)
EOF

bash internal/impl/python/scripts/install.sh

echo "[11, 6, 20, 5]" | go run -tags "x_wasm" cmd/bento/main.go -c python.yaml --set logger.level=warn

The output of the pipeline should be 42.

TODO

  • Extend the CI (or Makefile) to run the downloader script and build with the x_wasm tag -- allowing the .wasm to be embedded at compile time and used in the processor.

@gregfurman gregfurman self-assigned this Dec 18, 2025
@aronchick
Copy link

this is going to be really cool! lmk if you need help

@gregfurman gregfurman marked this pull request as ready for review December 20, 2025 21:11
@gregfurman
Copy link
Collaborator Author

Thanks @aronchick! If you have time (and are feeling extra generous), I'd appreciate if you could run through the Installation/Usage steps locally and try out different configurations. I'm also looking for any feedback on usability since we're planning to expand our WASM offering, so any insights would be super helpful!

I haven't dug too deep into the limitations of the sandboxed runtime and how it responds to accessing non-permissive resources (e.g., filesystem or OS).

In addition, we'd probably need to soak test the processor before making this GA in Bento since I'm unsure of performance/behavior over longer periods and under differing load profiles.

Copy link
Collaborator Author

@gregfurman gregfurman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jem-davies I've added a quick self-review. Let me know if you think we should be adding more documentation with what's been mentioned here.

Comment on lines +59 to +60
namespace = {"this": input_data, "root": None, "__builtins__": __builtins__, **injected_libs}
exec(compiled_code, namespace)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this namespace and exec combintation allows us to dynamically run our compiled execution script, constraining what is accessible during runtime and ensuring no global state is ever mutated.

Also, see how we pass this and root -- leveraging the same pattern we use with bloblang scripts/mappings.

name = lib_name.strip()
if name:
try:
injected[name] = importlib.import_module(name)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some python magic that allows us to dynamically extract which libraries we're making permissable to the sandboxed execution.


if command -v sha256sum >/dev/null; then
(cd "$WASM_RUNTIME_DIR/runtime" && sha256sum --check --status python-3.12.0.wasm.sha256sum)
elif command -v shasum >/dev/null; then
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fallback to shasum since sha256sum is not available on darwin

wget -q "$WASM_BINARY_URL" -O "$WASM_PATH"

if command -v sha256sum >/dev/null; then
(cd "$WASM_RUNTIME_DIR/runtime" && sha256sum --check --status python-3.12.0.wasm.sha256sum)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're downloading and embedding a WASM executable into bento, we need to verify the integrity of the executable we're intending to use. So we download and commit the SHA256 checksum, then verify the downloaded file's checksum.

instancePool sync.Pool
}

func NewWasmModulePool[T api.Module](ctx context.Context, ctor constructor[T]) (*WasmModulePool[T], error) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately WASM modules (at least with wazero) are not safe for concurrent executions. Instead, we create a module pool (backed by a sync.Pool) and allow the GC to dynamically spin up instances depending on load.

}

func (p *WasmModulePool[T]) Put(instance T) {
_ = runtime.AddCleanup(&instance, func(inst T) {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the GC cleans up the instance, which is possible given sync.Pool, we need to ensure the resources are actually freed up. Hence, this gross runtime.AddCleanup hook that triggers when the module is GC'd

p.instancePool.Put(instance)
}

func (p *WasmModulePool[T]) Close(ctx context.Context) error {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While it's probably easier to just close the entire runtime to ensure all associated modules (and resources) are cleaned up, this Close will retrieve all objects from the pool and close them.

While thinking of this as a Reset or Cleanup conceptually more correct, Close feels more idiomatic 🤷

}()

handshakeErr := make(chan error, 1)
go func() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do a quick-and-dirty handshake to ensure the python env is up-and-running. The python module should write a "READY" signal to the output socket which we need to check is received before proceeding.

return nil, pi.stderrBuf.Bytes(), err
}

header := make([]byte, 5)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First byte signals that the response was a success or exception. Last four bytes tell us how long the response is.

@aronchick
Copy link

Testing Instructions

I've tested this locally and it works great! Here's how others can try it:

Quick Start

# 1. Checkout the PR
gh pr checkout 621

# 2. Download the Python WASM runtime (~50MB) and test configs
bash internal/impl/python/scripts/install.sh
gh gist clone https://gist.github.com/aronchick/5d9391b03f9266b994c7a36043d743a4 python-examples

# 3. Run an example (first build takes ~30-60s)
go run -tags "x_wasm" ./cmd/bento/main.go -c python-examples/hello.yaml

Expected output: "Hello, World!"

Example Configs

Full test configs available in the gist: <GIST_URL>

File Description
hello.yaml Simple greeting
test1.yaml JSON transformation
test2.yaml Array processing (sum, count, double)
test3.yaml Message filtering
test4.yaml Stdlib imports (math, json, re)
test5.yaml E-commerce order transformation
test6.yaml Error handling demo

Quick Snippets

Array processing:

- python:
    script: |
      root = {"sum": sum(this), "doubled": [x * 2 for x in this]}

With imports:

- python:
    imports: [math, json, re]
    script: |
      root = {"ceil": math.ceil(this["value"])}

Filtering (set root=None to drop):

- python:
    script: |
      root = this if this.get("status") == "active" else None

Run Unit Tests

go test -tags "x_wasm" -v ./internal/impl/python/...

All 18 tests pass (~18 seconds).

Key Points

  • Build tag required: Always use -tags "x_wasm"
  • Cold start: ~1 second per Python instance
  • Familiar syntax: Uses this and root like Bloblang
  • Sandboxed: Runs in WASM, no filesystem/network access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants