Skip to content

Explore chart types for posterior evolution visualization #17

@jjroelofs

Description

@jjroelofs

Summary

Exploratory issue to evaluate different chart types for visualizing how posterior beliefs evolve over time during experiments. The goal is to prototype multiple approaches so stakeholders can judge their usefulness firsthand.

Context

With the event log feature (#16), we'll have historical snapshot data available:

  • turns and rewards per arm at various trial points
  • total_experiment_turns for x-axis positioning
  • created timestamp for optional date-based display

From this data we can compute:

  • Posterior mean (estimated conversion rate)
  • Credible intervals (uncertainty bounds)
  • Full Beta distribution shape
  • Probability each arm is "best"

Experiment Types to Consider

The module serves two distinct use cases with different visualization needs:

Type Arms Example Visualization challenge
A/B tests 2-5 Landing page variants Show detail per arm
Recommendations 100-500+ Blog post rankings Summarize many arms

Charts should work for both, or we need different defaults per type.


Chart Types to Prototype

1. Line Chart with Confidence Bands

Best for: A/B tests (2-5 arms)

Rate
 │    ╭──────────────────── Arm A (shaded CI)
0.15─│───╱─────────────────────────────
     │  ╱   ╭─────────────── Arm B  
0.10─│─╱───╱────────────────────────────
     │╱   ╱
0.05─│───╱──────────────────────────────
     └────────────────────────────────→ trials
  • X-axis: total experiment turns (or datetime)
  • Y-axis: conversion rate estimate
  • Lines: one per arm with distinct colors
  • Bands: 95% credible interval shaded around each line

Pros: Intuitive, shows estimate + uncertainty, familiar format
Cons: Gets cluttered with many arms, overlapping bands hard to read

Prototype tasks:

  • Basic line chart with Chart.js or similar
  • Add shaded confidence bands
  • Test with 2, 5, and 10 arms
  • Toggle between trials and datetime x-axis

2. Probability of Winning (Stacked Area)

Best for: A/B tests, decision-focused view

P(best)
  1.0─│████████████████▓▓▓▓▓▓▓▓▓▓░░░░░░
     │████████████████▓▓▓▓▓▓▓▓▓▓░░░░░░
  0.5─│████ Arm A █████▓▓ Arm B ▓░░░░░░
     │████████████████▓▓▓▓▓▓▓▓▓▓░ C ░░
  0.0─│████████████████▓▓▓▓▓▓▓▓▓▓░░░░░░
     └────────────────────────────────→ trials
  • X-axis: trials or datetime
  • Y-axis: probability (0-1, stacked to 100%)
  • Each arm is a colored band

Pros: Answers "which should I pick?", always sums to 100%, intuitive competition view
Cons: Doesn't show actual conversion rates, requires Monte Carlo computation

Prototype tasks:

  • Stacked area chart implementation
  • P(best) calculation from Beta distributions
  • Test with 2, 5, and 10 arms
  • Consider animation showing bands shifting

3. Heatmap (Arms × Time)

Best for: Large recommendation experiments (100+ arms)

         Trials →
        10   100  1000  10000
Arm 1   ░░   ▒▒   ▓▓    ██
Arm 2   ░░   ▒▒   ▓▓    ██
Arm 3   ░░   ░░   ▒▒    ▓▓
...
Arm 99  ░░   ░░   ░░    ▒▒

Color = conversion rate (darker = higher)
  • X-axis: trial progression
  • Y-axis: arms (sortable by current performance)
  • Color intensity: conversion rate or P(best)

Pros: Scales to hundreds of arms, shows patterns across whole experiment
Cons: Less precise than line charts, requires good color scale design

Prototype tasks:

  • Heatmap grid implementation
  • Sortable rows (by name, current rate, total trials)
  • Color scale selection (sequential vs diverging)
  • Test with 50, 100, 500 arms

4. Ranking Chart (Bump Chart)

Best for: Recommendations, seeing position changes

Rank
  1─│    ╲    ╱───────── post1
  2─│─────╲╱─────╲────── post2
  3─│───────────╱─╲───── post3
  4─│──────────────╲──── post4
    └─────────────────────→ trials
  • X-axis: trials or datetime
  • Y-axis: rank position
  • Lines: one per arm showing rank over time

Pros: Clear view of competition, works for many arms (show top N)
Cons: Doesn't show magnitude of differences, can get tangled

Prototype tasks:

  • Bump chart implementation
  • Show top N arms (configurable, default 10-20)
  • Highlight lines on hover
  • Click to see arm details

5. Distribution Evolution (Ridge/Joy Plot)

Best for: Educational view, single arm deep-dive

Trial 1000 ───────╱╲───────────────
Trial 500  ────╱──╲────────────────
Trial 100  ──╱────╲────────────────
Trial 10   ╱───────╲───────────────
           0.0    0.1    0.2    Rate
  • X-axis: conversion rate
  • Y-axis: stacked by trial number
  • Shape: actual Beta distribution PDF

Pros: Beautiful, shows full uncertainty evolution, educational
Cons: Only practical for 1-3 arms, complex to read

Prototype tasks:

  • Ridge plot implementation
  • Beta PDF calculation and rendering
  • Animation option (morphing distribution)
  • Use as detail view when clicking an arm

6. Convergence Indicator

Best for: Quick status check, dashboard widget

CI Width
     │╲
 0.3─│ ╲
     │  ╲____
 0.1─│       ╲________
     └────────────────→ trials
     
     [███████████░░░░] 78% confident
  • Simple line showing uncertainty shrinking over time
  • Or: progress bar showing "confidence level"

Pros: At-a-glance experiment maturity, answers "can we decide yet?"
Cons: Supplementary only, doesn't show which arm is winning

Prototype tasks:

  • CI width over time line chart
  • Confidence progress bar widget
  • Threshold indicator (e.g., "95% confident A beats B")

Recommended Approach

Phase 1: Core Charts

  1. Line chart with CI bands (primary for A/B tests)
  2. Heatmap (primary for large experiments)

Phase 2: Decision Support

  1. P(best) stacked area
  2. Convergence indicator

Phase 3: Advanced

  1. Ranking chart
  2. Ridge plot for deep-dive

Technical Considerations

  • Library: Chart.js, D3.js, or Apache ECharts
  • Rendering: Client-side JavaScript, data via JSON endpoint
  • Responsive: Charts should work on mobile
  • Accessibility: Color-blind friendly palettes, screen reader support
  • Performance: Lazy load charts, paginate large experiments

Deliverables

  • Prototype each chart type with sample data
  • Screenshot/demo of each for stakeholder review
  • Recommendation for default chart per experiment type
  • Performance testing with large datasets

Dependencies

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions