-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Problem
When experiments have no data (cold start), the Thompson sampling algorithm returns identical scores for all arms because the beta distribution with alpha=1, beta=1 (from 0 turns, 0 rewards) produces the same random value for all arms in a single sampling run. This causes deterministic ordering instead of random exploration.
Impact
- During cold start, content is always displayed in the same order
- Prevents proper exploration of all content options
- Reduces effectiveness of the reinforcement learning algorithm
- Creates bias toward content that appears first in the deterministic order
Root Cause
The beta distribution sampling in ThompsonCalculator::calculateThompsonScores() generates scores like:
$alpha = $data->rewards + 1;
$beta = $data->turns - $data->rewards + 1;
$scores[$id] = $this->randBeta($alpha, $beta);When all arms have 0 turns and 0 rewards:
- All arms get alpha=1, beta=1
- The randBeta() function returns similar/identical values
- No differentiation between arms during initial exploration
Solution Implemented
Add a micro-randomization tie-breaker to ensure unique scores:
$base_score = $this->randBeta($alpha, $beta);
$tie_breaker = mt_rand(1, 999) / 1000000;
$scores[$id] = $base_score + $tie_breaker;This ensures:
- Each arm gets a unique score even with identical statistics
- Proper randomization during cold start
- Maintains statistical properties of Thompson sampling
- Tie-breaker is small enough (0.000001-0.000999) to not affect learned preferences
Testing
- Verified unique scores generated during cold start
- Confirmed random exploration with empty experiment data
- Validated tie-breaker doesn't interfere with learned preferences
Fixed in PR #8
🤖 Generated with Claude Code
Metadata
Metadata
Assignees
Labels
No labels