Financial Prediction Extraction System
A sophisticated multi-stage system for automatically extracting and tracking financial predictions from podcast transcripts. This tool uses advanced LLM processing to identify, extract, and analyze price predictions with high accuracy and comprehensive metadata.
processes podcast audio files to:
- Transcribe audio using GPU-accelerated Whisper models
- Extract financial predictions using a three-stage LLM pipeline
- Track predictions with timestamps, confidence levels, and timeframes
- Generate structured data for analysis and backtesting
- Stage 1: Fast scanning to locate potential predictions in transcripts
- Stage 2: Focused extraction with concurrent processing for speed
- Stage 3: Advanced timeframe parsing and validation
- GPU-accelerated transcription via Vast.ai integration
- Concurrent API processing (3x faster than sequential)
- Smart batching to optimize token usage
- Automatic rate limit handling
- Asset identification (stocks, cryptocurrencies, ETFs)
- Price targets and percentage changes
- Timeframe parsing (specific dates, relative timeframes)
- Speaker attribution and confidence levels
- Exact timestamps with YouTube links
- Modular LLM client system (GPT-4, Claude, etc.)
- Configurable models per stage
- Cost optimization with model selection
- Dry-run mode for testing
# Clone the repository
git clone https://github.com/yourusername/xtotext.git
cd xtotext
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Copy the example configuration:
cp config/config.example.py config/config.py- Set your API keys as environment variables:
export OPENAI_API_KEY="your-openai-key"
export DIGITAL_OCEAN_API_KEY="your-do-key"
export VAST_API_KEY="your-vast-key"- Configure model selection (optional):
export STAGE1_MODEL="gpt-4o" # Fast scanning
export STAGE2_MODEL="gpt-4o" # Extraction
export STAGE3_MODEL="gpt-4-turbo" # Refinement# Process a podcast channel
python main.pyThe system will:
- Download new episodes from the configured channel
- Transcribe audio using GPU instances
- Extract predictions using the LLM pipeline
- Save results to
data/episodes/[channel]/prediction_data/
Predictions are saved as JSON with the following structure:
{
"text": "BTC to $150,000",
"asset": "BTC",
"value": 150000.0,
"confidence": "high",
"timestamp": "1:23:45",
"timeframe": "end of 2025",
"context": "I think Bitcoin will reach $150k by end of next year...",
"episode": "Episode Title",
"episode_date": "2024-01-15",
"reasoning": "Based on halving cycle and institutional adoption"
}Audio Files → Transcription (Vast.ai GPU) → Stage 1 (Scanning) → Stage 2 (Extraction) → Stage 3 (Refinement) → Structured Data
- Downloaders: YouTube channel monitoring and downloading
- Infrastructure: Vast.ai GPU management, Digital Ocean processing
- LLM Clients: Modular system supporting multiple models
- Prediction Tracker: Core extraction and processing logic
- Storage: JSON-based prediction database
Enable faster extraction with parallel API calls:
export ENABLE_CONCURRENT_PROCESSING=true
export CONCURRENT_WORKERS=3Test Stage 3 without making API calls:
export STAGE3_DRY_RUN=trueConfigure extraction behavior in llm_extractor_two_stage.py:
- Asset name mappings
- Confidence thresholds
- Context window sizes
Typical processing times:
- Transcription: 2-3 minutes per hour of audio
- Extraction: 3-4 minutes per episode
- Total: ~45 minutes for 12 episodes with concurrent processing
Approximate costs per episode:
- Stage 1: $0.05-0.10
- Stage 2: $0.10-0.20
- Stage 3: $0.01-0.02
- Total: $0.16-0.32 per episode
MIT License - see LICENSE file for details
This tool is for research and analysis purposes only. Always verify predictions and perform your own due diligence before making financial decisions.