GitHub

Financial Prediction Extraction System

A sophisticated multi-stage system for automatically extracting and tracking financial predictions from podcast transcripts. This tool uses advanced LLM processing to identify, extract, and analyze price predictions with high accuracy and comprehensive metadata.

Overview

processes podcast audio files to:

Transcribe audio using GPU-accelerated Whisper models
Extract financial predictions using a three-stage LLM pipeline
Track predictions with timestamps, confidence levels, and timeframes
Generate structured data for analysis and backtesting

Key Features

🎯 Multi-Stage Extraction Pipeline

Stage 1: Fast scanning to locate potential predictions in transcripts
Stage 2: Focused extraction with concurrent processing for speed
Stage 3: Advanced timeframe parsing and validation

⚡ High Performance

GPU-accelerated transcription via Vast.ai integration
Concurrent API processing (3x faster than sequential)
Smart batching to optimize token usage
Automatic rate limit handling

📊 Comprehensive Data Extraction

Asset identification (stocks, cryptocurrencies, ETFs)
Price targets and percentage changes
Timeframe parsing (specific dates, relative timeframes)
Speaker attribution and confidence levels
Exact timestamps with YouTube links

🔧 Flexible Architecture

Modular LLM client system (GPT-4, Claude, etc.)
Configurable models per stage
Cost optimization with model selection
Dry-run mode for testing

Installation

# Clone the repository
git clone https://github.com/yourusername/xtotext.git
cd xtotext

# Create virtual environment
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Configuration

Copy the example configuration:

cp config/config.example.py config/config.py

Set your API keys as environment variables:

export OPENAI_API_KEY="your-openai-key"
export DIGITAL_OCEAN_API_KEY="your-do-key"
export VAST_API_KEY="your-vast-key"

Configure model selection (optional):

export STAGE1_MODEL="gpt-4o"        # Fast scanning
export STAGE2_MODEL="gpt-4o"        # Extraction
export STAGE3_MODEL="gpt-4-turbo"   # Refinement

Usage

Basic Usage

# Process a podcast channel
python main.py

The system will:

Download new episodes from the configured channel
Transcribe audio using GPU instances
Extract predictions using the LLM pipeline
Save results to data/episodes/[channel]/prediction_data/

Output Format

Predictions are saved as JSON with the following structure:

{
  "text": "BTC to $150,000",
  "asset": "BTC",
  "value": 150000.0,
  "confidence": "high",
  "timestamp": "1:23:45",
  "timeframe": "end of 2025",
  "context": "I think Bitcoin will reach $150k by end of next year...",
  "episode": "Episode Title",
  "episode_date": "2024-01-15",
  "reasoning": "Based on halving cycle and institutional adoption"
}

Architecture

Processing Pipeline

Audio Files → Transcription (Vast.ai GPU) → Stage 1 (Scanning) → Stage 2 (Extraction) → Stage 3 (Refinement) → Structured Data

Key Components

Downloaders: YouTube channel monitoring and downloading
Infrastructure: Vast.ai GPU management, Digital Ocean processing
LLM Clients: Modular system supporting multiple models
Prediction Tracker: Core extraction and processing logic
Storage: JSON-based prediction database

Advanced Features

Concurrent Processing

Enable faster extraction with parallel API calls:

export ENABLE_CONCURRENT_PROCESSING=true
export CONCURRENT_WORKERS=3

Dry Run Mode

Test Stage 3 without making API calls:

export STAGE3_DRY_RUN=true

Custom Prompts

Configure extraction behavior in llm_extractor_two_stage.py:

Asset name mappings
Confidence thresholds
Context window sizes

Performance

Typical processing times:

Transcription: 2-3 minutes per hour of audio
Extraction: 3-4 minutes per episode
Total: ~45 minutes for 12 episodes with concurrent processing

Cost Optimization

Approximate costs per episode:

Stage 1: $0.05-0.10
Stage 2: $0.10-0.20
Stage 3: $0.01-0.02
Total: $0.16-0.32 per episode

License

MIT License - see LICENSE file for details

Disclaimer

This tool is for research and analysis purposes only. Always verify predictions and perform your own due diligence before making financial decisions.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
config		config
scripts/digital_ocean		scripts/digital_ocean
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
show_latest_predictions.py		show_latest_predictions.py
test_concurrent.py		test_concurrent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Key Features

🎯 Multi-Stage Extraction Pipeline

⚡ High Performance

📊 Comprehensive Data Extraction

🔧 Flexible Architecture

Installation

Configuration

Usage

Basic Usage

Output Format

Architecture

Processing Pipeline

Key Components

Advanced Features

Concurrent Processing

Dry Run Mode

Custom Prompts

Performance

Cost Optimization

License

Disclaimer

About

Uh oh!

Releases

Packages

Languages

License

echennells/predictiontracker

Folders and files

Latest commit

History

Repository files navigation

Overview

Key Features

🎯 Multi-Stage Extraction Pipeline

⚡ High Performance

📊 Comprehensive Data Extraction

🔧 Flexible Architecture

Installation

Configuration

Usage

Basic Usage

Output Format

Architecture

Processing Pipeline

Key Components

Advanced Features

Concurrent Processing

Dry Run Mode

Custom Prompts

Performance

Cost Optimization

License

Disclaimer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages