Skip to content

eabait/lingo-flix

Repository files navigation

LingoFlix: Master English Through Movies

LingoFlix is an innovative language learning application designed to help users master English phrasal verbs and slang in a fun, immersive way. Leveraging iconic movie scenes and dialogue, LingoFlix transforms popular films into interactive language lessons. Whether you're a beginner or an advanced learner, you'll improve your vocabulary, comprehension, and fluency by exploring authentic English as it's spoken in real-life situations.

Key Features

  • Movie-Based Learning: Learn English through iconic scenes and dialogues from a wide range of movies, making language learning both entertaining and effective.
  • Phrasal Verbs and Slang Mastery: Focus specifically on understanding and using phrasal verbs and slang, two of the most challenging aspects of English.
  • Interactive Exercises: Complete fill-in-the-blank, multiple-choice, and context-based exercises that help you practice and retain new expressions.
  • Diverse Content Library: Access a growing library of movies and genres, from classic films to modern blockbusters, ensuring a variety of language styles and accents.
  • Automatic Content Extraction: Includes Python scripts to automatically select and extract content from popular movies for creating interactive English learning exercises.

LingoFlix makes language learning a cinematic adventure, turning your favorite movies into your best language teacher.

Technical Decisions

Frontend (Next.js Application)

  • Frontend Framework: NextJS 14 / App Router
  • UI Styling: Tailwind CSS or CSS Modules
  • State Management: React Context API (no Redux, MobX, Zustand, etc.)
  • Data Loading (MVP): Fetching JSON from public/data/exercises
  • Data Loading (Future): Fetching from a cloud bucket (e.g., Cloudflare R2) using a configuration flag (NEXT_PUBLIC_DATA_BASE_URL)
  • Responsiveness: Fully responsive UI for mobile, tablet, and desktop.
  • Navigation: Client-side navigation without full reloads.
  • JSON Contract: UI must parse and render JSON according to the schema defined by Content Development.

Backend (Python Content Extraction Scripts)

  • Primary Language: Python for Content Extraction, Data Processing, and Analytics.
  • Key Modules: imdb_fetcher.py, json_generator.py, nlp_processor.py, llm_exercise_generator.py, script_scraper.py
  • Automatic Movie Selection: Selects popular movies based on IMDb popularity and rating from the past 5 years.
  • Script Scraping: Scrapes movie scripts directly from IMSDb.
  • Language Feature Extraction: Extracts phrasal verbs and slang using OpenAI and OpenRouter NLP models.
  • LLM-based Exercise Generation: Creates contextually appropriate and challenging exercise options using LLMs.
  • Intelligent Fallback: Falls back to traditional methods when LLM is not available.
  • Exercise Types: Creates fill-in-the-blank and multiple-choice exercises.
  • Local Storage: Stores generated exercises as JSON files locally.
  • Optional Cloud Storage: Can upload exercises to a cloud bucket if configured.
  • Difficulty Categorization: Categorizes exercises by difficulty level (currently Intermediate).

Project Structure

lingo-flix/
├── scripts/                      # Python extraction utilities
│   ├── main.py                   # Main script to run the content extraction process
│   ├── ...                       # Other script files
├── public/
│   ├── data/
│   │   └── exercises/            # JSON exercise files (MVP data source)
│   │       └── ...               # Exercise JSON files
├── src/                          # NextJS application code
│   ├── app/                      # App Router pages and layout
│   │   ├── ...                   # Page and layout files
│   ├── components/               # React components
│   │   ├── ...                   # Component files
│   ├── context/                  # React Context API for state management
│   │   └── AppContext.tsx
│   └── types/                    # TypeScript type definitions
│       └── index.ts
├── .env.example                  # Example environment variables file
├── package.json                  # Frontend dependencies and scripts
├── requirements.txt              # Python dependencies
├── next.config.ts                # Next.js configuration
├── tsconfig.json                 # TypeScript configuration
├── postcss.config.mjs            # PostCSS configuration (for Tailwind CSS)
├── eslint.config.mjs             # ESLint configuration
└── README.md                     # Project README

Python Scripts

The scripts directory contains the Python utilities for content extraction, processing, and management.

Top-Level Scripts

  • main.py: The main script to orchestrate the entire content extraction process, from fetching movie data to generating exercises.
  • group_generator.py: Script responsible for generating and organizing exercises into thematic or genre-based groups.
  • legal_lint.py: Performs checks to ensure compliance with legal and licensing requirements for using movie content. This script is typically run periodically or as part of a continuous integration/continuous deployment (CI/CD) pipeline to maintain compliance.
  • purge_raw.py: Script to clean up and remove raw data files after processing to manage storage space. This script is typically executed after the content extraction process is complete to remove temporary raw script files.
  • run_with_monitoring.py: Executes other scripts with monitoring capabilities to track performance and resource usage.

Utility Scripts (scripts/utils)

  • imdb_fetcher.py: Fetches movie information, such as popularity and ratings, from IMDb.
  • json_generator.py: Handles the generation of exercise data in JSON format and optionally uploads them to cloud storage.
  • nlp_processor.py: Utilizes NLP models to identify and extract phrasal verbs and slang from movie scripts.
  • llm_exercise_generator.py: Generates diverse and contextually relevant options for exercises using Large Language Models.
  • script_scraper.py: Scrapes movie scripts from online sources like IMSDb.
  • cache_manager.py: Manages caching of fetched data and processing results to improve performance and reduce external API calls.
  • circuit_breaker.py: Implements a circuit breaker pattern to handle failures and prevent cascading errors in external service calls.
  • config_manager.py: Manages the application's configuration settings, including loading from environment variables and configuration files.
  • licence_registry.py: Tracks and manages licensing information for the content used in the application.
  • monitor.py: Provides monitoring functionalities for tracking script execution, resource usage, and performance metrics.
  • progress_tracker.py: Tracks the progress of long-running tasks and provides reporting on completion status.

Content Extraction Process

The content extraction process is handled by the Python scripts in the scripts directory. The main script (main.py) orchestrates the following steps to generate interactive English learning exercises from movies:

  1. Fetch Popular Movies: The process begins by identifying popular movies based on criteria such as IMDb rating and popularity using the imdb_fetcher.py script.
  2. Scrape Movie Scripts: Once movies are selected, the script_scraper.py script is used to scrape the raw movie scripts from online databases like IMSDb.
  3. Process Scripts (NLP): The raw scripts are then processed by the nlp_processor.py script, which utilizes Natural Language Processing models (like OpenAI or OpenRouter) to identify and extract relevant language features, specifically phrasal verbs and slang.
  4. Generate Exercises (LLM): The extracted language features are passed to the llm_exercise_generator.py script. This script leverages Large Language Models to create contextually appropriate and challenging exercise options (fill-in-the-blank and multiple-choice). An intelligent fallback mechanism is in place if the LLM is unavailable.
  5. Generate and Store JSON: Finally, the json_generator.py script takes the generated exercises and structures them into JSON files. These files are stored locally in the public/data/exercises/ directory and can optionally be uploaded to a configured cloud storage bucket.

Here is a sequential diagram illustrating the content extraction process:

sequenceDiagram
    participant User
    participant main.py
    participant imdb_fetcher.py
    participant script_scraper.py
    participant nlp_processor.py
    participant llm_exercise_generator.py
    participant json_generator.py
    participant ExternalAPIs
    participant LocalStorage
    participant CloudStorage

    User->>main.py: Start Extraction Process
    main.py->>imdb_fetcher.py: Request Popular Movies
    imdb_fetcher.py->>ExternalAPIs: Fetch Movie Data (IMDb)
    ExternalAPIs-->>imdb_fetcher.py: Movie Data
    imdb_fetcher.py-->>main.py: List of Popular Movies
    main.py->>script_scraper.py: Request Movie Scripts
    script_scraper.py->>ExternalAPIs: Scrape Scripts (IMSDb)
    ExternalAPIs-->>script_scraper.py: Raw Scripts
    script_scraper.py-->>main.py: Raw Scripts
    main.py->>nlp_processor.py: Process Scripts
    nlp_processor.py->>ExternalAPIs: Analyze Text (NLP Models)
    ExternalAPIs-->>nlp_processor.py: Language Features
    nlp_processor.py-->>main.py: Extracted Language Features
    main.py->>llm_exercise_generator.py: Generate Exercises
    llm_exercise_generator.py->>ExternalAPIs: Generate Options (LLM)
    ExternalAPIs-->>llm_exercise_generator.py: Exercise Options
    llm_exercise_generator.py-->>main.py: Generated Exercises
    main.py->>json_generator.py: Generate and Store JSON
    json_generator.py->>LocalStorage: Save JSON Files
    LocalStorage-->>json_generator.py: Confirmation
    json_generator.py->>CloudStorage: Upload JSON Files (Optional)
    CloudStorage-->>json_generator.py: Confirmation
    json_generator.py-->>main.py: Completion Status
    main.py-->>User: Extraction Complete
Loading

Getting Started

Prerequisites

  • Node.js (v18 or later recommended)
  • Python (v3.7 or later recommended)

Setup

  1. Clone the repository:

    git clone https://github.com/yourusername/lingo-flix.git
    cd lingo-flix
  2. Frontend Setup: Install frontend dependencies:

    npm install
    # or yarn install
    # or pnpm install
    # or bun install
  3. Backend Setup: Create a Python virtual environment and install dependencies:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    pip install -r requirements.txt
  4. Environment Variables: Create a .env.local file in the root directory for frontend environment variables and a .env file in the root directory for backend environment variables. Copy the contents from .env.example and fill in your API keys and configuration.

    Refer to the "Configuration" section for a detailed list of environment variables.

Running the Application

  1. Start the Frontend Development Server:

    npm run dev
    # or yarn dev
    # or pnpm dev
    # or bun dev

    Open http://localhost:3000 with your browser to see the result.

  2. Run the Backend Content Extraction: Activate the Python virtual environment (if not already active):

    source venv/bin/activate  # On Windows: venv\Scripts\activate

    Run the main script:

    python scripts/main.py

    This will generate exercise JSON files in public/data/exercises/.

Configuration

A comprehensive list of environment variables can be found below. Create a .env.local file in the root directory for frontend environment variables and a .env file in the root directory for backend environment variables. Copy the contents from .env.example and fill in your API keys and configuration.

# Environment variables for LingoFlix

# General Configuration
NODE_ENV=development # Node.js environment (e.g., development, production)
PORT=3000 # Port for the Next.js application
ENVIRONMENT=development # e.g., development, production

# API Configuration
OPENROUTER_API_KEY=XXX # Get one from https://openrouter.ai/
OMDB_API_KEY=XXXX # Get one from https://www.omdbapi.com/apikey.aspx

# Frontend Data URL
NEXT_PUBLIC_DATA_BASE_URL=/data/exercises # Base URL for fetching exercise data in the frontend (e.g., /data/exercises or a cloud storage URL)

# Contact Email
CONTACT_EMAIL=contact.lingoflix@gmail.com # Email address for contact and legal inquiries

# NLP Processing Configuration
NLP_MODEL=google/gemini-2.0-flash-001 # LLM model to use for NLP tasks
NLP_MAX_TOKENS=4000 # Max tokens for NLP model responses
NLP_TEMPERATURE=0.1 # Temperature for NLP model responses
API_TIMEOUT=30 # Timeout for API calls in seconds
DIALOGUE_ONLY=true # Process only dialogue from scripts (true/false)

# Processing Limits
DIFFICULTY_LEVEL=Intermediate # Difficulty level for exercises (e.g., Beginner, Intermediate, Advanced)
MAX_EXERCISES_TOTAL=20 # Maximum total exercises per movie
MAX_EXERCISES_PER_CATEGORY=10 # Maximum exercises per category (phrasal_verb, slang)
MAX_EXERCISES_PER_TYPE=5 # Maximum exercises per type (fill-in-the-blank, multiple_choice)
MIN_SCRIPT_WORD_COUNT=1000 # Minimum word count for a script to be processed
CHUNK_SIZE=15000 # Size of text chunks for NLP processing
CHUNK_OVERLAP=200 # Overlap between text chunks for NLP processing
SAMPLE_RATIO=0.3 # Ratio of exercises to sample if exceeding limits

# Cache Configuration
ENABLE_CACHING=true # Enable caching (true/false)
CACHE_ENABLED=true # Enable caching (redundant with ENABLE_CACHING, but included as seen in code)
CACHE_DIR=cache # Directory for cache files
CACHE_EXPIRY_DAYS=30 # Default cache expiry in days
OMDB_CACHE_HOURS=168 # OMDb cache expiry in hours (1 week)
SCRIPT_CACHE_HOURS=168 # Script cache expiry in hours (1 week)
NLP_CACHE_HOURS=720 # NLP cache expiry in hours (30 days)
EXERCISE_CACHE_HOURS=24 # Exercise cache expiry in hours (1 day)

# Script Execution Monitoring
MAX_RUNTIME_HOURS=2 # Maximum script runtime in hours
MEMORY_LIMIT_GB=4 # Maximum memory usage in GB

# Raw Data Purging
PURGE_WINDOW_HOURS=24 # Age threshold in hours for raw script files to be purged
REQUIRE_BACKUP_SUCCESS=true # Require successful backup before purging (true/false)

# Directory Paths (usually relative to project root)
RAW_DATA_DIR=src/data/raw
PROCESSED_DATA_DIR=src/data/processed
SCRIPTS_DIR=src/data/raw/scripts # Directory for raw scripts
EXERCISES_DIR=src/data/processed/exercises # Directory for processed exercises
LANGUAGE_FEATURES_DIR=src/data/processed/language_features # Directory for language features
PUBLIC_EXERCISES_DIR=public/data/exercises # Directory for public exercises
LOG_DIR=logs # Directory for log files

# Add any other environment variables you had here
# EXAMPLE_VAR=example_value

Output (Backend)

The backend scripts generate JSON files with exercises locally in the public/data/exercises/ directory. The files follow a specific JSON contract.

Learn More

Deployment

The easiest way to deploy the Next.js app is to use the Vercel Platform.

Check out the Next.js deployment documentation for more details.

License

MIT License

About

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published