This project uses Dagster to create a simple ETL (Extract, Transform, Load) pipeline that fetches new submissions from a specified subreddit using the Reddit API and stores them in a local SQLite database.
The pipeline consists of two main assets:
reddit_submissions: Extracts data from a subreddit, transforms it, and loads new submissions into a SQLitesubmissionstable.preview_top_submissions: A downstream asset that runs afterreddit_submissionsto display a preview of the 10 most recent posts.
The project is designed to be configurable, allowing you to easily change the target subreddit and the number of posts to fetch via a config.ini file.
Here is a preview of the asset graph in the Dagster UI, showing the dependency between the two assets.
- Python >=3.9, <3.13
- A Reddit account with API credentials.
- uv (a fast Python package installer and resolver).
-
Clone the Repository Start by cloning the project repository to your local machine.
git clone https://github.com/rohanvh7/Reddit-Analysis.git cd Reddit-Analysis -
Create a Virtual Environment It's highly recommended to use a virtual environment.
uvcan create one for you.uv venv source .venv/bin/activate # On Windows, use `.venv\Scripts\activate`
-
Install Dependencies with
uvWith your virtual environment activated, useuv syncto install all required dependencies, includingdagster,praw, andpandas, as defined inpyproject.toml. To include development dependencies (likepytest), use the--all-extrasflag.uv sync --all-extras
-
Configure Environment Variables This project uses a
.envfile to securely manage your Reddit API credentials. Create a file named.envin the root directory of the project.Copy the following format into your
.envfile and replace the placeholder values with your actual Reddit credentials.# .env file REDDIT_CLIENT_ID=YOUR_CLIENT_ID_HERE REDDIT_CLIENT_SECRET=YOUR_CLIENT_SECRET_HERE REDDIT_USERNAME=YOUR_USERNAME_HERE REDDIT_PASSWORD=YOUR_PASSWORD_HERE REDDIT_USER_AGENT=MyDagsterApp/0.1 by u/YourUsername
Important: The
.gitignorefile is already configured to ignore.env, ensuring your secrets are not committed to version control. Make sure you don't have double quotes or<>around your credentials in the.envfile.
With your virtual environment activated and your .env file configured, you can launch the Dagster UI.
-
Start the Dagster UI From your project's root directory, run the
dagster devcommand. Dagster will automatically find your code location based on the[tool.dagster]section of yourpyproject.toml.dagster dev
-
Access the UI Open your web browser and navigate to http://localhost:3000.
-
Materialize the Assets In the Dagster UI, you will see the asset graph. To run the full pipeline:
- Select the
preview_top_submissionsasset. - Click the "Materialize" button. Dagster will automatically run the upstream
reddit_submissionsasset first.
Upon successful completion, a
submissions.dbfile will be created in your project directory, and the run logs for the preview asset will display a table of the latest posts. - Select the
- AI Assistance: Gemini 2.5 Pro
- Reddit API Wrapper: Praw
