An advanced Natural Language Processing (NLP) application designed to automatically generate concise summaries from large blocks of text. With the ever-growing amount of information online and in documents, this project provides an efficient solution to extract key insights from text quickly and accurately.
The Text-Summarizer-Project is a versatile system capable of summarizing articles, research papers, news reports, and other lengthy documents. It uses state-of-the-art NLP techniques, combining extractive and abstractive summarization methods to produce concise and meaningful summaries.
Key features include:
- Text Preprocessing: Cleans input text, removes noise, punctuation, and stopwords.
- Sentence Extraction: Identifies key sentences representing main ideas.
- Semantic Understanding: Leverages semantic analysis to comprehend meaning and relevance.
- Summarization Techniques: Supports both extractive and abstractive summarization.
- Length Control: Users can adjust summary length (short or comprehensive).
- User Interface: Simple interface for text input and summary output.
Benefits:
- Time-saving by quickly condensing long texts.
- Helps researchers, students, and professionals extract key insights.
- Useful for journalists, content creators, and language learners.
- Can be integrated into search engines or knowledge management systems.
SAMSum Dataset (Hugging Face Link)
- 16k messenger-like conversations with human-written summaries.
- Covers dialogues between 2+ speakers, varying in style (informal, semi-formal, formal) with slang, emoticons, and typos.
- Training/Validation/Test split:
- Train: 14,732
- Validation: 818
- Test: 819
Example Instance:
{
"id": "13818513",
"summary": "Amanda baked cookies and will bring Jerry some tomorrow.",
"dialogue": "Amanda: I baked cookies. Do you want some?\r\nJerry: Sure!\r\nAmanda: I'll bring you tomorrow :-)"
}- dialogue: Text of conversation
- summary: Human-written concise summary
- id: Unique identifier
PEGASUS (Google AI) – A state-of-the-art transformer-based model for abstractive summarization.
- Transformer-based neural network
- Trained on large datasets of text and code
- Generates fluent and informative summaries
- Outperforms other summarization models on various tasks
- Initial training with 1 epoch due to low computing power
- Achieved accuracy was low; further iterations are planned to improve performance
- Preprocessing, cleaning, and noise removal
- Extractive and abstractive summarization techniques
- Semantic sentence ranking and selection
- Adjustable summary length
- Robust MLOps framework using MLflow and DVC
- Deployment-ready FastAPI service with Docker and AWS integration
| Metric | Score |
|---|---|
| ROUGE-L | 44.1 |
| ROUGE-2 | 24.5 |
| Baseline Δ | +2 |
Outperforms standard baselines and demonstrates the effectiveness of hybrid PEGASUS-based summarization.
- ML/DL: Hugging Face Transformers, PEGASUS
- MLOps: MLflow, DVC, Docker
- Backend/Deployment: FastAPI, AWS EC2, S3, ECR
- CI/CD: GitHub Actions
- Clone the repository:
git clone https://github.com/praj2408/Text-Summarizer-Project.git
cd Text-Summarizer-Project- create a conda environment.
conda create -n summary python==3.8 -y
conda activate summary- Install dependencies.
pip install -r requirements.txt- Run the FastAPI app locally.
python app.py- Open your browser at
http://localhost:8000(or specified port) to interact with the service.
-
Create IAM user with:
- EC2 access
- ECR (Elastic Container Registry) access
-
Assign policies:
AmazonEC2FullAccessAmazonEC2ContainerRegistryFullAccess
- Create an ECR repository.
- Build Docker image of the app:
docker build -t text-summarizer .- Push Docker image to ECR.
- Launch an EC2 instance
- Install Docker on EC2
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo usermod -aG docker ubuntu
newgrp docker
- Pull the Docker Image.
- Launch the Docker container on EC2.
-
Configure secrets:
AWS_ACCESS_KEY_IDAWS_SECRET_ACCESS_KEYAWS_REGIONAWS_ECR_LOGIN_URIECR_REPOSITORY_NAME
-
Automate deployment with CI/CD workflow.
Contributions are welcome! Feel free to fork the repository, raise issues, and submit pull requests.
This project is licensed under the MIT License.
