A production-grade, serverless Telegram bot for fast and accurate speech-to-text transcription. Built on AWS Lambda and powered by the Soniox API, this bot listens for voice, video, or video note messages and returns a transcription, making it a powerful tool for converting spoken audio to text on the fly.
- Architecture
- Key Features
- Deployment Guide
- Local Transcription Utility
- Project Structure
- Logging and Debugging
- Contributing
- License
The bot operates on a serverless, event-driven architecture. A user's message to the Telegram bot triggers a webhook to AWS API Gateway, which in turn invokes the core AWS Lambda function. The function handles all business logic: fetching the media file from Telegram, orchestrating the transcription with the Soniox API, and sending the result back to the user.
graph TD
subgraph user_device [User's Device]
User
end
subgraph telegram [Telegram]
TelegramBot[Telegram Bot API]
end
subgraph aws [AWS]
APIGW[API Gateway]
Lambda[AWS Lambda Function]
CloudWatch[CloudWatch Logs]
end
subgraph soniox [Soniox]
SonioxAPI[Soniox Speech-to-Text API]
end
User -- "1 Sends voice/video message" --> TelegramBot;
TelegramBot -- "2 Webhook POST request" --> APIGW;
APIGW -- "3 Triggers function" --> Lambda;
Lambda -- "4 Fetches media file" --> TelegramBot;
Lambda -- "5 Uploads media for transcription" --> SonioxAPI;
SonioxAPI -- "6 Processes audio" --> SonioxAPI;
Lambda -- "7 Polls for results" --> SonioxAPI;
SonioxAPI -- "8 Returns transcript" --> Lambda;
Lambda -- "9 Sends transcript back" --> TelegramBot;
TelegramBot -- "10 Delivers message" --> User;
Lambda -- "Logs execution" --> CloudWatch;
- High-Accuracy Transcription: Leverages the Soniox API for state-of-the-art, low-latency speech recognition.
- Multi-Format Support: Transcribes Telegram voice messages (
audio/ogg), video messages (video/mp4), and video notes (video/mp4). - Multi-Language Hints: Improves accuracy by providing language hints to the transcription model (defaults:
ru,uk,es,en). - Serverless & Scalable: Built on AWS Lambda for cost-efficiency (pay-per-use) and automatic scaling to handle any workload.
- Secure & Private: Implements a user
ALLOW_LISTto ensure only authorized Telegram usernames can interact with the bot. - Asynchronous & Resilient: Handles transcription requests asynchronously, polling for results without blocking. It includes robust error handling and notifies the user of failures.
- Automated Resource Cleanup: Ensures all temporary media files and transcription jobs are deleted from the Soniox service after processing, minimizing storage footprint and cost.
- Local Transcription Utility: Comes with
transcribe_local.py, a helper script to run the same transcription logic on local files.
Follow these steps to deploy your own instance of the transcription bot.
- An AWS Account with permissions to create Lambda, API Gateway, and IAM resources.
- Python 3.8 or newer installed locally.
- A Telegram Account.
- Open Telegram and start a chat with @BotFather.
- Send the
/newbotcommand and follow the prompts to choose a name and username for your bot. - BotFather will provide you with a unique token. Save this token—it is your
TELEGRAM_TOKEN.
- Sign up for an account at soniox.com.
- Navigate to your account settings or API section to generate an API key.
- Save this key—it is your
SONIOX_TOKEN.
The Lambda function is configured via environment variables.
| Variable | Description | Required |
|---|---|---|
TELEGRAM_TOKEN |
The token for your Telegram bot from Step 1. | ✓ |
SONIOX_TOKEN |
The API key for the Soniox service from Step 2. | ✓ |
ALLOW_LIST |
A comma-separated list of Telegram usernames authorized to use the bot (e.g., user1,user2,user3). |
✓ |
Security Best Practice: For a production environment, it is highly recommended to store secrets like API tokens in AWS Secrets Manager or Parameter Store and grant the Lambda function's IAM role permission to access them.
-
Create Lambda Function:
- Go to the AWS console, Lambda service and click Create function.
- Select Author from scratch.
- Function name:
telegram-transcription-bot - Runtime: Python 3.9 (or newer)
- Architecture:
arm64for lower cost - Unfold Additional configurations and enable Function URL.
- Auth type: None
- Click Create function.
-
Upload Code:
- In the Code source section, click on lambda_function.py and paste the full content of lambda_function.py file from this repository.
- Click on Deploy button.
-
Configure Settings:
- Go to the Configuration tab.
- In Environment variables, add the three variables from Step 3.
- In General configuration, you may want to increase the Timeout to 30 seconds to accommodate longer transcriptions.
- In Function URL section, copy the URL and save it for the next step.
Tell Telegram where to send message events. Replace <YOUR_TOKEN> with your TELEGRAM_TOKEN and <YOUR_API_GATEWAY_URL> with the endpoint from the previous step. Run this command in your terminal or browser:
https://api.telegram.org/bot<YOUR_TOKEN>/setWebhook?url=<YOUR_API_GATEWAY_URL>
You should see a {"ok":true,"result":true,"description":"Webhook was set"} response. Your bot is now live!
- Send a text message to your bot. You should receive "You sent text." response back.
- Send a voice, video, or video note message to your bot. You should see the transcription in the chat.
The repository includes transcribe_local.py, a command-line script for transcribing local audio or video files using the same core transcribe() function.
The script requires the SONIOX_TOKEN environment variable to be set.
# Set the token
export SONIOX_TOKEN="<YOUR_SONIOX_TOKEN>"
# Run transcription
python3 transcribe_local.py path/to/your/file.mp3 -o output/transcript.txt- The input file is the first argument.
- The
-oor--outputflag is optional and specifies where to save the transcript. If omitted, it saves to<input_file>.txt.
All Lambda function logs are sent to AWS CloudWatch. You can monitor executions, trace errors, and view print/logging statements in the log group associated with your Lambda function. Key log events include:
- Receiving an event.
- User authorization status.
- Transcription progress and cleanup actions.
- Errors during any stage of the process.