-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Speak TTS Skill - User Feedback & Experience Report
Date: December 29, 2025
User: Testing with Claude Code AI Agent
Platform: macOS Sonoma (Apple Silicon)
Executive Summary
The Speak TTS skill is an impressive local text-to-speech solution with powerful features like voice cloning, emotion tags, and streaming playback. While the core functionality works excellently, the installation process revealed some areas that could benefit from better documentation and error handling.
Overall Rating: ⭐⭐⭐⭐ (4/5)
What Works Exceptionally Well
1. Voice Cloning ⭐⭐⭐⭐⭐
- Outstanding feature: The voice cloning capability is remarkably effective
- Successfully cloned voice from a YouTube Shorts video
- Quality of cloned voice is impressive with the Chatterbox TTS model
- Simple to use: just provide a WAV file path with
--voiceflag - Works seamlessly with streaming mode
2. Streaming Mode ⭐⭐⭐⭐⭐
- Adaptive buffering works flawlessly
- Audio starts playing while generation continues
- Perfect for long-form content
- The
--stream --playcombination works beautifully - Real-time feedback shows buffer status clearly
3. Emotion Tags ⭐⭐⭐⭐⭐
- Adds natural expressiveness to speech
- Tags like
[laugh],[chuckle],[sigh],[clear throat]work perfectly - Makes the TTS sound more human and engaging
- Easy to incorporate into text
4. Daemon Mode ⭐⭐⭐⭐⭐
- Significant performance improvement for repeated use
- Model stays loaded in memory
- Subsequent generations are nearly instant
- Perfect for AI agent use cases
5. On-Device Processing ⭐⭐⭐⭐⭐
- Complete privacy - no cloud APIs
- Fast generation on Apple Silicon
- No API costs or rate limits
- Offline capability
Installation & Setup Experience
Challenges Encountered
1. Python Version Requirements ⚠️
Issue: Documentation states "Python 3.10+" is required, but the setup script defaults to system Python (3.9.6 in our case)
What Happened:
- Initial setup ran successfully with Python 3.9.6
- All dependencies installed without errors
- However, audio generation would hang indefinitely with "Generating audio..." message
- After 60+ seconds, it would fail with "Error: No audio generated"
Solution Required:
- Manually installed Python 3.11 via Homebrew
- Created virtual environment with:
/opt/homebrew/bin/python3.11 -m venv ~/.chatter/env - Reinstalled all dependencies
- After this, everything worked perfectly
Recommendation:
- Setup script should check Python version and fail fast with clear error if < 3.10
- Or better: automatically use Homebrew Python if available and meets requirements
- Add troubleshooting section specifically for Python version issues
2. Bun Runtime Installation
Issue: Bun is required but not mentioned prominently enough in prerequisites
What Happened:
- Realized Bun was missing only after trying to run the skill
- Had to install it separately via:
curl -fsSL https://bun.sh/install | bash
Recommendation:
- Make Bun requirement more prominent in documentation
- Consider adding auto-detection and installation prompt
- Or provide a setup script that handles Bun installation
3. Global Command Setup
Issue: The speak command wasn't immediately available after installation
What Happened:
- Had to manually add shell alias to
~/.zshrc - Shell needed to be reloaded for alias to work
- For AI agent use, needed to use full path:
~/.bun/bin/bun run ~/.claude/skills/speak/src/index.ts
Recommendation:
- Setup script could offer to add the alias automatically
- Provide a post-install message with exact commands to run
- Consider creating a symlink automatically with user permission
Feature-Specific Feedback
Audio Playback (--play flag)
Initial Issue: The --play flag didn't work initially during our testing session. Audio would generate successfully but wouldn't play automatically.
What Fixed It: After killing all processes and restarting the daemon, --play started working correctly.
Recommendation:
- Investigate potential race conditions or socket issues that might prevent playback
- Add better error messages if playback fails
- Consider fallback to
afplayif primary playback method fails
Model Download Experience
Positive:
- First-time model download is transparent
- Progress indicators work well
- Cached models for subsequent use
Could Be Better:
- Initial download size isn't shown upfront
- No estimate of download time
- Consider pre-downloading default model during setup
Performance
Excellent:
- Generation speed is impressive (RTF: 0.3-0.35x typically)
- Streaming reduces perceived latency significantly
- Voice cloning doesn't noticeably slow generation
- Daemon mode makes subsequent calls nearly instant
Documentation Quality
Strengths ⭐⭐⭐⭐⭐
- SKILL.md is comprehensive and well-organized
- Pattern-based examples are extremely helpful
- Clear decision guidelines for agents
- Good coverage of all features
Could Be Improved
- Installation troubleshooting section needs expansion
- Python version requirements should be more prominent
- More emphasis on prerequisite installation order
- Common error messages and their solutions
AI Agent Integration Experience
What Works Great
- Skill integrates seamlessly with Claude Code
- Agent defaults are sensible (fp16, temp 0.5, speed 1.0)
- Easy to invoke from agent context
- Streaming + daemon mode perfect for conversational use
Suggestions
- Add example of complete agent integration flow
- Document best practices for long conversations
- Include sample hook configurations
- Add guidance on when to use daemon mode vs regular mode
Feature Requests & Ideas
High Priority
-
Setup Health Check Enhancement
- Add
--fixflag to automatically resolve common issues - Detect Python version problems and suggest solutions
- Check audio playback capability
- Add
-
Error Messages
- More descriptive error messages
- Include suggested fixes in error output
- Add troubleshooting URLs to errors
-
Progress Indicators
- Show model download progress with size/ETA
- Better indication when model is loading vs generating
- Playback status feedback
Medium Priority
-
Voice Library Management
- Built-in voice preset manager
- Ability to name and catalog custom voices
- Preview voices before using them
-
Batch Processing UI
- Better progress tracking for multiple files
- Estimate total time for batch jobs
- Resume capability for interrupted batches
Nice to Have
-
Configuration Presets
- Named configuration profiles (e.g., "podcast", "audiobook", "quick-test")
- Easy switching between presets
- Share presets between users
-
Audio Post-Processing
- Optional normalization
- Noise reduction for cloned voices
- Export in multiple formats (MP3, OGG, etc.)
Specific Use Cases Tested
✅ Successfully Tested
- Basic text-to-speech with default voice
- Emotion tags integration
- Streaming long content
- Voice cloning from YouTube video
- Daemon mode for rapid iterations
- Markdown processing (stripping syntax)
- Multiple language support (tested Malayalam actor's voice)
❌ Not Tested
- Voice cloning with custom .wav files
- Code block handling options
- Clipboard input
- Preview mode
- Configuration file usage
- Different model quantizations (4bit, 5bit, 6bit)
Performance Metrics (Our Setup)
Hardware: MacBook with Apple Silicon
Python: 3.11.14
Model Used: chatterbox-turbo-8bit (default)
| Metric | Value | Notes |
|---|---|---|
| Average RTF | 0.30-0.35x | Excellent performance |
| First generation | ~3-5s | Including model load |
| With daemon | ~0.5s | Subsequent calls |
| Voice cloning RTF | 0.32x | Minimal overhead |
| Streaming latency | ~5s buffer | Well-tuned |
Comparison with Alternatives
vs. Cloud TTS (Google, AWS, Azure)
Pros:
- ✅ Complete privacy - no data sent to cloud
- ✅ No API costs
- ✅ Offline capability
- ✅ Voice cloning included
- ✅ No rate limits
Cons:
⚠️ Requires Apple Silicon⚠️ Initial setup complexity⚠️ Larger disk space requirement
vs. Other Local TTS
Pros:
- ✅ Better voice quality than most local options
- ✅ Emotion tags are unique feature
- ✅ Excellent MLX optimization
- ✅ Active development
Security & Privacy Considerations
Excellent:
- No data leaves the machine
- Voice samples stored locally
- No telemetry or tracking
- Full control over generated audio
Note: When cloning voices from public figures (like we did), users should be aware of ethical considerations and potential legal implications for public use.
Recommendations for Creator
Critical (Fix Soon)
- Add Python version validation to setup script
- Improve error messages when audio generation fails
- Document Python 3.10+ requirement more prominently
- Add automatic retry logic for playback failures
Important (Next Release)
- Create one-line install script that handles all prerequisites
- Add health check with auto-fix capabilities
- Improve progress indicators for downloads
- Add voice sample quality checker
Nice to Have (Future)
- Web UI for voice management
- Built-in voice library with legal samples
- Export/import configuration profiles
- Integration with popular note-taking apps
Conclusion
The Speak TTS skill is a powerful and innovative tool that brings high-quality local text-to-speech to Apple Silicon Macs. The voice cloning feature is particularly impressive and sets it apart from many alternatives.
While the installation process has some rough edges (especially around Python version requirements), once properly set up, the tool works flawlessly. The streaming mode, daemon mode, and emotion tags make it perfect for AI agent integration and long-form content.
Would I recommend it? Absolutely yes, especially for:
- Privacy-conscious users
- AI agent developers
- Content creators needing voice variety
- Anyone wanting offline TTS capability
Main improvement needed: Smoother installation experience with better error handling and automatic prerequisite detection.
Testing Environment
- macOS Sonoma (Darwin 23.1.0)
- Apple Silicon
- Python 3.11.14
- Bun 1.3.5
- yt-dlp 2025.12.8
Thank you for creating this excellent tool! 🎉