Skip to content

User Experience Feedback & Improvement Suggestions #12

@ruzin

Description

@ruzin

Speak TTS Skill - User Feedback & Experience Report

Date: December 29, 2025
User: Testing with Claude Code AI Agent
Platform: macOS Sonoma (Apple Silicon)


Executive Summary

The Speak TTS skill is an impressive local text-to-speech solution with powerful features like voice cloning, emotion tags, and streaming playback. While the core functionality works excellently, the installation process revealed some areas that could benefit from better documentation and error handling.

Overall Rating: ⭐⭐⭐⭐ (4/5)


What Works Exceptionally Well

1. Voice Cloning ⭐⭐⭐⭐⭐

  • Outstanding feature: The voice cloning capability is remarkably effective
  • Successfully cloned voice from a YouTube Shorts video
  • Quality of cloned voice is impressive with the Chatterbox TTS model
  • Simple to use: just provide a WAV file path with --voice flag
  • Works seamlessly with streaming mode

2. Streaming Mode ⭐⭐⭐⭐⭐

  • Adaptive buffering works flawlessly
  • Audio starts playing while generation continues
  • Perfect for long-form content
  • The --stream --play combination works beautifully
  • Real-time feedback shows buffer status clearly

3. Emotion Tags ⭐⭐⭐⭐⭐

  • Adds natural expressiveness to speech
  • Tags like [laugh], [chuckle], [sigh], [clear throat] work perfectly
  • Makes the TTS sound more human and engaging
  • Easy to incorporate into text

4. Daemon Mode ⭐⭐⭐⭐⭐

  • Significant performance improvement for repeated use
  • Model stays loaded in memory
  • Subsequent generations are nearly instant
  • Perfect for AI agent use cases

5. On-Device Processing ⭐⭐⭐⭐⭐

  • Complete privacy - no cloud APIs
  • Fast generation on Apple Silicon
  • No API costs or rate limits
  • Offline capability

Installation & Setup Experience

Challenges Encountered

1. Python Version Requirements ⚠️

Issue: Documentation states "Python 3.10+" is required, but the setup script defaults to system Python (3.9.6 in our case)

What Happened:

  • Initial setup ran successfully with Python 3.9.6
  • All dependencies installed without errors
  • However, audio generation would hang indefinitely with "Generating audio..." message
  • After 60+ seconds, it would fail with "Error: No audio generated"

Solution Required:

  • Manually installed Python 3.11 via Homebrew
  • Created virtual environment with: /opt/homebrew/bin/python3.11 -m venv ~/.chatter/env
  • Reinstalled all dependencies
  • After this, everything worked perfectly

Recommendation:

  • Setup script should check Python version and fail fast with clear error if < 3.10
  • Or better: automatically use Homebrew Python if available and meets requirements
  • Add troubleshooting section specifically for Python version issues

2. Bun Runtime Installation

Issue: Bun is required but not mentioned prominently enough in prerequisites

What Happened:

  • Realized Bun was missing only after trying to run the skill
  • Had to install it separately via: curl -fsSL https://bun.sh/install | bash

Recommendation:

  • Make Bun requirement more prominent in documentation
  • Consider adding auto-detection and installation prompt
  • Or provide a setup script that handles Bun installation

3. Global Command Setup

Issue: The speak command wasn't immediately available after installation

What Happened:

  • Had to manually add shell alias to ~/.zshrc
  • Shell needed to be reloaded for alias to work
  • For AI agent use, needed to use full path: ~/.bun/bin/bun run ~/.claude/skills/speak/src/index.ts

Recommendation:

  • Setup script could offer to add the alias automatically
  • Provide a post-install message with exact commands to run
  • Consider creating a symlink automatically with user permission

Feature-Specific Feedback

Audio Playback (--play flag)

Initial Issue: The --play flag didn't work initially during our testing session. Audio would generate successfully but wouldn't play automatically.

What Fixed It: After killing all processes and restarting the daemon, --play started working correctly.

Recommendation:

  • Investigate potential race conditions or socket issues that might prevent playback
  • Add better error messages if playback fails
  • Consider fallback to afplay if primary playback method fails

Model Download Experience

Positive:

  • First-time model download is transparent
  • Progress indicators work well
  • Cached models for subsequent use

Could Be Better:

  • Initial download size isn't shown upfront
  • No estimate of download time
  • Consider pre-downloading default model during setup

Performance

Excellent:

  • Generation speed is impressive (RTF: 0.3-0.35x typically)
  • Streaming reduces perceived latency significantly
  • Voice cloning doesn't noticeably slow generation
  • Daemon mode makes subsequent calls nearly instant

Documentation Quality

Strengths ⭐⭐⭐⭐⭐

  • SKILL.md is comprehensive and well-organized
  • Pattern-based examples are extremely helpful
  • Clear decision guidelines for agents
  • Good coverage of all features

Could Be Improved

  • Installation troubleshooting section needs expansion
  • Python version requirements should be more prominent
  • More emphasis on prerequisite installation order
  • Common error messages and their solutions

AI Agent Integration Experience

What Works Great

  • Skill integrates seamlessly with Claude Code
  • Agent defaults are sensible (fp16, temp 0.5, speed 1.0)
  • Easy to invoke from agent context
  • Streaming + daemon mode perfect for conversational use

Suggestions

  • Add example of complete agent integration flow
  • Document best practices for long conversations
  • Include sample hook configurations
  • Add guidance on when to use daemon mode vs regular mode

Feature Requests & Ideas

High Priority

  1. Setup Health Check Enhancement

    • Add --fix flag to automatically resolve common issues
    • Detect Python version problems and suggest solutions
    • Check audio playback capability
  2. Error Messages

    • More descriptive error messages
    • Include suggested fixes in error output
    • Add troubleshooting URLs to errors
  3. Progress Indicators

    • Show model download progress with size/ETA
    • Better indication when model is loading vs generating
    • Playback status feedback

Medium Priority

  1. Voice Library Management

    • Built-in voice preset manager
    • Ability to name and catalog custom voices
    • Preview voices before using them
  2. Batch Processing UI

    • Better progress tracking for multiple files
    • Estimate total time for batch jobs
    • Resume capability for interrupted batches

Nice to Have

  1. Configuration Presets

    • Named configuration profiles (e.g., "podcast", "audiobook", "quick-test")
    • Easy switching between presets
    • Share presets between users
  2. Audio Post-Processing

    • Optional normalization
    • Noise reduction for cloned voices
    • Export in multiple formats (MP3, OGG, etc.)

Specific Use Cases Tested

✅ Successfully Tested

  • Basic text-to-speech with default voice
  • Emotion tags integration
  • Streaming long content
  • Voice cloning from YouTube video
  • Daemon mode for rapid iterations
  • Markdown processing (stripping syntax)
  • Multiple language support (tested Malayalam actor's voice)

❌ Not Tested

  • Voice cloning with custom .wav files
  • Code block handling options
  • Clipboard input
  • Preview mode
  • Configuration file usage
  • Different model quantizations (4bit, 5bit, 6bit)

Performance Metrics (Our Setup)

Hardware: MacBook with Apple Silicon
Python: 3.11.14
Model Used: chatterbox-turbo-8bit (default)

Metric Value Notes
Average RTF 0.30-0.35x Excellent performance
First generation ~3-5s Including model load
With daemon ~0.5s Subsequent calls
Voice cloning RTF 0.32x Minimal overhead
Streaming latency ~5s buffer Well-tuned

Comparison with Alternatives

vs. Cloud TTS (Google, AWS, Azure)

Pros:

  • ✅ Complete privacy - no data sent to cloud
  • ✅ No API costs
  • ✅ Offline capability
  • ✅ Voice cloning included
  • ✅ No rate limits

Cons:

  • ⚠️ Requires Apple Silicon
  • ⚠️ Initial setup complexity
  • ⚠️ Larger disk space requirement

vs. Other Local TTS

Pros:

  • ✅ Better voice quality than most local options
  • ✅ Emotion tags are unique feature
  • ✅ Excellent MLX optimization
  • ✅ Active development

Security & Privacy Considerations

Excellent:

  • No data leaves the machine
  • Voice samples stored locally
  • No telemetry or tracking
  • Full control over generated audio

Note: When cloning voices from public figures (like we did), users should be aware of ethical considerations and potential legal implications for public use.


Recommendations for Creator

Critical (Fix Soon)

  1. Add Python version validation to setup script
  2. Improve error messages when audio generation fails
  3. Document Python 3.10+ requirement more prominently
  4. Add automatic retry logic for playback failures

Important (Next Release)

  1. Create one-line install script that handles all prerequisites
  2. Add health check with auto-fix capabilities
  3. Improve progress indicators for downloads
  4. Add voice sample quality checker

Nice to Have (Future)

  1. Web UI for voice management
  2. Built-in voice library with legal samples
  3. Export/import configuration profiles
  4. Integration with popular note-taking apps

Conclusion

The Speak TTS skill is a powerful and innovative tool that brings high-quality local text-to-speech to Apple Silicon Macs. The voice cloning feature is particularly impressive and sets it apart from many alternatives.

While the installation process has some rough edges (especially around Python version requirements), once properly set up, the tool works flawlessly. The streaming mode, daemon mode, and emotion tags make it perfect for AI agent integration and long-form content.

Would I recommend it? Absolutely yes, especially for:

  • Privacy-conscious users
  • AI agent developers
  • Content creators needing voice variety
  • Anyone wanting offline TTS capability

Main improvement needed: Smoother installation experience with better error handling and automatic prerequisite detection.


Testing Environment

  • macOS Sonoma (Darwin 23.1.0)
  • Apple Silicon
  • Python 3.11.14
  • Bun 1.3.5
  • yt-dlp 2025.12.8

Thank you for creating this excellent tool! 🎉

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions