User Experience Feedback & Improvement Suggestions

# Speak TTS Skill - User Feedback & Experience Report

**Date**: December 29, 2025
**User**: Testing with Claude Code AI Agent
**Platform**: macOS Sonoma (Apple Silicon)

---

## Executive Summary

The Speak TTS skill is an impressive local text-to-speech solution with powerful features like voice cloning, emotion tags, and streaming playback. While the core functionality works excellently, the installation process revealed some areas that could benefit from better documentation and error handling.

**Overall Rating**: ⭐⭐⭐⭐ (4/5)

---

## What Works Exceptionally Well

### 1. Voice Cloning ⭐⭐⭐⭐⭐
- **Outstanding feature**: The voice cloning capability is remarkably effective
- Successfully cloned voice from a YouTube Shorts video
- Quality of cloned voice is impressive with the Chatterbox TTS model
- Simple to use: just provide a WAV file path with `--voice` flag
- Works seamlessly with streaming mode

### 2. Streaming Mode ⭐⭐⭐⭐⭐
- Adaptive buffering works flawlessly
- Audio starts playing while generation continues
- Perfect for long-form content
- The `--stream --play` combination works beautifully
- Real-time feedback shows buffer status clearly

### 3. Emotion Tags ⭐⭐⭐⭐⭐
- Adds natural expressiveness to speech
- Tags like `[laugh]`, `[chuckle]`, `[sigh]`, `[clear throat]` work perfectly
- Makes the TTS sound more human and engaging
- Easy to incorporate into text

### 4. Daemon Mode ⭐⭐⭐⭐⭐
- Significant performance improvement for repeated use
- Model stays loaded in memory
- Subsequent generations are nearly instant
- Perfect for AI agent use cases

### 5. On-Device Processing ⭐⭐⭐⭐⭐
- Complete privacy - no cloud APIs
- Fast generation on Apple Silicon
- No API costs or rate limits
- Offline capability

---

## Installation & Setup Experience

### Challenges Encountered

#### 1. Python Version Requirements ⚠️
**Issue**: Documentation states "Python 3.10+" is required, but the setup script defaults to system Python (3.9.6 in our case)

**What Happened**:
- Initial setup ran successfully with Python 3.9.6
- All dependencies installed without errors
- However, audio generation would hang indefinitely with "Generating audio..." message
- After 60+ seconds, it would fail with "Error: No audio generated"

**Solution Required**:
- Manually installed Python 3.11 via Homebrew
- Created virtual environment with: `/opt/homebrew/bin/python3.11 -m venv ~/.chatter/env`
- Reinstalled all dependencies
- After this, everything worked perfectly

**Recommendation**:
- Setup script should check Python version and fail fast with clear error if < 3.10
- Or better: automatically use Homebrew Python if available and meets requirements
- Add troubleshooting section specifically for Python version issues

#### 2. Bun Runtime Installation
**Issue**: Bun is required but not mentioned prominently enough in prerequisites

**What Happened**:
- Realized Bun was missing only after trying to run the skill
- Had to install it separately via: `curl -fsSL https://bun.sh/install | bash`

**Recommendation**:
- Make Bun requirement more prominent in documentation
- Consider adding auto-detection and installation prompt
- Or provide a setup script that handles Bun installation

#### 3. Global Command Setup
**Issue**: The `speak` command wasn't immediately available after installation

**What Happened**:
- Had to manually add shell alias to `~/.zshrc`
- Shell needed to be reloaded for alias to work
- For AI agent use, needed to use full path: `~/.bun/bin/bun run ~/.claude/skills/speak/src/index.ts`

**Recommendation**:
- Setup script could offer to add the alias automatically
- Provide a post-install message with exact commands to run
- Consider creating a symlink automatically with user permission

---

## Feature-Specific Feedback

### Audio Playback (`--play` flag)

**Initial Issue**: The `--play` flag didn't work initially during our testing session. Audio would generate successfully but wouldn't play automatically.

**What Fixed It**: After killing all processes and restarting the daemon, `--play` started working correctly.

**Recommendation**:
- Investigate potential race conditions or socket issues that might prevent playback
- Add better error messages if playback fails
- Consider fallback to `afplay` if primary playback method fails

### Model Download Experience

**Positive**:
- First-time model download is transparent
- Progress indicators work well
- Cached models for subsequent use

**Could Be Better**:
- Initial download size isn't shown upfront
- No estimate of download time
- Consider pre-downloading default model during setup

### Performance

**Excellent**:
- Generation speed is impressive (RTF: 0.3-0.35x typically)
- Streaming reduces perceived latency significantly
- Voice cloning doesn't noticeably slow generation
- Daemon mode makes subsequent calls nearly instant

---

## Documentation Quality

### Strengths ⭐⭐⭐⭐⭐
- SKILL.md is comprehensive and well-organized
- Pattern-based examples are extremely helpful
- Clear decision guidelines for agents
- Good coverage of all features

### Could Be Improved
- Installation troubleshooting section needs expansion
- Python version requirements should be more prominent
- More emphasis on prerequisite installation order
- Common error messages and their solutions

---

## AI Agent Integration Experience

### What Works Great
- Skill integrates seamlessly with Claude Code
- Agent defaults are sensible (fp16, temp 0.5, speed 1.0)
- Easy to invoke from agent context
- Streaming + daemon mode perfect for conversational use

### Suggestions
- Add example of complete agent integration flow
- Document best practices for long conversations
- Include sample hook configurations
- Add guidance on when to use daemon mode vs regular mode

---

## Feature Requests & Ideas

### High Priority
1. **Setup Health Check Enhancement**
   - Add `--fix` flag to automatically resolve common issues
   - Detect Python version problems and suggest solutions
   - Check audio playback capability

2. **Error Messages**
   - More descriptive error messages
   - Include suggested fixes in error output
   - Add troubleshooting URLs to errors

3. **Progress Indicators**
   - Show model download progress with size/ETA
   - Better indication when model is loading vs generating
   - Playback status feedback

### Medium Priority
4. **Voice Library Management**
   - Built-in voice preset manager
   - Ability to name and catalog custom voices
   - Preview voices before using them

5. **Batch Processing UI**
   - Better progress tracking for multiple files
   - Estimate total time for batch jobs
   - Resume capability for interrupted batches

### Nice to Have
6. **Configuration Presets**
   - Named configuration profiles (e.g., "podcast", "audiobook", "quick-test")
   - Easy switching between presets
   - Share presets between users

7. **Audio Post-Processing**
   - Optional normalization
   - Noise reduction for cloned voices
   - Export in multiple formats (MP3, OGG, etc.)

---

## Specific Use Cases Tested

### ✅ Successfully Tested
- [x] Basic text-to-speech with default voice
- [x] Emotion tags integration
- [x] Streaming long content
- [x] Voice cloning from YouTube video
- [x] Daemon mode for rapid iterations
- [x] Markdown processing (stripping syntax)
- [x] Multiple language support (tested Malayalam actor's voice)

### ❌ Not Tested
- [ ] Voice cloning with custom .wav files
- [ ] Code block handling options
- [ ] Clipboard input
- [ ] Preview mode
- [ ] Configuration file usage
- [ ] Different model quantizations (4bit, 5bit, 6bit)

---

## Performance Metrics (Our Setup)

**Hardware**: MacBook with Apple Silicon
**Python**: 3.11.14
**Model Used**: chatterbox-turbo-8bit (default)

| Metric | Value | Notes |
|--------|-------|-------|
| Average RTF | 0.30-0.35x | Excellent performance |
| First generation | ~3-5s | Including model load |
| With daemon | ~0.5s | Subsequent calls |
| Voice cloning RTF | 0.32x | Minimal overhead |
| Streaming latency | ~5s buffer | Well-tuned |

---

## Comparison with Alternatives

### vs. Cloud TTS (Google, AWS, Azure)
**Pros**:
- ✅ Complete privacy - no data sent to cloud
- ✅ No API costs
- ✅ Offline capability
- ✅ Voice cloning included
- ✅ No rate limits

**Cons**:
- ⚠️ Requires Apple Silicon
- ⚠️ Initial setup complexity
- ⚠️ Larger disk space requirement

### vs. Other Local TTS
**Pros**:
- ✅ Better voice quality than most local options
- ✅ Emotion tags are unique feature
- ✅ Excellent MLX optimization
- ✅ Active development

---

## Security & Privacy Considerations

**Excellent**:
- No data leaves the machine
- Voice samples stored locally
- No telemetry or tracking
- Full control over generated audio

**Note**: When cloning voices from public figures (like we did), users should be aware of ethical considerations and potential legal implications for public use.

---

## Recommendations for Creator

### Critical (Fix Soon)
1. Add Python version validation to setup script
2. Improve error messages when audio generation fails
3. Document Python 3.10+ requirement more prominently
4. Add automatic retry logic for playback failures

### Important (Next Release)
5. Create one-line install script that handles all prerequisites
6. Add health check with auto-fix capabilities
7. Improve progress indicators for downloads
8. Add voice sample quality checker

### Nice to Have (Future)
9. Web UI for voice management
10. Built-in voice library with legal samples
11. Export/import configuration profiles
12. Integration with popular note-taking apps

---

## Conclusion

The Speak TTS skill is a powerful and innovative tool that brings high-quality local text-to-speech to Apple Silicon Macs. The voice cloning feature is particularly impressive and sets it apart from many alternatives.

While the installation process has some rough edges (especially around Python version requirements), once properly set up, the tool works flawlessly. The streaming mode, daemon mode, and emotion tags make it perfect for AI agent integration and long-form content.

**Would I recommend it?** Absolutely yes, especially for:
- Privacy-conscious users
- AI agent developers
- Content creators needing voice variety
- Anyone wanting offline TTS capability

**Main improvement needed**: Smoother installation experience with better error handling and automatic prerequisite detection.

---

## Testing Environment

- macOS Sonoma (Darwin 23.1.0)
- Apple Silicon
- Python 3.11.14
- Bun 1.3.5
- yt-dlp 2025.12.8

Thank you for creating this excellent tool! 🎉


Metric	Value	Notes
Average RTF	0.30-0.35x	Excellent performance
First generation	~3-5s	Including model load
With daemon	~0.5s	Subsequent calls
Voice cloning RTF	0.32x	Minimal overhead
Streaming latency	~5s buffer	Well-tuned

User Experience Feedback & Improvement Suggestions #12

Description

Speak TTS Skill - User Feedback & Experience Report

Executive Summary

What Works Exceptionally Well

1. Voice Cloning ⭐⭐⭐⭐⭐

2. Streaming Mode ⭐⭐⭐⭐⭐

3. Emotion Tags ⭐⭐⭐⭐⭐

4. Daemon Mode ⭐⭐⭐⭐⭐

5. On-Device Processing ⭐⭐⭐⭐⭐

Installation & Setup Experience

Challenges Encountered

1. Python Version Requirements ⚠️

2. Bun Runtime Installation

3. Global Command Setup

Feature-Specific Feedback

Audio Playback (--play flag)

Model Download Experience

Performance

Documentation Quality

Strengths ⭐⭐⭐⭐⭐

Could Be Improved

AI Agent Integration Experience

What Works Great

Suggestions

Feature Requests & Ideas

High Priority

Medium Priority

Nice to Have

Specific Use Cases Tested

✅ Successfully Tested

❌ Not Tested

Performance Metrics (Our Setup)

Comparison with Alternatives

vs. Cloud TTS (Google, AWS, Azure)

vs. Other Local TTS

Security & Privacy Considerations

Recommendations for Creator

Critical (Fix Soon)

Important (Next Release)

Nice to Have (Future)

Conclusion

Testing Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Audio Playback (`--play` flag)