Skip to content

LLM Scribe is a toolkit for creating handwritten datasets quickly and easily for LLM fine-tuning. Automatically outputs into multiple common finetuning formats such as chatml, alpaca, and more.

License

Notifications You must be signed in to change notification settings

ella0333/LLM-Scribe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 

Repository files navigation

LLM Scribe

🔍 What is LLM Scribe?

LLM Scribe is your professional toolkit for creating high-quality conversational datasets for Large Language Model fine-tuning. Whether you're a creative writer crafting character personalities or a developer preparing training data, LLM Scribe eliminates the technical barriers and formatting headaches.

No more struggling with JSON syntax or format specifications - LLM Scribe handles all the technical details while you focus on creating valuable content.

📷 Video Demo

✨ Key Features

Streamlined Dataset Creation

  • Intuitive Interface - Focus on writing, not formatting
  • Auto-save Functionality - Never lose your work with automatic saving on every interaction
  • Progress Tracking - Set goals and monitor your dataset completion
  • Tab Navigation - Rapidly cycle between fields for efficient data entry
  • Light mode and Dark mode themes - Swap in settings

Professional Export Options

  • Multiple Export Formats - Supports all major LLM training formats including ChatML, Alpaca, ShareGPT/Vicuna
  • Format-Specific Customizations - Tailor your datasets with format-specific options

Advanced Capabilities

  • Real-time Token Tracking - Monitor token usage with popular tokenizers (OpenAI, HuggingFace, Mistral)
  • Customizable Fields - Enable/disable optional fields based on your specific needs
  • System Message Support - Add system prompts for ChatGPT/ChatML formats
  • Custom IDs - Assign unique identifiers for ShareGPT/Vicuna formats

Workflow Optimization

  • Easy Dataset Reloading - Seamlessly continue work on existing projects
  • Multi-turn Conversation Support - Create contextually aware training data
  • In-app Guidance - Helpful tooltips and explanations throughout the interface

📋 Supported Export Formats

Pair Data Exports

  • chatgpt_chatml.jsonl
  • chatml.json
  • alpaca.jsonl
  • alpaca.json
  • sharegpt_vicuna.jsonl
  • sharegpt_vicuna.json
  • generic.jsonl

Multi-turn Data Exports

  • chatgpt_chatml.jsonl
  • chatml.json
  • sharegpt_vicuna.jsonl
  • sharegpt_vicuna.json
  • Plus all pair formats (automatically generated)

🎓 Perfect for Both Beginners & Experts

For Beginners

  • Start with default settings to get all formats you need
  • Choose between simple pair data or more advanced multi-turn conversations
  • No technical knowledge required - just write and export

For Experienced Developers

  • Fine-tune your datasets with format-specific customizations
  • Track token usage for cost and performance optimization
  • Leverage advanced features for professional dataset creation

📥 Installation & Getting Started

⚠️ Security Notice - No Code Signing Certificate - You may receive a security warning when installing. This is normal as the application is not code-signed. To install, click "More info" and then "Run anyway" when prompted.

Please click the open book icon to get started once you open the app! It will give you all the info you need.

💻 System Requirements

  • Windows Only Application - Not compatible with macOS or Linux

🔒 License

This software includes a commercial license that grants you full commercial rights to all datasets and outputs you create. The underlying system and methodology are patent pending. For licensing inquiries regarding technology integration, please contact us.

☕ About the Developer

Created with ❤️ by Gabriella Baris - Check out my portfolio for more projects and tools!

If LLM Scribe has been helpful for your projects, consider buying me a coffee! Your support helps keep this project alive and enables continued development of new features.

Ko-fi

📚 Support & Contact

If you have any issues, find bugs, or need assistance, please message Gabriella@Kryptive.com for:

  • Technical support
  • Bug reports
  • Additional format requests
  • Tokenizer library additions

Patent & Technology Licensing

Interested in integrating this technology into your own products? Contact us for licensing the underlying system and methodology.


Version 1.0 | Patent Pending

Note on Tokenizer Libraries: LLM Scribe utilizes open-source libraries (tiktoken, Hugging Face transformers, Mistral AI Tokenizers) for token counting functionalities, each governed by their respective licenses.

About

LLM Scribe is a toolkit for creating handwritten datasets quickly and easily for LLM fine-tuning. Automatically outputs into multiple common finetuning formats such as chatml, alpaca, and more.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published