This backend system consists of several components that enable real-time conversational characters using Pixel Streaming, STT, TTS, LLM integration, and lipsync. Communication between Unreal Engine (UE) and the backend is handled via WebSockets.
- When a player speaks to a UE character, the audio is captured as a WAV file.
- The backend Node.js controller (communicator) receives a notify from UE and forwards the audio to the speech-to-text (STT) service.
- The STT service converts the stereo signal to mono, resamples to 16 kHz, and runs it through a speech recognition model.
- The recognized text is sent to the LLM service, which generates a response.
- Once the LLM response is ready, the Node.js controller notifies the text-to-speech (TTS) and lipsync services via HTTP:
- http://127.0.0.1:5001/speak → TTS service (generates audio).
- http://127.0.0.1:5002/process_wav → Lipsync service (processes audio).
- The TTS service outputs a WAV file, which is then passed to the lipsync service.
- The lipsync service generates blendshapes, streams them to UE via Live Link (WebSockets), and plays the audio.
- The UE character consumes the blendshapes and applies them in real time for synchronized facial animation.
- Instead of writing/reading WAV files between services, use binary streaming (if possible) (WebSockets or gRPC) for lower latency.
- Each LLM-dependent service should run as a standalone server connected to the Node.js controller via WebSockets.
- Fault tolerance. The Node.js controller should implement fallback and reconnection logic to ensure resilience (e.g., if one service crashes, reconnect with retries and failover to a backup).
- Currently, audio playback happens inside the lipsync service, which only works locally. In production, audio should be sent back to UE for playback.
- For smoother synchronization, blendshapes should only be streamed when UE has confirmed it received the audio and is ready to play it.
- Low latency must remain the top priority when introducing any improvements.
Not all dependencies may be listed here, but here are the ones you definitely need to run the system locally:
- Run
npm install - Install sox to convert stereo to mono realime. https://sourceforge.net/projects/sox/
- Install required dependencies for python services
\services\neuro-sync\Local_APIand\services\neuro-sync\Player - Required models must download when u start the project first time.
- Edit absolute paths for
start.bat. This is the entry point which opens all required services in a Windows Terminal grid.
if all the dependencies are here and paths are correct u can run npm run dev to start the project
neuro-sync is a third-party lib I used to handle lipsync blendshape generation and tts. I built a server.py wrapper around it to expose its functionality as a service.
https://github.com/AnimaVR/NeuroSync_Player
https://github.com/AnimaVR/NeuroSync_Local_API
This project is released under a dual-license model:
-
Free / MIT License
- Free for individuals and organizations earning under $1M/year.
- Covers all original code authored in this project.
- See the full MIT license in
LICENSE.
-
Commercial License
- Required for organizations with annual revenue of $1,000,000 or more.
- Provides extended rights, priority support, and permission to integrate into proprietary systems.
- Contact divakov.gleb@gmail.com to obtain a commercial license.
This project uses third-party components licensed under their own terms.
-
NeuroSync (NeuroSync Local_API) — used for lipsync blendshape generation and TTS.
- Free for individuals and small organizations under $1M/year.
- Commercial license required for larger organizations.
- Full license text:
Local_API/LICENCE
-
NeuroSync Player — used for handling player-side functionality and integration with NeuroSync services.
- Free for individuals and small organizations under $1M/year.
- Commercial license required for larger organizations.
- Full license text:
Player/LICENCE