A voice assistant based on Qwen3-Omni, supporting voice cloning (OpenVoice) to customize the assistant's voice. No installation required, just a single ~18MB executable file. All chat audio is saved locally.
基于Qwen3-Omni的语音助手,支持声音克隆(OpenVoice),实现自定义助手的声音,无需安装,仅一个~18M可执行文件,聊天语音均保存在本地。
- 💪 Single executable file (~18MB), no installation needed
- 🎙 Supports voice cloning for customizing the assistant's voice
- 🔐 Privacy-focused: all chat audio is stored locally
- 👄 Supports 49 different voices from Qwen3-Omni
- 🎨 Voice cloning supports multiple formats including mp4, mp3, wav, etc.
structure
some dir
├─ gouzi # single executable file
├─ config.txt # config file
├─ chat-log # directory for saving chat log and audio
└─ checkpoints_v2 # OpenVoice modle (optional, required for voice cloning)
└─ converter # use -M specify this dir, default: ./checkpoints_v2/converter
├─ config.json
└─ checkpoint.pth
1. download a pre-built binary
2. download OpenVoice modle (optional, required for voice cloning)
3. prepare config.txt
add your models, api key, endpoint, prompt, etc, see config_template.txt for details.
4. start server
./gouzi
After starting, you can interact with the assistant via microphone. Note: In noisy environments, background noise might be mistakenly sent to the model.
1. Use a specific voice from the model (without voice cloning)
Use voice number 4 (Chelsie) voice:
gouzi -v 4
2. List all voices supported by the model
Display all available voices:
gouzi -V
3. Clone a custom voice (supports mp3, wav, mp4, etc.)
The assistant will speak using the voice from target_voice.wav:
gouzi -T target_voice.wav
4. Play your own spoken input before playing the assistant's response
Like listening to your own voice message after sending it on WeChat:
gouzi -P
5. Resume a previous conversation using a UUID Each conversation has a unique UUID:
gouzi -u 19c5f1e2-0da6-4e7b-a982-d2d5e73e9fa3
6. Use a custom prompt defined in config.txt
Use the first prompt you defined in config.txt:
gouzi -p 1
7. Set silence duration to determine when speech ends
Default is 2000ms (2 seconds), adjust in milliseconds:
gouzi -s 5000
8. Adjust voice sensitivity (minimum valid speech length)
Speech shorter than this threshold is treated as noise (e.g., keyboard clicks). Default: 100ms:
gouzi -t 200
Each conversation's audio files are saved under a folder named after its UUID. When the service stops, the program automatically merges all audio files into two final files:
- One combines user and original assistant audio (with 1-second gaps).
- The other combines user audio with cloned voice assistant audio.
Conversation UUID Folder
├─ 2026-01-18_17-55-19.log # Chat log
├─ user-1.wav # User's first audio
├─ assistant-1.wav # Assistant's original response
├─ assistant_voice_clone-1.wav # Assistant's voice-cloned response
├─ user-2.wav # User's second audio
├─ assistant-2.wav # Assistant's original response
├─ assistant_voice_clone-2.wav # Assistant’s voice-cloned response
├─ all_in_one.wav # Merged audio: user + original assistant
└─ all_in_one_voice_clone.wav # Merged audio: user + cloned assistant voice
- Default build (without voice cloning)
git clone https://github.com/jingangdidi/voice_clone.git
cd voice_clone
cargo build --release
- CPU-based voice cloning
cargo build --release --features voice_clone_cpu
- CUDA-based voice cloning
cargo build --release --features voice_clone_cuda
- Metal-based voice cloning (Apple Silicon)
cargo build --release --features voice_clone_metal
Usage: gouzi [-c <config>] [-u <uuid>] [-p <prompt>] [-T <tone>] [-M <model>] [-v <voice>] [-V] [-s <silence-threshold>] [-t <min-speech-time>] [-m <maxage>] [-P] [-o <outpath>]
server for audio assistant
Options:
-c, --config config file, contain api_key, endpoint, model name
-u, --uuid previous uuid
-p, --prompt prompt index
-T, --tone voice clone tone color file
-M, --model openvoice model path, default: ./checkpoints_v2/converter
-v, --voice voice, support: 1-49
-V, --voice-show show all voices
-s, --silence-threshold the duration (in milliseconds) of silence after which speech is considered to have ended. Default: 2000ms
-t, --min-speech-time the minimum duration (in milliseconds) of a speech segment to be considered valid. Speech shorter than this threshold will be ignored. Default: 100ms
-m, --maxage chat max age since last query, default: 1DAY, support: SECOND, MINUTE, HOUR, DAY, WEEK
-P, --play-user-audio play user input audio
-o, --outpath output path, default: ./chat-log
-h, --help display usage information
(
voice: "Cherry", // Assistant voice
silence_threshold: 2000, // Silence duration (ms) to consider speech ended; default: 2000
min_speech_time: 100, // Minimum valid speech duration (ms); shorter = noise; default: 100
play_user_audio: true, // Whether to play user's audio before assistant speaks
outpath: "./chat-log", // Path to save chat logs; default: ./chat-log
model_config: [
ModelGroup(
provider: "Qwen", # Must be unique
api_key: "sk-xxx", # Required
endpoint: "https://dashscope.aliyuncs.com/compatible-mode/v1",
models: [
Model(
name: "qwen3-omni-flash-2025-12-01", # Model name; only qwen3-omni series supported
discription: "Qwen-Omni accepts multimodal inputs (text, image, audio, video) and generates text or speech responses. Offers multiple lifelike voices, supports multilingual and dialect outputs. Suitable for text creation, visual recognition, voice assistants, etc. Context: 65,536 tokens; max input: 49,152 tokens; max output: 16,384 tokens",
group: "Qwen",
is_default: true,
is_cot: false,
)
]
)
],
prompts: [
Prompt(
name: "audio assistant",
content: "You are a friendly, professional, and responsive voice assistant. Your task is to understand natural language commands, provide accurate, concise, and context-appropriate responses, and proactively guide users when necessary.",
)
]
)
- [2026.01.?] release v0.1.0