gouzi

A voice assistant based on Qwen3-Omni, supporting voice cloning (OpenVoice) to customize the assistant's voice. No installation required, just a single ~18MB executable file. All chat audio is saved locally.

基于Qwen3-Omni的语音助手，支持声音克隆（OpenVoice），实现自定义助手的声音，无需安装，仅一个~18M可执行文件，聊天语音均保存在本地。

👑 Features

💪 Single executable file (~18MB), no installation needed
🎙 Supports voice cloning for customizing the assistant's voice
🔐 Privacy-focused: all chat audio is stored locally
👄 Supports 49 different voices from Qwen3-Omni
🎨 Voice cloning supports multiple formats including mp4, mp3, wav, etc.

🚀 Quick-Start

structure

some dir
├─ gouzi          # single executable file
├─ config.txt     # config file
├─ chat-log       # directory for saving chat log and audio
└─ checkpoints_v2 # OpenVoice modle (optional, required for voice cloning)
     └─ converter # use -M specify this dir, default: ./checkpoints_v2/converter
          ├─ config.json
          └─ checkpoint.pth

1. download a pre-built binary

latest release

2. download OpenVoice modle (optional, required for voice cloning)

checkpoints_v2

3. prepare config.txt

add your models, api key, endpoint, prompt, etc, see config_template.txt for details.

4. start server

./gouzi

After starting, you can interact with the assistant via microphone. Note: In noisy environments, background noise might be mistakenly sent to the model.

😁 Usage Example

1. Use a specific voice from the model (without voice cloning)

Use voice number 4 (Chelsie) voice:

gouzi -v 4

2. List all voices supported by the model

Display all available voices:

gouzi -V

3. Clone a custom voice (supports mp3, wav, mp4, etc.)

The assistant will speak using the voice from target_voice.wav:

gouzi -T target_voice.wav

4. Play your own spoken input before playing the assistant's response

Like listening to your own voice message after sending it on WeChat:

gouzi -P

5. Resume a previous conversation using a UUID Each conversation has a unique UUID:

gouzi -u 19c5f1e2-0da6-4e7b-a982-d2d5e73e9fa3

6. Use a custom prompt defined in config.txt

Use the first prompt you defined in config.txt:

gouzi -p 1

7. Set silence duration to determine when speech ends

Default is 2000ms (2 seconds), adjust in milliseconds:

gouzi -s 5000

8. Adjust voice sensitivity (minimum valid speech length)

Speech shorter than this threshold is treated as noise (e.g., keyboard clicks). Default: 100ms:

gouzi -t 200

📚 Output Files

Each conversation's audio files are saved under a folder named after its UUID. When the service stops, the program automatically merges all audio files into two final files:

One combines user and original assistant audio (with 1-second gaps).
The other combines user audio with cloned voice assistant audio.

Conversation UUID Folder
├─ 2026-01-18_17-55-19.log     # Chat log
├─ user-1.wav                  # User's first audio
├─ assistant-1.wav             # Assistant's original response
├─ assistant_voice_clone-1.wav # Assistant's voice-cloned response
├─ user-2.wav                  # User's second audio
├─ assistant-2.wav             # Assistant's original response
├─ assistant_voice_clone-2.wav # Assistant’s voice-cloned response
├─ all_in_one.wav              # Merged audio: user + original assistant
└─ all_in_one_voice_clone.wav  # Merged audio: user + cloned assistant voice

🛠 Building from source

Default build (without voice cloning)

git clone https://github.com/jingangdidi/voice_clone.git
cd voice_clone
cargo build --release

CPU-based voice cloning

cargo build --release --features voice_clone_cpu

CUDA-based voice cloning

cargo build --release --features voice_clone_cuda

Metal-based voice cloning (Apple Silicon)

cargo build --release --features voice_clone_metal

🚥 Arguments

Usage: gouzi [-c <config>] [-u <uuid>] [-p <prompt>] [-T <tone>] [-M <model>] [-v <voice>] [-V] [-s <silence-threshold>] [-t <min-speech-time>] [-m <maxage>] [-P] [-o <outpath>]

server for audio assistant

Options:
  -c, --config             config file, contain api_key, endpoint, model name
  -u, --uuid               previous uuid
  -p, --prompt             prompt index
  -T, --tone               voice clone tone color file
  -M, --model              openvoice model path, default: ./checkpoints_v2/converter
  -v, --voice              voice, support: 1-49
  -V, --voice-show         show all voices
  -s, --silence-threshold  the duration (in milliseconds) of silence after which speech is considered to have ended. Default: 2000ms
  -t, --min-speech-time    the minimum duration (in milliseconds) of a speech segment to be considered valid. Speech shorter than this threshold will be ignored. Default: 100ms
  -m, --maxage             chat max age since last query, default: 1DAY, support: SECOND, MINUTE, HOUR, DAY, WEEK
  -P, --play-user-audio    play user input audio
  -o, --outpath            output path, default: ./chat-log
  -h, --help               display usage information

📝 config.txt

(
    voice: "Cherry",         // Assistant voice
    silence_threshold: 2000, // Silence duration (ms) to consider speech ended; default: 2000
    min_speech_time: 100,    // Minimum valid speech duration (ms); shorter = noise; default: 100
    play_user_audio: true,   // Whether to play user's audio before assistant speaks
    outpath: "./chat-log",   // Path to save chat logs; default: ./chat-log
    model_config: [
        ModelGroup(
            provider: "Qwen",  # Must be unique
            api_key: "sk-xxx", # Required
            endpoint: "https://dashscope.aliyuncs.com/compatible-mode/v1",
            models: [
                Model(
                    name: "qwen3-omni-flash-2025-12-01", # Model name; only qwen3-omni series supported
                    discription: "Qwen-Omni accepts multimodal inputs (text, image, audio, video) and generates text or speech responses. Offers multiple lifelike voices, supports multilingual and dialect outputs. Suitable for text creation, visual recognition, voice assistants, etc. Context: 65,536 tokens; max input: 49,152 tokens; max output: 16,384 tokens",
                    group: "Qwen",
                    is_default: true,
                    is_cot: false,
                )
            ]
        )
    ],
    prompts: [
        Prompt(
            name: "audio assistant",
            content: "You are a friendly, professional, and responsive voice assistant. Your task is to understand natural language commands, provide accurate, concise, and context-appropriate responses, and proactively guide users when necessary.",
        )
    ]
)

⏰ changelog

[2026.01.?] release v0.1.0

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
core		core
source_voice		source_voice
voice_clone_lib		voice_clone_lib
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
README_zh.md		README_zh.md
config_template.txt		config_template.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gouzi

👑 Features

🚀 Quick-Start

😁 Usage Example

📚 Output Files

🛠 Building from source

🚥 Arguments

📝 config.txt

⏰ changelog

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gouzi

👑 Features

🚀 Quick-Start

😁 Usage Example

📚 Output Files

🛠 Building from source

🚥 Arguments

📝 config.txt

⏰ changelog

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages