The Qwen3-TTS-12Hz-1.7B-VoiceDesign model is a fascinating new entry in the text-to-speech landscape. Unlike traditional TTS systems that rely on fixed speaker IDs, this model allows for Voice Design—you can describe the voice you want using natural language (e.g., "A warm, clear, and professional female voice"), and the model generates it.
It supports 10 major languages (including English, Chinese, and Japanese) and offers fine-grained control over emotion and prosody. For more details, check out the official documentation.
In this guide, we'll walk through how to run this model on Google Colab to generate long-form audio narrations.
Quick Start
1. Environment Setup
Copy and run the following commands in a code cell on Google Colab to set up the environment. We recommend using an A100 GPU runtime for the best performance.
!conda create -n qwen3-tts python=3.12 -y
!conda activate qwen3-tts
!pip install -U qwen-tts
!pip install -U flash-attn --no-build-isolation
2. Smart Chunking & Generation Script
Paste the following script into a new code cell. This script handles long-form text by splitting it into manageable chunks while preserving natural sentence structures.
Create a file named input.txt (using the Colab file explorer on the left) with your text content before running this script.
import torch
import soundfile as sf
import numpy as np
import re
from qwen_tts import Qwen3TTSModel
def split_text_smart(text, max_length=100):
"""
Splits text into chunks of at most max_length characters,
preserving sentence structure where possible.
"""
# 1. Split by paragraphs (double newlines)
paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
chunks = []
for para in paragraphs:
if len(para) <= max_length:
chunks.append(para)
continue
# 2. If paragraph is too long, split by sentences
# Matches punctuation followed by whitespace or quote+whitespace
sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])|(?<=[.!?]["”])\s+(?=[A-Z])', para)
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) + 1 <= max_length:
current_chunk += (" " if current_chunk else "") + sentence
else:
if current_chunk:
chunks.append(current_chunk)
current_chunk = sentence
# 3. If a single sentence is still too long (rare but possible), hard split or sub-clause split
# For now, we'll just let it slide if it's slightly over,
# or split by comma if strictly needed.
if len(current_chunk) > max_length:
# Simple split by comma or semi-colon
sub_parts = re.split(r'(?<=[,;])\s+', current_chunk)
current_chunk = ""
for part in sub_parts:
if len(current_chunk) + len(part) + 1 <= max_length:
current_chunk += (" " if current_chunk else "") + part
else:
if current_chunk: chunks.append(current_chunk)
current_chunk = part
if current_chunk:
chunks.append(current_chunk)
return chunks
# Initialize the model
# We use bfloat16 and flash_attention_2 for optimization on A100
model = Qwen3TTSModel.from_pretrained(
"Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
device_map="cuda:0",
dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
# Read text from file
try:
with open("input.txt", "r", encoding="utf-8") as f:
text_content = f.read()
except FileNotFoundError:
print("Please create an 'input.txt' file with the text you want to read.")
text_content = "This is a default test sentence because the input file was not found."
# Chunk the text
chunks = split_text_smart(text_content, max_length=100)
print(f"Split text into {len(chunks)} chunks.")
all_wavs = []
sr = None
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1}/{len(chunks)}...")
# The 'instruct' parameter defines the voice persona
wavs, current_sr = model.generate_voice_design(
text=chunk,
language="English",
instruct="A warm, clear, and professional female voice suitable for blog reading and storytelling. The tone is engaging and narrative.",
)
if sr is None:
sr = current_sr
# generate_voice_design returns a list of wavs (batch size), we only sent one text
all_wavs.append(wavs[0])
# Concatenate all audio segments
if all_wavs:
final_wav = np.concatenate(all_wavs)
sf.write("output_narrative.wav", final_wav, sr)
print("Audio generation complete: output_narrative.wav")
How It Works
- Voice Design: The magic happens in the
instructparameter. By changing the string "A warm, clear, and professional female voice...", you can generate entirely different personas without retraining the model. - Smart Splitter: The
split_text_smartfunction ensures that we don't cut words in half. It prioritizes paragraph breaks, then sentence breaks, and finally clause breaks (commas/semicolons) to fit within themax_lengthlimit. - Flash Attention: Using
flash_attention_2significantly speeds up generation on compatible GPUs like the A100.
