Try Qwen3 TTS Voice Design on Google ColabTry Qwen3 TTS Voice Design on Google ColabTry Qwen3 TTS Voice Design on Google Colab

2026::01::24
4 min
AUTHOR:Z.SHINCHVEN

The Qwen3-TTS-12Hz-1.7B-VoiceDesign model is a fascinating new entry in the text-to-speech landscape. Unlike traditional TTS systems that rely on fixed speaker IDs, this model allows for Voice Design—you can describe the voice you want using natural language (e.g., "A warm, clear, and professional female voice"), and the model generates it.

It supports 10 major languages (including English, Chinese, and Japanese) and offers fine-grained control over emotion and prosody. For more details, check out the official documentation.

In this guide, we'll walk through how to run this model on Google Colab to generate long-form audio narrations.

Quick Start

1. Environment Setup

Copy and run the following commands in a code cell on Google Colab to set up the environment. We recommend using an A100 GPU runtime for the best performance.

!conda create -n qwen3-tts python=3.12 -y
!conda activate qwen3-tts
!pip install -U qwen-tts
!pip install -U flash-attn --no-build-isolation

2. Smart Chunking & Generation Script

Paste the following script into a new code cell. This script handles long-form text by splitting it into manageable chunks while preserving natural sentence structures.

Create a file named input.txt (using the Colab file explorer on the left) with your text content before running this script.

import torch
import soundfile as sf
import numpy as np
import re
from qwen_tts import Qwen3TTSModel

def split_text_smart(text, max_length=100):
    """
    Splits text into chunks of at most max_length characters,
    preserving sentence structure where possible.
    """
    # 1. Split by paragraphs (double newlines)
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]

    chunks = []

    for para in paragraphs:
        if len(para) <= max_length:
            chunks.append(para)
            continue

        # 2. If paragraph is too long, split by sentences
        # Matches punctuation followed by whitespace or quote+whitespace
        sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])|(?<=[.!?]["”])\s+(?=[A-Z])', para)

        current_chunk = ""
        for sentence in sentences:
            if len(current_chunk) + len(sentence) + 1 <= max_length:
                current_chunk += (" " if current_chunk else "") + sentence
            else:
                if current_chunk:
                    chunks.append(current_chunk)
                current_chunk = sentence

                # 3. If a single sentence is still too long (rare but possible), hard split or sub-clause split
                # For now, we'll just let it slide if it's slightly over,
                # or split by comma if strictly needed.
                if len(current_chunk) > max_length:
                     # Simple split by comma or semi-colon
                     sub_parts = re.split(r'(?<=[,;])\s+', current_chunk)
                     current_chunk = ""
                     for part in sub_parts:
                        if len(current_chunk) + len(part) + 1 <= max_length:
                             current_chunk += (" " if current_chunk else "") + part
                        else:
                            if current_chunk: chunks.append(current_chunk)
                            current_chunk = part

        if current_chunk:
            chunks.append(current_chunk)

    return chunks

# Initialize the model
# We use bfloat16 and flash_attention_2 for optimization on A100
model = Qwen3TTSModel.from_pretrained(
    "Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign",
    device_map="cuda:0",
    dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)

# Read text from file
try:
    with open("input.txt", "r", encoding="utf-8") as f:
        text_content = f.read()
except FileNotFoundError:
    print("Please create an 'input.txt' file with the text you want to read.")
    text_content = "This is a default test sentence because the input file was not found."

# Chunk the text
chunks = split_text_smart(text_content, max_length=100)
print(f"Split text into {len(chunks)} chunks.")

all_wavs = []
sr = None

for i, chunk in enumerate(chunks):
    print(f"Processing chunk {i+1}/{len(chunks)}...")
    # The 'instruct' parameter defines the voice persona
    wavs, current_sr = model.generate_voice_design(
        text=chunk,
        language="English",
        instruct="A warm, clear, and professional female voice suitable for blog reading and storytelling. The tone is engaging and narrative.",
    )
    if sr is None:
        sr = current_sr

    # generate_voice_design returns a list of wavs (batch size), we only sent one text
    all_wavs.append(wavs[0])

# Concatenate all audio segments
if all_wavs:
    final_wav = np.concatenate(all_wavs)
    sf.write("output_narrative.wav", final_wav, sr)
    print("Audio generation complete: output_narrative.wav")

How It Works

  1. Voice Design: The magic happens in the instruct parameter. By changing the string "A warm, clear, and professional female voice...", you can generate entirely different personas without retraining the model.
  2. Smart Splitter: The split_text_smart function ensures that we don't cut words in half. It prioritizes paragraph breaks, then sentence breaks, and finally clause breaks (commas/semicolons) to fit within the max_length limit.
  3. Flash Attention: Using flash_attention_2 significantly speeds up generation on compatible GPUs like the A100.
RackNerd Billboard Banner
Share Node:

RELATED_DATA_STREAMS

SCANNING_DATABASE_FOR_CORRELATIONS...