Phonemize is a lightweight, zero-dependency Python library for converting text to IPA phonetic representations, delivering 40% faster text processing for TTS pipelines, speech synthesis, NLP, and wake word training. Learn about Nanowakeword for wake word training, read the configuration guide, or explore ISBL for AI alignment research.

Phonemize logo

Phonemize

Phonemize is a modern, multilingual grapheme-to-phoneme (G2P) conversion library powered by state-of-the-art Transformer models. This library delivers production-grade accuracy, lightning-fast inference, and seamless integration into text-to-speech (TTS) pipelines, speech synthesis engines, and other voice-related applications. Whether you're building real-time TTS systems or offline speech processing pipelines, Phonemize provides the precision and performance you need.

Quick Access

Installation – Get started in seconds
Quickstart – Basic usage example
Training Guide – Train custom models
Advanced Inference – Detailed predictions & confidence scores
Configuration – Tune model parameters
Troubleshooting – Common issues & solutions
API Reference – Complete API documentation
Pre-trained Models – Download trained models
Performance Tips – Speed optimization

Key Features

Easy-to-use API: Intuitive Python interface for both training and inference tasks. Get started with just a few lines of code.
Multilingual Support: Train and deploy a single unified model across multiple languages without architectural changes.
High Performance: Achieve microsecond-level inference latency with state-of-the-art Transformer-based predictions.
Custom Training: Train your own specialized models with minimal setup using simple YAML configuration files.
Optimized for TTS: Purpose-built for real-time streaming, offline batch processing, and production text-to-speech systems.
TorchScript Export: Export trained models to TorchScript for hardware-accelerated inference and deployment.

Installation

Inference Only

For using pre-trained models without training capabilities:

pip install phonemize

Full Installation (with Training Support)

To enable model training and all dependencies:

pip install 'phonemize[train]'

Requirements: Python 3.8 or later, PyTorch 1.9+

Quickstart

Basic Usage: Inference with Pre-trained Model

Load a pre-trained model and convert text to phonemes in just a few lines:

import torch
                from phonemize import Phonemizer

                # Import custom classes for safe model loading
                from phonemize.preprocessing.text import Preprocessor, LanguageTokenizer, SequenceTokenizer

                # Register classes for secure deserialization
                torch.serialization.add_safe_globals([Preprocessor, LanguageTokenizer, SequenceTokenizer])

                # Load pre-trained checkpoint
                phonemizer = Phonemizer.from_checkpoint("phonemize_m1.pt")

                # Convert text to phonemes (supports multiple languages)
                result = phonemizer("Arcosoph's quixotic insights empower enthusiasts.", lang="en_us")

                # Display results
                print(result)

Output:

The model converts graphemes (written text) to phonemes (speech sounds) using the ARPAbet notation:

[AA][R][K][AH][S][AO][F][S] [K][W][IH][K][S][AA][T][IH][K] [IH][N][S][AY][T][S] [IH][M][P][AW][ER] [EH][N][TH][UW][Z][IY][AE][S][T][S].

Each [XX] represents a single phonetic unit. This output can be directly fed into TTS vocoders or speech synthesis systems.

Model Architectures

Phonemize supports two distinct Transformer-based architectures, each with different strengths:

Architecture	Type	Speed	Quality	Use Case
Forward Transformer	Encoder-Decoder (seq2seq)	Fast ⚡	Excellent 95%+	Real-time TTS, low-latency inference
Autoregressive Transformer	Decoder-only (causal)	Slower	Slightly Higher 96%+	Research, domain-specific fine-tuning

Quick Decision Guide:

Use Forward for production TTS systems
Use Autoregressive for research or when highest accuracy is needed

Training Your Own Model

Create custom G2P models tailored to your specific language or domain. The training pipeline handles preprocessing, model creation, and distributed training.

Training Pipeline

Prepare Data: Collect grapheme-phoneme pairs in (language, grapheme, phoneme) format
Configure: Define model architecture, hyperparameters, and preprocessing in a YAML config file
Preprocess: Tokenize and prepare datasets with vocabulary and data splits
Train: Launch training with optional multi-GPU support and checkpoint resumption

Example: Training a Multilingual Model

from phonemize.preprocess import preprocess
                from phonemize.train import train

                # Prepare multilingual training data (language_code, grapheme, phoneme)
                # Use IPA (International Phonetic Alphabet) for phonemes
                train_data = [
                    ("en_us", "young", "jʌŋ"),
                    ("de", "benützten", "bənʏt͡stn̩"),
                    ("fr", "bonjour", "bɔ̃ʒuʁ")
                ] * 1000

                # Define validation set for monitoring model performance
                val_data = [
                    ("en_us", "young", "jʌŋ"),
                    ("de", "benützten", "bənʏt͡stn̩"),
                    ("fr", "bonjour", "bɔ̃ʒuʁ")
                ] * 100

                # Load configuration (defines model architecture, training parameters, etc.)
                config_file = "configs/forward.yaml"

                # Step 1: Preprocess and tokenize data
                preprocess(
                    config_file=config_file,
                    train_data=train_data,
                    val_data=val_data,
                    deduplicate_train_data=False  # Keep duplicates for balanced multilingual training
                )

                # Step 2: Train model with distributed support (rank=0 for single GPU)
                train(rank=0, num_gpus=1, config_file=config_file)

Resume Training from Checkpoint

If training is interrupted, resume from the last checkpoint:

from phonemize.train import train

                # Resume training from checkpoint
                train(
                    rank=0, 
                    num_gpus=1, 
                    config_file="configs/forward.yaml",
                    checkpoint_file="checkpoints/latest_model.pt"  # Path to checkpoint
                )

Multi-GPU Training (Distributed Data Parallel)

For faster training on multiple GPUs, use DDP configuration:

from phonemize.train import train

                # Train on 4 GPUs with automatic distribution
                train(rank=0, num_gpus=4, config_file="configs/forward.yaml")

Update your config file with DDP settings:

training:
                ddp_backend: 'nccl'     # Use 'gloo' on CPU-only systems
                ddp_host: 'localhost'   # Hostname for multi-node training
                ddp_port: 12355         # Port for DDP communication

Output: Model checkpoints saved to the checkpoint_dir specified in config. Monitor training with TensorBoard.

Advanced Inference

Basic Inference with Customization

Perform phonemization with optional parameters for punctuation handling and acronym expansion:

from phonemize import Phonemizer

                # Load custom trained model
                phonemizer = Phonemizer.from_checkpoint("checkpoints/best_model.pt")

                # Single text inference with default settings
                phonemes = phonemizer("Phonemizing text is simple!", lang="en_us")
                print(phonemes)  # Output: [F][O][N][EH][M][IH][Z][IH][NG] ...

                # Batch inference for better performance
                texts = ["Hello world", "How are you?", "Testing Phonemize"]
                results = phonemizer(texts, lang="en_us", batch_size=32)

                # Custom punctuation handling
                custom_punct = ".,!?;"
                result = phonemizer(
                    "Hello, world!",
                    lang="en_us",
                    punctuation=custom_punct,
                    expand_acronyms=True,  # Expands "DIY" -> "D-I-Y"
                    batch_size=8
                )

Inference Parameters:

text (str | List[str]): Input text or list of texts to phonemize
lang (str): Target language code (e.g., 'en_us', 'de', 'fr') - must match training languages
punctuation (str): Characters to split on [default: '().,:?!/–']
expand_acronyms (bool): Auto-expand acronyms like "U.S.A" [default: True]
batch_size (int): Inference batch size for GPU efficiency [default: 8]

Detailed Results with Confidence Scores

Access comprehensive predictions including confidence metrics and token probabilities:

# Get detailed predictions
                result = phonemizer.phonemise_list(
                    ["Phonemizing text is simple!"], 
                    lang="en_us"
                )

                # Access detailed per-word predictions
                for word, pred in result.predictions.items():
                    print(f"Word: '{word}'")
                    print(f"  Phonemes:  {pred.phonemes}")
                    print(f"  Tokens:    {pred.phoneme_tokens}")
                    print(f"  Confidence: {pred.confidence:.3f}")
                    print(f"  Token Probs: {[f'{p:.2f}' for p in pred.token_probs]}")

                # Access full result structure
                print(f"Original texts: {result.text}")
                print(f"Split text: {result.split_text}")     # Words separated by punctuation
                print(f"Phoneme output: {result.phonemes}")   # Concatenated phoneme strings
                print(f"Split phonemes: {result.split_phonemes}")  # Phonemes per word

Result Object Properties:

text: Original input texts (List[str])
phonemes: Final phoneme strings (List[str])
split_text: Texts split by punctuation (List[List[str]])
split_phonemes: Corresponding phonemes per split (List[List[str]])
predictions: Dict mapping words → Prediction objects with confidence scores

Prediction Object Properties:
word: Original word
phonemes: Predicted phoneme string
phoneme_tokens: Individual phoneme tokens (with markers)
confidence: Overall confidence score (0.0-1.0)
token_probs: Per-token probabilities

Use Cases: Confidence scores identify uncertain predictions for manual review, retraining, or fallback to dictionary lookup.

Configuration Guide

All training parameters are defined in YAML configuration files. Phonemize includes example configs for different model types.

Forward Transformer Config (Recommended for Production)

Use configs/forward_config.yaml for fast, high-quality models:

model:
                type: 'transformer'  # Fast seq2seq model
                d_model: 512         # Model dimension
                d_fft: 1024          # FFT dimension in feed-forward layer
                layers: 6            # Number of transformer layers
                dropout: 0.1
                heads: 4             # Attention heads

                preprocessing:
                languages: ['en_us', 'de', 'fr']  # Add your languages here!
                text_symbols: 'abcdefghijklmnopqrstuvwxyz'  # Supported graphemes
                phoneme_symbols: ['a', 'e', 'i', 'ə', 'ŋ', ...]  # Supported phonemes
                char_repeats: 3      # Max phonemes per grapheme
                lowercase: true
                n_val: 5000          # Validation split size

                training:
                learning_rate: 0.0001
                warmup_steps: 10000
                batch_size: 32
                epochs: 500
                generate_steps: 10000      # Eval interval
                validate_steps: 10000
                checkpoint_steps: 100000   # Save interval

Autoregressive Transformer Config (Research/Custom)

Use configs/autoreg_config.yaml for slightly higher accuracy at the cost of speed:

model:
                type: 'autoreg_transformer'  # Slower but potentially higher accuracy
                d_model: 512
                layers: 4            # Usually fewer layers for autoreg
                heads: 4

                preprocessing:
                char_repeats: 1      # MUST be 1 for autoregressive models

Configuration Best Practices

For Your Languages:

preprocessing:
                languages: ['en_us', 'de', 'fr', 'es']  # Add all target languages
                text_symbols: 'abcdefghijklmnopqrstuvwxyzäöüàâé...'  # Include all chars in data

For More Accuracy (slower):

model:
                d_model: 768         # Increase model size
                layers: 8            # More transformer layers
                dropout: 0.2

                training:
                learning_rate: 0.00005  # Lower LR for stability
                warmup_steps: 20000     # Longer warmup

For Faster Training:

model:
                d_model: 256         # Smaller model
                layers: 3            # Fewer layers
                heads: 2

                training:
                batch_size: 64       # Larger batches
                generate_steps: 5000 # Less frequent evaluation

Common Issues & Troubleshooting

Language Not Supported Error

# ❌ Error: Language not supported
                phonemizer("Hello", lang="pt")  # Portuguese not in training languages

                # ✅ Solution: Use a language from model's training set
                # Check available languages in the config file
                phonemizer("Hello", lang="en_us")  # Use supported language

Model Not Improving During Training

# Increase model capacity
                model:
                d_model: 768
                layers: 8

                # Improve learning rate schedule
                training:
                learning_rate: 0.0001
                warmup_steps: 20000
                scheduler_plateau_patience: 15

Out of Memory (OOM) Error

# ✅ Reduce batch size
                train(rank=0, num_gpus=1, config_file="configs/forward.yaml")
                # Then update config:
                # training:
                #   batch_size: 16  # Reduced from 32

                # ✅ For inference: reduce batch_size
                result = phonemizer("text", lang="en_us", batch_size=4)  # Was 8

Low Accuracy on Custom Language

preprocessing:
                # Include ALL characters from your training data
                text_symbols: 'abcdefghijklmnopqrstuvwxyzäöü...'  # Add special chars!
                
                # Include ALL phonemes in your dataset
                phoneme_symbols: ['a', 'b', 'c', ..., 'ə', 'ŋ']

TorchScript Export (Production Deployment)

import torch
                from phonemize import Phonemizer

                # Load trained checkpoint
                phonemizer = Phonemizer.from_checkpoint("checkpoints/best_model.pt")

                # Convert to TorchScript (JIT compilation)
                scripted_model = torch.jit.script(phonemizer.predictor.model)
                phonemizer.predictor.model = scripted_model

                # Save for production deployment
                torch.jit.save(scripted_model, "phonemizer_scripted.pt")

                # Run optimized inference
                result = phonemizer("Running the optimized TorchScript model!")
                print(result)

Benefits:

✅ No Python dependencies required at inference time
✅ Significantly faster latency (30-50% speedup)
✅ Portable across platforms
✅ Memory efficient

Performance & Optimization

Inference Performance Tips

from phonemize import Phonemizer

                phonemizer = Phonemizer.from_checkpoint("phonemize_m1.pt")

                # ✅ Batch Processing (5-10x faster than single items)
                texts = ["word1", "word2", "word3"] * 100
                results = phonemizer(texts, lang="en_us", batch_size=32)

                # ❌ Avoid: Processing one word at a time
                # for word in texts:
                #     result = phonemizer(word, lang="en_us")  # Slow!

                # ✅ Adjust batch size based on GPU memory
                # Small GPU: batch_size=4-8
                # Medium GPU (6GB): batch_size=16-32  
                # Large GPU (24GB+): batch_size=64-128

                # ✅ Use TorchScript for maximum speed
                import torch
                scripted = torch.jit.script(phonemizer.predictor.model)
                phonemizer.predictor.model = scripted

Training Performance Tips

# ✅ For faster training iterations
                training:
                batch_size: 64           # Larger batches = faster epoch
                generate_steps: 5000     # Less frequent validation
                validate_steps: 10000

                # ✅ Use Multi-GPU training for large datasets
                # In training code: train(rank=0, num_gpus=4, config_file=...)

                # ✅ For production models: increase compute
                model:
                d_model: 768             # Larger model = better accuracy
                layers: 8
                heads: 8

Benchmark Results

Model	Inference Speed	Throughput	Accuracy
Forward Transformer (GPU)	~1-2 ms/word	500-1000 words/sec	98.5%+
Autoregressive (GPU)	~5-10 ms/word	100-200 words/sec	99%+
TorchScript (GPU)	~0.5-1 ms/word	1000-2000 words/sec	Same as original

API Reference

Phonemizer Class

from phonemize import Phonemizer

                # Load from checkpoint
                phonemizer = Phonemizer.from_checkpoint(checkpoint_path)

                # Phonemize single text
                result = phonemizer(
                    text="Hello world",
                    lang="en_us",
                    punctuation="().,:?!/–",  # Custom punctuation
                    expand_acronyms=True,       # Expand acronyms
                    batch_size=8                # Batch size for inference
                )  # Returns: str

                # Phonemize batch of texts  
                results = phonemizer(
                    text=["Text 1", "Text 2"],
                    lang="en_us"
                )  # Returns: List[str]

                # Get detailed predictions
                result_obj = phonemizer.phonemise_list(
                    texts=["Hello"],
                    lang="en_us",
                    punctuation="().,:?!/–",
                    expand_acronyms=True,
                    batch_size=8
                )  # Returns: PhonemizerResult

Preprocess Function

from phonemize.preprocess import preprocess

                preprocess(
                    config_file="configs/forward.yaml",          # Config path
                    train_data=[("en_us", "word", "wɜrd"), ...], # Training data
                    val_data=[("en_us", "word", "wɜrd"), ...],   # Validation data (optional)
                    deduplicate_train_data=False                 # Keep duplicates
                )

Train Function

from phonemize.train import train

                train(
                    rank=0,                          # GPU rank (0 for single GPU)
                    num_gpus=1,                      # Number of GPUs
                    config_file="configs/forward.yaml",  # Config path
                    checkpoint_file=None             # Resume from checkpoint (optional)
                )

Pre-trained Models

High-quality pre-trained models ready for immediate use:

Model	Language	Dataset	Accuracy	Arch	Version	Size
phonemize_m1	🇺🇸 English (US)	CMUDict	98.5%+	Forward	0.1.0	~60MB

Tested on: CMU Dictionary, diverse English vocabulary including technical terms, names, and phonetic anomalies.

Models are optimized for the phonemize library and saved with PyTorch 1.9+.

Roadmap

Support for additional languages (Mandarin, Japanese, Arabic)
Real-time streaming inference API
ONNX model export for maximum portability
REST API server for easy deployment
Fine-tuning APIs for domain-specific models

Acknowledgments

Phonemize is inspired by DeepPhonemizer and has been completely refactored for modern Python, better performance, and improved usability. Special thanks to the open-source speech processing community.

License & Compatibility

License: MIT License - Free for commercial and personal use
Python Support: 3.8, 3.9, 3.10, 3.11, 3.12, 3.13
Repository: github.com/arcosoph/phonemize

Questions or Issues? Join our Discord community or open an issue on GitHub.

Phonemize — Fast Text-to-Phoneme Conversion Library