NanowakeWord Configuration Guide

Complete documentation of all configurable parameters in the Nanowakeword package, including descriptions, default values, meanings, and usage examples.

Project & Data Paths
Model Architecture
Training & Optimization
Feature Manifest
Batch Composition
Data Generation
Augmentation Settings
Feature Generation Manifest
Advanced Settings
Pipeline Control
Intelligent Auto-Configuration
Inference Parameters

Project & Data Paths

Configuration parameters for project organization and data source locations.

`model_name`

Type: string
Default: Auto-generated based on model type (e.g., XXX_dnn_v1)
Description: Name of the trained model. Used for creating directories and organizing outputs.
Example:
```
model_name: "my_wakeword_A_v1"
```

`output_dir`

Type: string
Default: "./trained_models"
Description: Base directory where all trained models and artifacts will be stored.

Example:

output_dir: "./trained_models"
# Creates: ./trained_models/my_wakeword_v1/model/, ./trained_models/my_wakeword_v1/features/

`positive_data_path`

Type: string (file path)
Mandatory: Yes
Default: None
Description: Directory containing positive audio samples (actual wake word utterances).
Requirements:
- Must contain .wav files at 16 kHz sample rate
- Mono or stereo audio (will be converted to mono)
- Can be empty if using only generated synthetic samples

`negative_data_path`

Type: string (file path)
Mandatory: Yes
Default: None
Description: Directory containing negative audio samples (non-wake-word utterances).

Example:

negative_data_path: "./data/common_words"

`background_paths`

Type: list of strings
Default: Optional
Description: Directories containing background noise audio files for augmentation. Multiple paths supported.

Example:

background_paths: # You can add multiple path or only one
  - "./data/office_noise"
  - "./data/street_noise"
  - "./data/home_noise"

`rir_paths`

Type: list of strings
Default: Optional
Description: Directories containing Room Impulse Response (RIR) files for acoustic augmentation.
Note: At least one RIR path is required for intelligent configuration.

Model Architecture

Parameters controlling the neural network structure and behavior.

`model_type`

Type: string
Default: "dnn"
Valid Options: "dnn", "lstm", "gru", "rnn", "cnn", "transformer", "crnn", "tcn", "quartznet", "conformer", "e_branchformer", "custom"
Description: The neural network architecture to use for wake word detection.
Complexity Levels (from simplest to most complex):
- dnn - Dense feedforward network (lightweight, fast)
- cnn - Convolutional Neural Network (good for spectrograms)
- lstm, gru, rnn - Recurrent networks (excellent for sequences)
- crnn - Hybrid CNN-RNN (combines both strengths)
- transformer, conformer, e_branchformer - Advanced attention-based (most powerful, most complex)

Examples by use case:

# Embedded/Edge device (minimal resources)
model_type: "dnn"

# Edge device with more resources
model_type: "lstm"

# Desktop/cloud with ample resources
model_type: "conformer"

`layer_size` (DNN/RNN-based architectures)

Type: integer
Default: 128
Valid Range: 64 to 512
Description: Number of neurons in each hidden layer for feedforward and recurrent layers.
Relationship to model capacity: Larger values = more parameters = longer training, better performance (up to a point)

Example:

layer_size: 256  # Larger model, slower but potentially better

`n_blocks`

Type: integer
Default: 3
Valid Range: 1 to 10
Description: Number of stacked blocks/layers in the model.
- For dnn: Number of fully connected layers
- For lstm/gru: Number of recurrent layers
- For transformer: Number of encoder layers
- For crnn: Number of RNN layers (CNN part is fixed)
Example:
```
n_blocks: 5  # Deeper network
```

`dropout_prob`

Type: float
Default: 0.5 (intelligently adjusted)
Valid Range: 0.0 to 0.8
Description: Dropout probability per layer to prevent overfitting.
- Higher values = more regularization = potential underfitting
- Lower values = less regularization = potential overfitting
- Typically 0.2-0.5 for most models
Example:
```
dropout_prob: 0.3
```

`activation_function` (Advanced)

Type: string
Default: "relu"
Valid Options: "relu", "gelu", "silu"
Description: Activation function used in hidden layers.
- relu - Traditional, fast, widely supported
- gelu - Smooth, often better convergence
- silu - Modern alternative (Swish activation)
Example:
```
activation_function: "gelu"
```

`embedding_dim` (Advanced)

Type: integer
Default: 64
Valid Range: 32 to 256
Description: Dimensionality of the final embedding before classification.

Architecture-Specific Parameters

Transformer Architecture

model_type: "transformer"
transformer_d_model: 128        # Model dimension, default: 128
transformer_n_head: 4           # Number of attention heads, default: 4

CRNN Architecture

model_type: "crnn"
crnn_cnn_channels: [16, 32, 32]  # CNN channel progression, default: [16, 32, 32]
crnn_rnn_type: "lstm"             # "lstm" or "gru", default: "lstm"

TCN Architecture

model_type: "tcn"
tcn_channels: [64, 64, 128]      # Channel progression, default: [64, 64, 128]
tcn_kernel_size: 3                # Convolution kernel size, default: 3

Conformer Architecture

model_type: "conformer"
conformer_d_model: 144            # Model dimension, default: 144
conformer_n_head: 4               # Attention heads, default: 4

E-Branchformer Architecture

model_type: "e_branchformer"
branchformer_d_model: 144         # Model dimension, default: 144
branchformer_n_head: 4            # Attention heads, default: 4

QuartzNet Architecture

model_type: "quartznet"
quartznet_config:                 # Channel, kernel, repeat config
  - [256, 33, 1]
  - [256, 33, 1]
  - [512, 39, 1]

Custom Architecture

Type: string
Value: "custom"
Description: Load a user-defined torch.nn.Module class from a Python file or installed module.
Required Settings:
- custom_model_config.module_path
- custom_model_config.class_name
Optional Settings:
- custom_model_config.params
Custom model requirements:
- The class must inherit from torch.nn.Module
- It should return an embedding tensor shaped [batch_size, embedding_dim]
- It may accept the following standard constructor arguments:
  - input_shape
  - embedding_dim
  - dropout_prob
  - activation_fn
  - config
- Additional custom parameters may be provided via params
Example:

You can create a Python file.

import torch
from torch import nn


class MyCustomModel(nn.Module):

    def __init__(self,input_shape, embedding_dim=64, dropout_prob=0.5, activation_fn=None, config=None, hidden_channels=32,):
        super().__init__()
        self.input_shape = input_shape
        self.embedding_dim = embedding_dim
        self.activation_fn = activation_fn if activation_fn is not None else nn.ReLU()
        # Build CNN feature extractor (no flatten/linear until we know conv output size)
        self.feature_extractor = nn.Sequential(
            nn.Conv2d(1, hidden_channels, kernel_size=3, padding=1),
            self.activation_fn,
            nn.MaxPool2d(kernel_size=2, stride=2),
            nn.Conv2d(hidden_channels, hidden_channels * 2, kernel_size=3, padding=1),
            self.activation_fn,
            nn.AdaptiveAvgPool2d((1, 1)),
        )
        # Determine flattened feature size by running a dummy tensor through the convs
        with torch.no_grad():
            dummy = torch.zeros(1, 1, *input_shape)
            conv_out = self.feature_extractor(dummy)
            flattened_size = int(conv_out.numel() // conv_out.shape[0])
        self.embedding_head = nn.Sequential(
            nn.Flatten(),
            nn.Linear(flattened_size, 128),
            self.activation_fn,
            nn.Dropout(dropout_prob),
            nn.Linear(128, embedding_dim),
        )
    def forward(self, x):
        # Expect input shaped [batch, time, features] or [batch, 1, time, features]
        if x.dim() == 3:
            x = x.unsqueeze(1)
        x = self.feature_extractor(x)
        x = self.embedding_head(x)
        return x

In your config:

model_type: "custom"
custom_model_config:
  module_path: "path/to/your/custom_model_architectures.py"
  class_name: "MyCustomModel"
  params:
    hidden_channels: 32

Important: module_path may be either a relative path to a Python file or an importable Python module name.

Training & Optimization

Parameters governing the training loop, optimization, and learning rate scheduling.

`steps`

Type: integer
Default: 20000 (intelligently adjusted based on data volume)
Valid Range: 1000 to 100000
Description: Total number of training iterations/steps.
Calculation Logic:
- base_steps = effective_data_volume * 1000 steps per hour
- Adjusted based on data quality and model complexity
- Typically 10,000-40,000 for most scenarios

Example:

steps: 50000  # For very large/complex datasets

`batch_size`

Type: integer
Default: 128
Valid Range:
- Minimum: 1 (at least 1 sample per batch required)
- Maximum: Limited by GPU/CPU memory
- CPU training → 16–128+ typical
- single GPU → 32–256+ typical
- multi-GPU → 512+ possible
Description: Number of training samples per batch.
- Larger batches = faster training, more stable gradients, more memory
- Smaller batches = slower training, noisier gradients, less memory
Example:
```
batch_size: 128
```

`optimizer_type`

Type: string
Default: "adamw"
Valid Options: "adamw", "adam", "sgd"
Description: Optimization algorithm.
- adamw - Adaptive Moment Estimation with Weight decay (recommended)
- adam - Original adaptive optimizer
- sgd - Stochastic Gradient Descent (simple, slower convergence)
Example:
```
optimizer_type: "adamw"
```

`learning_rate_max`

Type: float
Default: Auto-calculated
Description: Maximum learning rate during training (used with cycle schedulers).
Intelligently Adjusted Based On:
- Dataset size (larger datasets → higher LR)
- Data noise levels (cleaner data → higher LR)
- Model complexity
Example:
```
learning_rate_max: 0.001
```

`learning_rate_base`

Type: float
Default: learning_rate_max / 10
Description: Minimum/base learning rate during cyclical scheduling.
Note: Automatically calculated if not specified.

`lr_scheduler_type`

Type: string
Default: "onecycle"
Valid Options: "onecycle", "cyclic", "cosine"
Description: Learning rate schedule strategy.
- onecycle - One cycle from base to max LR and back (good for fast convergence)
- cyclic - Multiple triangular cycles (good for exploration)
- cosine - Cosine annealing (smooth, gradual decrease)
Example:
```
lr_scheduler_type: "onecycle"
```

`clr_step_size_up` (Cyclic LR)

Type: integer
Default: Auto-calculated based on total steps
Description: Number of steps to increase LR in each cycle.

`clr_step_size_down` (Cyclic LR)

Type: integer
Default: Auto-calculated based on total steps
Description: Number of steps to decrease LR in each cycle.

`weight_decay`

Type: float
Default: 0.01
Description: L2 regularization coefficient to prevent overfitting.

`momentum` (SGD optimizer)

Type: float
Default: 0.9
Valid Range: 0.0 to 1.0
Description: Momentum factor for SGD optimizer.

`num_workers`

Type: integer
Default: 2
Valid Range: 0 to CPU_count
Description: Number of worker threads for data loading.
- 0 = single thread (slower, no multiprocessing)
- 2-4 = typical for most systems
- Increase for large datasets and fast GPUs

Feature Manifest

Defines paths to pre-computed audio feature files (.npy format) used for training.

Structure

feature_manifest: # You can add Multiple Sources
  targets:           # Positive samples (wake word)
    key1: "path/to/features.npy"
    # others.. 
  negatives:         # Negative samples (non-wake-words)
    key1: "path/to/negatives.npy"
    key2: "path/to/noise.npy" # Background noise samples
    # others..
  # Optional: Validation data (if _val key suffix used)
  targets_val:
    key1: "path/to/val_positive.npy"
  negatives_val:
    key1: "path/to/val_negatives.npy"
    key2: "path/to/val_noise.npy"

Key Naming Convention (It will use `batch_composition`)

Keys within each category can be arbitrary unique identifiers
Short keys preferred for readability (e.g., t, n, b)
Multiple feature sources can be specified with different keys (e.g., real_pos, bg2, hard_neg)

Example with Multiple Sources

feature_manifest:
  targets:
    t: "./trained_models/model_v1/features/positive.npy"
    my_voice: "./voice/muhammad_abid/muhammad_abid_data.npy"
    
  negatives:
    common_words: "./features/common_words.npy"
    hard_negatives: "./features/similar_words.npy"
    external_dataset: "./external/negatives_1m.npy"
    office: "./features/office_noise.npy"
    home: "./features/home_noise.npy"

Batch Composition

batch_composition defines how many feature samples are taken per training batch from the datasets specified in feature_manifest.

Each entry in batch_composition corresponds to a dataset or dataset group defined in feature_manifest.

batch_composition:
  target: 10
  n: 68
  hn: 10
  b: 40
  # others..

This means that each training batch will contain:

10 samples from the targets datasets (all datasets inside the targets)
68 samples from the negatives.n dataset
10 samples from the negatives.hn dataset
40 samples from the negatives.b dataset

Relationship with `feature_manifest`

batch_composition always uses the datasets defined in feature_manifest.

For example:

feature_manifest:
  targets:
    t: positive_features.npy

  negatives:
    n: negative_features.npy
    hn: hard_negative_features.npy
    b: noise_features.npy

The keys used in batch_composition must match the dataset keys or dataset groups defined in feature_manifest.

How Samples Are Selected

When a group name is used:

batch_composition:
  target: 10

the samples are randomly selected from all datasets inside the targets group.

For example:

targets:
  t1: dataset1.npy
  t2: dataset2.npy
  t3: dataset3.npy

Then:

target: 10

means:

A total of 10 samples will be taken from the targets group
Samples are selected randomly across all target datasets
Not exactly 10 from each dataset

Example distribution:

t1 → 3 samples
t2 → 4 samples
t3 → 3 samples

Selecting From a Specific Dataset

To select samples from a specific dataset, use its dataset key:

batch_composition:
  t: 10

This means:

10 samples will be taken only from targets.t

because:

targets:
  t: positive_features.npy

Summary

feature_manifest defines where the datasets are located
batch_composition defines how many samples are taken from those datasets per batch
Keys in batch_composition must match keys or groups in feature_manifest

Data Generation

Parameters for synthetic audio generation using Text-to-Speech (TTS).

This function serves as the central orchestrator for creating synthetic audio clips. It operates based on a list of "generation tasks" defined in the main configuration file under the data_generation_tasks key. This task-based approach grants the user fine-grained control over the entire data generation process, allowing for the creation of multiple, diverse datasets (e.g., positive, negative, validation) in a single run.

Each task is an independent job that specifies what text to synthesize, how many samples to create, where to save them, and what Text-To-Speech (TTS) settings to use. This modularity empowers users to build complex and robust datasets tailored to their specific needs.

The primary workflow is as follows:

Loads the list of tasks from the configuration.
Pre-loads any globally required models (like the phonemizer) for efficiency.
Iterates through each enabled task.
For each task, it determines the text source and generates the list of phrases to be synthesized.
It then calls the generate_samples utility to create the audio files.
Clears the GPU cache after heavy tasks to maintain performance.

Configuration Schema (data_generation_tasks): The data_generation_tasks key in your config file should be a list of dictionaries, where each dictionary represents a single task.

Task Keys:
    name (str): A descriptive name for the task (e.g., "Positive Wake Words").
    enabled (bool): If `False`, this task will be skipped. Defaults to `True`.
    output_dir (str): The path to the directory where audio clips will be saved.
    num_samples (int): The total number of audio clips to generate for this task.
    file_prefix (str): A prefix for the generated audio filenames (e.g., "pos_").
    tts_settings (dict, optional): Task-specific TTS settings that override
                                    the global `tts_settings`.
    text_source (dict): A dictionary defining the source of the text to be
                        synthesized. This is the core of the task's logic.

The text_source Dictionary: This dictionary must contain a type key, which determines how the text is generated. Supported types are:

type: "fixed_phrase" Generates audio for a single, repeated phrase. Ideal for positive wake word samples.
- phrase (str, optional): The exact phrase to use. If not provided, it falls back to the global target_phrase.
type: "from_list" Generates audio from a user-provided list of phrases. Perfect for curated lists of negative samples.
- phrases (list[str]): A list of custom text phrases.
- repeat_each (int, optional): How many times to repeat each phrase in the list. Defaults to 1.
type: "auto_adversarial" Generates phonetically similar but common English words/phrases. Excellent for creating a robust set of negative samples that challenge the model with real-world, confusable words.
- base_phrase (str, optional): The phrase to generate variations from. Falls back to the global target_phrase.
- Supports other keys like include_partial_phrase, max_multi_word_len, etc.
type: "phoneme_adversarial" Generates nonsensical but phonetically very similar text by manipulating the phonemes of a base phrase. This creates extremely challenging negative samples to drastically reduce false activations.
- base_phrase (str, optional): The phrase to generate variations from. Falls back to the global target_phrase.
- min_distance (float, optional): Controls how different the generated phoneme strings are from the original. Defaults to 0.35.

Example Usage (in a .yaml config file):

target_phrase: "hey nano"

data_generation_tasks:
  - name: "Positive Wake Words"
    enabled: true
    output_dir: "dataset/positive"
    num_samples: 1000
    text_source:
      type: "fixed_phrase"
      # Uses the global "hey nano" target_phrase

  - name: "Phoneme-Based Hard Negatives"
    enabled: true
    output_dir: "dataset/negative"
    num_samples: 1500
    file_prefix: "neg_phoneme"
    text_source:
      type: "phoneme_adversarial"
      min_distance: 0.4

Augmentation Settings

Audio augmentation parameters for training robustness.

Structure

augmentation_settings:
  gain_prob: 1.0               # Probability of gain adjustment
  min_gain_in_db: -2.0         # Minimum gain in dB
  max_gain_in_db: 2.0          # Maximum gain in dB
  pitch_prob: 0.3              # Probability of pitch shift
  max_pitch_semitones: 1.0     # Maximum pitch shift
  min_pitch_semitones: -1.0    # Minimum pitch shift
  max_snr_in_db: 35.0          # Maximum signal-to-noise ratio
  min_snr_in_db: 15.0          # Minimum signal-to-noise ratio
  rir_prob: 0.0                # Probability of applying RIR

Parameter Descriptions

`min_snr_in_db` / `max_snr_in_db`

Type: float
Range: Typically -10 to +40 dB
Description: Signal-to-Noise ratio range when mixing audio with background noise.
- Lower SNR = harder augmentation (more noise, harder training)
- Higher SNR = easier augmentation (less noise, cleaner audio)

`rir_prob`

Type: float (0.0-1.0)
Default: 0.2
Description: Probability of applying room impulse response convolution.
Effect: Simulates acoustic room effects for robustness.

`pitch_prob` / `min_pitch_semitones` / `max_pitch_semitones`

Type: float
Pitch Range: Typically ±2 to ±5 semitones
Description: Pitch shifting for voice variation without changing content.

`gain_prob` / `min_gain_in_db` / `max_gain_in_db`

Type: float
Gain Range: Typically -6 to +6 dB
Description: Volume adjustment for robustness to different microphone levels.

`ColoredNoise`

Type: float (0.0-1.0)
Default: 0.30
Description: Probability of adding colored noise (pink/brown noise).

Example: Aggressive Augmentation

augmentation_settings:
  min_snr_in_db: -5.0          # Very noisy (challenging)
  max_snr_in_db: 20.0
  rir_prob: 0.5                # Frequent RIR
  pitch_prob: 0.6              # Frequent pitch shift
  min_pitch_semitones: -4.0    # Wider pitch range
  max_pitch_semitones: 4.0
  gain_prob: 1.0
  min_gain_in_db: -12.0        # Wider gain range
  max_gain_in_db: 12.0

Feature Generation Manifest

Defines how to generate and process feature files from raw audio.

Structure

feature_generation_manifest:
  feature_key_name1:
    input_audio_dirs: ["path/to/audio"]  # Source audio directories
    output_filename: "output_features.npy" # Output file name
    use_background_noise: true            # Mix with background noise
    use_rir: true                         # Apply RIR augmentation
    augmentation_rounds: 10               # Number of augmentation iterations
    augmentation_settings:                # Optional: override global settings
      min_snr_in_db: 5.0
      pitch_prob: 0.5

Parameters

`input_audio_dirs`

Type: list of strings
Description: Directories containing raw audio files to process.

`output_filename`

Type: string
Description: Name of the output .npy feature file (without .npy extension).

`use_background_noise`

Type: boolean
Default: true
Description: Mix samples with background noise from background_paths.

`use_rir`

Type: boolean
Default: true
Description: Apply room impulse response convolution.

`augmentation_rounds`

Type: integer
Default: 10
Valid Range: 1 to 50
Description: How many times to augment each audio sample.
- Higher rounds = more training data, slower generation
- Examples: 1-3 rounds for large datasets, 10-20 for small datasets

`augmentation_settings`

Type: dict (optional)
Description: Feature-specific augmentation overrides (if not using global settings).

Example: Multiple Feature Generations

feature_generation_manifest:
  positive_features:
    input_audio_dirs: ["./data/positive"]
    output_filename: "positive_features.npy"
    use_background_noise: true
    use_rir: true
    augmentation_rounds: 15
    
  hard_negative_features:
    input_audio_dirs: ["./data/negative"]
    output_filename: "hard_negative_features.npy"
    use_background_noise: true
    use_rir: true
    augmentation_rounds: 20
    
  pure_noise_features:
    input_audio_dirs: ["./data/background_noise"]
    output_filename: "noise_features.npy"
    use_background_noise: false
    use_rir: false
    augmentation_rounds: 5
    
    augmentation_settings: false  # There will be no argumentation.

  others_features:
    # your paramiters...

Advanced Settings

Fine-tuning parameters for specialized scenarios.

`augmentation_batch_size`

Type: integer
Default: Auto-calculated (16-128 based on system resources)
Description: Batch size for audio augmentation (separate from training batch size).
Note: Intelligently calculated based on available RAM and CPU cores.

`feature_gen_cpu_ratio`

Type: float
Default: 1.0
Valid Range: 0.0 to 1.0
Description: CPU utilization ratio for feature generation (0.0=GPU only, 1.0=CPU ratio).

Checkpointing & Early Stopping

`checkpoint_averaging_top_k`

Type: integer
Default: 5
Description: Number of best checkpoints to average for final model.

`checkpointing.enabled`

Type: boolean
Default: true
Description: Enable periodic model checkpointing during training.

`checkpointing.interval_steps`

Type: integer
Default: 1000
Description: Save checkpoint every N training steps.

`checkpointing.limit`

Type: integer
Default: 2
Description: Maximum checkpoint files to keep (oldest are deleted).

`early_stopping_patience`

Type: integer
Default: 0
Valid Range: 0 to 100
Description: Stop training if no improvement for N validation checks.
0 = disabled

`main_delta`

Type: float
Default: 0.0001
Description: Minimum improvement threshold for early stopping.

Loss & Training Dynamics

`stabilization_steps`

Type: integer
Default: 1500
Description: Number of gradual warmup steps at training start.
Effect: Prevents instability in initial iterations.

`ema_alpha`

Type: float
Default: 0.01
Valid Range: 0.0 to 1.0
Description: Exponential moving average smoothing factor for loss tracking.
Higher values: Faster response to recent changes
Lower values: Smoother, more stable trend

Validation Settings

`validation_batch_size`

Type: integer
Default: 256
Description: Batch size for validation pass.

Export Settings

`onnx_opset_version`

Type: integer
Default: 17
Valid Range: 11 to 20
Description: ONNX opset version for model export compatibility.
Note: Lower versions = broader compatibility, higher versions = latest features.

Custom Export Model

Nanowakeword supports user-provided export hooks so you can run any custom export code (for example, CoreML, TFLite, or a private converter) automatically after training and after distillation.

How it works:

Place a Python script anywhere on disk that exposes a callable (default name export_model) which accepts the following arguments (either by keyword or positional):
- model - the in-memory PyTorch model (or a student model during distillation)
- input_shape - the detected input shape tuple
- config - the final merged training configuration (a ConfigProxy-backed dict)
- model_name - the name chosen for the model (string)
- output_dir - directory where built-in exporters have written artifacts

Alternatively, specify a shell command which will be executed; the command supports Python-style str.format() placeholders: {model_path}, {model_name}, {output_dir}.

Configuration (example YAML):

export_model:
  # Option A: Python script
  script: /absolute/path/to/my_coreml_export.py
  function: export_model   # optional, defaults to export_model

  # Option B: shell command (alternative)
  # command: "python /scripts/convert_to_coreml.py --onnx {model_path} --out {output_dir}"

Example Python export script (my_coreml_export.py):

def export_model(model, input_shape, config, model_name, output_dir):
    """Example: export a model to CoreML.

    Notes:
    - This example assumes you have `coremltools` installed and available.
    - Many users prefer to export the ONNX produced by the built-in exporter
      and run a converter on that file instead of converting a live PyTorch model.
    """
    import os
    # Option 1: convert the in-memory PyTorch model directly
    try:
        import coremltools as ct
        # Example: convert a traced TorchScript model
        # WARNING: conversion requirements depend on your model; this is illustrative.
        model.eval()
        example_input = None
        # Create a dummy input matching the expected shape; adapt dtype/device as needed
        import torch
        example_input = torch.randn(1, *input_shape)
        traced = torch.jit.trace(model, example_input)
        mlmodel = ct.convert(traced)
        out_path = os.path.join(output_dir, model_name + ".mlmodel")
        mlmodel.save(out_path)
        print(f"Saved CoreML model to {out_path}")
        return
    except Exception as e:
        # Fallback: convert the already-produced ONNX file with an external tool
        print(f"In-memory CoreML conversion failed: {e}. Trying ONNX fallback.")

    # Option 2: operate on ONNX produced by built-in exporter
    onnx_path = os.path.join(output_dir, model_name + ".onnx")
    if os.path.exists(onnx_path):
        # call your converter here, e.g. coremltools.converters.onnx.convert(...) or a CLI
        print(f"Found ONNX at {onnx_path}. Run your converter here.")
    else:
        raise FileNotFoundError(f"Could not find ONNX at {onnx_path}")

Command example using command (shell):

custom_export:
  command: "python /scripts/onnx_to_coreml.py --onnx {model_path} --out {output_dir}"

This feature is intentionally flexible: your script can use the in-memory torch model, the ONNX file written by the trainer, or call any external tooling your workflow requires.

Pipeline Control

Master switches to enable/disable major processing stages.

`generate_clips`

Type: boolean
Default: false
Description: Enable/disable the clip generation stage (TTS synthesis).
Example:
```
generate_clips: true
```

`transform_clips`

Type: boolean
Default: false
Description: Enable/disable feature extraction and augmentation stage.
⚠️ Important: Set to false when not actively generating features to avoid infinite loops.

`train_model`

Type: boolean
Default: false
Description: Enable/disable the training stage.

`overwrite`

Type: boolean
Default: false
Description: Force regeneration of feature files, overwriting existing files.
⚠️ Warning: Use with caution as it will delete existing computed features.

`force_verify`

Type: boolean
Default: false
Description: Force re-verification of all data directories, ignoring cache.

`show_training_summary`

Type: boolean
Default: true
Description: Display effective training configuration in tabular format.

`debug_mode`

Type: boolean
Default: false
Description: Enable debug logging and visualization outputs.

`enable_journaling`

Type: boolean
Default: true
Description: Log training metrics and model information to journal.

Command-Line Arguments

Running training with configuration overrides:

# Basic training
nanowakeword -c your_config_path.yaml 

# Generate + Transform + Train
nanowakeword -c config.yaml -G -t -T

# Force regeneration of features
nanowakeword -c config.yaml --overwrite

# Resume from previous training
nanowakeword -c config.yaml --resume ./trained_models/my_model_v1

# Only transform (no generation, no training)
nanowakeword -c config.yaml -t

# Distill
nanowakeword -c copy_X_config.yaml --distill

Arguments Explanation

-c, --config_path - Path to YAML config file (required)
-G, --generate_clips - Enable synthetic data generation stage
-t, --transform_clips - Enable feature generation and augmentation
-T, --train_model - Enable model training
-f, --force-verify - Ignore cache and re-verify all data
--overwrite - Regenerate all feature files (destructive)
--resume - Resume training from specific model directory

Nanowakeword Configuration Guide

NanowakeWord Configuration Guide

Table of Contents

Project & Data Paths

model_name

output_dir

positive_data_path

negative_data_path

background_paths

rir_paths

Model Architecture

model_type

layer_size (DNN/RNN-based architectures)

n_blocks

dropout_prob

activation_function (Advanced)

embedding_dim (Advanced)

Architecture-Specific Parameters

Transformer Architecture

CRNN Architecture

TCN Architecture

Conformer Architecture

E-Branchformer Architecture

QuartzNet Architecture

Custom Architecture

Training & Optimization

steps

batch_size

optimizer_type

learning_rate_max

learning_rate_base

lr_scheduler_type

clr_step_size_up (Cyclic LR)

clr_step_size_down (Cyclic LR)

weight_decay

momentum (SGD optimizer)

num_workers

Feature Manifest

Structure

Key Naming Convention (It will use batch_composition)

Example with Multiple Sources

Batch Composition

Relationship with feature_manifest

How Samples Are Selected

Selecting From a Specific Dataset

Summary

Data Generation

Augmentation Settings

Structure

Parameter Descriptions

min_snr_in_db / max_snr_in_db

rir_prob

pitch_prob / min_pitch_semitones / max_pitch_semitones

gain_prob / min_gain_in_db / max_gain_in_db

ColoredNoise

Example: Aggressive Augmentation

Feature Generation Manifest

Structure

Parameters

input_audio_dirs

output_filename

use_background_noise

use_rir

augmentation_rounds

augmentation_settings

Example: Multiple Feature Generations

Advanced Settings

augmentation_batch_size

feature_gen_cpu_ratio

Checkpointing & Early Stopping

checkpoint_averaging_top_k

checkpointing.enabled

checkpointing.interval_steps

checkpointing.limit

early_stopping_patience

main_delta

Loss & Training Dynamics

stabilization_steps

ema_alpha

Validation Settings

`model_name`

`output_dir`

`positive_data_path`

`negative_data_path`

`background_paths`

`rir_paths`

`model_type`

`layer_size` (DNN/RNN-based architectures)

`n_blocks`

`dropout_prob`

`activation_function` (Advanced)

`embedding_dim` (Advanced)

`steps`

`batch_size`

`optimizer_type`

`learning_rate_max`

`learning_rate_base`

`lr_scheduler_type`

`clr_step_size_up` (Cyclic LR)

`clr_step_size_down` (Cyclic LR)

`weight_decay`

`momentum` (SGD optimizer)

`num_workers`

Key Naming Convention (It will use `batch_composition`)

Relationship with `feature_manifest`

`min_snr_in_db` / `max_snr_in_db`

`rir_prob`

`pitch_prob` / `min_pitch_semitones` / `max_pitch_semitones`

`gain_prob` / `min_gain_in_db` / `max_gain_in_db`

`ColoredNoise`

`input_audio_dirs`

`output_filename`

`use_background_noise`

`use_rir`

`augmentation_rounds`

`augmentation_settings`

`augmentation_batch_size`

`feature_gen_cpu_ratio`

`checkpoint_averaging_top_k`

`checkpointing.enabled`

`checkpointing.interval_steps`

`checkpointing.limit`

`early_stopping_patience`

`main_delta`

`stabilization_steps`

`ema_alpha`

`validation_batch_size`

`onnx_opset_version`

`generate_clips`

`transform_clips`

`train_model`

`overwrite`

`force_verify`

`show_training_summary`

`debug_mode`

`enable_journaling`