ISBL Formula — Importance Sampling Based on Loss

The ISBL (Importance Sampling Based on Loss) formula introduces a dynamic sampling strategy for AI alignment datasets, using historical loss values to prioritize harder training samples in reinforcement learning pipelines. Explore Nanowakeword for wake word training, read the configuration guide, or check out Phonemize for phoneme conversion.

The core logic of the sampler is governed by the following dynamic probability distribution:

\mathcal{P}(x_i \mid x_i \in C_k, t) = \frac{ \left(\mathcal{L}_i^{(t-1)}\right)^\alpha + \epsilon }{ \sum_{x_j \in C_k} \left[ \left(\mathcal{L}_j^{(t-1)}\right)^\alpha + \epsilon \right] }

🔍 Nomenclature & Parameter Breakdown

$\mathcal{P}(x_i \mid x_i \in C_k, t)$ : The conditional probability of selecting a specific data sample $$x_i$$ from class pool $$C_k$$ at training step $$t$$ .
$$x_i, x_j$$ : Individual data samples (e.g., specific audio clips) within the dataset. Here, $$x_i$$ represents the target sample being evaluated, while $$x_j$$ represents all competing samples within the same pool during summation.
$$C_k$$ (Class/Category Pool): A distinct subset or category of data (e.g., targets or negatives), isolated via the dataset's index pools.
$$t$$ (Training Step / Time): The current iteration or time step of the training loop, defining the temporal state of the sampling probabilities.
$\mathcal{L}_i^{(t-1)}$ (Loss/Hardness Score): The individual loss value computed for sample $$x_i$$ during its most recent forward pass at step $$t-1$$ . Higher loss signifies higher "hardness". (Note: At $$t=0$$ , before any training occurs, all scores are uniformly initialized to $\mathcal{L}_i^{(0)} = 1.0$ ).
$\alpha$ (Smoothing Factor): A hyperparameter set to 0.75. It acts as a contrast control that dampens extreme loss values. This prevents unlearnable, corrupted, or heavily noisy audio clips from dominating the batch gradients and causing model collapse.
$\epsilon$ (Epsilon / Stability Constant): A tiny positive constant set to 1e-6 serving a dual purpose:
1. Mathematical Safety: Prevents division-by-zero errors or absolute zero probabilities when a sample is perfectly learned.
2. Catastrophic Forgetting Prevention: As the model converges and all individual losses drop near zero ( $\mathcal{L} \approx 0$ ), the equation naturally transitions into a uniform random sampler ( $\mathcal{P} \approx \frac{1}{N}$ ), ensuring balanced baseline revision in later training stages.
$\sum_{x_j \in C_k}$ (Summation Over Class): The summation operator (Sigma) that aggregates the computed scores of all individual samples $$x_j$$ belonging to class $$C_k$$ . Dividing the single sample's score by this total sum normalizes the output into a strict probability distribution bounded between $$0$$ and $$1$$ .

Product

Nanowakeword — Wake Word Training Framework

Open-source Python framework for building custom wake word models.

Documentation

Nanowakeword Configuration Guide

Complete parameter documentation for the NanoWakeWord Python package.

Product

Phonemize — Text-to-Phoneme Library

Fast, zero-dependency Python library for IPA phonetic conversion.

Importance Sampling based on Loss

🔍 Nomenclature & Parameter Breakdown

Related Articles

Related Articles