Open Datasets for Machine Intelligence

We believe in open science. Access clean, structured audio, speech, and text corpora for training production-grade generative models.

Noice & Spech Audio 3.38 GB

SonicWeave-v2

29.33 hours Negative Dataset.

🎧 5 sec par sample
⚖️ Apache 2.0
🕒 29.33 hrs
Negative Feature 2.96 GB

RACON_11h_v1

RACON is a comprehensive feature set derived from approximately 11 hours of diverse, real-world audio. It is designed to serve as a high-quality negative dataset for training and evaluating robust wake-word models, particularly within systems like Nanowakeword.

🗣️ 16kHz Mono WAV
⚖️ Apache 2.0
🕒 ~11 hours
Negative Feature 130 MB

AE29H_float32

Audio Embeddings ~29 hours dataset contains precomputed audio embeddings designed for Nanowakeword framework. The embeddings are intended to be used as general-purpose negative training data, meaning the audio does not contain the target wake word or phrase.

📄 NumPy Array
⚖️ Apache 2.0
🕒 ~48 hrs
Speech & TTS 64.6 MB

FC-CoT-Top10k

High-quality Function Calling demonstrations & Clear, well-structured Chain of Thought reasoning

📊 10,000 Samples
⚖️ Apache 2.0
🎯 Function Calling