How often is the dataset updated?

Mozilla typically releases major dataset updates 2-3 times per year.

Mozilla Common Voice

Mozilla Common Voice | Find AI List

Overview

Mozilla Common Voice is a cornerstone of the 2026 decentralized AI ecosystem, serving as a massive, multi-language corpus of transcribed speech. Built on a technical architecture of crowdsourced contribution and peer-to-peer validation, the platform addresses the 'data poverty' that often hampers smaller organizations and researchers in the Speech-to-Text (STT) and Automatic Speech Recognition (ASR) sectors. Unlike proprietary silos held by Big Tech, Common Voice releases its data under a CC-0 (Public Domain) license, allowing for unrestricted commercial and academic use. By 2026, the project has expanded significantly into spontaneous speech collection and multi-dialectal metadata tagging, enabling the development of more nuanced and inclusive Large Language Models (LLMs) and Small Language Models (SLMs). The technical workflow involves rigorous sentence collection, voice recording via web/mobile interfaces, and a three-stage validation pipeline to ensure high-fidelity signal-to-noise ratios. Its market position is critical for fine-tuning models like OpenAI's Whisper or Meta's MMS, specifically for under-represented languages where commercial datasets are non-existent.

Common tasks

Training ASR models Fine-tuning STT algorithms Linguistic research and dialect analysis Building voice-controlled accessibility tools Collecting speech data Validating speech data Creating custom voice datasets Analyzing speech patterns

FAQ

View all

Can I use Common Voice data for commercial products?

Yes, the dataset is released under CC-0, meaning it is in the public domain and can be used for any purpose, including commercial ones, without attribution.

How do I add a new language to Common Voice?

You can start by mobilizing your community to contribute sentences to the Sentence Collector. Once enough sentences are validated, Mozilla will enable recording for that language.

What is the quality of the recordings?

Recording quality varies as they are crowdsourced, but the multi-human validation process ensures that the transcription matches the audio accurately.

Is the data GDPR compliant?

Yes, Mozilla takes significant steps to ensure privacy, including allowing users to delete their recordings and providing anonymized demographic data.

FAQ+

Can I use Common Voice data for commercial products?

Yes, the dataset is released under CC-0, meaning it is in the public domain and can be used for any purpose, including commercial ones, without attribution.