A feature processing and data loading framework for child-centered long-form audio recordings. Runs a SLURM pipeline that extracts speech activity, speaker types, signal quality, and environmental sound classification (ESC) — then packages everything into WebDataset shards with rich per-clip metadata for model training.
- Installation
- Quick Start
- Pipeline
- Project Structure
- Dataloader
- Citation
- Component Models
- Acknowledgements
Requirements: Linux or macOS, Python ≥ 3.13, uv, ffmpeg, git-lfs.
# Check system dependencies:
./check_sys_dependencies.sh
# Clone (includes model weights via git-lfs):
git lfs install
git clone --recurse-submodules https://github.com/LAAC-LSCP/DLplusplus.git
cd DLplusplus
# Install Python dependencies:
uv sync
# Download the Brouhaha SNR model checkpoint (~47 MB, one-time):
uv run python scripts/download_brouhaha.pyAlternative: pip install
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtIf your audio files live in a directory (no pre-existing metadata file), generate a ready-to-use manifest with:
python scripts/make_manifest.py /path/to/audio/ -name my_datasetThis recursively scans for all common audio formats (wav, flac, mp3, ogg,
opus, m4a, aac, aiff, wma) and writes manifests/my_dataset.csv with
columns path (absolute), uid (filename stem), and ext (format).
Run an individual pipeline step on a folder of audio files:
# Speaker diarization (VTC)
uv run python -m src.pipeline.vtc my_data \
--manifest manifests/my_dataset.csv
# Voice activity detection (VAD)
uv run python -m src.pipeline.vad my_data \
--manifest manifests/my_dataset.csvProcess an entire dataset end-to-end:
# First run with a custom manifest:
bash slurm/pipeline.sh my_data \
--manifest manifests/my_dataset.parquet \
--path-col audio_path \
--audio-root /store/audio/
# Subsequent runs (manifest already normalized):
bash slurm/pipeline.sh my_dataThis submits five SLURM jobs — four feature extraction steps in parallel, then a packaging step that depends on all four. See the Pipeline section for full documentation.
The pipeline orchestrator (slurm/pipeline.sh) runs a preflight check, then submits five SLURM jobs:
┌─── VAD (CPU) ───┐
├─── VTC (GPU) ───┤
Raw Audio ──► ├─── SNR (GPU) ───┼──► Package (CPU)
└─── ESC (GPU) ───┘
Steps 1–4 run in parallel as independent jobs. Step 5 (Package) depends on all four completing successfully.
| Step | Module | Resource | Description |
|---|---|---|---|
| 1. VAD | src.pipeline.vad |
CPU | TenVAD speech activity detection |
| 2. VTC | src.pipeline.vtc |
GPU | BabyHuBERT/segma speaker diarization (KCHI, OCH, MAL, FEM) |
| 3. SNR | src.pipeline.snr |
GPU | Brouhaha per-frame SNR & C50 extraction |
| 4. ESC | src.pipeline.esc |
GPU | PANNs CNN14 environmental sound classification |
| 5. Package | src.pipeline.package |
CPU | Clip tiling + WebDataset shards + dashboards |
Resume support: VAD and VTC save checkpoints. Interrupted jobs can be resubmitted and will skip already-completed files.
Runs TenVAD with CPU multiprocessing (default: all cores).
Output:
output/{dataset}/vad_raw/segments.parquet— per-frame VAD segmentsoutput/{dataset}/vad_merged/segments.parquet— merged overlapping segmentsoutput/{dataset}/vad_meta/metadata.parquet— per-file summary metadata
Runs the BabyHuBERT model via segma on GPU (SLURM array, default 3 shards).
Output:
output/{dataset}/vtc_raw/— raw VTC segments (per-shard parquets)output/{dataset}/vtc_merged/— merged/deduplicated segments across shardsoutput/{dataset}/vtc_meta/— per-file summary metadata
Segment columns: uid, onset, offset, duration, label (FEM / MAL / KCHI / OCH).
Runs Brouhaha on GPU (SLURM array, default 2 shards). Produces per-file time-series arrays and speech-masked summary statistics.
Output:
output/{dataset}/snr/{uid}.npz— per-file compressed arrays:snr(float16, shapen_frames) — per-frame SNR in dBc50(float16, shapen_frames) — per-frame C50 clarity in dBvad(float16, shapen_frames) — per-frame Brouhaha VAD probabilitystep_s— frame step in seconds (~16 ms)vad_threshold— threshold used (0.5)
output/{dataset}/snr_meta/shard_{id}.parquet— per-file metadata:uid,snr_status,duration,n_raw_frames,n_speech_frames,speech_fractionsnr_mean,snr_std,snr_min,snr_max— computed only on speech frames (VAD > 0.5)c50_mean,c50_std,c50_min,c50_max— computed only on speech frames
Downstream steps (e.g. packaging) index into the per-frame arrays by onset/offset using step_s to compute exact segment-level statistics.
Runs PANNs CNN14 on GPU (SLURM array, default 2 shards). Classifies audio into 13 coarse categories and 527 AudioSet classes.
Output:
output/{dataset}/esc/{uid}.npz— per-file compressed arrays:categories(float16, shapen_bins × 13) — coarse category probabilitiescategory_names— the 13 category labelsaudioset_probs(float16, shapen_bins × 527) — full AudioSet probabilitiesaudioset_names— 527 AudioSet display labelspool_step_s,inference_step_s— time resolutions
output/{dataset}/esc_meta/shard_{id}.parquet— per-file metadata:uid,esc_status,duration,n_inference_windows,n_pooled_binsdominant_category,dominant_probprob_{category}— mean probability for each of 13 categories
Categories: alarm_signal, animal, crying, environment, human_activity, impact, laughter, machinery, music, nature, other, silence, singing, tv_radio, vehicle.
Tiles full audio files into clips of roughly equal length, cutting only at silence gaps (never mid-speech). Cut-point selection uses a 6-tier fallback chain:
| Tier | Strategy | Severity |
|---|---|---|
| 1 | Long silence gap (≥10 s) in VAD∪VTC union | Clean |
| 2 | Any silence gap in VAD∪VTC union | Clean |
| 3 | Gap in VAD-only mask (VTC still active) | Info |
| 4 | Gap in VTC-only mask (VAD still active) | Info |
| 5 | VTC speaker-change boundary (inside active audio) | Warning |
| 6 | Hard cut — no gaps or boundaries | Warning |
Within each tier, the midpoint closest to the ideal evenly-distributed position is chosen. The pipeline output includes a tier breakdown showing how many cuts used each strategy.
Output:
output/{dataset}/shards/— WebDataset.tarshards (WAV/FLAC + JSON metadata)output/{dataset}/shards/manifest.csv— per-clip metadataoutput/{dataset}/shards/samples/— random sample clips for manual validationoutput/{dataset}/stats/— Parquet DataFrames at multiple granularities (clip, segment, turn, conversation, file)figures/{dataset}/dashboard/— 6 PNG diagnostic dashboards (seesrc/plotting/README.md)
Each clip in a shard is stored as two files sharing the key {uid}_{clip_idx:04d}:
| File | Format | Contents |
|---|---|---|
{clip_id}.wav / .flac |
WAV / FLAC | Mono audio, 16 kHz |
{clip_id}.json |
JSON (UTF-8) | All scalar + structured metadata (see below) |
The .json metadata contains:
Source — uid, clip_idx, clip_id, abs_onset, abs_offset, duration, source_path, audio_fmt, sample_rate.
VTC speech — vtc_speech_duration, vtc_speech_density, n_vtc_segments, mean_vtc_seg_duration, mean_vtc_gap, n_turns, n_labels, labels_present, has_adult, dominant_label, label_durations, vad_coverage_by_label (fraction of each VTC label also covered by VAD).
Demographics — child_speech_duration, adult_speech_duration, child_fraction (share of VTC speech that is child).
VAD speech — vad_speech_duration, vad_speech_density, n_vad_segments.
VAD–VTC agreement — vad_vtc_iou: frame-level Intersection over Union between the two systems' masks.
SNR & C50 — Per-VTC-segment SNR and C50 averages are computed by the segment_snr post-hoc step and stored in output/{dataset}/segment_snr/ parquets (columns: uid, onset, offset, label, snr_mean, c50_mean). During packaging, these are aggregated into per-clip summary statistics in the manifest CSV: snr_mean, snr_std, snr_min, snr_max, c50_mean, c50_std, c50_min, c50_max (dB). Higher C50 = less reverberation. The full per-frame time-series arrays remain available in snr/{uid}.npz for downstream analysis.
ESC environment — dominant_esc (category name), esc_profile (dict of mean probability per category).
Segment detail — vad_segments and vtc_segments: lists of {onset, offset, duration} objects with timestamps relative to the clip start. vtc_segments additionally carry a label field (FEM / MAL / KCHI / OCH).
| Module | Purpose |
|---|---|
src.plotting.compare |
VAD vs VTC comparison (IoU, precision, recall, diagnostics) |
src.pipeline.normalize |
Standardize external manifests into manifests/{dataset}.csv |
src.pipeline.preflight |
Estimate dataset size, GPU needs, and wall-clock time |
src.pipeline.segment_snr |
Post-hoc per-VTC-segment SNR/C50 averaging |
DLplusplus/
├── src/
│ ├── utils.py # Shared utilities (manifest I/O, paths, logging)
│ ├── compat.py # Compatibility shims (torchaudio patches)
│ ├── pipeline/ # CLI entry points (one per pipeline step)
│ │ ├── vad.py # Step 1: TenVAD voice activity detection
│ │ ├── vtc.py # Step 2: BabyHuBERT speaker diarization
│ │ ├── snr.py # Step 3: Brouhaha SNR/C50 extraction
│ │ ├── esc.py # Step 4: PANNs CNN14 ESC
│ │ ├── package.py # Step 5: Audio clipping + WebDataset shards
│ │ ├── segment_snr.py # Post-hoc per-segment SNR/C50 averaging
│ │ ├── compare.py # VAD vs VTC comparison helpers
│ │ ├── normalize.py # Manifest normalization
│ │ └── preflight.py # Pre-pipeline dataset scan
│ ├── packaging/ # Clip building, shard writing, listener
│ │ ├── clips.py # Clip tiling algorithm (6-tier fallback)
│ │ ├── stats.py # Per-clip/file/conversation statistics
│ │ ├── writer.py # WebDataset tar shard writer
│ │ └── listener.py # Sample extraction for validation
│ ├── core/ # Reusable, tested modules
│ │ ├── intervals.py # Interval arithmetic (merge, IoU)
│ │ ├── conversations.py # Turn/conversation extraction
│ │ ├── vad_processing.py# Per-file VAD (worker code)
│ │ ├── parallel.py # Process pool driver with progress queue
│ │ ├── checkpoint.py # Checkpoint save / resume
│ │ └── metadata.py # VTC metadata constructors
│ └── plotting/ # Dashboard figure generation
│ ├── figures.py # Orchestrator (calls sub-modules)
│ ├── snr_noise.py # SNR quality + noise environment
│ ├── speech_turns.py # Conversational structure + turns
│ ├── overview.py # Dataset overview + correlation + text summary
│ └── packaging.py # Per-clip/label summary grids
├── dataloader/ # Dataloader++ package (see Section 5)
│ ├── types.py # Shared type aliases and enums
│ ├── config.py # PipelineConfig + FilterConfig
│ ├── build.py # build_manifest() — Big Join + filters
│ ├── processor/ # Feature Processor ABCs (offline extraction)
│ │ ├── base.py # FeatureProcessor ABC
│ │ └── registry.py # Processor discovery & registration
│ ├── adapters/ # Pipeline output adapters
│ │ ├── vad.py # VADAdapter (reads vad_meta, vad_raw, vad_merged)
│ │ ├── vtc.py # VTCAdapter (reads vtc_meta, vtc_raw, vtc_merged)
│ │ ├── snr.py # SNRAdapter (reads snr_meta, snr/*.npz)
│ │ └── esc.py # ESCAdapter (reads esc_meta, esc/*.npz)
│ ├── loader/ # Feature Loader ABCs (waveform + metadata I/O)
│ │ ├── base.py # FeatureLoader ABC
│ │ ├── waveform.py # WaveformLoader
│ │ └── metadata.py # MetadataLoader (JSON/Parquet/NPZ)
│ ├── manifest/ # Manifest management
│ │ ├── schema.py # MetadataManifest schema
│ │ ├── joiner.py # ManifestJoiner (Big Join)
│ │ └── store.py # MetadataStore (unified I/O)
│ ├── transform/ # Runtime data transforms
│ │ ├── base.py # DataProcessor ABC + Compose
│ │ ├── audio.py # Resample, segment, normalize
│ │ └── label.py # Label encoding, mask generation
│ ├── batch/ # Batching and collation
│ │ ├── base.py # Collator ABC
│ │ ├── data_batch.py # DataBatch container
│ │ └── speech.py # SpeechCollator implementation
│ └── dataset/ # PyTorch Dataset implementations
│ ├── base.py # SpeechDataset ABC
│ └── webdataset.py # WebDataset-backed loader
├── slurm/
│ ├── pipeline.sh # One-command pipeline orchestrator
│ ├── vad.slurm # SLURM: VAD (CPU, 48 workers)
│ ├── vtc.slurm # SLURM: VTC (GPU array, 3 shards)
│ ├── snr.slurm # SLURM: Brouhaha SNR (GPU array, 2 shards)
│ ├── esc.slurm # SLURM: PANNs ESC (GPU array, 2 shards)
│ ├── segment_snr.slurm # SLURM: Per-segment SNR (GPU array)
│ ├── vtc_clips.slurm # SLURM: VTC on packaged clips
│ ├── snr_diagnostic.slurm # SLURM: SNR masking diagnostics
│ ├── package_test.sh # Quick end-to-end packaging test
│ ├── repackage_test.sh # Re-package + clip alignment test
│ └── test.slurm # SLURM: pytest on compute node
├── tests/ # pytest suite covering all core modules
│ ├── conftest.py # Audio fixtures + skip markers
│ ├── fixtures/ # Short WAV files (committed)
│ ├── test_intervals.py
│ ├── test_checkpoint.py
│ ├── test_metadata.py
│ ├── test_parallel.py
│ ├── test_clips.py # Clip tiling + tier fallback chain
│ ├── test_snr.py # Brouhaha SNR extraction
│ ├── test_esc.py # PANNs ESC
│ ├── test_vad_processing.py
│ ├── test_reproducibility.py
│ └── test_stitched_audio.py
├── docs/
│ └── DATALOADER_DESIGN.md # Dataloader++ specification
├── scripts/
│ ├── download_brouhaha.py # Auto-download Brouhaha checkpoint
│ └── make_manifest.py # Generate manifest from audio directory
├── models/ # Brouhaha checkpoint (gitignored, auto-downloaded)
│ └── best/checkpoints/
│ └── best.ckpt # ~47 MB, from ylacombe/brouhaha-best
├── VTC-2.0/ # BabyHuBERT model weights & config
│ └── model/
│ ├── best.ckpt # Trained checkpoint (~1 GB, git-lfs)
│ └── config.yml # segma training config
├── manifests/ # Dataset manifests (one CSV per dataset)
├── output/ # Pipeline outputs (per-dataset subdirs)
├── figures/ # Diagnostic plots (per-dataset subdirs)
├── logs/ # SLURM logs + benchmark records
├── pyproject.toml # Python project config (uv / pip)
├── requirements.txt # Pinned dependency lockfile
└── check_sys_dependencies.sh
All paths are derived from the dataset name:
manifests/{dataset}.csv → output/{dataset}/ (metadata, segments, metrics)
→ figures/{dataset}/ (plots)
# Login node (TenVAD tests auto-skip):
uv run python3 -m pytest tests/
# Compute node (full suite):
sbatch slurm/test.slurmThe dataloader/ package implements the Dataloader++ specification for Meta's speech training infrastructure. It bridges the offline feature processing pipeline (above) with online model training.
See docs/DATALOADER_DESIGN.md for the full design document.
| Component | Location | Purpose |
|---|---|---|
| Feature Processor | dataloader/processor/ |
ABC wrapping offline extraction stages (VAD, VTC, SNR, ESC) |
| Feature Loader | dataloader/loader/ |
Load waveforms + metadata from WebDataset shards or raw files |
| Manifest Joiner | dataloader/manifest/ |
Join heterogeneous metadata manifests by wav_id (the "Big Join") |
| Data Processor | dataloader/transform/ |
Composable runtime transforms (segment, resample, encode, mask) |
| Collator / DataBatch | dataloader/batch/ |
Pad variable-length samples into typed DataBatch tensors |
| Dataset | dataloader/dataset/ |
PyTorch Dataset implementations (WebDataset-backed) |
@software{dlplusplus,
title = {{DL++}: Feature Processing and Data Loading for Child-Centered Long-Form Audio},
author = {Dager, Daniel and Kunze, Tarek and Charlot, Théo and Cristia, Alejandrina and Dupoux, Emmanuel and Lavechin, Marvin},
year = {2026},
url = {https://github.com/LAAC-LSCP/DLplusplus},
}DL++ integrates the following models as feature processing stages:
Tencent/TenVAD — lightweight speech activity detector used in Step 1 (CPU).
Speaker diarization into four types (KCHI, OCH, MAL, FEM), trained on child-centered long-form recordings. Used in Step 2 (GPU).
Training code: LAAC-LSCP/BabyHuBERT
@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings},
author={Théo Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
year={2025},
eprint={2509.15001},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2509.15001},
}Earlier VTC versions
VTC 1.5 (Whisper-VTC) — GitHub: LAAC-LSCP/VTC-IS-25
@inproceedings{kunze25_interspeech,
title = {{Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier}},
author = {Tarek Kunze and Marianne Métais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {2845--2849},
doi = {10.21437/Interspeech.2025-1962},
}VTC 1.0 (PyanNet-VTC) — GitHub: MarvinLvn/voice-type-classifier
@inproceedings{lavechin20_interspeech,
title = {An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings},
author = {Marvin Lavechin and Ruben Bousbib and Hervé Bredin and Emmanuel Dupoux and Alejandrina Cristia},
year = {2020},
booktitle = {Interspeech 2020},
pages = {3072--3076},
doi = {10.21437/Interspeech.2020-1690},
}marianne-m/brouhaha-vad — per-frame signal-to-noise ratio and clarity (C50) extraction. Used in Step 3 (GPU).
@inproceedings{lavechin2023brouhaha,
title = {Brouhaha: Multi-task Training for Voice Activity Detection, Speech-to-Noise Ratio, and Speech Reverberation Estimation},
author = {Marvin Lavechin and Marianne Métais and Hadrien Titeux and Alodie Boissonnet and Johan Music and Hervé Bredin and Emmanouil Benetos and Alejandrina Cristia},
year = {2023},
booktitle = {2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
doi = {10.1109/ASRU57964.2023.10389642},
}qiuqiangkong/panns_inference — AudioSet-based sound event detection (527 classes, grouped into 13 coarse categories). Used in Step 4 (GPU).
@inproceedings{kong2020panns,
title = {PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
author = {Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
year = {2020},
journal = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
volume = {28},
pages = {2880--2894},
doi = {10.1109/TASLP.2020.3030497},
}This work uses the segma library, inspired by pyannote.audio.
This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015450 and 2025-AD011016414) and was developed as part of the ExELang project funded by the European Union (ERC, ExELang, Grant No 101001095).