DL++ (Dataloader++)

A feature processing and data loading framework for child-centered long-form audio recordings. Runs a SLURM pipeline that extracts speech activity, speaker types, signal quality, and environmental sound classification (ESC) — then packages everything into WebDataset shards with rich per-clip metadata for model training.

1. Installation

Requirements: Linux or macOS, Python ≥ 3.13, uv, ffmpeg, git-lfs.

# Check system dependencies:
./check_sys_dependencies.sh

# Clone (includes model weights via git-lfs):
git lfs install
git clone --recurse-submodules https://github.com/LAAC-LSCP/DLplusplus.git
cd DLplusplus

# Install Python dependencies:
uv sync

# Download the Brouhaha SNR model checkpoint (~47 MB, one-time):
uv run python scripts/download_brouhaha.py

Alternative: pip install

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

2. Quick Start

Generate a manifest from a directory

If your audio files live in a directory (no pre-existing metadata file), generate a ready-to-use manifest with:

python scripts/make_manifest.py /path/to/audio/ -name my_dataset

This recursively scans for all common audio formats (wav, flac, mp3, ogg, opus, m4a, aac, aiff, wma) and writes manifests/my_dataset.csv with columns path (absolute), uid (filename stem), and ext (format).

Single-step inference

Run an individual pipeline step on a folder of audio files:

# Speaker diarization (VTC)
uv run python -m src.pipeline.vtc my_data \
    --manifest manifests/my_dataset.csv

# Voice activity detection (VAD)
uv run python -m src.pipeline.vad my_data \
    --manifest manifests/my_dataset.csv

Full pipeline on a SLURM cluster

Process an entire dataset end-to-end:

# First run with a custom manifest:
bash slurm/pipeline.sh my_data \
    --manifest manifests/my_dataset.parquet \
    --path-col audio_path \
    --audio-root /store/audio/

# Subsequent runs (manifest already normalized):
bash slurm/pipeline.sh my_data

This submits five SLURM jobs — four feature extraction steps in parallel, then a packaging step that depends on all four. See the Pipeline section for full documentation.

3. Pipeline

The pipeline orchestrator (slurm/pipeline.sh) runs a preflight check, then submits five SLURM jobs:

                ┌─── VAD  (CPU)  ───┐
                ├─── VTC  (GPU)  ───┤
Raw Audio ──►   ├─── SNR  (GPU)  ───┼──► Package (CPU)
                └─── ESC  (GPU)  ───┘

Steps 1–4 run in parallel as independent jobs. Step 5 (Package) depends on all four completing successfully.

Step	Module	Resource	Description
1. VAD	`src.pipeline.vad`	CPU	TenVAD speech activity detection
2. VTC	`src.pipeline.vtc`	GPU	BabyHuBERT/segma speaker diarization (KCHI, OCH, MAL, FEM)
3. SNR	`src.pipeline.snr`	GPU	Brouhaha per-frame SNR & C50 extraction
4. ESC	`src.pipeline.esc`	GPU	PANNs CNN14 environmental sound classification
5. Package	`src.pipeline.package`	CPU	Clip tiling + WebDataset shards + dashboards

Resume support: VAD and VTC save checkpoints. Interrupted jobs can be resubmitted and will skip already-completed files.

Step 1 — VAD (Voice Activity Detection)

Runs TenVAD with CPU multiprocessing (default: all cores).

Output:

output/{dataset}/vad_raw/segments.parquet — per-frame VAD segments
output/{dataset}/vad_merged/segments.parquet — merged overlapping segments
output/{dataset}/vad_meta/metadata.parquet — per-file summary metadata

Step 2 — VTC (Voice Type Classification)

Runs the BabyHuBERT model via segma on GPU (SLURM array, default 3 shards).

Output:

output/{dataset}/vtc_raw/ — raw VTC segments (per-shard parquets)
output/{dataset}/vtc_merged/ — merged/deduplicated segments across shards
output/{dataset}/vtc_meta/ — per-file summary metadata

Segment columns: uid, onset, offset, duration, label (FEM / MAL / KCHI / OCH).

Step 3 — SNR (Signal-to-Noise Ratio & Clarity)

Runs Brouhaha on GPU (SLURM array, default 2 shards). Produces per-file time-series arrays and speech-masked summary statistics.

Output:

output/{dataset}/snr/{uid}.npz — per-file compressed arrays:
- snr (float16, shape n_frames) — per-frame SNR in dB
- c50 (float16, shape n_frames) — per-frame C50 clarity in dB
- vad (float16, shape n_frames) — per-frame Brouhaha VAD probability
- step_s — frame step in seconds (~16 ms)
- vad_threshold — threshold used (0.5)
output/{dataset}/snr_meta/shard_{id}.parquet — per-file metadata:
- uid, snr_status, duration, n_raw_frames, n_speech_frames, speech_fraction
- snr_mean, snr_std, snr_min, snr_max — computed only on speech frames (VAD > 0.5)
- c50_mean, c50_std, c50_min, c50_max — computed only on speech frames

Downstream steps (e.g. packaging) index into the per-frame arrays by onset/offset using step_s to compute exact segment-level statistics.

Step 4 — ESC (Environmental Sound Classification)

Runs PANNs CNN14 on GPU (SLURM array, default 2 shards). Classifies audio into 13 coarse categories and 527 AudioSet classes.

Output:

output/{dataset}/esc/{uid}.npz — per-file compressed arrays:
- categories (float16, shape n_bins × 13) — coarse category probabilities
- category_names — the 13 category labels
- audioset_probs (float16, shape n_bins × 527) — full AudioSet probabilities
- audioset_names — 527 AudioSet display labels
- pool_step_s, inference_step_s — time resolutions
output/{dataset}/esc_meta/shard_{id}.parquet — per-file metadata:
- uid, esc_status, duration, n_inference_windows, n_pooled_bins
- dominant_category, dominant_prob
- prob_{category} — mean probability for each of 13 categories

Categories: alarm_signal, animal, crying, environment, human_activity, impact, laughter, machinery, music, nature, other, silence, singing, tv_radio, vehicle.

Step 5 — Package (Clip Tiling + WebDataset Shards)

Tiles full audio files into clips of roughly equal length, cutting only at silence gaps (never mid-speech). Cut-point selection uses a 6-tier fallback chain:

Tier	Strategy	Severity
1	Long silence gap (≥10 s) in VAD∪VTC union	Clean
2	Any silence gap in VAD∪VTC union	Clean
3	Gap in VAD-only mask (VTC still active)	Info
4	Gap in VTC-only mask (VAD still active)	Info
5	VTC speaker-change boundary (inside active audio)	Warning
6	Hard cut — no gaps or boundaries	Warning

Within each tier, the midpoint closest to the ideal evenly-distributed position is chosen. The pipeline output includes a tier breakdown showing how many cuts used each strategy.

Output:

output/{dataset}/shards/ — WebDataset .tar shards (WAV/FLAC + JSON metadata)
output/{dataset}/shards/manifest.csv — per-clip metadata
output/{dataset}/shards/samples/ — random sample clips for manual validation
output/{dataset}/stats/ — Parquet DataFrames at multiple granularities (clip, segment, turn, conversation, file)
figures/{dataset}/dashboard/ — 6 PNG diagnostic dashboards (see src/plotting/README.md)

Clip metadata

Each clip in a shard is stored as two files sharing the key {uid}_{clip_idx:04d}:

File	Format	Contents
`{clip_id}.wav` / `.flac`	WAV / FLAC	Mono audio, 16 kHz
`{clip_id}.json`	JSON (UTF-8)	All scalar + structured metadata (see below)

The .json metadata contains:

Source — uid, clip_idx, clip_id, abs_onset, abs_offset, duration, source_path, audio_fmt, sample_rate.

VTC speech — vtc_speech_duration, vtc_speech_density, n_vtc_segments, mean_vtc_seg_duration, mean_vtc_gap, n_turns, n_labels, labels_present, has_adult, dominant_label, label_durations, vad_coverage_by_label (fraction of each VTC label also covered by VAD).

Demographics — child_speech_duration, adult_speech_duration, child_fraction (share of VTC speech that is child).

VAD speech — vad_speech_duration, vad_speech_density, n_vad_segments.

VAD–VTC agreement — vad_vtc_iou: frame-level Intersection over Union between the two systems' masks.

SNR & C50 — Per-VTC-segment SNR and C50 averages are computed by the segment_snr post-hoc step and stored in output/{dataset}/segment_snr/ parquets (columns: uid, onset, offset, label, snr_mean, c50_mean). During packaging, these are aggregated into per-clip summary statistics in the manifest CSV: snr_mean, snr_std, snr_min, snr_max, c50_mean, c50_std, c50_min, c50_max (dB). Higher C50 = less reverberation. The full per-frame time-series arrays remain available in snr/{uid}.npz for downstream analysis.

ESC environment — dominant_esc (category name), esc_profile (dict of mean probability per category).

Segment detail — vad_segments and vtc_segments: lists of {onset, offset, duration} objects with timestamps relative to the clip start. vtc_segments additionally carry a label field (FEM / MAL / KCHI / OCH).

Additional tools

Module	Purpose
`src.plotting.compare`	VAD vs VTC comparison (IoU, precision, recall, diagnostics)
`src.pipeline.normalize`	Standardize external manifests into `manifests/{dataset}.csv`
`src.pipeline.preflight`	Estimate dataset size, GPU needs, and wall-clock time
`src.pipeline.segment_snr`	Post-hoc per-VTC-segment SNR/C50 averaging

4. Project Structure

DLplusplus/
├── src/
│   ├── utils.py             # Shared utilities (manifest I/O, paths, logging)
│   ├── compat.py            # Compatibility shims (torchaudio patches)
│   ├── pipeline/            # CLI entry points (one per pipeline step)
│   │   ├── vad.py           #   Step 1: TenVAD voice activity detection
│   │   ├── vtc.py           #   Step 2: BabyHuBERT speaker diarization
│   │   ├── snr.py           #   Step 3: Brouhaha SNR/C50 extraction
│   │   ├── esc.py           #   Step 4: PANNs CNN14 ESC
│   │   ├── package.py       #   Step 5: Audio clipping + WebDataset shards
│   │   ├── segment_snr.py   #   Post-hoc per-segment SNR/C50 averaging
│   │   ├── compare.py       #   VAD vs VTC comparison helpers
│   │   ├── normalize.py     #   Manifest normalization
│   │   └── preflight.py     #   Pre-pipeline dataset scan
│   ├── packaging/           # Clip building, shard writing, listener
│   │   ├── clips.py         #   Clip tiling algorithm (6-tier fallback)
│   │   ├── stats.py         #   Per-clip/file/conversation statistics
│   │   ├── writer.py        #   WebDataset tar shard writer
│   │   └── listener.py      #   Sample extraction for validation
│   ├── core/                # Reusable, tested modules
│   │   ├── intervals.py     #   Interval arithmetic (merge, IoU)
│   │   ├── conversations.py #   Turn/conversation extraction
│   │   ├── vad_processing.py#   Per-file VAD (worker code)
│   │   ├── parallel.py      #   Process pool driver with progress queue
│   │   ├── checkpoint.py    #   Checkpoint save / resume
│   │   └── metadata.py      #   VTC metadata constructors
│   └── plotting/            # Dashboard figure generation
│       ├── figures.py       #   Orchestrator (calls sub-modules)
│       ├── snr_noise.py     #   SNR quality + noise environment
│       ├── speech_turns.py  #   Conversational structure + turns
│       ├── overview.py      #   Dataset overview + correlation + text summary
│       └── packaging.py     #   Per-clip/label summary grids
├── dataloader/              # Dataloader++ package (see Section 5)
│   ├── types.py             #   Shared type aliases and enums
│   ├── config.py            #   PipelineConfig + FilterConfig
│   ├── build.py             #   build_manifest() — Big Join + filters
│   ├── processor/           #   Feature Processor ABCs (offline extraction)
│   │   ├── base.py          #     FeatureProcessor ABC
│   │   └── registry.py      #     Processor discovery & registration
│   ├── adapters/            #   Pipeline output adapters
│   │   ├── vad.py           #     VADAdapter (reads vad_meta, vad_raw, vad_merged)
│   │   ├── vtc.py           #     VTCAdapter (reads vtc_meta, vtc_raw, vtc_merged)
│   │   ├── snr.py           #     SNRAdapter (reads snr_meta, snr/*.npz)
│   │   └── esc.py           #     ESCAdapter (reads esc_meta, esc/*.npz)
│   ├── loader/              #   Feature Loader ABCs (waveform + metadata I/O)
│   │   ├── base.py          #     FeatureLoader ABC
│   │   ├── waveform.py      #     WaveformLoader
│   │   └── metadata.py      #     MetadataLoader (JSON/Parquet/NPZ)
│   ├── manifest/            #   Manifest management
│   │   ├── schema.py        #     MetadataManifest schema
│   │   ├── joiner.py        #     ManifestJoiner (Big Join)
│   │   └── store.py         #     MetadataStore (unified I/O)
│   ├── transform/           #   Runtime data transforms
│   │   ├── base.py          #     DataProcessor ABC + Compose
│   │   ├── audio.py         #     Resample, segment, normalize
│   │   └── label.py         #     Label encoding, mask generation
│   ├── batch/               #   Batching and collation
│   │   ├── base.py          #     Collator ABC
│   │   ├── data_batch.py    #     DataBatch container
│   │   └── speech.py        #     SpeechCollator implementation
│   └── dataset/             #   PyTorch Dataset implementations
│       ├── base.py          #     SpeechDataset ABC
│       └── webdataset.py    #     WebDataset-backed loader
├── slurm/
│   ├── pipeline.sh          # One-command pipeline orchestrator
│   ├── vad.slurm            # SLURM: VAD (CPU, 48 workers)
│   ├── vtc.slurm            # SLURM: VTC (GPU array, 3 shards)
│   ├── snr.slurm            # SLURM: Brouhaha SNR (GPU array, 2 shards)
│   ├── esc.slurm            # SLURM: PANNs ESC (GPU array, 2 shards)
│   ├── segment_snr.slurm    # SLURM: Per-segment SNR (GPU array)
│   ├── vtc_clips.slurm      # SLURM: VTC on packaged clips
│   ├── snr_diagnostic.slurm # SLURM: SNR masking diagnostics
│   ├── package_test.sh      # Quick end-to-end packaging test
│   ├── repackage_test.sh    # Re-package + clip alignment test
│   └── test.slurm           # SLURM: pytest on compute node
├── tests/                   # pytest suite covering all core modules
│   ├── conftest.py          #   Audio fixtures + skip markers
│   ├── fixtures/            #   Short WAV files (committed)
│   ├── test_intervals.py
│   ├── test_checkpoint.py
│   ├── test_metadata.py
│   ├── test_parallel.py
│   ├── test_clips.py        #   Clip tiling + tier fallback chain
│   ├── test_snr.py          #   Brouhaha SNR extraction
│   ├── test_esc.py        #   PANNs ESC
│   ├── test_vad_processing.py
│   ├── test_reproducibility.py
│   └── test_stitched_audio.py
├── docs/
│   └── DATALOADER_DESIGN.md # Dataloader++ specification
├── scripts/
│   ├── download_brouhaha.py # Auto-download Brouhaha checkpoint
│   └── make_manifest.py     # Generate manifest from audio directory
├── models/                  # Brouhaha checkpoint (gitignored, auto-downloaded)
│   └── best/checkpoints/
│       └── best.ckpt        #   ~47 MB, from ylacombe/brouhaha-best
├── VTC-2.0/                 # BabyHuBERT model weights & config
│   └── model/
│       ├── best.ckpt        #   Trained checkpoint (~1 GB, git-lfs)
│       └── config.yml       #   segma training config
├── manifests/               # Dataset manifests (one CSV per dataset)
├── output/                  # Pipeline outputs (per-dataset subdirs)
├── figures/                 # Diagnostic plots (per-dataset subdirs)
├── logs/                    # SLURM logs + benchmark records
├── pyproject.toml           # Python project config (uv / pip)
├── requirements.txt         # Pinned dependency lockfile
└── check_sys_dependencies.sh

Data flow

All paths are derived from the dataset name:

manifests/{dataset}.csv  →  output/{dataset}/   (metadata, segments, metrics)
                         →  figures/{dataset}/  (plots)

Running tests

# Login node (TenVAD tests auto-skip):
uv run python3 -m pytest tests/

# Compute node (full suite):
sbatch slurm/test.slurm

5. Dataloader

The dataloader/ package implements the Dataloader++ specification for Meta's speech training infrastructure. It bridges the offline feature processing pipeline (above) with online model training.

See docs/DATALOADER_DESIGN.md for the full design document.

Component	Location	Purpose
Feature Processor	`dataloader/processor/`	ABC wrapping offline extraction stages (VAD, VTC, SNR, ESC)
Feature Loader	`dataloader/loader/`	Load waveforms + metadata from WebDataset shards or raw files
Manifest Joiner	`dataloader/manifest/`	Join heterogeneous metadata manifests by `wav_id` (the "Big Join")
Data Processor	`dataloader/transform/`	Composable runtime transforms (segment, resample, encode, mask)
Collator / DataBatch	`dataloader/batch/`	Pad variable-length samples into typed `DataBatch` tensors
Dataset	`dataloader/dataset/`	PyTorch `Dataset` implementations (WebDataset-backed)

6. Citation

@software{dlplusplus,
    title  = {{DL++}: Feature Processing and Data Loading for Child-Centered Long-Form Audio},
    author = {Dager, Daniel and Kunze, Tarek and Charlot, Théo and Cristia, Alejandrina and Dupoux, Emmanuel and Lavechin, Marvin},
    year   = {2026},
    url    = {https://github.com/LAAC-LSCP/DLplusplus},
}

7. Component Models

DL++ integrates the following models as feature processing stages:

TenVAD — Voice Activity Detection

Tencent/TenVAD — lightweight speech activity detector used in Step 1 (CPU).

BabyHuBERT — Voice Type Classification (VTC 2.0)

Speaker diarization into four types (KCHI, OCH, MAL, FEM), trained on child-centered long-form recordings. Used in Step 2 (GPU).

Training code: LAAC-LSCP/BabyHuBERT

@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
    title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings},
    author={Théo Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
    year={2025},
    eprint={2509.15001},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2509.15001},
}

Earlier VTC versions

VTC 1.5 (Whisper-VTC) — GitHub: LAAC-LSCP/VTC-IS-25

@inproceedings{kunze25_interspeech,
    title     = {{Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier}},
    author    = {Tarek Kunze and Marianne Métais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
    year      = {2025},
    booktitle = {{Interspeech 2025}},
    pages     = {2845--2849},
    doi       = {10.21437/Interspeech.2025-1962},
}

VTC 1.0 (PyanNet-VTC) — GitHub: MarvinLvn/voice-type-classifier

@inproceedings{lavechin20_interspeech,
    title     = {An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings},
    author    = {Marvin Lavechin and Ruben Bousbib and Hervé Bredin and Emmanuel Dupoux and Alejandrina Cristia},
    year      = {2020},
    booktitle = {Interspeech 2020},
    pages     = {3072--3076},
    doi       = {10.21437/Interspeech.2020-1690},
}

Brouhaha — SNR & C50 Estimation

marianne-m/brouhaha-vad — per-frame signal-to-noise ratio and clarity (C50) extraction. Used in Step 3 (GPU).

@inproceedings{lavechin2023brouhaha,
    title     = {Brouhaha: Multi-task Training for Voice Activity Detection, Speech-to-Noise Ratio, and Speech Reverberation Estimation},
    author    = {Marvin Lavechin and Marianne Métais and Hadrien Titeux and Alodie Boissonnet and Johan Music and Hervé Bredin and Emmanouil Benetos and Alejandrina Cristia},
    year      = {2023},
    booktitle = {2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
    doi       = {10.1109/ASRU57964.2023.10389642},
}

PANNs CNN14 — Environmental Sound Classification (ESC)

qiuqiangkong/panns_inference — AudioSet-based sound event detection (527 classes, grouped into 13 coarse categories). Used in Step 4 (GPU).

@inproceedings{kong2020panns,
    title     = {PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition},
    author    = {Qiuqiang Kong and Yin Cao and Turab Iqbal and Yuxuan Wang and Wenwu Wang and Mark D. Plumbley},
    year      = {2020},
    journal   = {IEEE/ACM Transactions on Audio, Speech, and Language Processing},
    volume    = {28},
    pages     = {2880--2894},
    doi       = {10.1109/TASLP.2020.3030497},
}

8. Acknowledgements

This work uses the segma library, inspired by pyannote.audio.

This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015450 and 2025-AD011016414) and was developed as part of the ExELang project funded by the European Union (ERC, ExELang, Grant No 101001095).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DL++ (Dataloader++)

Table of Contents

1. Installation

2. Quick Start

Generate a manifest from a directory

Single-step inference

Full pipeline on a SLURM cluster

3. Pipeline

Step 1 — VAD (Voice Activity Detection)

Step 2 — VTC (Voice Type Classification)

Step 3 — SNR (Signal-to-Noise Ratio & Clarity)

Step 4 — ESC (Environmental Sound Classification)

Step 5 — Package (Clip Tiling + WebDataset Shards)

Clip metadata

Additional tools

4. Project Structure

Data flow

Running tests

5. Dataloader

6. Citation

7. Component Models

TenVAD — Voice Activity Detection

BabyHuBERT — Voice Type Classification (VTC 2.0)

Brouhaha — SNR & C50 Estimation

PANNs CNN14 — Environmental Sound Classification (ESC)

8. Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 61 Commits
VTC-2.0 @ 91e67b5		VTC-2.0 @ 91e67b5
dataloader		dataloader
docs		docs
figures		figures
scripts		scripts
slurm		slurm
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
README.md		README.md
check_sys_dependencies.sh		check_sys_dependencies.sh
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

DL++ (Dataloader++)

Table of Contents

1. Installation

2. Quick Start

Generate a manifest from a directory

Single-step inference

Full pipeline on a SLURM cluster

3. Pipeline

Step 1 — VAD (Voice Activity Detection)

Step 2 — VTC (Voice Type Classification)

Step 3 — SNR (Signal-to-Noise Ratio & Clarity)

Step 4 — ESC (Environmental Sound Classification)

Step 5 — Package (Clip Tiling + WebDataset Shards)

Clip metadata

Additional tools

4. Project Structure

Data flow

Running tests

5. Dataloader

6. Citation

7. Component Models

TenVAD — Voice Activity Detection

BabyHuBERT — Voice Type Classification (VTC 2.0)

Brouhaha — SNR & C50 Estimation

PANNs CNN14 — Environmental Sound Classification (ESC)

8. Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages