The TartanAviation ATC Collection: Audio, ADS-B, and 531k Labels

TartanAviation¹ ships 398 hours of air traffic control radio and exactly zero transcripts. I setup a pipeline with the goal to transcribe all 531,050 utterances. The final result? A three-dataset collection on Huggingface, check out the collection here.

A Rough Idea

AirLab keeps most of its data on a Swift object storage instance. The first job was just getting it mirrored onto Huggingface, so here comes squawk. It downloads the audio and ADS-B² and sets up the initial mirror. Next, squawk runs a VAD³ to slice long clips into individual utterances, and the pipeline in readback labels every one. The labels land in a third dataset, text only, that snaps 1:1 onto utterances.

clips

paired ATC audio + ADS-B

~41.8k clips

utterances

VAD-split segments + ADS-B

531,050 · ~398h

labels

transcript + confidence

531,050 · 25MB

The three-stage TartanAviation ATC collection on Hugging Face, built by squawk mirroring CMU TartanAviation. Raw clips (~41.8k paired ATC audio + ADS-B) are VAD-split into 531,050 utterances (~398h), then run through ensemble ASR, ROVER, ADS-B snap, and review to produce the labels dataset (531,050 rows, 25MB), the subject of this post.

The Working Set

At this point, all we have is raw audio clips and some nicely formatted ADS-B data, and no idea what anyone actually said. Two jobs, then: get words out of the audio, and pin the right callsign onto each clip. For the second, maybe we just see who gets mentioned, since most of the time the speaker names themselves, then snap that to a real ADS-B tail.

Ensemble, Vote, Snap

Three ASR models⁴ transcribe each clip, weighted ROVER⁵ fuses them, and a fuzzy match snaps the spoken callsign to the right ADS-B tail. A review studio cleans a sample. Deliberately simple.

models in agreement	utterances	%
all three agree	60,327	11%
two of three	116,969	22%
only one	353,754	67%

How many of the three voting models fully agree with the fused transcript, across all 531,050 utterances. All three agree on only 11%; on 67% a single model carries the slot. The models genuinely disagree on most of this corpus, which is what makes the vote do real work and where I went looking for gains.

On 67% of clips only one of the three models agrees with the fused output. That divergence is where I went hunting for gains.

Go Around

Three ideas. The vote beat all three.

Decode-time ADS-B biasing.⁶ Bias the decoder toward the present callsigns. It backfires: ADS-B lists who is airborne, not who is transmitting, and on 63% of callsign-list clips the spoken callsign isn’t in the list at all. Biasing just shoves in callsigns nobody said, WER climbs the harder you push, and fuse-plus-snap already recovers the real ones.

LLM correction. Hand an LLM the hypotheses and it cheerfully over-corrects the labels that were already right, with no signal⁷ for which ones to leave alone. Audio omni models⁸ ramble past 300% WER. Real gains do exist on the clips a human flagged, but you just can’t find them automatically.

Can an LLM out-vote ROVER? It cannot even reproduce the vote.⁹ I tried re-fusing from scratch, picking the single best, span-locking the agreed words, and a per-clip reasoning agent with the surrounding conversation.

Method	Overall WER	Category
oracle*	0.123	oracle
ROVER	0.135	baseline
per-clip	0.239	tried
select-best	0.243	tried
span-lock	0.275	tried
re-fuse	0.333	tried

Overall word error rate by fusion method on the 126-clip human-reviewed dev set, lower is better. Each stem runs from the dashed vote line to the method's dot, so its length is how far that method lands from ROVER. Every text-only method sits to the right — worse than the vote; only an unattainable perfect router (oracle*) edges left of it, because the disputed slots need the audio, which text methods do not have.

Every text-only method is worse overall. Accepted clips use the fused transcript as their reference¹⁰ , so a rewrite only adds error; the one bar under ROVER needs a perfect oracle router the real one cannot match. The disputed slots are ambiguous in the audio, and text alone cannot resolve them better than the vote.

Scoring

The pipeline already computed how far an out-of-vote advisory model¹¹ diverged from the fused text, then threw the number away. Folding it back in, a plain average of three signals (agreement, the ROVER confidence, and how little the advisory model disagrees)¹² ranks correctness at AUC 0.80, beating every single signal and a learned blend that overfits. One number per label, pick your own cutoff.

≥ 0.80 94,727 rows 17.8%

cutoff 0.80

confidence >=	rows	% kept
0.15	531050	100.0
0.20	477255	89.9
0.25	446345	84.0
0.30	413805	77.9
0.35	391502	73.7
0.40	367110	69.1
0.45	342730	64.5
0.50	316848	59.7
0.55	286081	53.9
0.60	253801.00000000006	47.8
0.65	218349	41.1
0.70	176672.0000000001	33.3
0.75	136250	25.7
0.80	94727	17.8
0.85	57079.00000000009	10.7
0.90	28884	5.4
0.95	13054.000000000035	2.5

Distribution of the published confidence score across all 531,050 labeled rows, 20 bins over 0..1 (bins below 0.15 are empty). Drag the cutoff to split the histogram: bars at or above it are kept, the rest are dropped. The score ranks, it is not calibrated: on the hardest reviewed slice the accept rate is ~91% at >=0.8 and ~97% at >=0.9, so the true accept rate across the full set is higher. The cutoff is a volume-vs-precision dial, not a probability.

Usage

twangodev/tartanaviation-atc-labels: 531,050 rows, one transcript and one confidence each, a 1:1 add-on to the utterances dataset, 25 MB, no audio, CC-BY-4.0. Same 184 shards, same row order, so the join is one line:

from datasets import load_dataset, concatenate_datasets

src = load_dataset("twangodev/tartanaviation-atc-adsb-utterances", split="train")
lab = load_dataset("twangodev/tartanaviation-atc-labels", split="train")
joined = concatenate_datasets([src, lab], axis=1)

The labels train the next rasr checkpoint.

Footnotes

CMU AirLab’s terminal-airspace corpus: paired ATC radio and ADS-B from two Pittsburgh airports, KAGC and KBTP. CC-BY-4.0. ↩
Automatic Dependent Surveillance-Broadcast: aircraft continuously broadcast their GPS position, velocity, and identity (including callsign) on 1090 MHz, logged by ground receivers. The callsign field is what pairs onto each clip. ↩
pyannote.audio’s VoiceActivityDetection pipeline over pyannote/segmentation-3.0, 16 kHz, with a 0.15s minimum speech span and a 0.1s minimum silence gap. ↩
nvidia/parakeet-tdt-0.6b-v2, nvidia/canary-qwen-2.5b, and jlvdoorn/whisper-large-v3-atco2-asr. The ATC-tuned Whisper carries a 2x vote weight; it is the strongest member. ↩
ROVER, Recognizer Output Voting Error Reduction (Fiscus, 1997): align the hypotheses word by word, take the weighted majority at each slot. The baseline to beat. ↩
NVIDIA’s GPU-PB / TurboBias boosting tree in NeMo: shallow-fusion contextual biasing at decode time, no retraining. ↩
Not inter-model agreement, not the model’s own confidence. The model is confidently wrong. ↩
Qwen3-Omni (bf16) and NVIDIA Nemotron Nano 3 Omni (NVFP4). ↩
Told to take a per-word majority, it matches ROVER’s exact output 20% of the time. It paraphrases and normalizes; it cannot help but interpret. ↩
A human accepted the fused text, so ROVER scores zero error there by construction, and any rewrite only adds error. The honest test is the slice a human had to edit. ↩
twangodev/rasr-parakeet-v1, the in-house model whose successor (rasr-parakeet-v2) these labels train. Held out of the vote so it can flag its own blind spots. ↩
$c = \operatorname{mean}(a,\ r,\ 1 - d)$ over agreement $a$ , ROVER confidence $r$ , and advisory disagreement $d$ . ↩