The TartanAviation ATC Collection: Audio, ADS-B, and 531k Labels

TartanAviation1 ships 398 hours of air traffic control radio and exactly zero transcripts. I setup a pipeline with the goal to transcribe all 531,050 utterances. The final result? A three-dataset collection on Huggingface, check out the collection here.

A Rough Idea

AirLab keeps most of its data on a Swift object storage instance. The first job was just getting it mirrored onto Huggingface, so here comes squawk. It downloads the audio and ADS-B2 and sets up the initial mirror. Next, squawk runs a VAD3 to slice long clips into individual utterances, and the pipeline in readback labels every one. The labels land in a third dataset, text only, that snaps 1:1 onto utterances.

clips
paired ATC audio + ADS-B
~41.8k clips
utterances
VAD-split segments + ADS-B
531,050 · ~398h
labels
transcript + confidence
531,050 · 25MB
The three-stage TartanAviation ATC collection on Hugging Face, built by squawk mirroring CMU TartanAviation. Raw clips (~41.8k paired ATC audio + ADS-B) are VAD-split into 531,050 utterances (~398h), then run through ensemble ASR, ROVER, ADS-B snap, and review to produce the labels dataset (531,050 rows, 25MB), the subject of this post.

The Working Set

At this point, all we have is raw audio clips and some nicely formatted ADS-B data, and no idea what anyone actually said. Two jobs, then: get words out of the audio, and pin the right callsign onto each clip. For the second, maybe we just see who gets mentioned, since most of the time the speaker names themselves, then snap that to a real ADS-B tail.

Ensemble, Vote, Snap

Three ASR models4 transcribe each clip, weighted ROVER5 fuses them, and a fuzzy match snaps the spoken callsign to the right ADS-B tail. A review studio cleans a sample. Deliberately simple.

models in agreementutterances%
all three agree60,32711%
two of three116,96922%
only one353,75467%
How many of the three voting models fully agree with the fused transcript, across all 531,050 utterances. All three agree on only 11%; on 67% a single model carries the slot. The models genuinely disagree on most of this corpus, which is what makes the vote do real work and where I went looking for gains.

On 67% of clips only one of the three models agrees with the fused output. That divergence is where I went hunting for gains.

Go Around

Three ideas. The vote beat all three.

Decode-time ADS-B biasing.6 Bias the decoder toward the present callsigns. It backfires: ADS-B lists who is airborne, not who is transmitting, and on 63% of callsign-list clips the spoken callsign isn’t in the list at all. Biasing just shoves in callsigns nobody said, WER climbs the harder you push, and fuse-plus-snap already recovers the real ones.

LLM correction. Hand an LLM the hypotheses and it cheerfully over-corrects the labels that were already right, with no signal7 for which ones to leave alone. Audio omni models8 ramble past 300% WER. Real gains do exist on the clips a human flagged, but you just can’t find them automatically.

Can an LLM out-vote ROVER? It cannot even reproduce the vote.9 I tried re-fusing from scratch, picking the single best, span-locking the agreed words, and a per-clip reasoning agent with the surrounding conversation.

MethodOverall WERCategory
oracle*0.123oracle
ROVER0.135baseline
per-clip0.239tried
select-best0.243tried
span-lock0.275tried
re-fuse0.333tried
Overall word error rate by fusion method on the 126-clip human-reviewed dev set, lower is better. Each stem runs from the dashed vote line to the method's dot, so its length is how far that method lands from ROVER. Every text-only method sits to the right — worse than the vote; only an unattainable perfect router (oracle*) edges left of it, because the disputed slots need the audio, which text methods do not have.

Every text-only method is worse overall. Accepted clips use the fused transcript as their reference10 , so a rewrite only adds error; the one bar under ROVER needs a perfect oracle router the real one cannot match. The disputed slots are ambiguous in the audio, and text alone cannot resolve them better than the vote.

Scoring

The pipeline already computed how far an out-of-vote advisory model11 diverged from the fused text, then threw the number away. Folding it back in, a plain average of three signals (agreement, the ROVER confidence, and how little the advisory model disagrees)12 ranks correctness at AUC 0.80, beating every single signal and a learned blend that overfits. One number per label, pick your own cutoff.

≥ 0.80 94,727 rows 17.8%
confidence >=rows% kept
0.15531050100.0
0.2047725589.9
0.2544634584.0
0.3041380577.9
0.3539150273.7
0.4036711069.1
0.4534273064.5
0.5031684859.7
0.5528608153.9
0.60253801.0000000000647.8
0.6521834941.1
0.70176672.000000000133.3
0.7513625025.7
0.809472717.8
0.8557079.0000000000910.7
0.90288845.4
0.9513054.0000000000352.5
Distribution of the published confidence score across all 531,050 labeled rows, 20 bins over 0..1 (bins below 0.15 are empty). Drag the cutoff to split the histogram: bars at or above it are kept, the rest are dropped. The score ranks, it is not calibrated: on the hardest reviewed slice the accept rate is ~91% at >=0.8 and ~97% at >=0.9, so the true accept rate across the full set is higher. The cutoff is a volume-vs-precision dial, not a probability.

Usage

twangodev/tartanaviation-atc-labels: 531,050 rows, one transcript and one confidence each, a 1:1 add-on to the utterances dataset, 25 MB, no audio, CC-BY-4.0. Same 184 shards, same row order, so the join is one line:

from datasets import load_dataset, concatenate_datasets

src = load_dataset("twangodev/tartanaviation-atc-adsb-utterances", split="train")
lab = load_dataset("twangodev/tartanaviation-atc-labels", split="train")
joined = concatenate_datasets([src, lab], axis=1)

The labels train the next rasr checkpoint.

Footnotes

  1. CMU AirLab’s terminal-airspace corpus: paired ATC radio and ADS-B from two Pittsburgh airports, KAGC and KBTP. CC-BY-4.0.
  2. Automatic Dependent Surveillance-Broadcast: aircraft continuously broadcast their GPS position, velocity, and identity (including callsign) on 1090 MHz, logged by ground receivers. The callsign field is what pairs onto each clip.
  3. pyannote.audio’s VoiceActivityDetection pipeline over pyannote/segmentation-3.0, 16 kHz, with a 0.15s minimum speech span and a 0.1s minimum silence gap.
  4. nvidia/parakeet-tdt-0.6b-v2, nvidia/canary-qwen-2.5b, and jlvdoorn/whisper-large-v3-atco2-asr. The ATC-tuned Whisper carries a 2x vote weight; it is the strongest member.
  5. ROVER, Recognizer Output Voting Error Reduction (Fiscus, 1997): align the hypotheses word by word, take the weighted majority at each slot. The baseline to beat.
  6. NVIDIA’s GPU-PB / TurboBias boosting tree in NeMo: shallow-fusion contextual biasing at decode time, no retraining.
  7. Not inter-model agreement, not the model’s own confidence. The model is confidently wrong.
  8. Qwen3-Omni (bf16) and NVIDIA Nemotron Nano 3 Omni (NVFP4).
  9. Told to take a per-word majority, it matches ROVER’s exact output 20% of the time. It paraphrases and normalizes; it cannot help but interpret.
  10. A human accepted the fused text, so ROVER scores zero error there by construction, and any rewrite only adds error. The honest test is the slice a human had to edit.
  11. twangodev/rasr-parakeet-v1, the in-house model whose successor (rasr-parakeet-v2) these labels train. Held out of the vote so it can flag its own blind spots.
  12. c=mean(a, r, 1d)c = \operatorname{mean}(a,\ r,\ 1 - d) over agreement aa, ROVER confidence rr, and advisory disagreement dd.