# The TartanAviation ATC Collection: Audio, ADS-B, and 531k Labels

[TartanAviation](https://arxiv.org/abs/2403.03372)[^1] ships 398 hours of air traffic control radio and exactly zero transcripts. I setup a pipeline with the goal to transcribe all 531,050 utterances. The final result? A three-dataset collection on Huggingface, check out the collection [here](https://huggingface.co/collections/twangodev/tartanaviation-atc-adsb).

## A Rough Idea

AirLab keeps most of its data on a Swift object storage instance. The first job was just getting it mirrored onto Huggingface, so here comes [squawk](https://github.com/twangodev/squawk). It downloads the audio and ADS-B[^2] and sets up the initial mirror. Next, squawk runs a VAD[^3] to slice long clips into individual utterances, and the pipeline in [readback](https://github.com/twangodev/readback) labels every one. The labels land in a third dataset, text only, that snaps 1:1 onto utterances.

The TartanAviation ATC collection on Hugging Face, built by squawk (https://github.com/twangodev/squawk) mirroring CMU TartanAviation:

- **tartanaviation-atc-adsb**, clips, paired ATC audio + ADS-B (~41.8k clips): https://huggingface.co/datasets/twangodev/tartanaviation-atc-adsb (repo https://github.com/twangodev/squawk)
- **tartanaviation-atc-adsb-utterances**, utterances, VAD-split segments + ADS-B (531,050 · ~398h): https://huggingface.co/datasets/twangodev/tartanaviation-atc-adsb-utterances (repo https://github.com/twangodev/squawk)
- **tartanaviation-atc-labels**, labels, transcript + confidence (531,050 · 25MB): https://huggingface.co/datasets/twangodev/tartanaviation-atc-labels (repo https://github.com/twangodev/readback)

## The Working Set

At this point, all we have is raw audio clips and some nicely formatted ADS-B data, and no idea what anyone actually said. Two jobs, then: get words out of the audio, and pin the right callsign onto each clip. For the second, maybe we just see who gets mentioned, since most of the time the speaker names themselves, then snap that to a real ADS-B tail.

## Ensemble, Vote, Snap

Three ASR models[^4] transcribe each clip, weighted ROVER[^5] fuses them, and a fuzzy match snaps the spoken callsign to the right ADS-B tail. A review studio cleans a sample. Deliberately simple.

Inter-model agreement across all 531,050 utterances: all three agree on 11%, two of three on 22%, only one on 67%.

| models in agreement | utterances | % |
| --- | --- | --- |
| all three agree | 60,327 | 11% |
| two of three | 116,969 | 22% |
| only one | 353,754 | 67% |

On **67% of clips only one of the three models** agrees with the fused output. That divergence is where I went hunting for gains.

## Go Around

Three ideas. The vote beat all three.

**Decode-time ADS-B biasing.**[^6] Bias the decoder toward the present callsigns. It backfires: ADS-B lists who is airborne, not who is transmitting, and on **63% of callsign-list clips the spoken callsign isn't in the list at all**. Biasing just shoves in callsigns nobody said, WER climbs the harder you push, and fuse-plus-snap already recovers the real ones.

**LLM correction.** Hand an LLM the hypotheses and it cheerfully over-corrects the labels that were already right, with no signal[^7] for which ones to leave alone. Audio omni models[^8] ramble past **300% WER**. Real gains do exist on the clips a human flagged, but you just can't find them automatically.

**Can an LLM out-vote ROVER?** It cannot even reproduce the vote.[^9] I tried re-fusing from scratch, picking the single best, span-locking the agreed words, and a per-clip reasoning agent with the surrounding conversation.

Overall word error rate by fusion method on the 126-clip human-reviewed dev set, lower is better, sorted ascending. Each dot is a method; the dashed line marks ROVER, the vote to beat, and every stem runs from that line to the dot, so its length is how far the method lands from the vote. Every text-only LLM method (per-clip 0.239, select-best 0.243, span-lock 0.275, re-fuse 0.333) sits to the right of the line, worse than the vote. Only an unattainable per-clip LLM routed by a perfect oracle (oracle* 0.123) edges left of it, because the disputed slots need the audio, which the text methods do not have.

| Method | Overall WER | Category |
| --- | --- | --- |
| oracle* | 0.123 | oracle |
| ROVER | 0.135 | baseline |
| per-clip | 0.239 | tried |
| select-best | 0.243 | tried |
| span-lock | 0.275 | tried |
| re-fuse | 0.333 | tried |

Every text-only method is worse overall. Accepted clips use the fused transcript as their reference[^10], so a rewrite only adds error; the one bar under ROVER needs a perfect oracle router the real one cannot match. The disputed slots are ambiguous in the audio, and **text alone cannot resolve them better than the vote**.

## Scoring

The pipeline already computed how far an out-of-vote advisory model[^11] diverged from the fused text, then threw the number away. Folding it back in, a plain average of three signals (agreement, the ROVER confidence, and how little the advisory model disagrees)[^12] ranks correctness at **AUC 0.80**, beating every single signal and a learned blend that overfits. One number per label, pick your own cutoff.

Distribution of the published confidence score across all 531,050 labeled rows, 20 bins over 0..1 (bins below 0.15 are empty). The table sweeps the cutoff: each row is a threshold and the number of rows kept at or above it. Confidence ranks rather than calibrates, on the hardest reviewed slice the accept rate is ~91% at >=0.8 and ~97% at >=0.9, so the true accept rate across the full set is higher. Drag the cutoff to trade volume against precision.

| confidence >= | rows | % kept |
| --- | --- | --- |
| 0.15 | 531050 | 100.0 |
| 0.20 | 477255 | 89.9 |
| 0.25 | 446345 | 84.0 |
| 0.30 | 413805 | 77.9 |
| 0.35 | 391502 | 73.7 |
| 0.40 | 367110 | 69.1 |
| 0.45 | 342730 | 64.5 |
| 0.50 | 316848 | 59.7 |
| 0.55 | 286081 | 53.9 |
| 0.60 | 253801.00000000006 | 47.8 |
| 0.65 | 218349 | 41.1 |
| 0.70 | 176672.0000000001 | 33.3 |
| 0.75 | 136250 | 25.7 |
| 0.80 | 94727 | 17.8 |
| 0.85 | 57079.00000000009 | 10.7 |
| 0.90 | 28884 | 5.4 |
| 0.95 | 13054.000000000035 | 2.5 |

## Usage

[twangodev/tartanaviation-atc-labels](https://huggingface.co/datasets/twangodev/tartanaviation-atc-labels): 531,050 rows, one transcript and one confidence each, a 1:1 add-on to the [utterances dataset](https://huggingface.co/datasets/twangodev/tartanaviation-atc-adsb-utterances), 25 MB, no audio, CC-BY-4.0. Same 184 shards, same row order, so the join is one line:

```python
from datasets import load_dataset, concatenate_datasets

src = load_dataset("twangodev/tartanaviation-atc-adsb-utterances", split="train")
lab = load_dataset("twangodev/tartanaviation-atc-labels", split="train")
joined = concatenate_datasets([src, lab], axis=1)
```

The labels train the next [rasr](https://github.com/twangodev/rasr) checkpoint.

## Notes

[^1]: CMU [AirLab](https://theairlab.org)'s terminal-airspace corpus: paired ATC radio and ADS-B from two Pittsburgh airports, KAGC and KBTP. [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/).

[^2]: Automatic Dependent Surveillance-Broadcast: aircraft continuously broadcast their GPS position, velocity, and identity (including callsign) on 1090 MHz, logged by ground receivers. The callsign field is what pairs onto each clip.

[^3]: [pyannote.audio](https://github.com/pyannote/pyannote-audio)'s VoiceActivityDetection pipeline over [pyannote/segmentation-3.0](https://huggingface.co/pyannote/segmentation-3.0), 16 kHz, with a 0.15s minimum speech span and a 0.1s minimum silence gap.

[^4]: [nvidia/parakeet-tdt-0.6b-v2](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2), [nvidia/canary-qwen-2.5b](https://huggingface.co/nvidia/canary-qwen-2.5b), and [jlvdoorn/whisper-large-v3-atco2-asr](https://huggingface.co/jlvdoorn/whisper-large-v3-atco2-asr). The ATC-tuned Whisper carries a 2x vote weight; it is the strongest member.

[^5]: ROVER, Recognizer Output Voting Error Reduction ([Fiscus, 1997](https://doi.org/10.1109/ASRU.1997.659110)): align the hypotheses word by word, take the weighted majority at each slot. The baseline to beat.

[^6]: NVIDIA's GPU-PB / TurboBias boosting tree in [NeMo](https://github.com/NVIDIA/NeMo): shallow-fusion contextual biasing at decode time, no retraining.

[^7]: Not inter-model agreement, not the model's own confidence. The model is confidently wrong.

[^8]: [Qwen3-Omni](https://huggingface.co/Qwen/Qwen3-Omni-30B-A3B-Instruct) (bf16) and [NVIDIA Nemotron Nano 3 Omni](https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4) (NVFP4).

[^9]: Told to take a per-word majority, it matches ROVER's exact output 20% of the time. It paraphrases and normalizes; it cannot help but interpret.

[^10]: A human accepted the fused text, so ROVER scores zero error there by construction, and any rewrite only adds error. The honest test is the slice a human had to edit.

[^11]: [twangodev/rasr-parakeet-v1](https://huggingface.co/twangodev/rasr-parakeet-v1), the in-house model whose successor ([rasr-parakeet-v2](https://huggingface.co/twangodev/rasr-parakeet-v2)) these labels train. Held out of the vote so it can flag its own blind spots.

[^12]: $c = \operatorname{mean}(a,\ r,\ 1 - d)$ over agreement $a$, ROVER confidence $r$, and advisory disagreement $d$.
