How Does AI Transcribe Piano from YouTube? (NotaGen Explained)
A plain-English walkthrough of how modern transformer-based models turn audio waveforms into sheet music. Why solo piano works best.
The problem nobody could solve for 50 years
Automatic music transcription has been a research goal since the 1970s. The problem statement is simple: input is an audio recording, output is sheet music or MIDI. Every step in between is hard.
Early approaches used signal processing — Fourier transforms to find frequencies, peak detection to find note onsets. These worked on pure sine waves and barely worked on real music. A piano note isn't a single frequency; it's a fundamental plus a stack of harmonics that overlap with the harmonics of every other note being played.
The 2010s brought convolutional neural networks that learned to recognize note-onset patterns directly from spectrograms. Big jump in accuracy. Still couldn't handle dense polyphony — pianists routinely play 4–8 notes simultaneously, and CNN onset detectors couldn't always tell them apart.
The 2020s brought transformer-based sequence models trained on tens of thousands of hours of paired audio + MIDI. Suddenly, "good enough to use" became real.
What "transformer" means here
The same architecture that powers ChatGPT powers modern music transcription. Different inputs (audio frames vs. text tokens), same fundamental idea: take a sequence in, predict the most likely sequence out.
For transcription, the input is a sequence of audio frames (the song chopped into ~10ms slices, each represented as a spectrogram patch). The output is a sequence of MIDI events ("note on, pitch 60, time 1.2s", "note off, pitch 60, time 1.5s", etc.).
The model learns by being shown millions of (audio, MIDI) pairs during training. Eventually it learns the mapping well enough that it can transcribe audio it has never heard.
Why solo piano works best
Three reasons:
Multi-instrument transcription (piano + drums + guitar + vocals) is an active research area. Tools like Spleeter can separate stems, then transcribe each separately. Quality is improving but still rough. For now, "AI sheet music" effectively means "solo piano AI sheet music".
What we use and how it fits in
Our pipeline runs in AWS Lambda, triggered when you submit a YouTube URL:
End-to-end takes 60–90 seconds for a 3-minute song. Most of that is GPU inference time on the Lambda runtime.
What "85% accurate" really means
When we say "90% note accuracy", that's measured against a ground-truth MIDI file from the same performance. It does not mean 90% of measures are correct — wrong notes are spread out, so most measures have at least one error.
Where errors come from:
For practice purposes, 90% is usable. For published sheet music, you'd hand-edit afterward in MuseScore (which is exactly what professional engravers do — start from a rough computer pass and clean up).
Where the field is going
Three frontiers:
In 2 years, expect "AI sheet music" to handle full-band transcriptions at the quality solo piano gets today.
Try it for yourself
The fastest way to understand what AI transcription is and isn't good at is to feed it a song you know intimately and read the output. Convert a YouTube cover here and compare the AI output to your mental model of the piece. The errors will teach you more than any blog post.
Ready to start playing?
Put what you've learned into practice with thousands of simplified songs.
Browse Songs