How Does AI Transcribe Piano from YouTube? (NotaGen Explained)

The problem nobody could solve for 50 years

Automatic music transcription has been a research goal since the 1970s. The problem statement is simple: input is an audio recording, output is sheet music or MIDI. Every step in between is hard.

Early approaches used signal processing, Fourier transforms to find frequencies, peak detection to find note onsets. These worked on pure sine waves and barely worked on real music. A piano note isn't a single frequency; it's a fundamental plus a stack of harmonics that overlap with the harmonics of every other note being played.

The 2010s brought convolutional neural networks that learned to recognize note-onset patterns directly from spectrograms. Big jump in accuracy. Still couldn't handle dense polyphony, pianists routinely play 4–8 notes simultaneously, and CNN onset detectors couldn't always tell them apart.

The 2020s brought transformer-based sequence models trained on tens of thousands of hours of paired audio + MIDI. Suddenly, "good enough to use" became real.

What "transformer" means here

The same architecture that powers ChatGPT powers modern music transcription. Different inputs (audio frames vs. text tokens), same fundamental idea: take a sequence in, predict the most likely sequence out.

For transcription, the input is a sequence of audio frames (the song chopped into ~10ms slices, each represented as a spectrogram patch). The output is a sequence of MIDI events ("note on, pitch 60, time 1.2s", "note off, pitch 60, time 1.5s", etc.).

The model learns by being shown millions of (audio, MIDI) pairs during training. Eventually it learns the mapping well enough that it can transcribe audio it has never heard.

Why solo piano works best

Three reasons:

Big training datasets exist. The MAESTRO dataset alone has 200+ hours of competition-quality piano performances with frame-aligned MIDI. There's nothing equivalent for solo violin or solo flute.

Piano is harmonically clean. A struck piano string vibrates predictably. The model can learn its harmonic profile and subtract it from the spectrogram, isolating one note at a time.

Notes don't bend. A pianist can't slide between notes the way a string player can. Discrete onsets are easier to detect than continuous pitch changes.

Multi-instrument transcription (piano + drums + guitar + vocals) is an active research area. Tools like Spleeter can separate stems, then transcribe each separately. Quality is improving but still rough. For now, "AI sheet music" effectively means "solo piano AI sheet music".

What we use and how it fits in

Our pipeline runs in AWS Lambda, triggered when you submit a YouTube URL:

Audio extraction: yt-dlp pulls the audio track from the YouTube URL.

Pre-processing: audio is normalized, resampled to a fixed sample rate, and chunked into 30-second windows.

Transcription model: a transformer reads the audio and emits a sequence of MIDI events. We use NotaGen, a recent model that performs well on solo piano.

Hand-splitting: NotaGen returns one MIDI track. A separate model decides which notes belong to the right hand vs. left hand. We split at MIDI 60 (middle C) by default, then refine using a learned classifier for ambiguous cases.

Engraving: the MIDI is converted to MusicXML using our engraving rules (voicing, beaming, stem direction). PDF is rendered from the MusicXML.

Delivery: all three files (MIDI, MusicXML, PDF) plus the original audio land in your library.

End-to-end takes 60–90 seconds for a 3-minute song. Most of that is GPU inference time on the Lambda runtime.

What "85% accurate" really means

When we say "90% note accuracy", that's measured against a ground-truth MIDI file from the same performance. It does not mean 90% of measures are correct, wrong notes are spread out, so most measures have at least one error.

Where errors come from:

Octave errors, a note correctly identified, wrong octave. Easy to fix in MuseScore.

Rhythm rounding, the AI snaps to the nearest 16th note, which is wrong for swung or rubato passages.

Missed soft notes, pianissimo notes can fall below the model's confidence threshold.

Phantom notes, strong reverb tails sometimes get interpreted as new note onsets.

For practice purposes, 90% is usable. For published sheet music, you'd hand-edit afterward in MuseScore (which is exactly what professional engravers do, start from a rough computer pass and clean up).

Where the field is going

Three frontiers:

Multi-instrument, separating piano from strings from drums from vocals, then transcribing each. Demucs and similar source-separators keep getting better.

Expression preservation, capturing dynamics, articulation, pedal usage. Current models drop most of this.

Style-aware engraving, knowing that this is jazz and should be written in lead-sheet form vs classical that should be fully voiced.

In 2 years, expect "AI sheet music" to handle full-band transcriptions at the quality solo piano gets today.

Try it for yourself

The fastest way to understand what AI transcription is and isn't good at is to feed it a song you know intimately and read the output. Convert a YouTube cover here and compare the AI output to your mental model of the piece. The errors will teach you more than any blog post.

How Does AI Transcribe Piano from YouTube? (NotaGen Explained)

The problem nobody could solve for 50 years

What "transformer" means here

Why solo piano works best

What we use and how it fits in

What "85% accurate" really means

Where the field is going

Try it for yourself

Ready to start playing?

Keep reading

What Do the Colored Notes Mean? A Guide to Super Simple Piano

3 Ways to View Sheet Music: Which One Is Right for You?

How to Create Piano Playlists to Organize Your Practice