Scholars' Association News
Issue 30
May 2014


Show Images Hide Images

Automatic Music Transcription
By Emmanouil Benetos

Automatic music transcription is the process of converting a music recording into some form of musical notation, such as a music score or a machine-readable symbolic file. An immediate application of transcription systems is allowing performers to store and reproduce a recorded music performance. Lately, the problem of automatic music transcription has attracted considerable research interest in the field of music technology due to the large number of related applications, including the creation of systems for music search and retrieval (e.g. searching melodies or motifs in music collections), interactive music systems (e.g. automatic accompaniment in live performances), music education (e.g. systems for automatic musical instrument tutoring), and finally in the fields of systematic and computational musicology (for automated analysis of music recordings or collections).

The problem of automatic music transcription is interdisciplinary by nature and combines elements of computer science, applied mathematics, musicology, acoustics, and psychology. It can be divided into several subtasks, which include pitch detection (as well as detecting the start and end time of each note), musical instrument recognition, extraction of rhythmic information (e.g. tempo, meter), and the estimation of expressive characteristics (e.g. dynamics, articulation). The core problem of automatic music transcription is the detection of multiple concurrent notes, also called multi-pitch detection.

Transcription is a challenging and time-consuming task, even for expert musicians and musicologists. While the problem of automatically transcribing monophonic music is considered to be solved, the creation of an automated system capable of transcribing music without restrictions regarding music genre, instrumentation, or polyphony levels is still an open topic, especially for recordings with high polyphony levels and multiple instruments (such as orchestral music).

A future challenge lies in the creation of systems that can operate in real-time, since current transcription systems are computationally expensive and require significant processing time. Another open problem lies in the utilization of musicological information (e.g. key, chords) that can improve the final transcription.

At the Music Informatics Research Group (MIRG) of City University London, current research in automatic transcription focuses on the creation of efficient systems that can transcribe large collections of music recordings in a short amount of time, as well as in the use of musicological models that can improve transcription performance. Research is also carried out towards the creation of systems that support tuning changes, as well as several types of articulations, such as vibrato or tremolo. At the same time, in collaboration with Boğaziçi University, work is carried out in the field of ethnomusicology, towards automatic transcription of Turkish classical music.

The operation of a typical music transcription system can be seen in the following example, using the first two bars of J.S. Bach’s Menuet in G major, BWV anh 114 (see also relevant sound example). In Figure 1, the waveform of the recording can be seen, which is used as input to a transcription system. Given that each note has a corresponding audio frequency (also called fundamental frequency), the one-dimensional waveform needs to be converted into a two-dimensional representation of time and frequency (spectrogram). In Figure 2, the spectrogram for the same recording is displayed, where it can be seen that for each produced note, several horizontal lines are created across time, which denote the presence of energy in a specific fundamental frequency, as well as in the integer multiples of that frequency (called harmonics). By using automatic transcription techniques, the spectrogram can be converted into a two-dimensional representation of pitch versus time. Figure 3 displays such a representation for the specific audio example, where energy for specific notes can be observed (in the figure, a MIDI value of 60 corresponds to middle C in a piano keyboard). By processing the aforementioned time-pitch representation, we can create a symbolic music representation, such as the score shown in Figure 4.

(Dr Emmanouil Benetos is a research fellow at the Music Informatics Research Group of City University London. More information:

‹ Previous