Notice: opening this page might take several seconds due to the large number of examples.

Music Information Retrieval (MIR), also called Music Information Research, aims at making computers understand music. This is currently not the case! Music played on computers does not really differ from music played by a gramophone: it is read from a media and immediately directed to the loudspeakers. The music understanding of a computer can be compared to us looking at the grooves of a vinyl record.

To outline this we take a the commonly known track Eine Kleine Nachtmusik composed by Wolfgang Amadeus Mozart as example:



Similar to a vinyl record, the audio played by this embedded player is represented in a coded form. The coding of the record is haptic. Depths and intervals of the grooves describe how the needle should transform the imprinted description into sound. Digital music is described in a similar way.

The audio signal is converted into a sequences of discrete numbers that are evenly spaced in time. This process of reducing a continuous signal to a discrete signal is called sampling. Sampling audio is similar to monitoring the temperature of an office by measuring every minute the current degree of Celsius. To digitize audio the voltage, generated by the sound pressure in the microphone, is measured. For digitizing audio especially music in CD quality, typically a sampling rate of 44100 Herz at a bit depth of 16 is used. This means, that each second of audio data is represented by 44100 16bit values.

This is how a computer sees 2 milliseconds of music:


It becomes more comprehensible when it is visualized as a bar-plot. Now it can be observed, that there are positive and negative values and that they seem to oscillate up and down.

Blog - MIR intro_14_0

Zooming out and visualizing the entire song provides a familiar visualization of audio files called Waveform. The following waveform visualizes the sequence of the measured amplitudes. What can be seen in a Waveform? Only how intense the sound pressure has been measured at a certain moment. This is why the sampled audio data is also referred to as Time Domain data.

Blog - MIR intro_16_0

A waveform is a very limited representation of music information. It only plots the sequence of amplitude values measured by the microphone. In the previous example this might give an impression of the song structure, but this assumption does not hold generally as depicted by the next example, which is a Punk-Rock song:

Blog - MIR intro_19_0

Goals of Music Information Retrieval

These two different examples of music are a good starting to discuss the objectives of MIR. The songs sound apparently different. The question is how to mathematically model this description of musical difference?

To answer this we have to take a look at the information we have so far:

  • filename
  • wave-data
  • samplerate
  • number of audio-channels (e.g. mono, stereo)

It is obvious that the songs can be discriminated by their filenames, since common filesystems do not support identical filenames. The number of audio-channels would distinguish mono from stereo encoded songs, but nowadays the majority of the tracks is stereo encoded. Leaves us to handle the to pure audio data.

The problem with the wave data is, that it contains too much information that cannot be processed in its raw form. It has to be transformed and reduced. This process is called Feature Extraction and attempts to aggregate the audio signal data into a much smaller set of descriptive numbers.

The most naive audio feature calculated by dividing the number of audio samples of an audio file by the sample-rate. The result is length of the track in seconds.


We now can computationally confirm that these two tracks are different. The more interesting question is, ‘how similar are they’?

Calculating Similarities between Songs

The raw and extracted information assembled to this point is by far not descriptive enough to draw conclusions about the similarity of tracks. Songs with identical length are not expected to sound identically.

More complex features are required. A standard audio descriptor is called Zero Crossing Rate (ZCR) and literally counts how often the signal crosses the zero-value-line. For further examples we use another famous example:

The following chart plots the Zero Crossing Rate against the waveform. It shows that there seems to be no correlation to the waveform. That is because the ZCR does not depend on the measured loudness but on the frequency of the analyzed audio signal. The higher the frequency of the signal, the more often it crosses the zero-line and the higher the ZCR value.

Blog - MIR intro_28_0

The relationship between ZCR and the frequency of the audio signal can be better observed using another representation, the Spectrogram. The spectrogram visualizes the frequency distribution of the audio signal. It is calculated by applying a Fourier Transform to the signal. The vertical axis describes the frequency range. The region near to the x-axis represents low frequencies as produced by a bass-guitar. Human voice resides around 1000Hz. The horizontal axis represents the time.

Blog - MIR intro_30_0

Calculations based on spectrograms are also referred to as Frequency Domain calculations and features calculated in this domain are far more descriptive and all state-of-the-art music features perform sophisticated calculations on the spectral information of the audio signal. The ZCR can be calculated very fast because it does not require the expensive Fourier Transform operation, but it just describes roughly describes how noisy a track is. This might be sufficient to distinguish many Heavy Metal songs from Classic music, but the general performance is not acceptable.

Advanced Music Descriptors

Advanced music features use methods from digital signal processing and consider psycho-acoustic models in order to extract suitable semantic information from music. The most popular are the Mel-frequency cepstral coefficients (MFCC) which are a good descriptor for music timbre. The Chroma feature-set maps the audible spectrum to the 12 semitones of the musical octave. For western music this reflects the tonality and usage of chords.

The Music Information Retrieval Group of the TU-Wien developed various feature sets, which are appropriate for different tasks.

  • Rhythm Patterns: Rhythm Patterns (RP) describe modulation amplitudes for a range of modulation frequencies on “critical bands” of the human auditory range. They provide an abstract description of rhythmic frequencies in a song.
  • Rhythm Histogram : Rhythm Histogram (RH) features are capturing rhythmical characteristics of an audio track by aggregating the modulation values of the critical bands computed in a Rhythm Pattern.
  • Statistical Spectrum Descriptor: In the first part of the algorithm for computation of a Statistical Spectrum Descriptor (SSD) the specific loudness sensation is computed on 24 Bark-scale bands, equally as for a Rhythm Pattern. Subsequently, the mean, median, variance, skewness, kurtosis, min- and max-value are calculated for each individual critical band. SSDs are able to capture additional timbral information compared to Rhythm Patterns, yet at a much lower dimension of the feature space.
  • Temporal Statistical Spectrum Descriptor: These features describe variations over time by including a temporal dimension to incorporate time series aspects. Statistical Spectrum Descriptors are extracted from segments of a musical track at different time positions. Thus, TSSDs are able to reflect rhythmical, instrumental, etc. changes timbral by capturing variations and changes of the audio spectrum over time
  • Temporal Rhythm Histograms: capture change and variation of rhythmic aspects in time. Similar to the Temporal Statistical Spectrum Descriptor statistical measures of the Rhythm Histograms of individual 6-second segments in a musical track are computed.

Demonstrating the Performance of Music Features

This section will use the Soundcloud Demo Dataset, which is a collection of commonly known mainstream radio hits. The complete collection can be observed as playlist on Soundcloud.

The following charts plot the features of the Rhythm Patterns family. Three songs from different genres are chosen to demonstrate how they are reflected in the feature values.

Song 1: The first song is a typical dance track. Thus, the Rhythm Histograms show clear peaks at about 2Hz which corresponds to a tempo of 120 beats per minute (bpm). The Rhythm Patterns provide a more detailed description of where this rhythmic energy comes from. In the region around 2Hz there are highlighted regions in lower sub-bands. This corresponds to bass drum that is typically played on each quarter note in Dance music. The other regions around 4Hz and 8Hz are partials and may correspond to percussion or other instruments that are played double time.

Song 2: The second song is a typical Heavy Metal song. Rhythm Histograms and Rhythm Patterns show the typical pattern for this genre. The peak in the middle of the Rhythm Pattern is generated by the predominant Hi-Hat which is played half open on the eights at about 120bpm (= about 4Hz). This is also reflected by the high maximum values shown in the Statistical Spectrum Descriptors.

Song 3: The third already points out some problems of music feature extraction. The song starts as rock ballad played on the piano has a classic opera interlude and ends in a Heavy Metal style. The complexity of this song is also reflected in its feature values. The energy in Rhythm Histograms and Rhythm Patterns is evenly distributed across the su-bbands and the various modulation frequencies. Also the spectral energy follows this pattern.

Mr Saxobeat
Metallica – Enter Sandman
Queen – Bohemian Rhapsody

‘More like this’ – Finding Similar Songs

Finding similar songs is one of the most difficult tasks in Music Information Retrieval. It starts with the lack of a clear definition of music similarity and ends with difficulty to describe it through mathematical or algorithmic models.

This problem is also known as the Semantic Gap and refers to the inability to describe high level semantic concepts such as music similarity or genre through low level features. The semantic gap is known to all information retrieval domains.

The following paragraphs will use the track below to query for similar songs:

The table below lists the most similar songs calculated on different feature sets. The results seem satisfactory – all tracks are from the same genre and are typical dance tracks. This is due to the fact, that dance music has a strong focus on rhythm. The bass drum is over accentuated and the instrumental accompaniment has a strong rhythmical texture. This creates distinctive patterns in the features that discriminates them from other genres.

Rhythm Patterns
Rhythm Histograms
Statistical Spectrum Descriptors

The following examples will provide a short evaluation of the presented music features and outline their advantages and shortcomings. To demonstrate this, the following three songs from different genres will be used to search for similar songs using the same feature-set.

Gloria gaynor – i will survive
Metallica – Enter Sandman
Queen – Bohemian Rhapsody

Rhythm Histograms

Rhythm Histograms provide nothing but a rough description of the rhythmic Energy of a song. This rhythm but can be generated by everything, ranging from percussive instruments to the amplitude modulation of a synthesizer.

Rhythm per se is a property of music similarity but comparing songs solely by rhythm leads to unsatisfactory results. This can be observed in the following overview where Heavy Metal songs appear in the results of a Disco query song. Again, Bohemiam Rhapsody by Queen shows how problematic songs with high variance are. The results are all rock which refers to the final part of the song. The melancholic first part is neglected. This can be explained by the fact that these features are aggregated using statistical moments of feature values calculated for the whole song. Statistical measures are usually subject to outliers and the intense values at the end distort the histogram values towards feature values that fall into the value space of Rock music.

Gloria gaynor – i will survive
Metallica – Enter Sandman
Queen – Bohemian Rhapsody

Rhythm Patterns

Rhythm Patterns expand the rhythmical descriptions to the different spectral sub-bands. Thus, they provide much more information but also require more computational resources to process them. This does not mean that these features capture timbral properties. They only distinguish rhythmic energy in different spectral ranges.

Because the Rhythm Histograms are an aggregated representation of the Rhythm Patterns, the results are similar. The additional information provides more accurate results. Taking Enter Sandman by Metallica as example. It might seem funny that Anita Ward’s Disco Hit Ring my Bell is among the most similar songs in the collection. Listening to this song reveals, that this is not far-fetched. The drum pattern played in the Disco Hit is very similar to that played by Metallica’s drummer Lars Ulrich. They only differ in timbre. This is the same for the songs of Guns ‘n Roses and Van Halen.

Gloria gaynor – i will survive
Metallica – Enter Sandman
Queen – Bohemian Rhapsody

Statistical Spectrum Descriptors

SSDs are very good descriptors of music timbre, but they completely ignore rhythm. This leads to interesting results especially on this small dataset. Results based on SSDs sound similar in terms of timbral properties and how they change in the song. For example, songs with acoustic guitars in the verse and distorted ones in the chorus would match together based on their high variance values in the corresponding spectral bands. Punk Rock songs would match together due to their simple composition and noisy drum style, which results in low variance and high mean and max values. SSDs are also good for matching songs according their mood.

Gloria gaynor – i will survive
Metallica – Enter Sandman
Queen – Bohemian Rhapsody

Combining different Music Descriptors

To overcome the obstacles of the distinct music features it is a good practice to combine different feature-sets to get a better computational description of music. The following example combines Statistical Spectrum Descriptors with Rhythm Histograms, providing a combined description of rhythmic and timbral properties. Yet, the results are not perfect, but the output can be optimized by adjusting the weights of the different feature-sets.

Gloria gaynor – i will survive
Metallica – Enter Sandman
Queen – Bohemian Rhapsody

Still a long way to go

This article intended to provide a short glimpse into Music Information Retrieval. The presented examples are should demonstrate some state-of-the-art approaches and the problems that have to be solved. Although MIR is a young research field, it has already accomplished a lot and there is a lot to come. It has a strong community of enthusiastic researchers, which meets annularly at the International Society of Music Information Retrieval Conference (ISMIR).