The Europeana Sounds project aggregates metadata records corresponding to hundreds of thousands audio-related objects and makes them available via the Europeana portal, Channels and API. The described content ranges from music to interviews, animal or ambient sounds, broadcasts, news, etc. The huge variety of audio content makes this collection both fascinating and challenging.
Although the records are rich in descriptive metadata, these textual descriptions are often not sufficient to support sophisticated search scenarios or (musicological) research. For example, finding recordings from a specific artist would be no problem, as well as finding works from a certain year. A search for contemporary music which was inspired by a classical composer or music style would certainly be problematic. This would require that those references to influencing artists and styles are recorded somewhere in the metadata. Yet, manually annotating records in digital library collections is cumbersome and error prone. Audio, especially music recordings are even more complex to describe. This is because, despite of some objective properties (such as composer, performer, instrumentation, structure, meter, etc.), music is highly subjective. The perceptions of properties like genre, timbre, mood, etc. are both personal and ambivalent. Even if such information has been recorded in the metadata, the algorithms of conventional search engines require a certain amount of consistency in descriptive terminology. Thus it appears that the state-of-the-art text based approach faces many obstacles concerning the effectiveness of music or audio search.
The information required for implementing an efficient audio search is encoded in audio files – still it needs to be unlocked. Human listeners are able to identify performing artists, instrumentation, melodies, songs, moods, genres, as well as even subtle similarities between songs within fractions of a second. Based on this premise of the existence of music information, the aim is to retrieve this information, to extract it and to store it in a machine processable format. This process is called feature extraction and transforms music information into a series of descriptive numbers. Based on the applied feature extraction method these numbers express or relate to musical properties such as timbre, rhythm or harmony. The following images provide an example of rhythmic features extracted from two different music tracks. Image a) represents the rhythmic energy from a classical, and image b) from a heavy metal track. At a glance it can be observed, that the two songs obviously differ significantly in their peak values (red colored pixels refer to high intensity values).
Music features calculated from the audio signal, such as the examples provided above, are further used to make songs comparable on a numerical basis. In that sense, songs exhibiting similar patterns will be considered as musically similar – at least in terms of rhythmic expression. This relationship between numerical and music similarity facilitates new approaches to music search. Besides the traditional text based approach, where music information has to be textually encoded and provided, a content based approach adds a wide range of advantages and possibilities to search engines.
Content Based Search
Content based search refers to approaches using information available within the audio content and not within the descriptive metadata. This usually requires a pre-processing step where this information is extracted and quantized to make it comparable on a numerical level. This is also referred to as vector representation. Usually a content based feature – independently of the actual multimedia domain (e.g. music, image, video, etc.) – is not a single value but a set of values. The Chroma features for example describe the distribution of the spectral energy over the 12 semitones of a musical octave. Thus, the feature consists of 12 values which are represented as one vector with a length of 12. Extracting these features from a collection of audio files creates a so-called vector space. Referring to this, search approaches based on content based features are also often referred to as Vector Space Approaches or Vector Space Models.
Audio Similarity Calculations
Searches in vector space function on different principles than traditional text-based approaches. Contrary to query languages which have strictly defined rule sets, such as SQL, vector space models are based on similarity estimations which are generally derived from numerical differences of feature vectors. There are various methods to calculate vector similarities by considering aspects of different probability distributions, but all of them provide an estimate of how similar the values within the vectors are. In the context of a content based search engine, these estimates are used to facilitate sophisticated search scenarios like the popular use case of Query by Example. In this scenario, the information need is expressed by an example song and the search engine is expected to retrieve similar songs. Thus, as a first step, the feature vector of the example song will be used to calculate the pairwise similarities to each other feature vector in the song collection. The result is a list similarity estimations. By sorting this list in descending order, top-entries with high similarity values are considered to refer to highly similar songs. This interpretation of music similarity highly depends on the models used to calculate the feature vectors. Although this is a considerable weakness of this approach, such approaches are state-of-the-art in multimedia research in general. Research specifically on music feature extraction is progressing rapidly and remarkable results are already made available as open source software.
Query by Example on the Europeana Sounds Collection
To demonstrate how Music Information Retrieval (MIR) technologies can be applied to the Europeana Platform a pilot was developed which showcased a query by example search approach. The Europeana Sounds collection is a large dataset, consisting of a wide range of music and non-music content. The following main categories had to be considered for the implementation of a query by example search approach:
- Music: Music content is by far the biggest category of the collection varying in genre, style, instrumentation and recording quality.
- Spoken Word: Spoken word in form of interviews, radio news broadcast, public speeches, etc.
- Animal Sounds: Recordings of animal sounds mostly contain noise and a small percentage of actual sound.
- Radio Broadcasts: Some sub-collections consist of live radio recordings. These audio items are long mixed-content files. They consist of spoken content, music and radio commercials.
This variety of different content categories made the facilitation of a query by example search approach equally interesting and challenging. Reports of mixed-content evaluations which are not restricted to music but also include different audio content such as spoken word, animal sounds or radio commercials. Thus, adequate combinations of audio and music features to capture the acoustic properties of the Europeana Sound’s music and non-music content had to be evaluated carefully.
Music similarity is an attractive but also a challenging research field, due to its inherent ambiguity and subjectivity. Music is part of our culture. It is a prevalent part of our daily life. It sounds from radios, TVs, shops and coffee shops on your way to work. Parents sing songs to make us fall asleep, in school we are taught hymns, traditional songs and an overview of the rich cultural heritage of classical music. Teenagers use music to express their personality, at that age the music starts to play a major role in our lives.
Because of the fuzziness present in the perception of music properties, the objective estimation of music similarity becomes a highly challenging task. Everyone has an own interpretation of similarity. In terms of music similarity calculations, objectivity is generally a requirement for generalization. Subjectivity and personalization concerns add further complexity and require presence of user-related (personal) data.
Due to the mentioned obstacles, a common solution is to use acoustic similarity as an estimate for music similarity. This is still controversial because the employed methods derive from digital signal processing, which differs from how humans perceive music. Humans are able to separate and concentrate on distinct properties of music such as instruments and melodies played by these instruments. This source separation in digital signal processing is still under research and there are not developed enough to process polyphonic digital music. As a consequence, music is often interpreted by its distribution of spectral energy; melodic or harmonic attributes need to be carefully reconstructed from the sampled audio. Different models have been developed attempting to capture specific properties of music.
Estimating Music Similarity
The acoustic music similarity is calculated by weighted a collection of state-of-the-art audio features. Distinguishing different music styles requires a consideration of various music properties, such as timbre, rhythm and harmony, as well as their progression and variety over the complete performance. Different feature sets are known to work better on certain music genres, but to be inferior when applied on other genres. A further obstacle is the presence of old historic recordings. Scratches and noise resulting from decaying media distort the feature values. Human ears are able to filter such noise and to still recognize the underlying music content. Music features robust against such noise were considered and evaluated. The following audio features were identified to capture the relevant audio content:
- Timbre: Timbre is a fundamental property of music and generally reflects the instrumentation used during the performance. Timbre is often a good discriminator for music genres as well as moods expressed by a song. Several timbre features have been presented each of them capturing different aspects of genre. The MIR-pilot used the Mel Frequency Cepstral Coefficients (MFCC) which are currently the most widely used features in music research, the Statistical Spectrum Descriptors (SSD) are based on psychoacoustic transformations, and the Spectral Centroid.
- Rhythm: Rhythm defines music, as well as timbre does. Other than with traditional or musicological descriptions of rhythm, where rhythms have names assigned to, computational rhythm features provide a statistical description of rhythmic energy. This facilitates a better comparability, but abstracts from our human understanding of rhythm.
- Harmony: Harmony describes the tonality of a composition. In terms of an analytical perspective, it analyses how the spectral energy is distributed among a certain (usually western) scale. To describe harmony Chroma and Tonnetz features were used.
- Loudness: Although loudness is not relevant for music similarity, it has to be considered in certain ways. The similarity will make no difference for two identical songs played at different amplitudes. However, on a more global level it has a discriminative notion. It was reported that contemporary music tends to steadily increase on loudness. This was proclaimed as the “loudness war”. It refers to a common habit of modern music production to apply several levels of compression in the post-processing phase. This reduces the dynamic range by pushing silent frequencies up. This results in a sound that is more attractive (subjectively) and demands the attention of the listener (objectively).
- Noisiness: Noise behavior analysis refers to the different recording qualities of the Europeana Sounds collection’s items. Adding these features to the stack would prefer performance over composition, and thus group the records by their quality.
Spoken word shows completely different spectral properties and thus requires different audio features to distinguish them from music content. This introduces additional problems where different features mask properties of others. To avoid this, weights can be applied to reduce the influence of distinct feature sets on the final similarity estimation. The collection of features chosen for the implementation of the MIR-pilot provides a good example. The Zero crossing rate feature has been added to capture effects of audio degradation within old records. This audio property has to be considered in the similarity estimation, but it is not as important as rhythmic, timbral or harmonic similarity. Feature weights were estimated and optimized empirically and provided the following distribution of feature importance:
The user-interface of the MIR pilot is aligned to the current design of the Europeana search interface. The MIR pilot supports the following use cases:
- Term-based queries: The term-based query option accepts text-based input and uses them to query the “title” attribute of the metadata. This attribute seems to be overloaded and commonly contains also information about composer and performer.
- Query by Example: By supplying an example song, the system searches for similar ones based on their acoustic properties.
- Query by Soundcloud example: accept also content which is not contained in the Europeana Sounds collection.
Term-based queries were added to facilitate elementary means to explore the Europeana Sounds collection, or to search for content based on certain terms such as “blues”, “love” or “piano”. To simplify experimentation and evaluation of the system, a Web-based audio player has been added directly into the result page. The query by example functionality can be triggered by clicking on the green button with the magnifying glass symbol for the desired query song. The system then calculates the similarities to all records of the collection, ranks them in descending order by their similarity and returns the top 24 entries of the result-list.
In order to get an end-user perspective on the results of the MIR pilot, an user evaluation was executed by the Netherlands Institute for Sound and Vision. In sessions of 90 minutes 13 participants provided feedback on their experience with the MIR pilot, based on the similarities and differences they experienced between songs from several genres. Snippets of 30 seconds were played from different genres, always starting with one reference track. After playing the reference track, three other tracks from a particular genre (the three that came on top using the content based search) were played. After each of these three tracks the participants needed to indicate if they experienced similarities and/or differences with the initial reference track. The user evaluation showed a rough linear correspondence between the calculated similarity estimation and the experienced similarity.
by Alexander Schindler, AIT Austrian Institute of Technology