Can’t trust the feeling? How open data reveals unexpected behavior of high- level music descriptors

Oct 3, 2020 20:01 · 613 words · 3 minute read challenge though done something similar

A virtual ‘hello’ from Delft! My name is Cynthia Liem, and it’s my delight to present the paper “Can’t trust the feeling?” to you, a work co-authored with my MSc student Chris Mostert, looking into unexpected behavior of high-level music descriptors. Music descriptors, also known as feature extractors, are very important in content-based music information retrieval research. In many cases, we don’t have access to raw audio. So if we want to do any larger-scale analyses, we have to rely on features that are pre-computed and released through APIs or open datasets. And, work based on such pre-computed features has led to grander statements on the nature of music, for example, the emotional development of pop music.

00:44 - However, those feature values are numbers, and the question is to what extent we can trust those numbers. Often, they’re rooted in descriptors that have published papers associated, that report good performance, but we don’t quite know how that extrapolates towards in-the-wild situations, and whether patterns were picked up that actually were supposed to be picked up. Now we have, however, a large-scale dataset, that allows for us to do a more comprehensive analysis. AcousticBrainz, large-scale, community-contributed and open, rooted in the Essentia open-source feature extractor. The community can run this extractor on their local audio, then submit it to the database that’s online, associated to MusicBrainz recording IDs, so you also can have multiple submissions for the same recording.

01:34 - There are both lower-level and higher-level descriptors; the higher-level ones are machine learning-based and also include classifier confidences. There are now millions of submissions in there, so this seems a nice way to do a large-scale, more meta-scientific analysis of descriptor outputs. There is a challenge though. And that is that anyone can submit anything to AcousticBrainz, meaning that we don’t quite know what output should be expected at all. At the same time, looking at some neighboring fields, most notably, psychology and software engineering, both those fields conduct testing beyond ‘known truths’. They can do that by extrapolating using known relationships.

02:15 - And inspired by that, we’ve done something similar. First of all, we considered resubmissions of the same recording. If it maps to the same MusicBrainz ID, in terms of recordings, it’s supposed to be a semantically equivalent representation, that led to that. So, classifiers run on those semantically equivalent representations should give semantically equivalent output. However, numerically, that’s not quite the case, and our paper presents a more comprehensive analysis, both of (in)stability and of potential bias of classifier outputs.

02:47 - Then there are some concepts that are known to relate to one another. For example, we have multiple genre classifiers, that all can classify rock. Then, if I give the same input to different rock classifiers, you’d expect for correlated outputs to come out of that. But it was not always the case, and at the worst, we even found a -0.07 correlation on two rock classifiers. When we dug more deeply, it actually turned out that the confidences and the labels of some classes were quite strangely distributed.

03:19 - And when looking at differences in anomalous peaks and non-anomalous parts, we found that bit rate, codec, and low-level extractor software version seem to contribute to large distributional differences in outputs. These are aspects we hardly consider in high-level descriptor research, but that we may need to be much more conscious on in our experiments. For more information, we refer to both our paper and the thesis of Chris. And we hope that our work can inspire towards some more comprehensive, holistic, out-of-ground-truth quality assurance mechanisms in higher-level music descriptors. Thank you very much. .