ICASSP 2023: The Good, the Bad, and the Ugly
The beautiful Island of Rhodes welcomed the most prestigious signal processing conference in the world this past week. As per usual, ICASSP –the IEEE International Conference on Acoustics, Speech, and Signal Processing– has been both great and disappointing. I’ll start with the good: hot trends, my favorite best 10 papers, some social activities, and the island of Rhodes. Then I’ll move to the bad and the ugly, which mostly revolves around logistics of the conference and its topics, the terrible oral presentations I witnessed –and endured–, and the sheer lack of diversity. Let’s begin.
It was fantastic to see how the field is moving towards a broader, multi-modal world. Audio tagging is essentially replaced by audio-text models, which makes me happy in terms of the exciting applicability of retrieving –and ultimately generating– sounds via free-form text and/or image frames. It makes sense that models that do not only understand audio, but also text and/or images, should be capable of having a richer understanding of audio signals in general. While we are still “chasing” SOTA in very narrow datasets in a vast set of papers, it is good to see an intentional effort to move beyond that.
Generation with diffusion is here to stay (for at least a couple more years?), and it has become ubiquitous to several fronts of signal processing: from symbolic music generation to sound effects generation. Not only that, but 44.1Hz+ audio generation is finally a relatively common practice that yields much more realistic and usable results. These generation models are significantly maturing, and it feels like high quality sound effects, speech, and music generation will be solved problems within the next few years (up to a certain glass or artistic ceiling?).
Finally, contrastive learning is everywhere. From multimodal models, to big tasks such as source separation, passing through specific applications such as the identification of the most important musical features of a given singer. This way of training architectures is fascinating, since it is both easy to interpret and to adapt from broader to very specific needs.
Top 10 Papers
Selecting 10 papers from 2,765 is obviously unfair and likely only useful as click bait. Yet here I am. The following list is exceedingly biased towards my favorite research topics, which include audio understanding, generation, and everything to do with music. Sorted alphabetically by author.
This impressive work, done by PhD student Jonah Casebeer in collaboration with Adobe Research, presents a framework to treat adaptive filters as a meta-learning problem: basically a lightweight, online form of self-supervision with applications as broad as biomedical sensing, telecommunications, music signals, etc. The paper focuses on 5 specific tasks (including echo cancellation, blind equalization, and multi-channel dereverberation), and shows how this new approach achieves SOTA results in all of them, without the tedious task of hand-tunning the algorithm, since all of the parameters are meta-learned. Published at TASLP. Code and other resources here.
The titans from Spotify present a method to achieve alignment between audio and lyrics using a contrastive learning approach. They show how this relatively simple method achieves SOTA results and generalizes to other languages beyond English, even when only English was used as training data. This is interesting, since it means the system is capable of learning the mapping between phonetics and text in a broader sense. Great work, that won the best industry paper award!
Despite its unfortunate name, these researchers from Microsoft present an interesting CLIP version for Audio instead of Images. That is, mapping both audio and text into the same embedded space. They use RoBERTa as their text encoder and PANNs as their audio encoder (similar to our own work published at ICASSP this year, which I think it’s pretty good, but it would look bad if I myself would include it on this list :D). Again, contrastive learning is here to save the day, and CLAP achieves SOTA in Zero-Shot scenarios. Code and other resources here.
Yuan Gong (et al.!) doing it again. They present a multimodal framework for Audio and Images that share both branches for each modality. The way to do inference is to pass first the image and then the audio (or vice versa) through the same weights, yielding a space that is more jointly tight together, achieving SOTA results on VGGSound. Published at SPL. Code here.
The Potential of Neural Speech Synthesis-based Data Augmentation for Personalized Speech Enhancement - Kuznetsova et al.
One of the most interesting trends on this age of AI is to leverage large models to do data augmentation to train more specific models. It’s basically inception in the AI world. This would allow bypassing the arduous task of asking humans to annotate data, and just rely on robust models to generate such data. The titans from Indiana University leverage models by Google such as AudioLM to generate audio data to be later used on their personalized speech enhancement training framework. And it works relatively well, basically for free!
Fantastic work that presents a general-purpose vocoder to generate audio from a CQT representation using the hottest approach in AI these days: diffusion. They achieve SOTA results on the task of bandwidth expansion, and is the well-deserved recipient of the best student paper award. Congrats! Code here.
The titans at Dolby doing titanesque things as per usual! They present DAG, a Diffusion Audio Generator that is capable of producing 48kHz audio using text as prompts. This work brings us closer to high-quality sound generation on demand, which might, in the not so distant future, make the manually curated and recorded sound effects libraries obsolete. Check out their fantastic results and other resources here.
This is truly an outstanding work. Darius Peterman from Indiana University in collaboration with researchers at MERL present a novel method to project audio embeddings into a hyperbolic space (a Poincaré ball). Such non-Euclidean space becomes much better at disentangling sources from and audio recording. Check out their awesome video demo here. Big fan! This paper obtained the well deserved best student paper award. Congrats!
A collaboration with Google, Amir Shirian et al. present a method to learn audio understanding models via graphs with self-supervised learning under limited data scenarios. Given how hard it is to obtain high-quality human labels, this interesting work presents an alternative exploiting graph structures treating each audio sample as a node in the graph. Such model obtains on-par or better results than fully supervised models, which is quite impressive. Presented at the IEEE Journal of Selected Topics in Signal Processing.
Last but not least, I really enjoyed this work by the extraordinary humans from AIST. They make use of a clever contrastive learning framework to identify singing voices, where they select the negative examples by changing the timbre of the voice (via naive pitch modulation). This allows the model to learn the specificities of the singing voice, yielding SOTA results on singer identification. Really interesting way of exploiting contrastive learning to gain insights on the acoustics of the singing voice! Published at TASLP. Code here.
The island of Rhodes is beautiful. Access to its crystalline and turquoise beaches was dangerously easy (across the street from the conference venue), and the old city of Rhodes is the most amazing medieval town I’ve ever been to, which reminded me of old Catalan and French towns, but bigger! The history of the island is rich and complex, which you can witness in every little corner and town of the island. The food was utterly delicious, and the people were kind and attentive. What a fantastic location for a conference!
Moreover, it was such a delight to re-connect with the wonderful MIR community. We had an impromptu meet-up on Tuesday that ended up being attended by well over 50 researchers! Thanks everyone who came, and to Prof. Magdalena Fuentes from NYU to be my partner in crime to make it happen. Once again, the MIR community proving to be the best research community ever :)
As a bonus: what a joy to go back to a large-scale in-person conference! Virtual conferences do not do it for me and I think most people would agree that they generally suck. The energy, serendipity, and full immersion are still impossible to recreate via Zoom (we’ll see what happens in a few years when –and if– VR/AR becomes a thing –that doesn’t cost $3500). Of course, it is terrible from an environmental point of view: thousands of researchers traveling thousands of miles by plane is absolutely ridiculous. But until we find a solution, in-person conferences are no rival to their virtual counterparts.
Given the sheer size of the conference (4k+ attendees, according to the welcome reception), it was definitely challenging not to feel disconnected at times. In several sessions/events I attended I was surrounded by researchers from fields I am not too familiar with, and getting to meet people was rather futile given the relatively anti-social features of most attendees. So at times it felt quite lonely, despite (or due to) the thousands of attendees.
Moreover, it was quite disappointing to have some of the few music-oriented sessions overlapping, while there were other massively broad sessions that were too high-level to be of sufficient technical interest (e.g., “Deep Neural Networks” –wtf?). I understand it is a challenge to organize ~2.8k papers into 325 sessions, but some extra effort should be made such that some relatively small topics like music are not being discussed at more than one session at a time.
Finally, some sessions felt like leftovers. By that I mean that the works published there seemed like previous rejects from top tier conferences from other research fields like Computer Vision or NLP. Such works were definitely subpar compared to those from other sessions that focused on the main topics of the conference.
Finally, and probably the worst part of the whole thing, was that around 60 to 70% of the oral technical presentations were absolutely terrible. Slides completely cluttered with text, speakers reading from their phones, pre-recorded presentations without Q&A, zero effort at trying to engage with the audience, etc. We, as part of the signal processing community, should really make an effort to improve our communication and presentation skills. I heard CVPR is light-years ahead of us in terms of quality of presentations, and we could and should definitely learn from them. I’m not sure what a good way to incentivize these important skills would be. One idea we talked about during the conference is to have peer-reviewed presentations before the conference, but that might be too logistically involving. If you have other ideas, I’m all ears to try to apply them to ISMIR 2024!
Additionally, and to conclude, the ratio of men vs women/non-binary in the conference is so extremely sad (exact numbers were not shared during the welcome reception, which is a missed opportunity to make everyone aware of the sheer lack of diversity in our field). I understand that this is a problem that is not easy to address in the short term, but seriously, it is quite unsettling to realize that the whole field is mostly driven by men talking with other men, answering questions to other men, that review the works of other men. Also, I can’t believe there were no pronouns on the official badges! C’mon, it’s 2023, organizers! It’s the least we can do to make the few non-binary and trans attendees feel slightly more included. Maybe this overlook comes from the fact that the core team of this year’s ICASSP organization (including general chair and co-chairs, workshop coordinator, and technical program chairs) only included a single woman out of 9 members (!!). Or that out of the new 44 Fellow members of the IEEE Signal Processing Society only 3 are women (i.e., less than a 7%). We need to do everything possible to be more welcoming to women and other underrepresented humans. Until we don’t reach gender parity, this field will continue to appear unappealing and unexciting to many.