Watching a music video and could not hear your favorite instrument’s notes among the loud drums? Thanks to MIT researchers, they have developed artificial intelligence (AI) based software, PixelPlayer that can single out each instrument playing in the video and make it louder or softer.

Pixel Player is a new AI project paved the way from MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). A team of MIT researchers led by Hang Zhao, a Ph.D. student at CSAIL has proposed to devise a deep-learning system that can analyze all the sounds of instruments playing in a video individually. It can then play a specific instrument independently with a single click.

“We were surprised that we could actually spatially locate the instruments at the pixel level. Being able to do that opens up a lot of possibilities, like being able to edit the audio of individual instruments by a single click on the video,” said Zhao.

The project, titled as “The Sound of Pixel” will be presented at European Conference on Computer Vision (ECCV), which is going to be organized at Munich, Germany this September.

Zhao co-wrote the paper under the guidance of MIT professors Antonio Torralba, in the Department of Electrical Engineering and Computer Science, and Josh McDermott, in the Department of Brain and Cognitive Sciences. The other team members are research associate Chuang Gan, undergraduate student Andrew Rouditchenko, and a Ph.D. graduate Carl Vondrick.

Self-supervised Sound Software

This is not the first time that effort towards isolating various sounds of instruments is taken up. The earlier projects focused primarily on audio composition, which often required extensive manual tagging.

On the other side PixelPlayer, underpinning a deep learning algorithm is a self-supervised system. It does not need any human intervention to identify the type or sound of a played instrument. In the paper, the MIT research team demonstrated that the new sound software can identify over 20 commonly seen music instruments precisely. However, the system still lacks the ability to segregate sub-classes of instruments.

“Trained on over 60 hours of videos, the “PixelPlayer” system can view a never-before-seen musical performance, identify specific instruments at a pixel level, and extract the sounds that are associated with those instruments,” cited in MIT’s official note.

Elaborating on deep learning aspect, MIT’s PixelPlayer discovers patterns in musical notes using neural networks that are already trained on existing footages. Given the input video, here one neural network studies the visual node whereas the other one specifically focuses on the audio component. The third neural network called ‘synthesizer’ links up specific pixels with specific sound waves to single out the different sounds of instruments.

Application of PixelPlayer

Having said that, ‘The Sound of Pixels’ project holds more capabilities for music composers. The AI-based sound software can further modify the volume of each singled out instrument.

This system can aid engineers to enhance the audio quality of an old music concert. The music-makers could even pick a particular instrument piece from the bundle of instrument played in a video, swap it with a new instrument sound and preview how it would sound like when played together with other instruments.

Zhao said, “A system like PixelPlayer could even be used on robots to better understand the environmental sounds that other objects make, such as animals or vehicles.”