The resultant model can then generate an embedding for a tune similar to the referred song. To recognize humming, the network should produce embeddings, which requires the pairs of audios containing the same melody to be close to each other, despite having different instrumental accompaniment and singing voices. Then, it produces embeddings for each input, to be used for matching later.
Thus, a neural network is trained with pairs of input (here pairs of hummed or sung audio and recorded audio). The initial step in developing Hum to Search is modifying the music-recognition models used in Now Playing and Sound Search to work with hummed recordings. To find the dominant melody that might be used to match these two spectrograms, one can look for similarities in the lines towards the bottom of the given images. To get this, the model needs to learn to focus on the dominant section of the audio and ignore background vocals, instruments, and voice timbre, and other noises. Using the image on the left, the model needs to locate the audio corresponding to the right-hand image. Visualization of a hummed clip and a matching studio recording. The difference between the hummed version and the original version can be visualized using spectrograms, as shown below:
One of the significant challenges in recognizing a hummed melody is that a hummed tune often contains relatively less information for instance, this hummed example of Bella Ciao is illustrated. This allows the model to match a hummed tune to the original polyphonic recordings without a MIDI (Musical Instrument Digital Interface) version of each track or any other complex hand-engineered logic to extract the melody. This approach produces an embedding of a melody directly from a song’s spectrogram without creating an intermediate representation. Google recently launched Hum to Search, a new machine-learned system within Google Search that helps to find a song by humming.