I would assume they divide a song into relatively small 5-10 second chunks, and train a piece of classification software with these as inputs. When you record a few seconds of audio with Shazam, it is uploaded to their server and fed to the classifier and the classifier yields its best guess. Their classification software may be based on something like the
K-nearest-neighbors algorithm.
With a big database of audio, you could run experiments to optimize the length of the sound bite, which frequencies to analyze, and other factors to improve recognition performance. Speech recognition, and more generally pattern recognition, have been major research areas in academia and industry for decades.