After my first draft here viewtopic.php?t=3261 I played around a bit to get a better timing for the transition between the images.
It's still not 100% perfect yet, but considering that I didn't have to make any "manual" adjustments because it simply follows the audio input, I'm pretty happy with the result.
Turn up the volume and enjoy: https://www.youtube.com/watch?v=4ErNVZ3bcq0