In applying machine learning to musical audio signals, the general goal is to learn a mapping from an input (e.g. a song) to an output (e.g. an annotation such as genre). Since the human perception of music and its annotation is highly subjective with low inter-rater agreement, the validity of such machine learning experiments is unclear. Because it is not meaningful to have computational models that go beyond the level of human agreement, these levels of inter-rater agreement present a natural upper bound for any algorithmic approach. We illustrate this fundamental evaluation problem using results from modeling music similarity between pieces of music, as utilized in automatic music recommendation.