Leave-one-out cross validation is a choice in cases few annotated examples are available. However, in textual IR, it is said 20 fold CV tends to be close to the truth.

I think review 2's opinion is worthy of consideration: "If the main applications are tempo induction and subgenre classification, why not evaluate the performance for these applications directly? This would be more relevant for MIR and annotation would be far less time-consuming. I think this issue has to be seriously considered by the participants in case they do not own already a sufficient amount of annotated data. "