Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges...Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges associated with dancedriven multi-instrument music(MIDI)generation are twofold:(i)lack of a publicly available multi-instrument MIDI and video paired dataset and(ii)the weak correlation between music and video.To tackle these challenges,we have built the first multi-instrument MIDI and dance paired dataset(D2MIDI).Based on this dataset,we introduce a multi-instrument MIDI generation framework(Dance2MIDI)conditioned on dance video.Firstly,to capture the relationship between dance and music,we employ a graph convolutional network to encode the dance motion.This allows us to extract features related to dance movement and dance style.Secondly,to generate a harmonious rhythm,we utilize a transformer model to decode the drum track sequence,leveraging a cross-attention mechanism.Thirdly,we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task.A BERTlike model is employed to comprehend the context of the entire music piece through self-supervised learning.We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance.展开更多
基金supported by the National Social Science Foundation Art Project(No.20BC040)China Scholarship Council(CSC)Grant(No.202306320525).
文摘Dance-driven music generation aims to generate musical pieces conditioned on dance videos.Previous works focus on monophonic or raw audio generation,while the multi-instrument scenario is under-explored.The challenges associated with dancedriven multi-instrument music(MIDI)generation are twofold:(i)lack of a publicly available multi-instrument MIDI and video paired dataset and(ii)the weak correlation between music and video.To tackle these challenges,we have built the first multi-instrument MIDI and dance paired dataset(D2MIDI).Based on this dataset,we introduce a multi-instrument MIDI generation framework(Dance2MIDI)conditioned on dance video.Firstly,to capture the relationship between dance and music,we employ a graph convolutional network to encode the dance motion.This allows us to extract features related to dance movement and dance style.Secondly,to generate a harmonious rhythm,we utilize a transformer model to decode the drum track sequence,leveraging a cross-attention mechanism.Thirdly,we model the task of generating the remaining tracks based on the drum track as a sequence understanding and completion task.A BERTlike model is employed to comprehend the context of the entire music piece through self-supervised learning.We evaluate the music generated by our framework trained on the D2MIDI dataset and demonstrate that our method achieves state-of-the-art performance.