This one was one of the most detailed tinkers to date. TLDR; once I got deep enough to understand how to do it, I decided I didn’t have enough time or depth to pursue it. I can share some interesting insights though.
The Idea
I got this idea in my head: The internet is full of MIDI. Could I download a bunch of MIDI and:
- Render the MIDI to audio with soft synthesizers
- Use Shazam like detection algorithms to get additional metadata
- Use Ollama and SearXNG to get descriptive data from the song metadata
- Train a model to take a natural language prompt like “An upbeat rock song in the style of Foo Fighters, in A minor.” and generate MIDI.
I found several hundred thousand MIDI files on the internet and got to work on figuring out how to pre-process them for training. I will admit that I started this with a very naive understanding of how I could use existing MIDI note data to generate new MIDI note data. I did a lot of experiments before I stumbled across a whitepaper Music Transformer: Generating Music with Long-Term Structure and it opened my eyes.
The Conclusion
Yea, it can be done, but the difficulty lies in generating music with long-term structure, because music requires maintaining relationships across time. Local models struggle with storing and referencing distant elements due to limited memory, and transformers face challenges due to memory requirements for long sequences.
In the end, I understand it enough to realize the Cornell team is so far beyond my understanding of the technology, I should sit back and watch.