AI / Machine Learning

April 18, 2022

How to Generate Music with AI

Mariano Schmidt

Mariano.Schmidt

The ability to generate and synthesize music in real-time is a key topic in generative art. Generative art refers to a piece of art, or a certain part of that art, that is generated by an autonomous system (a large network or group of networks under one routing policy).

Similar to AI now being able to generate images, this technology is also capable of generating music through the use of deep learning techniques. This innovative area of machine learning has been gathering pace for a number of decades.

The purpose of this article is to look at state-of-the-art methods and deep neural network models behind these models. These machine learning methods can be used by music composers and producers to generate music and act as an assistance tool for their work.

We will walk through some methods that combine different neural networks techniques, as well as introduce and breakdown some of these concepts and how they operate.

Want to work with innovative AI technologies? We’re hiring!

What is a deep neural network?

A deep neural network is a neural network composed by several layers. The difference between a neural network and a deep neural network is that the latter has more than 2 hidden layers and as a result the architecture is more complex.

Machine learning and music representation

Machine learning models are used as inputs in the shape of numeric vector(s) that represent the input we want to provide the model in an understandable manner. As a result, here lies our first problem - how can we represent music as numbers?

To achieve this, we must think of a melody as a sequence of numeric tokens whereas each token (a vector), has some information about the note, rhythm, and timbre, as well as other characteristics that we can represent.

We can use MIDI files (used to save, transport & open music sequences) to train the models we will explore. MIDI files are structured files that provide information ordered on notes, changes in rhythm, BPM (beats per minute), etc. We then use that representation as a natural language to train our models.

With raw audio, most algorithms will use the raw representation of the audio at each timestep. With the sequences of input as input vectors, most of the time, the models are trained with the natural language processing (NLP) task of predicting the next token of a sequence at each time step.

Transformers and neural networks

A transformer is an architecture used for neural networks that contains a number of specific layers called attention layers.

For every token in an input melody, the attention layer generates an attention vector that infers the relationship between this token and the other input tokens.

This approach makes sense in the world of music generation as each token in a melody sounds good or not so good depending on the past, current, and following sounds.

Music generation models

Let's take a look at the machine learning models we are going to train to get our music generation results.

MelodyRNN

Our first approach is MelodyRNN, an LSTM-based model using recurrent neural networks. This model provides some configuration of neural network architectures to change the pitch range of an MIDI file, or the use of training techniques such as the aforementioned 'attention' technique.

The tool is developed by Magenta and provides a set of commands to create our dataset from a MIDI file extracting melodies of each track that are useful to train the model. The code is completely open-source, and we trained 3 models from scratch using jazz melodies, batch songs, and children’s songs.

During all our training processes, we use the “attention_rnn” configuration to train the network so the model can better learn the dependencies of a note with the past notes in the melody.

Here are our generated melodies by the 3 models:

Jazz Melody

Batch Songs Melody

Children's Songs Melody

Music Transformer

Music Transformer is a model also developed by Magenta that uses transformers to perform music generation. It can generate around 60 seconds of audio MIDI-files outperforming the coherence that we get on the LSTM-based models.

Unlike the transformers methods, where the attention vectors infer the relationship between tokens in an absolute way, the attention layers in this algorithm use relative-attention, which models the relationship between tokens with the relative distance between them.

This allows us to model better the periodicity, frequency, and other characteristics of the melodies in the training examples in the short term.

Music Transformer has 3 methods it uses to generate music:

Generating a performance from scratch.
Generating a melody continuation.
Generating an accompaniment for a melody.

MuseNet

MuseNet is a tool from OpenAI that also uses transformers to generate MIDI Files. These melodies can also be generated from scratch using a primer melody or as accompaniment for a given melody.

One of the main differences is that MuseNet uses full-attention instead of relative-attention. This allows for generating pieces of music that have better coherence in melodies for up to 4 minutes, maybe losing some strength in the coherence in the short term.

There are no official publications about the code used for training, but you can see some examples here.

MusicVAE

MusicVAE is a hierarchical recurrent variational autoencoder (deep learning technique used to learn latent representations) to generate musical scores. We are going to explain the components of this architecture and provide examples. First off, it’s important to understand what an autoencoder is.

What is an autoencoder?

An autoencoder is a pair of two neural networks. It is made up of an encoder model and a decoder model.

The encoder helps find a way to encode the input (in this case the original melodies) into a compressed form - the latent space, in such a way that the reconstructed version (the output) with the decoder is as close as possible to the input melodies.

MusicVAE uses variational autoencoders, which is a subtype of an autoencoder that ensures that the latent space (the space generated after the encoder processes the input) has some good properties.

AI processing music notes. — Source - MUSICVAE blog

These good properties allow us to only use the decoder part, which takes one random point in the latent space and generates an output that makes sense given the inputs used to train the autoencoder.

The properties are:

Expression: Any real example can be mapped to some point in the latent space and reconstructed from it.
Realism: Any point in this space represents some realistic example, including ones not in the training set.
Smoothness: Examples from nearby points in latent space have similar qualities to one another.

Model training

To train our own version of MusicVAE we followed the official tutorial in Magenta’s Github. We used an EC2 instance with Linux, and set a conda environment with the magenta library installed.

The model provides us with a set of network configurations depending on the type of input and output that we are going to get from the model. There are configurations for 2-bar or 16-bar-single melodies and also for trio (guitar-piano-drums), etc.

We trained from scratch a model with the jazz melodies and we got the following results:

Jazz Melody 1

Jazz Melody 2

Jazz Melody 3

Jukebox

Jukebox is one of the last developed tools to train models capable of generating music. The main difference from the other models we looked at is that it can generate raw audio with music.

It also uses a set of variational autoencoders when training for the task of predicting sequences in a piece of music. It is a slow process in training and also generating samples as it takes up to 9 hours to generate one minute of music with lyrics and instruments.

Conlusion

As touched on, there are plenty of methods to generate music from scratch and/or to continue existing pieces using AI technologies. Most of these methods are used experimentally and not in production as they require a lot of running and time to perform samples of music.

However, there are some methods that can be trained in a reasonable time. For instance, in the case of MelodyRNN, training is complete within a few hours. With MelodyRNN, we can generate simple melodies as we've shared here, that can achieve good results with some coherence and harmony.

With MusicVAE, the training was more complex given the size of the network but in short pieces of music (2 bar melodies), we saw good results, training 20.000 iterations for more than a day in an AWS GPU instance.

We do expect to see considerable developments in this area, and if you are interested in learning more about generating music with AI, we recommend checking out the readings listed below. Also, feel free to let us know your thoughts in the comments section.

Artificial Intelligence

Data

Cloud

Staff Augmentation

Product Studio