Music generation

Music generation

Poof, another creative profession gone

#️⃣   ⌛  ~1 h 🤓  Intermediate

09.02.2024

upd:

#94

Music generation

Poof, another creative profession gone

⌛  ~1 h

#94

🎓 113/2

This post is a part of the Audio analysis educational series from my free course. Please keep in mind that the correct sequence of posts is outlined on the course page, while it can be arbitrary in Research.

I'm also happy to announce that I've started working on standalone paid courses, so you could support my work and get cheap educational material. These courses will be of completely different quality, with more theoretical depth and niche focus, and will feature challenging projects, quizzes, exercises, video lectures and supplementary stuff. Stay tuned!

The phenomenon of algorithmically generated music — that is, music composed or produced in part (or entirely) by computer algorithms — has captured the imagination of engineers, composers, and researchers for many decades. While composers in the late 1950s and 1960s already toyed with computer-assisted generative techniques, these early methods typically relied on rule-based or stochastic processes rather than modern machine learning systems. Over the past 30 years, the field of music generation has rapidly evolved, not just in terms of the complexity of the models involved, but also in the ambition of the goals pursued: from simple monophonic melodies to sophisticated polyphonic orchestrations and full-fledged audio tracks that strive for creativity and authenticity.

Today, if you scroll through social media or software development forums, you might find demos of generative systems that can produce anything from short musical motifs to multi-minute orchestral pieces. These generative systems often employ advanced neural network architectures, robust training pipelines, large-scale musical datasets, and specialized approaches to ensure musical coherence. Researchers and enthusiasts alike continue exploring ways to capture the elusive nature of creativity and aesthetic preference in purely algorithmic forms. This interest has led to a rich tapestry of innovations, from pioneering neural network-based systems in the 1990s to the deep architectures of the 2020s, culminating in models that can generate realistic audio waveforms for entire tracks.

In this article, I will dive deeply into the topic of music generation, focusing on the major historical milestones, the evolution from early algorithmic composition to modern deep learning paradigms, and the practical applications of these techniques in both symbolic and audio-based contexts. By the end, you will have a sense of how the field has progressed from simple random note generation toward cutting-edge architectures capable of producing content that can (at times) challenge what a human composer might create. I'll also discuss key research groups, labs, and authors, as well as the core datasets used to train and benchmark these models. Ultimately, the driving motivation for this overview is not only to highlight the incredible technical strides made, but also to provide perspectives on how these methods are shaping the future of music creation, distribution, and consumption.

Algorithmic composition

Algorithmic composition refers to the process of using formalizable procedures —mathematical, heuristic, or otherwise— to generate musical ideas. This domain was initially shaped by composers fascinated with chance operations (like John Cage) or serial procedures (like Pierre Boulez), but the method eventually found fertile ground in computer science, resulting in the synergy we now observe between systematic composition and computational algorithms. Before the advent of advanced machine learning, algorithmic composition was often driven by rules, heuristics, or random processes that the composer embedded within software. Over time, these approaches evolved to incorporate data-driven techniques, culminating in our modern context where deep neural networks stand at the forefront.

Early works and milestone projects

Early attempts to harness the power of computers for composition date back to the mid-20th century. One of the earliest documented pieces is "Illiac Suite" (1957) by Lejaren Hiller and Leonard Isaacson, which used statistical procedures on a mainframe computer to generate string quartet music. While the technique was rudimentary, it demonstrated that computers could be used for tasks previously thought to be in the realm of human-only creativity.

By the 1970s and 1980s, multiple universities had established small sub-communities dedicated to computer music, exploring not only compositional algorithms but also sound synthesis and digital signal processing. These laid the groundwork for a future in which generative music could be explored from many angles —from formal grammar-based approaches to more data-intensive, machine-learning-based systems.

1990s and earlier: HarmoNet (1992), Mozer's neural network approach

In the 1990s, we saw a confluence of ideas in artificial neural networks and algorithmic composition. One of the noteworthy developments was HarmoNet (1992). HarmoNet was an early neural system focusing on generating chordal accompaniments to existing melodies. It used a neural approach to harmonize Bach chorales in a style reminiscent of traditional four-part writing. The performance was promising for its time, though hampered by the relatively limited computational resources of that era.

Also in the early 1990s, researchers like Michael Mozer sought to apply recurrent neural networks to melodic generation. Mozer's approach used a simple recurrent network to generate single lines of music by predicting the next note given a sequence of previous notes. While the architecture was limited compared to the LSTM- and transformer-based approaches of today, it was a key stepping stone in demonstrating that data-driven neural models could capture musically relevant patterns from sequences.

Books and foundational texts (Cope, Todd, Nierhaus, etc.)

The field of algorithmic composition has been enriched by authors who meticulously documented or theorized about computational approaches to music. Some classic texts include:

David Cope's works, such as "Experiments in Musical Intelligence" (1996), in which he describes his "Recombinant Music" approach. Cope's systems compiled a database of musical fragments by a given composer and recombined them to produce new pieces in that composer's style.
Peter Todd and Gareth Loy's edited volume "Music and Connectionism" (1991), which compiled a range of studies on neural networks and music.
Gerhard Nierhaus's book "Algorithmic Composition: Paradigms of Automated Music Generation" (2009), which provides a thorough survey of historical and contemporary algorithmic methods.

These texts remain valuable references for understanding the foundations of computational music generation, offering a window into both the philosophical motivations and the technical details behind early attempts in the field.

Influential techniques in algorithmic composition

Algorithmic composition broadly includes many techniques:

Rule-based systems: Relying on formal grammar rules or explicitly coded heuristics for chord progressions, melody shaping, voice leading, etc.
Stochastic approaches: Generating music by sampling from statistical distributions, Markov chains, or Monte Carlo processes. These methods sometimes prove useful for style imitation.
Genetic algorithms: Evolving musical ideas by repeatedly mutating and selecting populations of melodic or harmonic sequences based on defined fitness functions.
Constraint satisfaction: Encapsulating composition as a problem in which you define constraints (e.g., no parallel fifths, certain chord progressions) and use search algorithms to find solutions.

Each technique contributed insights for future research, shaping how music is represented (e.g., symbolic formats) and manipulated.

Transition to machine learning-driven methods

While rule-based and heuristic approaches are still relevant in certain contexts (especially for specialized or extremely controlled compositional tasks), the modern era of music generation is dominated by machine learning (ML). ML-driven methods rely on large musical corpora and powerful statistical models (or neural networks) to learn generative rules implicitly, rather than relying solely on explicit coding of domain expertise. This paradigm shift was spurred by advances in computing hardware (e.g., GPUs), the availability of large datasets (like digitized MIDI libraries), and methodological breakthroughs in deep learning. Taken together, these factors have enabled the exploration of architectures far more capable of capturing the nuanced dependencies and long-range structures inherent in music.

Neural network architectures for music generation

Neural networks have proven instrumental in learning the relationship between consecutive tokens (notes, chords, or frames of audio), capturing local and global structures in ways that earlier algorithmic methods could not easily match. While the earliest attempts (like Mozer's RNN) demonstrated the viability of neural approaches, it was the subsequent emergence of more advanced architectures that allowed the field to truly blossom.

Common deep learning models used in music generation

Different architectures have emerged to address different challenges:

LSTM (Long Short-Term Memory) networks: A specialized variant of RNN that tackles the vanishing/exploding gradient problem by introducing a gating mechanism. LSTMs, proposed by Hochreiter and Schmidhuber (1997), excel in capturing longer-range dependencies and therefore became standard in many generative music projects.
GRU (Gated Recurrent Unit) networks: Similar to LSTMs but with a slightly simpler gating structure, GRUs have also been widely adopted in symbolic music generation tasks.
CNN (Convolutional Neural Network) architectures: While often associated with image processing, CNNs have found applications in music generation, especially when representing sequences in a 2D context (e.g., piano-roll representations) or capturing local features in waveforms (for audio-based tasks).
VAE (Variational Autoencoder): A generative framework that learns latent representations of data and can decode from latent points to produce new samples. VAEs can be used for creative transformations and style blending in music.
GAN (Generative Adversarial Network): Introduced by Goodfellow and gang (2014), GANs have found success in generating a variety of data modalities. For music, this typically translates to generating waveforms or symbolic sequences with a generator-discriminator setup. Variants like MuseGAN extend standard GANs to multi-track music.
Transformer models: Based on the self-attention mechanism, Transformers excel at capturing long-range relationships in sequential data. They have become the state-of-the-art in many generative tasks (text and music) because they can better parallelize the computation across time steps. The most notable additions to this category for music generation are the Music Transformer variants.
Diffusion models: More recently, diffusion-based approaches have been explored for audio generation (e.g., AudioLDM). These models gradually transform noise into coherent audio signals, matching or exceeding the fidelity of earlier autoregressive systems in some domains.

Symbolic vs. audio-based music generation: a high-level distinction

A crucial distinction is whether the model generates music at the symbolic level (e.g., notes, chords, MIDI events, or some other discrete representation) or directly at the audio level (samples in the time domain, or spectrogram frames). Symbolic approaches offer simpler data structures and smaller dimensionalities; they typically produce MIDI files or other notations that can be played or further arranged using standard music software. Audio-based systems, on the other hand, attempt to synthesize waveforms directly —a more difficult task due to the high temporal resolution required. Modern neural models, including various large-scale transformer or diffusion-based systems, have begun to achieve impressive audio fidelity, though they often require more data and more computational resources than their symbolic counterparts.

Evolution of architectures for handling long sequences

One of the defining challenges in music generation is the presence of long-range dependencies. A piece of music may span thousands of time steps (whether symbolic tokens or audio samples), and relevant thematic transformations might occur over extended durations. Traditional RNN-based models, while an improvement over simpler methods, still face difficulties scaling to very long sequences. Transformers introduced the concept of attention, allowing a model to directly relate any part of a sequence to any other part, enabling easier modeling of extended context. However, vanilla transformers have a memory complexity that grows quadratically with sequence length, leading to ongoing research in more efficient architectures (e.g., Performer, Linformer, Reformer) that reduce the computational overhead. These ideas have been adopted in certain advanced generative music systems aiming to handle full-length compositions in one forward pass.

Deep learning for symbolic music generation

Symbolic music generation typically focuses on generating notes or symbolic events (like velocity, instrumentation, articulation). This approach has the advantage of interpretability: the output can be visualized in a score or manipulated in a digital audio workstation (DAW). Symbolic generation can also be more data-efficient, because note-level data is much more compact than raw audio waveforms.

Key reference works and chronological advancements

Since around 2015, the field has seen an explosion of interest, in part due to the success of deep learning in language and image processing. Many language-modeling techniques can be directly adapted to music modeling by treating sequences of notes or music tokens similarly to sequences of words or subword tokens. Let's look at some representative systems:

MuseGAN (2017): Explored a GAN-based approach for generating multi-track symbolic music, effectively capturing simultaneous parts like drums, bass, chords, and melody.
Music Transformer (Huang and gang, 2018): Leveraged self-attention to capture long-term structure in polyphonic piano music. It introduced relative positional embeddings to better handle local invariances in music sequences.
Theme Transformer (2020): Focused on generating pieces that revolve around certain thematic material. This approach helps maintain cohesion over extended sequences.
RL-Chord (2021): Demonstrated that reinforcement learning can refine chord progressions or melodic lines by optimizing reward signals representing musical coherence or stylistic constraints.

Examples from 2015 to 2024 (MuseTransformer, Bar Transformer, FIGARO, etc.)

In addition to the systems mentioned, there have been numerous smaller or specialized projects. Some projects specifically aim to generate jazz solos, others produce string quartets in the style of classical composers, and still others attempt cross-genre fusion. A few notable examples:

MuseTransformer: A variation combining the concept of multi-track music generation (as in MuseGAN) with attention-based architecture, focusing on creating coherent multi-instrument music segments.
Bar Transformer: A specialized approach that processes sequences in bar-level chunks (common in Western music) and uses local attention to ensure each bar is musically consistent, while employing hierarchical structures to keep global coherence.
FIGARO: A system that aims to incorporate functional harmony and chord transitions more explicitly, sometimes by integrating a rule-based chord grammar with a neural generative model.

These systems have increasingly sought to address constraints like chord progression fidelity, voice-leading rules, or specific melodic motifs.

Highlights of transformer-based approaches (e.g., Music Transformer, Theme Transformer)

The rise of transformer-based approaches has undeniably been a game-changer. The key highlight is the ability to more naturally capture relationships between distant events in a sequence. In a piece of music, a theme introduced in measure 4 may reappear in measure 20, varied in measure 36, and concluded in measure 48. Transformers let the model attend to earlier parts of the sequence without compressing states through time as RNNs do.

Another important piece is the use of relative positional embeddings, which often helps the network generalize across transpositions, rhythmic shifts, and other transformations that do not depend on absolute positions in the sequence. By focusing on pairwise distances rather than absolute sequence positions, the model can more readily learn that a chord change from G major to D major is contextually similar to going from C major to G major.

Reinforcement learning enhancements (e.g., RL-Chord, fine-tuning rnn models)

Some teams have begun exploring the inclusion of a reward function to guide the generation process. For instance, one might design a reward that penalizes repeated notes beyond a certain threshold or encourages harmonic movement that remains within a certain chord progression. An RL approach can either be used from scratch —letting the network optimize for a given style or objective— or as a fine-tuning mechanism on top of a pretrained generative model. This approach offers a way to incorporate more explicit control or structure into generative systems.

GAN-based symbolic music (C-RNN-GAN, MidiNet, MuseGAN, etc.)

GAN-based systems for symbolic music often treat sequences or piano-roll representations as images (2D arrays with time on one axis and pitch on the other). The generator tries to create plausible "piano-roll images," while the discriminator attempts to distinguish real from generated examples. Projects like C-RNN-GAN incorporate recurrent layers within the generator and discriminator to maintain temporal coherence, whereas MuseGAN separated different instruments into different channels, akin to the multi-channel images in computer vision.

However, training GANs to generate music can be tricky —mode collapse, training instability, and the complexity of capturing global structure remain challenges. Nonetheless, these systems have provided valuable insights on how adversarial training might yield more "creative" or "surprising" musical outputs than purely likelihood-based approaches.

Deep learning for audio-based music generation

Generating raw audio (or high-level time-frequency representations) is more demanding computationally and algorithmically than symbolic generation. Music audio has extremely high temporal resolution: a typical sampling rate might be 44,100 Hz, meaning each second of audio requires tens of thousands of data points. Additionally, music is not just random sound but a carefully structured artifact with multi-level periodicities, timbral textures, harmonic progressions, and more. It's no surprise that early attempts at audio-based generation were fairly limited in scope, producing either short snippets of noisy or barely coherent output.

Autoregressive approaches and their challenges

One line of research used autoregressive strategies that predict the next audio sample given all previous samples. A canonical example here is WaveNet (2016) by DeepMind, originally developed for speech generation but later explored for music. WaveNet set new standards in audio fidelity for text-to-speech, and also showed some capability for music generation. However, such sample-level autoregressive models face major drawbacks:

Computational load: Generating each sample is a sequential process, which is extremely slow for extended durations.
Long-term structure: While local continuity can be captured effectively, maintaining large-scale musical coherence is difficult without additional conditioning signals or memory mechanisms.

Despite these challenges, autoregressive audio models paved the way for more advanced techniques.

Diffusion-based generation (AudioLDM, Noise2Music, etc.)

More recent work leverages diffusion models. The idea is to start from noise and iteratively denoise the signal step-by-step, guided by a learned score function or noise prediction model. Projects like AudioLDM (Yang and gang, 2022) and Noise2Music use diffusion or latent diffusion processes to produce waveforms or spectrogram representations. One of the key advantages is that these approaches have displayed an ability to generate high-quality samples without the same level of mode collapse that sometimes plagues GAN-based models. Also, some diffusion-based frameworks can be conditioned on text or symbolic prompts, bridging the gap between raw audio generation and higher-level control.

Examples of notable audio-based systems

Jukebox and MuseNet (OpenAI)

OpenAI's Jukebox (2020) is a large-scale VQ-VAE (Vector Quantized Variational Autoencoder) combined with autoregressive priors that generate raw audio in a compressed token space. Jukebox can replicate a variety of musical styles, produce singing voices (with rudimentary lyric intelligibility), and blend genres in novel ways. However, it comes with large computational demands, and the audio quality, while novel and occasionally impressive, may feature artifacts.

MuseNet (2019), also from OpenAI, is more akin to a transformer-based approach that handles MIDI or a low-level symbolic representation rather than raw audio. It can condition on different instruments and styles, generating complex multi-instrument arrangements. While not purely audio-based, it's an important stepping stone and is commonly cited alongside Jukebox in discussions about AI-generated music from OpenAI.

AudioLM (Google), MusicLM

AudioLM (2022) from Google introduced a framework for generating coherent and high-fidelity audio by using multiple transformer-based stages. At a high level, AudioLM encodes audio into discrete token sequences using learned quantizers, then models the distribution of these tokens. It can generate fairly realistic continuations of piano music or speech signals without explicit textual or symbolic conditioning.

MusicLM (2023) is a more advanced system that can handle textual conditioning (e.g., 'a calming violin piece with a gentle piano accompaniment') and produce multi-second or minute-long coherent musical passages. It uses a hierarchical sequence-to-sequence modeling approach and can capture nuances in texture, timbre, and style.

MusicGen (Meta), Stable Audio (Stability AI)

MusicGen (2023) from Meta focuses on text-to-music generation, leveraging large-scale training on licensed musical data. It aims for controllability by letting users specify textual prompts, and it can generate multi-track outputs with varying styles (rock, hip-hop, classical, etc.).

Stable Audio, released by Stability AI, applies techniques reminiscent of diffusion-based image generation (like Stable Diffusion) adapted for the audio domain. It offers a text-conditioned pipeline for generating short audio clips. While still an emerging area, it underscores the growing interest in diffusion-based audio approaches at a consumer-accessible scale.

Datasets and benchmarking

Datasets form the foundation of data-driven approaches to music generation. The choice of dataset often dictates the genre, style, or representation that a model can effectively learn. Below are some key datasets:

Symbolic music datasets (JSB Chorales, Lakh MIDI, Maestro, etc.)

JSB Chorales: A set of over 350 Bach chorales in symbolic format. Widely used for evaluating harmonic modeling capabilities.
Lakh MIDI Dataset: Over 100,000 MIDI files scraped from the internet, with varying genres and styles. Pioneered by the folks behind the Million Song Dataset project.
MAESTRO: Curated by Google Magenta, it contains thousands of hours of virtuosic piano performances aligned to MIDI. Useful for training high-fidelity piano performance models.

Audio music datasets (Audio libraries, licensed databases)

For raw audio, large-scale datasets that are properly licensed are trickier to assemble due to copyright restrictions. Some notable ones:

OpenMIC-2018: A dataset focusing on instrument recognition.
FMA (Free Music Archive): A corpus of audio under Creative Commons licenses, useful for research but often has a wide variety of quality and style.
Internal/Proprietary sets: Many advanced systems are trained on massive internal datasets that are not publicly available, collected and licensed by large tech companies (e.g., Google, Meta, etc.).

Evaluation metrics and challenges in benchmarking

Evaluating generated music remains difficult. Unlike tasks such as classification (where accuracy is a clear metric), music quality is partly subjective. Some metrics in the literature include:

Statistical similarity: How closely does the distribution of generated notes or chords match that of the training set?
Chord/harmony analysis: Evaluating the presence or frequency of dissonant intervals, chord transitions, and voice leading rules.
User studies: Conducting listening tests in which participants rate pieces for musicality, coherence, or creativity.
Predictive likelihood: For models that estimate probabilities for the next note or time step, cross-entropy or negative log-likelihood can be measured on a hold-out set.

A major challenge is that a piece of music can be "perfectly valid" in one style but less valid in another, so standardizing evaluation across genres is non-trivial. Nonetheless, the field continues exploring more robust, domain-specific approaches —for example, using musicological rule-checkers or harmonic analyzers to assess generated outputs.

Applications and tools

Music generation systems have begun to appear in practical settings, not just in academic demos. Tools exist for hobbyists, professional composers, and educators, reflecting the widespread interest in bridging technology and creativity.

Commercial and open-source software for ai-driven music

AIVA: An AI music composition platform that has been used for soundtracks in commercials and video games. It offers user control over desired mood or style and can output various audio or MIDI.
Amper Music: Provides generative music for media content creation. Users can select genre, mood, and instrumentation, with the system producing a short piece to match.
Project Music GenAI Control (Adobe): Although details are still evolving, there's interest in integrating generative music into broader creative suites, similar to how generative image tools are integrated into design workflows.

Creative and educational applications

In education, generative music systems can be used to:

Demonstrate harmonic or melodic principles in an interactive manner.
Offer a collaborative composition environment, where students co-compose pieces with an AI, exploring new styles or chord progressions they might not think of on their own.
Provide real-time accompaniments for practicing musicians.

They also spur creativity, allowing composers to quickly prototype ideas, or discover novel chord progressions they might not have considered otherwise. These systems often spark debates about the role of the human composer in the creative loop, which itself is an important educational conversation.

Integration into music production workflows

Producers and composers may use generative tools in several ways:

Idea generation: For brainstorming quick melodic or harmonic sketches.
Layer or track creation: Generating backing tracks or transitional sections that are later refined by a human musician.
Sound design: Using neural network transformations to shape or morph audio, akin to advanced effects or creative sampling.

While fully automated composition is still a novelty in many commercial sectors, partial integration is increasingly common, especially as these tools become more reliable and user-friendly.

Research groups, labs, and authors

The study of AI-driven music generation spans multiple academic disciplines and industry labs. Some standouts include:

Active research labs

Google Magenta: A research project within Google that explores machine learning as a tool in the creative process. Magenta has released multiple open-source tools for music and art generation.
Metacreation Lab (Simon Fraser University): Focuses on generative art and music, investigating how AI can augment creative activity.
Audiolabs Erlangen: Conducts research on audio signal processing and music information retrieval, often overlapping with generative modeling.
OpenAI: Responsible for systems like MuseNet and Jukebox. Although they span multiple AI domains, their music generation works are among the most recognized.

Influential researchers (Douglas Eck, François Pachet, Jürgen Schmidhuber, etc.)

Douglas Eck (Google): A driving force behind Magenta. Eck's work has significantly contributed to the popularity of neural network-based music generation.
François Pachet: Known for his work in music style modeling and generating lead sheets. Formerly led Sony CSL's Computer Science Laboratory in Paris and later contributed to Spotify's Creator Technology Research Lab.
Jürgen Schmidhuber: Co-inventor of LSTM, which has underpinned countless sequence-to-sequence tasks, including music. While his lab's focus is broader deep learning research, the LSTM invention indirectly propelled generative music forward.

Notable conferences and journals (ISMIR, ICASSP, AAAI, etc.)

ISMIR (International Society for Music Information Retrieval): A prime venue where novel music AI research, including generation, is presented.
ICASSP (IEEE International Conference on Acoustics, Speech and Signal Processing): Hosts tracks on audio processing and generation, including some aspects of music generation.
AAAI (Association for the Advancement of Artificial Intelligence) and NeurIPS (Neural Information Processing Systems): Although broader AI conferences, they often feature cutting-edge papers on generative models relevant to music.
JMLR (Journal of Machine Learning Research): A well-known journal for machine learning methodological advances, occasionally featuring music generation innovations if they significantly push the boundaries of generative modeling.

Below, I want to provide a bit more technical detail in code and math for those interested in how one might implement or experiment with certain neural approaches. These examples are necessarily simplified but can serve as starting points for deeper exploration.

A simplified LSTM-based symbolic music generator in Python

Imagine you have a collection of MIDI files. You preprocess them into sequences of tokens where each token represents a note-on, note-off, or time-shift event. Then you train a small LSTM to predict the next token. A skeleton code might look like this:


import torch
import torch.nn as nn
import torch.optim as optim

# Let's say we have a dataset that provides sequences of integers (tokens)
# We'll define a simple LSTM-based model for demonstration.

class MusicLSTM(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_layers=2):
        super(MusicLSTM, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size)
        
    def forward(self, x, hidden=None):
        x = self.embedding(x)
        out, hidden = self.lstm(x, hidden)
        out = self.fc(out)
        return out, hidden

# Suppose we have a function get_next_batch() that returns (input_batch, target_batch).
# Each is shaped [batch_size, sequence_length], containing token indices.

vocab_size = 128  # for demonstration, let's assume we have 128 unique tokens
model = MusicLSTM(vocab_size, embed_dim=64, hidden_dim=128)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10

for epoch in range(num_epochs):
    model.train()
    hidden_state = None
    
    input_batch, target_batch = get_next_batch()  # user-defined function
    optimizer.zero_grad()
    logits, hidden_state = model(input_batch, hidden_state)
    
    # Reshape logits to [batch_size * sequence_length, vocab_size]
    # and target_batch to [batch_size * sequence_length]
    loss = criterion(logits.view(-1, vocab_size), target_batch.view(-1))
    loss.backward()
    optimizer.step()
    
    print(f'Epoch {epoch+1} - Loss: {loss.item()}')

This simple example omits data handling details and advanced features like teacher forcing or scheduled sampling, but it highlights the fundamental loop of training an LSTM to predict the next token in a symbolic representation of music.

Example of cross-entropy loss in math

When training a model to predict the next token in a sequence, a common objective is cross-entropy loss. If $p_\theta(y_t \mid y_{t-1}, \dots, y_1)$ is the model's predicted probability of token $y_t$ at time $t$ given the previous tokens and parameters $\theta$ , and $y_t^*$ is the true token, then the loss for one time step can be written as:

\mathcal{L}_t(\theta) = - \log p_\theta(y_t^* \mid y_{t-1}^*, \dots, y_1^*).

The total loss is typically the sum (or mean) of all time-step losses across the entire sequence.

In this context:

$y_t^*$ is the ground truth token at time $t$ .
$p_\theta$ is modeled by the neural network (e.g., LSTM) outputs, typically fed through a softmax layer.
Minimizing cross-entropy aligns the model's distribution with the empirical distribution of the training data.

Example use of a simple attention mechanism for music tokens

While the above code uses LSTMs, a more modern approach might involve a self-attention layer. A minimal snippet could look like this:


import torch
import torch.nn as nn

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, num_layers):
        super(SimpleTransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.transformer_blocks = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads, dim_feedforward=hidden_dim)
            for _ in range(num_layers)
        ])
        self.fc = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x):
        x = self.embedding(x)  # (batch_size, seq_len, embed_dim)
        # PyTorch's nn.TransformerEncoder requires shape [seq_len, batch_size, embed_dim]
        x = x.transpose(0, 1)
        for block in self.transformer_blocks:
            x = block(x)
        x = x.transpose(0, 1)
        logits = self.fc(x)
        return logits

Such an architecture can handle symbolic music data by letting each token attend to all other tokens in the sequence, capturing longer-range dependencies more effectively than a basic RNN.

Suno and Udio: what we have now

(work in progress...)

Conclusion

Music generation has come a long way from early stochastic or rule-based systems to sophisticated neural architectures that can learn style, structure, and timbre. Symbolic generation offers clarity and computational efficiency, while audio-based generation aims for end-to-end fidelity, albeit at higher computational costs. With each new methodological leap —be it an innovation in self-attention or the adoption of diffusion-based approaches— we inch closer to bridging the gap between "synthetic" and "human-like" in musical creation.

Researchers continue to push the boundaries, exploring ways to integrate user control, domain knowledge, musicological rules, and creative constraints. While there's still plenty of room for improvement in areas like timbre modeling, long-term thematic development, and interpretive nuance (e.g., expressive timing, dynamics, articulations), the current trajectory suggests that generative music systems will only become more compelling and integrated into our daily creative workflows.

Below is a short list of images that might appear in an extended version of these materials: