Using Neural Networks for Audio Processing

Recently I’ve been playing around with neural networks as a tool for realtime audio processing, opting to use a sort of “black box” approach rather than the conventional ways we handle distortion. In this post, I’d like to get you acquainted with some of the rough basics behind the use of neural networks in audio, talking through some of my own experiences and tests with some practical examples of each. I’ll warn you now that this shouldn’t be viewed as a tutorial or particularly deep dive, I’m mainly writing this as a way to articulate some of my own thoughts on the matter in a way that I can reflect on. Without further ado, let’s jump in.

Distortion: The Easiest Way to Start

While it’s not necessarily the flashiest effect to model, I decided on distortion as the starting point for my experiments. This is for a few reasons, but mainly that it offers us the most flexibility in the type of neural network we use on the back end. It does, however, have some other distinct advantages – mainly that there’s already extensive research into modeling distortion through mathematical functions. The typical digital distortion effect takes an incoming signal and intentionally degrades it through techniques such as soft clipping, wave shaping, or nonlinear transfer functions. In practice, this means running an input signal through a mathematical function that introduces harmonic content—essentially reshaping the waveform in a controlled (or sometimes chaotic) way. This can be done by applying a polynomial function or a lookup table to generate a desired saturation characteristic, or by running straight up math on the signal.

An Extremely Simple C++ Distortion

float processSample(float inputSample)
{

    // Apply gain to drive the signal into distortion
    float drive = 10.0f;
    float x = inputSample * drive;

    // Soft clipping function (tanh approximation)
    float output = std::tanh(x);
    
    return output;
}

This basic C++ distortion effect uses the tanh function to simulate soft clipping, but it falls short of capturing the complexities of analog circuits. Real-world distortion from vacuum tubes, transistors, and op-amps introduces dynamic, frequency-dependent changes that a simple function can’t replicate. Analog circuits also have “memory,” where past signals influence the present, adding to their unique character—something this basic implementation doesn’t account for.

Why Use a Neural Network

While traditional math approaches excel at capturing the general characteristics of distortion, I find that they often lack the nuance of analog hardware. Traditional DSP techniques such as the tanh function rely on mathematical models that approximate the ideal behavior of circuits, but they struggle with more complex, dynamic, and non-linear systems—aka the ones that sound more realistic to our ears.

Neural networks offer a unique solution: instead of manually designing a function to approximate distortion, you can train a model to learn the behavior of a piece of real-world gear. This concept is often referred to as black box modeling, and is based on the idea that you don’t need to understand the inner workings of something to approximate it’s function. We’ll provide the input and desired output of our distortion effect, then we pass off the procesing part to a neural network which will iterate on itself until it’s figured out how to accurately predict the output based on a given input.

The key advantage? Once trained, a neural network model can run in close to real-time and generate predictions based on the data we provided, making it (at least in theory) more accurate to our source material. The trade-off, of course, is that training a good model takes time and, more importantly, data, and it requires more computational resources than some simple DSP methods (though some of the more advanced distortion models are already quite computationally taxing).

Choosing a Neural Network Type

Not all neural networks are well-suited for audio processing—particularly when it comes to real-time processing (i.e. in a plugin). Processing audio as a sequence of individual samples (without considering context) leads to unnatural results. Instead, we need networks that can capture the temporal aspects of audio effects and learn how signals evolve over time. Generally, audio is processed using one of the following network types:

Recurrent Neural Networks (RNNs) – These networks have built-in memory, meaning they can analyze sequences of data instead of isolated points. Basic RNNs work for modeling time-series data, but they can struggle with long-term dependencies – the key part of a RNN is that it feeds it’s results back into itself (a gross simplification) to understand how that data evolves over time.
Long Short-Term Memory (LSTM) & Gated Recurrent Units (GRUs) – These are “improved” versions of RNNs that can remember patterns over longer sequences, making them better for tasks like modeling analog warmth over time, though I’m not fully convinced that they offer any real material benefit over a standard RNN – more testing here is necessary.
WaveNet (Dilated Causal Convolutions) – This deep learning model, originally developed by DeepMind for speech synthesis, is particularly well-suited for high-quality audio modeling – it uses convolutional layers instead of recurrent units, which from my experience leads to less glitches and better overall performance relative to the sound quality.
State-Space Models & State Transition Networks – These are some lesser-known approaches – I was introduced to them from Jatin Chowdhury’s work, and I can’t claim I have anything new to add to the conversation here. I’ll be trying these in the future, so check back down the line when I’ve had more time to experiment.

Let’s look at the results from a few of these types of neural networks and see how they stack up.

Training Data

The success of a neural network depends entirely on the quality of the data it learns from. In our case, we need two things:

Input data: A set of clean, undistorted audio samples.
Target data: The same audio samples passed through a real distortion effect, whether from a hardware pedal, an amplifier, or a high-quality DSP model.

A simple way to generate training data is to take an audio clip (e.g., a sine wave sweep or a guitar riff), pass it through a known distortion effect, and record both the clean and distorted versions – this is particularly easy in a DAW like Pro Tools or Logic, or you can use an open source library on the internet.

def load_audio(file_path, target_sr=48000):
    audio, sr = sf.read(file_path)
    if sr != target_sr:
        audio = torchaudio.transforms.Resample(sr, target_sr)(torch.tensor(audio).float())
    return torch.tensor(audio).float()

# Load input (DI) and target (AMP) signals
di_signal = load_audio("DI.wav")
amp_signal = load_audio("AMP.wav")

# Normalize signals to [-1, 1]
di_signal /= torch.max(torch.abs(di_signal))
amp_signal /= torch.max(torch.abs(amp_signal))

# Ensure signals are the same length
min_length = min(len(di_signal), len(amp_signal))
di_signal, amp_signal = di_signal[:min_length], amp_signal[:min_length]

After training, this network should be able to process incoming audio samples and generate a distorted output that closely mimics the training data. Of course, this is just a toy example—real-world training would involve much more data, including different gain settings and dynamic behavior.

Setting Up a Basic Neural Network (Feed Forward)

For this part of the article, I’ll be working in the PyTorch framework, which I’ve found to be the easiest way to get started in the machine learning space. Let’s take a look at how we can set up a basic neural network.

class AmpModel(nn.Module):
    def __init__(self):
        super(AmpModel, self).__init__()
        self.fc1 = nn.Linear(seq_length, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, 128)
        self.fc4 = nn.Linear(128, seq_length)  # Output same shape as input

    def forward(self, x):
        x = x.view(x.shape[0], -1)  # Flatten input
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.relu(self.fc3(x))
        x = self.fc4(x)  # No activation to preserve waveform
        return x

This version of AmpModel is a simple, fully connected neural network designed to process audio by transforming waveforms through dense layers. The input is first flattened into a single vector, treating the entire sequence as a whole. It then passes through four linear layers, starting at 512 dimensions before gradually reducing to 128, forcing the network to extract key features. ReLU activation is applied after each layer to introduce non-linearity, allowing for more complex transformations – you could directly use a tanh function here, but I find ReLU works best.

Because this model lacks convolutional or recurrent layers, it doesn’t analyze local patterns or track temporal dependencies directly. Instead, it learns a more global transformation of the waveform, making it useful for tasks like distortion modeling or waveshaping.

Unfortunately, this network didn’t yield particularly “good” sounding results, which while obvious to the ear, are also fairly apparent if we take a look at the waveform compared against our original.

The main issue here is that the network has a bad tendency to “round off” transient information, and the accuracy of the steps is just all over the place. It certainly captures most of the harmonics, but sonically it’s missing all of the sharpness and bite that characterize guitar distortion.

Using a Gated Recurrent Unit Neural Network (GRU)

The GRU network is one that I’m particularly fond of, though I’ll admit it has some challenges when it comes to processing overhead. The setup for this network is very similar to our feed forward network, with a few minor differences.

class AmpModel(nn.Module):
    def __init__(self):
        super(AmpModel, self).__init__()
        self.conv1 = nn.Conv1d(1, 16, kernel_size=5, padding=2)
        self.conv2 = nn.Conv1d(16, 32, kernel_size=5, padding=2)
        self.gru = nn.GRU(32, 64, batch_first=True)
        self.fc = nn.Linear(64, 1)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.relu(self.conv2(x))
        x = x.permute(0, 2, 1)  # Swap dimensions for GRU
        x, _ = self.gru(x)
        x = self.fc(x)
        return x.squeeze(-1)  # Remove last dimension

This network processes audio using a combination of convolutional layers and a GRU. It starts by passing the input through two 1D convolutional layers (as discussed earlier), which essentially functions as a set of mathematical filters that are applied in bunches of 5 to the incoming data, introducing non-linearity via ReLU activation. The output is then reshaped and fed into a GRU, which captures temporal dependencies in the signal. Finally, a fully connected layer maps the processed features to a single-channel output, shaping the final sound.

Accuracy of the GRU

Annnnnnnnnnd success! While the GRU network isn’t perfect, It’s certainly workable as an audio effect (keep in mind my training data and processing was fairly minimal here) – I normalized the output against the original signal of my audio waveform and this is what we got:

Overall, this was sonically very close to the original, with the main differences coming in the peaks and valleys – the GRU network has a tendency to overshoot on the waveform, especially noticeable around ~100 samples and ~600 samples – compared to the feed-forward network this is almost the exact opposite behavior.