Mixing and Managing Audio – Pocket-Sized Music Book

Pan, automation and levels

One important topic is how do we turn a recorded audio into a finished product? If we record a song, how do we get together and polish the various instruments and vocal sounds? The foundational elements of this process are mixing and effect processing.

Two of the most important parts in mixing multitrack audio are the levels and panning of tracks in a mix. Levels refer to the volume levels, that is each track relative to each other. Panning refers to where each track is placed in the speakers or headphones. Left, right or in between. Automation lets us create changes overtime to the volume, panning and other aspects of the mix.

There is no shortcut or pattern to set the levels in a mix. There is no right number for where a fader should be except where it sounds best. Everything is relative and different sounds cut through differently. It all depends on context and there is no substitute for using your ears. When we set the faders the should remember which are the good gain staging practices. It is usually better to turn down sounds which are loud than rather turning up sounds that are quiet because sounds that are too loud can clip. It is important not where the fader is but what level does the sound ends up being. It is important to remember that when we mix sounds together the mix will be louder that any individual sound by itself. If individual tracks are fairly loud then mixing them together might exceed the limit and cause distortion, unless the faders are turned down below the 0 dBFS point. When mixing, it is a good habit to think of what needs to be quieter instead of what needs to be louder. That way we will not run out of dynamic range and we will avoid any clipping distortion. This way we will make better use of the movement range of the faders because most of them only boost up to 6 dB but we can reduce as much as we need. This habit of thinking “what is getting in the way of something more important?” also makes us prioritise sounds more musically.

Setting faders is not a one time thing. In almost every project there are moments where various tracks become too quiet or too loud or where the balance needs to shift. That is where we bring in automation. Automation means that the computer moves faders and other controls for us, over the timeline of the project, according to a plan that we create. Automation is what lets us control the mix at every moment of the song.

If the individual tracks of a project are well recorded and performed and the parts fit well together, then getting the levels balance just right is the most important aspect of the mix.

Not as fundamental but very important is pan. Pan is short for panorama. Just like a camera pan changes the view and moves images on our screen, audio panning lets us move sounds around between the left and right sides. Stereo audio always has two tracks. One for left and one for right. Hearing panning requires to listen in stereo. If we send a sound equally to both sides, it sounds as if it is coming from the centre. If we turn down the level of the signal form the left side, the sound seems to come more from the right side, until eventually it is only coming from the right. It works the same way vice versa. Most of the time, each track or project is recorded with just one microphone. Panning means moving that single source around in the stereo field. It is also possible to record sounds in stereo using two mics. If we pan one mic to each side, the end result is a stereo sound with its stereo filed captured acoustically. For the most part, individual tracks are recorded mono and then panned so that the mix, as a whole, has a stereo image. Besides creativity, there are three factors when considering panning:

One-ear compatibility
Mono compatibility
Bass management

Sometimes people listen with just one earbud or they only have good hearing in one ear. The only way to be sure that everyone can always hear the most important elements is to pan those elements at, or near the centre, so they appear in both ears. That is why led vocals are almost always right in the middle. Sometimes people listen in mono where the left and right mixed together in one speaker. Some effects and techniques cab sound very different in mono and stereo. A mix that sounds good in both situations is said to have mono compatibility. Good bass management involves panning sounds to the centre if they have a lot of low end. Both speakers playing the low end is more efficient.

Keeping track of the levels while mixing

When mixing audio, we need to be aware of how the mix comes across at different listening sizes, that is volumes. The listening volume, or more precisely – the monitoring level, makes a big difference in how we hear while mixing and therefore which mix choices we make. Part of that is how the acoustic character of a room, good or bad, becomes stronger at higher levels. Mainly, our ears hear sound differently at different volumes.

At low volumes our ears are less sensitive to bass and treble frequencies. When we turn up the volume, the bass and treble frequencies seem to get louder in comparison to the mid range. When we turn down the level, the mid range becomes relatively more dominant. The sound does not change its EQ curve, it is our ears that cause this. It is helpful to constantly check our mix at different levels. There is a standard level, at least for film and video, to help provide consistency between mixes. To copy with the standard, the mixing engineer sets the listening level, so that a test signal played at – 20dBFS sounds 83 dBSPL acoustically. If we setup our monitors with this standard level, the 83 dBSPL acoustic calibration depends on the signal having an average level of around – 20dBFS. For music, there is no standard average level and at least as of 2016 most commercial music mixes are victims of the “loudness war”. That is they are dynamically compressed, peak limited and clipped so that their average level is – 8dBFS or even higher. To match those levels with the true 83 dBSPL standard, lower the level of those compressed track at least until the average is at about – 20dBFS.

83 dBSPL is fairly loud. Eight hours a day listening to sounds at 83 dBSPL will cause hearing damage over time. Because of that, many engineers do not mix at that level the whole time.

If we want something to seem powerful and loud at any volume, put it next to something small and quiet. If there is no quiet, there can be no loud. The important is context.

EQ and harmonics

Frequencies interact in very interesting ways when we combine different sounds. A graph that shows amplitude versus frequency is called a spectrum analyzer. It tells us which parts of the frequency spectrum are at what amplitude. It is quite different from a waveform graph or an oscilloscope view.

Every sound, no matter how complex, is made up of ingredients of sine waves at different amplitudes, frequencies and timings or phases, constantly coming and going. A sine wave of 1000Hz for example is just one frequency. If we observe it on an oscilloscope, it is a smooth wave, exactly like a graph of the mathematical sine function. On the spectrum analyzer it is a spike at 1000Hz and nothing anywhere else, showing the one frequency that is playing. On the other hand, if we analyse a note from a piano, it is still fairly simple but it is made up of several related frequencies. On the oscilloscope the wave is mostly a consistent pattern and our ears hear this as melodic and simple. On the spectrum analizer there is a pattern of equally spaced sine wave patterns. One at let us say 100, one at 200, 400, 500, 700 Hz and so on. The mathematical relationship of those patterns is what makes the combined sound wave consistent and melodic.

Technically, each ingredient frequency within a sound is called a partial. We call the lowest partial the fundamental frequency and all the higher partials harmonics or overtones. The fundamental frequency determines the pitch of the sound and the tone of the waveform is influenced by the other partials, that is the harmonics and overtones. Partials at exact multiples of the fundamental sound like they belong together and our ears do the math automatically. Those mathematically related partials are called harmonic partials. Some sounds have non-harmonic partials which are not mathematically related to the fundamental frequency. The non-harmonic partials do not line up as the harmonic partials. The recipe of partials that make up sound determine its waveform shape and vice versa. Change the partials and we change the waveform shape. Change the waveform shape and we change the partials. Distorting the waveform shape changes the recipe of partials.

It is easy to see these changes in simple sounds but with more complex sounds, like a full mix of a song, so many partials and ingredients are coming and going all of the time that our visual tools are less useful in exact partials and more useful just to watch the overall trends. Most of the instruments have their own area of frequency but their various partials can still overlap with each other. This means that some of the harmonic frequencies of an instrument are or can be in the same range as another instrument.

Some software can try to guess which frequencies belong where there is no perfect way to know which frequencies belong to which instrument and there is no easy way to unman sounds. Even though we can not unman a song or alter the partials to change a recording from one instrument to another, we do have a powerful tool to adjust the volume balance of partials that are already there. That tool is called EQ or Equalization.

The variety of Equalizers

Equalization alters the relative amplitude of different frequency ranges. For example, a parametric equalizer gives us a lot of control over the sound with several different EQ tools such as: High-pass and Low-pass filters, high-shelf and low-shelf filters and peak-notch filters. Digital parametric EQs are great because they usually show how the amplitude of certain frequencies are being adjusted within a sound. When it is flat all across the 0 mark, it means that all frequencies pass through their original volume. Wherever the curve goes up or down from zero, it means that partials in the sound at those frequencies are turned up or down that much in amplitude.

Most of the tools in the EQ toolbox create specific shapes in the curve like the low-pass and high-pass filters. The low-pass lets low frequencies pass through and progressively cuts the highs. The high-pass lets the high frequencies pass through and cuts the lows. On both of them it is common to see at least two parameters. The cutoff frequency, which is where the rolloff begins, and the slope, which is the steepness measured in decibels of attenuation per octave. Attenuate means to turn down and an Octave is the exact doubling or cutting in half of the frequency in Herz.

A frequency of 0 Hz is also called a DC Offset. A constant that pulls the sound wave off centre vertically from neutral. DC stands for direct-current. It is a term borrowed from the analog domain. DC offset causes clipping to happen at lower volumes than otherwise harms the efficiency of speakers when playing back. A high-pass filter will get rid of DC offset and recenter the waveform. A low-pass filter is simply a mirror image. It has the same two parameters, slope and cutoff frequency, but on a low-pass filter the level rolls off as you double the frequency, instead of cutting it in half. Both of these filters affect everything beyond the cutoff frequency.

The shelf filter does the same as the latter two, but more subtle and gently. It comes in high and low versions. The high-shelf can turn down or turn up the high end and the low-shelf can turn down or turn up the low-end. Self EQs do not eliminate the extreme lows and highs. Instead, they cut or boost them by a constant amount. Shelf filters have different parameters than the low-pass and high-pass filters. One is the frequency, the centre of the transition to the shelf. There other is the amount of boost or cut also known as gain. The treble and bass controls on car stereos are usually high-shelf and low-shelf filters.

Instead of impacting everything above or below a certain point a peak-dip or peak-notch filter will target some frequency in the middle and leaves both the lows and the highs alone. The term parametric refers to the flexibility of the peak-dip parameters. These are: the centre frequency, the gain and the width. Gain controls the amount of boost or cut and the centre frequency controls where that boost or cut happens. The width, also called bandwidth or Q, controls how wide or narrow the shape of the filter is. Q stands for Quality, not quality as in how good it is but as a technical term meaning how focused the shape is.

Although there are other EQ types, the above shown are most common. Low-pass, high-pass, low-shelf, high-shelf and peak-dip. Combinations of those filters can accomplish almost any combination that we might need. Some of the reasons that we might need to use EQ are:

Cutting out frequencies like low-end rumble, bringing out desired sounds at certain frequencies or helping the various sounds in a mix to sit well with each other by providing each track its own sonic space. EQ can not create frequencies that are not already there and not everything needs equalization. There is no right setting of setting an EQ except using our ears and paying attention to context. Most EQs come with existing presets but we must remember that presets do not know how our existing recording sounds.

dynamic processors

The word compression applied to audio has two possible meanings that can easily be confused with each other.

One meaning is dat compression. Making a file take less computer space. MP3 is one example of this. It trades sound quality for smaller file size. Dynamic range compression on the other hand is something different. The tool that does the dynamic range compression is called a compressor. In simple terms, it turns down the volume whenever the signal hets loud and then turns it back up when the signal gets quieter. The goal is to reduce the sound’s dynamic range – that is compress that range. The dynamic range of a sound refers of how different its quietest and loudest moments are from each other. A sound with very loud and very quiet moments has a wide dynamic range. A sound that stays fairly near the sea loudness has a narrow dynamic range. The compressor is probably the most well know audio unit from a family of tools called dynamic processors, which are used to manipulate the dynamic range of sounds. The four main types of dynamic processors are:

Compressor
Limiter
Expander
Gate/Noise Gate

All of these tools monitor the signal and adjust its volume automatically, based on how we set their parameters. The expander and gate both increase the dynamic range of the signal while the compressor and limiter both decrease it. The compressor and expander both generally affect the signal in a more subtle way while the limiter and gate both have more drastic effect.

On all four of these units there is a parameter called the threshold. It refers to a specific volume level. The difference between all four processors is based on how they behave regarding the threshold. In simplified terms:

The Limiter – whenever the signal tries to get louder than the threshold, it turns it down as much as it takes, to keep the signal at or below the threshold.

The Compressor – Whenever the signal gets louder than the threshold, it turns it down somewhat (depending on the ratio) so it only gets a little louder.

The Gate – Whenever the signal drops below the threshold, it turns it down all the way, muting it completely.

The Expander – Whenever the single drops below the threshold, turn it down somewhat.

This is the big picture about the dynamics processors family.

Parameters of the dynamic processors

We can define each dynamic processor by two parameters. Threshold and Ratio. The threshold is the volume level the processor is always watching, to see if the signal’s amplitude crosses it. The ratio is a mathematical ratio, like 2:1/3:2/10:1 and so on.

For example, a compressor kick is whenever the signal goes above the threshold and then the ratio determines how far the compressor turns it down in response. A signal going in to the compressor constantly fluctuates in levels, like signals do. As long as it stays quieter than the threshold, the compressor leaves it alone. Let us say that at one point the compressor goes 2 dB over the threshold. At that moment the compressor sees that the signal has exceeded the threshold and reacts by turning it down. If we have a ratio of 2:1 this means that for every 2 dB louder than the threshold that the signal tries to be, it will only be allowed to get one dB louder. If we have a 3:1 ratio that would mean that the signal will have to try to be 3dB louder to get 1dB louder. A 10:1 ratio means 10dB for 1dB and so on. As long as the signal is above the threshold, the compressor reduces its loudness according to the ratio. When the signal is below the threshold, the compressor goes back to neutral level and leaves the signal alone.

To be able to visualise all of this, most compressors have a gain reduction meter. It moves from right to left to show how far the processor is turning down the signal. Many processors also have a graph, called a transfer function, to visualise what the compressor is doing whit the signal. The input level is on the X axis and the output level is on the Y axis. If the transfer function is at a 45 degree diagonal angle, then what comes in is what comes out. The transfer function makes it easy to see the compressor behaviour at a glance.

We can use the transfer function to explain how the other dynamic processors work. The limiter is a compressor with a ratio of infinity to 1 which means that the signal can never exceed the threshold. This is the strictest definition of what is called a brick wall limiter. The term limiter can also mean a compressor with a very high ratio because a compressor with a ratio of around 10:1 ,or higher, acts basically as a limiter. The usual kind of expander, the downward expander, watches for when the single is quieter than the threshold. If the signal is above the threshold, the expander leaves it alone. If the signal drops below the threshold the expander uses the ratio to figure out how much to turn down the signal even more. For example, if we have a ratio of 1:2 if the signal drops 1dB below the threshold the expander turns it down so is is at 2dB below the threshold. There is also an upward expander that turns the level up when the signal is above the threshold. Same as the compressor/limiter extreme, a gate is like an extreme expander. Setting an expansion ratio of 1:infinity means that the expander is now a gate. Whenever the input drops below the threshold, it will be muted completely.

Now we have a basic definition of all four different dynamics processor types. Grouped into logical categories:

Decrease Dynamic Range – Compressor (gentle), Limiter (strict).
Increase Dynamic Range – Expander (gentle), Gate (strict).

Attack and release

The important aspect of dynamic processing is time. Dynamic processors do not operate instantly. If they did, they would distort the waveform’s shape instead of responding to volume changes. In order to sound more natural, dynamic processors have parameters to adjust the reaction times. These are the attack and release parameters. We will define these in terms of a compressor and later show how they work on other dynamic processors.

On a compressor the attack time sets how quickly the sound is turned down once it exceeds the threshold. The release time sets how quickly the compressor lets go and brings the volume back to normal when the input signal falls below the threshold. Attack ties are often very quick because the beginning of a sound is usually very sudden so attack times are usually measured in milliseconds or even microseconds. Release times tend to be longer because sounds often fade away more slowly so release times are usually measured in milliseconds and sometimes seconds. A compressor can sound very different depending on the attack and release times.

Attack and release work exactly the same on a limiter as they do on a compressor since a limiter is basically a very high ratio compressor. On an expander or a gate, the release time sets how quickly the signal is turned down below the threshold and the attack time sets how quickly sounds above the threshold return to normal. In other words, no matter which dynamic processor you are using attack always affects sounds louder than the threshold and release always affects sounds quieter than the threshold.

Many dynamic processors, especially compressors, let you adjust a parameter called knee, ranging from hard to soft. Soft knee means that instead of the threshold being a strict line, there is a gradual transition. The shape of the transfer function graph is where the knee parameter gets its name. Many engineers use a soft knee setting when they want to make the compression soft more subtle and to ensure that the loudness level, where the compressor kicks in, will not be audible. Hard knee can be used more easily as na effect and to have tighter control over exact volume levels.

Most dynamic processors let us setup what’s called a sidechain. The processor will watch the side chain signal and compare it to the threshold, but then actually control the volume of the main signal. There are many uses of the sidechain. One easy to hear is to create a pumping effect. Other uses can be in spoken word recordings to do what is called ducking. For example, we put a compressor on a background music track, then set the sidechain to listen to when the announcer is speaking so that the background music gets quieter, that is it ducks out of the way whenever someone is talking.

Therese are the very basics of dynamic processors and their parameters. These are an amazingly helpful tools in all kinds of situations.

Reverberation and delay effects

These three effects are related because all of them involve copies of a sound, repeated over time, which gives the sound a sense of space.

Reverberation is a phenomenon that happens physically in the acoustic domain. Reverb can also mean a simulation of acoustic reverberation such as a plugin in a DAW. We hear reverb all of the time so much that we usually do not even notice it. Reverb begins when a sound hits a surface and a copy or echo bounces back. This by itself is not reverb yet. It is just one echo a bit quieter and later than the original. The original sound continues on its way hitting other surfaces and creating other echoes. The echoes made from those surfaces then bounce off still more surfaces, splitting and multiplying. The result is an uncountable number of echoes of the original sound all blurring together into reverb. Various surfaces absorb or reflect different frequencies in different amounts so each echo is eventually EQ-ed in the acoustic domain. The shape, size and materials of the surfaces in the environment determine the sonic character of the reverb.

Whenever we record a sound we also record the reverb of the space it is in. It is advisable to record in a space that we like the sound of the room. Rooms with a lot of reverberation are referred to as live sounding. Although we can not remove the reverberation of a recording, we can add more. One option for flexibility into record in a space with very little reverb, usually with surfaces that are highly absorbent and do not reflect much sound. This kind of room is referred to dead sounding or dry sounding. Recordings made in a space like this can then have our choice of artificial reverb added later.

Traditional artificial reverb was created using a speaker and a microphone in a very live sounding room called a reverb chamber. To add reverb to a sound, engineers would play it through the speaker then record the sound of the room with the mic. Reverb chambers are still used today.

Another artificial reverb like effect uses a large metal plate or a long spring. The principal is similar to the reverb chamber. A speaker-like transducer vibrates the metal and a microphone pics up the reverberations. Reverbs can be also created digitally. Early digital reverbs use math to determine when and how loud to play echoes. Those types are still used but nowadays it is more common to use a recorded Impulse Response (IR) which is a recording of when reflected copies of a sound happen, how loud each one is and the EQ curve of each one, usually taken from a real acoustic space.

To refer to a single delayed copy of a sound, usually created by bouncing off a surface, we use the term echo. To refer to the overall space, or a simulation of it, we have used the term reverb. To refer to an audio effect of individual, countable repetitions of a sound we usually say delay. Historically, delay effects use a spare tape recorder. With the tape moving, sounds will be recorded at one fixed point and simultaneously played back from another point, an inch or two further down the line. Since a moving tape will take time from one point to the next this create a time delay which can be adjusted by changing the tape speed. Engineers would then setup a mixing console to send one track, or a mix of tracks, into the spare tape recorder. The delayed output of the tape machine would come back to the mix on its own track creating a single echo effect called a slapback delay. Then, by sending just a bit of the echo sound back into the tape recorder again, engineers would create a controlled feedback loop which makes repeated echoes. Nowadays it is still possible to use a tape recorder to create delay but there are also digital simulations of tape delay and many other creative and subtle variations of delay effects. Some plugins let us specify the timing, volume, EQ and other aspects of a couple dozen or more delays. This lets us create everything from simple slapback to more complex echoes that are almost like reverb or even turn a sound into an entire moving beat. Echo and delay are also the foundation of modulation effects like chorus and flanger.

How it works in the analog and the digital realms

Most of the time when people defend their different opinions regarding which is better, the analog or the digital domain, both of these attitudes are missing the point and often are based on preconceived notions about how audio works that may or may not be factually correct. It is a trap to think too simplistically and generally or to assume that one is necessary better than the other one. They simply have different strengths and weaknesses in different situations. Furthermore, it is possible to combine both of these domains in the same project if our gear and equipment allows to do so. Many audio interfaces allow multiple separate audio streams to flow into and out of the computer, converting them from analog to digital or back as appropriate. We can setup custom signal paths in our DAW to take advantage of this. We can send some tracks to analog processors and others to digital plugins in the DAW.

In general, the sound we get depends less on whether the gear is digital or analog and more on how we use it. Of course, both of the domains have very specific characteristics.

Analog processing has the chance to introduce analog problems like unwanted distortion, noise and hum. Digital processing has the chance to introduce digital problems like aliasing, clipping or qunatization distortion. When we used them with care and attention, either approach can sound great.

Analog recording media like tape and vinyl alters the sound with subtle distortion that adds a character that digital recording lacks. Sometimes this is called euphonic distortion – meaning good sounding distortion. On the other hand, high quality digital recording is essentially neutral and does not significantly change the sound. Because of this, we can even combine the two to get the best of both worlds. Analog mixer, even the best professional ones, are imperfect and have small amounts of distortion and cross-talk which is leakage between channels. Many engineers consider that to be euphonic distortion and prefer the character of it while others do not care for it. Typical digital mixers, like the ones build in the DAWs, are completely neutral and transparent. Some engineers prefer this cleaner sound, others describe it as sterile. All of that said, common myths about digital audio necessarily sounding less “warm” or less faithful to the original recording than analog are mostly based on misunderstanding. Digital recording captures sound over a wider frequency and dynamic range than analog recordings and does so without significantly colouring the sound. Even standard CD quality digital audio captures frequencies that exceed the bounds of human hearing.

Since practically all modern audio recordings involve the digital domain in some way it is essential to understand and play to the strengths of digital audio.