Audio Handling Basics: Process Audio Files In Command-Line or Python

Like my articles? Feel free to vote for me

as ML Writer of the year

Handling audio data is an essential task for machine learning engineers working in the fields of speech analytics, music information retrieval and multimodal data analysis, but also for developers that simply want to edit, record and transcode sounds. This article shows the basics of handling audio data using command-line tools, and also provides a not-so-deep dive into handling sounds in Python.

So what is sound and which are its basic attributes?

According to physics, sound is a travelling vibration, i.e. a wave that moves through a medium such as the air. The sound wave is transferring energy from particle to particle until it is finally “received” by our ears and perceived by our brains. The two basic attributes of sound are amplitude (what we also call loudness) and frequency (a measure of the wave’s vibrations per time unit).

Photo by Kai Dahms on Unsplash

Similarly to images and videos, sound is an analog signal that has to be transformed to a digital signal, in order to be stored in computers and analyzed by software. This analog to digital conversion includes two processes: sampling and quantization.

Sampling is used to convert the time-varying continuous signal x(t) to a discrete sequence of real numbers x(n). The interval between two successive discrete samples is the sampling period (Ts). We use the sampling frequency (fs = 1/Ts) as the attribute that describes the sampling process.

Typical sampling frequencies are 8KHz, 16KHz and 44.1KHz. 1Hz means one sample per second, so obviously higher sampling frequencies mean more samples per second and therefore better signal quality.

(This actually means that the discrete signal can capture a higher range of frequencies, namely from 0 to fs/2 Hz according to the Nyquist rule)

Quantization is the process of replacing each real number, x(n), of the sequence of samples with an approximation from a finite set of discrete values. In other words, quantization is the process of reducing the infinite number precision of an audio sample to a finite precision as defined by a particular number of bits.

In the majority of the cases, 16 bits per sample are used to represent each quantized sample, which means that there are 2¹⁶ levels for the quantized signal. For that reason, raw audio values usually vary from -2¹⁵ to 2¹⁵(1 bit used for the sign), however, as we will see later, this is usually normalized in the (-1, 1) range for the sake of simplicity.

We usually call this bit resolution property of the quantization procedure “sample resolution” and it is measured in bits per sample.

Tools and libraries used in this article

I’ve selected the following command-line tools, programs and libraries to use for basic handling of audio data:

ffmpeg/libav. FFmpeg (https://ffmpeg.org) is a free, open-source project for handling multimedia files and streams. Some think that ffmpeg and libav are the same, but actually libav is a fork project from ffmpeg
sox (http://sox.sourceforge.net) aka “the Swiss Army knife of sound processing programs” is a free cross-platform command line utility for basic audio processing. Despite the fact that it has not been updated since 2015, it is still a good solution. In this article we mostly demonstrate ffmpeg and a couple of examples in sox
audacity (https://www.audacityteam.org) is a free, open-source and cross-platform program for editing sounds

programming: we will use pydub (https://github.com/jiaaro/pydub) and scipy (https://scipy-cookbook.readthedocs.io) for reading audio data and librosa (https://librosa.github.io/librosa/) .

We could also use pyAudioAnalysis (https://github.com/tyiannak/pyAudioAnalysis) for IO or for more advanced feature extraction and signal analysis.

Finally, we will also use plotly (https://plotly.com) for basic signal visualization.

This article is divided into two parts:

1st part: how to use ffmpeg and sox to handle audio files
2nd part: how to programmatically handle audio files and perform basic processing

Part I: Handling audio data — the command-line way

Below are some examples for the most basic audio handling such as conversion between formats, temporal trimming, merging and segmentation, using mostly ffmpeg and sox.

To convert video (mkv) to audio (mp3)

ffmpeg -i video.mkv audio.mp3

For downsampling to 16KHz, converting stereo (2 channels) to mono (1 channel) and converting MP3 to WAV (uncompressed audio samples), one needs to use the -ar (audio rate) -ac (audio channel) properties:

ffmpeg -i audio.wav -ar 16000 -ac 1 audio_16K_mono.wav

Note that, in that case, stereo to mono conversion means that the two channels are averaged to one. Also, downsampling of an audio file and stereo to mono conversion can be achieved using sox in the following manner: sox <source_file_ -r <new_sampling_rate> -c 1 <output_file>)

Now let’s see the new file’s attributes using ffmpeg:

ffmpeg -i audio_16K_mono.wav

will return:

Input #0, wav, from ‘audio_16K_mono.wav’:
Metadata:
encoder : Lavf57.71.100
Duration: 00:03:10.29, bitrate: 256 kb/s
Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 16000 Hz,
mono, s16, 256 kb/s

To trim an audio file, e.g. from the 60th to the 80th second (20 seconds new duration):

ffmpeg -i audio.wav -ss 60 -t 20 audio_small.wav

(This can be achieved with the -to argument, which is used to define the end of the trimmed segment, in the example above that would be 80)

To concatenate two or more audio files one can use the “ffmpeg -f concat” command. Suppose you want to concatenate all files f1.wav, f2.wav and f3.wav to a large file called output.wav. What you need to do is create a text file of the following format (say named ‘list_of_files_to_concat’):

file 'file1.wav'
file 'file2.wav'
file 'file3.wav'

and then run

ffmpeg -f concat -i list_of_files_to_concat -c copy output.wav

On the other hand, to break an audio file into successive chunks (segments) of the (same) specified duration can be done with the “ffmpeg -f segment” option. For example, the following command will break output.wav into 1-second, non-overlapping segments named out00000.wav, out00001.wav, etc.:

ffmpeg -i output.wav -f segment -segment_time 1 -c copy out%05d.wav

With regards to channel handling, apart from simple mono to stereo conversion (or stereo to mono) through the -ac property, one may want to switch stereo channels (right to left). The way to achieve this is through the ffmpeg map_channel property:

ffmpeg -i stereo.wav -map_channel 0.0.1 -map_channel 0.0.0 stereo_inverted.wav

To create a stereo file from two mono files, say left.wav and right.wav:

ffmpeg -i left.wav -i right.wav -filter_complex "[0:a][1:a]join=inputs=2:channel_layout=stereo[a]" -map "[a]" mix_channels.wav

On the opposite direction, to split a stereo file into two mono (one for each channel):

ffmpeg -i stereo.wav -map_channel 0.0.0 left.wav -map_channel 0.0.1 right.wav

Map_channel can also be used to mute a channel from a stereo signal, e.g. (below the left channel is muted):

ffmpeg -i stereo.wav -map_channel -1 -map_channel 0.0.1 muted.wav

Volume adaptation can also be achieved through ffmpeg, e.g.

ffmpeg -i data/music_44100.wav -filter:a “volume=0.5” data/music_44100_volume_50.wav
ffmpeg -i data/music_44100.wav -filter:a “volume=2.0” data/music_44100_volume_200.wav

The figure below presents a screen shot from viewing (with Audacity) the original, the 50% volume adaptation and the x2 (200%) volume adaptation signals. The x2 volume boosted signal is clearly clipped (i.e. some samples cannot be represented and they are assigned the maximum allowed value — 2¹⁵ for 16-bit signals):

Volume change can be achieved with sox as well in the following way:

sox -v 0.5 data/music_44100.wav data/music_44100_volume_50_sox.wav
sox -v 2.0 data/music_44100.wav data/music_44100_volume_200_sox.wav

Part II: Handling audio data — the programming way

Load WAV and MP3 files to array

Let us first load our sampled audio data to a numpy array (we use numpy arrays as they are considered the most widelly adopted way to process numerical sequences/vectors). The most common way to load WAV data to numpy arrays is scipy.io.wavfile, while for MP3 data one can use pydub (https://github.com/jiaaro/pydub) that uses ffmpeg for encoding / decoding audio data.

In the following example, the same signal stored in WAV and MP3 files is loaded to numpy arrays.

# Read WAV and MP3 files to array
from pydub import AudioSegment
import numpy as np
from scipy.io import wavfile
from plotly.offline import init_notebook_mode
import plotly.graph_objs as go
import plotly

# read WAV file using scipy.io.wavfile
fs_wav, data_wav = wavfile.read("data/music_8k.wav")

# read MP3 file using pudub
audiofile = AudioSegment.from_file("data/music_8k.mp3")
data_mp3 = np.array(audiofile.get_array_of_samples())
fs_mp3 = audiofile.frame_rate

print('Sq Error Between mp3 and wav data = {}'.
      format(((data_mp3 - data_wav)**2).sum()))
print('Signal Duration = {} seconds'.
      format(data_wav.shape[0] / fs_wav))

result:

Sq Error Between mp3 and wav data = 0
Signal Duration = 5.256 seconds

Note: the overall duration of the loaded signal (in seconds) is computed by dividing the number of samples by the sampling frequency (Hz = samples per second). Also, in the example above we compute the sum square error to make sure that the two signals are identical despite their mp3 to wav conversion.

Stereo signals

Stereo signals are handled through 2D arrays. In the following example, the data_wav array has two columns, one for each channel. By convention, the left channel is always the first and the second the right channel.

# Handling stereo signals
fs_wav, data_wav = wavfile.read("data/stereo_example_small_8k.wav")
time_wav = np.arange(0, len(data_wav)) / fs_wav
plotly.offline.iplot({ "data": [go.Scatter(x=time_wav, 
                                           y=data_wav[:, 0], 
                                           name='left channel'), 
                                go.Scatter(x=time_wav, 
                                           y=data_wav[:, 1], 
                                           name='right channel')]})

Normalization

Normalization is necessary for performing computations on the audio signal values, as it makes the signal values independent to the sample resolution (i.e. signals with 24 bits per sample have much higher range of values than signals with 16 bits per sample). The following example demonstrates how to normalize an audio signal in the (-1, 1) range, by simply dividing by 2¹⁵.

This is because we know that the sample resolution is 16 bits per sample. In the rare case of 24 bits per sample this normalization should obviously change respectively.

# Normalization
fs_wav, data_wav = wavfile.read("data/lost_highway_small.wav")
data_wav_norm = data_wav / (2**15)
time_wav = np.arange(0, len(data_wav)) / fs_wav
plotly.offline.iplot({ "data": [go.Scatter(x=time_wav, 
                                           y=data_wav_norm, 
                                           name='normalized audio signal')]})

Trim / Segment

The following examples show how to get seconds 2 to 4 from the previously loaded and normalized signal. This is done by simply referring to the respective indices in the numpy array. Obviously the indices must be in audio samples, so seconds need to be multiplied by the sampling frequency.

# Trim (segment) audio signal (2 seconds)
data_wav_norm_crop = data_wav_norm[2 * fs_wav: 4 * fs_wav]
time_wav_crop = np.arange(0, len(data_wav)) / fs_wav
plotly.offline.iplot({ "data": [go.Scatter(x=time_wav_crop, 
                                           y=data_wav_norm_crop, 
                                           name='cropped audio signal')]})

Fix-sized segmentation

In the first part we showed how we can segment a long recording to non-overlapping segments using ffmpeg. The following code sample shows how to do the same with Python. Line 8 does the actual segmentation in a single-line command. Overall, the following script loads and normalizes an audio signal, and then it breaks it into 1-second segments and writes each one of them in a file.

(Pay attention to the note in the last comment: you will need to cast to 16bit before saving to file because the numpy conversion has led to higher sample resolutions).

# Fix-sized segmentation (breaks a signal into non-overlapping segments)
fs, signal = wavfile.read("data/obama.wav")
signal = signal / (2**15)
signal_len = len(signal)
segment_size_t = 1 # segment size in seconds
segment_size = segment_size_t * fs  # segment size in samples
# Break signal into list of segments in a single-line Python code
segments = np.array([signal[x:x + segment_size] for x in
                     np.arange(0, signal_len, segment_size)])
# Save each segment in a seperate filename
for iS, s in enumerate(segments):
    wavfile.write("data/obama_segment_{0:d}_{1:d}.wav".format(segment_size_t * iS,
                                                              segment_size_t * (iS + 1)), fs, (s))

A simple algorithm to remove silent segments from a recording

The previous script has broken a recording into a list of 1-second segments. The code below implements a very simple silence removal method. Towards this end, it computes the energy as the sum of squares of the samples, then it calculates a threshold as 50% of the median energy value, and finally it keeps segments whose energy are above that threshold:

import IPython
# Remove pauses using an energy threshold = 50% of the median energy:
energies = [(s**2).sum() / len(s) for s in segments]
# (attention: integer overflow would occure without normalization here!)
thres = 0.5 * np.median(energies)
index_of_segments_to_keep = (np.where(energies > thres)[0])
# get segments that have energies higher than a the threshold:
segments2 = segments[index_of_segments_to_keep]
# concatenate segments to signal:
new_signal = np.concatenate(segments2)
# and write to file:
wavfile.write("data/obama_processed.wav", fs, new_signal)
plotly.offline.iplot({ "data": [go.Scatter(y=energies, name="energy"),
                                go.Scatter(y=np.ones(len(energies)) * thres, 
                                           name="thres")]})
# play the initial and the generated files in notebook:
IPython.display.display(IPython.display.Audio("data/obama.wav"))
IPython.display.display(IPython.display.Audio("data/obama_processed.wav"))

The energy / threshold plot is shown in the figure below (all segments whose energies are below the red line are removed from the processed recording). Also, note the last two lines of code (using the IPython.display.display() function) that are used to add a clickable audio clip directly in the notebook for both the initial and the processed audio files, as the following screenshot shows:

You can listen to the original and processed (after silence removal) recordings below:

Music analysis: a toy example on bpm (beats per minute) estimation

Music analysis is an application domain of signal processing and machine learning, that focuses on analyzing musical signals, mostly for content-based retrieval and recommendation. One of the major tasks in music analysis, is to extract high-level attributes that describe a song, such as its musical genre and the underlying mood.

Tempo is one of the most important attributes of a song. Tempo tracking is the task of automatically estimating a songs tempo (in bpm) directly from the signal. One of the basic implementations of tempo tracking is included in the librosa library.

The following toy example takes as input a mono audio file where a song is stored and produces a stereo file where on the left channel is the initial song, while on the right channel is an artificially generated periodic “beep” sound that “follows” the main tempo of the song:

import numpy as np
import scipy.io.wavfile as wavfile
import librosa
import IPython
# load file and extract tempo and beats:
[Fs, s] = wavfile.read('data/music_44100.wav')
tempo, beats = librosa.beat.beat_track(y=s.astype('float'), sr=Fs, units="time")
beats -= 0.05
# add small 220Hz sounds on the 2nd channel of the song ON EACH BEAT
s = s.reshape(-1, 1)
s = np.array(np.concatenate((s, np.zeros(s.shape)), axis=1))
for ib, b in enumerate(beats):
    t = np.arange(0, 0.2, 1.0 / Fs)
    amp_mod = 0.2 / (np.sqrt(t)+0.2) - 0.2
    amp_mod[amp_mod < 0] = 0
    x = s.max() * np.cos(2 * np.pi * t * 220) * amp_mod
    s[int(Fs * b):
      int(Fs * b) + int(x.shape[0]), 1] = x.astype('int16')
# write a wav file where the 2nd channel has the estimated tempo:
wavfile.write("data/music_44100_with_tempo.wav", Fs, np.int16(s))    
# play the generated file in notebook:
IPython.display.display(IPython.display.Audio("data/music_44100_with_tempo.wav"))

The result of the script above is a WAV file where the left channel is the initial song and the right channel is the sequence of beep sounds on the estimated tempo onsets. Below are two examples of generated sounds for two different initial songs:

Real-time recording and frequency analysis

All of the presented code samples above have mainly focused on reading audio data from files and performing some very basic processing on the audio data such as trimming or segmentation to fix-sized windows, and then either plotting or saving the processed sounds into files.

The following code goes one step further in a twofold way: (a) by showing how sound can be captured by a microphone in a way that allows real-time and online processing (b) by introducing the frequency domain representation of a sound. Our goal here is to create a simple Python script that captures sound in a segment-basis, and for each segment it plots in the terminal the segment’s frequency distribution.

Real-time audio capturing is achieved through the pyaudio library. Audio samples are captured in small segments (say, 200 mseconds long). Then, for each segment, the code presented below performs a basic frequency representation by running the following steps:

compute the magnitude X of the Fast Fourier Transform (FFT) of the recorded segment. Also, keep the frequency values (in Hz) in a separate array, say freqs. Then, to put it simply, according to the DFT definition, X(i) is the energy of the audio signal that is concentrated in frequency freqs(i) Hz
downsample X and freqs, so that we keep much fewer frequency coefficients to visualize
the script also calculates the total segment’s energy (not just the energy at particular frequency bins as described in 1). This is done just to normalize against the maximum width of the frequency visualization.
plot the downsampled frequency energies X for all (downsampled as well) frequencies using a simple bar plot.

These four steps are implemented in the following script. The code is also available here as part of the paura library. See inline comments for more detailed explaination:

# paura_lite:
# An ultra-simple command-line audio recorder with real-time
# spectrogram  visualization

import numpy as np
import pyaudio
import struct
import scipy.fftpack as scp
import termplotlib as tpl
import os

# get window's dimensions
rows, columns = os.popen('stty size', 'r').read().split()

buff_size = 0.2          # window size in seconds
wanted_num_of_bins = 40  # number of frequency bins to display

# initialize soundcard for recording:
fs = 8000
pa = pyaudio.PyAudio()
stream = pa.open(format=pyaudio.paInt16, channels=1, rate=fs,
                 input=True, frames_per_buffer=int(fs * buff_size))

while 1:  # for each recorded window (until ctr+c) is pressed
    # get current block and convert to list of short ints,
    block = stream.read(int(fs * buff_size))
    format = "%dh" % (len(block) / 2)
    shorts = struct.unpack(format, block)

    # then normalize and convert to numpy array:
    x = np.double(list(shorts)) / (2**15)
    seg_len = len(x)

    # get total energy of the current window and compute a normalization
    # factor (to be used for visualizing the maximum spectrogram value)
    energy = np.mean(x ** 2)
    max_energy = 0.02  # energy for which the bars are set to max
    max_width_from_energy = int((energy / max_energy) * int(columns)) + 1
    if max_width_from_energy > int(columns) - 10:
        max_width_from_energy = int(columns) - 10

    # get the magnitude of the FFT and the corresponding frequencies
    X = np.abs(scp.fft(x))[0:int(seg_len/2)]
    freqs = (np.arange(0, 1 + 1.0/len(X), 1.0 / len(X)) * fs / 2)

    # ... and resample to a fix number of frequency bins (to visualize)
    wanted_step = (int(freqs.shape[0] / wanted_num_of_bins))
    freqs2 = freqs[0::wanted_step].astype('int')
    X2 = np.mean(X.reshape(-1, wanted_step), axis=1)

    # plot (freqs, fft) as horizontal histogram:
    fig = tpl.figure()
    fig.barh(X2, labels=[str(int(f)) + " Hz" for f in freqs2[0:-1]],
             show_vals=False, max_width=max_width_from_energy)
    fig.show()
    # add exactly as many new lines as they are needed to
    # fill clear the screen in the next iteration:
    print("\n" * (int(rows) - freqs2.shape[0] - 1))

And this is an execution example of the script:

All code examples presented in part B are available in this github repo: https://github.com/tyiannak/basic_audio_handling as a jupyter notebook.

The last example (the real-time command-line spectrum analyzer) is available at https://github.com/tyiannak/paura/blob/master/paura_lite.py

About the author (tyiannak.github.io)

Thodoris is currently the Director of ML at behavioralsignals.com, where his work focuses on building algorithms that recognise emotions and behaviors based on audio information. He also teaches multimodal information processing in a Data Science and AI master program in Athens, Greece.