Intro to Audio Analysis: Recognizing Sounds Using Machine Learning

Like my articles? Feel free to vote for me as ML Writer of the year here . Sound analysis is a challenging task, associated to various modern applications, such as speech analytics, music information retrieval, speaker recognition, behavioral analytics and auditory scene analysis for security, health and environmental monitoring. This article provides a brief introduction to basic concepts of extraction, sound and , with demo examples in applications such as musical genre classification, speaker clustering, audio event classification and voice activity detection. audio feature classification segmentation examples are provided in all cases, mostly through the library. All examples are also provided in github repo. Python pyAudioAnalysis this With regards to the involved ML methodologies, this article focuses on hand-crafted audio features and traditional statistical classifiers such as SVMs. Deep audio methods are to follow in a future article, as the present article is more about learning to extract audio features that make sense to your classifiers even when you have some tens of training samples. Prerequisites Before proceeding deeper to audio recognition, the reader needs to know the basics of audio handling and signal representation: sound definition, sampling, quantization, sampling frequency, sample resolution and the basics of frequency representation. Τhese topics are covered in article. this Audio Feature Extraction: short-term and segment-based So you should already know that an audio signal is represented by a of at a given "sample resolution" (usually 16bits=2 bytes per sample) and with a particular (e.g. 16KHz = 16000 samples per second). sequence samples sampling frequency We can now proceed to the next step: use these samples to the corresponding sounds. By "analyze" we can mean anything from: recognize between different types of sounds, segment an audio signal to homogeneous parts (e.g split voiced from unvoiced segments in a speech signal) or group sound files based on their content similarity. analyze In all cases, we first need to find a way to go from the low-level and voluminous audio data samples to a higher-level representation of the audio content. This is the purpose of , the most common and important task in all machine learning and pattern recognition applications. feature extraction (FE) FE is about extracting a set of features that are informative with respect to the desired properties of the original data. In our case, we are interested to extract audio features that are capable of discriminating between different audio classes, i.e. different speakers, events, emotions or musical genres, depending on the application subdomain. The most important concept of audio feature extraction is (or ): this simply means that the audio signal is split into short-term windows (or ). The frames can be optionally overlapping. short-term windowing framing frames The length of the frames usually ranges from 10 to 100msecs depending on the application and types of signals. For the non-overlapping case, the of the windowing procedure is equal to the window's (also called "size"). step length If, on the other hand, step < size, then the frames are overlapping: e.g., a 10msec step for a 40msec window size means a 75% overlap. Usually, a window function (such as hamming) is also applied to each frame. For each frame (let N be the total number of frames), we extract a set of (short-term) audio features. When the features are directly extracted from the audio sample values, they are called time-domain. If the features are calculated on the FFT values, they are called frequency-domain features. Finally, cepstral features (such as ) are features that are based on the cepstrum. MFCCs As an example, let's assume we only extract the signal's energy (mean of squares of the audio samples) and spectral centroid (the centroid of the FFT's magnitude). This means, that (or two equally-lengthed feature sequences, if you like). during this framing procedure, the signal is represented by a sequence of 2-D short-term feature vectors So how can we use these arbitrary-size sequences to analyze the respective signal? Imagine you want to build a classifier to discriminate between two audio classes, say speech and silence. Your initial training data are audio files and corresponding class labels (one class label per audio file). If these files have the same duration, the corresponding short-term feature vector sequences will have the same length. What happens, though, in the general case of arbitrary durations? whole How do we represent arbitrary-sized audio segments? One solution would be to zero pad the feature sequences up to the maximum duration of the dataset and then concatenate the different short-term feature sequences to a single feature vector. But that would (a) lead to very high dimensionality (and therefore the need for more data samples to achieve training) and (b) be very dependent on the temporal positions of the feature values (as each feature would correspond to a different timestamp). A more common approach followed in traditional audio analysis is to extract a set of per . feature statistics fix-sized segment The segment-level statistics extracted over the short-term feature sequences are the representations for each fix-sized segment. The final signal representation can be the long-term average of the segment statistics. ... segment feature statistics is the simplest way to go As an example, consider an audio signal of 2.5 seconds. We select a short-term window of 50 msecs and a 1-sec segment. According to the above, the energy and spectral centroid sequences will be extracted for each 1-sec segment. The length of the sequences N will be equal to 1 / 0.050 = 20. Then, the μ and σ of each sequence are extracted for each 1-sec segment, as the segment feature statistics. These are finally long-term averaged, resulting in the final signal representation. (Note that the last segment is 0.5 long, so the statistics are extracted on a shorter segment) Notes: Short-term usually range from 10 to 100 msecs. Longer frames mean better frequency representations (more samples to compute FFT and therefore each FFT bin corresponds to fewer Hz). But longer frames also mean losing in time resolution, as audio signals are nonstationary (consider a frame so long, that it encapsulates two different audio events: the frequency resolution would be very high, but what would the respective features represent?) frame sizes The long-term averaging step of the segment feature statistics of a signal (described above) is optional, usually adopted during training / testing an audio classifier. It is used to map the whole signal to a single feature vector. This could not be desired in another setup, e.g. when we are interested in segmenting the initial signal. Audio Feature Extraction: code examples uses to read a WAV audio file and extract short-term feature sequences and plots the energy sequence (just one of the features). Please see inline comments for an explanation, along with these two notes: Example1 pyAudioAnalysis returns the sampling rate (Fs) of the audio file and a NumPy array of the raw audio samples. To get the duration in seconds, one simply needs to divide the number of samples by Fs read_audio_file() function returns (a) a 68 x 20 short-term feature matrix, where 68 is the number of short-term features implemented in the library and 20 is the number of frames that fit into the 1-sec segments (1-sec is used as mid-term window in the example) (b) a 68-length list of strings that contain the names of each feature implemented in the library. ShortTermFeatures.feature_extraction() pyAudioAnalysis ShortTermFeatures aF pyAudioAnalysis audioBasicIO aIO numpy np plotly.graph_objs go plotly IPython fs, s = aIO.read_audio_file( ) IPython.display.display(IPython.display.Audio( )) duration = len(s) / float(fs) print( ) win, step = , [f, fn] = aF.feature_extraction(s, fs, int(fs * win), int(fs * step)) print( ) print( ) i, nam enumerate(fn): print( ) time = np.arange( , duration - step, win) energy = f[fn.index( ), :] mylayout = go.Layout(yaxis=dict(title= ), xaxis=dict(title= )) plotly.offline.iplot(go.Figure(data=[go.Scatter(x=time, y=energy)], layout=mylayout)) # Example 1: short-term feature extraction from import as from import as import as import as import import # read audio data from file # (returns sampling freq and signal as a numpy array) "data/object.wav" # play the initial and the generated files in notebook: "data/object.wav" # print duration in seconds: f'duration = seconds' {duration} # extract short-term features using a 50msec non-overlapping windows 0.050 0.050 f' frames, short-term features' {f.shape[ ]} 1 {f.shape[ ]} 0 'Feature names:' for in f' : ' {i} {nam} # plot short-term energy # create time axis in seconds 0 # get the feature whose name is 'energy' 'energy' "frame energy value" "time (sec)" results in (cropped due to length): duration = seconds frames, short-term features Feature names: :zcr :energy :energy_entropy :spectral_centroid :spectral_spread :spectral_entropy :spectral_flux :spectral_rolloff :mfcc_1 ... :chroma_11 :chroma_12 :chroma_std :delta zcr :delta energy ... :delta chroma_12 :delta chroma_std 1.03 20 68 0 1 2 3 4 5 6 7 8 31 32 33 34 35 66 67 and the following graph: demonstrates the spectral centroid short-term feature. Spectral centroid is simply the centroid of the FFT magnitude, normalized in the [0, Fs/2] frequency range (e.g, if Spectral Centroid = 0.5 this is equal to Fs/4 measured in Hz). Example2 pyAudioAnalysis ShortTermFeatures aF pyAudioAnalysis audioBasicIO aIO numpy np plotly.graph_objs go plotly IPython fs, s = aIO.read_audio_file( ) IPython.display.display(IPython.display.Audio( )) duration = len(s) / float(fs) print( ) win, step = , [f, fn] = aF.feature_extraction(s, fs, int(fs * win), int(fs * step)) print( ) time = np.arange( , duration - step, win) energy = f[fn.index( ), :] mylayout = go.Layout(yaxis=dict(title= ), xaxis=dict(title= )) plotly.offline.iplot(go.Figure(data=[go.Scatter(x=time, y=energy)], layout=mylayout)) # Example 2: short-term feature extraction: # spectral centroid of two speakers from import as from import as import as import as import import # read audio data from file # (returns sampling freq and signal as a numpy array) "data/trump_bugs.wav" # play the initial and the generated files in notebook: "data/trump_bugs.wav" # print duration in seconds: f'duration = seconds' {duration} # extract short-term features using a 50msec non-overlapping windows 0.050 0.050 f' frames, short-term features' {f.shape[ ]} 1 {f.shape[ ]} 0 # plot short-term energy # create time axis in seconds 0 # get the feature whose name is 'energy' 'spectral_centroid' "spectral_centroid value" "time (sec)" In this example, the Spectral Centroid sequence is calculated for a recording that contains a 2-sec Donald Trump speech sample, followed by the Bugs Bunny's "What's up doc?" phrase. It is obvious that the spectral centroid sequence can be used to discriminate these two speakers in this particular example (higher spectral centroid values correspond to higher frequencies and therefore "brighter" sounds). In total, 34 short-term features are extracted in , for each frame, and the ShortTermFeatures.feature_extraction() function also (optionally) extracts the respective delta features. In that, case the total number of features extracted for each short-term frame is 68. The complete list and description of the short-term features can be found in the library's and . pyAudioAnalysis wiki this publication The two first examples used function to extract 68 features per short-term frame. As described in the previous section, in many cases, such as segment-level classification, we also extract segment-level statistics. This is achieved through the function, as shown in : ShortTermFeatures.feature_extraction() MidTermFeatures.mid_feature_extraction() Example3 pyAudioAnalysis MidTermFeatures aF pyAudioAnalysis audioBasicIO aIO fs, s = aIO.read_audio_file( ) mt, st, mt_n = aF.mid_feature_extraction(s, fs, * fs, * fs, * fs, * fs) print( ) print( ) print( ) print( ) i, mi enumerate(mt_n): print( ) # Example 3: segment-level feature extraction from import as from import as # read audio data from file # (returns sampling freq and signal as a numpy array) "data/trump_bugs.wav" # get mid-term (segment) feature statistics # and respective short-term features: 1 1 0.05 0.05 f'signal duration seconds' {len(s)/fs} f' -D short-term feature vectors extracted' {st.shape[ ]} 1 {st.shape[ ]} 0 f' -D segment feature statistic vectors extracted' {mt.shape[ ]} 1 {mt.shape[ ]} 0 'mid-term feature names' for in f' : ' {i} {mi} results in: signal duration seconds -D short-term feature vectors extracted -D segment feature statistic vectors extracted mid-term feature names :zcr_mean :energy_mean :energy_entropy_mean :spectral_centroid_mean :spectral_spread_mean :spectral_entropy_mean :spectral_flux_mean :spectral_rolloff_mean :mfcc_1_mean ... :delta chroma_9_std :delta chroma_10_std :delta chroma_11_std :delta chroma_12_std :delta chroma_std_std 3.812625 76 68 4 136 0 1 2 3 4 5 6 7 8 131 132 133 134 135 extracts 2 statistics, namely the and of each short-term feature sequence, using the provided "mid-term" (segment) window size of 1 sec for the example above. Since the duration of the signal is 3.8 sec, and the mid-term window step and size is 1 sec, we expect that mid-term segments will be created and for each one of them a feature statistics vector will be calculated. Also, these segment statistics are computed on the short-term feature sequences of 3.8 / 0.05 = short-term frames. Also, note that the mid-term feature names also contain the segment statistic, e.g. zcr_mean is the mean of the zero-crossing-rate short-term feature. MidTermFeatures.mid_feature_extraction() mean std 4 76 The first 3 examples showed how we can extract short-term features and mid-term (segment) feature statistics. Function extracts audio features for all files in the provided folder, so that these data can be used for training a classifier etc. MidTermFeatures.directory_feature_extraction() So it actually calls for each WAV file and it performs long-term averaging to go from segment feature statistic vectors to a single feature vector. Also, this function is capable of extracting two music beat-related features that are appended in the averaged segment statistics. As an example, let's suppose we want to analyze a song of 120 seconds, with a short-term window (and step) of 50 msecs and a mid-term (segment) window and step of 1 second. MidTermFeatures.mid_feature_extraction() The following steps will occur during the call: MidTermFeatures.directory_feature_extraction() 120 / 0.05 = 2400 68-D short-term feature vectors are extracted 120 136-D feature statistics (mean and std of the 68-D vector sequences) are computed the 120 136-D are long-term averaged for the whole song and (optionally) two beat-features are appended (beat has to be computed in a file-level as it needs long-term information), leading to a final feature vector of 138 values. demonstrates the usage of to extract file-level features (averages of segment feature statistics) for 20 2-sec music samples (separate WAV files) from two musical genre categories, namely classical and heavy metal. For each of the 2-sec song segment extracts the 138-D feature vector, as described above. Then we select to plot 2 from these features, using different colors for the two audio classes (classical and metal): Example4 MidTermFeatures.directory_feature_extraction() MidTermFeatures.directory_feature_extraction() pyAudioAnalysis MidTermFeatures aF os numpy np plotly.graph_objs go plotly dirs = [ , ] class_names = [os.path.basename(d) d dirs] m_win, m_step, s_win, s_step = , , , features = [] d dirs: f, files, fn = aF.directory_feature_extraction(d, m_win, m_step, s_win, s_step) features.append(f) print(features[ ].shape, features[ ].shape) f1 = np.array([features[ ][:, fn.index( )], features[ ][:, fn.index( )]]) f2 = np.array([features[ ][:, fn.index( )], features[ ][:, fn.index( )]]) plots = [go.Scatter(x=f1[ , :], y=f1[ , :], name=class_names[ ], mode= ), go.Scatter(x=f2[ , :], y=f2[ , :], name=class_names[ ], mode= )] mylayout = go.Layout(xaxis=dict(title= ), yaxis=dict(title= )) plotly.offline.iplot(go.Figure(data=plots, layout=mylayout)) # Example4: plot 2 features for 10 2-second samples # from classical and 10 from metal music from import as import import as import as import "data/music/classical" "data/music/metal" for in 1 1 0.1 0.05 # segment-level feature extraction: for in # get feature matrix for each directory (class) # (each element of the features list contains a # (samples x segment features) = (10 x 138) feature matrix) 0 1 # select 2 features and create feature matrices for the two classes: 0 'spectral_centroid_mean' 0 'energy_entropy_mean' 1 'spectral_centroid_mean' 1 'energy_entropy_mean' # plot 2D features 0 1 0 'markers' 0 1 1 'markers' "spectral_centroid_mean" "energy_entropy_mean" This example plots the size of the feature matrices for the two classes (10 x 138 as explained above) and returns the following plot of the distributions for the two selected features (mean of spectral centroid and mean of energy entropy): It can be seen that the two features can discriminate between the two classes with very high accuracy (only one classical song sample will be misclassified by a simple linear classifier). In particular, the mean of spectral centroid values has higher values for the metal samples, while the mean of energy entropy higher values for the energy entropy samples. Of course, this is just a small demo on a very simple task and with few samples. As we will see in the next Section, classification based on audio features is not always easy and requires more than two features... Audio classification: train the audio classifier Having seen how to extract audio feature vectors per short-term frame, segment and for whole recordings, we can now proceed to building supervised models for particular classification tasks. All we need to have is a set of audio files and respective class labels. assumes that audio files are organized in folders and each folder represents a different audio class. pyAudioAnalysis In the example of the previous Section, we've seen how two features differentiated for two musical genre classes, from respective WAV files organized in two folders. shows how the same features can be used to train a simple SVM classifier: each point of a grid in the 2-D feature space is then classified to either of the two classes. This is a way of visualizing the of the classifier. Example5 decision surface pyAudioAnalysis MidTermFeatures aF os numpy np sklearn.svm SVC plotly.graph_objs go plotly dirs = [ , ] class_names = [os.path.basename(d) d dirs] m_win, m_step, s_win, s_step = , , , features = [] d dirs: f, files, fn = aF.directory_feature_extraction(d, m_win, m_step, s_win, s_step) features.append(f) f1 = np.array([features[ ][:, fn.index( )], features[ ][:, fn.index( )]]) f2 = np.array([features[ ][:, fn.index( )], features[ ][:, fn.index( )]]) p1 = go.Scatter(x=f1[ , :], y=f1[ , :], name=class_names[ ], marker=dict(size= ,color= ), mode= ) p2 = go.Scatter(x=f2[ , :], y=f2[ , :], name=class_names[ ], marker=dict(size= ,color= ), mode= ) mylayout = go.Layout(xaxis=dict(title= ), yaxis=dict(title= )) y = np.concatenate((np.zeros(f1.shape[ ]), np.ones(f2.shape[ ]))) f = np.concatenate((f1.T, f2.T), axis = ) cl = SVC(kernel= , C= ) cl.fit(f, y) x_ = np.arange(f[:, ].min(), f[:, ].max(), ) y_ = np.arange(f[:, ].min(), f[:, ].max(), ) xx, yy = np.meshgrid(x_, y_) Z = cl.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape) / cs = go.Heatmap(x=x_, y=y_, z=Z, showscale= , colorscale= [[ , ], [ , ]]) mylayout = go.Layout(xaxis=dict(title= ), yaxis=dict(title= )) plotly.offline.iplot(go.Figure(data=[p1, p2, cs], layout=mylayout)) # Example5: plot 2 features for 10 2-second samples # from classical and 10 from metal music. # also train an SVM classifier and draw the respective # decision surfaces from import as import import as from import import as import "data/music/classical" "data/music/metal" for in 1 1 0.1 0.05 # segment-level feature extraction: for in # get feature matrix for each directory (class) # select 2 features and create feature matrices for the two classes: 0 'spectral_centroid_mean' 0 'energy_entropy_mean' 1 'spectral_centroid_mean' 1 'energy_entropy_mean' # plot 2D features 0 1 0 10 'rgba(255, 182, 193, .9)' 'markers' 0 1 1 10 'rgba(100, 100, 220, .9)' 'markers' "spectral_centroid_mean" "energy_entropy_mean" 1 1 0 # train the svm classifier 'rbf' 20 # apply the trained model on the points of a grid 0 0 0.002 1 1 0.002 2 # and visualize the grid on the same plot (decision surfaces) False 0 'rgba(255, 182, 193, .3)' 1 'rgba(100, 100, 220, .3)' "spectral_centroid_mean" "energy_entropy_mean" results in: In the two examples above (4 and 5) the f1 and f2 matrices can be used to for our classification task, as in any other case when we are using and most ML similar libraries: a matrix X of feature vectors is required (this can be generated by merging f1 and f2 as shown in Example5), along with an array y of the same number of rows with X, corresponding the target values (the class labels). scikit-learn Then, as with any other classification task, X can be split into train and test (using either random subsampling, fold cross-validation or leave-one-out) and for each train/test split an evaluation metric (such as F1, Recall, Precision, Accuracy or even the whole confusion matrix) is computed. Finally, we report the overall evaluation metric, as the average among all train/test splits (depending on the method). This procedure can be followed as described in many tutorials (IMO it is best in the scikit-learn's webpage), as soon as we have the audio features, as described in the previous examples. shown Alternatively, provides a wrapped functionality that includes both feature extraction and classifier training. This is done in function, which pyAudioAnalysis audioTrainTest.extract_features_and_train() (a) first calls function which calls , to extract feature matrices for all given folders of audio files (assuming each folder corresponds to a class) MidTermFeatures.multi_directory_feature_extraction() , MidTermFeatures.directory_feature_extraction() (b) generates X and y matrices, aka the feature matrix for the classification task and the respective class labels. (c) evaluates classifier for different parameters (e.g. C if SVM classifiers are selected) (d) returns printed evaluation results and saves the best model to a binary file (to be used by another function for testing as shown later) The following example trains an SVM classifier for the classical/metal music classification task: pyAudioAnalysis.audioTrainTest extract_features_and_train mt, st = , dirs = [ , ] extract_features_and_train(dirs, mt, mt, st, st, , ) # Example6: use pyAudioAnalysis wrapper # to extract feature and train SVM classifier # for 20 music (10 classical/10 metal) song samples from import 1.0 0.05 "data/music/classical" "data/music/metal" "svm_rbf" "svm_classical_metal" final results are here: classical metal OVERALL C PRE REC f1 PRE REC f1 ACC f1 best f1 best Acc Confusion Matrix: cla met cla met Selected params: 0.001 79.4 81.0 80.2 80.6 79.0 79.8 80.0 80.0 0.010 77.2 78.0 77.6 77.8 77.0 77.4 77.5 77.5 0.500 76.5 75.0 75.8 75.5 77.0 76.2 76.0 76.0 1.000 88.2 75.0 81.1 78.3 90.0 83.7 82.5 82.4 5.000 100.0 83.0 90.7 85.5 100.0 92.2 91.5 91.4 10.000 100.0 78.0 87.6 82.0 100.0 90.1 89.0 88.9 20.000 100.0 75.0 85.7 80.0 100.0 88.9 87.5 87.3 41.50 8.50 0.00 50.00 5.00000 The overall confusion matrix fo the best C param (in that case C=5), indicates that there is (on average for all subsampling experiments) an almost 9% probability that a classical segment will be classified as metal (which, by the way, makes sense if we remember the feature distribution plots we have seen above). is a wrapper that (a) reads all audio files organized in a list of folders and extracts long-term averaged feature statistics (b) then a classifier assuming that folder names represent audio classes. audioTrainTest.extract_features_and_train() from the pyAudioAnalysis lib, trains The trained model will be saved in (last argument of the function), along with the feature extraction parameters (short-term and segment window sizes and steps) . Note that another file is also created called , that stores the normalization parameters, i.e. the mean and std used to normalize the audio features before training and testing. Finally, apart from SVMs the wrapper supports most scikit-learn classifiers such as decision trees and gradient boosting. svm_classical_metal svm_classical_metalMEANS Audio classification: apply the audio classifier So we have trained an audio classifier to distinguish between two audio classes (classical and metal) based on averages of feature statistics as described before. Now let's see how we can use the trained model to the class of an audio file. Towards this end, we are going to use pyAudioAnalysis' as shown in : predict unknown audioTrainTest.file_classification() Example7 pyAudioAnalysis audioTrainTest aT files_to_test = [ , , ] f files_to_test: print( ) c, p, p_nam = aT.file_classification(f, , ) print( ) print( ) print() # Example7: use trained model from Example6 # to classify an unknown sample (song) from import as "data/music/test/classical.00095.au.wav" "data/music/test/metal.00004.au.wav" "data/music/test/rock.00037.au.wav" for in f' :' {f} "svm_classical_metal" "svm_rbf" f'P( = )' {p_nam[ ]} 0 {p[ ]} 0 f'P( = )' {p_nam[ ]} 1 {p[ ]} 1 results in: data/music/test/classical .au.wav: P(classical= ) P(metal= ) data/music/test/metal .au.wav: P(classical= ) P(metal= ) data/music/test/rock .au.wav: P(classical= ) P(metal= ) .00095 0.6365171558566964 0.3634828441433037 .00004 0.1743387880715576 0.8256612119284426 .00037 0.2757302369241449 0.7242697630758552 We can see that we have asked the classifier to predict for three files. The first was a classical music segment (not used in the training dataset obviously) and indeed the estimated posterior of classical was higher than that of metal. Similarly, we tested against a metal segment and it was classified as metal with a posterior of 86%. Finally, the 3rd test file does not belong to the two classes (it is a rock song segment), however, the result makes some sense as it is classified to the closer class. audioTrainTest.file_classification() gets the trained model, and the path of an unknown audio file and does feature extraction and classifier prediction for the unknown file, returning the (predicted) winner class, the classes posteriors and the respective classes names. The previous example showed how we can apply the trained audio classifier to an unknown audio file to predict its audio label. In addition to that, pyAudioAnalysis provides function , which accepts a list of folders, assuming their basenames are class names (as we do during training), and repetitively applies a pre-trained classifier on the audio files of each folder. A the end it outputs performance metrics such as confusion matric and ROC curves: audioTrainTest.evaluate_model_for_folders() pyAudioAnalysis audioTrainTest aT aT.evaluate_model_for_folders([ , ], , , ) # Example8: use trained model from Example6 # to classify audio files organized in folders # and evaluate the predictions, assuming that # foldernames = classes names as during training from import as "data/music/test/classical" "data/music/test/metal" "svm_classical_metal" "svm_rbf" "classical" results in the following figures of the confusion matrix, precision/recall/f1 per class, and Precision/Recall curve and ROC curve for a "class of interest" (here we have provided classical). Note that the 3rd and 4th subplots evaluate the classifier as a of the class "classical" (last argument). For example, the last plot shows the true positive rate vs the false positive rate, and this is achieved by simulating thresholding of the posterior of the class of interest (classical): as the probability threshold rises, both true positive and false negative rates rise, the question is: how "steep" is the true positive rate's increase? More info about the ROC curve can be found . The precision/recall curve is equivalent to ROC, but it shows both metrics on the same graph on the y-axis for different thresholds of the posterior which is shown on the x-axis (more info ). detector here here In our example, we can see that for a probability threshold of, say, 0.6 we can have a 100% Precision with around 80% Recall for classical: this means that all files detected will be indeed classical, while we will be "losing" almost 1 out of 5 "classical" song as metal. Another important note here is that . there is no "best" operation point of the classifier, that depends on the individual application Audio regression: predicting continuous target values Regression is the task of training a mapping function from a feature space to a continuous target variable (instead of a discrete class). Some applications of regression of audio signals include: speech and/or music emotion recognition using non-discrete classes (emotion and arousal) and music soft attribute estimation (e.g. to detect Spotify's "danceability"). In this article, we demonstrate how regression can be used to detect a choral singing segment's pitch, without using any signal processing approach (e.g. the autocorrelation method). Towards this end, we have used part of the , which is a set of recordings with respective annotations. Here, we have selected to use this dataset to produce segment-level pitch annotations: we split the singing recordings to small (0.5 sec) segments and for each segment, we calculate the mean and standard deviation of the pitch (which provided by the dataset). These two metrics are f0_mean and f0_std respectively and are the two target regression values demonstrated in the following code. Below you can listen to a 0.5-sec sample with a low f0 and low f0 deviation: Choral Signing Dataset acapella pitch and a 0.5-sec sample with a high f0 and high f0 deviation: Getting 0.5-sec segments from the leads to thousands of samples, but for demonstration purposes of this article, we have used around 120 training and 120 testing samples, available under and of the github repo. Again, to demonstrate training and testing of the regression model we are using which treats audio segmentation in the following way: Choral Signing Dataset data/regression/f0/segments_train data/regression/f0/segments_train folders pyAudioAnalysis given a path that contains audio files and a set of .csv files of the format , trains one regression model for each .csv file pyAudioAnalysis In our example, contains 120 WAV files and two 120-line CSV files named f0.csv and f0_std.csv. Each CSV corresponds to a separate regression task and each line of the CSV corresponds to the ground truth of the respective audio file. To train, evaluate, and save the two regression models for our example, the following code is used: data/regression/f0/segments_train pyAudioAnalysis audioTrainTest aT aT.feature_extraction_train_regression( , , , , , , , ) # Example9: # Train two linear SVM regression models # that map song segments to pitch and pitch deviation # The following function searches for .csv files in the # input folder. For each csv of the format , # a separate regresion model is trained from import as "data/regression/f0/segments_train" 0.5 0.5 0.05 0.05 "svm" "singing" False Since contains two CSVs, namely and , the above code results in two models: and ( prefix is provided as a 7th argument in the function above and is used for all trained models). data/regression/f0/segments_train f0.csv f0_std.csv singing_f0 singing_f0_std singing Also, this is the result of the evaluation process executed internally by the function: audioTrainTest.feature_extraction_train_regression() Analyzing file : data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_263 .wav Analyzing file : data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_264 .wav Analyzing file : data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_301 .wav Analyzing file : data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_328 .wav Analyzing file : data/regression/f0/segments_train/CSD_ER_alto_1.wav_segments_331 .wav ... ... ... Analyzing file : data/regression/f0/segments_train/CSD_ND_alto_4.wav_segments_383 .wav Analyzing file : data/regression/f0/segments_train/CSD_ND_alto_4.wav_segments_394 .wav Feature extraction complexity ratio: x realtime Regression task f0_std Param MSE T-MSE R-MSE best Selected params: Regression task f0 Param MSE T-MSE R-MSE best Selected params: 1 of 120 .1595 2 of 120 .957 3 of 120 .632 4 of 120 .748 5 of 120 .2835 119 of 120 .483 120 of 120 .315 44.7 0.0010 736.98 10.46 661.43 0.0050 585.38 9.64 573.52 0.0100 522.73 9.17 539.87 0.0500 529.10 7.41 657.36 0.1000 379.13 6.73 541.03 0.2500 361.75 5.09 585.60 0.5000 323.20 3.88 522.12 1.0000 386.30 2.58 590.08 5.0000 782.14 0.99 548.65 10.0000 1140.95 0.47 529.20 0.50000 0.0010 3103.83 44.65 3121.97 0.0050 2772.07 41.38 3098.40 0.0100 2293.79 37.57 2935.42 0.0500 1206.49 19.69 2999.49 0.1000 1012.29 13.94 3115.49 0.2500 839.82 8.64 3147.30 0.5000 758.04 5.62 2917.62 1.0000 689.12 3.53 3087.71 5.0000 892.52 1.07 3061.10 10.0000 1158.60 0.47 2889.27 1.00000 The 1st column on the results above represents the classifier's parameter evaluated during the experiment. The 2nd column is the Mean Square Error (MSE) of the estimated parameter (f0_std and f0 for the two models), measured on the internal test (validation) data of each experiment. The 3rd column shows the training MSE and the 4th column shows an estimate of the random model (to be used as a baseline metric). We can see that the f0_std target is much harder to estimate through a regression model: the best MSE achieved (320) is just slightly better than the average random MSE (around 550). On the other hand, for the f0 target, the trained regression model achieves a 680 MSE with a baseline error of around 3000. In other words, the model achieves a x5 performance-boosting related to the random model, while for the f0_std this boosting is just around x1.5. Now, once the two regression models are trained, evaluated and saved, we can use them to map any audio segment to either f0 or f0_std. demonstrates how to do this using : Example10 audioTrainTest.file_regression() glob csv os numpy np plotly.graph_objs go plotly pyAudioAnalysis audioTrainTest aT wav_files_to_test = glob.glob( ) ground_truths = {} open( , ) file: reader = csv.reader(file, delimiter = ) row reader: ground_truths[row[ ]] = float(row[ ]) estimated_val, gt_val = [], [] w wav_files_to_test: values, tasks = aT.file_regression(w, , ) os.path.basename(w) ground_truths: estimated_val.append(values[tasks.index( )]) gt_val.append(ground_truths[os.path.basename(w)]) mse = ((np.array(estimated_val) - np.array(gt_val))** ).mean() print( ) p = go.Scatter(x=gt_val, y=estimated_val, mode= ) mylayout = go.Layout(xaxis=dict(title= ), yaxis=dict(title= ), showlegend= ) plotly.offline.iplot(go.Figure(data=[p, go.Scatter(x=[min(gt_val+ estimated_val), max(gt_val+ estimated_val)], y=[min(gt_val+ estimated_val), max(gt_val+ estimated_val)])], layout=mylayout)) # Example10 # load trained regression model for f0 and apply it to a folder # of WAV files and evaluate (use csv file with ground truths) import import import import as import as import from import as # read all files in testing folder: "data/regression/f0/segments_test/*.wav" with 'data/regression/f0/segments_test/f0.csv' 'r' as ',' for in 0 1 for in # for each audio file # get the estimates for all regression models starting with "singing" "singing" "svm" # check if there is ground truth available for the current file if in # ... and append ground truth and estimated values # for the f0 task 'f0' # compute mean square error: 2 f'Testing MSE= ' {mse} # plot real vs predicted results 'markers' "f0 real" "f0 predicted" False In this example, we demonstrate how can be used for a set of files from a test dataset. Note that this function returns decisions and task names for all available regression tasks that start with the provided prefix (in our case "singing"). The results are shown below: we can see that the real and predicted values are pretty close for the f0 task. audioTrainTest.file_regression() MSE=492.7141430351564 Testing audioTrainTest.feature_extraction_train_regression() reads a folder of WAV files and assumes that each CSV of ( , ) format, is a regression ground-truth file. It then extracts audio features, trains and saves the respective number of models using a prefix (provided also as argument). audioTrainTest.file_regression() reads the saved models and returns predicted regression outputs for all tasks. About Audio Segmentation Until now we have seen how to train supervised models that map segment-level audio feature statistics to either class labels (audio classification) or real-valued targets (audio regression). Also, we have seen how to use these models to predict the label of an unknown audio file, e.g. a speech utterance or a whole song or a song's segment. In all these cases, the assumption followed was that the unknown audio signals . For example, a song belongs to a particular genre, a singing segment has a particular pitch value and a speech utterance has a particular emotion. However, in real-world applications, there are many cases in which audio signals are not segments of homogeneous content, but complex audio streams that contain many successive segments of different content labels. A recording of a real-world dialog, for instance, is a sequence of labels of speaker identities or emotions. belonged to a single label Real-world recordings are not segments of homogeneous content but sequences of segments of different labels For that reason, audio is an important step of audio analysis and it is about segmenting a long audio recording to a sequence of segments that are of homogeneous content. The definition of homogeneity is relative to the application domain: if, for example, we are interested in speaker recognition, a segment is considered homogeneous if it belongs to the same speaker. segmentation Audio Segmentation: using pretrained models (supervised) hTAudio segmentation algorithms can be divided into two categories: (a) supervised and (b) unsupervised or semisupervised. Supervised segmentation is based upon a pretrained segment model that is able to classify homogeneous segments. In this section we will show how to achieve segmentation using a simple fix-size segment and a pre-trained model. Suppose you have trained a segment model to distinguish between classes S and M. The segmentation method presented in this Section is as simple as that: split the audio recording to non-overlapping fix-sized segments with the same length as the one used to train the model. Then classify each fix-sized segment of the audio stream using the trained model and finally merge successive segments that contain the same class label. This process is illustrated in the following diagram. In Example6 we had trained a model that classifies unknown music segments to "metal" and "classical" (model was saved in file ). Let's use this model to segment a 30-sec recording that contains both metal and classical (non-overlapping) parts. This recording is stored in of the article's code. Also, contains the respective annotation file of the format \t \t . This is the ground truth file: svm_classical_metal data/music/metal_classical_mix.wav data/music/metal_classical_mix.segment ground-truth classical metal classical metal 0 7.5 7.5 15 15 19 19 29 The fix-window supervised segmentation functionality is implemented in function , as shown in : audioSegmentation.mid_term_file_classification() Example11 pyAudioAnalysis.audioSegmentation mid_term_file_classification, labels_to_segments pyAudioAnalysis.audioTrainTest load_model labels, class_names, _, _ = mid_term_file_classification( , , , , ) print( ) il, l enumerate(labels): print( ) cl, m, s, m_classes, mt_win, mt_step, s_win, s_step, c_beat = load_model( ) print( ) segs, c = labels_to_segments(labels, mt_step) iS, seg enumerate(segs): print( ) # Example 11 # Supervised audio segmentation example: # - Apply model "svm_classical_metal" to achieve fix-sized, supervised audio segmentation # on file data/music/metal_classical_mix.wav # - Function audioSegmentation.mid_term_file_classification() uses pretrained model and applies # the mid-term step that has been used when training the model (1 sec in our case as shown in Example6) # - data/music/metal_classical_mix.segments contains the ground truth of the audio file from import from import "data/music/metal_classical_mix.wav" "svm_classical_metal" "svm_rbf" True "data/music/metal_classical_mix.segments" "\nFix-sized segments:" for in f'fix-sized segment : ' {il} {class_names[int(l)]} # load the parameters of the model (actually we just want the mt_step here): "svm_classical_metal" # print "merged" segments (use labels_to_segments()) "\nSegments:" for in f'segment sec - sec: ' {iS} {seg[ ]} 0 {seg[ ]} 1 {class_names[int(c[iS])]} returns a list of label ids (one for each fix-sized segment window), a list of class names and the accuracy and confusion matrix (if ground truth is also provided, as in the example above). The list corresponds to fix-sized segments of length equal to the segment step used during training of the model (1 second in the above example, according to Example6). That's why we use , to load the segment window directly from the model file. Also, we use to generate the list of final segments, based on the simple merging route (i.e concatenate successive 1-sec segments that have the same label). audioSegmentation.mid_term_file_classification() labels audioTrainTest.load_model() audioSegmentation.labels_to_segments() The output of the above code is the following (red corresponds to ground truth and blue to predicted segment labels): Overall Accuracy: Fix-sized segments: fix-sized segment : classical fix-sized segment : classical fix-sized segment : classical fix-sized segment : classical fix-sized segment : classical fix-sized segment : classical fix-sized segment : classical fix-sized segment : metal fix-sized segment : metal fix-sized segment : metal fix-sized segment : metal fix-sized segment : metal fix-sized segment : classical fix-sized segment : metal fix-sized segment : metal fix-sized segment : classical fix-sized segment : classical fix-sized segment : classical fix-sized segment : metal fix-sized segment : metal fix-sized segment : classical fix-sized segment : metal fix-sized segment : classical fix-sized segment : classical fix-sized segment : metal fix-sized segment : metal fix-sized segment : metal fix-sized segment : metal fix-sized segment : metal fix-sized segment : metal Segments: segment sec - sec: classical segment sec - sec: metal segment sec - sec: classical segment sec - sec: metal segment sec - sec: classical segment sec - sec: metal segment sec - sec: classical segment sec - sec: metal segment sec - sec: classical segment sec - sec: metal 0.79 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 0 0.0 7.0 1 7.0 12.0 2 12.0 13.0 3 13.0 15.0 4 15.0 18.0 5 18.0 20.0 6 20.0 21.0 7 21.0 22.0 8 22.0 24.0 9 24.0 29.0 returns the more "compact" and useful information for the end "user". Also, note that the segmentation are either: audioSegmentation.labels_to_segments() errors due to of the (e.g. segments 22 and 23 are misclassified as classical when their true label is metal or (a) misclassifications segment classifier due to issues: e.g. according to ground truth, the 1st segment classical music ends at 7.5 sec, while our model is applied every 1 second, so the best this fix-window methodology will achieve is to either recognize classical until 7 or 8 sec. Obviously, this can be handled through a smaller step in the segment window (i.e. by introducing a segment overlap), however this will be with significant increase in computational demands (more segment-level predictions will take place). (b) time resolution Audio Segmentation: unsupervised So we've seen how, given a pretrained audio segment model, we can split an audio recording into segments of homogeneous content. In many cases though, we are not aware of the exact classification problem, or we do not have the data to train such a classifier. Such cases require unsupervised or semi-supervised solutions as shown in the use-cases below: Music segmentation Extracting structural parts from a music track is a typical use-case where unsupervised audio analysis can be used. Since it is obviously rather difficult to have a classifier that distinguishes between song parts, we can answer the question: In the following example, M. Jacksons "Billie Jean" is used as input to the previously described segment-level feature extraction process and a simple k-means clustering is applied on the resulting feature vector sequences. Then, the segments of each cluster are concatenated into an artificial recording and saved to audio files. Each artificial "cluster recording" shows how song parts can be grouped and if this grouping makes some sense in terms of music structure. The code is shown in : can you group song segments so that segments of the same group sound like they belong to the same song part? Example12 os, sklearn.cluster pyAudioAnalysis.MidTermFeatures mid_feature_extraction mT pyAudioAnalysis.audioBasicIO read_audio_file, stereo_to_mono pyAudioAnalysis.audioSegmentation labels_to_segments pyAudioAnalysis.audioTrainTest normalize_features numpy np scipy.io.wavfile wavfile IPython input_file = fs, x = read_audio_file(input_file) mt_size, mt_step, st_win = , , [mt_feats, st_feats, _] = mT(x, fs, mt_size * fs, mt_step * fs, round(fs * st_win), round(fs * st_win * )) (mt_feats_norm, MEAN, STD) = normalize_features([mt_feats.T]) mt_feats_norm = mt_feats_norm[ ].T n_clusters = x_clusters = [np.zeros((fs, )) i range(n_clusters)] k_means = sklearn.cluster.KMeans(n_clusters=n_clusters) k_means.fit(mt_feats_norm.T) cls = k_means.labels_ segs, c = labels_to_segments(cls, mt_step) sp range(n_clusters): count_cl = i range(len(c)): c[i] == sp segs[i, ]-segs[i, ] > : count_cl += cur_x = x[int(segs[i, ] * fs): int(segs[i, ] * fs)] x_clusters[sp] = np.append(x_clusters[sp], cur_x) x_clusters[sp] = np.append(x_clusters[sp], np.zeros((fs,))) print( ) wavfile.write( , fs, np.int16(x_clusters[sp])) IPython.display.display(IPython.display.Audio( )) # Example 12: Unsupervised Music Segmentation # # This example groups of song segments to clusters of similar content import from import as from import from import from import import as import as import # read signal and get normalized segment feature statistics: "data/music/billie_jean.wav" 5 0.5 0.1 0.5 0 # perform clustering 5 for in # save clusters to concatenated wav files # convert flags to segment limits for in 0 for in # for each segment in each cluster (>2 secs long) if and 1 0 2 1 # get the signal and append it to the cluster's signal (followed by some silence) 0 1 # write cluster's signal into a WAV file f'cluster : segments sec total dur' {sp} {count_cl} {len(x_clusters[sp])/float(fs)} f'cluster_ .wav' {sp} f'cluster_ .wav' {sp} The above code saves artificial cluster sounds to WAV files and also displays them in a mini player in the notebook itself, but I've also uploaded the cluster sounds to YouTube (couldn't think of an obvious way to embed them in the article). So, let's listen to the resulting clusters and see if they correspond to homogeneous song parts: This is clearly the of the song, repeated twice (second time is much longer though as it includes more successive repetitions and a small solo) chorus Cluster has a single segment that corresponds to the song's . 2 intro Cluster is the of the song 3 pre-chorus The cluster contains segments from the of the song (if you exclude the small segment in the beginning). The 5th cluster is not shown as it just included a very short almost-silent segment at the beginning of the song. In all cases, clusters represented (with some errors of course) structural song components, even using this very simple approach, and without making use of any "external" supervised knowledge, other than similar features may mean similar music content. 4th verses Clusters of song segments may correspond to stuctural song elements if appropriate audio features are used Finally, note that, executing the code above may result in the same clustering but with different ordering of cluster IDs (and therefore order in the resulting audio files). This is probably due to the k-means random seed. Speaker diarization This is the task that, given an unknown speech recording, answers the question: "who speaks when?". For the sake of simplicity let's assume that we already know the number of speakers in the recording. What is the most straightforward way to solve this task? Obviously, first extract segment-level audio features and then perform some type of clustering, hoping that the resulting clusters will correspond to speaker IDs. In the following example (13), we use the exact same pipeline as the one followed in Example12, where we clustered a song to its structural parts. We have only changed the segment window size to 2 sec with a step of 0.1 sec and a smaller short-term window (50msec), since speech signals are, in general, characterized with faster changes in their main attributes, due to the existence of very different phonemes, some of which last just a few seconds (on the other hand musical note last several msecs, even in the fastest types of music). So Example13, uses the same rationalle of clustering of audio feature vectors. This time the input signal is a speech signal with 4 speakers (this is known beforehand), so we set our kmeans cluster size to 4: os, sklearn.cluster pyAudioAnalysis.MidTermFeatures mid_feature_extraction mT pyAudioAnalysis.audioBasicIO read_audio_file, stereo_to_mono pyAudioAnalysis.audioSegmentation labels_to_segments pyAudioAnalysis.audioTrainTest normalize_features numpy np scipy.io.wavfile wavfile IPython input_file = fs, x = read_audio_file(input_file) mt_size, mt_step, st_win = , , [mt_feats, st_feats, _] = mT(x, fs, mt_size * fs, mt_step * fs, round(fs * st_win), round(fs * st_win * )) (mt_feats_norm, MEAN, STD) = normalize_features([mt_feats.T]) mt_feats_norm = mt_feats_norm[ ].T n_clusters = x_clusters = [np.zeros((fs, )) i range(n_clusters)] k_means = sklearn.cluster.KMeans(n_clusters=n_clusters) k_means.fit(mt_feats_norm.T) cls = k_means.labels_ segs, c = labels_to_segments(cls, mt_step) sp range(n_clusters): count_cl = i range(len(c)): c[i] == sp segs[i, ]-segs[i, ] > : count_cl += cur_x = x[int(segs[i, ] * fs): int(segs[i, ] * fs)] x_clusters[sp] = np.append(x_clusters[sp], cur_x) x_clusters[sp] = np.append(x_clusters[sp], np.zeros((fs,))) print( ) wavfile.write( , fs, np.int16(x_clusters[sp])) IPython.display.display(IPython.display.Audio( )) import from import as from import from import from import import as import as import # read signal and get normalized segment feature statistics: "data/diarization_example.wav" 2 0.1 0.05 0.5 0 # perform clustering 4 for in # save clusters to concatenated wav files # convert flags to segment limits for in 0 for in # for each segment in each cluster (>2 secs long) if and 1 0 2 1 # get the signal and append it to the cluster's signal (followed by some silence) 0 1 # write cluster's signal into a WAV file f'speaker : segments sec total dur' {sp} {count_cl} {len(x_clusters[sp])/float(fs)} f'diarization_cluster_ .wav' {sp} f'diarization_cluster_ .wav' {sp} This is the initial recording And these are the 4 resulting clusters (results are also written in inline audio clips in jupiter notebook again): In the above example speaker clustering (or speaker diarization as we usually call it) was quite successfull with a few errors at the begining of the segments, mainly due to time resolution limitations (2-sec window has been used). Of course this is not always the case: speaker diarization is a hard task, especially if (a) a lot of background noise is present (b) the number of speakers is unknown beforehand (c) the speakers are not balanced (e.g. a speaker speaks 60% of the time and another speaker just 0.5% of the time). The code of this article is provided as a jupiter notebook . in this GitHub repo