Human society functions by communication between individuals. Language in both its written and spoken forms underpin all aspects of human interactions. The spoken language is the most fundamental as this is how individuals communicate with one another using only the human vocal apparatus. Since spoken language is one of the easiest measures to acquire (all your need is a microphone), is used in a variety of transaction applications (e.g. telephone banking), and has the potential for security by surveillance, it comes as no surprise that speaker recognition is one of the key research areas in signal processing and pattern recognition.
Deep Belief Networks (DBNs) have become a popular research area in machine learning and pattern recognition. In recent years, deep learning techniques have been successfully applied to the modeling of speech signals, such as speech recognition, acoustic modeling, speaker and language recognition, spectrogram coding, voice activity detection, acoustic-articulatory inversion mapping, 3D object recognition, intelligent video surveillance and image recognition. DBNs, use a powerful strategy of unsupervised training and multiple layers that provides parameter-efficient and accurate acoustic modeling. DBNs have been successfully used in speech recognition for modeling the posterior probability of state given a feature vector. Feature vectors are typically standard frame based acoustic representations (e.g., MFCCs) that are usually stacked across multiple frames.
The DBN performs a nonlinear transformation of the input features, and produces the probability that an output unit is active, given a wide context of input frames. The basic process for pre-training a DBN is based upon stacking RBMs. RBMs are an undirected graphical model with visible and hidden units with only visible-hidden connections. For real value data, we use Gaussian-Bernoulli RBM (GRBM).
Speaker recognition research is fundamentally different from approaches in automatic speech recognition since the focus in speaker recognition has been on vector-based approaches such as high-dimensional space of GMM super-vectors and the more recent i-vectors, extracted by means of Factor Analysis. In these approaches, a GMM universal background model (GMM-UBM) is used to derive a vector representation of an utterance which is then used for speaker modeling. Much of the focus of this work has been on strategies for vector classification, simple inner product methods, probabilistic linear discriminant analysis (PLDA), support vector machines, and advanced Bayesian methods. Also, methods such as WCCN, NAP, and PLDA for compensating and modeling of speaker session and channel variation have been key topics. We combine DBNs with vector-based methods.
Some of the useful resources:
- B. Schuller, and e. al., “A Survey on perceived speaker traits: personality, likability, pathology, and the first challenge,” Computer Speech & Language, vol. 29, no. 1, pp. 32, 2015.