+98 21 8609-3065 h.veisi@ut.ac.ir
Persian Speech Recognition using Deep Learning

Automatic speech recognition (ASR) is the translation of spoken words into text. Speech recognition has many applications such as virtual speech assistants (e.g., Apple’s Siri, Google Now, and Microsoft’s Cortana), speech-to-speech translation, voice dictation and etc.

As shown in Fig. 1, an ASR system has five main components: signal processing and feature extraction, acoustic model (AM), language model (LM), lexicon and hypothesis search. The signal processing and feature extraction component takes the audio signal as the input, enhances the speech by removing noises and extracts feature vectors. The acoustic model integrates knowledge about acoustics and phonetics, takes the features as its input, and recognize phonemes. Language model contains information about structure of the language. Lexicon includes all words that audio signal can be mapped to them. The hypothesis search component combines AM and LM scores and outputs the word sequence with the highest score as the recognition result.

Fig 1- Speech Recognition Components

There are several methods for creating acoustic model, such as hidden markov model (HMM) [5] and artificial neural network (ANN) [1]. Audio signals as a sequential data, current input depends on previous inputs. Recurrent neural networks (RNNs) [1] benefits for their ability to learn sequential data. But for standard RNN architectures, the range of context that can be in practice accessed is quite limited. This problem is often referred to in the literature as the vanishing gradient problem. In 1997, after introducing long short term memory (LSTM) [1-3] neural network, the problem of limitation in processing sequential data resolved. An LSTM network is the same as a standard RNN, except that units in the hidden layer are replaced by memory blocks, as illustrated in Fig 2.

Fig 2- An LSTM network [1]
As shown in Fig 3, each block contains one or more self-connected memory cells and three multiplicative units named the input, output and forget gates. The multiplicative gates allow LSTM memory cells to store and access information over long periods of time, thereby mitigating the vanishing gradient problem.

 

 

Fig 3- An LSTM memory block [1]

One shortcoming of conventional RNNs is that they are only able to make use of previous context. In speech recognition, there is we can also benefits from future context as well. Bidirectional RNNs (BRNNs) [6] do this by processing the data in both directions with two separate hidden layers, which are then fed forwards to the same output layer. The structure of BRNN is shown in Fig 4.

 

Fig 4- A Bidirectional RNN [4]

A deep neural network (DNN) [5] is an ANN with multiple hidden layers of units between the input and output layers that provide a complex and nonlinear modeling. In this project, we use deep bidirectional LSTM neural (DBLSTM) [4] network for implementing a Persian ASR system. The structure of deep bidirectional LSTM is shown in Fig 5.

 

Fig 5- Deep Bidirectional LSTM [4]

 

Some related references are:

[1]      Graves, A., Supervised sequence labelling with recurrent neural networks. Vol. 385. 2012: Springer.[2]      Gers, F., Long short-term memory in recurrent neural networks. Unpublished PhD dissertation, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland, 2001.[3]      Hochreiter, S. and J. Schmidhuber, Long short-term memory. Neural computation, 1997. 9(8): p. 1735-1780.[4]      Graves, A., N. Jaitly, and A.-r. Mohamed. Hybrid speech recognition with deep bidirectional LSTM. in Automatic Speech Recognition and Understanding (ASRU), 2013 IEEE Workshop on. 2013. IEEE[5]      Yu, D. and L. Deng, Automatic Speech Recognition. Signals and Communication Technology. 2015: Springer London[6]      Schuster, M. and K.K. Paliwal, Bidirectional recurrent neural networks. Signal Processing, IEEE Transactions on, 1997. 45(11): p. 2673-2681.

Other Projects

Speaker Recognition

Speaker Recognition

As a biometric, voice can be used to identify or verify a speaker. Speaker recognition denotes techniques for identification/verification of the...

read more
Data Analysis

Data Analysis

Data analysis (affinity analysis or data mining) includes computational techniques for discovering relationships of data that come from activities...

read more
Speech Recognition

Speech Recognition

Automatic Speech Recognition (ASR) denotes to techniques that convert spoken speech into text. Our ASR in DSP Lab concentrates on Persian speech...

read more

Notice: ob_end_flush(): Failed to send buffer of zlib output compression (0) in /home/smj97ir/public_html/wp-includes/functions.php on line 5464