hsinfu's Blog

[Summarization] Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

Hinton, Geoffrey, et al. "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups." Signal Processing Magazine, IEEE 29.6 (2012): 82-97

這篇 paper 主要是介紹 如何用 DNN + HMM 取代原本 speech recognition 上的 state-of-the-art method GMM + HMM 。而用 DNN 取代 GMM 在 performance 上會有一個 large margin 的進展,因此大多數做語音辨識的,目前都採用 DNN + HMM 這個 architecture ,如下圖所示,而目前聲音的 observation 大多是取 MFCC or PLP 這兩個 feature 當 input。

*HMM = Hidden Markov Models
*DNN = Deep Neural Networks
*GMM = Gaussian Mixture Models
*MFCC = Mel Frequency Cepstral Coefficients
*PLP = Perceptual Linear Predictive coefficients

而這篇 paper 提出將 training 分成兩步驟,

  1. layers of feature detectors are initialized, one layer at a time, by fitting a stack of generative models, each of which has one layer of latent variables. These generative models are trained without using any information about the HMM states that the acoustic model will need to discriminate

  2. each generative model in the stack is used to initialize one layer of hidden units in a DNN and the whole network is then discriminatively fine-tuned to predict the target HMM states

這篇 paper 先在 TIMIT 這個相對小的 database 上做實驗可以看出分成兩個步驟後,效果有顯著的提昇,再應用到別的 dataset 上做實驗。以下這邊是他們的實驗結果