传送门
《iOS-PocketSphinx——安装PocketSphinx》
《iOS-PocketSphinx——构建iOS使用的SDK》
当前文章:《iOS-PocketSphinx——调整默认声学模型》
系统环境
Mac OS 10.15.6
训练前准备
- 如果未安装
sox
,则安装:
$ brew install sox
创建适应语料库
所需文件
The actual set of sentences you use is somewhat arbitrary, but ideally it should have good coverage of the most frequently used words or phonemes in the set of sentences or the type of text you want to recognize. For example, if you want to recognize isolated commands, you need tor record them. If you want to recognize dictation, you need to record full sentences. For simple voice adaptation we have had good results simply by using sentences from the CMU ARCTIC text-to-speech databases. To that effect, here are the first 20 sentences from ARCTIC, a
.fileids
file, and a transcription file:
你可以使用随意一组句子,但理想情况下,它应该很好地覆盖句子中你想要识别的最常用的单词或音素。对于简单的语音适应,我们仅需使用CMU ARCTICtext-to-speech databases
中的句子,就可以获得良好的效果。为此,以下是ARCTIC的前20个句子,一个.fileids
文件和一个. transcription
文件:
将其下载到工作目录,将arctic20.transcription修改成中一组中文句子,文件内容如下(不一定要20个句子哈,我这里用训练语言模型时的11个短语):
<s> 天猫精灵 </s> (arctic_0001)
<s> 你好天猫 </s> (arctic_0002)
<s> 来一首歌 </s> (arctic_0003)
<s> 来点音乐 </s> (arctic_0004)
<s> 音量大 </s> (arctic_0005)
<s> 音量小 </s> (arctic_0006)
<s> 声音大一点 </s> (arctic_0007)
<s> 声音小一点 </s> (arctic_0008)
<s> 下一首 </s> (arctic_0009)
<s> 上一首 </s> (arctic_0010)
<s> 切歌 </s> (arctic_0011)
In addition, you need to make sure that you record at a sampling rate of 16 kHz (or 8 kHz if you adapt a telephone model) in mono with a single channel.
The simplest way would be to start a sound recorder like Audacity or Wavesurfer and read all sentences in one big audio file. Then you can cut the audio files on sentences in a text editor and make sure every sentence is saved in the corresponding file. The file structure should look like this:
此外,您需要确保以16 kHz的采样率(如果使用电话模型,则为8 kHz)以单个通道录制。
最简单的方法是启动诸如Audacity或Wavesurfer之类的录音机,并在一个大音频文件中读取所有句子。然后,您可以在文本编辑器中剪切句子上的音频文件,并确保每个句子都保存在相应的文件中,文件结构应如以下所示:
arctic_0001.wav
arctic_0002.wav
.....
arctic_0011.wav
arctic20.fileids(语音文件名称列表-不含后缀)
arctic20.transcription(脚本文件,记录中文句子与对应的语音文件名-不含后缀)
验证一下每个录音文件是否正常,可以用以下命令播放它们:
$ for i in *.wav; do play $i; done
默认中文声学模型
将《iOS-PocketSphinx——建立语言模型》下载的官方默认的声学模型zh-cn
(我用的旧的声学模型,因为新的似乎不太理想)拷贝到工作目录中,文件结构如下:
feat.params –特征提取参数,用于配置特征提取的选项列表。
mdef –三音手机上下文到GMM ID(Seones)之间的映射的定义
means –高斯密码本的意思
mixture_weights –高斯混合体(如果存在sendump可能会丢失)
noisedict –填充词词典
transition_matrices – HMM转换矩阵
variances –高斯码本差异
可能还有:
sendump –压缩和量化的混合物(可以代替mixture_weights)
feature_transform –特征转换矩阵
调整声学模型还需要用到字典,由于我们训练的几个短语在官方的默认中文模型字典中不存在的,因此需要用到我们建立语言模型时用到的字典tianmao.dict
,将其拷贝到工作目录中:
$ cp -a /User/../tianmao.dict .
把训练的语言模型也一并拷贝到工作目录中:
$ cp -a /User/../tianmao.lm.bin .
生成声学特征文件
我们要从这些WAV录音中生成一组声学模型特征文件。必须确保使用与训练标准声学模型相同的声学参数来提取这些特征,这些参数存储在声学模型目录中的
feat.params
文件里。
利用sphinx_fe
命令提取特征:
$ sphinx_fe -argfile zh-cn/feat.params -samprate 16000 -c arctic20.fileids -di . -do . -ei wav -eo mfc -mswav yes
这样在当前文件夹下就会对每一个语音文件生成一个*.mfc
后缀的特征文件:
arctic_0001.mfc
arctic_0001.wav
arctic_0002.mfc
arctic_0002.wav
arctic_0003.mfc
arctic_0003.wav
.....
arctic_0011.mfc
arctic_0011.wav
arctic20.fileids
arctic20.transcription
arctic20.dic
zh-cn
tianmao.dict
tianmao.lm.bin
使用SphinxTrain训练工具
调整声学模型,需要通过SphinxTrain训练工具中的程序去完成。从sphinxtrain的安装目录/usr/local/libexec
中找到sphinxtrain
,将sphinxtrain
整个目录(仅需使用到部分程序)复制到工作目录中。
转换sendump和mdef文件
首先确保您使用的是带有完整mixture_weights
文件的完整模型。
如果声学模型中的mdef
为二进制文件,还得先转化为纯文本格式,利用pocketsphinx_mdef_convert
程序:
$ pocketsphinx_mdef_convert -text zh-cn/mdef zh-cn/mdef.txt
积累观察计数
现在,要收集统计信息,请运行(如果使用新的声学模型,这命令得改改参数,根据feat.params
来改):
$ ./sphinxtrain/bw \
-hmmdir zh-cn \
-moddeffn zh-cn/mdef.txt \
-ts2cbfn .ptm. \
-feat s2_4x \
-cmn current \
-agc none \
-dictfn tianmao.dict \
-ctlfn arctic20.fileids \
-lsnfn arctic20.transcription \
-accumdir .
Make sure the arguments in the bw command match the parameters in the feat.params file inside the acoustic model folder. Please note that not all the parameters from feat.param are supported by bw. bw for example doesn’t suppport upperf or other feature extraction parameters. You only need to use parameters which are accepted, other parameters from feat.params should be skipped.
注意:确保bw
命令中的参数feat.params
与声学模型文件夹内文件中的参数匹配 。请注意,并非所有来自的参数feat.param
都受支持bw
。 bw
例如不支持upperf或其他特征提取参数。您只需要使用被接受的参数,其他参数feat.params
应被跳过。(一定要一一对比,我之前没注意这里。)
For example, for a continuous model you don’t need to include the svspec option. Instead, you need to use just -ts2cbfn .cont. For semi-continuous models use -ts2cbfn .semi. If the model has a feature_transform file like the en-us continuous model, you need to add the -lda feature_transform argument to bw, otherwise it will not work properly.
例如,对于连续模型,您不需要包括该svspec 选项。相反,您只需要使用即可。-ts2cbfn .cont.对于半连续模型,请使用-ts2cbfn .semi。如果模型具有feature_transform类似于en-us连续模型的文件,则需要将-lda feature_transform 参数添加到bw,否则它将无法正常工作。
If you are missing the noisedict file, you also need an extra step. Copy the fillerdict file into the directory that you choose in the hmmdir parameter and renaming it to noisedict.
如果缺少该noisedict文件,则还需要执行额外的步骤。将fillerdict文件复制到您在hmmdir 参数中选择的目录中,并将其重命名为noisedict。(我使用时没缺少,不用理会。)
使用MLLR(最大似然线性回归算法)创建转换
MLLR是一种轻量级的自适应方法,适用于数据量有限的情况。进行在线调整时使用MLLR是一种不错的方式。MLLR最适合用于连续模型。由于半连续模型主要依赖于混合权重,因此其对半连续模型的影响非常有限。如果您希望获得最高的准确性,则可以在下面将MLLR适应与MAP适应结合起来。
从sphinxtrain的安装目录/usr/local/libexec/sphinxtrain
中找到mllr_solve
,将其拷贝到工作目录中。
接下来,我们将生成MLLR转换,并将其传递给解码器以在运行时调整声学模型。这是通过以下mllr_solve
程序完成的:
$ ./sphinxtrain/mllr_solve \
-meanfn zh-cn/means \
-varfn zh-cn/variances \
-outmllrfn mllr_matrix -accumdir .
此命令将创建一个名为的适配数据文件mllr_matrix
。现在,如果您希望使用改编后的模型进行解码,只需在Pocketsphinx命令行中添加-mllr mllr_matrix
即可。
使用MAP更新声学模型文件
MAP是一种不同的适应方法。在这种情况下,与MLLR不同,我们不去创建通用转换,而是更新模型中的每个参数。现在,我们将复制声学模型目录,并使用改编的模型文件覆盖新创建的目录:
$ cp -a zh-cn zh-cn-adapt
要应用改编,请使用以下map_adapt
程序:
$ ./sphinxtrain/map_adapt \
-moddeffn zh-cn/mdef.txt \
-ts2cbfn .ptm. \
-meanfn zh-cn/means \
-varfn zh-cn/variances \
-mixwfn zh-cn/mixture_weights \
-tmatfn zh-cn/transition_matrices \
-accumdir . \
-mapmeanfn zh-cn-adapt/means \
-mapvarfn zh-cn-adapt/variances \
-mapmixwfn zh-cn-adapt/mixture_weights \
-maptmatfn zh-cn-adapt/transition_matrices
重新创建已改编的sendump文件
要从已更新的mixture_weights
文件重新创建sendump
文件,请运行:
$ ./sphinxtrain/mk_s2sendump \
-pocketsphinx yes \
-moddeffn zh-cn-adapt/mdef.txt \
-mixwfn zh-cn-adapt/mixture_weights \
-sendumpfn zh-cn-adapt/sendump
恭喜你!您现在有了一个经过调整的声学模型。
使用模型
适应后,声学模型位于文件夹zh-cn-adapt中。该模型应具有以下文件:
mdef
feat.params
mixture_weights
means
noisedict
transition_matrices
variances
要在PocketSphinx中使用模型,只需将模型文件放入应用程序的资源中。然后使用以下-hmm选项指向它:
pocketsphinx_continuous -hmm
<your_new_model_folder>
-lm<your_lm>
-dict<your_dict>
-infile test.wav
识别文件
$ pocketsphinx_continuous -hmm zh-cn-adapt -lm tianmao.lm.bin -dict tianmao.dict -infile arctic_0001.wav
实时识别
$ pocketsphinx_continuous -inmic yes -hmm zh-cn-adapt -lm tianmao.lm.bin -dict tianmao.dict
学习感想和困惑
我这边识别率还是没什么提升,不知道是否哪个步骤出问题?
我这边希望能做个本地语音唤醒(无网也可以唤醒),刚好pocketsphinx可以实现无网语音识别。和百度地图App的语音唤醒功能一致,百度地图的语音唤醒词是“小度小度”,识别度和灵敏度很高,可能说一些接近的词汇也能唤醒小度,例如“小豆小豆”、“小度度度”,但平时说的一些不相关的词都不会误唤醒,我觉得就很好了。
先说说官方的中文模型cmusphinx-zh-cn-5.2.tar.gz,识别率真的低,说的词常常识别出乱七八糟的词,而且噪音也会带进去,例如敲两下桌子,可能也会识别出几个词。其实官方的英文模型识别率也不高,但比中文的好点,由于不考虑英文模型,只对中文模型进行训练和调整。
再说说训练中文模型,由于官方文档是基于英文模型来来讲解的,多少有点出入,加上网上的资料比较少,大部分是直接翻译官方文档而来,只能细抠再细抠,偶尔抠出点感悟和理解,就较为欣喜,然而还是困难重重。路漫漫其修远兮,吾将上下求索......
目前仅需一个唤醒词“天猫精灵”(例如),将语言模型中的字典仅保留“天猫精灵”的音素对应,声学模型训练时也仅用“天猫精灵”的音频训练,训练出来的模型在使用的过程中,说“天猫精灵”,识别率是很高,但是有个问题,就是各种噪音,或说其他不相关的词,也会识别出“天猫精灵”,这就没法用了。无法过滤噪音或其他差异度较高的词汇吗?希望大家和我一起探讨一下,可能你的指点能给我带来很大的启发和帮助,谢谢大家!
上面的困惑稍微得到了一点解决,之前使用的新的官方中文模型来训练的,效果确实不佳,现在我用旧的官方中文模型来训练,效果勉强算合格吧。当然,我也知之甚微,还在继续摸索中。
参考资料:
调整默认声学模型(官方教程):https://cmusphinx.github.io/wiki/tutorialadapt/
PocketSphinx语音识别系统语言模型的训练和声学模型的改进:https://www.cnblogs.com/bhlsheji/p/4514475.html
sox安装及常用命令:https://www.jianshu.com/p/7dbfd5799889
网友评论