本示例相关代码见github:
https://github.com/lzhenboy/Speech2Text.git
1、视频抽取语音
目标:从视频文件中抽取语音
工具:ffmpeg
安装:
brew install ffmpeg
示例:
ffmpeg -i video-demo.mp4 -f wav -ar 16000 audio-demo.wav
参数解释:
-i video-demo.mp4 # 输入文件路径
-f wav # 输出语音文件格式为wav
-ar 16000 # 采样率为16000
speech-demo.wav # 输出文件路径
2、语音识别
目标:从语音中识别文本
开源实现:SpeechRecognition
安装:
pip install SpeechRecognition
示例:
# -*- coding: utf-8 -*-
# @Time : 2020/7/30 3:43 下午
# @Author : lzhenboy
import os
import time
import json
import wave
# import eyed3
import jieba
import jieba.analyse
import datetime
from collections import OrderedDict
import speech_recognition as sr
from pydub import AudioSegment
class Speech2Text(object):
def __init__(self):
# 音频文件分割时间间隔,默认30s
self.time_interval = 30
self.recognizer = sr.Recognizer()
def split_speech(self, speech_file, split_speech_dir):
# 获取音频时长
with wave.open(speech_file, 'rb') as fin:
time_length = int(fin.getparams()[3] / fin.getparams()[2])
print('File `{}.wav` speech duration: {}'.format(speech_base_name, time_length))
# 音频分割输出
os.makedirs(split_speech_dir, exist_ok=True)
audio = AudioSegment.from_wav(speech_file)
kn = int(time_length / self.time_interval) + 1
for i in range(kn):
audio[i * self.time_interval * 1000:((i + 1) * self.time_interval + 2) * 1000].export(
os.path.join(split_speech_dir, speech_file.replace('.wav', '-{}.wav'.format(i))), format='wav')
def convert_speech_to_text(self, split_speech_dir, speech_text_dir):
os.makedirs(speech_text_dir, exist_ok=True)
# 获取文件夹下的音频文件名
start_time = datetime.datetime.now()
speech_text = {}
for idx, name in enumerate(os.listdir(split_speech_dir)):
print("%d %s 开始转换" % (idx, name))
# 音频分块识别
try:
with sr.WavFile(os.path.join(split_speech_dir, name)) as source:
audio = self.recognizer.record(source)
# Google的接口效果最好
text = r.recognize_google(audio, language='zh-CN')
time.sleep(5)
temp_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print('%s %d %s 已完成' % (temp_time, idx, name))
speech_text[idx] = text
except Exception as e:
print(e)
temp_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S')
print('%s %d %s 未完成' % (temp_time, idx, name))
speech_text[idx] = ''
continue
end_time = datetime.datetime.now()
print('总共花费时间:%s' % (end_time - start_time))
with open(os.path.join(speech_text_dir, 'speech_text.json'), 'w', encoding='utf-8') as fout:
json.dump(speech_text, fout, indent=4, ensure_ascii=False)
if __name__ == '__main__':
speech_demo_file = './data/audio-demo.wav'
split_speech_demo_dir = './data/split_speech_file'
speech_text_demo_dir = './data'
st = Speech2Text()
# Step1
st.split_speech(speech_demo_file, split_speech_demo_dir)
# Step2
st.convert_speech_to_text(split_speech_dir, speech_text_demo_dir)
效果评估:
对语音连续的音频文件,google的语音识别接口识别效果较好,但对于中间有长时间停顿的音频文件,语音识别效果一般,往往会将后半部分漏掉。
参考文献
1、使用Python进行语音识别---将音频转为文字
2、Python使用Speech_Recognition实现普通话识别
3、ffmpeg 从视频中提取WAV格式的音频
网友评论