上一篇聊了PaddleSpeech的中文识别,这篇把完整的会议记录工具做出来。目标:丢进去一段会议录音,输出带说话人标记和时间戳的会议纪要。
整体流程
会议录音(mp3/wav)
|
音频预处理(格式转换/降噪)
|
VAD语音活动检测(去除静音段)
|
说话人分离(谁在说话)
|
语音识别(说了什么)
|
时间戳对齐
|
文本后处理(标点/纠错)
|
GPT生成会议纪要
|
输出结果(Markdown)
音频预处理
统一转成16kHz单声道WAV,用ffmpeg处理:
import subprocess
from pathlib import Path
def preprocess_audio(input_path: str, output_path: str = None) -> str:
"""音频预处理:转换为16kHz单声道WAV"""
if output_path is None:
output_path = str(Path(input_path).with_suffix('.processed.wav'))
cmd = [
'ffmpeg', '-i', input_path,
'-ar', '16000', # 采样率16kHz
'-ac', '1', # 单声道
'-acodec', 'pcm_s16le', # 16bit PCM
'-y', output_path
]
subprocess.run(cmd, capture_output=True, check=True)
return output_path
VAD语音活动检测
长会议录音里有大量静音段,先用VAD切掉。用silero-vad,轻量准确:
import torch
def detect_speech_segments(audio_path: str, threshold: float = 0.5):
"""检测语音活动段"""
model, utils = torch.hub.load(
'snakers4/silero-vad', 'silero_vad',
force_reload=False, onnx=True
)
get_speech_ts, _, read_audio, *_ = utils
wav = read_audio(audio_path, sampling_rate=16000)
speech_timestamps = get_speech_ts(
wav, model,
threshold=threshold,
min_speech_duration_ms=250,
min_silence_duration_ms=300,
sampling_rate=16000,
)
# 返回格式: [{'start': 1000, 'end': 5000}, ...] 单位是采样点
return speech_timestamps
说话人分离
说话人分离(Speaker Diarization)用pyannote.audio,目前最好的开源方案:
from pyannote.audio import Pipeline
def diarize_speakers(audio_path: str, hf_token: str):
"""说话人分离"""
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.0",
use_auth_token=hf_token # 需要HuggingFace token
)
diarization = pipeline(audio_path)
segments = []
for turn, _, speaker in diarization.itertracks(yield_label=True):
segments.append({
'start': turn.start, # 秒
'end': turn.end, # 秒
'speaker': speaker, # SPEAKER_00, SPEAKER_01, ...
})
return segments
# 结果示例:
# [
# {'start': 0.5, 'end': 3.2, 'speaker': 'SPEAKER_00'},
# {'start': 3.5, 'end': 8.1, 'speaker': 'SPEAKER_01'},
# {'start': 8.3, 'end': 12.0, 'speaker': 'SPEAKER_00'},
# ]
pyannote需要在HuggingFace上申请模型权限(免费的),首次使用会下载约1.5GB模型。
语音识别 + 时间戳对齐
Whisper自带word-level时间戳,配合说话人分离结果做对齐:
import whisper
import numpy as np
def transcribe_with_timestamps(audio_path: str, model_size: str = "large-v2"):
"""带时间戳的语音识别"""
model = whisper.load_model(model_size)
result = model.transcribe(
audio_path,
language="zh",
task="transcribe",
word_timestamps=True, # 开启词级时间戳
condition_on_previous_text=True,
verbose=False,
)
return result
def align_speakers_and_text(diarization_segments, whisper_result):
"""将说话人分离结果与识别文本对齐"""
aligned = []
for segment in whisper_result['segments']:
seg_start = segment['start']
seg_end = segment['end']
seg_mid = (seg_start + seg_end) / 2
# 找这个时间点对应的说话人
speaker = "UNKNOWN"
for d_seg in diarization_segments:
if d_seg['start'] <= seg_mid <= d_seg['end']:
speaker = d_seg['speaker']
break
aligned.append({
'start': seg_start,
'end': seg_end,
'speaker': speaker,
'text': segment['text'].strip(),
})
# 合并同一说话人的连续段落
merged = []
for item in aligned:
if merged and merged[-1]['speaker'] == item['speaker']:
merged[-1]['end'] = item['end']
merged[-1]['text'] += item['text']
else:
merged.append(dict(item))
return merged
长音频分段处理
Whisper对超过30分钟的音频处理效果会下降(显存也吃不消)。分段处理:
import soundfile as sf
def split_audio(audio_path: str, max_duration: int = 600):
"""将长音频按VAD结果分段,每段不超过max_duration秒"""
speech_segments = detect_speech_segments(audio_path)
audio_data, sr = sf.read(audio_path)
chunks = []
current_chunk_start = 0
current_chunk_end = 0
for seg in speech_segments:
seg_start_sec = seg['start'] / sr
seg_end_sec = seg['end'] / sr
if seg_end_sec - current_chunk_start > max_duration:
# 当前段已经够长,保存并开始新段
if current_chunk_end > current_chunk_start:
chunks.append((current_chunk_start, current_chunk_end))
current_chunk_start = seg_start_sec
current_chunk_end = seg_end_sec
# 最后一段
if current_chunk_end > current_chunk_start:
chunks.append((current_chunk_start, current_chunk_end))
return chunks
GPT生成会议纪要
识别结果直接看太累了,用GPT-3.5(省钱)生成结构化的会议纪要:
import openai
def generate_meeting_summary(transcript: list[dict]) -> str:
"""用GPT生成会议纪要"""
# 格式化转写结果
formatted = []
for item in transcript:
time_str = format_time(item['start'])
formatted.append(f"[{time_str}] {item['speaker']}: {item['text']}")
transcript_text = "\n".join(formatted)
prompt = f"""以下是一段会议的转写记录,请生成结构化的会议纪要。
要求:
1. 总结会议主题和参会人
2. 列出讨论的关键议题和结论
3. 提取待办事项(TODO)和责任人
4. 用Markdown格式输出
转写记录:
{transcript_text}
"""
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-16k", # 用16k版本处理长文本
messages=[
{"role": "system", "content": "你是一个专业的会议纪要整理助手。"},
{"role": "user", "content": prompt}
],
temperature=0.3,
)
return response.choices[0].message.content
def format_time(seconds: float) -> str:
h = int(seconds // 3600)
m = int((seconds % 3600) // 60)
s = int(seconds % 60)
if h > 0:
return f"{h:02d}:{m:02d}:{s:02d}"
return f"{m:02d}:{s:02d}"
完整Pipeline
把上面的模块串起来:
def process_meeting(audio_path: str, hf_token: str, output_path: str = None):
"""完整的会议记录处理流程"""
print(f"处理音频: {audio_path}")
# 1. 预处理
processed = preprocess_audio(audio_path)
print("音频预处理完成")
# 2. 说话人分离
diarization = diarize_speakers(processed, hf_token)
print(f"检测到 {len(set(s['speaker'] for s in diarization))} 个说话人")
# 3. 语音识别
whisper_result = transcribe_with_timestamps(processed)
print("语音识别完成")
# 4. 对齐
transcript = align_speakers_and_text(diarization, whisper_result)
print(f"共 {len(transcript)} 段对话")
# 5. 生成纪要
summary = generate_meeting_summary(transcript)
# 6. 输出
if output_path is None:
output_path = str(Path(audio_path).with_suffix('.md'))
with open(output_path, 'w', encoding='utf-8') as f:
f.write("# 会议纪要\n\n")
f.write(summary)
f.write("\n\n---\n\n# 完整转写记录\n\n")
for item in transcript:
time_str = format_time(item['start'])
f.write(f"**[{time_str}] {item['speaker']}**: {item['text']}\n\n")
print(f"会议纪要已保存到: {output_path}")
return output_path
效果参考
在一典型会议录音上的表现参考:
- 说话人分离:pyannote 对中文支持尚可,准确率取决于音频质量
- Whisper large-v2 的中文识别效果良好,字错率在 10-15% 范围内(取决于口音和环境噪音)
- 整体处理时间取决于硬件配置,GPU 模式下显著快于 CPU 模式
- GPT 生成纪要通常在数秒内完成
主要瓶颈在说话人分离和语音识别的 GPU 推理上。如果不需要说话人分离,速度能快一倍。
后续计划加一个 Web 界面,用 FastAPI + WebSocket 做实时进度展示。