How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?

In this tutorial, we stroll by way of an superior implementation of WhisperX, the place we discover transcription, alignment, and word-level timestamps intimately. We arrange the surroundings, load and preprocess the audio, and then run the total pipeline, from transcription to alignment and evaluation, whereas guaranteeing reminiscence effectivity and supporting batch processing. Along the way in which, we additionally visualize outcomes, export them in a number of codecs, and even extract key phrases to achieve deeper insights from the audio content material. Check out the FULL CODES here.
!pip set up -q git+https://github.com/m-bain/whisperX.git
!pip set up -q pandas matplotlib seaborn
import whisperx
import torch
import gc
import os
import json
import pandas as pd
from pathlib import Path
from IPython.show import Audio, show, HTML
import warnings
warnings.filterwarnings('ignore')
CONFIG = {
"gadget": "cuda" if torch.cuda.is_available() else "cpu",
"compute_type": "float16" if torch.cuda.is_available() else "int8",
"batch_size": 16,
"model_size": "base",
"language": None,
}
print(f"
Running on: {CONFIG['device']}")
print(f"
Compute sort: {CONFIG['compute_type']}")
print(f"
Model: {CONFIG['model_size']}")
We start by putting in WhisperX alongside with important libraries and then configure our setup. We detect whether or not CUDA is offered, choose the compute sort, and set parameters corresponding to batch measurement, mannequin measurement, and language to put together for transcription. Check out the FULL CODES here.
def download_sample_audio():
"""Download a pattern audio file for testing"""
!wget -q -O pattern.mp3 https://github.com/mozilla-extensions/speaktome/uncooked/grasp/content material/cv-valid-dev/sample-000000.mp3
print("
Sample audio downloaded")
return "pattern.mp3"
def load_and_analyze_audio(audio_path):
"""Load audio and show primary information"""
audio = whisperx.load_audio(audio_path)
period = len(audio) / 16000
print(f"
Audio: {Path(audio_path).title}")
print(f"
Duration: {period:.2f} seconds")
print(f"
Sample fee: 16000 Hz")
show(Audio(audio_path))
return audio, period
def transcribe_audio(audio, model_size=CONFIG["model_size"], language=None):
"""Transcribe audio utilizing WhisperX (batched inference)"""
print("n
STEP 1: Transcribing audio...")
mannequin = whisperx.load_model(
model_size,
CONFIG["device"],
compute_type=CONFIG["compute_type"]
)
transcribe_kwargs = {
"batch_size": CONFIG["batch_size"]
}
if language:
transcribe_kwargs["language"] = language
consequence = mannequin.transcribe(audio, **transcribe_kwargs)
total_segments = len(consequence["segments"])
total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
del mannequin
gc.gather()
if CONFIG["device"] == "cuda":
torch.cuda.empty_cache()
print(f"
Transcription full!")
print(f" Language: {consequence['language']}")
print(f" Segments: {total_segments}")
print(f" Total textual content size: {sum(len(seg['text']) for seg in consequence['segments'])} characters")
return consequence
We obtain a pattern audio file, load it for evaluation, and then transcribe it utilizing WhisperX. We arrange batched inference with our chosen mannequin measurement and configuration, and we output key particulars corresponding to language, variety of segments, and whole textual content size. Check out the FULL CODES here.
def align_transcription(segments, audio, language_code):
"""Align transcription for correct word-level timestamps"""
print("n
STEP 2: Aligning for word-level timestamps...")
strive:
model_a, metadata = whisperx.load_align_model(
language_code=language_code,
gadget=CONFIG["device"]
)
consequence = whisperx.align(
segments,
model_a,
metadata,
audio,
CONFIG["device"],
return_char_alignments=False
)
total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
del model_a
gc.gather()
if CONFIG["device"] == "cuda":
torch.cuda.empty_cache()
print(f"
Alignment full!")
print(f" Aligned phrases: {total_words}")
return consequence
besides Exception as e:
print(f"
Alignment failed: {str(e)}")
print(" Continuing with segment-level timestamps solely...")
return {"segments": segments, "word_segments": []}
We align the transcription to generate exact word-level timestamps. By loading the alignment mannequin and making use of it to the audio, we refine timing accuracy, and then report the full aligned phrases whereas guaranteeing reminiscence is cleared for environment friendly processing. Check out the FULL CODES here.
def analyze_transcription(consequence):
"""Generate statistics in regards to the transcription"""
print("n
TRANSCRIPTION STATISTICS")
print("="*70)
segments = consequence["segments"]
total_duration = max(seg["end"] for seg in segments) if segments else 0
total_words = sum(len(seg.get("phrases", [])) for seg in segments)
total_chars = sum(len(seg["text"].strip()) for seg in segments)
print(f"Total period: {total_duration:.2f} seconds")
print(f"Total segments: {len(segments)}")
print(f"Total phrases: {total_words}")
print(f"Total characters: {total_chars}")
if total_duration > 0:
print(f"Words per minute: {(total_words / total_duration * 60):.1f}")
pauses = []
for i in vary(len(segments) - 1):
pause = segments[i+1]["start"] - segments[i]["end"]
if pause > 0:
pauses.append(pause)
if pauses:
print(f"Average pause between segments: {sum(pauses)/len(pauses):.2f}s")
print(f"Longest pause: {max(pauses):.2f}s")
word_durations = []
for seg in segments:
if "phrases" in seg:
for phrase in seg["words"]:
period = phrase["end"] - phrase["start"]
word_durations.append(period)
if word_durations:
print(f"Average phrase period: {sum(word_durations)/len(word_durations):.3f}s")
print("="*70)
We analyze the transcription by producing detailed statistics corresponding to whole period, section depend, phrase depend, and character depend. We additionally calculate phrases per minute, pauses between segments, and common phrase period to higher perceive the pacing and circulate of the audio. Check out the FULL CODES here.
def display_results(consequence, show_words=False, max_rows=50):
"""Display transcription ends in formatted desk"""
information = []
for seg in consequence["segments"]:
textual content = seg["text"].strip()
begin = f"{seg['start']:.2f}s"
finish = f"{seg['end']:.2f}s"
period = f"{seg['end'] - seg['start']:.2f}s"
if show_words and "phrases" in seg:
for phrase in seg["words"]:
information.append({
"Start": f"{phrase['start']:.2f}s",
"End": f"{phrase['end']:.2f}s",
"Duration": f"{phrase['end'] - phrase['start']:.3f}s",
"Text": phrase["word"],
"Score": f"{phrase.get('rating', 0):.2f}"
})
else:
information.append({
"Start": begin,
"End": finish,
"Duration": period,
"Text": textual content
})
df = pd.DataFrame(information)
if len(df) > max_rows:
print(f"Showing first {max_rows} rows of {len(df)} whole...")
show(HTML(df.head(max_rows).to_html(index=False)))
else:
show(HTML(df.to_html(index=False)))
return df
def export_results(consequence, output_dir="output", filename="transcript"):
"""Export ends in a number of codecs"""
os.makedirs(output_dir, exist_ok=True)
json_path = f"{output_dir}/{filename}.json"
with open(json_path, "w", encoding="utf-8") as f:
json.dump(consequence, f, indent=2, ensure_ascii=False)
srt_path = f"{output_dir}/{filename}.srt"
with open(srt_path, "w", encoding="utf-8") as f:
for i, seg in enumerate(consequence["segments"], 1):
begin = format_timestamp(seg["start"])
finish = format_timestamp(seg["end"])
f.write(f"{i}n{begin} --> {finish}n{seg['text'].strip()}nn")
vtt_path = f"{output_dir}/{filename}.vtt"
with open(vtt_path, "w", encoding="utf-8") as f:
f.write("WEBVTTnn")
for i, seg in enumerate(consequence["segments"], 1):
begin = format_timestamp_vtt(seg["start"])
finish = format_timestamp_vtt(seg["end"])
f.write(f"{begin} --> {finish}n{seg['text'].strip()}nn")
txt_path = f"{output_dir}/{filename}.txt"
with open(txt_path, "w", encoding="utf-8") as f:
for seg in consequence["segments"]:
f.write(f"{seg['text'].strip()}n")
csv_path = f"{output_dir}/{filename}.csv"
df_data = []
for seg in consequence["segments"]:
df_data.append({
"begin": seg["start"],
"finish": seg["end"],
"textual content": seg["text"].strip()
})
pd.DataFrame(df_data).to_csv(csv_path, index=False)
print(f"n
Results exported to '{output_dir}/' listing:")
print(f" ✓ {filename}.json (full structured information)")
print(f" ✓ {filename}.srt (subtitles)")
print(f" ✓ {filename}.vtt (net video subtitles)")
print(f" ✓ {filename}.txt (plain textual content)")
print(f" ✓ {filename}.csv (timestamps + textual content)")
def format_timestamp(seconds):
"""Convert seconds to SRT timestamp format"""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"
def format_timestamp_vtt(seconds):
"""Convert seconds to VTT timestamp format"""
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
millis = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"
def batch_process_files(audio_files, output_dir="batch_output"):
"""Process a number of audio information in batch"""
print(f"n
Batch processing {len(audio_files)} information...")
outcomes = {}
for i, audio_path in enumerate(audio_files, 1):
print(f"n[{i}/{len(audio_files)}] Processing: {Path(audio_path).title}")
strive:
consequence, _ = process_audio_file(audio_path, show_output=False)
outcomes[audio_path] = consequence
filename = Path(audio_path).stem
export_results(consequence, output_dir, filename)
besides Exception as e:
print(f"
Error processing {audio_path}: {str(e)}")
outcomes[audio_path] = None
print(f"n
Batch processing full! Processed {len(outcomes)} information.")
return outcomes
def extract_keywords(consequence, top_n=10):
"""Extract most typical phrases from transcription"""
from collections import Counter
import re
textual content = " ".be part of(seg["text"] for seg in consequence["segments"])
phrases = re.findall(r'bw+b', textual content.decrease())
stop_words = {'the', 'a', 'an', 'and', 'or', 'however', 'in', 'on', 'at', 'to', 'for',
'of', 'with', 'is', 'was', 'are', 'had been', 'be', 'been', 'being',
'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'might',
'ought to', 'might', 'may', 'should', 'can', 'this', 'that', 'these', 'these'}
filtered_words = [w for w in words if w not in stop_words and len(w) > 2]
word_counts = Counter(filtered_words).most_common(top_n)
print(f"n
Top {top_n} Keywords:")
for phrase, depend in word_counts:
print(f" {phrase}: {depend}")
return word_counts
We format outcomes into clear tables, export transcripts to JSON/SRT/VTT/TXT/CSV codecs, and keep exact timestamps with helper formatters. We additionally batch-process a number of audio information end-to-end and extract high key phrases, enabling us to rapidly flip uncooked transcriptions into analysis-ready artifacts. Check out the FULL CODES here.
def process_audio_file(audio_path, show_output=True, analyze=True):
"""Complete WhisperX pipeline"""
if show_output:
print("="*70)
print("
WhisperX Advanced Tutorial")
print("="*70)
audio, period = load_and_analyze_audio(audio_path)
consequence = transcribe_audio(audio, CONFIG["model_size"], CONFIG["language"])
aligned_result = align_transcription(
consequence["segments"],
audio,
consequence["language"]
)
if analyze and show_output:
analyze_transcription(aligned_result)
extract_keywords(aligned_result)
if show_output:
print("n" + "="*70)
print("
TRANSCRIPTION RESULTS")
print("="*70)
df = display_results(aligned_result, show_words=False)
export_results(aligned_result)
else:
df = None
return aligned_result, df
# Example 1: Process pattern audio
# audio_path = download_sample_audio()
# consequence, df = process_audio_file(audio_path)
# Example 2: Show word-level particulars
# consequence, df = process_audio_file(audio_path)
# word_df = display_results(consequence, show_words=True)
# Example 3: Process your individual audio
# audio_path = "your_audio.wav" # or .mp3, .m4a, and so on.
# consequence, df = process_audio_file(audio_path)
# Example 4: Batch course of a number of information
# audio_files = ["audio1.mp3", "audio2.wav", "audio3.m4a"]
# outcomes = batch_process_files(audio_files)
# Example 5: Use a bigger mannequin for higher accuracy
# CONFIG["model_size"] = "large-v2"
# consequence, df = process_audio_file("audio.mp3")
print("n
Setup full! Uncomment examples above to run.")
We run the total WhisperX pipeline end-to-end, loading the audio, transcribing it, and aligning it for word-level timestamps. When enabled, we analyze stats, extract key phrases, render a clear outcomes desk, and export every little thing to a number of codecs, prepared for actual use.
In conclusion, we constructed an entire WhisperX pipeline that not solely transcribes audio but additionally aligns it with exact word-level timestamps. We export the ends in a number of codecs, course of information in batches, and analyze patterns to make the output extra significant. With this, we now have a versatile, ready-to-use workflow for transcription and audio evaluation on Colab, and we’re prepared to lengthen it additional into real-world tasks.
Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The publish How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? appeared first on MarkTechPost.