How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?

In this tutorial, we stroll by way of an superior implementation of WhisperX, the place we discover transcription, alignment, and word-level timestamps intimately. We arrange the surroundings, load and preprocess the audio, and then run the total pipeline, from transcription to alignment and evaluation, whereas guaranteeing reminiscence effectivity and supporting batch processing. Along the way in which, we additionally visualize outcomes, export them in a number of codecs, and even extract key phrases to achieve deeper insights from the audio content material. Check out the FULL CODES here.

Copy Code

!pip set up -q git+https://github.com/m-bain/whisperX.git
!pip set up -q pandas matplotlib seaborn


import whisperx
import torch
import gc
import os
import json
import pandas as pd
from pathlib import Path
from IPython.show import Audio, show, HTML
import warnings
warnings.filterwarnings('ignore')


CONFIG = {
   "gadget": "cuda" if torch.cuda.is_available() else "cpu",
   "compute_type": "float16" if torch.cuda.is_available() else "int8",
   "batch_size": 16, 
   "model_size": "base", 
   "language": None, 
}


print(f" Running on: {CONFIG['device']}")
print(f" Compute sort: {CONFIG['compute_type']}")
print(f" Model: {CONFIG['model_size']}")

We start by putting in WhisperX alongside with important libraries and then configure our setup. We detect whether or not CUDA is offered, choose the compute sort, and set parameters corresponding to batch measurement, mannequin measurement, and language to put together for transcription. Check out the FULL CODES here.

Copy Code

def download_sample_audio():
   """Download a pattern audio file for testing"""
   !wget -q -O pattern.mp3 https://github.com/mozilla-extensions/speaktome/uncooked/grasp/content material/cv-valid-dev/sample-000000.mp3
   print(" Sample audio downloaded")
   return "pattern.mp3"


def load_and_analyze_audio(audio_path):
   """Load audio and show primary information"""
   audio = whisperx.load_audio(audio_path)
   period = len(audio) / 16000 
   print(f" Audio: {Path(audio_path).title}")
   print(f"  Duration: {period:.2f} seconds")
   print(f" Sample fee: 16000 Hz")
   show(Audio(audio_path))
   return audio, period


def transcribe_audio(audio, model_size=CONFIG["model_size"], language=None):
   """Transcribe audio utilizing WhisperX (batched inference)"""
   print("n STEP 1: Transcribing audio...")
  
   mannequin = whisperx.load_model(
       model_size,
       CONFIG["device"],
       compute_type=CONFIG["compute_type"]
   )
  
   transcribe_kwargs = {
       "batch_size": CONFIG["batch_size"]
   }
   if language:
       transcribe_kwargs["language"] = language
  
   consequence = mannequin.transcribe(audio, **transcribe_kwargs)
  
   total_segments = len(consequence["segments"])
   total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
  
   del mannequin
   gc.gather()
   if CONFIG["device"] == "cuda":
       torch.cuda.empty_cache()
  
   print(f" Transcription full!")
   print(f"   Language: {consequence['language']}")
   print(f"   Segments: {total_segments}")
   print(f"   Total textual content size: {sum(len(seg['text']) for seg in consequence['segments'])} characters")
  
   return consequence

We obtain a pattern audio file, load it for evaluation, and then transcribe it utilizing WhisperX. We arrange batched inference with our chosen mannequin measurement and configuration, and we output key particulars corresponding to language, variety of segments, and whole textual content size. Check out the FULL CODES here.

Copy Code

def align_transcription(segments, audio, language_code):
   """Align transcription for correct word-level timestamps"""
   print("n STEP 2: Aligning for word-level timestamps...")
  
   strive:
       model_a, metadata = whisperx.load_align_model(
           language_code=language_code,
           gadget=CONFIG["device"]
       )
      
       consequence = whisperx.align(
           segments,
           model_a,
           metadata,
           audio,
           CONFIG["device"],
           return_char_alignments=False
       )
      
       total_words = sum(len(seg.get("phrases", [])) for seg in consequence["segments"])
      
       del model_a
       gc.gather()
       if CONFIG["device"] == "cuda":
           torch.cuda.empty_cache()
      
       print(f" Alignment full!")
       print(f"   Aligned phrases: {total_words}")
      
       return consequence
   besides Exception as e:
       print(f"  Alignment failed: {str(e)}")
       print("   Continuing with segment-level timestamps solely...")
       return {"segments": segments, "word_segments": []}

We align the transcription to generate exact word-level timestamps. By loading the alignment mannequin and making use of it to the audio, we refine timing accuracy, and then report the full aligned phrases whereas guaranteeing reminiscence is cleared for environment friendly processing. Check out the FULL CODES here.

Copy Code

def analyze_transcription(consequence):
   """Generate statistics in regards to the transcription"""
   print("n TRANSCRIPTION STATISTICS")
   print("="*70)
  
   segments = consequence["segments"]
  
   total_duration = max(seg["end"] for seg in segments) if segments else 0
   total_words = sum(len(seg.get("phrases", [])) for seg in segments)
   total_chars = sum(len(seg["text"].strip()) for seg in segments)
  
   print(f"Total period: {total_duration:.2f} seconds")
   print(f"Total segments: {len(segments)}")
   print(f"Total phrases: {total_words}")
   print(f"Total characters: {total_chars}")
  
   if total_duration > 0:
       print(f"Words per minute: {(total_words / total_duration * 60):.1f}")
  
   pauses = []
   for i in vary(len(segments) - 1):
       pause = segments[i+1]["start"] - segments[i]["end"]
       if pause > 0:
           pauses.append(pause)
  
   if pauses:
       print(f"Average pause between segments: {sum(pauses)/len(pauses):.2f}s")
       print(f"Longest pause: {max(pauses):.2f}s")
  
   word_durations = []
   for seg in segments:
       if "phrases" in seg:
           for phrase in seg["words"]:
               period = phrase["end"] - phrase["start"]
               word_durations.append(period)
  
   if word_durations:
       print(f"Average phrase period: {sum(word_durations)/len(word_durations):.3f}s")
  
   print("="*70)

We analyze the transcription by producing detailed statistics corresponding to whole period, section depend, phrase depend, and character depend. We additionally calculate phrases per minute, pauses between segments, and common phrase period to higher perceive the pacing and circulate of the audio. Check out the FULL CODES here.

Copy Code

def display_results(consequence, show_words=False, max_rows=50):
   """Display transcription ends in formatted desk"""
   information = []
  
   for seg in consequence["segments"]:
       textual content = seg["text"].strip()
       begin = f"{seg['start']:.2f}s"
       finish = f"{seg['end']:.2f}s"
       period = f"{seg['end'] - seg['start']:.2f}s"
      
       if show_words and "phrases" in seg:
           for phrase in seg["words"]:
               information.append({
                   "Start": f"{phrase['start']:.2f}s",
                   "End": f"{phrase['end']:.2f}s",
                   "Duration": f"{phrase['end'] - phrase['start']:.3f}s",
                   "Text": phrase["word"],
                   "Score": f"{phrase.get('rating', 0):.2f}"
               })
       else:
           information.append({
               "Start": begin,
               "End": finish,
               "Duration": period,
               "Text": textual content
           })
  
   df = pd.DataFrame(information)
  
   if len(df) > max_rows:
       print(f"Showing first {max_rows} rows of {len(df)} whole...")
       show(HTML(df.head(max_rows).to_html(index=False)))
   else:
       show(HTML(df.to_html(index=False)))
  
   return df


def export_results(consequence, output_dir="output", filename="transcript"):
   """Export ends in a number of codecs"""
   os.makedirs(output_dir, exist_ok=True)
  
   json_path = f"{output_dir}/{filename}.json"
   with open(json_path, "w", encoding="utf-8") as f:
       json.dump(consequence, f, indent=2, ensure_ascii=False)
  
   srt_path = f"{output_dir}/{filename}.srt"
   with open(srt_path, "w", encoding="utf-8") as f:
       for i, seg in enumerate(consequence["segments"], 1):
           begin = format_timestamp(seg["start"])
           finish = format_timestamp(seg["end"])
           f.write(f"{i}n{begin} --> {finish}n{seg['text'].strip()}nn")
  
   vtt_path = f"{output_dir}/{filename}.vtt"
   with open(vtt_path, "w", encoding="utf-8") as f:
       f.write("WEBVTTnn")
       for i, seg in enumerate(consequence["segments"], 1):
           begin = format_timestamp_vtt(seg["start"])
           finish = format_timestamp_vtt(seg["end"])
           f.write(f"{begin} --> {finish}n{seg['text'].strip()}nn")
  
   txt_path = f"{output_dir}/{filename}.txt"
   with open(txt_path, "w", encoding="utf-8") as f:
       for seg in consequence["segments"]:
           f.write(f"{seg['text'].strip()}n")
  
   csv_path = f"{output_dir}/{filename}.csv"
   df_data = []
   for seg in consequence["segments"]:
       df_data.append({
           "begin": seg["start"],
           "finish": seg["end"],
           "textual content": seg["text"].strip()
       })
   pd.DataFrame(df_data).to_csv(csv_path, index=False)
  
   print(f"n Results exported to '{output_dir}/' listing:")
   print(f"   ✓ {filename}.json (full structured information)")
   print(f"   ✓ {filename}.srt (subtitles)")
   print(f"   ✓ {filename}.vtt (net video subtitles)")
   print(f"   ✓ {filename}.txt (plain textual content)")
   print(f"   ✓ {filename}.csv (timestamps + textual content)")


def format_timestamp(seconds):
   """Convert seconds to SRT timestamp format"""
   hours = int(seconds // 3600)
   minutes = int((seconds % 3600) // 60)
   secs = int(seconds % 60)
   millis = int((seconds % 1) * 1000)
   return f"{hours:02d}:{minutes:02d}:{secs:02d},{millis:03d}"


def format_timestamp_vtt(seconds):
   """Convert seconds to VTT timestamp format"""
   hours = int(seconds // 3600)
   minutes = int((seconds % 3600) // 60)
   secs = int(seconds % 60)
   millis = int((seconds % 1) * 1000)
   return f"{hours:02d}:{minutes:02d}:{secs:02d}.{millis:03d}"


def batch_process_files(audio_files, output_dir="batch_output"):
   """Process a number of audio information in batch"""
   print(f"n Batch processing {len(audio_files)} information...")
   outcomes = {}
  
   for i, audio_path in enumerate(audio_files, 1):
       print(f"n[{i}/{len(audio_files)}] Processing: {Path(audio_path).title}")
       strive:
           consequence, _ = process_audio_file(audio_path, show_output=False)
           outcomes[audio_path] = consequence
          
           filename = Path(audio_path).stem
           export_results(consequence, output_dir, filename)
       besides Exception as e:
           print(f" Error processing {audio_path}: {str(e)}")
           outcomes[audio_path] = None
  
   print(f"n Batch processing full! Processed {len(outcomes)} information.")
   return outcomes


def extract_keywords(consequence, top_n=10):
   """Extract most typical phrases from transcription"""
   from collections import Counter
   import re
  
   textual content = " ".be part of(seg["text"] for seg in consequence["segments"])
  
   phrases = re.findall(r'bw+b', textual content.decrease())
  
   stop_words = {'the', 'a', 'an', 'and', 'or', 'however', 'in', 'on', 'at', 'to', 'for',
                 'of', 'with', 'is', 'was', 'are', 'had been', 'be', 'been', 'being',
                 'have', 'has', 'had', 'do', 'does', 'did', 'will', 'would', 'might',
                 'ought to', 'might', 'may', 'should', 'can', 'this', 'that', 'these', 'these'}
  
   filtered_words = [w for w in words if w not in stop_words and len(w) > 2]
  
   word_counts = Counter(filtered_words).most_common(top_n)
  
   print(f"n Top {top_n} Keywords:")
   for phrase, depend in word_counts:
       print(f"   {phrase}: {depend}")
  
   return word_counts

We format outcomes into clear tables, export transcripts to JSON/SRT/VTT/TXT/CSV codecs, and keep exact timestamps with helper formatters. We additionally batch-process a number of audio information end-to-end and extract high key phrases, enabling us to rapidly flip uncooked transcriptions into analysis-ready artifacts. Check out the FULL CODES here.

Copy Code

def process_audio_file(audio_path, show_output=True, analyze=True):
   """Complete WhisperX pipeline"""
   if show_output:
       print("="*70)
       print(" WhisperX Advanced Tutorial")
       print("="*70)
  
   audio, period = load_and_analyze_audio(audio_path)
  
   consequence = transcribe_audio(audio, CONFIG["model_size"], CONFIG["language"])
  
   aligned_result = align_transcription(
       consequence["segments"],
       audio,
       consequence["language"]
   )
  
   if analyze and show_output:
       analyze_transcription(aligned_result)
       extract_keywords(aligned_result)
  
   if show_output:
       print("n" + "="*70)
       print(" TRANSCRIPTION RESULTS")
       print("="*70)
       df = display_results(aligned_result, show_words=False)
      
       export_results(aligned_result)
   else:
       df = None
  
   return aligned_result, df


# Example 1: Process pattern audio
# audio_path = download_sample_audio()
# consequence, df = process_audio_file(audio_path)


# Example 2: Show word-level particulars
# consequence, df = process_audio_file(audio_path)
# word_df = display_results(consequence, show_words=True)


# Example 3: Process your individual audio
# audio_path = "your_audio.wav"  # or .mp3, .m4a, and so on.
# consequence, df = process_audio_file(audio_path)


# Example 4: Batch course of a number of information
# audio_files = ["audio1.mp3", "audio2.wav", "audio3.m4a"]
# outcomes = batch_process_files(audio_files)


# Example 5: Use a bigger mannequin for higher accuracy
# CONFIG["model_size"] = "large-v2"
# consequence, df = process_audio_file("audio.mp3")


print("n Setup full! Uncomment examples above to run.")

We run the total WhisperX pipeline end-to-end, loading the audio, transcribing it, and aligning it for word-level timestamps. When enabled, we analyze stats, extract key phrases, render a clear outcomes desk, and export every little thing to a number of codecs, prepared for actual use.

In conclusion, we constructed an entire WhisperX pipeline that not solely transcribes audio but additionally aligns it with exact word-level timestamps. We export the ends in a number of codecs, course of information in batches, and analyze patterns to make the output extra significant. With this, we now have a versatile, ready-to-use workflow for transcription and audio evaluation on Colab, and we’re prepared to lengthen it additional into real-world tasks.

Check out the FULL CODES here. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export? appeared first on MarkTechPost.

How to Build an Advanced Voice AI Pipeline with WhisperX for Transcription, Alignment, Analysis, and Export?

A Complete Workflow for Automated Prompt Optimization Using Gemini Flash, Few-Shot Selection, and Evolutionary Instruction Search

How to Build a Privacy-Preserving Federated Pipeline to Fine-Tune Large Language Models with LoRA Using Flower and PEFT

AI’s influence in the cryptocurrency industry

The $84 trillion wealth transfer needs agentic AI

Your LLM is 5x Slower Than It Should Be. The Reason? Pessimism—and Stanford Researchers Just Showed How to Fix It

Building an Advanced Portfolio Analysis and Market Intelligence Tool with OpenBB

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!