|

How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python

✅

In this tutorial, we construct a speech recognition and translation workflow utilizing NVIDIA Canary-1B-v2. We start by establishing the required audio, NeMo, NumPy, and SciPy dependencies, then load the Canary mannequin on a GPU-enabled runtime for environment friendly inference. From there, we put together audio right into a clear 16 kHz mono format, carry out English ASR, translate speech into a number of languages, generate phrase and phase timestamps, export translated subtitles as an SRT file, check long-form transcription, run batch processing, and benchmark inference velocity. At the top, we’ve a whole multilingual ASR and speech translation pipeline that we are able to adapt for actual audio information, subtitle technology, and large-scale transcription experiments.

Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies

import os, subprocess, sys
SENTINEL = "/content material/.canary_setup_done"
if not os.path.exists(SENTINEL):
   def sh(c):
       print("$", c); subprocess.run(c, shell=True, verify=False)
   print(">>> PHASE 1: putting in dependencies (one-time)...n")
   sh("apt-get -qq replace")
   sh("apt-get -qq set up -y libsndfile1 ffmpeg > /dev/null")
   sh('pip set up -q "nemo_toolkit[asr]"')
   sh("pip set up -q librosa soundfile pydub")
   sh('pip set up -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')
   open(SENTINEL, "w").write("executed")
   print("n✅ Setup full. Restarting the runtime now.")
   print("   When it reconnects, RUN THIS CELL AGAIN to begin the tutorial.")
   os.kill(os.getpid(), 9)

We arrange the setting for the NVIDIA Canary-1B-v2 tutorial. We set up the required system packages, NeMo ASR toolkit, audio libraries, and suitable NumPy and SciPy variations. We then create a setup marker and restart the runtime in order that the up to date dependencies load cleanly earlier than operating the principle tutorial.

Loading NVIDIA Canary-1B-v2 and Checking GPU Availability

import time, json, gc, math, urllib.request
import torch, numpy as np, soundfile as sf, librosa
print(">>> PHASE 2: operating tutorialn")
print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)
print("CUDA out there:", torch.cuda.is_available())
if torch.cuda.is_available():
   print("GPU:", torch.cuda.get_device_name(0),
         f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
else:
   print("⚠  No GPU — will run on CPU (very gradual). "
         "Set Runtime > Change runtime sort > GPU.")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LANGS = {
   "bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",
   "en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",
   "el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",
   "mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",
   "sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",
}
print(f"nSupported languages ({len(LANGS)}):", ", ".be a part of(LANGS.keys()))
from nemo.collections.asr.fashions import ASRModel
print("nLoading nvidia/canary-1b-v2 ...")
t0 = time.time()
asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()
print(f"Model loaded in {time.time()-t0:.1f}s")

We import the principle libraries and verify whether or not CUDA is obtainable for GPU acceleration. We outline the supported language dictionary to allow Canary to deal with multilingual ASR and translation duties. We then load the NVIDIA Canary-1B-v2 mannequin from NeMo and transfer it to the out there gadget for inference.

Preparing 16 kHz Audio and Running English ASR with Translation

TARGET_SR = 16000
def prepare_audio(path_or_url, out_path=None):
   if str(path_or_url).startswith(("http://", "https://")):
       native = "/content material/_dl_" + os.path.basename(path_or_url.cut up("?")[0])
       urllib.request.urlretrieve(path_or_url, native)
       path_or_url = native
   audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)
   if out_path is None:
       base = os.path.splitext(os.path.basename(path_or_url))[0]
       out_path = f"/content material/{base}_16k_mono.wav"
   sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")
   dur = len(audio) / TARGET_SR
   print(f"Prepared: {out_path}  ({dur:.1f}s, 16kHz, mono)")
   return out_path, dur
SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"
sample_wav, sample_dur = prepare_audio(SAMPLE_URL)
def transcribe(information, source_lang="en", target_lang="en", timestamps=False, batch_size=1):
   if isinstance(information, str):
       information = [files]
   return asr_model.transcribe(information, source_lang=source_lang, target_lang=target_lang,
                               timestamps=timestamps, batch_size=batch_size)
print("n=== 1) BASIC ASR (English) ===")
res = transcribe(sample_wav, source_lang="en", target_lang="en")
print("Transcript:", res[0].textual content)
print("n=== 2) TRANSLATION (EN audio -> X) ===")
for tgt in ["fr", "de", "es", "it"]:
   out = transcribe(sample_wav, source_lang="en", target_lang=tgt)
   print(f"  EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].textual content}")

We create a reusable audio preparation perform that downloads audio when wanted and converts it into 16 kHz mono WAV format. We load the pattern audio file and outline a helper perform for transcription and translation. We then run fundamental English ASR and translate the identical English speech into French, German, Spanish, and Italian.

Generating Word and Segment Timestamps and Exporting SRT Subtitles

print("n=== 3) TIMESTAMPS (ASR) ===")
ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)
word_ts = ts_out[0].timestamp.get("phrase", [])
seg_ts  = ts_out[0].timestamp.get("phase", [])
print("Segments:")
for s in seg_ts:
   print(f"  [{s['start']:6.2f}s - {s['end']:6.2f}s]  {s['segment']}")
print("First 10 phrases:")
for w in word_ts[:10]:
   print(f"  [{w['start']:6.2f}s - {w['end']:6.2f}s]  {w['word']}")
def _srt_time(t):
   h=int(t//3600); m=int((tpercent3600)//60); s=int(tpercent60); ms=int(spherical((t-int(t))*1000))
   return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def segments_to_srt(segments, out_path="/content material/output.srt"):
   traces=[]
   for i, seg in enumerate(segments, 1):
       traces += [str(i), f"{_srt_time(seg['start'])} --> {_srt_time(seg['end'])}",
                 seg["segment"].strip(), ""]
   open(out_path, "w", encoding="utf-8").write("n".be a part of(traces))
   print(f"Saved SRT: {out_path}")
   return out_path
print("n=== 4) SRT EXPORT (translated French subtitles) ===")
fr_ts = transcribe(sample_wav, source_lang="en", target_lang="fr", timestamps=True)
segments_to_srt(fr_ts[0].timestamp["segment"], "/content material/subtitles_fr.srt")
print(open("/content material/subtitles_fr.srt").learn())

We allow timestamped transcription to extract each segment-level and word-level timing data. We print the transcript segments and the primary few phrase timestamps to examine how the mannequin aligns textual content with audio. We additionally convert translated French segments into an SRT subtitle file and show the generated subtitles.

Running Long-Form Transcription, Batch Processing, and Speed Benchmark

print("n=== 5) LONG-FORM (pattern tiled x6) ===")
long_audio, _ = librosa.load(sample_wav, sr=TARGET_SR, mono=True)
long_audio = np.tile(long_audio, 6)
sf.write("/content material/lengthy.wav", long_audio, TARGET_SR, subtype="PCM_16")
print(f"Long clip length: {len(long_audio)/TARGET_SR:.1f}s")
long_out = transcribe("/content material/lengthy.wav", source_lang="en", target_lang="en", batch_size=1)
print("Long transcript (first 300 chars):", long_out[0].textual content[:300], "...")
print("n=== 6) BATCH ===")
for identify in ["clip_a", "clip_b"]:
   sf.write(f"/content material/{identify}.wav",
            librosa.load(sample_wav, sr=TARGET_SR, mono=True)[0], TARGET_SR, subtype="PCM_16")
batch = transcribe(["/content/clip_a.wav", "/content/clip_b.wav"],
                  source_lang="en", target_lang="en", batch_size=2)
for i, b in enumerate(batch):
   print(f"  file {i}: {b.textual content}")
print("n=== 7) BENCHMARK ===")
t0 = time.time(); _ = transcribe(sample_wav, source_lang="en", target_lang="en")
elapsed = time.time()-t0
print(f"Audio: {sample_dur:.2f}s | Compute: {elapsed:.2f}s | RTFx ≈ {sample_dur/elapsed:.1f}x")
print("n✅ Done. Change source_lang/target_lang from the LANGS dict to strive different languages.")

We check long-form transcription by repeating the pattern audio a number of instances and passing the longer clip by the mannequin. We additionally create two duplicate audio clips to display batch transcription with a batch dimension of two. Also, we benchmark the mannequin by evaluating audio length with compute time and report the real-time issue velocity.

Conclusion

In conclusion, we accomplished a sensible end-to-end workflow for utilizing NVIDIA Canary-1B-v2 as a multilingual ASR and speech translation system. We processed uncooked audio, generated correct transcripts, translated speech into completely different goal languages, extracted timestamps, created subtitle information, dealt with longer audio clips, and in contrast runtime efficiency by a easy benchmark. We now have a reusable Colab-ready pipeline that we are able to lengthen additional with customized uploads, extra languages, bigger batches, and production-style audio processing.


Check out the Full Codes with NotebookAlso, be happy to observe us on Twitter and don’t overlook to be a part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The put up How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python appeared first on MarkTechPost.

Similar Posts