How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python
In this tutorial, we construct a speech recognition and translation workflow utilizing NVIDIA Canary-1B-v2. We start by establishing the required audio, NeMo, NumPy, and SciPy dependencies, then load the Canary mannequin on a GPU-enabled runtime for environment friendly inference. From there, we put together audio right into a clear 16 kHz mono format, carry out English ASR, translate speech into a number of languages, generate phrase and phase timestamps, export translated subtitles as an SRT file, check long-form transcription, run batch processing, and benchmark inference velocity. At the top, we’ve a whole multilingual ASR and speech translation pipeline that we are able to adapt for actual audio information, subtitle technology, and large-scale transcription experiments.
Installing NeMo, Audio Libraries, NumPy, and SciPy Dependencies
import os, subprocess, sys
SENTINEL = "/content material/.canary_setup_done"
if not os.path.exists(SENTINEL):
def sh(c):
print("$", c); subprocess.run(c, shell=True, verify=False)
print(">>> PHASE 1: putting in dependencies (one-time)...n")
sh("apt-get -qq replace")
sh("apt-get -qq set up -y libsndfile1 ffmpeg > /dev/null")
sh('pip set up -q "nemo_toolkit[asr]"')
sh("pip set up -q librosa soundfile pydub")
sh('pip set up -q --force-reinstall --no-cache-dir "numpy>=2.2,<2.4" "scipy>=1.15"')
open(SENTINEL, "w").write("executed")
print("n
Setup full. Restarting the runtime now.")
print(" When it reconnects, RUN THIS CELL AGAIN to begin the tutorial.")
os.kill(os.getpid(), 9)
We arrange the setting for the NVIDIA Canary-1B-v2 tutorial. We set up the required system packages, NeMo ASR toolkit, audio libraries, and suitable NumPy and SciPy variations. We then create a setup marker and restart the runtime in order that the up to date dependencies load cleanly earlier than operating the principle tutorial.
Loading NVIDIA Canary-1B-v2 and Checking GPU Availability
import time, json, gc, math, urllib.request
import torch, numpy as np, soundfile as sf, librosa
print(">>> PHASE 2: operating tutorialn")
print("NumPy:", np.__version__, "| PyTorch:", torch.__version__)
print("CUDA out there:", torch.cuda.is_available())
if torch.cuda.is_available():
print("GPU:", torch.cuda.get_device_name(0),
f"| VRAM: {torch.cuda.get_device_properties(0).total_memory/1e9:.1f} GB")
else:
print("
No GPU — will run on CPU (very gradual). "
"Set Runtime > Change runtime sort > GPU.")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
LANGS = {
"bg":"Bulgarian","hr":"Croatian","cs":"Czech","da":"Danish","nl":"Dutch",
"en":"English","et":"Estonian","fi":"Finnish","fr":"French","de":"German",
"el":"Greek","hu":"Hungarian","it":"Italian","lv":"Latvian","lt":"Lithuanian",
"mt":"Maltese","pl":"Polish","pt":"Portuguese","ro":"Romanian","sk":"Slovak",
"sl":"Slovenian","es":"Spanish","sv":"Swedish","ru":"Russian","uk":"Ukrainian",
}
print(f"nSupported languages ({len(LANGS)}):", ", ".be a part of(LANGS.keys()))
from nemo.collections.asr.fashions import ASRModel
print("nLoading nvidia/canary-1b-v2 ...")
t0 = time.time()
asr_model = ASRModel.from_pretrained(model_name="nvidia/canary-1b-v2").to(DEVICE).eval()
print(f"Model loaded in {time.time()-t0:.1f}s")
We import the principle libraries and verify whether or not CUDA is obtainable for GPU acceleration. We outline the supported language dictionary to allow Canary to deal with multilingual ASR and translation duties. We then load the NVIDIA Canary-1B-v2 mannequin from NeMo and transfer it to the out there gadget for inference.
Preparing 16 kHz Audio and Running English ASR with Translation
TARGET_SR = 16000
def prepare_audio(path_or_url, out_path=None):
if str(path_or_url).startswith(("http://", "https://")):
native = "/content material/_dl_" + os.path.basename(path_or_url.cut up("?")[0])
urllib.request.urlretrieve(path_or_url, native)
path_or_url = native
audio, _ = librosa.load(path_or_url, sr=TARGET_SR, mono=True)
if out_path is None:
base = os.path.splitext(os.path.basename(path_or_url))[0]
out_path = f"/content material/{base}_16k_mono.wav"
sf.write(out_path, audio, TARGET_SR, subtype="PCM_16")
dur = len(audio) / TARGET_SR
print(f"Prepared: {out_path} ({dur:.1f}s, 16kHz, mono)")
return out_path, dur
SAMPLE_URL = "https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav"
sample_wav, sample_dur = prepare_audio(SAMPLE_URL)
def transcribe(information, source_lang="en", target_lang="en", timestamps=False, batch_size=1):
if isinstance(information, str):
information = [files]
return asr_model.transcribe(information, source_lang=source_lang, target_lang=target_lang,
timestamps=timestamps, batch_size=batch_size)
print("n=== 1) BASIC ASR (English) ===")
res = transcribe(sample_wav, source_lang="en", target_lang="en")
print("Transcript:", res[0].textual content)
print("n=== 2) TRANSLATION (EN audio -> X) ===")
for tgt in ["fr", "de", "es", "it"]:
out = transcribe(sample_wav, source_lang="en", target_lang=tgt)
print(f" EN -> {LANGS[tgt]:<10} ({tgt}): {out[0].textual content}")
We create a reusable audio preparation perform that downloads audio when wanted and converts it into 16 kHz mono WAV format. We load the pattern audio file and outline a helper perform for transcription and translation. We then run fundamental English ASR and translate the identical English speech into French, German, Spanish, and Italian.
Generating Word and Segment Timestamps and Exporting SRT Subtitles
print("n=== 3) TIMESTAMPS (ASR) ===")
ts_out = transcribe(sample_wav, source_lang="en", target_lang="en", timestamps=True)
word_ts = ts_out[0].timestamp.get("phrase", [])
seg_ts = ts_out[0].timestamp.get("phase", [])
print("Segments:")
for s in seg_ts:
print(f" [{s['start']:6.2f}s - {s['end']:6.2f}s] {s['segment']}")
print("First 10 phrases:")
for w in word_ts[:10]:
print(f" [{w['start']:6.2f}s - {w['end']:6.2f}s] {w['word']}")
def _srt_time(t):
h=int(t//3600); m=int((tpercent3600)//60); s=int(tpercent60); ms=int(spherical((t-int(t))*1000))
return f"{h:02d}:{m:02d}:{s:02d},{ms:03d}"
def segments_to_srt(segments, out_path="/content material/output.srt"):
traces=[]
for i, seg in enumerate(segments, 1):
traces += [str(i), f"{_srt_time(seg['start'])} --> {_srt_time(seg['end'])}",
seg["segment"].strip(), ""]
open(out_path, "w", encoding="utf-8").write("n".be a part of(traces))
print(f"Saved SRT: {out_path}")
return out_path
print("n=== 4) SRT EXPORT (translated French subtitles) ===")
fr_ts = transcribe(sample_wav, source_lang="en", target_lang="fr", timestamps=True)
segments_to_srt(fr_ts[0].timestamp["segment"], "/content material/subtitles_fr.srt")
print(open("/content material/subtitles_fr.srt").learn())
We allow timestamped transcription to extract each segment-level and word-level timing data. We print the transcript segments and the primary few phrase timestamps to examine how the mannequin aligns textual content with audio. We additionally convert translated French segments into an SRT subtitle file and show the generated subtitles.
Running Long-Form Transcription, Batch Processing, and Speed Benchmark
print("n=== 5) LONG-FORM (pattern tiled x6) ===")
long_audio, _ = librosa.load(sample_wav, sr=TARGET_SR, mono=True)
long_audio = np.tile(long_audio, 6)
sf.write("/content material/lengthy.wav", long_audio, TARGET_SR, subtype="PCM_16")
print(f"Long clip length: {len(long_audio)/TARGET_SR:.1f}s")
long_out = transcribe("/content material/lengthy.wav", source_lang="en", target_lang="en", batch_size=1)
print("Long transcript (first 300 chars):", long_out[0].textual content[:300], "...")
print("n=== 6) BATCH ===")
for identify in ["clip_a", "clip_b"]:
sf.write(f"/content material/{identify}.wav",
librosa.load(sample_wav, sr=TARGET_SR, mono=True)[0], TARGET_SR, subtype="PCM_16")
batch = transcribe(["/content/clip_a.wav", "/content/clip_b.wav"],
source_lang="en", target_lang="en", batch_size=2)
for i, b in enumerate(batch):
print(f" file {i}: {b.textual content}")
print("n=== 7) BENCHMARK ===")
t0 = time.time(); _ = transcribe(sample_wav, source_lang="en", target_lang="en")
elapsed = time.time()-t0
print(f"Audio: {sample_dur:.2f}s | Compute: {elapsed:.2f}s | RTFx ≈ {sample_dur/elapsed:.1f}x")
print("n
Done. Change source_lang/target_lang from the LANGS dict to strive different languages.")
We check long-form transcription by repeating the pattern audio a number of instances and passing the longer clip by the mannequin. We additionally create two duplicate audio clips to display batch transcription with a batch dimension of two. Also, we benchmark the mannequin by evaluating audio length with compute time and report the real-time issue velocity.
Conclusion
In conclusion, we accomplished a sensible end-to-end workflow for utilizing NVIDIA Canary-1B-v2 as a multilingual ASR and speech translation system. We processed uncooked audio, generated correct transcripts, translated speech into completely different goal languages, extracted timestamps, created subtitle information, dealt with longer audio clips, and in contrast runtime efficiency by a easy benchmark. We now have a reusable Colab-ready pipeline that we are able to lengthen additional with customized uploads, extra languages, bigger batches, and production-style audio processing.
Check out the Full Codes with Notebook. Also, be happy to observe us on Twitter and don’t overlook to be a part of our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to accomplice with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us
The put up How to Use NVIDIA Canary-1B-v2 for ASR, Translation, and Automatic SRT Subtitle Export in Python appeared first on MarkTechPost.
