How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines?

In this tutorial, we construct an superior voice AI agent utilizing Hugging Face’s freely accessible fashions, and we preserve the complete pipeline easy sufficient to run easily on Google Colab. We mix Whisper for speech recognition, FLAN-T5 for pure language reasoning, and Bark for speech synthesis, all related via transformers pipelines. By doing this, we keep away from heavy dependencies, API keys, or sophisticated setups, and we give attention to exhibiting how we will flip voice enter into significant dialog and get again natural-sounding voice responses in actual time. Check out the FULL CODES here.

Copy Code

!pip -q set up "transformers>=4.42.0" speed up torchaudio sentencepiece gradio soundfile


import os, torch, tempfile, numpy as np
import gradio as gr
from transformers import pipeline, AutoTokenizer, AutoModelForSeq2SeqLM


DEVICE = 0 if torch.cuda.is_available() else -1


asr = pipeline(
   "automatic-speech-recognition",
   mannequin="openai/whisper-small.en",
   gadget=DEVICE,
   chunk_length_s=30,
   return_timestamps=False
)


LLM_MODEL = "google/flan-t5-base"
tok = AutoTokenizer.from_pretrained(LLM_MODEL)
llm = AutoModelForSeq2SeqLM.from_pretrained(LLM_MODEL, device_map="auto")


tts = pipeline("text-to-speech", mannequin="suno/bark-small")

We set up the mandatory libraries and cargo three Hugging Face pipelines: Whisper for speech-to-text, FLAN-T5 for producing responses, and Bark for text-to-speech. We set the gadget mechanically in order that we will use GPU if accessible. Check out the FULL CODES here.

Copy Code

SYSTEM_PROMPT = (
   "You are a useful, concise voice assistant. "
   "Prefer direct, structured solutions. "
   "If the consumer asks for steps or code, use quick bullet factors."
)


def format_dialog(historical past, user_text):
   turns = []
   for u, a in historical past:
       if u: turns.append(f"User: {u}")
       if a: turns.append(f"Assistant: {a}")
   turns.append(f"User: {user_text}")
   immediate = (
       "Instruction:n"
       f"{SYSTEM_PROMPT}nn"
       "Dialog thus far:n" + "n".be a part of(turns) + "nn"
       "Assistant:"
   )
   return immediate

We outline a system immediate that guides our agent to keep concise and structured, and we implement a format_dialog operate that takes previous dialog historical past together with the consumer enter and builds a immediate string for the mannequin to generate the assistant’s reply. Check out the FULL CODES here.

Copy Code

def transcribe(filepath):
   out = asr(filepath)
   textual content = out["text"].strip()
   return textual content


def generate_reply(historical past, user_text, max_new_tokens=256):
   immediate = format_dialog(historical past, user_text)
   inputs = tok(immediate, return_tensors="pt", truncation=True).to(llm.gadget)
   with torch.no_grad():
       ids = llm.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           temperature=0.7,
           do_sample=True,
           top_p=0.9,
           repetition_penalty=1.05,
       )
   reply = tok.decode(ids[0], skip_special_tokens=True).strip()
   return reply


def synthesize_speech(textual content):
   out = tts(textual content)
   audio = out["audio"]
   sr = out["sampling_rate"]
   audio = np.asarray(audio, dtype=np.float32)
   return (sr, audio)

We create three core features for our voice agent: transcribe converts recorded audio into textual content utilizing Whisper, generate_reply builds a context-aware response from FLAN-T5, and synthesize_speech turns that response again into spoken audio with Bark. Check out the FULL CODES here.

Copy Code

def clear_history():
   return [], []


def voice_to_voice(mic_file, historical past):
   historical past = historical past or []
   if not mic_file:
       return historical past, None, "Please file one thing!"
   attempt:
       user_text = transcribe(mic_file)
   besides Exception as e:
       return historical past, None, f"ASR error: {e}"


   if not user_text:
       return historical past, None, "Didn't catch that. Try once more?"


   attempt:
       reply = generate_reply(historical past, user_text)
   besides Exception as e:
       return historical past, None, f"LLM error: {e}"


   attempt:
       sr, wav = synthesize_speech(reply)
   besides Exception as e:
       return historical past + [(user_text, reply)], None, f"TTS error: {e}"


   return historical past + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"


def text_to_voice(user_text, historical past):
   historical past = historical past or []
   user_text = (user_text or "").strip()
   if not user_text:
       return historical past, None, "Type a message first."
   attempt:
       reply = generate_reply(historical past, user_text)
       sr, wav = synthesize_speech(reply)
   besides Exception as e:
       return historical past, None, f"Error: {e}"
   return historical past + [(user_text, reply)], (sr, wav), f"User: {user_text}nAssistant: {reply}"


def export_chat(historical past):
   strains = []
   for u, a in historical past or []:
       strains += [f"User: {u}", f"Assistant: {a}", ""]
   textual content = "n".be a part of(strains).strip() or "No dialog but."
   with tempfile.NamedTemporaryFile(delete=False, suffix=".txt", mode="w") as f:
       f.write(textual content)
       path = f.title
   return path

We add interactive features for our agent: clear_history resets the dialog, voice_to_voice handles speech enter and returns a spoken reply, text_to_voice processes typed enter and speaks again, and export_chat saves the complete dialog right into a downloadable textual content file. Check out the FULL CODES here.

Copy Code

with gr.Blocks(title="Advanced Voice AI Agent (HF Pipelines)") as demo:
   gr.Markdown(
       "##  Advanced Voice AI Agent (Hugging Face Pipelines Only)n"
       "- **ASR**: openai/whisper-small.enn"
       "- **LLM**: google/flan-t5-basen"
       "- **TTS**: suno/bark-smalln"
       "Speak or sort; the agent replies with voice + textual content."
   )


   with gr.Row():
       with gr.Column(scale=1):
           mic = gr.Audio(sources=["microphone"], sort="filepath", label="Record")
           say_btn = gr.Button(" Speak")
           text_in = gr.Textbox(label="Or sort as an alternative", placeholder="Ask me something…")
           text_btn = gr.Button(" Send")
           export_btn = gr.Button(" Export Chat (.txt)")
           reset_btn = gr.Button(" Reset")
       with gr.Column(scale=1):
           audio_out = gr.Audio(label="Assistant Voice", autoplay=True)
           transcript = gr.Textbox(label="Transcript", strains=6)
           chat = gr.Chatbot(peak=360)
   state = gr.State([])


   def update_chat(historical past):
       return [(u, a) for u, a in (history or [])]


   say_btn.click on(voice_to_voice, [mic, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   text_btn.click on(text_to_voice, [text_in, state], [state, audio_out, transcript]).then(
       update_chat, inputs=state, outputs=chat
   )
   reset_btn.click on(clear_history, None, [chat, state])
   export_btn.click on(export_chat, state, gr.File(label="Download chat.txt"))


demo.launch(debug=False)

We construct a clear Gradio UI that lets us converse or sort after which hear the agent’s response. We wire buttons to our callbacks, keep chat state, and stream outcomes right into a chatbot, transcript, and audio participant, all launched in a single Colab app.

In conclusion, we see how seamlessly Hugging Face pipelines allow us to create a voice-driven conversational agent that listens, thinks, and responds. We now have a working demo that captures audio, transcribes it, generates clever responses, and returns speech output, all inside Colab. With this basis, we will experiment with bigger fashions, add multilingual help, and even lengthen the system with customized logic. Still, the core concept stays the identical: we will deliver collectively ASR, LLM, and TTS into one clean workflow for an interactive voice AI expertise.

Check out the FULL CODES here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to observe us on Twitter and don’t neglect to be a part of our 100k+ ML SubReddit and Subscribe to our Newsletter.

The publish How to Build an Advanced End-to-End Voice AI Agent Using Hugging Face Pipelines? appeared first on MarkTechPost.