Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter

In this tutorial, we construct a full, production-style pipeline for detecting and redacting personally identifiable info utilizing the OpenAI Privacy Filter. We start by organising the surroundings and loading a token classification mannequin that identifies a number of classes of delicate information, together with names, emails, telephone numbers, addresses, and secrets and techniques. We then design helper features to normalize labels, extract structured spans, and rework uncooked mannequin outputs into usable codecs. From there, we implement a configurable redaction system that enables us to substitute delicate entities with significant placeholders, preserving privateness and offering contextual readability. Throughout the method, we check the pipeline on curated examples, convert outputs into structured dataframes, and put together the system for batch processing and real-world utilization.

Copy Code

!pip set up -q -U transformers speed up torch pandas matplotlib huggingface_hub


import os, re, json, time, textwrap, warnings
from pathlib import Path
from collections import Counter
import pandas as pd
import matplotlib.pyplot as plt
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline


warnings.filterwarnings("ignore")


MODEL_ID = "openai/privacy-filter"
OUT_DIR = Path("/content material/privacy_filter_outputs")
OUT_DIR.mkdir(mother and father=True, exist_ok=True)


system = 0 if torch.cuda.is_available() else -1
torch_dtype = torch.bfloat16 if torch.cuda.is_available() else torch.float32


print("Device:", "GPU" if torch.cuda.is_available() else "CPU")
print("Torch dtype:", torch_dtype)
print("Model:", MODEL_ID)

We set up all required libraries and arrange the pipeline’s runtime surroundings. We configure system choice and initialize paths for storing outputs. We additionally print system particulars to affirm that every part is prepared earlier than loading the mannequin.

Copy Code

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
mannequin = AutoModelForTokenClassification.from_pretrained(
   MODEL_ID,
   torch_dtype=torch_dtype,
   device_map="auto" if torch.cuda.is_available() else None
)


classifier = pipeline(
   activity="token-classification",
   mannequin=mannequin,
   tokenizer=tokenizer,
   aggregation_strategy="easy",
   system=system if not torch.cuda.is_available() else None
)


LABEL_MASKS = {
   "account_number": "[ACCOUNT_NUMBER]",
   "private_address": "[PRIVATE_ADDRESS]",
   "private_email": "[PRIVATE_EMAIL]",
   "private_person": "[PRIVATE_PERSON]",
   "private_phone": "[PRIVATE_PHONE]",
   "private_url": "[PRIVATE_URL]",
   "private_date": "[PRIVATE_DATE]",
   "secret": "[SECRET]"
}

Copy Code

def normalize_label(label):
   label = label.substitute("B-", "").substitute("I-", "").substitute("E-", "").substitute("S-", "")
   return label.strip()


def detect_pii(textual content):
   uncooked = classifier(textual content)
   spans = []
   for merchandise in uncooked:
       label = normalize_label(merchandise.get("entity_group", merchandise.get("entity", "")))
       if label == "O" or not label:
           proceed
       spans.append({
           "label": label,
           "rating": float(merchandise["score"]),
           "textual content": merchandise["word"],
           "begin": int(merchandise["start"]),
           "finish": int(merchandise["end"])
       })
   spans = sorted(spans, key=lambda x: (x["start"], x["end"]))
   return spans


def redact_text(textual content, spans, min_score=0.50, mode="typed"):
   filtered = [s for s in spans if s["score"] >= min_score]
   filtered = sorted(filtered, key=lambda x: x["start"], reverse=True)
   redacted = textual content
   for span in filtered:
       substitute = LABEL_MASKS.get(span["label"], "[PII]") if mode == "typed" else "[REDACTED]"
       redacted = redacted[:span["start"]] + substitute + redacted[span["end"]:]
   return redacted


def privacy_report(textual content, min_score=0.50):
   spans = detect_pii(textual content)
   redacted = redact_text(textual content, spans, min_score=min_score)
   return {
       "original_text": textual content,
       "redacted_text": redacted,
       "span_count": len([s for s in spans if s["score"] >= min_score]),
       "spans": [s for s in spans if s["score"] >= min_score]
   }

We outline helper features to normalize labels and extract PII spans from mannequin predictions. We implement a redaction perform that replaces delicate segments primarily based on confidence thresholds. We mix every part into a single reporting perform that returns structured outputs.

Copy Code

sample_texts = [
   "My name is Alice Smith and my email is [email protected]. Call me at +1 415 555 0189.",
   "Patient Rohan Mehta visited on 2025-04-11 and lives at 221B Baker Street, London.",
   "Use API key sk-test-51HxYzDemoSecret987 and send the invoice to [email protected].",
   "The public website is https://example.com, but Jane Doe's private portal is https://jane-private.example.net.",
   "Account number 123456789012 was linked to Ahmed Khan on 12 March 2024.",
   "This sentence has no private information and should mostly remain unchanged."
]


experiences = []
for i, textual content in enumerate(sample_texts, 1):
   report = privacy_report(textual content, min_score=0.50)
   report["example_id"] = i
   experiences.append(report)


for r in experiences:
   print("n" + "=" * 100)
   print("Example:", r["example_id"])
   print("Original:", r["original_text"])
   print("Redacted:", r["redacted_text"])
   print("Detected spans:")
   print(json.dumps(r["spans"], indent=2, ensure_ascii=False))


rows = []
for r in experiences:
   for s in r["spans"]:
       rows.append({
           "example_id": r["example_id"],
           "label": s["label"],
           "rating": s["score"],
           "detected_text": s["text"],
           "begin": s["start"],
           "finish": s["end"],
           "original_text": r["original_text"],
           "redacted_text": r["redacted_text"]
       })


df = pd.DataBody(rows)
show(df)

We create pattern inputs and run them by means of the pipeline to check detection and redaction. We accumulate structured outcomes and print each unique and redacted textual content for comparability. We additionally convert the outputs into a dataframe for simpler evaluation.

Copy Code

json_path = OUT_DIR / "privacy_filter_reports.json"
csv_path = OUT_DIR / "privacy_filter_spans.csv"


with open(json_path, "w", encoding="utf-8") as f:
   json.dump(experiences, f, indent=2, ensure_ascii=False)


df.to_csv(csv_path, index=False)


print("nSaved JSON:", json_path)
print("Saved CSV:", csv_path)


if len(df):
   label_counts = df["label"].value_counts()
   plt.determine(figsize=(10, 5))
   label_counts.plot(variety="bar")
   plt.title("Detected PII Categories")
   plt.xlabel("PII Category")
   plt.ylabel("Detected Span Count")
   plt.xticks(rotation=35, ha="proper")
   plt.tight_layout()
   plt.present()


   plt.determine(figsize=(10, 5))
   df["score"].plot(variety="hist", bins=10)
   plt.title("Detection Confidence Distribution")
   plt.xlabel("Confidence Score")
   plt.ylabel("Frequency")
   plt.tight_layout()
   plt.present()


def compare_thresholds(textual content, thresholds=(0.30, 0.50, 0.70, 0.90)):
   spans = detect_pii(textual content)
   outcomes = []
   for threshold in thresholds:
       saved = [s for s in spans if s["score"] >= threshold]
       outcomes.append({
           "threshold": threshold,
           "span_count": len(saved),
           "redacted_text": redact_text(textual content, spans, min_score=threshold)
       })
   return pd.DataBody(outcomes)


threshold_demo = compare_thresholds(sample_texts[0])
show(threshold_demo)

We save the processed outputs into JSON and CSV codecs for persistence and reuse. We visualize detected PII classes and confidence distributions utilizing plots. We additionally analyze how altering thresholds impacts detection and redaction habits.

Copy Code

long_document = """
Customer Support Transcript:
Agent: Hello, might I affirm your identify?
Customer: My identify is PSP.
Agent: Thanks. Could you affirm your e-mail?
Customer: [email protected].
Agent: And your telephone quantity?
Customer: +91 xxxxx xxxxx.
Agent: Your service deal with is 45 MG Road, Bengaluru, Karnataka.
Customer: Yes. Also, my backup e-mail is [email protected].
Agent: Please don't share passwords or OTPs.
Customer: The short-term token I acquired is ghp_demoSecretToken123456.
"""


long_report = privacy_report(long_document, min_score=0.50)


print("nLONG DOCUMENT REDACTION")
print("=" * 100)
print(long_report["redacted_text"])
print("nStructured spans:")
print(json.dumps(long_report["spans"], indent=2, ensure_ascii=False))


def pii_audit_table(texts, min_score=0.50):
   audit_rows = []
   for idx, textual content in enumerate(texts, 1):
       outcome = privacy_report(textual content, min_score=min_score)
       labels = Counter([s["label"] for s in outcome["spans"]])
       audit_rows.append({
           "id": idx,
           "original_chars": len(textual content),
           "redacted_chars": len(outcome["redacted_text"]),
           "span_count": outcome["span_count"],
           "labels_found": dict(labels),
           "redacted_text": outcome["redacted_text"]
       })
   return pd.DataBody(audit_rows)


audit_df = pii_audit_table(sample_texts + [long_document], min_score=0.50)
show(audit_df)


audit_path = OUT_DIR / "privacy_filter_audit.csv"
audit_df.to_csv(audit_path, index=False)
print("Saved audit CSV:", audit_path)


custom_text = enter("nEnter your individual textual content for PII redaction, or press Enter to skip:n")


if custom_text.strip():
   custom_report = privacy_report(custom_text, min_score=0.50)
   print("nOriginal:")
   print(custom_report["original_text"])
   print("nRedacted:")
   print(custom_report["redacted_text"])
   print("nSpans:")
   print(json.dumps(custom_report["spans"], indent=2, ensure_ascii=False))
else:
   print("Skipped customized enter.")


print("nTutorial full.")

We check the pipeline on a longer, lifelike doc to consider robustness. We generate an audit-style abstract exhibiting counts and classes of detected PII. We additionally permit customized consumer enter so we will run the privateness filter interactively.

In conclusion, we developed a sturdy and extensible privateness filtering workflow that goes past easy detection. We systematically evaluated mannequin predictions, utilized confidence thresholds, and in contrast completely different redaction methods to perceive their influence. We additionally generated structured experiences, visualized detection patterns, and exported leads to JSON and CSV codecs for auditing and downstream integration. This method permits us to construct dependable privateness safeguards into information pipelines, making certain that delicate info is persistently recognized and dealt with responsibly whereas sustaining the usability of the underlying information.

Check out the Full Codes here. Also, be happy to observe us on Twitter and don’t overlook to be part of our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The publish Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter appeared first on MarkTechPost.

Step by Step Guide to Build a Complete PII Detection and Redaction Pipeline with OpenAI Privacy Filter

Native RAG vs. Agentic RAG: Which Approach Advances Enterprise AI Decision-Making?

Building efficient data pipelines for AI and NLP applications in AWS

Kyutai Releases 2B Parameter Streaming Text-to-Speech TTS with 220ms Latency and 2.5M Hours of Training

A Coding Guide to Build a Scalable End-to-End Analytics and Machine Learning Pipeline on Millions of Rows Using Vaex

How to Build a Fully Functional Custom GPT-style Conversational AI Locally Using Hugging Face Transformers

What is AI Inference? A Technical Deep Dive and Top 9 AI Inference Providers (2025 Edition)

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!