Three Practical Ways to Detect Sensitive Data
Preface
Agents don’t just think — they move data between systems.
They fetch from APIs, read your docs, write summaries, file tickets, and send emails.
That means sensitive data will inevitably cross their context window. If you don’t detect and mask it, you risk leaking data to the wrong party.
Every production agent needs sensitive data detection as a first-class step in the loop.
This post shows three practical, runnable ways to do it: no benchmarks, just recipes and trade-offs.
Install
# Python 3.10+ recommended
pip install presidio-analyzer presidio-anonymizer gliner
# spaCy as Presidio's NLP engine (fast, stable)
pip install spacy && python -m spacy download en_core_web_sm
# If you’ll use the built-in GLiNER recognizer inside Presidio:
pip install "presidio-analyzer[gliner]"
Presidio (rules/regex + validators)
What it is?
Presidio runs a registry of recognizers (regex/rules and optional NER). With spaCy as the NLP engine, it’s fast and production-friendly. For certain entities (e.g., credit cards), Presidio uses format checks + Luhn, which dramatically cuts false positives.
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
text = (
"Email roey@example.com, phone +1 (415) 555-0123, "
"SSN 123-45-6789, card 4111 1111 1111 1111, "
"IP 2001:0db8:85a3::8a2e:370:7334."
)
nlp_engine = NlpEngineProvider({
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}).create_engine()
analyzer = AnalyzerEngine(nlp_engine=nlp_engine)
results = analyzer.analyze(text=text, language="en")
for r in results:
print(r.entity_type, r.start, r.end, getattr(r, "score", None))
Pros:
- Fast with spaCy's pipeline.
- Custom, domain-aware checks (e.g., credit cards with Luhn, structured SSN/phone patterns) — much more than “just regex.”
- Deterministic behavior on well-structured, high-risk IDs.
Cons:
-
Problematic spans that are long, fuzzy, or archaic can slip through or be awkward to cover with rules, e.g.:
united states of america
(misspelling/variant boundaries).septmber 13th of 1782 the year of our lord
(non-standard/archaic date phrasing).
-
New PII types still mean writing recognizers.
GLiNER (flexible NER)
What it is?
A lightweight transformer NER that extracts arbitrary entity types via label strings (zero-shot style). Great when formats are messy or vary across locales.
from gliner import GLiNER
text = (
"Email roey@example.com, phone +44 20 7946 0958, "
"passport A1234567, address 221B Baker Street, London."
)
model = GLiNER.from_pretrained("urchade/gliner_multi_pii-v1")
labels = [
"person", "email address", "phone number",
"ip address", "credit card number", "social security number",
"passport number", "address"
]
ents = model.predict_entities(text, labels, threshold=0.45)
for e in ents:
print(e["label"], e["start"], e["end"], e.get("score"))
Pros:
- Zero-shot add-a-label experience: new entity types without new regex.
- Excellent on messy/long-tail spans and cross-locale variations (addresses, name forms, fuzzy dates).
Cons:
- No structural validation: e.g., may call random 16 digits a "credit card" (no Luhn).
- Heavier than rules; higher latency than Presidio-only.
Hybrid (Presidio + GLiNER)
What it is?
Add GLiNER as a Presidio recognizer. You keep Presidio’s rules/validators (e.g., Luhn for cards) and get GLiNER’s broader coverage for messy cases. This is the most practical default for real apps.
from presidio_analyzer import AnalyzerEngine
from presidio_analyzer.nlp_engine import NlpEngineProvider
from presidio_analyzer.predefined_recognizers import GLiNERRecognizer
text = (
"Email roey@example.com, phone +44 20 7946 0958, "
"221B Baker Street, London NW1 6XE."
)
nlp_engine = NlpEngineProvider({
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}).create_engine()
engine = AnalyzerEngine(nlp_engine=nlp_engine)
entity_mapping = {
"person": "PERSON",
"name": "PERSON",
"email address": "EMAIL_ADDRESS",
"phone number": "PHONE_NUMBER",
"ip address": "IP_ADDRESS",
"credit card number": "CREDIT_CARD",
"social security number": "US_SSN",
"passport number": "PASSPORT",
"address": "LOCATION",
}
gliner_rec = GLiNERRecognizer(
model_name="urchade/gliner_multi_pii-v1",
entity_mapping=entity_mapping,
flat_ner=False,
multi_label=True,
map_location="cpu", # "cuda" if you have a GPU
)
engine.registry.add_recognizer(gliner_rec)
# Recommended: remove spaCy's default NER recognizer to avoid double-NER:
try:
engine.registry.remove_recognizer("SpacyRecognizer")
except Exception:
pass
results = engine.analyze(text=text)
for r in results:
print(r.entity_type, r.start, r.end, getattr(r, "score", None))
Pros:
- Keep speed (spaCy engine) + keep validators (e.g., Luhn for cards).
- Add coverage for messy/long-tail cases via GLiNER labels.
- Centralized thresholds, allow/deny lists, and logging through Presidio.
Cons:
- Slower than rules-only.
- You’ll want a simple overlap policy when other recognizers and GLiNER both hit (e.g., prefer higher score or tighter span).
Trade-offs
Need / Constraint | Presidio (spaCy) | GLiNER | Hybrid |
---|---|---|---|
Credit cards, SSNs, structured IDs (validators) | ✅ | ❌ | ✅ |
Messy/archaic/long spans (“crazy” spans) | ⚠️ | ✅ | ✅ |
Fastest throughput | ✅ | ❌ | ❌ |
New PII types quickly | ⚠️ (write rules) | ✅ | ✅ |
Centralized policy (lists/logging/thresholds) | ✅ | ❌ | ✅ |