Local anonymization
Detect then substitute, locally
Anonymization happens in two steps, entirely on your server, before anything goes to a language model:
- Detection. First, deterministic rules (an IBAN, an amount, an email address follow formats recognizable by regular expression). Then, on what remains ambiguous, a named-entity recognition model that runs offline (no outgoing network packet to detect).
- Substitution. Each detected value is replaced with a typed and consistent token:
[PERSON_1],[SALARY_1],[IBAN_1]. The same real name always receives the same token within a document - this lets the model keep the relational structure, which keeps the response useful.
A rendering example: before / after
Original text (stays on your server):
Note from Jean Dupont (payroll manager): transfer the 4,500 EUR bonusto IBAN FR76 3000 6000 0112 3456 7890 189, contact [email protected].Text sent to the model (anonymized locally):
Note from [PERSON_1] (payroll manager): transfer the [AMOUNT_1] bonusto IBAN [IBAN_1], contact [EMAIL_1].The tokens replace the sensitive values; the non-sensitive context (“payroll manager”, “transfer the bonus”) stays in clear. The table mapping each token to its real value never leaves your server:
| Token | Type | Real value (local, never sent) |
|---|---|---|
[PERSON_1] | Person | Jean Dupont |
[AMOUNT_1] | Amount | 4,500 EUR |
[IBAN_1] | IBAN | FR76 3000 6000 0112 3456 7890 189 |
[EMAIL_1] | [email protected] |
When the model’s response comes back, these tokens are replaced with the real values on your premises, before being displayed. The AI provider never saw a single real identifier.
Why typed tokens, not [XXX]
Masking with an opaque marker ([XXX]) destroys meaning. A typed and consistent token
([PERSON_1]) lets the model understand “who is talking to whom” without knowing the real
identity. This is what makes anonymization useful and not just protective.
What the model is still able to do (and what it cannot)
| The model stays good on tokens | The model cannot / should not |
|---|---|
| File summarization | Numeric computation on masked values |
| Drafting a response | Verifying a real IBAN |
| Classification | Reasoning over external knowledge tied to the real identity |
| Understanding relationships | Deduplicating on the real name |
Sensitive numeric computation (raising a salary by 5%, summing values) is not done on tokens: it is done in local code on the real value, which gives an exact and deterministic result.
The pitfall of over-masking
Masking too much is a real cost: an unreadable response, or a burden of human re-reading. Our doctrine assumes an asymmetry: a false positive (masking for nothing) costs little; a false negative (letting a leak through) costs dearly. To keep usefulness while masking: consistent and typed tokens, deterministic rules before model-based detection, and the non-sensitive context left in clear.
Honesty: on very dense free text, over-masking degrades fluency. This is a cost to measure, not zero.
Going further
- Sovereignty - the principle and what really goes out.
- Egress log - the trace of what went out.