knowledge base

Everything about DeIdentify, in plain language

Written for anyone — whether you've never touched SQL or you run nightly ETL. Use the sidebar to jump around, or search for a specific term.

Getting Started

What is DeIdentify?

DeIdentify is a tool that takes a database export (SQL or CSV) and replaces the personal information inside it with realistic-looking fake data. It runs entirely in your browser — nothing is uploaded anywhere.

You get an output file with the same structure (same tables, same columns, same row counts, same relationships between tables) but with the sensitive values replaced. That file is safe to share with developers, analysts, or vendors who shouldn't see the real data.

infoNever worked with SQL before? A .sql file is just a text file that describes a database. Open it in Notepad and you'll see human-readable commands. DeIdentify handles all the parsing for you.

When should I use it?

You need to give a production data snapshot to a developer or analyst who isn't cleared for real PII.
You're setting up a staging or QA environment and want realistic data without the compliance risk.
You're sending a bug report to a vendor and want to strip identifiers from a data sample.
You're preparing a dataset for internal training or a demo.
You need de-identified data for research under HIPAA Safe Harbor or Expert Determination.

5-step quick start

Drag a .sql or .csv file anywhere onto the page (or click 'Choose file').
Review the auto-detected columns. Sensitive fields are highlighted; the tool guesses what each one contains (email, phone, name, etc.).
Optional: click the Preset menu and pick 'HIPAA Safe Harbor' if this is health data — this configures every column at once.
Click the green 'De-identify' button. You'll see a summary of what changed and a Diff view.
Click 'Download' (or 'Copy') to save the rewritten file. Optionally download the Audit report for your records.

tipNo file to try? Click 'Load sample' on the home page — it loads a small fake healthcare dataset so you can see the whole flow.

Core Concepts

SQL & CSV in one minute

A SQL dump is a text file with two kinds of statements you care about:

CREATE TABLE patients (
  id INT PRIMARY KEY,
  first_name VARCHAR(100),
  email VARCHAR(255)
);

INSERT INTO patients (id, first_name, email) VALUES
  (1, 'Alice', 'alice@example.com'),
  (2, 'Bob',   'bob@example.com');

CREATE TABLE defines a table (name + columns). INSERT INTO ... VALUES adds rows. DeIdentify reads the CREATE TABLE to learn the shape, then rewrites the values inside INSERT lines.

A CSV file is even simpler — the first row is column names, the rest are values separated by commas. Same idea, one table per file.

How columns are detected

DeIdentify looks at each column two ways:

By name — 'email' → email, 'first_name' → firstName, 'ssn' → ssn, 'mrn' → medical record number, and dozens more.
By value — if the first 30 non-null values match a known pattern (like nnn-nn-nnnn for SSN, or an email format), it's flagged even if the column has a weird name.
By your custom rules — you can add regex patterns that override or extend the built-in list.

infoThe detected type is a guess. Review the column plan before running — you can change any column's strategy in one click.

Referential integrity — the important bit

If patients.id = 42 belongs to Alice, and encounters.patient_id = 42 is Alice's visit, then after de-identification both need to become the same new value (say, 7891). Otherwise the visit is orphaned.

DeIdentify handles this automatically when: (a) the primary key column is in your CREATE TABLE, (b) INSERT statements include the primary key value explicitly, and (c) foreign-key columns follow the '<table_singular>_id' convention or you set the link manually.

noteIf you export your dump with auto-increment IDs replaced by DEFAULT (common in pg_dump without --inserts), primary keys aren't in the file — the preflight check will flag this so you can re-export.

Strategies (per column)

Keep

Leave the value exactly as-is. The output column is byte-for-byte identical to the input.

When to use: columns you have personally confirmed contain no direct identifiers, quasi-identifiers, or free-text that could leak PII — e.g. product SKUs, status flags, boolean columns, category enums, non-personal system timestamps like created_at on a lookup table.

Risk: Keep is the only strategy that can leak data if you're wrong about the column. When in doubt, choose Fake or Redact instead.

Fake (realistic)

Replace each real value with a synthetic value of the same shape and type. A name becomes a plausible name, an email becomes a syntactically valid email, a phone number stays formatted like a phone number.

Deterministic within a run: the same input plus the same salt always produces the same fake value, so joins across tables continue to work (customer 'Alice Smith' becomes 'Marta Hoffman' in every table she appears in). Change the salt to get a completely new mapping.

When to use: this is the default and best choice for most identifying columns (names, emails, phones, addresses, MRNs). It preserves data shape so applications, forms, and reports keep working against the de-identified copy.

Not reversible without the original mapping table (which DeIdentify never stores).

Hash (pseudonym)

Replace the value with a short deterministic token derived from HMAC(salt, value) — for example, a1f9c3d2. Same input plus same salt always produces the same token.

When to use: internal identifiers where you need uniqueness and referential integrity but don't need the output to look human — foreign keys, session IDs, external system IDs, join keys between tables.

Security: not reversible without brute-forcing the salt against a known value space. Keep the salt private if you don't want a knowledgeable insider to re-link tokens to originals.

Shift date ±

Add or subtract a random offset (in days) from every date/time value. Intervals between events are preserved — if admission was 4 days before discharge in the source, it is still 4 days apart in the output.

Per-entity (recommended for clinical data): every date belonging to one patient/subject is shifted by the same offset. Different patients get different offsets. Preserves each individual's timeline while making the absolute calendar meaningless. Required style for HIPAA Expert Determination workflows.
Per-table: every date in the table is shifted by the same offset. Simpler and faster, but weaker protection when a table is dominated by one subject.

When to use: encounter dates, appointment times, transaction timestamps, event logs — any temporal data where you want to preserve durations and sequence but hide the actual calendar.

Generalize

Replace a precise value with a broader bucket. Reduces uniqueness (and therefore re-identification risk) while keeping the column useful for analytics.

Age 78 → '75-84' age band
Date 2024-03-18 → 2024 (year only) or 2024-03 (month)
ZIP 10024 → 100XX (ZIP3, the HIPAA Safe Harbor rule for geography)
Salary 87,412 → '80k–90k' band

When to use: quasi-identifiers where the exact value isn't needed downstream. This is the HIPAA Safe Harbor approach for dates and geographic detail and the primary tool for reducing k-anonymity risk.

Redact / Null

Remove the value entirely. The output cell is written as NULL (SQL) or empty (CSV). No trace of the original remains in the file.

When to use: columns with no analytic value to the recipient, or free-text fields where you can't guarantee what's inside — clinical notes, internal comments, password hashes, security answers, uploaded document blobs, columns that consistently contain names or SSNs mixed with other text.

Strongest guarantee of the six strategies — you cannot leak what isn't there. Use liberally when in doubt.

Presets

HIPAA Safe Harbor

Applies the Safe Harbor rule: remove or generalize all 18 identifier categories that the HIPAA Privacy Rule requires. Result is de-identified under 45 CFR 164.514(b)(2).

Names, geographic subdivisions smaller than a state (except first 3 ZIP digits, and only if the geographic unit contains > 20,000 people)
All dates directly related to an individual (except year), and all ages > 89
Phone, fax, email, SSN, MRN, health plan number, account number
Certificate/license numbers, vehicle IDs, device serial numbers, URLs, IPs
Biometric identifiers, full-face photos, any other unique code

noteSafe Harbor is a rule, not a magic bullet. You are still responsible for reviewing that no free-text field or combination of columns re-identifies someone.

Expert Determination (workflow)

The tool doesn't certify — a qualified statistician does — but it gives you the levers: per-entity date shifts keep clinical intervals intact, fake generators preserve distributions, and hash keeps join keys working. Export the Audit report to hand to your reviewer.

42 CFR Part 2 (behavioral health)

Extra strict rules apply to substance use disorder records. DeIdentify recognizes provider_name, facility, diagnosis_code, and substance_use fields as sensitive by default and redacts or generalizes them.

Custom Rules

Adding a custom rule

Custom rules let you extend detection with regex patterns for column names or values. Click '+ show custom rules' in the toolbar.

Label: a short name for your rule (visible in the summary).
Scope: 'name' matches the column name, 'value' matches the actual values in the column.
Pattern: a regular expression. Wildcards are just regex (.* matches anything).
Target: what type this column should be treated as (email, mrn, custom identifier, etc.). Determines the default strategy and fake generator.

Examples:
  ^emp_id$              scope: name    target: accountNumber
  patient.*number       scope: name    target: mrn
  ^EMP\d{6}$            scope: value   target: username

Export / Import a profile

A profile is a JSON file containing your column plan, custom rules, salt, and settings. Export once, share it with a teammate, and everyone gets identical de-identification. Ideal for a team standard or a repeatable pipeline.

Preflight & Validation

What preflight checks

Before rewriting, DeIdentify scans your SQL for problems that would break referential mapping. Errors show line numbers.

COPY_UNSUPPORTED: pg_dump used COPY … FROM STDIN instead of INSERT. Re-export with pg_dump --inserts --column-inserts.
UNSUPPORTED_STATEMENT: forms like LOAD DATA INFILE, BULK INSERT, MERGE, INSERT … SELECT can't be rewritten row by row.
PK_NOT_IN_INSERT: an INSERT omits the primary key column. Referential mapping needs explicit PK values.
PK_NULL_VALUE / PK_DEFAULT_VALUE: primary keys were exported as NULL or DEFAULT (auto-increment). Re-export with explicit values.
NO_PRIMARY_KEY: table has no PRIMARY KEY declared — foreign key linking to it won't be consistent.
NO_SCHEMA: an INSERT references a table with no CREATE TABLE in the file — column-name detection still works but PK checks can't run.

Outputs & Reports

Download vs Copy

After a run, 'Download' saves the rewritten file with the filename you set in the 'save as' box. 'Copy' puts the same content on your clipboard — handy for pasting into a psql console or a scratch buffer.

Reading the Diff

The Diff tab in the preview pane shows original vs de-identified line by line. Red = removed (original), green = added (rewritten). Toggle 'Only changes' to collapse unchanged runs.

Run summary panel

After each run, the green summary panel shows how many columns and values were touched, which strategies ran, which detected types were involved, and which of your custom rules matched a column.

Audit report (.json / .csv)

The audit is a detailed log of what happened — per column, how many values were transformed, redacted, or kept, plus hashed samples so you can spot-check without exposing real PII. Keep this with the output file to prove your process.

Downloading & Running Offline

One-click offline app

The fastest way to use DeIdentify air-gapped is the single-file build. Click the 'Download offline app' button in the green banner at the top of the page (or use the direct link below) — you get one HTML file with every asset inlined.

Save SqlDeidentify.html anywhere on your machine.
Double-click it. Any modern browser opens it as a local page (file://…).
Use it exactly like the web version. Drag-and-drop, presets, custom rules, all offline.
The bundle physically blocks fetch/XHR/WebSocket at runtime, so even if your machine is online, the tool can't transmit data.

infoSome browsers restrict features like IndexedDB when loading from file://. If your workspace (custom rules, salt) doesn't persist between sessions, serve the file with a local static server instead — see the next section.

Updating your offline copy (replacing the old file)

Every time the app is published, a new build is generated with a fresh version stamp (shown as v YYYY.MM.DD-HHMM in the top green banner and on the homepage offline card). To upgrade the copy on your machine:

Note the version currently shown in the green banner at the top of this site — that is the latest published build.
Open your existing local SqlDeidentify.html — the same version string appears in its own top banner. If it already matches, you are up to date.
If it is older, click 'Download offline app' in the banner (or the button on the homepage) to grab the new SqlDeidentify.html.
Save the download into the SAME folder as your old copy and let your browser overwrite the existing file (Chrome/Edge: 'Keep' → replaces; Firefox: check 'Replace'; Safari: it auto-renames — delete the old file, then rename the new one back to SqlDeidentify.html).
Close any browser tabs that still have the OLD file open, then double-click the new SqlDeidentify.html.
Your saved presets, custom rules, and salt are preserved automatically — they live in the browser's IndexedDB scoped to the file's origin, not inside the HTML itself.

infoKeeping the filename identical (SqlDeidentify.html) is what preserves your saved workspace. If you rename the new file or move it to a different folder, the browser treats it as a fresh origin and your custom presets will not appear (they are not lost — just scoped to the old path). Move the old file back or rename the new one to match.

Tip: bookmark this Knowledge Base page. The version badge in the top banner always reflects the newest published build, so you can check at a glance whether your local copy is behind.

Serve locally for persistent workspace

For a stable browser origin so IndexedDB persists your saved rules and plans across sessions:

# put SqlDeidentify.html in a folder, cd into it, then:
python3 -m http.server 8000
# or:  npx serve .
# or:  bunx serve .

# then open http://localhost:8000/SqlDeidentify.html

Build from source (advanced)

If you want to modify the tool or run the multi-file build:

git clone <your-project>
bun install
bun run build            # multi-file build in ./dist
bunx serve dist          # any static file server works

# Or rebuild the single-file offline bundle:
./scripts/build-standalone.sh
# → dist-standalone/SqlDeidentify-offline/SqlDeidentify.html

Verifying it's actually offline

Open your browser DevTools → Network tab, then load DeIdentify. You should see only local requests (or none, for the single-file build).
Turn off your Wi-Fi and load a real file. Everything should work identically.
Inspect the HTML: it declares Content-Security-Policy connect-src 'none', and installs a runtime kill-switch on fetch, XMLHttpRequest, and WebSocket.

Troubleshooting

A column was detected as the wrong type

Open the column table for that file, find the row, and change the Strategy or Type dropdown. Or add a custom rule so the detection is correct next time and for anyone who imports your profile.

Foreign keys aren't linking after de-identify

Check 'auto-link foreign keys' is on in the toolbar.
Confirm your INSERTs include the primary key column explicitly (preflight will flag if not).
Manually set the FK link on the column plan: set Strategy to 'Hash' and 'Link to' the parent table's PK column.

Preflight is showing red errors

Errors don't prevent you from running — but they mean referential mapping may be inconsistent. Read the line number and message, then either re-export your dump with the recommended flags or accept the risk if that table isn't linked to anything.

My file is very large

Everything runs in browser memory. Files up to a few hundred MB are usually fine on modern machines. If the tab hangs or crashes, split the dump (one table per file), de-identify each, and concatenate. For truly huge dumps, use the offline app on a workstation with more RAM.

Privacy & Compliance

Where does my data go?

Nowhere. Files you drop into DeIdentify are read with the browser's File API and processed in the same tab. There is no upload endpoint.

Your data lives only in the current browser tab's memory (in-memory JavaScript state). It is never written to IndexedDB, localStorage, sessionStorage, or any disk cache.

The de-identification engine runs entirely offline inside your browser. No network requests are made during parsing, analysis, or output generation. Even the offline build is configured to block any outgoing network calls while you work.

No service worker is registered, so there is no background cache of uploads, outputs, or audit reports. When you close or refresh the tab, everything in memory is released and gone.

tipThe chat assistant only sends your text question to the AI gateway. It never transmits file contents, analysis results, or output data.

This tool is not legal advice

noteDeIdentify helps you apply the mechanical steps of Safe Harbor and produce audit records, but compliance with HIPAA, GDPR, or any other regime is your responsibility. When in doubt, consult a privacy officer or qualified expert before sharing data.

Liability

By using this platform, you agree that 100% of liability for how you use it remains with you, not the platform. You are solely responsible for reviewing outputs, ensuring they meet your compliance obligations, and deciding whether a dataset is safe to share.

The platform provides automation and guidance, but it does not certify, guarantee, or warrant that any output satisfies legal or regulatory requirements.

Glossary

Glossary of terms

Plain-language definitions for every technical term used in the app and this knowledge base.

SQL: Structured Query Language. The standard language databases speak. A .sql file is just text containing commands like CREATE TABLE and INSERT INTO.
SQL dump: A text file exported from a database that recreates it exactly — the tables (CREATE TABLE) and every row of data (INSERT INTO ...).
CSV: Comma-Separated Values. A plain-text spreadsheet where each line is a row and columns are separated by commas.
Row: One record in a table — for example, one patient.
Column: One field on every row — for example, first_name or email.
Primary key (PK): The column whose value uniquely identifies each row (usually 'id'). DeIdentify uses PK values to keep foreign-key links consistent after renaming.
Foreign key (FK): A column that points to a row in another table. Example: encounters.patient_id points to patients.id. DeIdentify keeps these links intact.
PII: Personally Identifiable Information — any data that can identify a real person: names, emails, phone numbers, addresses, SSNs, etc.
PHI: Protected Health Information — PII combined with anything about a person's health, care, or payment for care. Governed by HIPAA in the US.
De-identification: Replacing PII/PHI with realistic-looking but fake values so the data is safe to share for testing, analytics, or research.
Pseudonymization: A form of de-identification where each real value is mapped to a stable fake one (Alice → Marta every time). Reversible only if the mapping is kept.
Anonymization: Stronger than pseudonymization — no mapping is kept and re-identification is meant to be infeasible.
HIPAA Safe Harbor: One of two HIPAA methods for de-identifying PHI. Removes 18 specific identifier categories (names, dates, geographic detail, IDs, etc.). The built-in preset applies all 18.
Expert Determination: The other HIPAA method — a statistician certifies re-identification risk is very small. DeIdentify supports this workflow with per-entity date shifts and generalization.
Salt: A short secret string mixed into hashes and fake generators so the same input maps to the same output within one run. Changing the salt gives you a totally new, still-consistent mapping.
Hash: A one-way function that turns a value into a fixed-length code. Deterministic (same input → same output) but not reversible without brute force.
Preflight: The pre-check DeIdentify runs before rewriting. Flags missing primary keys, INSERTs without column lists, and statement types the tool can't rewrite.
Preset: A saved bundle of column strategies (e.g. 'HIPAA Safe Harbor'). One click and every recognized column gets a sensible default.
Strategy: What to do with a column's values: Keep, Fake, Hash, Shift, Generalize, or Redact.
Date shift: Add or subtract a random-but-consistent number of days from every date. Preserves intervals (admission → discharge stays 4 days) while hiding the real calendar.
Per-entity shift: The same shift is applied to every date belonging to one patient. Different patients get different shifts. Recommended for HIPAA Expert Determination.
Per-table shift: The same shift is applied to every date in one table. Simpler; leaks less about individuals when a table has many people.
Generalize: Replace an exact value with a broader bucket. Example: '78 years old' → '75-84', '2024-03-18' → '2024', '10024' → '100XX'.
Redact / Null: Drop the value entirely — set it to NULL or an empty string. Use when the field is not needed downstream.
Referential integrity: The guarantee that foreign keys still line up after rewriting. When patients.id becomes '42→7891', encounters.patient_id also becomes '42→7891'.