What is a redaction engine?

The redaction engine is a part of the data ingestion pipeline that removes Personally Identifiable Information (PII), Payment Card Information (PCI), and other sensitive information from text.

The redaction engine may be hosted with Stratifyd or it may be an on-premises installation.

What redaction data is redacted?

The following entities are detected and can be removed using the redaction engine.

Entity

Description

Example

CARDINAL

Numeral without unit (ie: 1, 2)

  • What is a cardinal number? 30?
  • What is a cardinal number? [CARDINAL REDACTED]?

CREDIT_CARD

Credit card number

  • I like that all movies use 4510-6459-8301-6543 as their CC number
  • I like that all movies use [CREDIT_CARD REDACTED] as their CC number

DATE

Date or date range

  • My birthday is on March 31.
  • My birthday is on [DATE REDACTED].

EMAIL

Email address

EVENT

Named event (e.g. hurricane, battle, war, sporting event)

  • I've never been so scared as when Hurricane Katrina was looming on the horizon.
  • I've never been so scared as when [EVENT REDACTED] was looming on the horizon.

FAC

Facility or place (ie: building, airport, highway, bridge)

  • I sang at Carnegie Hall when I was in middle school choir
  • I sang at [FAC REDACTED] when I was in middle school choir

GPE

Geopolitical entity (ie: country, city, state)

  • Cause I was, born in the USA! Yeah I was born in the USA!
  • Cause I was, born in the [GPE REDACTED]! Yeah I was born in the [GPE REDACTED]!

LANGUAGE

Any named language

  • I wish I knew more Cantonese because I live Hong Kong
  • I wish I knew more [LANGUAGE REDACTED] because I live Hong Kong

LAW

Named document made into law

  • The Bill of Rights only had 26k votes!
  • [LAW REDACTED] only had 26k votes!

LOC

Location, non-GPE (e.g. mountain range, body of water)

  • Skiing in the Andes mountains is my dream!
  • Skiing in the [LOC REDACTED] mountains is my dream!

MONEY

Monetary value (including unit)

  • I only have 5 dollars to my name.
  • I only have [MONEY REDACTED] to my name.

NORP

Nationality or religious/political group

  • My grandma speaks French to me all the time!
  • My grandma speaks [NORP REDACTED] to me all the time!

ORDINAL

Position in a series (e.g. first, second)

  • I got my first question right but my last question totally wrong.
  • I got my [ORDINAL REDACTED] question right but my last question totally wrong.

ORG

Organization (e.g. company, agency, institution)

  • I am scared of Boeing because of their planes failing.
  • I am scared of [ORG REDACTED] because of their planes failing.

PERCENT

Percentage (including %)

  • I need 50% of this and 50 percent of that.
  • I need [PERCENT REDACTED] of this and [PERCENT REDACTED] of that.

PERSON

Person's name

  • Hello Sam, this is Jack and I don't like green eggs.
  • Hello [PERSON REDACTED], this is [PERSON REDACTED] and I don't like green eggs.

PHONE

Telephone number

  • Just call me at 555-555-4372 - it's definitely not fake.
  • Just call me at [PHONE REDACTED] - it's definitely not fake.

PRODUCT

Product–not service (e.g. object, vehicle, food)

  • I love my Subaru WRX because it's so fast!
  • I love my [PRODUCT REDACTED] because it's so fast!

QUANTITY

Measurement (including unit, e.g. weight, distance)

  • I would walk 500 miles and I would walk 500 more!
  • I would walk [QUANTITY REDACTED] and I would walk 500 more!

TIME

Time or period of time less than a day

  • I need to be at the airport at 3pm sharp!
  • I need to be at the airport at [TIME REDACTED] sharp!

URL

Uniform resource locator (e.g. google.com)

  • Just go to google.com of course!
  • Just go to [URL REDACTED] of course!

WORK_OF_ART

Title of artwork (e.g. book, song, painting)

  • The Mona Lisa was my least favorite at the Louvre.
  • [WORK_OF_ART REDACTED] was my least favorite at the Louvre.

Why use the redaction engine?

If you analyze public data, you need not worry about redaction. The need for this comes into play when you analyze call center data or similar data where users share PII, PCI, and other sensitive information such as account numbers.

To set up and run the redaction engine

1. In your terminal, use commands given by Stratifyd account representative to pull the Docker images from the cloud registry and run it as a web server.

NOTE: This process will require AWS account access to docker containers. There are over 800MB of files download and extract, so this process can take some time the first time you do it.

docker pull 818160864477.dkr.ecr.us-west-2.amazonaws.com/stratifyd/production/redactionstandaloneserver:latest
docker run -p 8888:8000 818160864477.dkr.ecr.us-west-2.amazonaws.com/stratifyd/production/redactionstandaloneserver

2. In your browser, navigate to: http://localhost:8888/api/doc

3. To check out the functionality or test for a specific redaction, follow these steps.Under Entity, expand the GET method and click Try it out to get a list of Entity types that you can redact.

Under Redaction, expand the POST method and click Try it out to enter your own text to redact.

In the body section, under Edit Value, enter text that you want to redact in the text string.

Here you can also change the default PERSON entity to one of the other entity types.

Click Execute. The redacted text appears in the response body below.

4. To send text programmatically for redaction, use code like the following.

import requests

BASE_URL = "http://localhost:8888/"

def redact_text(text, entities=None):
body = {
'text' : text,
}
if entities:
body['entities'] = entities
r = requests.post(BASE_URL + "api/redact", json=body)
if r.ok:
redacted_text = r.json().get('redacted_text', '')
return redacted_text

if __name__=="__main__":
redact_text("My name is Sam and I am anonymous.", entities=["PERSON"])

Swagger

GET /entities
headers = {
'Content-Type' : 'application/json'
}

returns: {
'entities' : [
{
'name' : "PERSON",
"description" : "Removes people's names (capitalization is important)"
}
]
}


POST /redact
headers = {
'Content-Type' : 'application/json'
}
body = {
"entities" : ['PERSON'],
'text' : "My name is Sam and I want to redact my name."
}

returns: {
"redacted_text" : {
'My name is [PERSON REDACTED] and I want to redact my name'.
}
}

Local Script

To run the redaction engine locally on your machine, you may leverage the following script after making appropriate adjustments for file paths and output file names (Requires the following python libraries: requests, csv):

redact_test

import requests
import csv

BASE_URL = "http://localhost:8888/"

def redact_text(text, entities=None):
body = {
'text' : text,
}
if entities:
body['entities'] = entities
r = requests.post(BASE_URL + "api/redact", json=body)
if r.ok:
redacted_text = r.json().get('redacted_text', '')
return redacted_text

# Open original file (read out into memory)
# Open new file to put redacted text
# Write

def get_raw_data(file):
data = []
with open(file,'r') as f:
reader = csv.DictReader(f)
for line in reader:
data.append(line)
return data

def get_entities():
r = requests.get(BASE_URL + "api/entities")
if r.ok:
entities = []
for entity in r.json():
entities.append(entity['name'])
return entities
else:
print(r.text)
return False

if __name__=="__main__":
input_filename = "redaction_texts.csv"
text_field = "original_text"
entities = get_entities()
remove_list = [
"URL",
"FAC",
"EVENT",
"WORK_OF_ART",
"LAW",
"LANGUAGE",
"TIME",
"QUANTITY",
"ORDINAL"
]
entities = list(set(entities) - set(remove_list))
# entities.remove(remove_list)
output_filename = "[ENTER FILE TO REDACT]"
data = get_raw_data(input_filename)
with open(output_filename,'w') as f:
fieldnames = list(data[0].keys())
fieldnames.append('redacted_text')
writer = csv.DictWriter(f,fieldnames)
for line in data:
redacted_text = redact_text(line[text_field],entities)
line['redacted_text'] = redacted_text
writer.writerow(line)

For Stand Alone Redaction On-Prem Deployment

What is provided:

  • Docker Images for access to the Redaction Engine
  • Local python scripts to perform redaction on flat files (requires local files & allows for entity selection).

What isn't provided:

  • Automated process – the scripts we provide would need to be productionalized by the client engineers. The work flow would be: client gets data → client engineer passes data through redaction → engineer pushes data to Stratifyd
  • Customized models – the engine is currently built to handle general PII data. We do not currently support a custom model with a new entity for a provided list of words.Ex. Client wants to redact the following words: “red”, “white”, “blue”.
Did this answer your question?