Masking Algorithms & Masking Techniques
Overview
The goal of data masking is to obscure the true value of sensitive data to prevent misuse. For instance, masking a credit card number prevents a malicious actor from making charges on the credit card. There are many different algorithms that can be used to implement data masking. The majority of data masking algorithms are based on a few common masking techniques.
Data Masking Techniques
There are a variety of masking techniques which can be used depending on the type of data and the masking requirements. Common techniques used in data masking include: substitution, anonymization, redaction, masking out, and encryption.
Substitution
Substitution algorithms replace a data value with a similar data value completely unrelated to the original value,
generally selected from a specific set of values.
The advantage of substitution is that realistic data values can be generated.
Thus, a last name MacDonald
might be masked to Murphy
.
Anonymization
Anonymization relies on changing the data value being masked. A common method when dealing with numbers and dates is to vary the value. The variance can be achieved by generating a new value using a deterministic variable algorithm. It is variable, much like a random number, because it will generate any possible value in the defined range of possible values. It is deterministic because it will always generate the same value for a given data value to being masked. Often, only a portion of the data value being masked is changed. This can maintain certain statistical properties of the data. For instance, changing a date of birth from 1/15/1985 to 1/31/1985 would prevent a person from potentially being identified by their data of birth in a data set but still maintain the statistical age distribution of a population. The advantage of variance is that realistic data values that are valid (i.e. a valid date type) and remain statistically valid.
Redaction
Redaction is a technique of completely removing or hiding a data value.
In printed material it is commonly achieved by blacking out words or phrases.
Electronically, words or phrases, can be redacted by replacing characters with a specific character like X
or *
.
While redaction ensures protection of the data, it has several drawbacks.
It breaks referential integrity, the masked data is not realistic, and masked data may not be valid depending on the dat type.
Masking Out
Masking Out is similar to redaction except that the entire data value is typically not hidden. Two common examples that people might be aware of are masking all but the final 4 digits of a social security number or a credit card number. The technique of masking out data shares all the drawbacks of redaction.
Encryption
Most masking techniques are one way. Given a data value, a masked value can be consistently generated, but it is not possible to determine the original data value given a masked value. Encryption is the exception to this rule. Using encryption, a masked data value can be decrypted to recover the original data value. While possible, decryption is only possible using a decryption key, which is not generally available.
Data Masking Core Algorithms
Both the Substitution and Anonymization data masking techniques rely on the same core secure variable deterministic selection algorithm. The basic variable requirement is to select an item from a substitution set with all values in the set having an equal chance of being selected. A basic random function available in most programming languages can do this.
The next requirement is that this algorithm is deterministic. This means it should always select the same item from the substitution set given for a given data value (e.g. when masking the last_name "Smith" the substitution item will always be "Brown"). This is important to maintain referential integrity in a database. This implies a random algorithm will not work, as by definition, it will return a different value every time it is executed.
The final requirement is that this algorithm is secure. Security means that it is not possible to determine the original data value from the masked data value. Besides the security of not being able to reverse the masking operation, it is also important that other actors cannot brute force the reverse operation. A brute force approach would be masking possible original values to see if they match the masked value. While a single match does not indicate anything, if a set of values matches, then the confidence would increase.
Obfusware Selection Algorithm
Obfusware handles these requirements with its Ofusware Selection Algorithm. The algorithm meets the secure, variable, and deterministic requirements. In particular, the algorithm gives different selection results for each organization it is licensed to, so outside organizations cannot use a brute force attack. The core of the algorithm is based on the well-known and validated, SHA-256 encryption algorithm.