Data Masking a Billion Rows in 5 Minutes
Executive Summary:
Data masking is a proven technology for maintaining privacy of sensitive data such as PII, PCI, and HIPAA.
One impediment has always been the time and effort required to mask large datasets.
With the explosive growth of big data, data lakes and AI has only exacerbated the problem, moving it from difficult to formidable.
Legacy data masking tools were designed when RDBMS ruled the data landscape and struggle with the current landscape of big data.
The solution to this dilemma is data masking tools designed specifically for the big data challenge. While traditional tools struggle masking 10's and 100's of millions of rows of data, a data masking tool built on the foundation of proven big data processing technology can mask a billion rows of data in minutes. Obfusware is a next-generation data masking tool designed from the ground up to leverage the big data technology and outperform legacy tools by orders of magnitude.
Obfusware AG Big Data Masking
Obfusware is built on the Apache Spark scalable, fast, engine for large-scale data processing. Amazon Web Service (AWS) Glue is a serverless data integration service that also uses Apache Spark for data processing, which makes it easy to integrate Obfusware as an AWS Glue transform. The primary benefit of Spark and thus Glue is the ability to horizontally scale across multiple servers, whether that be 5, 10, 50, or hundreds of servers.
Designing a 1 Billion Row Obfusware Masking Job
A simple AWS Glue visual job can be created using the AWS Glue Studio.
Image 1: AWS Glue Obfusware Making Job Diagram
AWS Glue can handle large amounts of data using an AWS S3 bucket to store the data to be masked in an optimized big data format comprised of multiple parquet files. The masked results can then be written to a new location again as parquet files. A single Obfusware transform is used to mask the selected columns in the data set. The transform uses four Obfusware data maskers.
- USLastNameMasker - Replaces original string data with a realistic, culturally USA, last name
- USVariableDateMasker - Replaces the original data item with a new date with the same month and year but a different day value
- US555TelephoneMasker - Generates a new telephone number, replacing the exchange with
555
and the final digits with a new sequence of digits - EmailMasker - Replaces an email address with an email address for the domain
@example.com
.
Input Data to be Masked
The data to be masked is a table containing of 13 columns of standard contact data. There are 4 representative columns selected to be masked: last_name
, dob
, phone1
, and email
.
The billion rows of data are approximately 176GB of uncompressed data. The parquet files used the snappy compression algorithm and reduced the overall data size to approximately 94GB.
first_name | last_name | dob | company_name | address | city | county | state | zip | phone1 | phone2 | web | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
James | Butt | 4/1/1997 | Benton, John B Jr | 6649 N Blue Gum St | New Orleans | Orleans | LA | 70116 | 504-621-8927 | 504-845-1427 | jbutt@gmail.com | http://www.bentonjohnbjr.com |
Josephine | Darakjy | 1/22/1970 | Chanay, Jeffrey A Esq | 4 B Blue Ridge Blvd | Brighton | Livingston | MI | 48116 | 810-292-9388 | 810-374-9840 | josephine_darakjy@darakjy.org | http://www.chanayjeffreyaesq.com |
Art | Venere | 3/24/1993 | Chemel, James L Cpa | 8 W Cerritos Ave #54 | Bridgeport | Gloucester | NJ | 8014 | 856-636-8749 | 856-264-4130 | art@venere.org | http://www.chemeljameslcpa.com |
Lenna | Paprocki | 6/22/1965 | Feltz Printing Service | 639 Main St | Anchorage | Anchorage | AK | 99501 | 907-385-4412 | 907-921-2010 | lpaprocki@hotmail.com | http://www.feltzprintingservice.com |
Donette | Foller | 7/3/1971 | Printing Dimensions | 34 Center St | Hamilton | Butler | OH | 45011 | 513-570-1893 | 513-549-4561 | donette.foller@cox.net | http://www.printingdimensions.com |
Simona | Morasca | 4/30/1999 | Chapman, Ross E Esq | 3 Mcauley Dr | Ashland | Ashland | OH | 44805 | 419-503-2484 | 419-800-6759 | simona@morasca.com | http://www.chapmanrosseesq.com |
Mitsue | Tollner | 6/28/1973 | Morlong Associates | 7 Eads St | Chicago | Cook | IL | 60632 | 773-573-6914 | 773-924-8565 | mitsue_tollner@yahoo.com | http://www.morlongassociates.com |
Leota | Dilliard | 7/13/1977 | Commercial Press | 7 W Jackson Blvd | San Jose | Santa Clara | CA | 95111 | 408-752-3500 | 408-813-1105 | leota@hotmail.com | http://www.commercialpress.com |
Sage | Wieser | 4/3/1968 | Truhlar And Truhlar Attys | 5 Boston Ave #88 | Sioux Falls | Minnehaha | SD | 57105 | 605-414-2147 | 605-794-4895 | sage_wieser@cox.net | http://www.truhlarandtruhlarattys.com |
Kris | Marrier | 7/19/1989 | King, Christopher A Esq | 228 Runamuck Pl #2808 | Baltimore | Baltimore City | MD | 21224 | 410-655-8723 | 410-804-4694 | kris@gmail.com | http://www.kingchristopheraesq.com |
Table 1: Sample input data
Masked Output Data
A sample of the masking results is shown below. By comparing the masked columns to the input data columns, the effects of the 4 Obfusware maskers are clear.
first_name | last_name | dob | company_name | address | city | county | state | zip | phone1 | phone2 | web | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
James | Ketchersid | 4/13/1997 | Benton, John B Jr | 6649 N Blue Gum St | New Orleans | Orleans | LA | 70116 | 504-555-7562 | 504-845-1427 | freddy6791@example.com | http://www.bentonjohnbjr.com |
Josephine | Bittman | 1/30/1970 | Chanay, Jeffrey A Esq | 4 B Blue Ridge Blvd | Brighton | Livingston | MI | 48116 | 810-555-5377 | 810-374-9840 | kiara.coria@example.com | http://www.chanayjeffreyaesq.com |
Art | Lahmers | 3/31/1993 | Chemel, James L Cpa | 8 W Cerritos Ave #54 | Bridgeport | Gloucester | NJ | 8014 | 856-555-7295 | 856-264-4130 | marin7087@example.com | http://www.chemeljameslcpa.com |
Lenna | Aceves | 6/17/1965 | Feltz Printing Service | 639 Main St | Anchorage | Anchorage | AK | 99501 | 907-555-9906 | 907-921-2010 | janell5891@example.com | http://www.feltzprintingservice.com |
Donette | Hozempa | 7/29/1971 | Printing Dimensions | 34 Center St | Hamilton | Butler | OH | 45011 | 513-555-0753 | 513-549-4561 | therford@example.com | http://www.printingdimensions.com |
Simona | Atallah | 4/9/1999 | Chapman, Ross E Esq | 3 Mcauley Dr | Ashland | Ashland | OH | 44805 | 419-555-6979 | 419-800-6759 | numbers3031@example.com | http://www.chapmanrosseesq.com |
Mitsue | Grabinski | 6/20/1973 | Morlong Associates | 7 Eads St | Chicago | Cook | IL | 60632 | 773-555-1509 | 773-924-8565 | wlenorud462@example.com | http://www.morlongassociates.com |
Leota | Loflen | 7/6/1977 | Commercial Press | 7 W Jackson Blvd | San Jose | Santa Clara | CA | 95111 | 408-555-3488 | 408-813-1105 | twarhol@example.com | http://www.commercialpress.com |
Sage | Baize | 4/11/1968 | Truhlar And Truhlar Attys | 5 Boston Ave #88 | Sioux Falls | Minnehaha | SD | 57105 | 605-555-6092 | 605-794-4895 | elsie.glen@example.com | http://www.truhlarandtruhlarattys.com |
Kris | Ponder | 7/31/1989 | King, Christopher A Esq | 228 Runamuck Pl #2808 | Baltimore | Baltimore City | MD | 21224 | 410-555-7263 | 410-804-4694 | mruffolo@example.com | http://www.kingchristopheraesq.com |
Table 2: Sample Masked output data
Obfusware Job Statistics Report
At the conclusion of a job, Obfusware produces a job report with basics statistics and timings for masking operations.
INFO 2025-08-20T03:37:26,497 402306 com.obfusware.datamasking.spark.SparkMaskingSchema [spark-listener-group-shared] 666 MASKING STATISTICS REPORT
Masking Context: 4 maskers
1) EmailMasker(Email) - Replaces an email address with an email address for the domain '@example.com'.
2) US555TelephoneMasker(USTelephone) - Generates a new telephone number, replacing the exchange with '555' and the final digits with a new sequence of digits
3) USLastNameMasker(HashList) - Replaces original string data with a realistic, culturally USA, last name
4) USVariableDateMasker(MultiDate) - Replaces the original data item with a new date with the same month and year but a different day value
Aggreggate Statistics (From: 2025-08-20T03:31:53.931760990Z, To: 2025-08-20T03:37:25.258870068Z, Duration: 00:05:31.327, ops/sec: 12,956,643)
Total masker invocations: 4,292,887,120 (02:28:51.757 hh:mm:ss.millis)
Total successful invocations: 4,292,887,120 (02:28:51.757 hh:mm:ss.millis)
Total failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)
DataMasker Statistics
EmailMasker (3,239,308 ops/sec)
Total invocations: 1,073,221,780 (00:42:22.193 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:42:22.193 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)
US555TelephoneMasker (3,239,196 ops/sec)
Total invocations: 1,073,221,780 (00:49:37.813 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:49:37.813 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)
USLastNameMasker (3,239,160 ops/sec)
Total invocations: 1,073,221,780 (00:21:26.088 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:21:26.088 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)
USVariableDateMasker (3,239,187 ops/sec)
Total invocations: 1,073,221,780 (00:35:25.663 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:35:25.663 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)
The interesting information can be found in line 8 of the report.
Duration: 00:05:31.327, ops/sec: 12,956,643
The job took 5 minutes and 31 seconds to run with almost 13 million masking operations per second.
Horizontal Scaling an Obfusware Masking Job
The performance of an AWS Glue Obfusware masking job can be easily managed using horizontal scaling. Adding additional Data Processing Units (DPU) to a masking job can positively affect the performance of the job. Masking a billion rows in approximately 5 minutes was an ambitious objective and was not achieved on the first try.
First, a test run was done to find a baseline performance using 10 DPUs. This resulted in a run time of approximately 24 minutes. Next, 20 DPUs were used doubling the available processing power. This resulted in a run time of approximately 11 minutes, cutting the runtime by more almost 54%. This demonstrates the extremely efficient horizontal scaling available that Apache Spark enables. To get close to our 5 minute runtime target, the number of DPUS was doubled again to 40. The resulting runtime of 5 minutes and 31 seconds was close enough to the 5 minute target to declare victory.
DPU | ops/sec | ops/sec/DPU | ops/hr | rows/hr | Duration |
---|---|---|---|---|---|
10 | 3,000,771 | 333,419 | 10,802,775,600 | 2,700,693,900 | 00:23:50.594 |
20 | 6,437,642 | 338,823 | 23,175,511,200 | 5,793,877,800 | 00:11:06.841 |
40 | 12,956,643 | 332,222 | 46,643,914,800 | 11,660,978,700 | 00:05:31.327 |
Table 3: Obfusware horizontal scaling results
Conclusion
The combination of AWS Glue and Obfusware create a data masking solution capable of meeting the challenge of masking big data. Another important factor is the ease of implementing this solution. The AWS Glue Studio enabled the creation of the Obfusware masking job using an easy to understand visual interface. The Obfusware tight integration into AWS Glue made specifying the required masking transforms simple and intuitive even for a novice AWS Glue user. Amazon Web Services (AWS) support for big data is an important factor in the success of the solution. The AWS S3 storage is simple to use and cost-effective. AWS Glue provides key services such as a Data Catalog and integration with key big data technologies such as Apache Parquet data files and Apache Spark. All of these factors enabled the creation and running of a billion row masking job in about half a day of work and an AWS service bill less than the cost of lunch.
If this solution is interesting, and you would like to learn how Obfusware can solve your data masking challenges, check out Obfusware AG at the Obfusware website.