Skip to main content

Data Masking a Billion Rows in 5 Minutes

· 11 min read
Mark Smith
Founder - Touisset Services LLC

Executive Summary:
Data masking is a proven technology for maintaining privacy of sensitive data such as PII, PCI, and HIPAA. One impediment has always been the time and effort required to mask large datasets. With the explosive growth of big data, data lakes and AI has only exacerbated the problem, moving it from difficult to formidable. Legacy data masking tools were designed when RDBMS ruled the data landscape and struggle with the current landscape of big data.

The solution to this dilemma is data masking tools designed specifically for the big data challenge. While traditional tools struggle masking 10's and 100's of millions of rows of data, a data masking tool built on the foundation of proven big data processing technology can mask a billion rows of data in minutes. Obfusware is a next-generation data masking tool designed from the ground up to leverage the big data technology and outperform legacy tools by orders of magnitude.

Obfusware AG Big Data Masking

Obfusware is built on the Apache Spark scalable, fast, engine for large-scale data processing. Amazon Web Service (AWS) Glue is a serverless data integration service that also uses Apache Spark for data processing, which makes it easy to integrate Obfusware as an AWS Glue transform. The primary benefit of Spark and thus Glue is the ability to horizontally scale across multiple servers, whether that be 5, 10, 50, or hundreds of servers.

Designing a 1 Billion Row Obfusware Masking Job

A simple AWS Glue visual job can be created using the AWS Glue Studio.

AWS Glue Obfusware Making Job Diagram
Image 1: AWS Glue Obfusware Making Job Diagram

AWS Glue can handle large amounts of data using an AWS S3 bucket to store the data to be masked in an optimized big data format comprised of multiple parquet files. The masked results can then be written to a new location again as parquet files. A single Obfusware transform is used to mask the selected columns in the data set. The transform uses four Obfusware data maskers.

  1. USLastNameMasker - Replaces original string data with a realistic, culturally USA, last name
  2. USVariableDateMasker - Replaces the original data item with a new date with the same month and year but a different day value
  3. US555TelephoneMasker - Generates a new telephone number, replacing the exchange with 555 and the final digits with a new sequence of digits
  4. EmailMasker - Replaces an email address with an email address for the domain @example.com.

Input Data to be Masked

The data to be masked is a table containing of 13 columns of standard contact data. There are 4 representative columns selected to be masked: last_name, dob, phone1, and email. The billion rows of data are approximately 176GB of uncompressed data. The parquet files used the snappy compression algorithm and reduced the overall data size to approximately 94GB.

first_namelast_namedobcompany_name                address                                citycountystatezipphone1                phone2                email        web
JamesButt4/1/1997Benton, John B Jr6649 N Blue Gum StNew OrleansOrleansLA70116504-621-8927504-845-1427jbutt@gmail.comhttp://www.bentonjohnbjr.com
JosephineDarakjy1/22/1970Chanay, Jeffrey A Esq4 B Blue Ridge BlvdBrightonLivingstonMI48116810-292-9388810-374-9840josephine_darakjy@darakjy.orghttp://www.chanayjeffreyaesq.com
ArtVenere3/24/1993Chemel, James L Cpa8 W Cerritos Ave #54BridgeportGloucesterNJ8014856-636-8749856-264-4130art@venere.orghttp://www.chemeljameslcpa.com
LennaPaprocki6/22/1965Feltz Printing Service639 Main StAnchorageAnchorageAK99501907-385-4412907-921-2010lpaprocki@hotmail.comhttp://www.feltzprintingservice.com
DonetteFoller7/3/1971Printing Dimensions34 Center StHamiltonButlerOH45011513-570-1893513-549-4561donette.foller@cox.nethttp://www.printingdimensions.com
SimonaMorasca4/30/1999Chapman, Ross E Esq3 Mcauley DrAshlandAshlandOH44805419-503-2484419-800-6759simona@morasca.comhttp://www.chapmanrosseesq.com
MitsueTollner6/28/1973Morlong Associates7 Eads StChicagoCookIL60632773-573-6914773-924-8565mitsue_tollner@yahoo.comhttp://www.morlongassociates.com
LeotaDilliard7/13/1977Commercial Press7 W Jackson BlvdSan JoseSanta ClaraCA95111408-752-3500408-813-1105leota@hotmail.comhttp://www.commercialpress.com
SageWieser4/3/1968Truhlar And Truhlar Attys5 Boston Ave #88Sioux FallsMinnehahaSD57105605-414-2147605-794-4895sage_wieser@cox.nethttp://www.truhlarandtruhlarattys.com
KrisMarrier7/19/1989King, Christopher A Esq228 Runamuck Pl #2808BaltimoreBaltimore CityMD21224410-655-8723410-804-4694kris@gmail.comhttp://www.kingchristopheraesq.com

Table 1: Sample input data

Masked Output Data

A sample of the masking results is shown below. By comparing the masked columns to the input data columns, the effects of the 4 Obfusware maskers are clear.

first_namelast_namedobcompany_name                address                                citycountystatezipphone1                phone2                email                web
JamesKetchersid4/13/1997Benton, John B Jr6649 N Blue Gum StNew OrleansOrleansLA70116504-555-7562504-845-1427freddy6791@example.comhttp://www.bentonjohnbjr.com
JosephineBittman1/30/1970Chanay, Jeffrey A Esq4 B Blue Ridge BlvdBrightonLivingstonMI48116810-555-5377810-374-9840kiara.coria@example.comhttp://www.chanayjeffreyaesq.com
ArtLahmers3/31/1993Chemel, James L Cpa8 W Cerritos Ave #54BridgeportGloucesterNJ8014856-555-7295856-264-4130marin7087@example.comhttp://www.chemeljameslcpa.com
LennaAceves6/17/1965Feltz Printing Service639 Main StAnchorageAnchorageAK99501907-555-9906907-921-2010janell5891@example.comhttp://www.feltzprintingservice.com
DonetteHozempa7/29/1971Printing Dimensions34 Center StHamiltonButlerOH45011513-555-0753513-549-4561therford@example.comhttp://www.printingdimensions.com
SimonaAtallah4/9/1999Chapman, Ross E Esq3 Mcauley DrAshlandAshlandOH44805419-555-6979419-800-6759numbers3031@example.comhttp://www.chapmanrosseesq.com
MitsueGrabinski6/20/1973Morlong Associates7 Eads StChicagoCookIL60632773-555-1509773-924-8565wlenorud462@example.comhttp://www.morlongassociates.com
LeotaLoflen7/6/1977Commercial Press7 W Jackson BlvdSan JoseSanta ClaraCA95111408-555-3488408-813-1105twarhol@example.comhttp://www.commercialpress.com
SageBaize4/11/1968Truhlar And Truhlar Attys5 Boston Ave #88Sioux FallsMinnehahaSD57105605-555-6092605-794-4895elsie.glen@example.comhttp://www.truhlarandtruhlarattys.com
KrisPonder7/31/1989King, Christopher A Esq228 Runamuck Pl #2808BaltimoreBaltimore CityMD21224410-555-7263410-804-4694mruffolo@example.comhttp://www.kingchristopheraesq.com

Table 2: Sample Masked output data

Obfusware Job Statistics Report

At the conclusion of a job, Obfusware produces a job report with basics statistics and timings for masking operations.

INFO 2025-08-20T03:37:26,497 402306 com.obfusware.datamasking.spark.SparkMaskingSchema [spark-listener-group-shared] 666 MASKING STATISTICS REPORT
Masking Context: 4 maskers
1) EmailMasker(Email) - Replaces an email address with an email address for the domain '@example.com'.
2) US555TelephoneMasker(USTelephone) - Generates a new telephone number, replacing the exchange with '555' and the final digits with a new sequence of digits
3) USLastNameMasker(HashList) - Replaces original string data with a realistic, culturally USA, last name
4) USVariableDateMasker(MultiDate) - Replaces the original data item with a new date with the same month and year but a different day value

Aggreggate Statistics (From: 2025-08-20T03:31:53.931760990Z, To: 2025-08-20T03:37:25.258870068Z, Duration: 00:05:31.327, ops/sec: 12,956,643)
Total masker invocations: 4,292,887,120 (02:28:51.757 hh:mm:ss.millis)
Total successful invocations: 4,292,887,120 (02:28:51.757 hh:mm:ss.millis)
Total failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)

DataMasker Statistics
EmailMasker (3,239,308 ops/sec)
Total invocations: 1,073,221,780 (00:42:22.193 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:42:22.193 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)

US555TelephoneMasker (3,239,196 ops/sec)
Total invocations: 1,073,221,780 (00:49:37.813 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:49:37.813 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)

USLastNameMasker (3,239,160 ops/sec)
Total invocations: 1,073,221,780 (00:21:26.088 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:21:26.088 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)

USVariableDateMasker (3,239,187 ops/sec)
Total invocations: 1,073,221,780 (00:35:25.663 hh:mm:ss.millis)
Successful invocations: 1,073,221,780 (00:35:25.663 hh:mm:ss.millis)
Failed invocations: 0 (00:00:00.000 hh:mm:ss.millis)
Report 1: Obfusware Job Report

The interesting information can be found in line 8 of the report.
        Duration: 00:05:31.327, ops/sec: 12,956,643
The job took 5 minutes and 31 seconds to run with almost 13 million masking operations per second.

Horizontal Scaling an Obfusware Masking Job

The performance of an AWS Glue Obfusware masking job can be easily managed using horizontal scaling. Adding additional Data Processing Units (DPU) to a masking job can positively affect the performance of the job. Masking a billion rows in approximately 5 minutes was an ambitious objective and was not achieved on the first try.

First, a test run was done to find a baseline performance using 10 DPUs. This resulted in a run time of approximately 24 minutes. Next, 20 DPUs were used doubling the available processing power. This resulted in a run time of approximately 11 minutes, cutting the runtime by more almost 54%. This demonstrates the extremely efficient horizontal scaling available that Apache Spark enables. To get close to our 5 minute runtime target, the number of DPUS was doubled again to 40. The resulting runtime of 5 minutes and 31 seconds was close enough to the 5 minute target to declare victory.

DPUops/secops/sec/DPUops/hrrows/hrDuration
103,000,771333,41910,802,775,6002,700,693,90000:23:50.594
206,437,642338,82323,175,511,2005,793,877,80000:11:06.841
4012,956,643332,22246,643,914,80011,660,978,70000:05:31.327

Table 3: Obfusware horizontal scaling results

Conclusion

The combination of AWS Glue and Obfusware create a data masking solution capable of meeting the challenge of masking big data. Another important factor is the ease of implementing this solution. The AWS Glue Studio enabled the creation of the Obfusware masking job using an easy to understand visual interface. The Obfusware tight integration into AWS Glue made specifying the required masking transforms simple and intuitive even for a novice AWS Glue user. Amazon Web Services (AWS) support for big data is an important factor in the success of the solution. The AWS S3 storage is simple to use and cost-effective. AWS Glue provides key services such as a Data Catalog and integration with key big data technologies such as Apache Parquet data files and Apache Spark. All of these factors enabled the creation and running of a billion row masking job in about half a day of work and an AWS service bill less than the cost of lunch.

If this solution is interesting, and you would like to learn how Obfusware can solve your data masking challenges, check out Obfusware AG at the Obfusware website.

Obfusware AG