Data Privacy Techniques in Data Engineering

As data engineers, we develop pipelines to process sensitive data. We must be aware of data privacy regulations and ensure our pipelines comply with them. This article discusses various techniques to ensure data privacy in data engineering.

Most regulations come from GDPR (General Data Protection Regulation), CCPA (California Consumer Privacy Act), HIPAA, and other data privacy laws. This article will not delve into the details of these regulations but will focus on the technical aspects of ensuring data privacy in data engineering.

The following will be covered:

Pseudonymization: Hashing and Tokenization
Anonymization: Suppression and Generalization

Pseudonymization

Pseudonymization is a data management and de-identification procedure in which personally identifiable information fields within a data record are replaced by one or more artificial identifiers or pseudonyms. A single pseudonym for each replaced field or collection of replaced fields makes the data record less identifiable while remaining suitable for data analysis and data processing. Pseudonymization is a reversible process, meaning that the original data can be restored using the pseudonym.

Re-identification is possible
Pseudonymized data is still considered as PII according to GDPR
There are 2 main types of pseudonymization: hashing and tokenization

Example:

Source Data

+------------+----------+
|name        |dob       |
+------------+----------+
|Alex Smith  |1980-01-11|
|John Doe    |1985-01-21|
|Mike Johnson|1990-01-31|
+------------+----------+

Pseudonymized Data

+----------+----------+
|name      |       dob|
+----------+----------+
|NAME-001  |1980-01-01|
|NAME-002  |1985-01-01|
|NAME-003  |1990-01-01|
+----------+----------+

Hashing

Hashing is a one-way function that converts an input into a fixed-size string of characters, typically a hexadecimal number. The output is deterministic, meaning that the same input will always produce the same output. Hashing is used to store passwords securely, but it can also be used to pseudonymize data. The most common hashing algorithms are MD5, SHA-1, and SHA-256. To make the hashing process more secure, you can add a salt to the input. A salt is a random string that is added to the input before hashing. This makes it more difficult for attackers to crack the hash.

Increased data size
Some operations can be less efficient

from pyspark.sql import SparkSession
from pyspark.sql import functions as F


def get_spark_session():
    spark = SparkSession.builder.appName("test").getOrCreate()
    return spark


def get_salt():
    return "some salty string"  # should be stored & read from a secure location


def test_spark():
    spark = get_spark_session()
    columns = ["name", "dob"]
    data = [("Alex Smith", "1980-01-01"), ("John Doe", "1985-01-01"), ("Mike Johnson", "1990-01-01")]
    df = spark.createDataFrame(data, columns)
    df.show(10, False)
    hashed_df = df.withColumn("name", F.md5(F.concat(F.col("name"), F.lit(get_salt()))))
    hashed_df.show(10, False)

Results:

# Source Data
+------------+----------+
|name        |dob       |
+------------+----------+
|Alex Smith  |1980-01-01|
|John Doe    |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+

# Hashed Data
+--------------------------------+----------+
|name                            |dob       |
+--------------------------------+----------+
|6c863d4f759030513519e9187f2cdbe6|1980-01-01|
|be8c2b1bed7ccaf2f9f8e597f390500d|1985-01-01|
|b2a1813854b9829f6d1aa302fae4493d|1990-01-01|
+--------------------------------+----------+

Tokenization

Tokenization is the process of replacing sensitive data with unique identification symbols that retain all the essential information about the data without compromising its security.

Tokenization is often used to protect sensitive data in a database. It is also used to protect data in transit over the internet. Tokenization is a reversible process, meaning that the original data can be restored using the token.

Transforms all PII into a token
Tokenized data is still considered as PII according to GDPR
Original data is stored in a secure location (look-up table)

from pyspark.sql import SparkSession
from pyspark.sql import functions as F


def get_spark_session():
    spark = SparkSession.builder.appName("test").getOrCreate()
    return spark

def test_tokenization():
    spark = get_spark_session()
    columns = ["name", "dob"]
    data = [("Alex Smith", "1980-01-01"), ("John Doe", "1985-01-01"), ("Mike Johnson", "1990-01-01")]
    src_df = spark.createDataFrame(data, columns)
    src_df.show(10, False)
    
    lookup_df = src_df.select("name").distinct().withColumn("id", F.monotonically_increasing_id())
    lookup_df.show(10, False)
    
    tokenized_df = src_df.join(lookup_df, "name", "left").withColumn("name", F.col("id")).drop("id")
    tokenized_df.show(10, False)

Results:

# Source data
+------------+----------+
|name        |dob       |
+------------+----------+
|Alex Smith  |1980-01-01|
|John Doe    |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+

# Lookup table, should be stored in a secure location
+------------+---+
|name        |id |
+------------+---+
|Alex Smith  |0  |
|John Doe    |1  |
|Mike Johnson|2  |
+------------+---+

# Tokenized data
+----+----------+
|name|dob       |
+----+----------+
|0   |1980-01-01|
|1   |1985-01-01|
|2   |1990-01-01|
+----+----------+

Anonymization

Anonymization is the process of removing or modifying personal data in such a way that it cannot be linked back to an individual. Anonymization is a one-way process, meaning that the original data cannot be restored. Anonymization is often used for data that is used for research or statistical purposes. Anonymized data is not considered as PII according to GDPR.

There are two main techniques for anonymization:

Generalization
Suppression

Suppression

Suppression is the process of removing sensitive data from a record. For example, you can remove the name and address from a record, but keep the date of birth.

Possible implementations include:

Excluding the column from the dataset
Replacing the value with a null or a default value
Removing rows where geographic groups are too small, e.g., presenting a village/street/area where only one person lives

# Source data
+------------+----------+
|name        |dob       |
+------------+----------+
|Alex Smith  |1980-01-01|
|John Doe    |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+

# Suppressed data
+------------+----------+
|name        |dob       |
+------------+----------+
|null        |1980-01-01|
|null        |1985-01-01|
|null        |1990-01-01|
+------------+----------+

Generalization

Generalization is the process of replacing sensitive data with a more generalized value. For instance, one could replace the date of birth with the year of birth. Generalization is often employed to safeguard data used for research or statistical purposes.

Possible implementations include:

Replace the value with a more general value

# Source data
+------------+----------+
|name        |dob       |
+------------+----------+
|Alex Smith  |1980-01-01|
|John Doe    |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+

# Generalized data
+--------+--------------+
|name    |year_of_birth |
+--------+--------------+
|1       |1980          |
|2       |1985          |
|3       |1990          |
+--------+--------------+

Replace the value with a range of values (Binning)

# Source data
+------------+----------+
|name        |dob       |
+------------+----------+
|Alex Smith  |1980-01-01|
|John Doe    |1985-01-01|
|Mike Johnson|1990-01-01|
+------------+----------+

# Binned data
+--------+-----------+
|name    |age_range  |
+--------+-----------+
|1       |40-50      |
|2       |30-40      |
|3       |30-40      |
+--------+-----------+

Replace the value with a category

# Source data
+------------+---------------------------------------------------+
|name        |address                                            |
+------------+---------------------------------------------------+
|Alex Smith  |123 Fake Street, Building 1, Amsterdam, Netherlands|
|John Doe    |456 Unreal Road, Building 2, Paris, France         |
|Mike Johnson|789 Nonexistent Avenue, Building 3, New York, USA  |
+------------+---------------------------------------------------+

# Categorical generalized data
+------------+---------------+
|name        |country        |
+------------+---------------+
|Alex Smith  |Netherlands    |
|John Doe    |France         |
|Mike Johnson|USA            |
+------------+---------------+

# or for IP-addresses, CIDR notation can be used
+------------+-----------------+
|name        |ip_address       |
+------------+-----------------+
|Alex Smith  |192.168..203.157 |
|John Doe    |127.0.99.185     |
|Mike Johnson|192.168.193.61   |
+------------+-----------------+

# Converted to /24 CIDR notation
+------------+------------------+
|name        |ip_address        |
+------------+------------------+
|Alex Smith  |192.168.203.0/24  |
|John Doe    |127.0.99.0/24     |
|Mike Johnson|192.168.193.0/24  |
+------------+------------------+