Maskala is a privacy engineering toolkit that works with Apache Spark. It aims to provide a number of features for analysing, masking, generalising and filtering data to help ensure the identity of individuals in a dataset are protected from re-identification.
Warning
Disclaimer: Anonymisation is hard. The data privacy and security techniques used in this project, such as K-Anonymity and data redaction, are intended to assess and mitigate the risk of re-identification and may provide you with a means to reduce the risk inherent in working with private data. However, they will not provide complete anonymisation and should not be seen as foolproof solutions. For some use cases you should seek more accepted means of anonymisation such as differential privacy, or through the best technique of all: Not collecting personal data to begin with.
The Anonymiser class is part of the Scala-based data anonymisation toolkit designed to apply various anonymisation strategies to data stored in Apache Spark DataFrames. This toolkit allows for the configuration-driven anonymisation of specific columns in a DataFrame, supporting strategies like masking, encryption, range generalisation, and more.
Let's step through an example. Imagine we have the following data (Note: This is an example Netflix dataset from the )
user_id,rating,date,movie,location
1815755,5,2004-07-20,Dinosaur Planet,Dominica
1426604,4,2005-09-01,Dinosaur Planet,Svalbard & Jan Mayen Islands
1535440,4,2005-08-18,Dinosaur Planet,Monacoimport org.apache.spark.sql.SparkSession
import org.mitchelllisle.Anonymiser
val spark = SparkSession.builder.appName("AnonymisationApp").getOrCreate()
val df = spark.read.load("your-data-source")
val anonymiser = new Anonymiser("path/to/your/config.yaml")
anonymiser(df)
catalog: 'your_catalog'
schema: 'your_schema'
table: 'your_table'
anonymise:
- column: 'columnName1'
strategy: 'MaskingStrategy'
parameters:
mask: 'XXXX'
- column: 'columnName2'
strategy: 'RangeStrategy'
parameters:
rangeWidth: 10
separator: '-'
- column: 'columnName3'
strategy: 'EncryptionStrategy'
parameters:
secret: 'your-secret-key'
analyse:
- type: 'AnalysisType1'
parameters:
param1: 'value1'
param2: 'value2'These methods are tools to aid in understanding and reducing re-identification risks and should be used as part of a broader data protection strategy. Remember, no single method can ensure total data privacy and security.
K-Anonymity is a concept in data privacy that aims to ensure an individual's information cannot be distinguished from a least k-1 others in a dataset. Essentially, it means that each individual's data is indistinguishable from at least k-1 other individuals within the dataset. This is achieved by generalizing, suppressing, or altering specific identifiers (like names, addresses, or other personal details) until each person cannot be uniquely identified from a group of at least k individuals. K-Anonymity might help mitigate the risk of re-identification in published data, making it useful in protecting personal information in datasets. However, it's important to note that while K-Anonymity can reduce the risk of identity disclosure, it has been shown to be susceptible to re-identification attacks
You can assess if your dataset satisfies KAnonymity by using the isKAnonymous method:
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
// In the below DataFrame there are only two rows that match, meaning the other two don't satisfy K(2) Anonymity
val data = Seq(
("30", "Male"),
("30", "Male"),
("18", "Female"),
("45", "Female")
).toDF("Age", "Gender")
val kAnon = new KAnonymity(2)
val evaluated = kAnon.isKAnonymous(data) // returns falseIf you want a dataset that only contains the rows that meet KAnonymity, you can use the removeLessThanKRows method
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val data = Seq(
("30", "Male"),
("30", "Male"),
("18", "Female"),
("45", "Female")
).toDF("Age", "Gender")
val kAnon = new KAnonymity(2)
val result = kAnon.removeLessThanKRows(data)
/* result would only contain the first two rows above:
("30", "Male"),
("30", "Male"),
* */Note: Now if you run
isKAnonymous(result)it will returntruesince we've removed the rows that don't satisfy K(2).
β-Diversity is an extension of the K-Anonymity principle in data privacy, designed to enhance the protection against certain types of attacks that K-Anonymity is susceptible to. While K-Anonymity ensures that each individual is indistinguishable from at least k-1 others in the dataset, β-Diversity goes further by requiring that each group of indistinguishable individuals has at least 'l' distinct values for sensitive attributes. This concept addresses the limitation of K-Anonymity in scenarios where sensitive attributes within a group can be homogeneous, thereby still posing a risk of attribute disclosure. β-Diversity ensures diversity in sensitive information, reducing the likelihood that an individual's sensitive attributes can be accurately inferred within an anonymized dataset. It's particularly useful in preventing attacks like homogeneity and background knowledge attacks, contributing to a more robust privacy-preserving data publication. However, similar to K-Anonymity, β-Diversity is not a comprehensive solution. For more information (including the limitations of l-diversity) I recommend reading β-Diversity: Privacy Beyond k-Anonymity
You can assess if your dataset satisfies β-Diversity by using the isLDiverse method:
import org.mitchelllisle.ldiversity.LDiversity
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val data = Seq(
("A", "Male"),
("A", "Male"),
("B", "Female"),
("B", "Other")
).toDF("QuasiIdentifier", "SensitiveAttribute")
val lDiv = new LDiversity(2)
val evaluated = lDiv.isLDiverse(data) // returns falseIf you want a dataset that only contains the rows that meet β-Diversity, you can use the removeLessThanLRows method
import org.mitchelllisle.ldiversity.LDiversity
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
val data = Seq(
("A", "Male"),
("A", "Male"),
("B", "Female"),
("B", "Other")
).toDF("QuasiIdentifier", "SensitiveAttribute")
val kAnon = new LDiversity(2)
val result = kAnon.removeLessThanKRows(data)
/* result would only contain the first two rows above:
("30", "Male"),
("30", "Male"),
* */The UniquenessAnalyser class in org.mitchelllisle.reidentifiability package provides methods to analyze the
uniqueness of values within a DataFrame using Spark. Uniqueness is a proxy for re-identifiability, an important privacy
engineering concept. This class helps evaluate re-identifiability risk metrics using data uniqueness as an indicator.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate()
val data = spark.read.option("header", "true").csv("src/test/resources/netflix-sample.csv")
val result = UniquenessAnalyser(table)The Redactor class along with redaction strategies allows for flexible redaction of data in a DataFrame. You can apply multiple redaction strategies including masking, hashing and more.
import org.apache.spark.sql.SparkSession
val spark = SparkSession.builder.getOrCreate()
val redactor = new Redactor(Seq(
MaskingStrategy("movie", "*****"),
HashingStrategy("user_id"),
))
val data = spark.read.option("header", "true").csv("src/test/resources/netflix-sample.csv")
val redactedData = redactor(data)
The MerkleTree provides cryptographic proof capabilities for data retention verification and tamper-evident audit trails. This is essential for regulatory compliance (GDPR, CCPA) and proving data deletion without revealing sensitive information.
Generate cryptographic fingerprints of your datasets that change if any data is modified:
import org.mitchelllisle.analysers.MerkleTree
val spark = SparkSession.builder().getOrCreate()
val userData = spark.read.option("header", "true").csv("user-data.csv")
// Create tamper-evident proof of current data state
val dataProof = MerkleTree.apply(
data = userData,
columns = Seq("email", "age", "location"), // Columns to include in proof
idColumn = "user_id"
)
println(s"Dataset fingerprint: ${dataProof.rootHash}")
println(s"Record count: ${dataProof.recordCount}")Prove that specific records were actually deleted (not just hidden) with cryptographic evidence:
// Before deletion
val beforeData = userData
val beforeProof = MerkleTree.apply(beforeData, columns, "user_id")
// After user requests deletion
val afterData = userData.filter($"user_id" =!= "user123")
// Generate deletion proof
val deletionProof = MerkleTree.verifyDeletion(
beforeData = beforeData,
afterData = afterData,
deletedIds = Seq("user123"),
columns = columns,
idColumn = "user_id"
)
// Validate the proof
val isValidDeletion = MerkleTree.validateDeletionProof(
deletionProof,
expectedDeletions = 1
)
if (isValidDeletion) {
println("β Deletion cryptographically verified")
// Proof is now validated - no JSON export needed
}Perform comprehensive analysis combining uniqueness assessment with retention proofs:
val (uniquenessAnalysis, retentionProof) = MerkleTree.combinedPrivacyAnalysis(
dataFrame = userData,
groupByColumns = Seq("age", "location"), // For uniqueness analysis
userIdColumn = "user_id",
retentionColumns = Seq("email", "age", "location"), // For retention proof
k = 2048
)
// Analyze re-identification risks
uniquenessAnalysis.show()
// Cryptographic proof is available in retentionProof objectGDPR Right to be Forgotten: Provide mathematical proof of data deletion to users and regulators Data Integrity Monitoring: Detect unauthorized changes to sensitive datasets Audit Compliance: Create verifiable logs of data operations for regulatory requirements Zero-Knowledge Verification: Prove compliance without revealing actual data contents
Note
Merkle trees provide tamper-evident proofs but should be combined with access controls and encryption for complete data protection. The root hash acts as a "digital fingerprint" - any change to the data completely changes the hash, making unauthorized modifications immediately detectable.
- Apache Spark 3.x
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.