Anonymization
YData Synthesizers offers a way to anonymize sensitive information such that the original values are not present in the synthetic data but replaced by fake values.
Does the model retain the original values?
No! The anonymization is performed before the model training such that it never sees the original values.
The anonymization is performed by specifying which columns need to be anonymized and how to perform the anonymization. The anonymization rules are defined as a dictionary with the following format:
{column_name: anonymization_rule}
While here are some predefined anonymization rules such as name
, email
, company
, it is also possible to create a rule using a regular expression.
The anonymization rules have to be passed to a synthesizer in its fit
method using the parameter anonymize
.
What is the difference between anonymization and privacy?
Anonymization makes sure sensitive information are hidden from the data. Privacy makes sure it is not possible to infer the original data points from the synthetic data points via statistical attacks.
Therefore, for data sharing anonymization and privacy controls are complementary.
The example below demonstrates how to anonymize the column Name
by fake names and the column Ticket
by a regular expression:
import os
from ydata.sdk.dataset import get_dataset
from ydata.sdk.synthesizers import RegularSynthesizer
# Do not forget to add your token as env variables
os.environ["YDATA_TOKEN"] = '<TOKEN>' # Remove if already defined
def main():
"""In this example, we demonstrate how to train a synthesizer from a pandas
DataFrame.
After training a Regular Synthesizer, we request a sample.
"""
X = get_dataset('titanic')
# We initialize a regular synthesizer
# As long as the synthesizer does not call `fit`, it exists only locally
synth = RegularSynthesizer(name="Titanic")
# We define anonymization rules, which is a dictionary with format:
# {column_name: anonymization_rule, ...}
# while here are some predefined anonymization rules like: name, email, company
# it is also possible to create a rule using a regular expression
rules = {
"Name": "name",
"Ticket": "[A-Z]{2}-[A-Z]{4}"
}
# or a different option for anonymization configuration
rules = {
'Name': {'type': 'name'},
'Ticket': {'type': 'regex',
'regex': '[A-Z]{2}-[A-Z]{4}'}
}
# We train the synthesizer on our dataset
synth.fit(
X,
anonymize=rules
)
# We request a synthetic dataset with 50 rows
sample = synth.sample(n_samples=50)
print(sample[["Name", "Ticket"]].head(3))
if __name__ == "__main__":
main()