Anonymization is the process of protecting private or sensitive information by erasing or encrypting identifiers that connect an ‘entity’ to stored data. If the anonymization is fool-proof, the process of Deanonymization should not reveal personal identifiable information. Typically, anonymization could be just pattern anonymization, or just the value anonymization, or both. In this work, we want to do just the value anonymization, so as to preserve the predictive/detective power. Just like any other data, even in Telco data, we will have to deal with both the categorical Variables and the numerical variables. There are various approaches under anonymization:
Typically, Synthetic data generation is considered as the most fool-proof among all the other approaches.
In this work, we want to answer the following questions:
In this phase, we want to (a) agree on what constitute the ‘sensitive’ (PII) data. (b) agree on the exact dataset to use. (c) find the 'gaps' in datasets and try to fill it somehow.
What constitutes the PII information:
PII Type | Dataset (links) |
Names (Systems, Domain, Individuals, Organizations, Places, etc.) | |
Address (IP and MAC) | Internet Traffic Dataset: EX1, EX2 |
Telco Fields - IMSI, IMEI, MSIN, MSISDN, MCC+MNC | Adult Dataset enhanced with Telco-Fields Adult Dataset: Generate random IMEI/IMSI* fields and add it to this dataset |
Location Data (GPS, Cell-ID, Count, etc.) |
In this phase we would want to:
we will be evaluating the following techniques