Audio version of the article
Data masking is a technique used to protect sensitive data – usually any data that could be deemed personally identifiable information (PII) – over and above an organisation’s standard information security protocols such as access control etc. Data masking, also known as data obfuscation, hides the actual data using modified content like characters or numbers.The main objective of data masking is creating an alternate version of data that cannot be easily identifiable or reverse engineered, protecting data classified as sensitive. Importantly, the data will be consistent across multiple databases, and the usability will remain unchanged It is a method of creating a structurally similar but inauthentic version of an organization’s data that can be used for purposes such as software testing and user training. The purpose is to protect the actual data while having a functional substitute for occasions when the real data is not required..Data masking is often mentioned in legal, statutory and regulatory guidelines and laws governing the storage and access of employee, customer, user and vendor information.Data masking generally applies to non-production environments, such as software development and testing, user training, etc.—areas that do not need actual data.
Although most organizations have stringent security controls in place to protect production data in storage and in business use, sometimes that same data element has been used for operations that are less secure. The issue is often compounded if these operations are outsourced and the organization has less control over the environment. In the wake of compliance legislation, most organizations are no longer comfortable exposing real data unnecessarily. Data masking substitutes original values in a data set with randomized data using various data shuffling and manipulation techniques. The obfuscated data maintains the unique characteristics of the original data so that it yields the same results as the original data set.Data masking is a complex technical process that involves altering sensitive information, and preventing users from identifying data subjects through a variety of measures.Whilst this is itself an administrative task, the nature of data masking is directly related to an organisation’s ability to remain compliant with laws, regulations and statutory guidelines concerning the storage, access and processing of data. As such, ownership should reside with the Chief Information Security Officer, or organisational equivalent.
Data masking should be used in accordance with the organization’s topic-specific policy on access control and other related topic-specific policies, and business requirements, taking applicable legislation into consideration.
To limit the exposure of sensitive data including PII, and to comply with legal, statutory, regulatory and contractual requirements.
ISO 27002 Implementation Guidance
Where the protection of sensitive data (e.g. PII) is a concern, the organization should consider hiding such data by using techniques such as data masking, pseudonymization or anonymization. Pseudonymization or anonymization techniques can hide PII, disguise the true identity of PII principals or other sensitive information, and disconnect the link between PII and the identity of the PII principal or the link between other sensitive information. When using pseudonymization or anonymization techniques, it should be verified that data has been adequately pseudonymized or anonymized. Data anonymization should consider all the elements of the sensitive information to be effective. As an example, if not considered properly, a person can be identified even if the data that can directly identify that person is anonymised, by the presence of further data which allows the person to be identified indirectly. Additional techniques for data masking include:
- encryption (requiring authorized users to have a key);
- nulling or deleting characters (preventing unauthorized users from seeing full messages);
- varying numbers and dates;
- substitution (changing one value for another to hide sensitive data);
- replacing values with their hash.
The following should be considered when implementing data masking techniques:
a) not granting all users access to all data, therefore designing queries and masks in order to show only the minimum required data to the user;
b) there are cases where some data should not be visible to the user for some records out of a set of data; in this case, designing and implementing a mechanism for obfuscation of data (e.g. if a patient does not want hospital staff to be able to see all of their records, even in case of emergency, then the hospital staff are presented with partially obfuscated data and data can only be accessed by staff with specific roles if it contains useful information for appropriate treatment);
c) when data are obfuscated, giving the PII principal the possibility to require that users cannot see if the data are obfuscated (obfuscation of the obfuscation; this is used in health facilities, for example if the patient does not want personnel to see that sensitive information such as pregnancies or results of blood exams has been obfuscated);
d) any legal or regulatory requirements (e.g. requiring the masking of payment cards’ information during processing or storage).
The following should be considered when using data masking, pseudonymization or anonymization:
a) level of strength of data masking, pseudonymization or anonymization according to the usage of the processed data;
b) access controls to the processed data;
c) agreements or restrictions on usage of the processed data;
d) prohibiting collating the processed data with other information in order to identify the PII principal;
e) keeping track of providing and receiving the processed data.
Anonymization irreversibly alters PII in such a way that the PII principal can no longer be identified directly or indirectly. Pseudonymization replaces the identifying information with an alias. Knowledge of the algorithm(sometimes referred to as the “additional information”) used to perform the pseudonymization allows for at least some form of identification of the PII principal. Such “additional information” should therefore be kept separate and protected. While pseudonymization is therefore weaker than anonymization, pseudonymized datasets can be more useful in statistical research. Data masking is a set of techniques to conceal, substitute or obfuscate sensitive data items. Data masking can be static (when data items are masked in the original database), dynamic (using automation and rules to secure data in real-time) or on-the-fly (with data masked in an application’s memory). Hash functions can be used in order to anonymize PII. In order to prevent enumeration attacks, they should always be combined with a salt function. PII in resource identifiers and their attributes [e.g. file names, uniform resource locators (URLs)] should be either avoided or appropriately anonymized. Additional controls concerning the protection of PII in public clouds are given in ISO/IEC 27018.Additional information on de-identification techniques is available in ISO/IEC 20889.
Data masking, which is also called data sanitization, keeps sensitive information private by making it unrecognizable but still usable. This lets developers, researchers and analysts use a data set without exposing the data to any risk.Data masking is different from encryption. Encrypted data can be decrypted and returned to its original state with the correct encryption key. With masked data, there is no algorithm to recover the original values. Masking generates a characteristically accurate but fictitious version of a data set that has zero value to hackers. It also cannot be reverse engineered, and statistical outputs cannot be used to identify individuals. Like data encryption, not every data field needs to be masked, although some fields must be completely hidden.
The organisations can consider data masking through the scope of two main techniques – pseudonymisation and/or anonymisation. Anonymisation is the process of removing personal identifiers, both direct and indirect, that may lead to an individual being identified. An individual may be directly identified from their name, address, postcode, telephone number, photograph or image, or some other unique personal characteristic. An individual may be indirectly identifiable when certain information is linked together with other sources of information, including, their place of work, job title, salary, their postcode or even the fact that they have a particular diagnosis or condition. Once data is truly anonymised and individuals are no longer identifiable,Pseudonymisation is the processing of personal data in such a way that the data can no longer be attributed to a specific data subject without the use of additional information, as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual. Both of these methods are designed to disguise the true purpose of PII through disassociation – i.e. hiding the link between the raw data, and the subject (usually a person). Organisations should take great care to ensure that no single piece of data compromises the subject’s identity. When using either of these techniques, organisations should consider: The level of pseudonymisation and/or anonymisation required, relative to the nature of the data.
- How the masked data is being accessed.
- Any binding agreements that restrict use of the data to be masked.
- Keeping the masked data separate from any other data types, in order to prevent the data subject being easily identified.
- Logging when the data was received, and how it has been provided to any internal or external sources.
Pseudonymisation and anonymisation aren’t the only methods available to organisations looking to mask PII or sensitive data.The other methods that can be used to bolster data security:
- Key-based encryption.
- Voiding or deleting characters within the dataset.
- Varying numbers and dates.
- Replacing values across the data.
- Hash-based value masking.
Data masking is an important part of an organisation’s policy towards protecting PII and safeguarding the identity of the individuals whom it holds data on. As well as the above techniques, organisations should consider the below suggestions when strategising their approach to data masking:
- Implement masking techniques that only reveal the minimum amount of data to anyone who uses it.
- ‘Obfuscating’ (hiding) certain pieces of data at the request of the subject, and only allowing certain members of staff to access the sections that are relevant to them.
- Building their data masking operation around specific legal and regulatory guidelines.
- Where pseudonymisation is implemented, the algorithm that is used to ‘de-mask’ the data is kept safe and secure.
Types of data masking
There are several types of data masking types you can depending on your use case. Of the many, static and on-the-fly data masking are the most common.
1.Static data masking (SDM):Static data masking generally works on a copy of a production database. SDM changes data to look accurate in order to develop, test, and train accurately—without revealing the actual data. The process goes like this:
- Take a backup or a golden copy of the production database to a different environment.
- Remove any unnecessary data, and mask it while in stasis.
- Save the masked copy to the desired location.
2) Dynamic data masking (DDM): DDM happens dynamically at run time and streams data directly from a production system so that masked data will not need to be saved in another database. It is primarily used for processing role-based security for applications, such as processing customer inquiries and handling medical records. Thus, DDM applies to read-only scenarios to prevent writing the masked data back to the production system.
3) Deterministic data masking: Deterministic data masking involves replacing column data with the same value. For example, if there is a first name column in your databases that consists of multiple tables, there could be many tables with the first name. If you mask ‘Ali’ to ‘Helen,’ it should show you as ‘Helen’ not only in the masked table but also in all associated tables. Whenever you run the masking, it will give you the same result.
4) On-the-fly data masking: On-the-fly data masking occurs when data transfers from production environments to another environment, like test or development. On-the-fly data masking is ideal for organizations that:
- Deploy software continuously
- Have heavy integrations
- Because it is challenging to keep a backup copy of masked data continuously, this process will send only a subset of masked data when needed.
5) Statistical data obfuscation: The production data can hold different statistical information, which statistical data obscuration techniques can masquerade. Differential privacy is one technique where you can share information about patterns in a data set without revealing information about the actual individuals in the data set.
Data masking techniques
A variety of data management techniques can be used to mask or anonymize PII and other private and sensitive data depending on the data type. These masking methods include the following:
- Scrambling:Scrambling randomly reorders alphanumeric characters to obscure the original content. For example, a customer complaint ticket number of 3429871 in a production environment could appear as 8840162 in a test environment after being scrambled. Although scrambling is easy to implement, it only works on certain types of data. Data obfuscated this way is not as secure as other techniques.
- Substitution:This technique replaces the original data with another value from a supply of credible values. Lookup tables are often used to provide alternative values to the original, sensitive data. The values must pass rule constraints and preserve the original characteristics of the data. It is harder to apply substitution than scrambling, but it can be applied to several data types and provides good security. For example, credit card numbers can be substituted with numbers that pass card provider validation rules.
- Shuffling:Values within a column, such as user surnames, are shuffled to randomly reorder them. For example, if customer surnames are shuffled, the results look accurate but won’t reveal any personal information. However, it is essential that the shuffling masking algorithm is kept secure so it cannot be used to reverse-engineer the data masking process.
- Date aging: This method increases or decreases a date field by a specific date range. Again, the range value used must be kept secure.
- Variance: A variance is applied to a number or date field. This approach is often used for masking financial and transaction value and date information. The variance algorithm modifies each number or date in a column by a random percentage of its real value. For instance, a column of employees’ salaries could have a variance of plus or minus 5% applied to it. This would provide a reasonable disguise for the data while maintaining the range and distribution of salaries within existing limits.
- Masking out:Masking out only scrambles part of a value and is commonly applied to credit card numbers where only the last four digits remain visible.
- Nullifying:Nullifying replaces the real values in a data column with a null value, completely removing the data from view. Although this sort of deletion is simple to implement, the nullified column cannot be used in queries or analysis. As a result, it can degrade the integrity and quality of the data set for development and testing environments.
Data masking best practices
1. Identify the sensitive data: Before masking any data, identify and catalog the:
- Sensitive data location(s)
- Authorized person(s) who can view them
- Their usage
Every single data element of a company does not need masking. Instead, thoroughly identify the existing sensitive data in both production and non-production environments. Depending on the complexity of data and the organizational structure, this may require a significant amount of time.
2. Define your stack of data masking techniques: It is not practical for large organizations to use only a single masking tool across the entire enterprise since data varies greatly. Plus, the technique you choose may require you to comply with specific internal security policies or meet budgetary requirements. In some cases, you may have to develop your masking technique. So, consider all these necessary factors to choose the right set of techniques. Keep them in sync to ensure the same type of data uses the same technique to preserve referential integrity.
3. Secure your data masking techniques: Masking techniques and associated data are as critical as sensitive data. For example, the substitution technique can use a lookup file for substitution. If this lookup file falls into the wrong hands, they can reveal the original data set. Organizations should establish the required guidelines to allow only authorized persons to access the masking algorithms.
4. Make masking repeatable: Over time, changes to an organization or a particular project or product can result in changes to the data. Avoid starting from square one each time. Instead, make masking a process that is repeatable, quick, and automatic, so you can implement them when changes to the sensitive data occur.
5. Define an end-to-end data masking process: Organizations must have an end-to-end process that includes:
6. Identifying sensitive information
- Applying the appropriate data masking technique
- Continuously auditing to ensure data masking is working as expected