Designing for security and compliance
Ensuring a secure and legally compliant solution will provide a high-quality product and will inspire confidence in your stakeholders and users.
Topics Include:- Identity and Access Management (e.g., Cloud IAM and organization policies)
- Data security
- Data security (encryption and key management)
- Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)
- Regional considerations (data sovereignty) for data access and storage
- Legal and regulatory compliance
GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Designing data processing systems
→ Designing for security and compliance
Topic Contents
Identity and Access Management (e.g., Cloud IAM and organization policies)Data security (encryption and key management)
Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)
Regional considerations (data sovereignty) for data access and storage
Legal and regulatory compliance
Identity and Access Management (e.g., Cloud IAM and organization policies)
Identity and Access Management (IAM) is GCP's universal authentication and authorization service. It is used to grant access to GCP APIs and exercise a high degree of control over your resources while operating on the cloud.
Identity and Access Management (IAM)
Identity and Access Management (IAM) is the primary method of API authorization Google Cloud Platform. It provides a central place to set permissions for users and services accounts for all GCP APIs and services. It is designed to be easily configurable, highly efficient, uniform across services, and effective. IAM is often tied other identity services, such as Google Cloud Identity, which can ensure the identity of the user. This also allows integration of other components, such as Firebase, without additional layers of authentication.
How IAM Works
An individual user or a service account is known as a Principle, usually identified by an email address. Each principle can have one or more Roles attached. A Policy is a collection of Roles which can be attached to a Principle. When a user attempts to access a resource GCP examines the policies to ensure that the principle has a role which allows the requested activity. For example, an engineer may require read and write access to BigQuery, but an analyst could only require read permissions. This would require the creation of two policies, one for the engineer, and one for the analyst.
IAM policies are set on four levels, known as resource hierarchies. These roles can be inherited from higher level authorizations, but the lowest level policy always overrides higher level policies. For example, you could have an accounting principle which requires read access to BigQuery accounts receivable datasets, but doesn't require access to the BigQuery recommender dataset. The principle could be granted a BigQuery Viewer role at the organization level, but ultimately only white listed for certain datasets at the resource level.
- Organizations are the root nodes in the hierarchy. For example, the company.
- Folders are children of the organization, or of another folder. This could be a department or other unit.
- Projects are children of the organization, or of a folder. There could be multiple projects per organization and unit.
- Resources for each service are descendants of projects. These are the individual APIs which provide access to components such as Cloud Storage, Compute Engine, or BigQuery.
IAM Predefined Roles
IAM consists of a few predefined basic roles which apply to most services. For some organizations and projects, these could be enough. For others, however, more fine-grained policies might be required. There are overall project roles, or these predefined roles can be attached to individual services only. It should be noted that, at the organizational level, these roles control access to potentially thousands of GCP services. You should always follow the principle of least privilege when developing and assigning policies.
- Viewer: Read only permissions to the given resource. (Read data from a BigQuery table)
- Editor: Viewer role + the ability to change an object state within a resource. (Read and modify data from a BigQuery table and create a new dataset)
- Owner: Viewer + Editor Roles + the ability to assign access to resources and, at the project level, set up and manage billing. (Read and modify data from a BigQuery table, create and delete datasets, and grant access to resources)
Role Recommender
Role Recommender is a feature of IAM which can help you better manage principle authorization. Recommender uses machine learning to create Policy Insights for principles, which can identify excess permissions and ensure compliance with the principle of least privilege.
Data security (encryption and key management)
GCP offers a robust data security architecture which has all the tools you would need to manage security. The built in GCP security is usually fine for most use cases, but GCP offers many encryption options and isolated services which can ensure a high degree of security.
Data Security
Data Security is of primary concern for any development inside GCP. GCP ensures encryption of your data at rest and in-transit by default. GCP offers a few methods when configuring encryption which allows for highly modular and efficient security schemas inside of GCP.
Data Encryption
Data encryption in the cloud consists of three encryption paradigms: encryption at rest and encryption in transit (aka in flight) and encryptions in use. All require different hardware and algorithmic methods to properly ensure a secure environment. All data are automatically encrypted within GCP's infrastructure at rest, in flight, and in use. This is good enough for almost every use case except the most highly secretive, such as government, health, or financial data.
Default GCP Encryption
While at rest your data are automatically protected using the Advanced Encryption Standard (AES). This includes all data in GCS or on Compute Engine Persistent Disks used by VMs. It is not possible to store unencrypted data in GCP.
GCP automatically encrypts data in flight using Transport Layer Security (TLS) encryption standard for web packets sent via APIs and S/MIME encryption is used for emails.
Customer Supplied Encryption Keys (CSEK)
Customer Supplied Encryption Keys (CSEK) are user generated keys which are supplied to GCP API's at the time of data transmission. The keys are stored in memory while in use, which ensures that the ephemeral keys do not exist past the transaction execution. This ensures maximum security, but it also means that there is a huge risk if the keys are lost as GCP would have no way of recovering your data.
This method is usually used with Cloud Storage or when executing on Compute Engine Persistent Disks (PD's), which therefore also applies to Dataproc executing on a compute engine cluster.
Key Management Service (Cloud KMS)
Key Management System (KMS) is a secure method of managing your keys in GCP. KMS ensures complete audit logging and strict access control over your cloud hosted keys in GCP. You can generate, use, rotate, and destroy AES256, RSA 2048, RSA 3072, RSA 4096, EC P256, and EC P384 cryptographic keys.
KMS is called via an API call and can be used to encrypt almost any data in GCP using a given key. This is great for encrypting data across services while using audit logging to determine exactly which services are accessing which data. For example, you can use the same key when processing incoming user credit card data and encrypting in GCS, then when processing that data in Dataproc, then when storing that data in BigQuery, and finally when querying that data later. This allows you to de facto control data access by limiting who can access the key used to decrypt the data. By using KMS Audit Logging you can determine who (which principal) accessed the data and at what stage (which resources) the data were accessed (or access was attempted, but the policy was denied). This is more convenient and more secure than trying to track user access across services. It is also possible to ensure that users supply Key Access Justifications when accessing keys.
Hardware security modules (HSM)
Hardware security modules (HSM) is a separate, physical cryptographic key cluster management suite managed by KMS. HSM hosts FIPS 140-2 Cryptographic Modules which ensures physical and mathematical security of your cryptographic key rings. HSMs are usually only used when highly specialized security requirements are present, such as with financial transactions or classified governmental workloads. Cracking HSM would require physical access to the cluster in order to hack, which ensures an extra layer of data protection.
Cloud External Key Manager (Cloud EKM)
Cloud External Key Manager (Cloud EKM) allows you to use an external key manager, if desired. It will use the standard EKM API to call the external service in order to access the keys. EKM ensures security of your key access by communicating with the external service through a VPC or securely over the internet, and additionally the API calls allow for justifications to be supplied, if required.
Privacy (e.g., personally identifiable information, and Cloud Data Loss Prevention API)
Data privacy has become a hot-button issue around the world, and with the amount of personal information circulating the web it has become a primary concern for any architect and engineer. Use GCP's Data Loss Prevention API to quickly and efficiently prevent accidental data leakage.
Ensuring Privacy
In addition to security, data privacy has become an essential consideration for every organization. GCP offers Sensitive Data Protection which can be configured to identify and handle sensitive data in GCP. Sensitive Data Protection can be used to automatically scan through BigQuery datasets to identify, classify, mask, or report on sensitive data in GCP. SDP takes advantage of the Data Loss Prevention API to programmatically manage your datasets within GCP. This can greatly ease regulatory compliance and ensure efficient reporting on loss prevention strategies and maintenance. SDP can be used with all your current BigQuery datasets and can be configured to automatically scan incoming datasets and tables as well.
Data Loss Prevention
Data Loss Prevention, or Cloud DLP, is a tool which is used to determine classified data across your organization's resources. DLP works by utilizing so-called InfoTypes, which are used to identify sensitive data points. DLP will scan through your data and compare the data points against the Infotypes. Once identified, the data can be automatically masked and de-identified while still ensuring an effective key for joining or aggregating data. You can also provide custom infoTypes as well if you have more specialized needs.
DLP can also be used to prevent re-identification which is essentially a way to identify a particular user by observing alternative means of identification, such as an email address, phone number, or address. DLP can provide a risk report to show you the probability of re-identification of users in your data space. Use this information to formulate a plan to de-identify your data.
BigQuery Dynamic Data Masking
BigQuery can perform dynamic data masking to automatically protect and hide sensitive data in the cloud based upon a set of pre-defined rules known as taxonomies. Taxonomies consist of policy tags which are applied at the table column level and tell BigQuery how to the mask the data. This is useful if you want to mask the data differently for different types of users. For example, if you have both analysts and administrators working with the same data you could create a policy tag which masks the data for the analyst but reveals it for the administrators. This provides an efficient and streamlined solution to protecting data in BigQuery while maintaining source data integrity. BigQuery will automatically log table access if it is protected by a policy tag in Cloud Logging. This way you can set up alerts if certain fields are accessed.
Data masking operates by applying rules to the data. If multiple rules are applied to a table they are applied according to the BigQuery rule hierarchy.
Rule | Heirarchy Level | Effect |
---|---|---|
Custom masking routine | 1 | Returns the column's value after applying a user-defined function (UDF) to the column. |
Date year mask | 6 | Returns the column's value after truncating the value to its year, setting all non-year parts of the value to the beginning of the year. |
Default masking value | 7 | Returns a default masking value for the column based on the column's data type. Use this when you want to hide the value of the column but reveal the data type. |
Email mask | 3 | Returns the column's value after replacing the username of a valid email with XXXXX. |
First four characters | 5 | Returns the first 4 characters of the column's value, replacing the rest of the string with XXXXX. |
Hash (SHA-256) | 2 | Returns the column's value after it has been run through the SHA-256 hash function. Use this if you need to perform joins on the column. |
Last four characters | 4 | Returns the last 4 characters of the column's value, replacing the rest of the string with XXXXX. |
Nullify | 8 | Returns NULL instead of the column value. |
Regional considerations (data sovereignty) for data access and storage
The cloud stores data across regions and zones which can sometimes cause regulatory conflicts and must be accounted for when developing a solution. As an organization grows and expands overseas or even across regions, such as US States or EU member countries, data sovereignty becomes a higher priority.
Data Sovereignty
Data Sovereignty is a legal framework which determines the laws and regulations which apply to data when it is processed, viewed, collected, or stored within a certain geographical boundary. Some scholars limit this concept to only data collected or produced within its boundaries, but the laws do not state this fact. All data must comply with local regulations when the data enter or leave state or national boundaries.
In this light, data could or should be treated as a commodity, similarly to a raw commodity such as petroleum, corn, soybeans, or other resources. The laws governing the storage, handling, processing, or transporting of data are also subject to a complex interdependent system of laws and regulations. Although very abstract compared to physical goods, data should and can still be considered a form of matter which represents real and tangible information concerning a consumer or organization and should be treated with the utmost care.
As a Data Engineer, you should be relatively familiar with the most common laws which govern your organization's ecosystem and data. Cloud technologies are built from the ground up to be distributed across various zones or regions, and using GCP native technologies, you must understand how to limit the geographical distribution of data to comply with laws while still taking advantage of the durability and availability of distributed storage and processing technologies.
It is entirely possible and reasonable to have differing encryption and privacy laws and standards depending upon which US State or country you are operating in. For example, BigQuery can have regional considerations for data sets, such as compliance with GDPR for Europe or California's CPRA. One law could prevent the collection or storage of a certain data point, such as a user's health care data, while another state is more lax with the law, however, even if the data are collected within a state where it is allowed, it still would violate the law if it is stored, accessed, or processed within CA's geographical boundaries. This means that you will have to set BigQuery's regional dataset storage and processing locations individually for certain types of data.
Legal and regulatory compliance
The legal and regulatory framework governing cloud architecture, development, and operation is becoming quite complicated. There are often specialized sets of requirements for the different data stored and used within your organization. Be aware of which laws and regulations apply to your application or data and which GCP tools can be used to ensure compliance.
The regulations outlined below represent the most common ones you might encounter on the exam and is not an exhaustive list. Be sure to properly consult legal counsel for up to date requirements for individual use cases and scenarios.
Legal Compliance
Since the cloud has become ubiquitous and data collection has began growing exponentially many jurisdictions have introduced various levels of legal requirements for data security and privacy. GCP offers the Compliance Resource Center which gives organizations detailed information on GCP's licenses, certifications, documentation and audit information to verify GCP's compliance with legal standards.
When developing your own solutions, it is essential to pay attention to local jurisdictions where you are operating in as well as the applicable laws. Failure to do so can result in fees, fines, or operational shutdowns until compliance is ensured. DLP can be used to identify PII and custom InfoTypes can be used to ensure compliance for your particular organization and/or project. Some of the major compliance requirements are detailed below. While GCP can provide some guidelines for compiling with the law, it is essential that your legal team have an effective plan in place for compliance with applicable laws and regulations. Most of these regulations fall under what's known as "shared responsibility", where both Google and the customer have responsibilities under the law to ensure compliance.
For the purposes of the exam you don't need to be a lawyer and know the regulations front to back, but be sure to be able to recognize when some of these common laws may apply. In the real world organizations are investing billions of dollars to update their data protection standards to comply with the new laws.
This can become a complicated process if your organization operates across multiple jurisdictions which might have divergent requirements. It is up to you to ensure that your operations are within regulations when operating in any given region.
Health Insurance Portability and Accountability Act (HIPAA) and Health Information Technology for Economic and Clinical Health Act (HITECH)
HIPPA and HITECH compliance is meant for businesses which work with health insurance or health care data, known as Protected Health Information (PHI), and is designed to protect consumers' data from unwanted manipulation, resale, or distribution, and also to help prevent damage in the case of a hack of the organization's data. These acts apply to US companies only, and you must determine if you are a Covered Entity and would therefore require a Business Associate Agreement with Google to ensure compliance.
Children's Online Privacy Protection Act (COPPA)
COPPA is a very important legal requirement for basically any website which interacts with the public, and it is especially vital for companies which interact with social media or chat. The Children’s Online Privacy Protection Act of 1998 (COPPA) is a U.S. regulation applicable to the collection of personal information from children under the age of 13. COPPA imposes certain requirements on operators of websites or online services directed to children under 13 years of age, and on operators of other websites or online services that have actual knowledge that they are collecting personal information online from a child under 13 years of age.
FedRAMP
FedRAMP is a standardized methodology for ensuring compliance among cloud operators when building solutions for the Federal Government and it's associated agencies. FedRAMP is essentially a rating service used by Federal Agencies to determine risk associated with a given cloud service and cloud provider. If your business is producing a service for a Federal Agency, or plan to, then you should use FedRAMP as a guide for ensuring a smooth onboarding of services.
General Data Protection Regulation (GDPR)
GDPR was created to protect consumer data of EU citizens. GDPR violations can become quite costly, so be sure to understand the laws where you are operating. Some data governance standards in the US may not apply in the EU or vise-versa. Google Cloud will provide extensive assistance to your organization to ensure compliance with GDPR and other regulatory standards. You should work with Google experts who can evaluate your organization and make compliance recommendations as required.
CCPA and CPRA
California's CCPA and CPRA laws are some of the most restrictive in the United States. It outlines strict guidelines for data collected, stored, accessed, or processed within California's geographical boundaries. Most companies choose to operate within the entire US, rather than attempting to exclude all of California's consumers (which might actually be technically or legally impossible to do), and therefore use CCPA and CPRA as a legal baseline when building their policies for operating within the US as a whole.