Log In

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Designing data processing systems/Designing for flexibility and portability

Designing for flexibility and portability

Account for an ever evolving product and data schema in your solution. Embrace multi-cloud and hybrid cloud architectures and tie everything together with an efficient data catalog.

Topics Include:

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Designing data processing systems
  → Designing for flexibility and portability

Topic Contents

Mapping current and future business requirements to the architecture
Designing for data and application portability (e.g., multi-cloud and data residency requirements)
Data staging, cataloging, and discovery (data governance)


Mapping current and future business requirements to the architecture

To be an effective data engineer you must know how to translate business requirements into technical solutions. This means not only correctly identifying and building for current requirements, but also to ensure future requirements can be satisfied without difficulty.


Mapping To Current And Future Business Requirements

No one can predict the future, that is of course a given. However, you can help yourself to prepare for a wide range of possible outcomes by giving your data room to evolve along with the components of your architecture. You can set yourself and your organization up for success by utilizing best practices and solid fundamentals in your engineering. This also means choosing the right tool for the right job. Understanding the nuanced advantages and disadvantages of the various components within GCP and how to best implement them will ensure that your solution is robust and efficient.

Business requirements reflect the goals of the organization, development schedules, and budget. You should balance speed, cost, and robustness of your solution. While a certain architecture may be faster, or more robust, it may not meet the immediate cost needs of the organization. Also, some web-applications may be ok with a document model, while some may require strict and absolute transaction handling of an RDBMS. If your expected workload is processing 20 or 30 MBs per transaction, it would be overkill to use something like dataproc to process the data where a simple cloud function could process the data much more efficiently. When you are looking at a security solution you should know when a situation requires a CMK or when simple managed GCP security would be sufficient. For the purposes of the test you will often be given clear requirements, but ambiguous or close answers. This requires a process of elimination and a balancing act to determine why a given technology would or would not be the right answer. While there are often two or more close answers which might solve the requirement, there is only one answer which optimizes your solution to a given set of requirements.

Resizing And Autoscaling Resources

A key advantage of cloud infrastructure is the easy and near infinite scalability of cloud infrastructure and resources. Proper right-sizing of cloud infrastructure can save you time and money as well as improve the experience for users. Nearly all resources in GCP are scalable and resizable in some way. As always, the more that you utilize managed resources in GCP the less infrastructure management you will have to contend with and the less overhead would be required.

Building and Operationalizing Processing Infrastructure

Data processing infrastructure in GCP is built upon the common GCP execution infrastructure, deployment tools, as well as management and logging tools. These include Compute Engine VM's, Cloud/Stackdriver Logging, Cloud Build, Google Kubernetes Engine, and Dataproc. These tools offer varying degrees of control over the configuration of each service.

Compute Engine

Virtual Machines

GCP Virtual Machines power many of GCP's services and users often take advantage of the flexibility and high configurability of Compute Engine Services to build custom data processing applications. Usually the serverless and scalable managed services are great solutions for data processing in GCP, but there are some scenarios where using Compute Engine would be advisable. This is usually when a user requires a high degree of security, data shielding, and are trading in PII for compliance or health care analytics where privacy is absolutely paramount.

There is of course, a tradeoff between configurability and maintenance. The more highly configured your machine and node set up becomes the more time you will have to spend maintaining it and monitoring it. In general, especially for the exam, it is best to use fully managed services unless there is a good reason why you can't.

Common Configurations

GCP has some common machine types used for certain workloads. Compute Engine is highly configurable and the different chipsets are better at different workloads.

  • General purpose workloads (C3, E2, N2, N2D, N1)
  • Ultra-high memory (M2, M1)
  • Compute-intensive workloads (C2, C2D)
  • Most demanding applications and workloads (A2 with NVIDIA GPUs)

Preemptible VM's are much cheaper than standard VM's, but they carry the risk of being terminated or restarted at any time. This can be ok if the workload can is fault-tolerant and can adapt to an interrupted work-stream. Batch processing applications can usually be set up to take advantage of preemptible VM's. Preemptible VM's can also be configured with GPU's to process batch machine learning workloads.

GCP compute engine was built from the start to be highly flexible, redundant, and scalable. The underlying technology enabling and governing GCP auto-scaling is agnostic of the given implementation of the hardware. Whether you are running a DataProc cluster, GKE cluster, a web application, or you have a custom solution you are implementing, all GCP compute engine instances auto-scale the same way, with a load balancer and a managed instance group. Auto-scaling of compute resources is often tied to some tangible metric captured via Cloud Monitoring, such as CPU% usage, network traffic (i.e. HTTP load balancing serving capacity), disk utilization, or as often is the case with DataProc Spark, memory pressure. When a certain level is hit the instance group will scale out according to given parameters when you configure the autoscaling solution. It is also a common practice to implement scale-down/scale-in policies, as well, to scale down resources when they are no longer utilized as greatly. One benefit of using an auto-scaling group is that you can effectively resize your VM's without any service disruptions.

In addition to the number of machines in your MIGs (managed instance group) you may only want to adjust individual components. This could be true if you are operating a database on a custom instance, or are running a single node spark cluster. It is possible to adjust the size of a persistent disk while the machine is running without causing much issue. Be sure to create a snapshot of the disk volume first as a backup. If you need to resize or rebuild an instance, you can use the snapshot to create a new vm.

Sole Tenancy and Confidential Computing

Compute Engine can be made highly secure in a few ways. Sole Tenancy means that you set up a compute engine instance on a server rack hosting only your instance. Not only is this a highly secure set up, but it can also improve processing consistency and performance by colocating all needed computing resources. This is by far the most secure computing set up you can have within GCP. It is possible to know the individual VM mappings used on the server giving you full control over the entire compute environment and even isolate individual workloads within individual projects as well, if desired.

One use case is to encrypt your data on premises and then move the data over an SSH tunnel to a sole-tenant compute engine node cluster, perform the computations you need, and then retrieve the encrypted data from the VM and decrypt the computed data on-prem again.

Confidential Computing is built upon the Trusted Execution Environment (TEE) developed by the Confidential Computing Consortium, whom are dedicated to ensuring data integrity, privacy, and security in computing environments. This is an other great option when you want a higher degree of security when computing in GCP, but may not need as strict of a policy as sole-tenancy. This is sufficient for all but the most demanding security requirements.

Shielded VMs are another security concept which provides a verified checksum of your compute engine VM. This ensures that no malware or rootkits are installed on your VM.

GKE - Google Kubernetes Engine

GKE, or Google Kubernetes Engine, is GCP's fully managed Kubernetes cluster management suite. GKE can handle the whole range of tasks from simple PAAS style architecture with only a control plane attached to the fully managed "Autopilot" mode which will also handle node pool configurations. GKE is highly flexible in its implementation and can be used as a general purpose solution for almost any workload needed from hosting web applications to hosting Cloud Composer instances. GKE nodes are highly configurable and can be altered to favor compute or memory intensive operations.

GKE can be used as a highly scalable data processor and is good for processing preemptible batch workloads using Spot Pods. GKE can also be configured with GPU's for AI/ML workloads.

GKE offers an impressive array of security features designed to maximize the security of your cluster and pods. A a fully managed service Google will manage all updates to the underlying hardware and ensure OS security.

If you want to operate Kubernetes on-prem racks while still being connected to GCP up can use Google Distributed Cloud Virtual (GKE Enterprise on-premises) which allow you to run Google managed Kubernetes clusters on VMWare or bare metal servers. This is useful for organizations with strict privacy or security requirements or who need to perform large calculations closer to their operations center. Once deployed the cluster will automatically connect to GCP with the Connect Agent.

GKE has cluster autoscaling built in natively and it is almost a requirement for any GKE implementation. Note that GKE handles cluster autoscaling differently than Compute Engine natively and this should be disabled when using GKE, although the core operations are similar. GKE will effectively monitor node utilization and will either add or remove nodes as required. GKE can also resize your node pool, if required.

Cloud Dataproc

Dataproc Node Configurations

Cloud Dataproc is GCP's fully managed Hadoop and Spark cluster management software. You can configure the hardware at the node level, the number of nodes, or an alternative configuration for the master node, if desired. Clusters can be built upon either compute engine or GKE, though the compute engine services tend to be the most stable and most interoperable with other services. As a fully managed service Google will manage the underlying infrastructure, OS, and any installed packages.

You can specify a scheduled delete time, the cloud bucket desired, and whether to make the cluster sole-tenant for added security. Additionally, Clusters can also be stopped on demand, so you can save costs on deactivated infrastructure. This is useful if you only run daily batch processing applications on Dataproc.

In addition to the compute hardware you will also have to provide a cloud bucket for Dataproc to use to store essential metadata and use as an output bucket for Spark data.

DataProc has built in auto-scaling which must be programmed before developing the cluster. The autoscaling policy for DataProc monitors the Hadoop YARN metrics to determine metric thresholds for triggering the up-scaling methods via Cloud Monitoring. By default, Dataproc autoscaling will monitor memory pressure, but you can check worker core utilization as well. If you use Dataproc Serverless for Spark then DataProc can handle the autoscaling for you effortlessly and easily with just a few simple configurations at set up time.

Initialization Actions

Dataproc allows initialization actions to be run upon cluster startup, this is often used to push native JAR support for Spark packages.

Dataproc Add-ons and the Component Gateway

Dataproc offers a number of optional add-ons to make the cluster more usable such as Jupyter Notebooks, HBASE, Flink, Docker, and others. Jupyter Notebooks is useful because you can get an interactive shell using PySpark to interact with Serverless Spark on the cluster without having to submit an application.

Cloud Functions

Cloud Functions are designed to be fully serverless and highly scalable. Autoscaling infrastructure is assumed for Cloud Functions and is not configurable, generally, but you can configure the max instances available for auto-scaling up to 1,000 for Gen2 functions. It is important to balance your requirements here to prevent run away scaling.



Designing for data and application portability (e.g., multi-cloud and data residency requirements)

Data and application portability refers to the capability for an organization to operate across various geographical and legal boundaries, which can often become blurred in the cloud. A good plan for interacting with different cloud vendors as well as geographically dispersed data is essential.


Designing For Data And Application Portability

The cloud is quickly and constantly evolving around different services and architectures. Occasionally, you might find that the data you need to access is found on an alternative service, such as AWS S3. Being able to develop a plan to work with this data is essential, and it includes not only being able to process the data in S3 by using a tool like BigQuery, but it also means ensuring that your data meets strict requirements for residency and internet communications. This is especially true if you're attempting to transfer data overseas or between sensitive systems.

The Google Cloud API's allow you to interact with different architectures and components via standardized and managed methods. Every component in Google Cloud can be operated upon programmatically via one of Google Clouds robust and efficient APIs. You can use the APIs and their methods to securely operate on GCP components including Big Query, Dataflow, GCS, and others. These APIs can be interacted with via many different languages, such as python, and can be operated on with easily created and highly secure service accounts. Use APIs to build interactivity between services within GCP, as well as to build efficient connections to GCP from outside the service.

Multicloud and Hybrid Cloud Architectures

Multicloud and hybrid cloud architectures are becoming common. This includes systems which interact with external sources operating on AWS, Azure, or an on-prem environment. There are a few systems which require negotiation when you are operating in a multicloud environment. GCP's Virtual Private Cloud, or VPC, is a logical network gate which encapsulates network connections within your project. New projects start with a default VPC which facilitates building interconnected cloud resources. You don't need to know all the details behind VPC's for the test, but understanding these rules can give you insight into how GCP operates at a fundamental level.

Cloud Interconnect

When you connect to GCP from an external sources (such as hybrid or multi-cloud) you must connect to the components by passing through the VPC. For the most part this is an invisible process for your applications, you would simply pass in credentials and connection details along with the API client. Cloud Interconnect offers a much more reliable and secure connection between your hybid cloud sources and GCP. It gives your components internal IP addresses which therefore means that GCP treats your on-prem components as if they were part of your VPC. One main benefit of Cloud Interconnect is that your data does not have to traverse the internet. This means your connection is more secure, reliable, and performant. Cloud Interconnect might be necessary to satisfy privacy and security requirements for sensitive workloads.

There are three types of Cloud Interconnect:

  • Dedicated Interconnect - Provides a direct physical connection between your on-premises network and the Google network.
  • Partner Interconnect - Provides connectivity between your on-premises and VPC networks through a supported service provider.
  • Cross-Cloud Interconnect - Provides a direct physical connection between your network in another cloud and the Google network.

HA VPN

Cloud Interconnect resources can be upgraded with Highly Available VPN which will encrypt traffic sent over Cloud Interconnect as well as provide 99.99% availability for your connection. A dedicated connection with HA VPN is as close to operating within GCP as you can get when operating in a hybrid environment.

Working With Mobile and IOT Applications

Mobile and IOT data have the power to transform how we interact with our products and how we interact with the world. Mobile applications can help you buy food or other goods, services, or search the web. The emergence of tools such as ChatGPT have made mobile applications almost as powerful as, if not more powerful than, desktop applications. These mobile applications produce millions of transactions a day from millions of people around the country and around the world. These transactions produce a huge amount of data everyday, and users expect lightning fast response times, constant connectivity, and reliable service.

GCP offers powerful backend services which can power your mobile and desktop applications including App Engine, Firebase, Cloud SQL, and Memorystore for Redis. In addition to the services themselves GCP also offers the power of its petabit, globally connected network. You can provide customers with an amazing experience by utilizing cutting edge technologies at the edge to interact with your customers and taking advantage of GCP's powerful data tools to process transactions.

Tools like Firestore are mobile first, but can certainly power desktop applications as well, and many application developers are now producing so called adaptive sites which have one endpoint to power multiple devices. This can centralize and standardize your processing and connections which will make processing transactions more secure and more performant. Firestore can handle 500 requests/second which translates into millions of requests per day and can power your website and applications through real time updates and event listeners.

Memorystore for Redis can give you high performance ACID transaction handling nearer to your customers while still giving you the flexibility of operating your data processing applications at a given gcp region.

Understanding the right tools for the right job is the cornerstone of being a data engineer. You need to balance the needs of the customer with the requirements of the business, as well as provide an architecture which can evolve over time. Having fundamental understanding of the tools and apis offered on GCP are a great start, but you must also understand how all of these tools interact with each other to produce a whole greater than the sum of its parts.

Anthos And Multi-Cloud Architectures

Anthos is GCP's solution for managing complex multi-cloud environments, especially multi-cloud GKE cluster deployments. It manages this through namespace unification and what GCP calls "sameness" across cloud GKE clusters. Anthos is not very common on the exam, as it is more used for application deployments rather than data processing, but it is useful to understand as you will see it in practice occasionally.

Distributed Cloud Edge

GCP's Distributed Cloud Edge is an on-premises GKE cluster management suite deployment strategy which is built and maintained by Google Cloud. It is used as a special case solution when a cloud deployed GKE cluster might not be sufficient. This might include data processing tasks which are geographically constrained due to privacy laws or security requirements. If data are localized on-prem GKE Cloud Edge can give you connectivity to Google Cloud without having to deploy to the cloud. If data latency is an issue then Cloud Edge can help alleviate this.

Some potential issues with Cloud Edge include a limited processing capacity because you're no longer operating on the cloud, certain workload restrictions, and a lack of common Anthos features such as Anthos Service Mesh.

Overall this is a very rare set up for highly specialized workloads which must operate as close to the business as possible. You might see this if data must be encrypted with Customer Managed Keys before moving to the cloud for further processing or archiving.



Data staging, cataloging, and discovery (data governance)

You should plan on how to catalog your data and ensure that it is easily accessible for users. This is called data cataloguing and becomes increasingly important as your datasets grow in density.


Data Staging, Cataloging, And Discovery

Data staging, cataloging, and discovery form the essential foundation of your data architecture. It can be great if you have a powerful data ingestion engine, processor, and machine learning solution, but if no one knows how to find you data or how to use it, then it isn't going to accomplish much in as a data solution. Being able to identify your data sources as well as the content within the data are just as important as the data itself. In order to build a proper cataloging system it is essential to have a well-organized importing methodology for your data lake. You must be systematic with your data ingestion or you risk building a disorganized data swamp which diminishes the value of your data and leads to inefficiency and technical debt.

Data Staging

Different organizations will have different methods to building their data lakes and staging areas, but there are a few common technologies that are used for data staging, including GCP, BigQuery, and BigTable. Additionally the data development process commonly produces and consumes data from the same technologies. Tools such as Spark running on DataProc will utilize GCP as both source and destination for data processing tasks. This means that proper cataloging and a highly organized storage methodology are vital to maintaining an efficient operation.

Data staging is the step after the data ingest, wrangling and preparation stages and is the last step before data processing begins. Staged data is cleaned, typed, highly organized, and cataloged which should facilitate data processing and further modeling. Proper data staging is usually where data are confirmed to be ready for modeling through data accounting, quality controls, and validations. Any privacy or regulatory concerns are addressed at the staging layer as well through data masking, encryption, or column and row level access controls.

Pushing logic to the left is a vital principle in data engineering, this means that if a given operation can be performed sooner in a pipeline, then it should be. Try to move as much processing of the data as close to the source as possible. For example, if you have data which needs to be parsed, cleaned, and typed before being processed and modeled, it would inefficient to try and perform all these actions in the modeling layer. These kind of activities should be performed in the wrangling stage against the largest level of data which the operation is relevant. This clear delineation of duties is not only smart from a computational resources perspective it is also smart from a security and legal standpoint. Data should be legally compliant before it gets to the modeling stage. This is doubly true if you are pushing data to the edge.

Data Cataloging

Data cataloging is a systematic method for organizing your data, making it easy to understand and access. A proper data catalog is the backbone to almost all data processing and feature engineering tasks you will perform on GCP. You can use it not only to look up what data exists within your system, but also ensure that your data complies with regulations, security requirements, and access rights. A data catalog can help bridge the gap in your organization between teams by providing a central repository where all members of an organization can quickly and easily find the data that they need.

A good data catalog begins with a well organized data lake. The more well organized and properly formulated your data lake is the easier to build and more powerful your data catalog will become. Use properly delimited data storage techniques to clearly demarcate your data assets within your storage hierarchy, this enables data cataloging tools to easily make sense of your data and get you up and running quickly. Future AI tools will help your cataloging process by automatically indexing incoming data against known sources to provide a richer and more impactful search service.

Dataplex

Dataplex is a data fabric that unifies distributed data and automates data management and governance for that data. You can use Dataplex to automate a wide variety of data cataloging techniques and jump start your efforts towards building a domain-driven Data Mesh, which is a tool used to help business users make sense out of the data products produced by members of your organization.

Dataplex is a powerful architecture which acts as an overseer of your data plane. It automates search and collocates data automatically according to user demands and likely data relevancy statistics by leveraging Google's own AI products. These tools help to better organize and collate your data products so that you can spend more time developing advanced data products (such as data science), and less time managing your already existing data products.

GCP Data Catalog

GCP Data Catalog is a fully managed data cataloging solution designed to make sense of organizational data. Data Catalog can help you to organize your data and make it searchable and discoverable by your enterprise users. Data Catalog uses an advanced AI solution to automatically tag incoming datasets and apply given rules to data assets. Additionally, Data Catalog can apply column level tags for column level security in BigQuery. Data Catalog can help with privacy as well by automatically flagging sensitive or high-value data.

Data Catalog works by applying tags and tag templates to data assets, such as BigQuery datasets or Pub/Sub topics, which is then translated into metadata elements. These tags can be used to generate either business or technical metadata and make it easy for your users to find the data that they need or allow further processing based upon the tag values. For example, you could have a few tables which you could tag as "sales". Then whenever a user needs sales data he could search for "sales" in Data Catalog and will see the relevant datasets. Simultaneously, you could safely store data such as customer SSN by applying a PII tag on the column and using that tag to automatically limit the accessibility of the PII column to necessary parties only.

Data Catalog is a part of Dataplex and therefore takes advantage of the powerful technology behind it. Data Catalog can work with the following assets:

  • Analytics Hub linked datasets
  • BigQuery datasets, tables, models, routines, and connections
  • Dataplex lakes, zones, tables, and filesets
  • Pub/Sub topics
  • (Preview): Cloud Bigtable instances and tables
  • (Preview): Cloud Spanner instances, databases, tables, and views
  • (Preview): Dataproc Metastore services, databases, and tables
  • (Preview): Vertex AI Models and Datasets