Log In

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Designing data processing systems/Designing for reliability and fidelity

Designing for reliability and fidelity

A high degree of reliability and consistency will inspire confidence in your users and stakeholders. Ensure your product is ready to handle any errors in data processing or failures in your architecture.

Topics Include:

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Designing data processing systems
  → Designing for reliability and fidelity

Topic Contents

Preparing and cleaning data (e.g., Dataprep, Dataflow, and Cloud Data Fusion)
Monitoring and orchestration of data pipelines
Disaster recovery and fault tolerance
Making decisions related to ACID (atomicity, consistency, isolation, and durability) compliance and availability
Data validation


Preparing and cleaning data (e.g., Dataprep, Dataflow, and Cloud Data Fusion)

The first major task when developing data pipelines is to prepare and clean your data. GCP offers a common framework consisting of key technologies which will aid your development.


Performing Data Preparation And Quality Control

Data preparation is one of the most important steps you perform as a data engineer. Data preparation is essential to effectively understanding, working with and ultimately extracting value from your data. Data preparation can mean different things depending upon who you ask, but it usually involves collecting, cleaning, typing, and deduplicating your data. It can also mean validating your incoming data and ensuring data quality is maintained and strict rules are enforced regarding data quality. These tasks are performed with various tools including Dataprep, Dataflow, Cloud Functions, Cloud Data Fusion, or PubSub. Data preparation can also be done with Dataproc or BigQuery using SQL statements. Because most of these technologies are covered in other sections, this section will focus on Dataprep and Cloud Data Fusion.

Dataprep

Dataprep by Trifacta is an intelligent data service for visually exploring, cleaning, and preparing structured and unstructured data for analysis, reporting, and machine learning. It's primary purpose is to build programatic solutions, or "recipes" for cleaning, organizing, and data profiling of petabytes of data. Dataprep is a user-friendly, NoCode, and serverless data preparation solution. Dataprep is easy to use and can quickly let you develop solutions for data cleansing and preparation, share them, and operationalize them for use with almost any type of data. And because Dataprep utilizes built in GCP services (such as Dataflow and BigQuery) all operations can be performed within your own project and region which means that your security and egress policies are not compromised. As always, the more well developed and organized your data lake is, the easier any downstream activities will be (such as data preparation, which is a form of data processing). Dataprep's primary focus is to produce models for future analyses, it is not a business intelligence tool.

Dataprep can perform cursory tasks such as anomaly detecting, sampling, data profiling, some common mathematical operations and statistical analyses, column renaming, error handling, anomalous data handling and alerting. Dataprep can also perform some transformations such as data enrichment and table joins. Dataprep pipelines are built into a series of steps known as a "recipe", which is essentially a DAG. A recipe step is some atomic operation performed upon your data. Following dataprep recipe development the recipes can even be extracted as true Dataflow jobs, giving you the ability to effectively operationalize the Dataflow job within your own architecture.

Cloud Data Fusion

Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building and managing data pipelines. Similarly to Dataprep, it is a NoCode, UI based service designed with simplicity and scalability in mind. Cloud Data Fusion connects to a wide variety of data sources and provides data transformation services and data pipelining services for nearly all GCP data services such as BigQuery, GCS, and Cloud Spanner. Cloud Data Fusion utilizes a huge number of built in transformers, connectors, and operators which can perform a wide variety of data operations. You can connect to almost any source and export data to almost any destination.

You can perform data preparation (or data wrangling) with Cloud Data Fusion by building in essential operators into your data pipelines as cursory steps or transformation steps before beginning more complex enrichments or transformations. Following preparation, use Fusion to perform more complex ETL operations including data enrichment and data lineage tracking, or develop complex operations using one of the 1,000+ available transformations. Cloud Data Fusion can also run DataProc jobs natively if you wanted an NoCode way to develop Spark pipelines. This can greatly expand the capabilities of Cloud Data Fusion.



Monitoring and orchestration of data pipelines

Orchestration and monitoring your pipelines can ensure maintainability, quality, and consistency.


Pipeline Monitoring

Monitoring your solution in flight is essential to ensuring a high quality solution. This can include checking some common GCP sources including Cloud Logging, Cloud Monitoring, as well as some more advanced sources such as Ops Agent, which is used to collect metrics and logs from Compute Engine instances. Similarly to setting up an effective unit test suite, setting up an effective monitoring policy can ensure a high quality solution and enable quick identification and reaction to bugs in your processing infrastructure.

Monitoring your solution consists of monitoring both the health of the infrastructure as well as the success or failure of the tasks or functions your are attempting to execute. Setting up alerts which trigger based upon pre-defined metrics or events can help make this task easier and take some of the load off of your DevOps teams. It is also possible to export metrics in real time to a 3rd party reporting service, such as Graphana.

Use Cloud Monitoring to measure your pipeline performance across a number of metrics including job information, bytes processed and written, billable operations such as shuffling, storage capacity, gpu and cpu usage, late arriving messages, replayed messages, pub/sub messages processed, and others.

You can use Cloud Metrics to measure messages and bytes processed, watermarking, and distribution.



Disaster recovery and fault tolerance

No matter how well you build and engineer your solution disasters, errors, and failures inevitably occur. You can help mitigate the impact of these issues with some careful planning and proper selection of required components.


Building And Running Test Suites

Effective testing of developed code will ensure that your solution is free of bugs or errors, can effectively satisfy the requirements of the build, and will ensure a great experience for consumers. This is just as important in the data space as it is in the application development space. As the cloud has evolved it has become standard practice to ensure an effective DevOps strategy is present in your development pipeline.

Performing testing as part of your development strategy will reap benefits as you build a production-ready product. Learn to fully test new functions as your build them, ensuring that your functions can handle edge cases or faulty inputs. Writing effective unit tests can enable rapid development while ensuring a high-quality product. A standard DevOps practice is to write tests before your write your code. This will give you guidance while developing and will enable immediate feedback during testing phases.

GCP offers a number of tools for developing CI/CD solutions as part of your commit pipeline and overall DevOps strategy. As a data engineer, you can help make your DevOps engineer's life easier by performing your own local testing and branch management as part of your development lifecycle. By writing your own unit tests and acceptance tests you can get ahead of the curve and ensure an effective solution is developed. This is also known as "Shifting Left", which means to test as close to the actual development as possible.

Cloud Build

Cloud Build is GCP's fully managed and cloud native build solution which can facilitate development for a wide variety of applications. As a serverless solution there is no infrastructure to maintain and the service is scalable. Cloud Build can be integrated with most common source control services such as GitHub. A common DevOps practice is to submit test suites to Cloud Build as part of your commit process. As a deployment pipeline is created the tests are executed, if passed, the code is deployed, otherwise the code is rejected without causing any issues to the target environment.

Cloud Build executes build operations as a series of steps which execute in a Docker container. This ensures a safe and secure environment during the build.

You can use Cloud Build to deploy to a number of GCP services:

  • Cloud Run
  • GKE
  • Cloud Functions
  • App Engine
  • Firebase
  • Compute Engine

It is also possible to use cloud build to deploy to a private on-prem Kubernetes cluster in a controlled environment. This can help satisfy data residency requirement or other security concerns.

Cloud Functions

Cloud Functions behave differently than other software products in the cloud. The infrastructure is generally ephemeral and only lasts the life of the execution. Additionally, cloud functions require a web-request or event to trigger, which presents unique challenges during testing. GCP has built a few solutions for aiding local development for Cloud Functions. This enables you to test without having to actually go through a deployment process in GCP, which can be tedious or possibly disallowed depending upon your DevOps requirements.

The Functions Framework is essentially a Cloud Functions emulator which allows you to pass web requests directly to your local workstation. This is good solution if you're comfortable with the overall environment parameters of Cloud Functions and should satisfy the vast majority of use cases.

The Functions Emulator spins up a Docker container to execute your code in a truely simulated Cloud Functions environment. This is a useful tool if you need to go deeper into the actual functions architecture and how it would execute your code in real time. This is also useful if you want to isolate the code from your own machine or other infrastructure.

Planning, Executing, And Stress Testing Data Recovery

Although it is not the ideal outcome, errors and job failures do occur. The effects can be mitigated with a competent and thorough data recovery strategy coupled with GCP best practices for data recovery. Disaster recovery procedures are usually attempting to fulfill two key objectives, which occasionally require trade-offs: RPO - Recovery Point Objective, and RTO - Recovery Time Objective. GCP offers many different methods for backing up your data, and the best choice depends upon your particular data stack, the infrastructure you utilize, and your requirements for data recovery. The requirements for disaster recovery are outlined in the Service Level Objective - or SLO, and are application dependent.

RPO - Recovery Point Objective

RPO, or Recovery Point Objective, is how far back you want your data to be recoverable. For example, do you want your database backed up every day, every week, every hour, etc? There is a trade-off, of course, between how much backup you need and how much data you want to maintain. Taking a snapshot of your Cloud SQL instance everyday probably wouldn't be disruptive, but keeping the daily snapshots for weeks or months would end up being very cost prohibitive and unnecessary to insure against an extremely unlikely catastrophic loss. The actual methods of ensuring RPO differs depending upon your current architecture and configuration, but most GCP services include some sort of backup plan and method built in. This is an important metric because you need to be able to balance the desire to recover as much healthy data as possible while minimizing the possibility or re-capturing the failed state or unhealthy/corrupted data.

RTO - Recovery Time Objective

RTO, or Recovery Time Objective, is how quickly you should be able to recover from a failure. This is usually measured in "Wall Clock Time", or how much real world time passes before recovery is complete. It is measured in terms of t where t0 is the time zero of failure. t + 1, t + 2, t + n can refer to a measure of time, such as the number of seconds, minutes, or hours it would take to recover the data. For example, if a Cloud SQL instance fails, how quickly can a new instance be brought up to take over duties?

GCS

GCS is a very durable service and a data loss from a failure of infrastructure is effectively impossible. Outages are possible for individual regions or zones, so GCS gives you the option to create regional, dual-region, or even multi-region buckets to provide maximum degrees of availability and durability. Cloud Storage is designed for 99.999999999% (11 9's) durability.

BigQuery

BigQuery is a fully managed, serverless, durable, and highly-available solution. Likely the only event that would cause a real loss of service scenario with BigQuery would be an extreme and catastrophic disaster which would physically destroy compute hardware in an entire GCP region, such as an extreme weather event, earthquake, or destructive war-time or terrorist event (such as a missile strike). Soft events, such as a regional power failure, would bring down the service but would not destroy your data.

BigQuery stores 7 days worth of data snapshots by default as table versions, and these can be accessed with special query commands. Table snapshots are the default method of data backup and recovery for BigQuery. It is an efficient delta backup, meaning that only data which changes against the base data are stored in the snapshot. Making effective use of partitions in your tables can minimize any table rebuilds BigQuery must perform as a result of CRUD operations on your datasets.

By default, BigQuery makes two redundant copies of your data across two different zones in whichever region you build your dataset in. If your disaster planning requires even greater levels of durability and availability than this, or you cannot possibly sacrifice any downtime in service, you can implement Cross-Region Datasets and the BigQuery Transfer Service will automate a copy command to another region. At this point, the only events where a total loss would be possible would be a comet striking the earth, the next ice age, or global nuclear war.

Another option for backup is to use GCS as a sink for BigQuery data. It is possible to do daily extracts of your tables as AVRO or JSON files to a globally replicable GCS bucket. This enables a relatively pain free method of data recovery for BigQuery using external tables. Additionally, GCS can take advantage of Autoclass archiving in order to minimize costs of backed-up data. This level of redundancy is almost pointless, however, because BigQuery essentially does that same thing by default. You can be very confident that your data will never be lost in a BigQuery table.

GCP Compute Engine - Persistent Disk

GCP Compute Engines use persistent disks, essentially hard-disks, to store your data. Depending upon your individual architecture the disaster recovery methods will be different. The most common method of data backup for EC2 is HD snapshots which are a point in time snapshot of the entire disk. In the event of failure a disk snapshot can be loaded onto a new machine to perform recovery. This process is straightforward, offered natively by GCP, and is quick to implement.

Managed Instance Groups, or MIGs, are used to automatically recover from failure scenarios and ensure that failures in a given machine do not result in the destruction of the application or any downtime in service. MIGs can be deployed to either one zone or multiple zones for greater levels of redundancy and availability. Using GCS as a data store or archive can also provide another source of redundancy.

GKE

Google Kubernetes Engine, or GKE, can be backed up using the Backup for GKE service. Backup For GKE is a separate add on service which is attached to a GKE cluster and provides backup for GKE's configurations and volumes. GKE uses Backup Plans to create the backups and restore a cluster in the event of a failure.

DataProc

DataProc stores its data and configurations in GCS, so a true backup for DataProc is unnecessary. In the event of failure you would simply rebuild the instance using the same objects. DataProc can do this natively.

Firestore

Firestore is highly available and durable by default, but it can be backed up by configuring automatic backups to GCS. You can schedule automatic backups and restore firestore data.

Recovering From Job Failures

The hardware and infrastructure recover capabilities of GCP are very good, but failures in jobs are also possible due to bugs, bad input data, unforeseen edge cases, or many other possibilities. Therefore, you should have a plan to address job failures as well. There are a few ways to accomplish this, but most involve best practices for data architecture, such as effective data partitioning, validation, and verification procedures.

Pub/Sub

PubSub is a globally available and highly redundant message streaming and queuing service. Messages are produced by consumers (publishers) and consumed by consumers (subscribers). When errors occur in a PubSub topic the cause is usually an error in message creation, such as a bad schema, or an error in the network, or some sort of material change in the subscriber (such as a failure to adapt to a change in schema).

PubSub has a few built in features designed to handle and recover from failures.

  • Subscription retry policy - If Pub/Sub attempts to deliver a message but the subscriber can't acknowledge it, Pub/Sub automatically tries to resend the message. There are two methods of retry available in PubSub.
    • Immediate redelivery - Pub/Sub tries to resend the message immediately if unacknowledged. If the subscriber does not acknowledge messages correctly PubSub will continue to resend the messages.
    • Exponential backoff - Exponential backoff lets you add progressively longer delays between retry attempts. This adds time to a message retry attempt for each failure. This can help to throttle requests against subscribers in order to prevent a large number of unacknowledged messages.
  • Dead-letter topic - A Dead-letter topic handles erroneous messages emanating from a topic. Bad messages are pushed to a separate topic if a subscriber rejects the message. From here the error is handled by an alternative processor. Dead-letter topics ensure solution quality by preventing erroneous messages from transmitting, by giving visibility into the sources of bad messages, and by allowing error handling of the bad messages. A common source of erroneous messages is a bad message schema.


Making decisions related to ACID (atomicity, consistency, isolation, and durability) compliance and availability

The cloud is very high performant and scalable. It achieves this through complex data replication schemas which can distribute data across multiple nodes, zones, and regions. This does, however, also present certain technical challenges when trying to effectively manage object state due to the CAP theorem. In the data world, we measure state consistency as whether the data transactions are ACID compliant.


Choosing Between ACID, Idempotent, Eventually Consistent Requirements

Effective cloud data engineering often requires trade-offs when choosing between different storage and computing methods. A priority problem for any data engineering solution is determine how to balance the competing needs of speed vs integrity of data. The more reliable and stateful you need your data to be the longer it will take to ensure it's integrity. This is especially true for distributed systems such as GCS, BigTable, or Spanner. There is always an issue of latency between the time a change is created on a primary or source system and the time it takes for those changes to be reflected within a secondary system. This is known as eventual consistency and is a product of so called tail-latency.

These discrepancies are often very small and usually not an issue, but depending upon the nature of user interactivity with your systems they could be felt. Generally, creation of new objects are easier on a distributed system than updates or deletes. New objects are available instantly after creation, while updates or deletes of objects can take some milliseconds to second to complete propagating across multiple regions. This can be an issue if your systems is trying to handle thousands of requests per minute. This is a well known problem in the cloud computing industry and each provider attempts to solve this issue in its own way.

There is no true concept of a primary key in cloud data warehouses because a primary and foreign key relationship in the context of RDBMS technically refers to data storage addresses in a standard block storage (hard drive). Cloud data warehouses use distributed object storage, so the concept of a primary key or foreign key doesn't exist and is not enforced within a cloud data warehouse. Some cloud data warehouses allow you to "declare" a primary key or foreign key on a table, which can help with query planning, but we consider this a misnomer since it doesn't have the proper referential integrity constraints nor distinct key requirements to be a true primary key. It would be closer to a query hint than to a primary key and should be taken to mean the qualities of the data which positively identify a unique row within a table, such as event-timestamp, and is not a true physical primary key. This is essentially object versioning, but it doesn't reflect a state change to an individual object, but it rebuilds the metadata reference to the correct version of the object. The CAP theorem tells us that it is impossible to mathematically guarantee high read performance and guaranteed state management. Really, it's a metaphysical appropriation of Heisenberg's Uncertainty Principle which states that one cannot assign exact simultaneous values to the position and momentum of a physical system. In other words, it is impossible to simultaneously issue a change of momentum to a state of an object at one location and propagate it instantaneously to an alternative copy of that object in another system. This is due not only to processing capabilities of GCP, but it's also due to physical laws, such as the speed of light.

The more extensive the transactions that you are processing the more time it will take to complete, but the more secure your transaction would be. For some systems, this is absolutely essential, such as financial transaction systems or health care data which has strict data privacy requirements and can't have many different versions of the data in existence. If you have requirements which require strict object state management then an RDBMS such as Cloud SQL would be required.

Distributed systems, such as Big Table, can process millions of requests per second, but occasionally there can be issues which dirty object reads. This becomes a higher risk when using multi-region transactions. With that said, Google has come a very long way in the past few years with their terabit network. They now have dedicated cables which run across contents and oceans which means that data does not have to travel over the internet in order to propagate to secondary regions or systems. Not only is this faster, it is also more secure. Bigtable is a key-value data store which must update the entire object with each operation. Google Cloud is very fast at this, but it does create the opportunity for for certain inefficiencies, such as hot-spotting.

BigQuery is tremendously good at large scale (PB+) data reads and analytics, but it is not designed for transaction handling, it simply doesn't possess the proper architecture to handle highly dynamic table updates. It is possible to perform these transactions, but it will not be as efficient as an RDBMS for these types of operations. Again, there is no concept of a primary key in cloud data warehouses due to the distributed nature of the data. Logically, Google can ensure the success of a transaction, but it can't be guaranteed the same way that it could be with an RDBMS system, this is simply a product of the nature of stateful transactions in the cloud. This is why BigQuery is referred to as an analytical data warehouse, and not a relational database.

This trade-off between speed of transactions and speed of analyses is present throughout any data architecture. The better a given system is at transaction handling of individual object states, the less well-equipped it is to scale. For example, a redis database operating at the edge provides very good reliable and near instantaneous reaction time to events, but trying to aggregate become impossible at scale. If you have a in memory redis database running at the edge of a multiplayer game system you will get reliable and extremely quick response to a huge number of in-game events, but what happens when you want to aggregate and examine trends across a dozen, a hundred, or even 100,000 games? Redis would not be a good choice for that and would be an extremely expensive and inefficient solution for this problem. BigQuery, however, would be a great solution for that type of problem and can quickly and cheaply examine this data for millions of games and players.



Data validation

Beyond technical challenges of a solution lie the challenge of ensuring high quality data. This is commonly known as data validation and is a core competency of any data engineer.


Data Validation

Data processing infrastructure and applications should effectively report on the success or failure of different executed jobs. Some common ways to report information to users or applications is with Cloud Logging and alerts, which can be set up to perform some action if a given message is detected in the log. Other NoCode tools such as Cloud Data Fusion or DataPrep can send out alerts when a pipeline succeeds or fails or if data fails to meet some condition. Built in unit tests can be used to detect anomalies or errors in the data and submit reports to users or to Cloud Logging.

A great tool for monitoring and validating pipeline execution is with Cloud Composer, GCP's DAG orchestration technology. Cloud Composer can be used to execute and monitor job status from most GCP services, and issue alerts and halt downstream tasks if an error occurs. Use Cloud Composer to test for the presence of data received as well as its quality.

Most GCP services have native monitoring dashboards which should be utilized to monitor for any errors or failures in your processes. Setting up DAG dependencies effectively can ensure that a failure of a given component in a pipeline will not cause painful and hard to debug downstream effects.

The Data Validation Tool can help ensure successful data migrations for EDW migration processes. Use this process when validating data warehouse migrations from one platform to another, i.e. SQL Server to BigQuery.

Verification And Monitoring

Data verification means ensuring that the data received is not corrupted, is secure, and is in the correct format. Data verification should be performed as close to the input as possible. For example, if your application receives input from a user, be sure to check this data against your database first before completing the transaction. This will ensure that any data received from the user represents a positive truth-state. Some bad actors have attempted to bombard websites with false information in order to create arbitrary advantages to exploit. If you are running an MlOps pipeline this is doubly important because a bad actor, faulty sensor, or bad communication equipment can create a data drift away from truth and have disastrous effects on your customers.

Cryptographic Data Verification is a process where you used cryptographic key-signing of data to ensure data integrity point-to-point. This is an especially useful practice when dealing with third party application sources. All data in GCP is automatically protected with at-rest and in-transit encryption standard. However, occasionally your requirements may need something more. You can use KMS to perform checksums on your data to ensure data integrity against both corruption and malicious activity. Cryptographic data verification is a high priority in financial transactions and health care clinical data.