GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Designing data processing systems/Designing data migrations

Designing data migrations

A key requirement of any professional data engineer is data migration from any source into GCP. Take advantage of native GCP services to ensure a consistent and efficient migration.

Topics Include:

Analyzing current stakeholder needs, users, processes, and technologies and creating a plan to get to desired state
Planning migration to Google Cloud (e.g., BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance, Google Cloud networking, Datastream)
Designing the migration validation strategy
Designing the project, dataset, and table architecture to ensure proper data governance

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Designing data processing systems
→ Designing data migrations

Topic Contents

Analyzing current stakeholder needs, users, processes, and technologies and creating a plan to get to desired state
Planning migration to Google Cloud (e.g., BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance, Google Cloud networking, Datastream)
Designing the migration validation strategy
Designing the project, dataset, and table architecture to ensure proper data governance

Analyzing current stakeholder needs, users, processes, and technologies and creating a plan to get to desired state

A plan for building your working solution is a core competency for a data engineer. Depending upon how deep you need to go this can be an exhaustive process where you should go through an entire current state architecture and map that architecture to a future state.

Introduction

Migrating from an on-prem data warehouse solution to BigQuery is a common and essential skill for any professional data engineer. The process can be very complex and require a large and diverse array of tools and techniques in order to successfully deliver a solution. Luckily, GCP has a few native processes and suggested strategies that can make the job easier.

Most customers have well defined reasons for migrating which can affect the migration strategy. Here are some common reasons that a customer may choose to migrate to BigQuery:

The data has become too large to query effectively or manage properly in the host RDBMS.
As part of an overall strategy to move to GCP across the enterprise/division.
Easier access to other GCP tools, such as Dataproc or Looker.
Security concerns from operating legacy hardware.
Scalability issues with on-prem hardware.
As part of a hybrid cloud solution across alternate cloud providers.
As part of an AI facilitation strategy to easily integrate with Vertex AI, SparkML, or BigQueryML.

You will often notice that a customer will not attempt a migration to BigQuery until their on-prem DW is already experiencing performance issues. This, of course, can make the process much more difficult. Therefore, it is vital to have well established tools in place that can handle potential faults and rebuilds in the data pipelines.

Operating in the cloud requires changing the way you think when approaching technical problems vs on-prem. The execution methodology, massive parallelization, security hierarchies, etc. are massively different in GCP than in an on-prem system. It is very rare to have any resource requirement map 1:1 from on-prem to GCP. Be prepared to potentially rebuild every component from scratch or even completely rethink the architecture altogether in order to take advantage of GCP's execution paradigm. You may find that the use cases your customer desires can be rebuilt much more efficiently in the cloud than in an on-prem solution.

In addition to the technical challenges of a project like this you will often find yourself engaged in relationship management with the customer as well. Be sure to be able to have a well developed migration strategy which is agreed upon by all stakeholders who utilize the EDW. There should be key relationships defined and maintained throughout the migration process.

Mapping To Current And Future Business Requirements

No one can predict the future, that is of course a given. However, you can help yourself to prepare for a wide range of possible outcomes by giving your data room to evolve along with the components of your architecture. You can set yourself and your organization up for success by utilizing best practices and solid fundamentals in your engineering. This also means choosing the right tool for the right job. Understanding the nuanced advantages and disadvantages of the various components within GCP and how to best implement them will ensure that your solution is robust and efficient.

Business requirements reflect the goals of the organization, development schedules, and budget. You should balance speed, cost, and robustness of your solution. While a certain architecture may be faster, or more robust, it may not meet the immediate cost needs of the organization. Also, some web-applications may be ok with a document model, while some may require strict and absolute transaction handling of an RDBMS. If your expected workload is processing 20 or 30 MBs per transaction, it would be overkill to use something like dataproc to process the data where a simple cloud function could process the data much more efficiently. When you are looking at a security solution you should know when a situation requires a CMK or when simple managed GCP security would be sufficient. For the purposes of the test you will often be given clear requirements, but ambiguous or close answers. This requires a process of elimination and a balancing act to determine why a given technology would or would not be the right answer. While there are often two or more close answers which might solve the requirement, there is only one answer which optimizes your solution to a given set of requirements.

<h3>Introduction</h3>
<p>
<a href="https://cloud.google.com/blog/topics/developers-practitioners/how-migrate-premises-data-warehouse-bigquery-google-cloud" target="_blank">Migrating</a> from an on-prem data warehouse solution to BigQuery is a common and essential skill for any professional data engineer.  The process can be very complex and require a large and diverse array of tools and techniques in order to successfully deliver a solution.  Luckily, GCP has a few native processes and suggested strategies that can make the job easier. 
</p>
<p>
Most customers have well defined reasons for migrating which can affect the migration strategy.  Here are some common reasons that a customer may choose to migrate to BigQuery:
<ul>
<li>The data has become too large to query effectively or manage properly in the host RDBMS.
<li>As part of an overall strategy to move to GCP across the enterprise/division.
<li>Easier access to other GCP tools, such as Dataproc or Looker.
<li>Security concerns from operating legacy hardware.
<li>Scalability issues with on-prem hardware.
<li>As part of a hybrid cloud solution across alternate cloud providers.
<li>As part of an AI facilitation strategy to easily integrate with Vertex AI, SparkML, or BigQueryML.
</ul>
</p>
<p>
You will often notice that a customer will not attempt a migration to BigQuery until their on-prem DW is already experiencing performance issues.  This, of course, can make the process much more difficult.  Therefore, it is vital to have well established tools in place that can handle potential faults and rebuilds in the data pipelines.
</p>
<p>
Operating in the cloud requires changing the <i>way you think</i> when approaching technical problems vs on-prem.  The execution methodology, massive parallelization, security hierarchies, etc. are massively different in GCP than in an on-prem system.  It is very rare to have any resource requirement map 1:1 from on-prem to GCP.  Be prepared to potentially rebuild every component from scratch or even completely rethink the architecture altogether in order to take advantage of GCP's execution paradigm.  You may find that the use cases your customer desires can be rebuilt much more efficiently in the cloud than in an on-prem solution.
</p>
<p>
In addition to the technical challenges of a project like this you will often find yourself engaged in relationship management with the customer as well.  Be sure to be able to have a well developed migration strategy which is agreed upon by all stakeholders who utilize the EDW.  There should be key relationships defined and maintained throughout the migration process.
</p>

<h3>Mapping To Current And Future Business Requirements</h3>
<p>
No one can predict the future, that is of course a given.  However, you can help yourself to prepare for a wide range of possible outcomes by giving your data room to evolve along with the components of your architecture.  You can set yourself and your organization up for success by utilizing <A href="https://cloud.google.com/architecture/framework" target="_blank">best practices</a> and solid fundamentals in your engineering.  This also means choosing the right tool for the right job.  Understanding the nuanced advantages and disadvantages of the various components within GCP and how to best implement them will ensure that your solution is robust and efficient.  
</p>
<p>
Business requirements reflect the goals of the organization, development schedules, and budget.  You should balance speed, cost, and robustness of your solution.  While a certain architecture may be faster, or more robust, it may not meet the immediate cost needs of the organization.  Also, some web-applications may be ok with a document model, while some may require strict and absolute transaction handling of an RDBMS.  If your expected workload is processing 20 or 30 MBs per transaction, it would be overkill to use something like dataproc to process the data where a simple cloud function could process the data much more efficiently.  When you are looking at a security solution you should know when a situation requires a CMK or when simple managed GCP security would be sufficient.  For the purposes of the test you will often be given clear requirements, but ambiguous or close answers.  This requires a process of elimination and a balancing act to determine why a given technology would or would not be the right answer.  While there are often two or more close answers which might solve the requirement, there is only one answer which optimizes your solution to a given set of requirements.
</p>

Planning migration to Google Cloud (e.g., BigQuery Data Transfer Service, Database Migration Service, Transfer Appliance, Google Cloud networking, Datastream)

Learn how leverage GCP technologies to effectively migrate data and applications from a current state architecture to GCP's core services.

Technical Requirements Gathering

The first step in developing a migration strategy is to gather a complete requirements list of all technical tasks required in order to get a big picture understanding of the current architecture. This begins with taking an inventory of the source system including all database objects and scripts. Then you should perform a deep dive analysis on all components to determine what is required to be migrated and what can be discarded. The better you know the data and host architecture you're working with the more effective your strategy will be.

Data Processing Translation

In practice, script rewrites usually take the most time because all stored procedures must be rebuilt in standard SQL or be completely refactored in an alternate execution paradigm (such as a DAG in Cloud Composer or a model in dbt). Although there are many "SQL Translation Services" out there, in practice these rarely work 100% and it's often best to consider completely rebuilding the scripts to take advantage of BigQuery's execution paradigm and native DAG architecture. Many customers will be coming from an RDBMS data warehouse environment where the SQL is not designed to take advantage of columnar storage architectures.

Primary and foreign keys do not exist in BigQuery, cloud data warehouses simply have no concept of keys due to the nature of distributed data. This means that there is no true strict referential integrity constraints possible in the cloud. They can still exist logically, meaning that you can have logical keys which link parent and child tables, and you can perform checks on them, but that is not the same as enforced referential integrity constraints within a RDBMS system. This isn't usually a breaking change and is fine for most warehouses, but it is something to be aware of if those relationships are enforced in the on-prem system.

Watch out for "SELECT *" statements when translating scripts. These select every column in a table or result set. This is bad practice with columnar storage and if you see this in any scripts that you are rebuilding you will want to optimize them. BigQuery engine (Dremel) has gotten much better at handling these statements and tries to perform a best guess analysis of which columns are actually required to produce the result set. With that said, it still is preferred to fully qualify your select statement before running the query unless you know that you will need all the columns in the table.

If the original on-prem environment was using proprietary or third party software to build and manage its data pipelines then you will need to migrate the system to GCP as well. Check if the third party software exists in the Google Cloud Marketplace or consider deploying the software in Kubernetes. If you find that migrating the pipeline software is untenable then you may have to rebuild the scripts using a tool such as Cloud Composer. It might be difficult if not impossible to easily track data movements between source and destination datasets therefore validation of the data and testing of the processes become the keys to proving viability.

Dependency analysis can become a challenge if there are a large number of complex dependencies. It is essential to properly map out the dataflows and dependencies within the data warehouse when developing your strategy. It is sometimes useful to "work backwards" when assessing dependencies. Rather than starting with the base tables and trying to identify all the references to that table instead begin with the end state tables and work backwards through the dependency layers until you reach the base tables. This exercise is quite useful for not only identifying dependencies in scripts, but also for identifying potential areas of logical consolidation in the scripts or even in the data itself.

Dataset Translation

Dataset translation consists of the movement of any required data between the on-prem system and GCP's BigQuery. This includes data tables, files, objects, and schemas. GCP offers the BigQuery Transfer Service or Cloud Datastream to facilitate data movements from a host EDW into BigQuery. If your data warehouse is constructed from source data such as cdc transaction flows from host OLTP systems then Datastream is appropriate. Another option is to stream transaction files into GCS where BigQuery can use Transfer Services or an external table to automate a replication feature. This is also a useful approach if you want to use a tools like Dataflow, Dataproc, or Vertex AI which work best with GCS files. Consider taking advantage of Storage Transfer Service for moving data from an alternative cloud provider or you on-prem system.

Certainly a primary concern for many customers is security during the migration process. GCP's managed services, such as Datastream, ensures point to point encryption while the data are in transit over the internet. And, of course, data are always encrypted within GCP. If that is still deemed insufficient, if your data is very large (1 PB or greater), or your connection to Google Cloud is slow, then you should consider using Transfer Appliance. Google will ship a storage device to your on-prem location where you load the data into physical drives, ship them back to Google, and then Google will load the data directly into Cloud Storage.

Many tables in a data warehouse are procedurally generated from source data or other upstream tables. By correctly preforming your dependency analysis before beginning your migration you can identify the base tables which are used to build all downstream tables. This could mean less data movement is required to complete the migration. A tool like dbt can help to map out these dependencies in your GCP environment.

<h3>Technical Requirements Gathering</h3> <p>The first step in developing a migration strategy is to gather a complete requirements list of all technical tasks     required in order to get a big picture understanding of the current architecture. This begins with taking an     inventory of the source system including all database objects and scripts. Then you should perform a deep dive     analysis on all components to determine what is required to be migrated and what can be discarded. The better you     know the data and host architecture you're working with the more effective your strategy will be.</p> <h3>Data Processing Translation</h3> <p>In practice, script rewrites usually take the most time because all stored procedures must be rebuilt in standard SQL     or be completely refactored in an alternate execution paradigm (such as a DAG in Cloud Composer or a model in <a         href="https://docs.getdbt.com/docs/introduction" target="_blank">dbt</a>). Although there are many "SQL     Translation Services" out there, in practice these rarely work 100% and it's often best to consider completely     rebuilding the scripts to take advantage of BigQuery's execution paradigm and native DAG architecture. Many     customers will be coming from an RDBMS data warehouse environment where the SQL is not designed to take advantage of     columnar storage architectures.</p> <p>Primary and foreign keys do not exist in BigQuery, cloud data warehouses simply have no concept of keys due to the     nature of distributed data. This means that there is no true strict referential integrity constraints possible in     the cloud. They can still exist <i>logically</i>, meaning that you can have logical keys which link parent and child     tables, and you can perform checks on them, but that is not the same as enforced referential integrity constraints     within a RDBMS system. This isn't usually a breaking change and is fine for most warehouses, but it is something to     be aware of if those relationships are enforced in the on-prem system.</p> <p>Watch out for "SELECT *" statements when translating scripts. These select every column in a table or result set.     This is bad practice with columnar storage and if you see this in any scripts that you are rebuilding you will want     to optimize them. BigQuery engine (<a href="https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood"         target="_blank">Dremel</a>) has gotten much better at handling these statements and tries to perform a best     guess analysis of which columns are actually required to produce the result set. With that said, it still is     preferred to fully qualify your select statement before running the query unless you know that you will need all the     columns in the table.</p> <p>If the original on-prem environment was using proprietary or third party software to build and manage its data     pipelines then you will need to migrate the system to GCP as well. Check if the third party software exists in the     Google Cloud Marketplace or consider deploying the software in Kubernetes. If you find that migrating the pipeline     software is untenable then you may have to rebuild the scripts using a tool such as Cloud Composer. It might be     difficult if not impossible to easily track data movements between source and destination datasets therefore     validation of the data and testing of the processes become the keys to proving viability.</p> <p>Dependency analysis can become a challenge if there are a large number of complex dependencies. It is essential to     properly map out the dataflows and dependencies within the data warehouse when developing your strategy. It is     sometimes useful to "work backwards" when assessing dependencies. Rather than starting with the base tables and     trying to identify all the references to that table instead begin with the end state tables and work backwards     through the dependency layers until you reach the base tables. This exercise is quite useful for not only     identifying dependencies in scripts, but also for identifying potential areas of logical consolidation in the     scripts or even in the data itself. </p> <h3>Dataset Translation</h3> <p>Dataset translation consists of the movement of any required data between the on-prem system and GCP's BigQuery. This     includes data tables, files, objects, and schemas. GCP offers the <a         href="https://cloud.google.com/bigquery/docs/dts-introduction" target="_blank">BigQuery Transfer Service</a> or     <a href="https://cloud.google.com/datastream" target="_blank">Cloud Datastream</a> to facilitate data movements from     a host EDW into BigQuery. If your data warehouse is constructed from source data such as cdc transaction flows from     host OLTP systems then Datastream is appropriate. Another option is to stream transaction files into GCS where     BigQuery can use Transfer Services or an external table to automate a replication feature. This is also a useful     approach if you want to use a tools like Dataflow, Dataproc, or Vertex AI which work best with GCS files. Consider     taking advantage of <a href="https://cloud.google.com/storage-transfer-service" target="_blank">Storage Transfer         Service</a> for moving data from an alternative cloud provider or you on-prem system.</p> <p>Certainly a primary concern for many customers is security during the migration process. GCP's managed services, such     as Datastream, ensures point to point encryption while the data are in transit over the internet. And, of course,     data are always encrypted within GCP. If that is still deemed insufficient, if your data is very large (1 PB or     greater), or your connection to Google Cloud is slow, then you should consider using <a         href="https://cloud.google.com/transfer-appliance/docs/4.0/overview">Transfer Appliance</a>. Google will ship a     storage device to your on-prem location where you load the data into physical drives, ship them back to Google, and     then Google will load the data directly into Cloud Storage.</p> <p>Many tables in a data warehouse are procedurally generated from source data or other upstream tables. By correctly     preforming your dependency analysis before beginning your migration you can identify the base tables which are used     to build all downstream tables. This could mean less data movement is required to complete the migration. A tool     like dbt can help to map out these dependencies in your GCP environment.</p>

Designing the migration validation strategy

In addition to properly formulating and executing your plan, it is essential to validate the results of your solution and ensure a successful migration. Developing a data validation strategy ahead of time can make it easier for you to accomplish this task.

Data Validation

Data validation is an essential process when migrating between systems, but it can also be difficult to directly compare results in a host system to one in the cloud. GCP recommends using the data validator tool when debugging your data pipelines and data. Be sure to work in an iterative approach and take each step of the migration one at a time and validate each step along the way. Use tools such as Cloud Composer to create and manage DAG executions. Use SQL aggregations to perform ad hoc spot checks on the data through row counts or cardinality analyses.

Designing the project, dataset, and table architecture to ensure proper data governance

Data must be effectively schematized, categorized, collated, and governed to achieve its full potential and provide the highest possible value to stakeholders and users.

Administrative Requirements Gathering

It is essential to properly gather all administrative requirements. Part of a solid migration strategy to BigQuery is offering the customer a good picture of all components required to complete the migration and build the final warehouse. It is almost impossible to fully predict how the end state will look exactly, but it should certainly be possible to give a solid estimate of the end state.

Work to effectively translate user access requirements from the on-prem system to BigQuery's IAM and permissions model. This will likely look very different than the permission model in place in the host system. Be sure to work with the customer to readily define which roles need to accomplish which tasks and map user access requirements accordingly. Be prepared to develop pre-defined authorized datasets or authorized views to control access to sensitive data and control access to consumers and systems downstream. Consider using Cloud Data Loss Prevention to automatically detect PII in datasets and mask, tokenize, or refuse private or protected data. Use service accounts for performing data processing tasks and be sure to follow the principle of least privilege in the cloud.

Do a best-effort analysis when developing a likely budget and cost analysis for the cloud migration. This includes budgeting for compute and capacity requirements including data storage for all components and compute costs. Examine commonly executed analytical queries in the host system and provide a useful estimate of likely costs of performing the same queries in BigQuery. Be sure to identify partitioning and clustering strategies for the BigQuery datasets to most efficiently store and access the data. Be aware of BigQuery's new editions pricing model which is new for 2023.

Set up any logging required during the migration. Cloud Monitoring is used to collect and report on metrics generated by GCP services including BigQuery slot usage and compute requirements for Dataflow or Dataproc. Cloud Logging is used to capture events generated by GCP services including errors or other information events. It is even possible to send your own events to Cloud Logging. Use Cloud Audit Logs to trace user access and activities in Google Cloud.

<h3>Administrative Requirements Gathering</h3>
<p>
It is essential to properly gather all administrative requirements.  Part of a solid migration strategy to BigQuery is offering the customer a good picture of all components required to complete the migration and build the final warehouse.  It is almost impossible to fully predict how the end state will look exactly, but it should certainly be possible to give a solid estimate of the end state.  
</p>
<p>
Work to effectively translate user access requirements from the on-prem system to BigQuery's <a href="https://cloud.google.com/bigquery/docs/access-control" target="_blank">IAM and permissions</a> model.  This will likely look very different than the permission model in place in the host system.  Be sure to work with the customer to readily define which roles need to accomplish which tasks and map user access requirements accordingly.  Be prepared to develop pre-defined 
<a href="https://cloud.google.com/bigquery/docs/authorized-datasets" target="_blank">authorized datasets</a> or 
<a href="https://cloud.google.com/bigquery/docs/authorized-views" target="_blank">authorized views</a> to control access to sensitive data and control access to consumers and systems downstream.  Consider using <a href="https://cloud.google.com/dlp" target="_blank">Cloud Data Loss Prevention</a> to automatically detect PII in datasets and mask, tokenize, or refuse private or protected data.  Use service accounts for performing data processing tasks and be sure to follow the <a href="https://cloud.google.com/iam/docs/using-iam-securely#least_privilege" target="_blank">principle of least privilege</a> in the cloud.
</p>
<p>
Do a best-effort analysis when developing a likely budget and cost analysis for the cloud migration.  This includes budgeting for compute and capacity requirements including data storage for all components and compute costs.  Examine commonly executed analytical queries in the host system and provide a useful estimate of likely costs of performing the same queries in BigQuery.  Be sure to identify <a href="https://cloud.google.com/bigquery/docs/partitioned-tables" target="_blank">partitioning</a> and <a href="https://cloud.google.com/bigquery/docs/clustered-tables" target="_blank">clustering</a> strategies for the BigQuery datasets to most efficiently store and access the data.  Be aware of BigQuery's new <a href="https://cloud.google.com/bigquery/docs/editions-intro" target="_blank">editions</a> pricing model which is new for 2023.
</p>
<p>
Set up any logging required during the migration.  <a href="https://cloud.google.com/monitoring/docs" target="_blank">Cloud Monitoring</a> is used to collect and report on metrics generated by GCP services including BigQuery slot usage and compute requirements for Dataflow or Dataproc.  <a href="https://cloud.google.com/logging" target="_blank">Cloud Logging</a> is used to capture events generated by GCP services including errors or other information events.  It is even possible to send your own <a href="https://cloud.google.com/logging/docs/write-query-log-entries-gcloud" target="_blank">events</a> to Cloud Logging. Use <a href="https://cloud.google.com/logging/docs/audit" target="_blank">Cloud Audit Logs</a> to trace user access and activities in Google Cloud.
</p>