GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Storing the data/Designing for a data mesh

Designing for a data mesh

Data Mesh is a fairly new concept which encourages domain driven data product development. GCP has a number of native tools which can enable you to quickly and efficiently build a data mesh, segment data, and build a federated governance model.

Topic Include:

Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)
Segmenting data for distributed team usage
Building a federated governance model for distributed data systems

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Storing the data
→ Designing for a data mesh

Topic Contents

Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)
Segmenting data for distributed team usage
Building a federated governance model for distributed data systems

Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)

GCP's native data tools are very flexible in their approaches to solving data problems and can be used with a wide variety of data architectures. Use GCP's data tools to develop your data mesh.

Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)

A data mesh is a new concept in data engineering known as domain driven design. The purpose of domain driven design is to facilitate the production of data assets, or products, and to shift much of the responsibility of data development away from core engineering teams to domain experts. These products are built and maintained by the domain experts and become searchable assets to be utilized by many members of an organization.

The organization's central repository of data products is searchable and fully aligned with company data policies and standards. The data are considered to be high quality and are combed and groomed effectively because they go through a rigorous QA process before being onboarded to the data mesh. Generally, this whole process assumes that the organization is already quite skilled as data engineers themselves. Their processes and production assets are not easily produced or replicated and much institutional knowledge can lead to a siloing of information or skill sets, having rather the opposite of the intended effect of knowledge sharing. The data mesh is most successful within organizations which see value in maintaining and producing sophisticated data assets and which actively encourage knowledge sharing and collaboration. Although the data model is distributed the governance and management are federated, which means that there are standards, controls, initiatives, and directives which apply to all domain assets. The federated system acts performs a balancing act, gradually giving or taking responsibility as certain domains require it.

As data mesh has become more popular GCP has developed tools and techniques to facilitate organizations which adopt this design philosophy. Google Cloud sees this process as a "democratization of data" which facilitates "data-driven decision making".

Dataplex

Dataplex is GCP's main tool for data mesh development. It is used to identify and standardize data assets within an organization, automatically apply rules and quality checks to the data, and facilitate its discoverability by other teams. Dataplex is a unified Data Management and Governance suite which can provide the necessary pathways for domain driven design/development while simultaneously ensuring a high-quality and performant system. The data itself is segregated by domain, or team, but the standards for data development, quality assurance, data mesh onboarding, discovery, and sharing should be managed at the organizational level. Dataplex provides this operational framework.

Data mesh is an extension and refinement of the data lake development process. Data lakes are developed as per the standard, but the data and methods are more organized and well developed, with data domains each inheriting their own responsibilities for their data flows, pipelines, data lakes, code repositories, and BigQuery/Data Warehouse assets. Some assets can be shared, but the responsibility for producing or managing those assets is still the responsibly of just one team or domain.

Data mesh provides several key features which bring your disparate data assets and teams together to unify an approach to data discovery, sharing, and management.

Data Lake

Dataplex lakes are built directly on top of GCS buckets, BigQuery datasets, or databases and are broken into what are known as zones. There should be at least two zones: raw and curated, sometimes referred to as conformed, groomed, etc. Zones can be created via the console or the API. Zones should be self-descriptive and reflect the intent of the data within. For example, if you are in the operations domain and you need to analyze inventory levels, then one zone could be called inventory and another planning. All data will initially flow into the raw zone, and pipeline processes would then process and move this data into the proper zone.

Steps Needed to Add Data Assets

There are a few steps needed before being able to add your assets to dataplex. Firstly, you should ensure that Dataplex has the proper permissions to read the needed GCS buckets and BigQuery datasets. Additionally, it is recommended to enable BigQuery Data Lineage to track where data comes from, where it is passed to, and what transformations are applied to it. Then to onboard data assets you would need to manually add either the cloud storage bucket or BigQuery dataset to the proper zone. After you add the dataset or bucket dataplex will scan through and add your assets to the proper zone, map data lineage, build Data Catalog information, and surface the data for data discovery. GCS assets will be made into BigQuery external tables, from here they can be manually upgraded to BigLake tables.

The Raw Zone

The data lake is developed with GCS at its core. All data should flow into clearly delineated buckets with datasets segregated by table and date with hive partitioning enabled. Basically, the raw zone is an exact 1:1 copy of the data exactly as it is gathered from the source without any transformations, filters, or maps applied. This helps the debugging process if something goes wrong and you need to communicate with source teams.

The Curated Zone

As data assets are moved from raw to curated zones dataplex automatically produces and manages the metadata attached to them. The data should be high-quality, groomed, well-organized, properly prefixed, in a hive partitioned format, and formatted as Avro, CSV, Iceberg, JSON, ORC or Parquet data to allow the creation of BigLake Tables. Dataplex will now automatically transform this zone into a BigQuery dataset with data assets represented as BigQuery tables. Dataplex will generate all appropriate metadata and propagate the metadata to Data Catalog where it will be searchable by anyone in the organization who has the proper permissions.

<h3>Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)</h3>
<p>
A <a href="https://en.wikipedia.org/wiki/Domain-driven_design" target="_blank">data mesh</a> is a new concept in data engineering known as domain driven design.  The purpose of domain driven design is to facilitate the production of data assets, or products, and to shift much of the responsibility of data development away from core engineering teams to domain experts.  These products are built and maintained by the domain experts and become searchable assets to be utilized by many members of an organization.
</p>
<p>
The organization's central repository of data products is searchable and fully aligned with company data policies and standards.  The data are considered to be high quality and are combed and groomed effectively because they go through a rigorous QA process before being onboarded to the data mesh.  Generally, this whole process assumes that the organization is already quite skilled as data engineers themselves.  Their processes and production assets are not easily produced or replicated and much institutional knowledge can lead to a siloing of information or skill sets, having rather the opposite of the intended effect of knowledge sharing.  The data mesh is most successful within organizations which see value in maintaining and producing sophisticated data assets and which actively encourage knowledge sharing and collaboration.  Although the data model is distributed the governance and management are <i>federated</i>, which means that there are standards, controls, initiatives, and directives which apply to all domain assets.  The federated system acts performs a balancing act, gradually giving or taking responsibility as certain domains require it. 
</p>
<p>
As data mesh has become more popular GCP has developed tools and techniques to facilitate organizations which adopt this design philosophy.  <a href="https://services.google.com/fh/files/misc/build-a-modern-distributed-datamesh-with-google-cloud-whitepaper.pdf" target="_blank">Google Cloud</a> sees this process as a "democratization of data" which facilitates "data-driven decision making". 
</p>

<h3>Dataplex</h3>
<p>
<a href="https://cloud.google.com/dataplex?hl=en" target="_blank">Dataplex</a> is GCP's main tool for data mesh development.  It is used to identify and standardize data assets within an organization, automatically apply rules and quality checks to the data, and facilitate its discoverability by other teams.  Dataplex is a unified Data Management and Governance suite which can provide the necessary pathways for domain driven design/development while simultaneously ensuring a high-quality and performant system.  The data itself is segregated by domain, or team, but the standards for data development, quality assurance, data mesh onboarding, discovery, and sharing should be managed at the organizational level.  Dataplex provides this operational framework.
</p>
<p>
Data mesh is an extension and refinement of the data lake development process.  Data lakes are developed as per the standard, but the data and methods are more organized and well developed, with data domains each inheriting their own responsibilities for their data flows, pipelines, data lakes, code repositories, and BigQuery/Data Warehouse assets.  Some assets can be shared, but the responsibility for producing or managing those assets is still the responsibly of just one team or domain.
</p>
<p>
Data mesh provides several key features which bring your disparate data assets and teams together to unify an approach to data discovery, sharing, and management.
</p>
<h3>Data Lake</h3>
<p>
Dataplex lakes are built directly on top of GCS buckets, BigQuery datasets, or databases and are broken into what are known as zones.  There should be at least two zones: raw and curated, sometimes referred to as conformed, groomed, etc.  Zones can be created via the console or the API.  Zones should be self-descriptive and reflect the intent of the data within.  For example, if you are in the operations domain and you need to analyze inventory levels, then one zone could be called inventory and another planning.  All data will initially flow into the raw zone, and pipeline processes would then process and move this data into the proper zone. 
</p>
<h4>Steps Needed to Add Data Assets</h4>
<p>
There are a <a href="https://cloud.google.com/dataplex/docs/manage-assets" target="_blank">few steps needed</a> before being able to add your assets to dataplex.  Firstly, you should ensure that Dataplex has the proper permissions to read the needed GCS buckets and BigQuery datasets. Additionally, it is recommended to enable <a href="https://cloud.google.com/data-catalog/docs/how-to/track-lineage" target="_blank">BigQuery Data Lineage</a> to track where data comes from, where it is passed to, and what transformations are applied to it. Then to onboard data assets you would need to manually add either the cloud storage bucket or BigQuery dataset to the proper zone.  After you add the dataset or bucket dataplex will scan through and add your assets to the proper zone, map data lineage, build Data Catalog information, and surface the data for data discovery.  GCS assets will be made into BigQuery external tables, from here they can be manually upgraded to BigLake tables.
</p>
<h4>The Raw Zone</h4>
<p>
The data lake is developed with GCS at its core.  All data should flow into clearly delineated buckets with datasets segregated by table and date with hive partitioning enabled.  Basically, the raw zone is an exact 1:1 copy of the data exactly as it is gathered from the source without <i>any</i> transformations, filters, or maps applied.  This helps the debugging process if something goes wrong and you need to communicate with source teams.
</p>
<h4>The Curated Zone</h4>
<p>
As data assets are moved from raw to curated zones dataplex automatically produces and manages the metadata attached to them.  The data should be high-quality, groomed, well-organized, properly prefixed, in a hive partitioned format, and formatted as Avro, CSV, Iceberg, JSON, ORC or Parquet data to allow the creation of <a href="https://cloud.google.com/bigquery/docs/biglake-intro" target="_blank">BigLake Tables</a>. Dataplex will now automatically transform this zone into a BigQuery dataset with data assets represented as BigQuery tables.  Dataplex will generate all appropriate metadata and propagate the metadata to Data Catalog where it will be searchable by anyone in the organization who has the proper permissions.
</p>

Segmenting data for distributed team usage

One hallmark of a data mesh is the domain driven data product development which ease of data segmentation for various teams. The various teams can sometimes share or isolate given data streams for their individual use cases.

Segmenting data for distributed team usage

Data segmentation is essential to enable domain driven development. In a data mesh data are segmented by domain and area of responsibility. Each data asset should be assigned to only one domain, and data contracts and sharing agreements drive collaboration and enable a highly nimble architecture while simultaneously ensuring a high-quality data product is the norm.

The actual logic behind a given segmentation or assignment is up to the organization and is developed by a standardized set of rules. The segmentations are usually very clear as a given team generally knows their data sources well and are readily willing and able to take ownership of the data. This approach to data segmentation assumes a competent data engineer is aligned to a given team, or that a strong and nimble dedicated data team is available to help various teams achieve their data objectives.

Building a federated governance model for distributed data systems

Utilize GCP data tools to achieve a flexible and federated governance standard. Achieve centrality of data mesh rules and policies for important organizational requirements such as privacy, security, and data quality which giving lower level teams the resources they need produce their products most effectively.

Building A Federated Governance Model For Distributed Data Systems

Data mesh is decentralized by default, which can present a competing set of requirements for data governance across an organization. However, this should be thought of as a feature, not a bug. Data driven organizations which rely upon data contracts to effectively determine relationships between teams should be imagined as a partnerships among equals, rather than as an overarching bureaucracy governing their teams and controlling their assets.

The idea of the data mesh is still nascent, and different organizations have different standards which apply in different scenarios. This is a usual practice in the industry and is perfectly reasonable, each organization and each domain are different. The data product owners or domain experts decide how to produce and curate the data, and the federalized governance determines how to qualify and share that data. The first step when building a data mesh is to establish a "data governance council" which is made up of the domain leaders who coordinate with senior organizational officers and managers to develop data contract standards and align principle methods to data product development.

IAM roles can be assigned and removed at will across different organizations and domain assignments. In GCP, this can mean developing domain specific customized roles to coordinate policies and standards which ensure a high-quality and secure product can be brought to the data market. The federated model dictates what policies apply to the whole data mesh, or to the data market, and which policies are more aligned with individual teams or domain/product owners. Data Catalog can used be organize advanced data masking techniques and determine which roles should be required to access a given data product.

Take advantage of tools like Dataplex to automate metadata generation and management across your organization as well as Analytics Hub to facilitate data sharing.

<h3>Building A Federated Governance Model For Distributed Data Systems</h3>
<p>
<a href="https://www.oreilly.com/library/view/building-an-event-driven/9781098127596/ch04.html" target="_blank">Data mesh</a> is decentralized by default, which can present a competing set of requirements for data governance across an organization.  However, this should be thought of as a feature, not a bug.  Data driven organizations which rely upon data contracts to effectively determine relationships between teams should be imagined as a partnerships among equals, rather than as an overarching bureaucracy governing their teams and controlling their assets. 
</p>
<p>
The idea of the data mesh is still nascent, and different organizations have different standards which apply in different scenarios.  This is a usual practice in the industry and is perfectly reasonable, each organization and each domain are different.  The data product owners or domain experts decide how to produce and curate the data, and the federalized governance determines how to qualify and share that data.  The first step when building a data mesh is to establish a "data governance council" which is made up of the domain leaders who coordinate with senior organizational officers and managers to develop data contract standards and align principle methods to data product development.
</p>
<p>
IAM roles can be assigned and removed at will across different organizations and domain assignments.  In GCP, this can mean developing domain specific customized roles to coordinate policies and standards which ensure a high-quality and secure product can be brought to the data market.  The federated model dictates what policies apply to the whole data mesh, or to the data market, and which policies are more aligned with individual teams or domain/product owners.  Data Catalog can used be organize advanced data masking techniques and determine which roles should be required to access a given data product.
</p>
<p>
Take advantage of tools like Dataplex to automate metadata generation and management across your organization as well as Analytics Hub to facilitate data sharing.
</p>