Log In

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Storing the data/Planning for using a data warehouse

Planning for using a data warehouse

A data warehouse is an effective tool for structuring and leveraging your data for analysis and reporting. Use tools such as BigQuery to create a high-performance serverless data warehouse solution. Organize your data effectively by mapping current and future architecture to current and future business requirements. Leverage developed tools to support data access patterns.

Topics Include:

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Storing the data
  → Planning for using a data warehouse

Topic Contents

Designing the data model
Deciding the degree of data normalization
Mapping business requirements
Defining architecture to support data access patterns


Designing the data model

Your data model is how you present your data to the end users and stakeholders. This is almost always a business driven requirement and data engineers should work hand in glove with business stakeholders to determine the requirements for the business and for the data model. This is a close collaboration and can make or break a project. In addition to just scheamtization, special attention should be paid to materialization and performance requirements.


Designing The Data Model

The first step in building a data warehouse or data mart is to map out the data model you which will form the backbone of all analyses performed against your data. At this point, the ingestion engines are created, the sources are connected, the data pipelines are built, and the raw data is either in GCS, BigQuery, or an RDBMS such as Cloud SQL or Cloud Spanner. Using your data processing layer you will now build the data model according to the business needs.

Data models can differ in shape depending upon which technology you choose to host your model upon. For example, BigQuery supports nested and repeated fields as well as tabular data which makes it ideal for working with hierarchical data or event data. However, BigQuery does not enforce primary or foreign keys which means that it cannot strongly enforce referential integrity. This therefore precludes strict ACID transactions which are a requirement for some organizations. It is still possible to define "primary keys" and "foreign keys" in tables, but these would be more accurately described as query optimizers than keys. For the purposes of BigQuery, think of a primary key as any combination of columns which form a unique identifier for your dataset.

BigQuery is the de facto data warehousing and analytical database standard in GCP. You should do all data warehousing in BigQuery unless there is a good reason you cannot.

BigQuery Data Modelling

Your data model should reflect the user's intention for the data. A good place to start is by understanding which questions your stakeholders are trying to answer. (It's called a query language for a reason!). When you have accurately defined the questions that are being asked of the data you can readily understand which datasets are required and which table relationships should be defined to access and manage the data.

With BigQuery, and all data warehouses in general, joins will always be the most expensive operations. If you are working with hierarchical data, such as JSON event data, you can take advantage of nested and repeated fields to denormalize the data to create much more efficient and faster query operations. For example, if you want to know which books were written by which author, you would create a table with author as your primary row, and then create a nested and repeated field for each book. This would be much faster than joining an "authors" table and a "books table". These optimizations become almost essential when dealing with massive datasets consisting of billions of rows of hierarchical data, such as Google Analytics data. Unlike vertically scalable warehouses, such as RDBMS, BigQuery will not slow down querying based upon the amount of data queried, it will just throw more resources at the query. BigQuery works best with nested fields and denormalized tables, but it is certainly capable of building complex joins across many tables.

Another reason to take advantage of BigQuery is its great facilitation of many types of input streams and methods. BigQuery Transfer Service is great for facilitating ongoing data movements of many Google services, such as Google Analytics, as well as sources from other cloud providers, such as AWS S3 or Azure Blob Storage.

Streaming data is easily handled by BigQuery thanks to the new Storage Write API. Write API allows you to write stream data into BigQuery tables either directly via an API call or by using Pub/Sub as a buffer. It is also possible to set up external tables in BigQuery to read directly from Cloud Storage.

You could virtually eliminate all other data ingestion and pre-processing steps for many sources by taking advantage of Transfer Service and Storage Write API.

BigQuery Public Datasets

Another huge advantage the BigQuery has over basically every other data warehouse on the market is the access to a huge amount of valuable publicly available datasets. These data are accessed via Analytics Hub and can be brought into your BigQuery Data Warehouse with just a few mouse clicks.



Deciding the degree of data normalization

Data normalization is the process of choosing the granularity required for your data model. Data granularity depends upon how you want to present the data. Work with stakeholders to identify the questions being asked and the data sources needed. From there, work to identify functional dependencies among the data and find methods to collapse tables joins to denormalize the data oin your data warehouse. The result will be quicker and more reliable queries from your data warehouse.


Deciding The Degree Of Normalization

To be mathematically precise normalization is defined as the degree of functional dependencies of ancillary data columns stored within a table. Data denormalization is the act of producing fields in a result set which contain some degree of mathematical bias, meaning, you are assuming the outcome of a function by materializing the results of a previous state of data by storing it in a field. The more normalized your data is the closer to the current-state "truth" your result set is, but the more processing power is required to process, analyze, and report on that data. Transactional databases are highly normalized while analytical databases are denormalized.

Google is working hard to bridge the gap between transactional and analytical databases through tools such as Alloy DB for PostgreSQL, which use smarter querying and resource optimization to speed up joins and querying. However, this cannot scale to the level of BigQuery's petabyte data warehouse and is also significantly more complicated. AlloyDb is also an extension of Cloud SQL, and is subject to its limitations. Any data replication with Cloud SQL will have to be managed by the user. AlloyDb can be useful if you have a large transactional Postgres database and expect a huge volume of CRUD operations performed against it. Alloy DB would be comparable to Aurora DB from AWS.

Transactional Databases and 3rd Normal Form

Transactional databases are almost always fully 3rd degree normalized databases with zero functional dependencies in the fields. This means that the database objects represent class objects are the entity level only, with a singular key (or identifier) representing a single instance of an object (entity). For example, there could be a users table which contains primitive information about a user. Fields included would be first name, last name, SSN, gender, ethnicity, date of birth, and date of site registration. Other information which could be more fluid, such as a user's home address or last log in date, is better segregated out to different tables. This serves a good purpose in a transactional database by reducing the amount of data which actually needs to be queried and changed. I.e, you only need to change one row in the "last login" table instead of having to query the entire user document, change one field, and then update the whole object again. The result is a cheaper operation both monetarily and computationally. Transaction workloads are computationally expensive and highly normalized across fields and tables. it makes for quick updates to given fields with minimal data operations required to complete.

Another table could contain orders of an e-commerce product, and yet another table contains products purchased, and another table contains the product definitions, and another the sizes, another the styles, etc...On a transaction level, this is absolutely essential to maintain a high performing database and strictly enforce referential integrity. This high degree of normalization, known as 3rd normal form, is the standard for most transactional databases. The transactional database offers the current state of the data only, and is used to process transactions, not analyze them.

In our example above the transactional database could potentially perform thousands of transactions per day across dozens or sometimes hundreds of tables! If you tried to form aggregated queries on top of this data it would be almost impossible to simultaneously query the data and ensure a high performing application. Read replicas can help to query the data, but still, in a highly normalized state any attempt at analytical queries would be difficult to run effectively. For example trying to grab all products ordered across a certain user segment, such as women aged 18-24 from the Great Lakes region, would require a very complicated SQL query with dozens of joins to build effectively. Additionally, businesses tend to operate on a different decision making cadence and don't require all updates made to a table to effectively operate. Attempting to capture and relay all operations made would be counterproductive and inefficient.

Analytical Databases, Data Warehouses and Denormalization

Analytical Databases, such as BigQuery, are purpose built to analyze huge amounts of data very quickly at a relatively cheap cost. In order to achieve this BigQuery recommends using a highly denormalized data model. Looking at the model above we could aggregate the rows for orders and users, effectively denormalizing the data for the previous day. Addresses could be stacked (nested) inside the users data, and products could be nested within orders. After aggregating the data we would use an ETL process to move the data either into GCS to be read as an external table or it could be written directly into BigQuery via the Write API.

Now, instead of having to query many tables, we only need to query and join two, users and orders, to learn which products are favored by women aged 18-24 from the Great Lakes region. This greatly speeds up our queries, but more importantly, it also effectively partitions the data along user_id and order_id which means that querying the tables is now highly scalable. Additionally, as a columnar database, BigQuery does not need to query every column in the table to process the data, which greatly saves on computational expenses. With a proper data model in place BigQuery can process billions of rows and petabytes of data in mere minutes, an impossible task for any RDBMS.



Mapping business requirements

Business stakeholders should be well consulted throughout the data warehouse development process to ensure that requirements are met. Data Engineers are responsible for building the warehouse, but are not the ones who actually profit from the data within.


Mapping Business Requirements

Designing and building a data warehouse cannot be performed in a vacuum or by a single party. It requires close coordination between technical and business leadership and engineering teams to be successful. In essence, building a data warehouse is a joint venture between the two parties and the end product should reflect this effort effectively. To that end, an effective professional data engineer should be able to act as an effective steward of the business's data as well as the business's requirements for the scope, qualifications, and outcomes of the data warehouse initiative. You are responsible not only for the data itself, but also what the data are trying to say.

The data warehouse should be thought of as the abstract reflection of the business operations spoken through the language of data. This translates well to analytical requirements and data visualizations, where data visualizations are often considered as telling a story about the business or a particular operation.

Planning in data systems is often done backwards. You don't start with the data and then decide what questions you need answered or how to answer them. You start with the questions and then gather and model the data needed to answer those questions. An example requirement could look like this:

As a product manager, I would like a detailed look at our inventory levels and product velocity in order to better determine how much inventory I should order each month.

Now that we have our objective we can begin mapping out the technical requirements to fulfill this ask from the business. Above there appears to be two variables we will need to analyze in order to answer the question. We should first establish the business definitions of our working data and then we can start mapping out how to best answer these questions. To simplify this experiment, we will ignore other factors such as seasonality or macro economic factors.

  • Velocity - how fast is a given product selling (units sold/day)?
  • Inventory levels - how much inventory for a given product do we have remaining at the end of each day?
  • How much inventory is needed? - Current inventory levels by day - number of products sold per day, aggregated over the month.

Using the above definitions we can begin mapping the requirements to the data and build the tables needed to answer the question. There should be three tables needed to answer this question:

Table Name Table Description Table Fields
Products Products currently for sale
[product_id, date, product_description]
Product-Inventory Inventory by product
[product_id, date, inventory]
Orders Orders by user, including product purchased as a nested array
[order_id, date, user_id, cart:[{
product_id, price, quantity
}]]

This is a very simple example, but it highlights the key parts of the process:

  • Who is requesting what data, and how do they need to see it?
  • What data is needed to answer the question?
  • How to design the warehouse model in order to effectively answer the question being asked?
Cloud data warehouses have driven a paradigm shift in data warehousing away from highly monolithic, vertically scalable data warehouses hosted on RDBMS's on on-prem hardware. With tools like BigQuery, which are essentially infinitely scalable and serverless, it becomes possible to become highly nimble in answering very specific questions. It isn't really necessary to maintain a huge, monolithic data warehouse anymore to effectively and accurately answer business questions. In fact, the above example could easy be a very simple data model, built each morning using LookML, and visualized by looker. The analyst would then go into the dashboard as needed and access the model which could be rebuilt on the fly.

Given BigQuery's incredible flexibility, a business can focus less on maintaining and developing huge data models or warehouses and more on actually making tactical, strategic, and data driven decisions more frequently and with greater impact. Using GCP's current technology it is possible to build a solution to stream gigabytes of data directly into BigQuery daily, build views to parse and type the data, build additional views on top of the parsing views to actually model the data, and then instantiate this data as tables with scheduled queries to be ready for analysis. This could all be done using GCP's native core data processing technologies within a single day and from a single computer.



Defining architecture to support data access patterns

Data architecture should be chosen to match data access patterns. Know the different technologies, their trade-offs, and their most common use cases to provide a high-quality, performant, and cost-optimized solution.


Defining Architecture To Support Data Access Patterns

Data architecture should support how your users need to access the data. Understanding user requirements is key to building an effective data solution in GCP. For example, if your data are accessed around the clock by thousands of users around the globe it wouldn't make sense to try and use a Cloud SQL instance to service this when BigQuery could provide data replication and high availability out of the box. However, if you are running a highly classified government workload, then Cloud SQL hosted on a sole-tenant node with a private IP address behind a firewall with CMEK encryption would be a better choice. There can be a multitude of possible use cases and data access patterns, so focus on knowing the different strengths and trade-offs of particular technologies will help you determine best fit. Instead of focusing on what would work for a solution, try the process of deduction, focus more on why a particular option wouldn't work, this can help you to eliminate bad solutions.

You goal is to find a balance between competing interests while fulfilling the requirements listed. Security, storage space, performance, and costs all play a part.

For most workloads in the cloud BigQuery is the go to solution for data warehousing, analysis, and analytics. It is a Swiss army knife solution for many different use cases and really it should be analyzed first before other options are considered. Most data driven companies will rely upon a robust data platform which power advanced business intelligence tools such as Looker or Tableau. By developing a complete and robust data platform you can power advanced use cases such as advanced analytics, machine learning, or data driven decision making.

When developing your solution you should maintain a look on cost, as well. Avoid solutions which are overkill. For example, if your entire data team only lives in a certain region of the united states then you don't need multi-region hosting or premium tier networking. If your users are SQL savvy then maybe give them deeper access to the actual data warehouse stack than simply a looker data model.