Preparing data for visualization
GCP has a large number of effective data visualization tools and best practices which can ensure a high quality product which respects the security and privacy requirements of your stakeholders and users.
Topics Include:- Connecting to tools
- Precalculating fields
- BigQuery materialized views (view logic)
- Determining granularity of time data
- Troubleshooting poor performing queries
- Identity and Access Management (IAM) and Cloud Data Loss Prevention (Cloud DLP)
GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Preparing and using data for analysis
→ Preparing data for visualization
Topic Contents
Connecting to toolsPre-calculating fields
BigQuery materialized views (view logic)
Determining granularity of time data
Troubleshooting poor performing queries
Identity and Access Management (IAM) and Cloud Data Loss Prevention (Cloud DLP)
Connecting to tools
GCP offers the Looker Suite business intelligence. Looker provides built in SQL native connectors for a wide variety of database types and sources. Looker has a best in class API connection suite which gives it a distinct advantage as a data visualization tool in GCP.
Connecting to tools
GCP, as with most cloud services, makes connecting your data warehouse or mesh to visualization tools relatively easy. With third party tools such as Tableau all you need to do is set up the authentication and authorization and you can start building dashboards and reports. With Looker it is even easier. Keep in mind, however, that any business intelligence and reporting suite is only as good as the data engineering behind it. If you data warehouse or data mesh is not producing high-quality and performant data sets then all downstream reporting will suffer in turn.
When you're using a tool like Looker you can use authentication with Google IAM to allow users to connect to or administer Looker. However, Looker requires more granulated permissions be handled within the instance itself using looker authentication and authorization. You can import your Google Workspace OAuth accounts into Looker and automatically apply certain roles or other options when importing.
It is recommended to create a dedicated service account for whichever tool you are using to query your data. This can help provide audit logging or usage metrics as well as properly segregate duties for your various components. Looker can also be configured to use a Private IP address which will ensure that the instance can only be accessed internally and cannot be accessed by external services (the internet).
BigQuery and Looker
BigQuery is the preferred tool to use to power your analytics and reporting in Looker. Looker was recently purchased by GCP and since then GCP has worked hard to integrate Looker natively with BigQuery and this makes getting up and running with Looker and BigQuery very easy. The Looker platform itself has a federated permission approach, and you must assign internal roles and permissions within the platform. Looker allows for a robust set of roles and permissions to apply to any given dataset, model, or dashboard.
Looker Studio is available out of the box with BigQuery and can be used to perform quick analyses of BigQuery result sets, tables, or views. You can get decent ad hoc reporting and analytics with Looker Studio, although it isn't as robust as Looker proper. If all of your data are hosted in BigQuery and you don't need some of lookers advanced features, such as LookML, then Looker Studio can get you quite far on its own. It is possible to pull all data into BigQuery via creative data engineering techniques, but if you want to connect to external providers, such as AWS Athena, or a data warehouse hosted on a private IP, then Looker is the better choice because it can integrate data from multiple sources inside and outside Google Cloud.
Looker Markup Language, or LookML, is Looker's proprietary data modeling language, and it is very good. It seamlessly integrates with BigQuery and can competently perform highly complex GoogleSQL operations out of the box. It can even be used to execute BQML modeling on the fly and report on the results. A deep dive into LookML is beyond the scope of the exam, but know that it is highly capable and can perform all required business intelligence tasks as required.
Looker can take advantage of BigQuery's unique architecture and processes. If you have dashboards or explores which are updated infrequently then Looker will use BigQuery's cached data results to build the dashboard. In this way you can check for new data without having to rerun expensive queries. BigQuery caches data for 24 hours by default. Alternatively, if you have data which is frequently updated you can use BigQuery materialized views to persist the data for a time. This can improve performance, but you might be reading stale data.
Pre-calculating fields
Visualization performance can be greatly improved by pre-calculating fields and caching results.
Pre-calculating fields
Queries can take time to complete. Even with a tool like BigQuery this can add up and lead to diminished performance. The more joins, rollups, partitions, and other advanced SQL you perform the more complex and time consuming your query will be to run. If you notice that your query is starting to become very complex and is behaving poorly then you can use a few techniques to speed up response times in BI software. Careful planning, data segmentation, and use of advanced tooling can greatly improve your query speeds.
Use BigQuery BI Engine to create highly optimized in-memory cached query results for frequently used data. Combine BI Engine with preferred tables to create blazing fast query results for your BI Tools. BigQuery also caches table results automatically for queries which improves performance. This isn't as robust as BI Engine, however, and you have less control over how it works.
BigQuery materialized views (view logic)
Take advantage of BigQuery materialized views in addition to pre-calculating fields to ensure high-performance visualizations.
BigQuery materialized views (view logic)
Use materialized views to encapsulate complex business logic and push query optimization further down the stack. BigQuery can produce some results faster and perform some joins more efficiently by encapsulating complex logic in views as opposed to generating complex SQL on the fly. This also makes for a cleaner implementation overall and enables better access controls and business logic visibility. You could also instantiate these logic as tables (such as a daily refresh) which will greatly speed up queries. This is also useful for minimizing the amount of data processed on a daily basis by frequently executed views. Materialized views are great for caching daily result sets of partitioned base tables.
Materialized views act as a sort of data cache and can help keep cost low if the query doesn't force a query of the underlying table.
GCP executes logic behind to scenes to automatically determine whether changes to base tables would lead to changes in the materialized views. If so, BigQuery will rebuild the materialized view, otherwise BigQuery will route the request to the cached view result instead. Materialized views are also compatible with BI Engine and can be a great hybrid approach to querying fresh data without having to rely upon daily table rebuilds which can increase costs.
Be aware that materialized view do NOT support all SQL syntax and you must follow strict rules when constructing your query. You can't construct a materialized view which is non-deterministic or which BigQuery cannot deduce is non-deterministic, such as aggregations of aggregations (i.e. window functions), which would require multiple temporarily cached intermediate results to perform and would require rebuilding the view. Currently the following SQL features are not supported:
- Left/right/full outer joins.
- Self-joins (joins using the same table more than once).
- Window functions.
- ARRAY subqueries.
- Non-deterministic functions such as RAND(), CURRENT_DATE(), SESSION_USER(), or CURRENT_TIME().
- User-defined functions (UDFs).
- TABLESAMPLE.
- FOR SYSTEM_TIME AS OF.
See below for a table outlining a comparison of the different BigQuery Techniques and how they can facilitate your analytical querying and improve performance.
Component | Caching | Scheduled queries | Logical views | Materialized views |
---|---|---|---|---|
Optimize compute | Yes | No | No | Yes |
Query support | All | All | All | Limited |
Partitioning and clustering | No | Yes | N/A | Yes |
Incremental refresh | No | No | No | Yes |
Additional storage | No | Yes | No | Yes |
Query rewrite | No | No | No | Yes |
Maintenance costs | No | Yes | N/A | Yes |
Data staleness | Never | Yes | Never | Optional |
Determining granularity of time data
Companies, their data, and teams can all operate on different frequencies. Stakeholders should work with data engineers to formulate the data granularity which best serves decision making frequencies.
Determining granularity of time data
Organizations tend to operate and make decisions on certain frequencies depending upon a few factors such as seasonality, peak hours, and retail shop open hours. As businesses move to global 24/7 hour operations this begins to make coordinating business operations and analytics more difficult to analyze and direct.
The first step for any business is to determine how to normalize their system and operating datetimes. This becomes increasingly true as operations move farther away from headquarters facilities and time zones become difficult to differentiate. This can affect not only operational reporting but month end accounting reporting, quarterly financial statements, or other required reporting. The best method for timestamp normalization is to normalize all event timestamps to UTC time. This is the default for GCP, and it allows for easy translation of timestamps across any time zone.
Following timestamp normalization you should determine how data are aggregated along the timestamp. Once you aggregate a timestamp it becomes locked into the time zone you aggregate on, which is why your raw, base data should always be normalized to UTC time in order to ensure continuity between reports. Usually, businesses are most interested in daily data aggregations in their headquarter's timezone. You can use views to produce these reports on the fly or use them to generate daily tables to speed up querying and reporting.
Translating data from a legacy data warehouse
When you are integrating pre-aggregated data which is coming from a legacy warehouse, be sure to note which time zone the data are coming from. in many legacy, on-prem systems the event timestamp is not adequately captured and cannot be relied upon to translate or integrate correctly. This sparks a judgement call by business stakeholders about how to handle these data. If aggregated by date then the data could be set to midnight and translated to UTC time, but of course this presents a set of possible incorrect data. Possibly, a better option is to keep this legacy data separated and then combine then in the reporting layer when the timestamps are normalized correctly and adjusted for timestamps.
Troubleshooting poor performing queries
High performance data visualizations should be monitored for consistent queries and, if necessary, any poor performing or stale queries should be corrected.
Troubleshooting poor performing queries
Identifying Poor Performing Queries
Most poor performing queries can be corrected by identifying when a query is performing poorly, and then deducing possible reasons for the poor performance. To help identify poor performing queries BigQuery offers a number of tools which provides metrics, operating statistics, and alerts which can alert you to jobs which may be behaving abnormally by either processing an abnormally large amount of data or by taking a longer of time to process. Additionally, you can also observer anomalies in the produced end state tables or reports.
Troubleshooting Poor Performing Queries
Once you have identified a poor performing query you can the debug process. Almost always, if there is a query or report which was performing well, and then suddenly the performance is degraded, this indicates a material change in the data, or a flaw in the data. This can mean something like a massive spike in activity, resulting in more data, or a flaw in the data, such as duplicated data which is expanding previously efficient joins.
If you're in a development environment and you are building a query there are a few methods that you can use to ensure optimal queries in BigQuery. The first is to properly isolate and normalize the data you are working with:
- Don't query every column in a dataset with Select *, this is very inefficient in columnar databases such as BigQuery.
- Use CTE's to effectively partition your data sets and reduce the data required for assembling your result set. This is a good best practice for when developing analytical queries because it gives you good visibility into the data you are actually working with. Additionally, this methodology is facilitated by BigQuery's architecture which is highly optimized for this pattern.
- If you're performing a join, try to list the larger dataset with higher cardinality first as BigQuery's optimizer will partition this data more readily and will do a better job of looking up the matches on the smaller table (broadcast join).
- It is a good idea to avoid performing much, if any, Data Manipulation Language (DML) statements, such as writes or deletes, in BigQuery as BigQuery is not a transactional database and these operations are not optimized in BigQuery.
- Ensure proper partitioning and clustering on the table. This will reduce the amount of data BigQuery will need to read which will reduce costs and speed up performance.
Identity and Access Management (IAM) and Cloud Data Loss Prevention (Cloud DLP)
Data visualizations could potentially be shared and viewed by many possible users. Leverage tools such as Identity and Access Management (IAM) and Cloud Data Loss Prevention (Cloud DLP) to prevent unauthorized access to protected data.
Identity and Access Management (IAM) and Cloud Data Loss Prevention (Cloud DLP)
IAM Considerations For Data Visualization
IAM is the backbone of GCP's security architecture throughout the stack. It is used to assign roles to principals and control access to various API's. There are a few access points for different roles depending upon which components are chosen. When using data visualization software you should be granting access to a dedicated service account which will interact with your data warehouse, such as BigQuery. The most basic role is BigQuery Data Viewer roles, but this is very broad and likely gives too much access to the account. Instead, it could be wise to instead use authorized datasets or views which contain the data needed for the reporting layer. Authorized datasets are very useful for isolating reporting data, especially within a data mesh architecture, where you can have your "gold layer" as the authorized dataset for sharing with users.
If you're working with data mesh you should assign the service account access to Data Catalog and Dataplex, as well, so that the data assets seamlessly integrate with your visualization software and can take advantage of the automated metadata generation and federated governance. By using GCP best practices to manage, develop, and engineer your data services you can create a best-in-class data engineering solution which will drive data driven decision making and improve profitability.
Although using a service account to access BigQuery is ideal, it is possible to allow pass through OAuth Authentication in conjunction with IAM and Google Workspaces to authenticate users and access data. This enables advanced features such as dynamic masking and more sophisticated audit logging to track user access and activity. This is a more sophisticated, and more complex, approach which requires the users to have OAuth access to GCP and BigQuery. These data access controls are available within both GCP and Looker, so it's more of who you want to control access and where within the stack.
Cloud DLP Considerations For Data Visualization Software
Cloud DLP, especially when used along side Data Catalog and Dataplex, is a powerful tool for ensuring the integrity of sensitive data within your organization. Cloud DLP can automatically scan through BigQuery datasets, identify sensitive data, and perform masking or implement column level security to prevent access. If you are using OAuth via Google Workspaces and Looker to access BigQuery then BigQuery can apply dynamic masking to the data to prevent unauthorized access to certain fields. This can streamline your operation by making data sharing more secure, efficient, and clean. Cloud DLP can also automatically de-identify certain fields throughout the data stack and measure re-identification risk within Looker.
It is possible to control access to sensitive fields within Looker as well. Looker has its own set of roles and can control access to data sets with its own authentication service. Additionally, it is also possible to create multiple authenticated views with various levels of security and allow certain users access to certain views based upon their looker role.