Using a data lake
A data lake can be a great solution for data storage and architecture. Although there are many ways to develop a data lake, there are a few best practices for managing a data lake in GCP.
Topics Include:- Managing the lake (configuring data discovery, access, and cost controls)
- Processing data
- Monitoring the data lake
GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Storing the data
→ Using a data lake
Topic Contents
Managing the lake (configuring data discovery, access, and cost controls)Processing data
Monitoring the data lake
Managing the lake (configuring data discovery, access, and cost controls)
A data lake is where you store raw and curated data before it is analyzed by your data warehouse. Effective use of data lifecycle management and data cataloguing will ensure that your data can be easily discovered and utilized.
Managing the lake (configuring data discovery, access, and cost controls)
Data lakes, like the data engineering discipline itself, has evolved over the years along with the evolution of cloud technology. Understanding, managing, and organizing a data lake is an essential skill for any data engineer operating within GCP. Data lakes are used to store the raw data which feeds downstream data warehouses, data marts, and, eventually, analytical tools such as Looker for reporting or feature sets for machine learning/AI. Data lakes can often contain gigabytes, terabytes, or even petabytes of data for potentially dozens or sometimes hundreds of sources. Without good governance and controls over how the data is propagated to, stored in, and accessed within your data lake your data can quickly devolve from an asset into a liability.
Within GCS, use effective lifecycle management techniques combined with smart querying to effectively manage costs in your storage layer.
Data Lakes are sometimes resolved into domain groups which organize data based upon a subject area such as Sales, Analytics, Data Science, etc. This can be facilitated through effective and well organized prefix, bucket structures, and cataloging.
GCP For Data Lakes
Data lakes can be built with various technologies, but the most common space is GCS since it is highly flexible for a wide range of data structures and access patterns. Lifecycle management is also very easy in GCS with great tooling and optionality available for building a high performance and cost optimized data lake. Data within GCS are managed by effective use of prefixes to delineate data sources and enable easy downstream identification of, and access to, datasets. This method is well suited to data engineering solutions such as external tables in BigQuery or as data source feeder for DataProc or Dataflow.
BigLake Tables
BigLake is a new GCP methodology and technical extension for cloud provider object stores (GCS/S3/Azure Blob). Think of BigLake as an evolution of a BigQuery external table which enables advanced data governance features such as refined access controls and metadata caching. BigLake tables are great for facilitating a hybrid or multicloud architecture. Additionally, BigLake tables are integrable with Analytics Hub for easy cataloguing, discovery, and access.
Metadata caching can greatly speed up querying of BigLake tables. If enabled, BigLake will perform a caching of all metadata from your external source which enables BigLake to find a given data source with a high degree of precision. Combined this with hive partitioning to create a high performing data lake which rivals the performance of pulling data directly into BigQuery. Effective use of prefix filtering against partitioned data can make querying external tables remarkably cheap. This, combined with data formats such as Parquet can enable a highly functional, efficient, and flexible solution to a wide range of data processing tasks.
BigLake tables can be used to read data directly from other public cloud providers via BigQuery Omni. This enables easy multi-cloud analytics without requiring the data to be transferred into GCS first.
GCP Data Catalog
Data Catalog is a powerful tool used for data indexing, tagging, and discovery. Data Catalog works by appropriating "tags" to data fields, sources, or tables, as a form of metadata. These tags can be used when searching for data within your data lake to facilitate data discovery as well as to control access to certain fields or datasets. Use Data Catalog to manage data warehouses and data lakes of any size or scope. As your data grows you will find that effective data cataloging is not only convenient, it is essential to ensuring easy, standardized, and regular access to your data.
Data Catalog can work alongside Sensitive Data Protection to effectively control for data masking or other data sensitivity processes within your data lake based upon assigned tags.
Processing data
Data processing is handled by a compute layer of some kind which interacts with your data lake and provides the transformations required to achieve the desired end shape of your data. Use data processing to access, transform, and deposit your data in the data lake in preparation for downstream analysis. Using clearly delineated areas in your data lake will ensure a high quality and high performance data processing infrastructure.
Processing Data
Processing data contained within a data lake is a static process, where data is read into a processor such as DataProc, Dataflow, BigQuery, or other data warehouse, transformed, and then uploaded back into to the data lake as a processed dataset. This often takes the shape of a raw to conformed layer or a conformed to modeled layer. This high degree of organization and clearly delineated data spaces help to ensure that data are not stored incorrectly or in a manner which is out of compliance standards. It is also vital to separate your storage and compute layers not only for performance reasons, but also for security and access control standards. The end result will be a smooth and efficient data processing architecture which will drive decision making at your organization.
BigQuery External Tables
BigQuery external tables are perfect for acting upon data stored within your data lake. External tables allow BigQuery to pull directly from your data lake without having to pass the data through any other medium, such as Dataflow or Pub/Sub. Use hive partitioning to precisely pull only the data you need for processing (such as a daily batch for a given dataset) and ensure a highly efficient and optimized query. Data pulled into a BigQuery external table can be operated on using GoogleSQL, the standard SQL flavor for operating within BigQuery. This makes data processing highly approachable for most users who are familiar with SQL, such as analysts and data scientists. This also provides easy access to advanced tools such as BigQueryML, Data Catalog, Analytics Hub, and Looker Studio.
DataProc
DataProc is another great tool for data processing which is optimized for data lakes and data stored within GCS. Use Dataproc Serverless to effortlessly run Spark jobs upon massive datasets consisting of billions of rows of data via a simple Jupiter Notebook. Combine Dataproc with GPUs to build and run complex and sophisticated data science workloads including deep learning, Generative AI, or LLMs.
Dataflow
Cloud Dataflow can be used to pull data from a GCS hosted data lake and perform processing upon the data. Use dataflow to easily merge batch data from your data lake and stream data from sources such as Pub/Sub to enrich your data and provide a valuable and lightweight approach to data processing.
Cloud Data Fusion
Cloud Data Fusion makes building advanced data processing workloads in GCP easy with its serverless infrastructure, simplified but capable UI, and LowCode architecture. Cloud Data Fusion is a great choice for analysts or other users who aren't experts in SQL or Python.
Dataprep
Dataprep is another tool, similar to Data Fusion, which provides an approachable solution to creating advanced data processing methods in GCP with ease. Use Dataprep to perform data profiling, processing, and cleansing without having to code. You can then take your solution and convert it into a dataflow job straight from the UI.
Monitoring the data lake
Monitoring your data lake means understanding who is accessing which data, how often, and for which purpose. Take advantage of processing services to monitor your data processing pipelines. Additionally, you should also monitor your data lake for cost efficiency and be aware of stale data which could be pushed to a cheaper category.
Monitoring the Data Lake
GCP has a number of tools to effectively monitor your data lake for cost, security, and compliance. Additionally, GCP also provides a robust data processing infrastructure for managing SLI's via an unified and centralized API.
Monitoring Storage Costs
Use GCP Cloud Billing to monitor your storage costs and identify services where you can potentially lower your bill. By taking advantage of Data Catalog you can assign tags to your data assets as they move through the data classes which can help ensure that your data are classified correctly to achieve an optimal cost outcome. You can use the Cost Calculator to assess your cloud costs for any configuration including GCS.
Audit Logging for GCS Access
You can maintain positive control over who has access to your data within GCS with Audit Logging. Audit Logs for GCS can show you who (which principle) performed an operation on your managed data storage container or any actions performed on a given object. This is useful not only for security, but also to see if there is a process or system performing unoptimized queries upon your datasets. For example, if you have a BigQuery process that should be using hive partitioned data to pull only the last day's worth of data, but instead is querying the entire bucket, this could be blocking tools such as auto-classification from moving data to a cheaper state.
Monitoring Data Processing
Data processing tasks should be properly observed, logged, and maintained for a number of reasons, but primarily cost and efficiency. Use Cloud Logging and Cloud Monitoring with Alerting to discover, isolate, and report on any processes which might be inefficient or detrimental, such as a Spark job performing complex and inefficient transformations against large datasets (such as a cartesian join). You can also monitor memory usage, cpu usage, and network packets for almost every service in GCP. This monitoring is essential when you are trying to ensure peak performance and cost optimization.
Dataproc
Cloud Monitoring can be enabled to effectively and efficiently monitor Dataproc for performance, health, uptime, and resource usage for most processing tasks. Monitoring Dataproc is a good practice because Dataproc can autoscale to a massive level and it can be difficult to control costs at that point.
Cloud Logging can help you observe data flow through your systems and help you identify sources of errors or bugs. Use Cloud Logging to centralize and aggregate your logs to produce performance metrics for your dataproc jobs. You can also set up your own custom error logging via the Error Reporting API. This means that you can report errors in the data itself absent any errors occurring in the processing infrastructure.
Dataflow
Similarly to Cloud Dataproc you can use Cloud Monitoring and Logging to effectively manage most issues with dataflow. One key difference with Dataflow compared to dataproc is that dataflow is a serverless solution. This precludes the need to monitor or manage resources usage for data processing tasks or workflows. These can be enabled manually, however, if desired. You can use Cloud Monitoring to collate metrics for the jobs and monitor job status. This can be useful for anomaly detection and capture, or to identify potential sources of failure in your jobs or pipelines.
Use Cloud Logging to identify and correct errors in dataflow or to pass log messages to cloud logging when operating within a dataflow task. This can be useful to produce checkpoints or custom metrics for your dataflow pipelines. These logs can then be easily accessed either through the dataflow monitoring UI or via the Cloud Logging UI.
Processing SLIs and Data Quality Checks
GCP can automatically monitor, detect, capture, and aggregate Service Level Indicators for both Dataflow and Dataproc. This provides a more targeted and convenient method for observation than trying to use Cloud Logging for SLI capture and compliance in support of SRE objectives.
There are two SLI's reported by Dataproc and Dataflow:
- Correctness - a measure of how many processing errors the pipeline incurs.
- Freshness - a measure of how quickly data is processed.
By using a simple API call you can provide the number of correct datapoints and how often the data passes the system from either dataflow or dataproc. Measure things like job duration, data points analyzed, data quality, data freshness, and other metrics quickly and easily. Then integrate these metrics into your larger SLA monitoring service to ensure a high quality and high performing data engineering architecture and service.