Log In

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Maintaining and automating data workloads/Optimizing resources

Optimizing resources

Analyze and optimize compute requirements for the business, accounting for both business-critical data processes, development, and data processing.

Topics Include:

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Maintaining and automating data workloads
  → Optimizing resources

Topic Contents

Minimizing costs per required business need for data
Ensuring that enough resources are available for business-critical data processes
Deciding between persistent or job-based data clusters (e.g., Dataproc)


Minimizing costs per required business need for data

Data processing and data storage costs can make or break a budget for some organizations. At larger corporations you can easily see spend in the millions of dollars per month for data storage and processing. Be sure to minimize the costs of your ongoing development and operations to provide the business with scalable data technology.


Minimizing costs per required business need for data

GCP offers a flexible development and operating environment which is well suited to a diverse array of programing paradigms and architectures. Scalability and good planning can help developers map out, develop, test, deploy, and operationalize workloads using fewer resources. GCP offers a few best practices to facilitate efficient data engineering operations.

Cloud Billing API

Before you can improve your operational efficiency you must hgave a full 360 degree view of your current spend by component. Cloud Billing API is a global/cloud level resource which gives you a rich and informative system to observe, analyze, and control costs in GCP. Costs in GCP are broken down by account and SKU, or resource. Cloud billing allows you to observe costs for any API in GCP whether that is BigQuery, Dataproc, Cloud Fusion, or any other service. Use Cloud Billing's UI to monitor monthly costs and ensure there is no anomalous spend for a given SKU. A deep dive into Cloud Billing is beyond the scope of the test, but there are some cost saving measures you can take to ensure that your costs remain reasonable and efficient.

BigQuery

BigQuery Standard

GCP has made strides in making BigQuery very efficient both computationally and economically. The BigQuery cost components break down as:

  • Storage - Ongoing storage costs (comparable to GCS), broken down by class. Includes:
    • Data Ingress - Batch loading is always free, streaming inserts and Storage Write API are charged a fixed fee.
    • Data Egress - Free up to 50 TiBs (tebibytes) per day for batch exports. Flat fee per GiB exported from BigQuery following the free tier. The price adjusts upwards if the data are egressed across regions and internationally.
    • Active Storage - Any table or partition modified within the last 90 days. (Queries do not affect storage timer)
    • Long Term Storage - Any table or partition modified past 90 days. Storage costs then drop 50%.
    • External Storage - External table storage are billed by the host component (such as GCS), not BigQuery.
  • Compute - Costs to process a given set of data either through SQL Statements, routines such as UDF's.
    • On-Demand Pricing - flat fee per TiB of data processed, the first TiB of processed data per month is free.
    • Capacity Pricing - charged per slot-hour pricing to run queries, measured in slots (virtual CPUs) over time. The actual cost is based upon which BigQuery Edition you are using.
    • Slot Pricing - a fixed reservation of dedicated query slots which you purchase in advance each month. This can help lower costs if you have a guaranteed workload each month. Standard tier pricing applies after slots are used up.

BigQuery ML

BigQuery ML is exponentially more expensive byte-per-byte compared to standard querying. This is no doubt due to the high levels of data processing required to run models and perform predictions against the models. The first 10 GiB of data processed by CREATE MODEL is free.

BI Engine

BI Engine is charged a fixed fee per byte processed with the in-memory engine.

Best Practices for Cost Optimization

BigQuery is a very powerful tool, but costs can quickly spiral out of control unless you make a conscientious attempt to manage costs. Use the following best practices to ensure that you are efficiently using BigQuery:

  • Only query the data that you need. BigQuery charges by byte processed against the columns of data selected. Ensure that you are only selecting the columns you need when querying data. You should almost never use SELECT * unless you know for sure that you will need all columns.
  • Use the query dry-run feature to get an accurate estimate of the bytes used by a query before actually executing it. This will automatically be performed in the BigQuery console.
  • Make efficient and smart use of Partitioning and Clustering of your tables. Partitioning will break a table into queryable chunks which limit the data processed by BigQuery. Clustering will sort the data in your table and select the data up to the point of the detected cluster key. This is great for time series analysis where the table is partitioned by day and clustered by timestamp. If possible, consider using partition expiration to automatically trim a table of stale or unnecessary data.
  • Know that LIMIT does NOT stop BigQuery from scanning an entire table or partition. LIMIT merely prunes the result set delivered to the user after the query has been ran.
  • It can be smart to materialize your result sets into intermediary tables to limit data processing.
  • Quotas and Limits limit how much of a given resource your project can utilize. This gives a hard stop to querying capacity to any user.

Compute Engine

Compute Engine pricing is highly variable depending upon which machine-type cpu's, persistent disks, gpu's, and memory you choose. Digging deep into the various options for GCE is not necessary for the exam, knowing the overarching services is more important, but there are some guidelines you should follow.

  • Take advantage of autoscaling with managed instance groups to facilitate large changes in workloads, but ensure that you set limits to autoscaling to not go over your cost budget.
  • Use the right tool for the right job. For Dataproc, use high-memory instance types instead of general purpose of compute optimized. For Cloud SQL, focus on a good balance between compute and memory depending upon the projected number of requests.
  • If running Dataproc on GCE take advantage of Preemptible VM instances if you run batch workloads. Preemptible VM instances use Spot Pricing, which is cheaper than On-Demand pricing. You can even add GPUs to Preemptible VM instances as well.
  • Pay attention to Google's Machine Type Recommendations. This can help you right-size an instance for your usual workloads.

Dataflow

Dataflow is a serverless data processing service. Dataflow will execute your workload and charge you based on the following parameters.

  • Worker CPU and memory - this is GCE
  • Dataflow Shuffle data processed for batch workloads - charges by the volume of data shuffled
  • Streaming Engine data processed for streaming workloads - charged by volume, complexity, and stages required.

FlexRS can lower your processing costs by 40% if you choose to execute your workloads within a 6 hour window. the actual execution time is chosen by Google. FlexRS is cheapest, batch is second, and then streaming is the most expensive.

Dataflow Prime is a newer service which translates your compute costs into Data Compute Units (DCU's), which acts more as a meter and can help make costs easier to observe and manage. 1 DCU is equivalent to running 1 VCPU and 4GBS of memory for 1 hour.

Dataproc

Dataproc Standard

DataProc is GCP's powerful Hadoop and Spark data processing engine. Unlike BigQuery, Pricing is highly dependent upon your specific architecture and configuration. So, you have a bit more control over the price components, but it is more complex and time-consuming to manage and maintain.

The Dataproc pricing formula for GCE is: $ * # of vCPUs * hourly duration. This applies to standard and GKE implementations of Dataproc. With GKE there will be more CPUs with autoscaling node-pools so this would likely mean a higher cost per wall-clock hour.

These prices do not account for the costs associated with the resources themselves, only Dataproc's rental of the processing power

Dataproc Serverless

Dataproc Serverless is more complicated, you are more charged by the process you are actually performing rather than a flat fee by compute unit. Dataproc Serverless for Spark pricing is based on the number of Data Compute Units (DCUs), the number of accelerators used, and the amount of shuffle storage used. DCUs, accelerators, and shuffle storage are billed per second, with a 1-minute minimum charge for DCUs and shuffle storage, and a 5-minute minimum charge for accelerators.

Best Practices for Cost Optimization

Dataproc pricing is heavily dependent upon the amount of, and behavior of, whichever compute resources you are using to perform your processing tasks. Dataproc serverless or even BigQuery can often be a better solution than Dataproc for many workloads, especially analytical workloads or reporting. Dataproc can be a cheaper and more performant solution for machine learning workloads, if you're willing to do all the coding yourself. if you are running pure spark jobs, consider using Dataproc Serverless with data hosted in BigQuery or GCS.

Consider the below best practices:

  • Right sizing a Dataproc instance takes a keen eye and experience for knowing how much compute is required to run a given job. Since Dataproc, especially Spark, is very memory hungry in relation to compute, use memory optimized VM's.
  • If you are running highly performant batch jobs which could be parallelized you should consider using preemptible GCP instances. This allows GCP to efficiently allocate compute resources to your job and can save significant dollars.
  • Choose between Standard, Balanced, or SSD persistent disks depending upon how dynamic your data reads are. The more standardized and batch-oriented your workloads are the cheaper they will be. If you are running consistent jobs with a guaranteed workload then using Standard disks can save you a lot of money byte-per-byte.
  • If you can tolerate faults or pre-emption, consider using secondary workers instead of primary workers.
  • If you are not running heavy machine learning algorithms then you do not need GPUs. On the other hand, if you are running heavy machine learning algorithms than you really should use GPUs. Although it would cost more than CPU's alone, it will significantly speed up your operations which can lower the average processing costs on a per job basis.
  • Autoscaling can help you handle spikes in workload, but can significantly increase costs. Consider using ephemeral clusters to tightly control clusters and keep resources up only as long as you need them.

Cloud SQL

Cloud SQL, like Dataproc, is highly dependent upon the underlying architecture. Essentially you pay for the VM which runs the DW separately from the Cloud SQL service. Cloud SQL itself is separated into two services:

  • Cloud SQL Enterprise:
    • 1-96 vCPUs
    • 1:6.5 core memory ratio
    • 99.95% SLA
    • < 60s of planned downtime
    • up to 624 GB Memory
  • Cloud SQL Enterprise Plus:
    • Best for high performance workloads. It offers:
    • Up to 128 vCPUs
    • 1:8 core memory ratio
    • 99.99% SLA
    • < 10 seconds of planned downtime
    • 2x read and write performance
    • up to 824 GB Memory

Cloud SQL is always on and the cost is not workload dependent. Therefore, it can't be optimized aside from right-sizing the underlying infrastructure. Use Cloud Monitoring to see if you can potentially migrate to a lower cost instance size.

Other Services

Dataplex and Data Catalog are based on a pay-go fixed fee model based upon DCU usage, similar to Dataproc Serverless. You also pay for metadata storage. These costs are quite small and are highly efficient.

Cloud Functions are charged by invocation and compute time. The first 2 million requests per month are free, and $.40 after that. Compute time is dependent upon the hardware chosen to process the request and scales accordingly.

Cloud Fusion is priced based upon pipeline development and execution. There are three levels which are priced by instance per hour. Since Data Fusion is serverless, you don't have much control over the pricing except tier selection and amount of time used.

  • Developer
  • Basic
  • Enterprise



Ensuring that enough resources are available for business-critical data processes

Sufficient resources should be allocated to data engineering operations. Utilize GCP's monitoring and logging service to set up alerts and create action plans in case resources are constrained.


Resizing And Autoscaling Resources

A key advantage of cloud infrastructure is the easy and near infinite scalability of cloud infrastructure and resources. Proper right-sizing of cloud infrastructure can save you time and money as well as improve the experience for users. Nearly all resources in GCP are scalable and resizable in some way. As always, the more that you utilize managed resources in GCP the less infrastructure management you will have to contend with and the less overhead would be required.

Compute Engine

GCP compute engine was built from the start to be highly flexible, redundant, and scalable. The underlying technology enabling and governing GCP auto-scaling is agnostic of the given implementation of the hardware. Whether you are running a DataProc cluster, GKE cluster, a web application, or you have a custom solution you are implementing, all GCP compute engine instances auto-scale the same way, with a load balancer and a managed instance group. Auto-scaling of compute resources is often tied to some tangible metric captured via Cloud Monitoring, such as CPU% usage, network traffic (i.e. HTTP load balancing serving capacity), disk utilization, or as often is the case with DataProc Spark, memory pressure. When a certain level is hit the instance group will scale out according to given parameters when you configure the autoscaling solution. It is also a common practice to implement scale-down/scale-in policies, as well, to scale down resources when they are no longer utilized as greatly. One benefit of using an auto-scaling group is that you can effectively resize your VM's without any service disruptions.

In addition to the number of machines in your MIGs (managed instance group) you may only want to adjust individual components. This could be true if you are operating a database on a custom instance, or are running a single node spark cluster. It is possible to adjust the size of a persistent disk while the machine is running without causing much issue. Be sure to create a snapshot of the disk volume first as a backup. If you need to resize or rebuild an instance, you can use the snapshot to create a new vm.

DataProc

DataProc has built in auto-scaling which must be programmed before developing the cluster. The autoscaling policy for DataProc monitors the Hadoop YARN metrics to determine metric thresholds for triggering the up-scaling methods via Cloud Monitoring. By default, Dataproc autoscaling will monitor memory pressure, but you can check worker core utilization as well. If you use Dataproc Serverless for Spark then DataProc can handle the autoscaling for you effortlessly and easily with just a few simple configurations at set up time.

Google Kubernetes Engine

GKE has cluster autoscaling built in natively and it is almost a requirement for any GKE implementation. Note that GKE handles cluster autoscaling differently than Compute Engine, although the core operations are similar. GKE will effectively monitor node utilization and will either add or remove nodes as required. GKE can also resize your node pool.

Cloud Functions

Cloud Functions are designed to be fully serverless and highly scalable. Autoscaling infrastructure is assumed for Cloud Functions and is not configurable, generally, but you can configure the max instances available for auto-scaling up to 1,000 for Gen2 functions. It is important to balance your requirements here to prevent run away scaling.



Deciding between persistent or job-based data clusters (e.g., Dataproc)

GCP offers a wide range of execution paradigms to handle different workloads including batch and real time streaming applications. Be sure to choose the right execution paradigm for any given workload.


Deciding between persistent or job-based data clusters (e.g., Dataproc)

Some GCP services can be ran as either persistent or as job-based/on-demand. The choice between these two really depends upon the nature of your workloads. If you are running batch workloads once daily in the mornings, then an ephemeral cluster is the right option. If you are running consistent and essential sliding window analyses throughout the day, such as an SRE workload with guaranteed SLOs, then an always on, or persistent, architecture is essential to ensuring success.

Batch workloads run during off-peak hours using preemptible or spot VMs are always cheaper than running 24/7 streaming workloads. Additionally, be sure to pay attention to regional and high-availability workloads. If your data only needs to operate in one region then you could take advantage of standard-tier networking.

Always try to store your data in GCS if possible. GCS is much cheaper than data stored on persistent disks and is more horizontally scalable as well.

Serverless services, such as BigQuery or Dataproc Spark, are highly optimized and scalable. If you can run your data processing job in a serverless service then you probably should. It is much easier to manage than trying to set up your own scheduling or waiting for servers to spin up, which all cost time and therefore money.