Log In

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Maintaining and automating data workloads/Monitoring and troubleshooting processes

Monitoring and troubleshooting processes

No matter how well your process or pipelines are developed errors in both applications and data occur. Leverage GCP's logging layer to quickly and efficiently spot and fix errors. Use BigQuery's admin portal to manage workloads and reservations.

Topics Include:

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Maintaining and automating data workloads
  → Monitoring and troubleshooting processes

Topic Contents

Observability of data processes (e.g., Cloud Monitoring, Cloud Logging, BigQuery admin panel)
Monitoring planned usage
Troubleshooting error messages, billing issues, and quotas
Manage workloads, such as jobs, queries, and compute capacity (reservations)


Observability of data processes (e.g., Cloud Monitoring, Cloud Logging, BigQuery admin panel)

GCP offers a complete, valuable, high performing, searchable, and actionable logging, monitoring, and reporting toolset to ensure the success and performance of your workloads.


Observability of Data Processes (e.g., Cloud Monitoring, Cloud Logging, BigQuery admin panel)

Cloud Monitoring

Cloud Monitoring collects metrics from nearly all GCP managed services and infrastructure. These include common metrics such as CPU% usage, hard disk space, and others. Cloud Monitoring often works in tandem with the service API automatically by gathering and reporting on relevant metrics for the service you are using. Making an effective use of Cloud Monitoring will ensure that your solutions and products are operating at peak efficiency. Most GCP products come with a pre-defined Cloud Monitoring dashboard which is a great way to get started and is often all you need to create an effective monitoring solution.

To monitor VMs you should set up an Agent Policy which will automatically install Ops Agent on all newly created VMs. You can then aggregate and report on metrics for all created VMs (such as those created by DataProc) automatically and view them with Cloud Monitoring. Use Cloud Monitoring to measure Dataflow Metrics and ensure an efficient execution environment.

Metrics Explorer

Metrics Explorer is a way to view collected metrics on aggregate for your infrastructure. For example, you could examine the CPU% usage for all your VMs, and then add a limit and sort command to only look for VMs with CPU% usage above a certain threshold. There are a huge number of possible metrics, and the most common and essential ones are offered free of charge. Additionally, it is also possible to build custom metrics or dashboards, if desired. Custom metrics should be gathered as time series data for easy reporting and integration with Metrics Explorer.

Alerting Policies

Once your project starts to hit dozens of services or consists of hundreds of components it can become burdensome to constantly monitor your infrastructure. Cloud Monitoring allows you to set up Alerting Policies which can automatically send you an alert. You can build the following alert types:

  • Metric-Threshold Alert - A metric-based alerting policy that sends notifications and generates an alert, or equivalently an incident, when values of a metric are more than, or less than, the threshold for a specific duration window.
  • Metric-Absence Alert - An alerting policy that sends notifications and creates an alert, or equivalently an incident, when a monitored time series has no data for a specific duration window.
  • Forecasted Metric-Value Alert - An alerting policy that sends notifications and generates an alert when the policy predicts that the threshold will be violated within the upcoming forecast window.

Cloud Logging

Cloud Logging is a fully managed service which collects event data from GCP and AWS infrastructure including debug or error messages from applications. Entries are submitted to Cloud Logging via an API call and can consist of verbose JSON messaging.

Logs Explorer

The Logs Explorer allows you to quickly query log files and filter for specific messages, particular applications, or a given infrastructure component. The Logs Analytics tool aggregates metrics for specific messages and can report these aggregations to Cloud Monitoring as a custom metric. Logs Analytics is useful if you want to see how many times a specific event occurs in your application or pipeline. For example, you can measure how many messages appear from a specific node within an IOT solution.

Log Based Alerting

Similarly to Cloud Monitoring, you can set up alerts for Cloud Logging based upon message content. For example, if an error occurs in your data processing pipeline you can be alerted and can react quickly or prevent downstream processing. A common practice is to create alerts on audit logs which can report if a given resource is accessed by a particular IAM entity (such as a service account). This can be useful if you are experiencing anomalous execution metrics and need to trace back the application flow to identify the particular function or execution which may be causing the anomaly.

GCP has a number of services to observe your data processing application and services. This can be used to monitor for resource usage, planned capacity, event logging, and error tracking. Use Cloud Monitoring, which includes Cloud Metrics to observe resource usage by service. You don't need to know every possible metric to pass the exam, but you should know how GCP collects and reports on metrics, and how best to observe, recognize patterns, and act on the metrics. Some of the bigger data-centric services, such as BigQuery, Dataproc, and Dataflow, are much more likely to be to the exam than more general services such as Compute Engine or GKE.

Data Proc

Use Cloud Monitoring to get a good look at how Dataproc is spinning up and using compute resources within a managed group or GKE node pool. This can include dataproc hardware resource usage, node and worker counts, storage information, job status information, and YARN resource utilization.

Dataflow

Dataflow posts valuable metrics to Cloud Monitoring and is useful for monitoring data pipelines, data processing statuses, and worker health.

BigQuery

You can view and monitor your organization's BigQuery metrics in Cloud Monitoring. This includes the number of jobs in flight, the number of queries in flight, bytes billed by statement/job, slot allocation and usage, bytes inserted, and storage information.a

Cloud Logging

Cloud logging is used to monitor system produced and custom events that your application produces.

BigQuery

BigQuery is a serverless service, so you shouldn't ever have to worry about debugging the service itself. For monitoring purposes the most valuable function of cloud logging for BigQuery is audit logging, which is used by BigQuery to trace requests made against your instance. This service is vital to ensure compliance with regulations and company policy, as well as to identify sources of large resource usage. Audit Logs captures information about the requester, the query syntax executed, and the job metadata, such as bytes processed.

Audit logs are very useful for BigQuery Admins and can queried from the console which means that admins do not need access to cloud logging service or UI to monitor their instance.

BigQuery Admin Panel

You can get creative with how Cloud Monitoring displays your administrative panels. You can pick and choose from a large number of options from resource utilization or the audit logs. This is good for monitoring activity in BigQuery.



Monitoring planned usage

Businesses demand to see sustainability and consistency in workload usage and resource consumption. Compare results in resource consumption against a budget to ensure an optimal and efficient operation.


Monitoring Planned Usage

GCP offers discounts for planned usage with extended terms. You can take advantage of these programs to save money and energy over time and provide a better service to organizations and users. Use these fixed rate terms to better plan a budget and implement governance policies to ensure that users and processes don't go over.

Set a Budget and Capping API Usage

Setting a budget in GCP Cloud Billing is a great way to help ensure that you don't go over your monthly budget in overall GCP spend. You can set a budget to track spending across the entire account, individual services, organizations, or specifically tagged items. Budgets can be applied for monthly, quarterly, or yearly time periods.

Use budgets to track spending and get alerted if your usage is projected to, or does, exceed the budgeted amount. Note that Cloud Billing will NOT programmatically prevent continued usage on the service, it can only send notification emails notifying interested parties of a probable or definite overrun.

Capping API usage puts a hard stop on amount of API requests made per day, per minute, or by user. Use this in combination with budgeting to help ensure a smooth and efficient operation and maintain access to vital resources for mission critical applications.

BigQuery

BigQuery is a powerful tool that offers large scale and dynamic data processing and analytics. If you have large workloads with many users then investing in a planned commitment can save you costs over time. To ensure compliance with these reservations you can monitor resource utilization and jobs directly from the BigQuery console. This console pulls information from Cloud Monitoring to build the charts and graphs used. Use the admin portal to track slot usage, bytes processed, or job performance over time.

You can set usage quotas, or custom cost controls, by project and/or by user. Set a project level cost control as a stop gap against overall reservation budget and set individual user controls to prevent inefficient query patterns.



Troubleshooting error messages, billing issues, and quotas

Leverage GCP's powerful logging and monitoring services to discover and troubleshoot errors within your system. Learn how to fix billing issues and measure work quotes.


Using Monitoring and Logging to Identify Errors and Improvements

By using effective monitoring and logging techniques you can identify bugs, errors, or inefficiencies in your services. This will give you insight in ways to improve your data processing infrastructure and functions. Occasionally, a simple relocation of a process from one piece of hardware to another can immediately result in dramatic improvements in performance. For example, if you notice that your machine learning pipeline has grown in size and is running slowly then adding GPU's to the VMs can greatly increase the processing performance of matrix multiplication. Another example is to use Cloud Monitoring to search for hot spots in your Bigtable instances and correct.

Likewise, by examining cloud logs you can identify bugs or errors which arise which might be slowing down or impairing your data processing pipelines. Use Cloud Logging and Logs Explorer to search for bugs or warnings occurring in your architecture in order to correct them and avoid technical debt. Use metrics explorer and alerts to discover anomalies or spikes, or stresses in your infrastructure and react accordingly. For example, if you are seeing a slowdown in data processing in VMs, but CPU utilization is low, then the problem may be that you need more memory, not more CPU.

Following best practices and keeping up with the fast evolution of cloud technology can help you identify ways to improve your infrastructure or pipelines. Taking advantage of Google's hybrid-cloud and multi-cloud strategies means that you're not necessarily relegated to only operating within GCP, so don't worry about working with on-prem systems or even other cloud providers and moving data between these services.

Cloud Logging

Cloud Logging is the primary tool to use when identifying and troubleshooting errors in GCP. It captures any and all bugs and errors which occur when services are operating. Use the Logs Explorer to quickly scan through logs and isolate errors, warnings, or info logs by service. Additionally, you can set up custom alerts and metrics to track custom data points.

Cloud Billing

Use the Cloud Billing UI to manage billing issues for your account. This can be done at the account, project, or organization/folder level.



Manage workloads, such as jobs, queries, and compute capacity (reservations)

GCP has tools that can be leveraged to monitor and manage workloads, jobs, queries, and reservations.


Manage Workloads, Such As Jobs, Queries, And Compute Capacity (Reservations)

Use BigQuery's Admin Portal to observe and monitor data or slot usage, jobs, and workloads. These tools are great for isolating job performance by query and finding optimization strategies to more efficiently perform required work. For example, you can drill down to high-cost queries and ensure that proper query patters, such as effective use of partitioned data, are utilized correctly. You can filter the returned data by reservation, folder, project, and user.

You can also monitor for job concurrency to ensure that slot allocations are not burdened. You can use this to spread out jobs over time to ensure slot availability during peak hours.

You can also query the Information Schema to observe executing jobs by project or organization. This gives administrators tools to observe resource usage from the BigQuery console itself without having to dig deep into the admin portal.

Managing Workloads Best Practices

GCP has a number of recommended best practices for how to set up your BigQuery project space to ensure a flexibility, efficiency, and high-performance. In general, the more established, predictable, and large your processes become the more investing in a reservation makes sense. As always, the best way to ensure cost-efficiency in BigQuery is to build an efficient table architecture by taking advantage of partitioning and clustering.

GCP recommends creating an administrative project to manage usage across different sub-projects, such as data science, reporting, and data processing. These sub projects could share slots to control costs while still offering the benefits of segregation of data and duties. This is also a good practice to mange costs by project/workload or by user type, such as data analysts or data scientists.

A key best practice is effective project partitioning. Create isolated projects to segregate workloads by type, such as for data ingestion, processing, and analysis. You could have a dedicated reservation to mange the daily build of your data warehouse or machine learning workloads and then have on-demand usage for analytical querying or predictions.