Log In

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Maintaining and automating data workloads/Designing automation and repeatability

Designing automation and repeatability

Leverage GCP's native Airflow implemenation, Cloud Composer, to develop task orchestration.

Topics Include:

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Maintaining and automating data workloads
  → Designing automation and repeatability

Topic Contents

Creating directed acyclic graphs (DAGs) for Cloud Composer
Scheduling jobs in a repeatable way


Creating directed acyclic graphs (DAGs) for Cloud Composer

GCP's fully managed Airflow service, known as Cloud Composer, is a very powerful Kubernetes native Airflow implementation. Use Cloud Composer to orchestrate Directed Acyclic Graphs (DAGS) which can power and mange complex data pipelines and services.


Cloud Composer

Cloud Composer is GCP's fully managed Apache Airflow Service. It is hosted and configured in GCP and runs natively on Kubernetes Engine, giving Cloud Composer much greater flexibility compared to alternative offerings from other cloud providers.

Cloud Composer is packaged and ran in environments, which are collections of the assortment of necessary services which Cloud Composer requires. This includes a web server, database, bucket, Redis deployment and Kubernetes Engine Deployed in Autopilot configuration. Although it is technically possible to alter the cluster configurations in Cloud Composers Node Pools, it is not recommended and might break the environment.

Cloud Composer environments can be created via either the cloud console, gcloud cli, api, or terraform. There are a huge array of configuration options available. You can choose a specific version of Airflow to use, worker configurations, high resiliency, networking, web server access if you want a private deployment, environment variables, and data encryption standards.

The diagram below shows the intricacies of the Composer service. It isn't necessary to memorize this for the exam, but it is useful to know in general practice. Cloud Composer/Apache Airflow is a very popular service for data engineering in general and understanding how it works will be beneficial.



Scheduling jobs in a repeatable way

Cloud Composer allows you to schedule DAGs to be executed according to chron scheduling. Additionally, GCP offers other methods of scheduling common tasks or even complex workloads.


Cloud Composer

If you're using Cloud Composer/Airflow to schedule jobs then you use the standard Airflow Scheduler to manage DAG executions using a simple Chron Scheduler. Cloud Composer becomes extremely valuable as your workflows become more complex and require a higher fidelity of functionality than Cloud Scheduler. Cloud Composer is built on Kubernetes Operators by default, so there is almost never any resource constraint issues when using Cloud Composer.

Cloud Scheduler

GCP's Cloud Scheduler is a great choice if you aren't using Cloud Composer to automate workloads. This is common if you are running BigQuery Transfer Service jobs. Cloud Scheduler is fully integrated with most GCP API's/Services and can be used to schedule many different types of services, such as Dataproc or Cloud Build.

Cloud Scheduler is fully integrated with services such as Cloud Monitoring and Cloud Logging and will automatically report job status. Alerts can be set up to track job success or failure.