Log In

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)/Ingesting and processing the data/Building the pipelines

Building the pipelines

After planning your pipeline you can beginning developing your solution. Identify and develop the core services your pipelines require while relying upon GCP core services to facilitate your efforts.

Topics Include:

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Ingesting and processing the data
  → Building the pipelines

Topic Contents

Data cleansing
Identifying the services (e.g., Dataflow, Apache Beam, Dataproc, Cloud Data Fusion, BigQuery, Pub/Sub, Apache Spark, Hadoop ecosystem, and Apache Kafka)
Transformations
Data acquisition and import


Data cleansing

Before data moves into your storage layer it should be cleansed and tested for quality. Be sure to identify any erroneous or flawed data which should be discarded or sent to an error log.


Data Preparation

Data preparation is the next step after data ingestion if the incoming data are not properly formatted, schematized, or developed correctly. This may also be the step where certain precursory data quality, security, and privatization checks occur. Data preparation can be used to alter file formatting, schemas, or compression algorithms before being read by a tool like BigQuery or streamed into dataflow. For example, an ingestion engine could read a JSON object from an API, strip the headers form the message, schematize the incoming JSON file as AVRO or Parquet, use Data Loss Prevention to check for any PII, compress the incoming JSON file, and insert the data into a bucket or BigQuery for further analysis.

Dataprep by Trifacta

Data preparation is an essential step in any data pipeline. Dataprep by Trifacta is a useful tool for easily creating complex data transformation pipelines in GCP. Pipelines consist of a number of tasks which, when fully assembled, are known as "recipes." Recipes are a set of instructions that Dataprep will follow when performing pipeline runs. Dataprep is an abstraction of a Dataflow pipeline and makes it easier to build complex DAG architectures using only the simple UI. Recipes can also be extracted and executed as pure Dataflow jobs. Some jobs can also be run inside of BigQuery as pure SQL scripts.

Dataprep is an effective tool for performing many common data transformation tasks. It has a built in data profiler to check for anomalous data and account for it. It also has hundreds of different built in transformations to handle most use cases.

Cleansing and Schematization

Before data can be effectively processed it must go through the initial process to qualify, cleanse, and schematize the data. This process can take different shapes, but in practice this is often handled by Python executing in a cloud function, within a Cloud Data Fusion package, or with a tool like Dataprep. Data quality checks can take many forms, but it is usually a mixture of data type checks, value range checks, value lookups, or other forms of data validations.

Data schematization is the process whereby you clean and properly type ingested data and is an essential process in any data pipeline. Tools like Dataprep, Python executing in a cloud function, or Dataflow are effective tools for performing cleansing and schematization. Think of the schematization process as the "bridge" thru which your data should pass to reach it's final form. The messier, more error prone, or more complex your ingest data is then the more difficult this process can be. In practice, it's often best to solve data issues as close to the source as possible, but this isn't always possible, and this is why you should always have some preparation procedures for data.

Eventually the schematization process should produce the final data product with a particular landing technology in mind, such as Bigquery, Dataproc, or Dataflow.

PII and Data Integrity Checks

Another common step in the data preparation process is to check for PII or perform some other data integrity checks, such as for sensitive information. For this, you can use Cloud Data Loss Prevention along with custom infotypes to detect and either delete or tokenize sensitive data. In practice this is either performed at the data prep or data processing stage, it really depends upon the business rules of your organization.



Identifying the services (e.g., Dataflow, Apache Beam, Dataproc, Cloud Data Fusion, BigQuery, Pub/Sub, Apache Spark, Hadoop ecosystem, and Apache Kafka)

GCP has a number of core data services used to accomplish data engineering tasks. It is important to identify all of these services, their trade-offs, and which tasks they are most well suited for.


It is an art as well as a science

GCP data engineering is as much an art as it is a science. It often involves possessing a deep understanding of different technologies and different techniques for addressing diverse problems in the cloud. There is rarely ever a true “cookie cutter” approach to any given problem, and there are often hidden nuances which can alter the best choice of infrastructure for your solution. In this section we will discuss the various trade-offs involved when choosing a technology and some example “logic flows” and decision matrices you could utilize when approaching a data engineering problem in GCP.

Data Processing Example Pipeline

To begin you should build a complex architectural flow of the data from a bird’s eye view. This is somewhat akin to developing the skeleton of your data architecture, which will form the shape of your data flow inside of GCP. Build from a top-down perspective with increasing levels of complexity depending upon the various nuances of your data. This begins with a thorough understanding of the nature of your source data as well as the final end state of your consumer objects. Using this knowledge you can then fill in the transformations necessary to properly, effectively, and efficiently develop the data into the final consumable product.

The below iframe shows two example data processing methodologies, one for batch, and one for streaming data. These are useful examples and guides, but certainly not a cookie cutter for a given scenario. Each scenario requires a thorough understanding of underlying technologies, the nature of the data source, and the final end state of the data. The below examples show a general outline for data processing in GCP, and provides some potential options when developing the solution. Note that it is possible to solve a given problem in multiple ways, but on the exam there is usually only one "best" way to do it within a given scenario, which is usually the most efficient, secure, and cost effective. Understanding the different possible ways the technologies can speak to each other is vital to success both on the exam and in practice.

Data Ingestion and Schematization

Data processing solutions begin with ingesting data from the source. The first step is to determine if the data are batch or streaming data, the incoming file types, as well as the data schema, if possible. If the schema is not readily available it is wise to use a processor such as cloud functions to project a schema onto the data. It is then recommended to reformat the data to a data format which is readable from the big data processor you have assigned. For example, BigQuery accepts AVRO and Parquet data readily and works well with both, but Spark is better when working with Parquet data as it can take advantage of pruning and predicate pushdown. If you have schemaless JSON data which has a highly variable schema (such as data from a document database) then it is probably wise to just export it as is to GCS and then parse the data using dataproc or BigQuery. The purpose of the main ingest stage is to ready the files for further processing. Once the data are effectively schematized they can be pushed to a pubsub topic for next steps for a streaming workload or they can be inserted directly into BigQuery from cloud functions or a pubsub subscription.

Data Processing Solutions

There are three main tools you can process big data in GCP. Although much of the data manipulation and transformation capabilities overlap among the technologies, there are certain scenarios where one choice is better than the others.

Cloud Dataflow is a serverless data processing solution which can read from a pub/sub stream and process returned data. It is highly efficient and it's serverless architecture makes spinning up a dataflow job very easy and cost effective. It is easily integrated with other GCP services and has a huge array of connectors as well as built-in support for windowing and triggering making stream processing easy. If your data are well schematized and organized then dataflow is an easy choice. If you are performing stream analytics on your regular schematized data then dataflow is the best choice. Dataflow also produces better native GCP metrics than dataproc does and allows a more complete and real time source of information on running pipelines.

Dataproc is GCP's managed Spark and Hadoop implementation. It is relatively low cost and easy to configure. Additionally, it does have potentially heightened security compared to data flow or BigQuery if the server is configured as a sole tenant architecture. GCP's dataproc serverless spark offering is a new option, but does make comparing the advantage of dataflow to dataproc more difficult. At that point the choice is whether to use Spark or Beam for processing data. If the source data is difficult to schematize then Spark is a good solution because Spark has better tools to work with schemaless data compared to Dataflow. Dataproc is also a good choice if you are looking to run SparkML jobs. Spark structured streaming behaves differently than dataflow, so much depends upon the nature of your source data, security requirements, and how you will eventually consume the data.

BigQuery is an interesting choice for data processing because it is a serverless solution which requires almost no set up or configuration from the developer to begin building and exercising pipelines. If your data are schematized and properly configured, and you don't need to use any other tools to process data (such as SparkML), then you could use BigQuery to perform every data processing application you require through simple SQL. You could stream data directly into BigQuery via the BigQuery Write API or a Pub/Sub subscription and perform any kind of processing you need over the data in near real time. This isn't quite the same as real time data processing in dataflow or structured streaming in Spark, but it is nonetheless very effective in most circumstances. BigQuery also has BigQuery ML which allows you to build machine learning solutions right inside of BigQuery for batch predictions. You can then export the model to Vertex AI to perform online predictions in just a matter of minutes. BigQuery ML outperforms Dataproc generally, though it is most expensive byte per byte, unless you reserve slots before hand.

Vertex AI/AI Platform is GCP's fully managed AI solution. Formerly known as AI Platform, Vertex AI can be used to quickly build ML solutions in GCP using either BigQuery or Cloud Storage. You can also run Dataproc Serverless Spark Pipelines from Vertex AI as well. Vertex AI can be used to perform batch or online processing of AI requests either from cloud console or applications. A more in depth examination of Vertex AI will be covered in a later module.

Cloud Composer Is GCP's fully managed Apache Airflow service. It is used within GCP to manage your DAG workflows and pipelines. It in an industry standard application which can manage GCP native and hybrid workflows across clouds with to great API support and a huge array of native connectors. Cloud Composer can be used to manage complex batch processes for big data processing, data transformations, and machine learning solutions. Its native integration with Kubernetes allows for a more flexible operating environment inside the cloud than in other competing implementations from other providers. Cloud Composer is a go to tool for any data engineer.

Stream Processing In GCP

Stream processing and event driven architectures in GCP consist of many components including serverless functions, message queues, and message brokers. Stream processing is a technique used to deliver real time actionable insights and executable information to consumers with low latency and high consistency.

Event Messages are a snapshot of an application activity or transaction which are sent to a Pub/Sub topic to be used in a downstream and decoupled process in GCP. It consists of some metadata as well as the transaction data which describes the event occurrence.

Message Queues are a way to decouple event sources from their destinations. Message sources (publishers) send asynchronous event messages to topics and then continue on with their next task. Publishers do not worry about what happens to the message once it is published other than the understanding that the message will be delivered. Message queues are designed to ensure that a faulty component in a microservices architecture doesn't interfere with other processes in the architecture. Message queues can be used to fan out event messages to many different subscribers whom are each tasked with some alternative workflow to achieve. For example, an order transaction might push a message to both a shipping and inventory service workflow for processing and fulfillment.

Pub/Sub

Pub/Sub is GCP's globally available message broker service. It is tasked with creating and maintaining message queues. Pub/Sub is a very common tool used in GCP to perform a wide variety of tasks. Within a data processing architecture it is used to stream data to tools such as Cloud Functions for schematization or data cleansing, dataflow for stream analytics, Dataproc for structured streaming architecture or SparkML, or even directly into BigQuery Write API for big data analytical processing.

Pub/Sub Architecture

Pub/Sub consists of six components:

  • Publisher (also called a producer): creates messages and sends (publishes) them to the messaging service on a specified topic.
  • Message: the data that moves through the service.
  • Topic: a named entity that represents a feed of messages.
  • Schema: a named entity that governs the data format of a Pub/Sub message.
  • Subscription: a named entity that represents an interest in receiving messages on a particular topic.
  • Subscriber (also called a consumer): receives messages on a specified subscription.

Pub/Sub is similar to other message broker systems. Producers and Consumers (Publishers and Subscribers) post and read messages from topics. Publishers produce messages containing event data associated with application transactions. The messages are then read by downstream subscribers for further processing or analysis.

Pub/Sub messages can be either pushed or pulled depending upon the subscription configuration. Pushed messages are good for event driven architectures or stream analytics. Pulled messages are good for batch processing and tend to be cheaper in practice, easier to implement, and offer more control over the processing methodology.

Messages are ordered by arrival time by default, but you can use an ordering key if desired.

Messages look like the below code. The data field contains the actual data, the messageId is the unique message Id, the publish time is the timestamp when the data was pushed to Pub/Sub, the ordering key is how Pub/Sub orders the message. Attributes are key-value pairs which can be sent instead of or in addition to the data field. The data field is free-form and is often a JSON object.

{
  "data": string,
  "attributes": {
    string: string,
    ...
  },
  "messageId": string,
  "publishTime": string,
  "orderingKey": string
}

Pub/Sub has recently offered some more useful features including retaining messages for up to 7 days. this enables potential replays for subscribers who may need to rerun the data streams. In a system such as Kafka this is handled by offsets against topics, but within Pub/Sub this is handled either by timestamps or by ordering key.

If required you can configure Message Storage Policies on the Pub/Sub topic to prevent data from processing outside of a given region. This is subordinate to Resource Location Restrictions. Meaning that if you have a Resource Location Restriction in place it will take priority over any message storage policy you have attached to the project. The message storage policies are best when you want to restrict one or more pub/sub topics only without affecting other services.

Schematization

Pub/Sub messages may be optionally schematized in either Avro format or Protocol Buffer (protobuff) format. Schematization is useful if you want to have greater control over the incoming data and need to have efficient data quality checks within the messages. If you're migrating from Kafka or the messages are being published externally it might be easier to use Avro format which is the default schema type for Kafka. If you're starting from scratch inside GCP and the messages are only going to remain inside GCP then protobuff offers a more efficient transmission. Schematized messages can be streamed as is into BigQuery Write API via a Pub/Sub Subscription. This is a very efficient method for streaming event data into BigQuery. Use Change Data Capture on the data to update your tables automatically as data in the topics mutate.

Common Use Cases

Pub/Sub has many use cases for data engineering and application event handling. It can be used to decouple services or as an event handler for downstream applications. For data engineering it can be used to transmit data for streaming analytics with Dataflow, trigger Vertex AI Pipelines, or batch analytics via direct BigQuery subscription. As a message broker it works well and is an effective substitute for tools like Kafka within GCP.

Pub/Sub is often used as a one-to-many data parallelization mechanism where one producer creates a message in a topic which is then ingested by multiple subscribers. An example could be that an event producer could send events to a pub/sub topic. One subscriber receives pushed messages in real time which are processed by dataflow while a second BigQuery subscriber polls the topic every hour to feed a batch analytics or a BQML machine learning pipeline. The two datasets are then compared to account for any late arriving data or to reconcile competing data.

Pub/Sub is highly configurable and can handle a few different event processing methods which can satisfy a wide array of use cases.

At least once processing is the default processing method for Pub/Sub. It means that each message is guaranteed to be delivered at least once to each subscriber. However, this means that duplicates or out of order processing may occur. This is a cheap and efficient process so if your system can handle duplicates or out of order processing then this method is fine.

In Order Processing ensures that the messages are processed in the order received by Pub/Sub. Messages must have the same ordering key and be in the same region to enable ordered message processing. Message ordering ensures that messages with the same ordering key will be processed oldest to latest.

Exactly once processing is a method where Pub/Sub ensures that each message is processed only one time per subscriber. Technically, this means that Pub/Sub will not attempt redelivery of a message if the message has been successfully acknowledged, or is waiting to be acknowledged. Producers can still produce duplicated messages with different message Id's, however. This method is useful if your processing application is sensitive to duplications and you can accept a higher processing time. Exactly-once is pull request only, which puts it in a grey area within an event-driven architecture.

Pull subscriptions are where a topic subscriber pulls messages from the queue via an HTTP API request. Pull methods tend to operate more as a batch structure which offers higher degrees of control and cheaper processing, at the cost of higher latency. Pull subscriptions operate as a transaction with both subscribers and Pub/Sub required to acknowledge receipt of the messages.

Push subscriptions are where Pub/Sub automatically forwards received messages to a given subscriber. This method is very fast with almost zero latency. It is often used with stream analytics or IOT where real time information is required to ensure application SLA's and there is some leniency for duplicated or malformed messages. Push subscriptions are not transaction oriented because pub/sub doesn't wait for a subscriber to acknowledge the message before continuing with the next message process. If messages are not acknowledged by the acknowledgement deadline then Pub/Sub will retry to send the messages. Therefore you must ensure that the subscriber application successfully acknowledges the message or Pub/Sub will continuously attempt to resend the same message which can result in large numbers of duplications or message backlogs in queue which can bog down Pub/Sub.

As usual there are tradeoffs between the various methods so you must understand each scenario and each method thoroughly to ensure success on the exam.

Service Oriented Architecture

A Service Oriented Architecture (SOA) is an architectural philosophy which seeks to decouple data processing layers and isolate application subprocesses into independent components known as microservices. This, in theory, provides a much more manageable processing pipeline for various subprocesses. Communication between services is handled by message brokers which manage interactions between services. In GCP you will often see microservices operate in either server oriented or serverless orientations, including VM or GKE clusters for server orientation or cloud functions for serverless implementations. Parallelization in the cloud can facilitate potentially millions of requests per day for millions of users across thousands of individual microservices.

In order to ensure a smooth, efficient, and complete solution customers leverage a service bus to manage message queues and event processing communications between services. SOA are generally transaction oriented which means an exactly once pull request is utilized in order to ensure successful message management and transaction integrity (ACID compliance) throughout the architecture. Therefore, SOA activities sometimes evolve into tasks or workflows for handing activities in the cloud. This effectively decouples the application from certain workload processing which ensures a faster and higher-quality performance of the application as well as a better user experience.

Site architectures can consist of numerous tasks with each fulfilling different workflow requirements. Often this can include the same service operating in a few different workflows. Each of these workflows produces events which reflect the work performed. These events can flow to a few different endpoints following the event capture including a logger, such as Cloud Logging for debugging or user experience tracing, and to Pub/Sub for both a real time stream analytics workflow and to BigQuery for big data analytics and machine learning. Eventually, these data could be read by Looker for Business Intelligence.

The diagram below shows the workflow used to develop a user's landing page after logging in to your site. The user first authenticates, and then if the user has a valid Google Identity then the site checks if the user exists in firestore. If it is a new user, then the user is redirected to the new user registration workflow. If the user is authenticated, then the user is moved to the user landing page build workflow.

This workflow consists of a number of event-driven steps which uses google workflows to track progress or handle any errors in the workflow. The process is kicked off by the site publishing a message to the landing page builder pub/sub topic. The topic then executes a push to a cloud function which updates the user's last login information which we can use later to track site usage analytics. Following the function execution the function then pushes the updated landing page login to the recommendation service which reaches out to the Vertex AI Recommendations endpoint for an online prediction used to suggest podcasts to the user. This object is then pushed to the orders service which gathers a user's recent listened-to data. Finally, the function pushes the data to the User's landing topic. The topic then executes a push subscription to App Engine for building the landing page for the user, a push subscription to Cloud Dataflow for real time user analytics, and a pull subscription to push the event data to the Big Query raw ground truth dataset.

Workflows processes the ground truth data using BQML for batch predictions, to detect drift or potentially rebuild the model if needed. Additionally, Vertex AI hosts the online endpoint for the BQML model (this is the endpoint referenced above). BQML outputs the results of the batch prediction to a BigQuery recommendations dataset.

Workflows tracks the progress in dataflow as dataflow produces aggregated window and time based user analytics. The aggregated user data is pushed to a real time analytics dataset in BigQuery which can be instantly read by looker for real time user login and tracking analytics.

Additionally, the two datasets could be combined into a view to better track how users respond to recommendations and potentially generate marketing analytics for campaign building.



Transformations

Between your source and sink lies data transformations. These shape your flatten, type, and develop your essential data structures. There are many transformation paradigms each with their own trade-offs and unique use cases. Your data transformations form the shape of your sink and, eventually, your data lake.


Dataflow Components

Apache Beam Engine

Apache Beam is a unified engine for processing both batch and streaming data. It seeks to simplify the development process by using the same code for both a batch and streaming solution. An Apache Beam pipeline is fairly simple in it's implementation, but that's what makes it so powerful. Apache Beam pipelines can be used across a number of different architectures and processing systems.

Apache Beam Pipelines are the applications which encapsulate all the instructions necessary to build your pipeline. Pipelines require a source, some processing instructions, and a sink in order to function correctly.

Sources and sinks are constructed by common interfaces via Pipeline I/O's. Dataflow and Apache beam come with standard libraries which includes a large number of Pipeline I/O's such as BigQuery and GCS.

Apache Beam is a unified batch and streaming pipeline which creates common components for both methods. Batch data is also known as bounded data and streaming data is also known as unbounded data. Beam uses a singular class to comprehend both methods known as a PCollection. A PCollection contains the necessary instructions that Beam must utilize to effectively house and interpret the incoming data in order to perform higher level operations on the data.

Data can be operated on after they are instantiated inside a PCollection. Apache Beam utilizes logical operators known as PTransforms to operate on the data. These can perform almost all possible data transformations including aggregations, grouping, and combining data. The primary operating method in Apache Beam is called a ParDo and contains the instructions needed to perform parallel processing upon distributed data encapsulated within a PCollection.

Batch Pipelines

Batch pipelines are the standard method of moving data within dataflow. Batch data can be used to work with whole datasets (such as gs://bucket_id/prefix/*.csv) or it can be used to work with data coming from any other common connector. This method is useful when you want to have data both in GCS for archive purposes and also want to move the data into another system (such as BigQuery) via Dataflow. It is also possible to use this method as a common source for dataflow which can then parallelize output to many different sinks using just one dataflow pipeline. This method is also useful when you don't want to use external tables in BigQuery or if you want to perform some preprocessing before loading external data (such as data masking/tokenization via Cloud DLP).

One common method is to use Incremental Batch Pipelines which are designed as an append only table inside of GCS. Incremental batch pipelines can be thought of as a sort of hybrid between a batch and streaming architecture where you can get the flexibility of streaming at a cheaper cost by not having to operate the pipeline at all times. If you are collecting IOT data from a large number of sensors you can deposit the data into GCS with a sensor delimited prefix and then use incremental batch processing to aggregate the data across sensors, check for data quality issues, anonymize or mask the data, and then load it either back into GCS for use with DataProc or insert it directly into BigQuery via the BigQuery Write API.

Dataflow Stream Processing

Dataflow is GCP's fully managed Apache Beam implementation. Stream processing is built using the same methods as batch but with alternative configurations. Stream processing is also referred to as unbounded data processing, in reference to a continuous stream of data as opposed to data with a start and end boundary. It is common to have stream processing to provide real-time analytics of streams and then also have the data pushed to an archive such as GCS for batch processing. This can help reconcile broken data streams and account for late arriving data for data auditing purposes.

Some common terms

  • Event Time - The time an event occurred.
  • Processing Time - The time an event was captured by the data processing application.
  • Late Arriving Data - Data is considered late arriving if the processing time is significantly different than the event time. This could signal some sort of interference in the event generation or transmission pipeline. For example, if you are capturing data from multiple weather sensors and a power outage prevented successful transmission of the event data for an hour. Late arriving data is determined by the watermark. Dataflow can handle late arriving data, but it must be enabled via the Java SDK and dataflow SQL does not support late arriving data.
  • Watermark - A time bound constraint which determines when to stop looking for late arriving data. For streaming solutions this is generally less than a minute.
  • Trigger - The method dataflow uses to emit data. This is usually when the watermark has been passed, but can also be configured to use the processing time, event time, or by a count of the data elements in the stream.

Windowing

Since streaming data are unbounded it is necessary to form windows of the incoming data in order to ensure continuous processing within hardware limits. Data must be contained within a particular stream microbatch method in order to ensure a successful time bound operation. Data stream windowing refers to the process of developing time bound processing standards for given data streams. Streaming data are processed by using one of three windowing methods.

  • Tumbling Windowing (Fixed Windows) - This the usual method of windowing and is similar in practice to the incremental batch method, but operates continuously. It executes against a common time bound window. Dataflow will pull all received data against the timestamp offset and perform data processing on the data. An example is the stream executing every 30 seconds and pulling the previous 30 seconds worth of data.
  • Hopping Windowing (Sliding Windows) - Hopping windows are common when working with late arriving data by providing a useful buffer within the window. It is common for dataflow to perform aggregations against the data here such as averaging or summing incoming data in order to reduce the sink hardware constraints or reduce data outflows from dataflow. For example, dataflow could aggregate a 60 second window's worth of data every 30 seconds to produce a running average. This can be used in an IOT solution to check for faulty sensors or equipment.
  • Session Windowing - Session windows track gap durations between events. A session window measures incoming data by key. If a data element arrives then the session window time is started and the stream begins tracking arriving data. If an event does not occur after a certain number of seconds then the window closes and the session is marked. Once data begins coming in again, the session window restarts and the process repeats. This method is commonly used to track website sessions, such as mouse clicks, shopping sessions, music playback, etc.

Dataproc - Hadoop and Spark Engine

Dataproc can be used as an alternative to, or in addition to, Dataflow as part of your data processing solution. As a fully managed solution you don't have to worry about cluster management or installing any components, they can be easily installed by just selecting the components you want installed when spinning up the cluster. Dataproc is often used when migrating legacy Spark or Hadoop applications or when you need a highly customized workflow developed outside of the capabilities of Dataflow and BigQuery. Dataproc comes preloaded with all the tools you would need to cover the vast majority of batch and streaming processing jobs using Spark and Hadoop.

Some common use cases are migrating legacy jobs to GCP or running SparkML jobs when BQML is not possible. Dataproc can be a substitute to Dataflow when working with complex batch or streaming data workflows which have very strict processing requirements or when strict privacy or security guarantees are only possible when operating on single-tenant hardware. Dataproc can also be cheaper than Dataflow for some workflows.

Dataproc can be run using various hardware. The default configuration is to run on a cluster of VMs, but dataproc can be run on a GKE cluster as well, but this does alter the cluster configuration away from a standard master-worker configuration. This can interfere with some job executions in practice. Dataproc serverless is a new fully managed batch Spark service. This can make it a more effective substitute for Dataflow which operates on a serverless platform, but you lose some of the security guarantees offered by dedicated hardware. It is possible to use custom containers if you have some more specific libraries needed by your task.

Batch Data Processing

GCS is an efficient substitute for HDFS and is the most common storage option for Dataproc, but you can load data onto dataproc storage if you want to work with an HDFS interface directly. Dataproc also has built in connectors for querying BigQuery data or HBase data hosted BigTable.

You can run a dataproc job by directly submitting a job to the dataproc cluster via the cloud console or by using the gcloud SDK via a command line interface. Dataproc jobs can also be scheduled to execute using a job orchestration tool, such as Cloud Composer, which can provide detailed error handling and process flows via a DAG architecture. It is also possible to use Cloud Scheduler or Cloud Functions to schedule application execution.

Dataproc can be started or stopped by using gcloud commands. Stopping a cluster stops all charges for the VMs, but you do still pay for persistent disks or other cluster resources. This is a useful practice if you want to save money by only keeping your instances running while jobs are being ran. Cloud Composer could be used to spin up a dataproc cluster, execute a series of jobs or tasks, and then spin down the resource once the tasks are completed.

Stream Data Processing

Stream data processing in Dataproc is handled by Spark Structured Streaming (DStreams). Like batch processing you would choose this when you need an alternative to Dataflow such as when you are migrating an on-prem job to Dataproc or you need a strict closed loop system with single-tenant hardware.

In some instances it might be more efficient to work with certain types of data via Spark rather than BigQuery. For example, Spark works natively with Parquet data while BigQuery must convert it to Capacitor format before using it.



Data acquisition and import

Access and development of data sources is the first step in any data pipeline. Depending upon your data architecture this could mean importing data into a staging area or using a tool like Cloud Functions to access an external API.


Data Ingestion

Data ingestion means retrieving data from external sources. This data can be either pulled from an external source, such as an HTTP endpoint, RDBMS database, SFTP server, Kafka stream, Pub/Sub topic, GCS, or another source. Like almost the entire data pipeline the choice of infrastructure and architecture is almost entirely malleable and is dependent upon the requirements of the engineering team, data governance policies, and data governance requirements. The primary goal of this stage is just to get the data into a position to operate further on the data in the processing stage.

API Extract

Reading data from an API endpoint is a very common task in data engineering and is often seen in microservices architectures and when interacting with 3rd party SAAS applications. Use a tool like python operating in cloud functions to pull the data from the API, perform some primitive and cursory activities such as initial cleansing and formatting operations, and then push the data to a Pub/Sub topic, GCS, or directly into BigQuery. API's are usually accessed through an SDK (software development kit) or HTTP endpoint which retrieves the data from the host system based upon a parameterized request string passed to the endpoint.

RDBMS Endpoint

Data can be pulled from an RDBMS as a source. This can mean either passing SQL as a direct query or by using an API pass through endpoint to facilitate the query. BigQuery can also query a Cloud SQL table directly as an external stage. Additionally, Datastream can also be used to perform CDC operations and synchronize data easily into BigQuery.

SFTP

SFTP data can be pulled through an SFTP interface, known as SFTP (SSH File Transfer Protocol). SFTP is an older technology, and you won't see it often in new projects. SFTP can be ingested as a bitstream using Python and moved directly into GCS or GCP offers a managed STFP service which offers a common interface for handling most common SFTP operations.

Kafka Stream

Kafka is a popular message broker service designed for streaming high-throughput and real time data to consumers and consumer groups. The data can be schematized before retrieval, and the streams can also be combined and joined with other streams to form other enriched and aggregated data. Kafka data can be read with Dataproc Spark, moved directly into BigQuery via Dataflow, or can be read with python in a cloud function. If you are using a managed service such as Confluent Cloud you can set up a BigQuery or GCS Connector to push topic messages directly into the storage system.

Kafka data can be read either in batches with an offset pattern or it can be read as a stream with a tool like dataflow or Spark in dataproc.

Pub/Sub

Pub/Sub is GCP's native message broker service and is conceptually similar to Kafka. Pub/Sub data can read by dataflow, dataproc, or can be streamed directly into BigQuery as a Pub/Sub subscription. It is often used as a buffer to house data coming from an operational system, such as Firestore, in preparation for ingestion to a downstream system, such as BigQuery.

GCS

You will often encounter scenarios where the data have already been deposited into GCS via a third party application or as the result of another upstream process. In this case you will need to read the data from GCS as a source. GCS data can be read directly from almost any processor including Python, Dataproc, Dataflow, or BigQuery as an external table.

Pipeline Paradigms

Pipelines are commonly thought to come in two flavors: ETL - Extract, Transform, and Load and ELT - Extract, Load and Transform. In the first case, transformation, cleansing, and preparation tasks are performed before the data are loaded into the data warehouse. In the second case the data are performed after the data are loaded into the data warehouse.

ETL vs ELT

ETL, or Extract, Transform, and Load, is the classical method of data processing and pipeline architecture. It is still used as a general solution to building high-performance pipelines in many architectures, including the cloud. As the name implies first the data are extracted using an ingestion engine, then processed (transformed) using data processing software or a suite of tools, and then finally the prepared data are loaded into the data warehouse for further analysis.

ELT, or Extract, Load, and Transform, is a newer method of data pipelining where data are first loaded into a data warehouses, such as BigQuery, and then transformed using a series of SQL operations, usually architected as a chained transformation or with a DAG orchestrated by Cloud Composer or dbt.

Ingesting Data

An ingestion engine is the first step in any data pipeline. This is the point where data are either first pulled into GCP from an external source or where the data are pushed into GCP from an alternative external source. The nature of the source data determines the choice and scope of the technologies required to actually perform this task.

Preparing Data

Data preparation is the process by where data are cleaned, munged, schematized, and organized in preparation for processing or loading into a data warehouse. In machine learning this is the step where data labeling is performed as well.

Pre/Processing Data

Following the preparation stage data are then sent to a processing layer to be further analyzed and aggregated before loading into a data warehouse, ML model pipeline, or data mart. At this stage the data have been through several stages of schematization, qualification, and quality checks and are now ready for insertion and analysis.

Loading Data

Following the processing stage data are then inserted or pulled into a data warehouse, data mart, or ML model. This is the point at where the engineering meets the analysis. The data can be semantically evolved to better reflect business logic or the data can be further transformed to build models or views used by BI tools such as Tableau or Looker.