Planning the data pipelines
Data pipelines are software applications. They require planning and standardization to operate at peak efficiency. Understand and plan for the various requirements of data pipelining in GCP.
Topics Include:- Defining data sources and sinks
- Defining data transformation logic
- Networking fundamentals
- Data encryption
GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Ingesting and processing the data
→ Planning the data pipelines
Topic Contents
Defining data sources and sinksDefining data transformation logic
Networking fundamentals
Data encryption
Defining data sources and sinks
Data sources and sinks, sometimes referred to as targets, are the endpoints of your data pipelines and define how you ingest the data and how it will be deposited in your target destination. Data sources define the eventual end product, and data sinks define your data architecture and the shape of your storage architecture.
Understanding The Source Data
When you are working towards developing a data processing solution you first need to start with understanding the source data. This could be SaaS data from a third party application, web app event data, IOT data coming in from remote sensors, or any other number of possibilities. Understanding the nature of the data is key as it will determine the steps needed to process the data effectively and transform it into a state which is consumable by end users. Keep in mind there is often more than one possible way to do something in GCP, but there usually only one preferred solution or only one most efficient solution, which is what you should be looking for when developing data processing solutions in GCP.
Some key questions to ask of your data:
- How are the data being produced?
- Is this data coming from an API, SFTP server, CDC transaction logs, database, Pub/Sub topic, or a Kafka stream?
- What is the preferred method of querying or gathering the data?
- Can I query this data using a modified timestamp so I only have to gather the newest data or do I have to completely poll the entire dataset each time I execute the pipeline?
- Is this data queryable from a source system via an API?
- Is a REST HTTP request required to query the data?
- What is the query pattern that I will need to use to effectively and completely query the data?
- If the data are dirty or require transformations, which are the nature of the data transformations required?
- Is this data stateful or event driven? Is there a high likely-hood of late-arriving, duplicated, or malformed data?
- What data governance standards are in place to determine how the system should handle bad data?
- What are the security requirements for this data?
- Is it a simple API SAML log in with just a use name and password?
- How do I store access credentials securely and have easy access to them when I need them?
There are just a few of the myriad variables present when you are developing data processing solutions in GCP. When you are taking the exam one of the keys to success is being able to completely understand the nature of the question as well as the components required to execute the pipeline. Use the process of deduction to determine the most correct answer to the question by thinking through how data would move within GCP as well as any hang ups which might require a more specialized solution.
Many of the questions you will see will be about some nuance in a given component or subsystem within GCP so having intimate knowledge of all the different GCP data services is essential. Other scenario questions will ask you to determine a proper solution for a given problem which is where the above questions will come in handy. It's really about understanding how to approach the problem in order to arrive at the proper solution.
Understanding The End State or Sink
After you have a thorough understanding of the nature of the source data the next step is to understand the end state of the data. This is a very important step because if you don't have a clear picture of how the data will be consumed then it will be difficult to develop the most efficient data processing system architecture.
The end state of the data is can be anything from a data warehouse, to a data lake, to a secondary processing system, or to a BI tool such as Tableau. Each of these scenarios requires a different set of techniques and tools to properly develop.
Some key questions to ask when determining the end state:
- Who (which audience) is going to be the primary consumer of your data?
- Will this be for end customers? Data scientists? Executive management? A third party SaaS application?
- Each of these end states mean that a different processing architecture must be developed with different components and different methodologies.
- What state does the end data need to be in? Does the consumer application require a specific schema?
- Will there be any aggregations involved in the data? Does the end state need every possible datum or can the data be aggregated to save processing and storage resources?
- Is the end state another system?
- Is the destination a data lake hosted within GCS?
- A data warehouse in BigQuery? Will the data be used to build a BigQuery ML solution?
- Will the data be used by another data processing system such as Spark in Dataproc?
- Will the data feed a machine learning pipeline in Vertex AI?
In order to efficiently architect and develop your pipeline you have to thoroughly understand your data and the final shape the data will take. Once you know the source data and the end state you can begin to architect the actual pipeline and map out the technologies and techniques needed to achieve the end state required. It isn't possible to think of every possible scenario, but hopefully you can develop the instincts needed to successfully intuit how a pipeline should be built based upon a thorough understanding of the data and the various components involved.
Defining data transformation logic
To get the data into the shape and location that you want you have to perform data transformations. This includes manipulations, shaping, typing, flattening, and other processes which must be performed to correctly form your storage architecture.
Data Processing
Data processing follows the data preparation stage is also known as the transformation layer in certain systems. In this stage data are refined, assembled, joined, and otherwise manufactured into the data products which your business users can make sense of, report on, formulate opinions of, and eventually act upon.
Data Transformations
Data transformations are also referred to as "munging", "wrangling", "processing", etc. This can involve data enrichments, aggregations, further quality checks, statistical challenges (such as data anomaly checks), key validations, filtering, mapping, etc.. These can range in complexity and are used to produce the eventual data models which will be used in data reporting, visualizations, or machine learning pipelines.
Networking fundamentals
Networking is the backbone of the cloud. Without a strong and scalable networking architecture processing trillions of requests each day the cloud could not exist. As a data engineer, you should be familiar with the basic GCP network architecture and it's influence upon your processes.
Networking Fundamentals
GCP is a massive network of high performing servers, infrastructure, fiber optic cables, and data centers spread across the world. The distributed architecture gives a high degree of parallelism, redundancy, and availability for customers. The flexibility of the network serves a wide range of use cases and allows companies to completely move off of on premises data centers if desired. Additionally, this also ensures massive economies of scale which allows cloud vendors to provide almost infinitely scalable resources to users while also saving them money byte for byte compared to on prem.
GCP also has a clear advantage in its processing infrastructure. It is constantly working to provide best-in-class data transmission lines, fiber optic cables, and data centers. It's cables stretch across the sea and provide dedicated and secure connections between continents, countries, and peoples.
Google has worked tirelessly to connect the world through massive undersea cables and data centers. All of this technology so that you get instant access to your favorite YouTuber's newest content.
Google's Network Overview
You don't need to understand the highly advanced network topography or the laws of thermodynamics to pass the exam. You do, however, need to understand how networks work in GCP, how they interact with the data, and how they provide the essential infrastructure to power your data engineering solutions.
Resource Location Restrictions
Resource Location Restrictions are an organizational policy which limits where new services can be created within GCP. So, for instance, if you have regulatory constraints which preclude operating in certain regions of the world, you can set this policy to prevent any services from operating there. Some GCP services are only capable of operating in certain regions, so setting this might negatively effect certain services and occasionally prevent them from working altogether.
Networking Definitions
The Google Cloud Physical Network consists of a globally distributed hierarchy consisting of regions and zones.
- Region - A specific geographical location where resources are colocated. This is not a physical location, but a logical collection of zonal assets.
- Zone - A deployment area within a region. These are the actual physical locations of the infrastructure and assets. Zones have dedicated high-speed connections with other zones in a region. From a user's perspective it is almost impossible to tell the difference between transmissions from one zone and another within the same region.
- Virtual Private Cloud (VPC) - A VPC is a virtual network which logically manages your project's data assets and movements within GCP.
- Subnet - Organizes your network's IP address space and delineates prefixes to effectively and efficiently access your cloud assets.
- Cloud Interconnect - a highly available global resource which connects your network to GCP. This allows GCP to view and operate on your on-prem assets as if they were part of the GCP VPC itself. This also enables you to apply an internal IP address to on-prem assets.
- Cloud Router - Cloud Router uses Border Gateway Protocol (BGP) to dynamically exchange routes between your VPC network and your external network.
- VLAN attachment - A virtual network device which allows connection between your host network and GCP. A default attachment has a 10 gbps capacity which can be upgraded to 50 gbps (or downgraded). Each interconnect can have multiple VLANs attached for connecting to multiple VPCs.
- Dedicated Interconnect - A dedicated connection between your on-prem network and GCP. Each connection can support multiple VPC's. This a good option if a highly secure connection between on-prem processing systems and GCP is necessary and you don't want to send data over the internet.
- Partner Interconnect - This uses a dedicated partner service provider to act as a medium between you and GCP so that you can get the benefits of interconnect without having to manage the assets yourself.
- Cross-Cloud Interconnect - the attachment uses a Google created and managed connection between GCP and another cloud service provider, such as AWS. This is useful if you have to trade large amounts of data securely between providers because you are running different services with different providers.
- Cloud Load Balancing - A suite of application and network load balancers which manage connections and requests to back end services such as Compute Engine or Google Kubernetes Engine (GKE). Enables potentially millions of connections with only a single request from a user. Cloud Load Balancing also ensures a highly secure TLS/SSL encryption from point to point without ever having to exit dedicated GCP infrastructure.
- Cloud Content Delivery Network (Cloud CDN) - delivers content at the edge for low latency processing and content delivery. Also includes Media CDN.
- Google Cloud Armor - Protects your projects and assets from external threats such as distributed denial-of-service (DDoS) attacks and application attacks like cross-site scripting (XSS) and SQL injection (SQLi).
- Cloud Network Address Translation (NAT) - This is a subnet service which allows infrastructure without a Public IP address to to communicate with the internet. This is often used to download critical updates in a highly controlled manner without opening a surface area to the endpoint. It is important to note that Cloud NAT is not the same thing as NAT devices in other providers, such as AWS. It performs the same service, but it is a virtual service, not a physical device. As such, it is assigned an IP address at the asset level, not the subnet level.
Standard vs Premium Tier
GCP offers two different tiers of network service. Which you should choose is dependent upon your specific requirements for security and performance.
Standard Tier offers routing over the internet as an intermediary between GCP services and end requestor. It is good up to a point, and is suitable for POCs/MVP development. It offers a free tier, as well, for packet transfers up to 200 GBs each month. This is ok if you don't require more advanced networkings services, such as Cloud CDN or HTTPS Proxy, or if your assets are ensured to only operate within a single region. This is also a good option if you are running batch processing which is not latency sensitive. The service levels are comparable to most other cloud providers. Standard Tier is cheaper because the data spends less time and distance in GCP's dedicated network. This is cheaper, but less reliable and performant
Premium Tier routing provides routing through GCP at the lowest level possible (known as a point of presence). Users connect to the network at the closest location possible and the data travels through GCP's network over dedicated hardware. Premium Tier offers globally available external IP addresses. Premium tier is usually employed by advanced application users with global deployments and users who require advanced networking or easy access to Cloud Storage.
Cloud VPC
Cloud VPC is the backbone of GCP project deployments. It connects all your project's global assets and enables scalability, high-availability, and massive parallelism all through a single API connection with user end-devices. All new projects come with a default VPC which contains one subnet. VPC routing determines which requests end up where and how.
A subnet manages and assigns IP addresses to individual assets in a zone. Subnets can be assigned either an IPv4 address range or an IPv6 address range, or both. This range is referred to as Classless Inter-domain Routing range, or CIDR. Subnets can be internal or external facing. External IPv6 addresses are only available in the premium tier. VPC's are global assets while subnets are regional assets.
Subnets can be created either automatically by GCP or manually using custom creation. Generally, it is best to use auto subnet creation unless there is a good reason why you can't. Custom subnet and CIDR block management can become very complicated and create bugs which are difficult to diagnose.
VPC peering enables two VPCs to communicate through a dedicated connection. If you are planning on doing VPC peering you will have to use custom subnets since auto mode subnets will create CIDR conflicts. VPC peering does not allow so called transitive peering, or bridge peering across VPCs. For example, if VPC1 is peered with VPC2 and VPC2 is peered with VPC3, this does not mean that VPC1 is therefore peered with VPC3. Another separate peerage must be made between VPC1 and VPC3. VPC peering is useful if you have internal assets with private IP addresses and do not want to route traffic to them using external IP addresses.
Data encryption
GCP offers a high degree of freedom when choosing how to encrypt data on platform. Native data encryption is standard within GCP, but there may be times when you would want something more robust which might require a more advanced architecture.
Data Encryption
Ensuring data encryption in your data pipelines is an essential process which ensures a high-quality product that inspires customer confidence. You can be assured that GCP's native encryption standards are perfectly suited to managing data encryption throughout the OSI stack. However, there are some highly specialized use cases which might require additional layers of encryption.
When you are sending your data to the cloud it is encrypted by default via server-side encryption at the point of entry into GCP. Additionally, the APIs are encrypted using TLS/SSL encryption by default. If you require additional security, or you want to be absolutely sure that Google cannot read your files, you can use a client-side encryption standard via customer managed keys. This means that you cannot decrypt the data while in GCP without supplying the key to GCP (such as by storing it in secrets manager or using KMS) or in the individual service you are working with. A potential use case for this is storing archived data in GCS. You could encrypt the files on-prem then upload them to GCS and they could only be decrypted by pulling the files back to on-prem and decrypting with the key.
Government customers often have very specific cryptographic requirements which are different or in addition to native GCP encryption, such as FIPS 140-2.