Storing the data
Ingested data must be stored and how you choose to architect your storage layer will set the tone for your entire project. GCP offers a number of fully managed storage options to choose from. The correct choice will depend upon a number of trade-offs between availability, durability, cost, and performance.
Topics Include:- Selecting the storage systems among a series of performance, cost trade-offs, and life-cycle management
- Planning for using a data warehouse including modeling data, mapping business requirements, and data access patterns.
- Using a data lake including management, data processing, and monitoring.
- Building a data mesh from GCP native technologies, segmenting data across teams, and building a federated governance model.
GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Storing the data
Module Topics
Selecting storage systemsPlanning for using a data warehouse
Using a data lake
Designing for a data mesh
Selecting storage systems
At the core of your entire solution is the storage layer. Build this correctly and your entire solution will run smoothly, build it incorrectly and it will be extremely difficult to solve and will plague your solution. Leverage GCP native services to ensure a high quality and cost-effective storage solution.
Topics Include:- Analyzing data access patterns
- Choosing managed services (e.g., Bigtable, Cloud Spanner, Cloud SQL, Cloud Storage, Firestore, Memorystore)
- Planning for storage costs and performance
- Lifecycle management of data
Planning for using a data warehouse
A data warehouse is an effective tool for structuring and leveraging your data for analysis and reporting. Use tools such as BigQuery to create a high-performance serverless data warehouse solution. Organize your data effectively by mapping current and future architecture to current and future business requirements. Leverage developed tools to support data access patterns.
Topics Include:- Designing the data model
- Deciding the degree of data normalization
- Mapping business requirements
- Defining architecture to support data access patterns
Using a data lake
A data lake can be a great solution for data storage and architecture. Although there are many ways to develop a data lake, there are a few best practices for managing a data lake in GCP.
Topics Include:- Managing the lake (configuring data discovery, access, and cost controls)
- Processing data
- Monitoring the data lake
Designing for a data mesh
Data Mesh is a fairly new concept which encourages domain driven data product development. GCP has a number of native tools which can enable you to quickly and efficiently build a data mesh, segment data, and build a federated governance model.
Topic Include:- Building a data mesh based on requirements by using Google Cloud tools (e.g., Dataplex, Data Catalog, BigQuery, Cloud Storage)
- Segmenting data for distributed team usage
- Building a federated governance model for distributed data systems