Pretest Information
These tests are used to test your knowledge, prepare you for exam questions, and provide a greater degree of clarification to topics. These questions should resemble those which you will see on the actual exam. There are no time limits for these questions (though feel free to time yourself!). Instead, it's more important get the questions and thought patterns correct, which in turn will speed up test taking. The goal should be 90% correct to be prepared for the exam.
If you have a session open it will resume automatically. You can also close the current session and start a new one.
GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
→ Pre-Tests
→ Test 1
Test Session History. You will see your 5 most recent attempts of this test. If your current test is still open it will load below. Only attempted questions will update for open sessions. To calculate overall exam score you must close the session.
Status | Start Date | Questions Attempted | Questions Correct | Questions Missed | Percentage Correct |
---|
Current Session Summary:
attempted. Question summary is marked below.
✅ Correct ❌ Incorrect ⬜️ Not Attempted
⬜️ 01 | ⬜️ 02 | ⬜️ 03 | ⬜️ 04 | ⬜️ 05 | ⬜️ 06 | ⬜️ 07 | ⬜️ 08 | ⬜️ 09 | ⬜️ 10 |
⬜️ 11 | ⬜️ 12 | ⬜️ 13 | ⬜️ 14 | ⬜️ 15 | ⬜️ 16 | ⬜️ 17 | ⬜️ 18 | ⬜️ 19 | ⬜️ 20 |
⬜️ 21 | ⬜️ 22 | ⬜️ 23 | ⬜️ 24 | ⬜️ 25 | ⬜️ 26 | ⬜️ 27 | ⬜️ 28 | ⬜️ 29 | ⬜️ 30 |
⬜️ 31 | ⬜️ 32 | ⬜️ 33 | ⬜️ 34 | ⬜️ 35 | ⬜️ 36 | ⬜️ 37 | ⬜️ 38 | ⬜️ 39 | ⬜️ 40 |
⬜️ 41 | ⬜️ 42 | ⬜️ 43 | ⬜️ 44 | ⬜️ 45 | ⬜️ 46 | ⬜️ 47 | ⬜️ 48 | ⬜️ 49 | ⬜️ 50 |
⬜️ 51 | ⬜️ 52 | ⬜️ 53 | ⬜️ 54 | ⬜️ 55 | ⬜️ 56 | ⬜️ 57 | ⬜️ 58 | ⬜️ 59 | ⬜️ 60 |
You work at a large financial institution and have been asked by management to architect a solution for enabling a data mesh in GCP. What is most efficient solution?
Option A is incorrect: Buying a plethora of tools would be inefficient in this instance as Dataplex can perform all of these duties natively with minimal effort.
Option B is incorrect: This effort is overly complicated and complex and would require a lot of development effort across teams. Additionally, it doesn't lay a proper foundation for a data mesh because it is not controlling for domain knowledge and differentiation. There are no clear delimiters between teams and federated duties.
Option C is correct: Before beginning construction of the data mesh itself you should establish a governing body, separate from the teams/domains themselves, which can establish rules for sharing and data quality regulations which apply to domains and zones. Dataplex is a unified Data Management and Governance suite which can provide the necessary pathways for domain driven design/development while simultaneously ensuring a high-quality and performant system. Dataplex will automatically capture, analyze, categorize, and apply governance functions to new data from BigQuery and GCS.
Option D is incorrect: Firstly, working directly with the teams to develop unique rules is a violation of data mesh's federated governance standards. Using Cloud Build and Cloud functions to perform the necessary checks would work, but it is inefficient compared to dataplex, which performs all these functions natively. Building a data service on a VM or App Engine could be useful for data discovery and sharing, but all the custom rules, logic, and development would have to be done in house and would be inefficient compared to Dataplex, which performs all these functions natively. Capturing access logs to ensure that users are not accessing data improperly is a stopgap measure and defeats the purpose of the data mesh, where those kinds of checks are performed by Dataplex natively as part of the domain drive development standard.
You are working to build out a data lake using Dataplex. After setting up the necessary zones you need to incorporate your BigQuery tables, views, and BQML models into Dataplex and Data Catalog. What actions do you need to take to do this?
A is incorrect. After datasets are added to a zone within Dataplex all tables are automatically scanned and tagged by Dataplex. Dataplex will then build out the metadata automatically.
B is incorrect. You must first add the datasets to the proper zones, and then Dataplex will onboard them.
C is incorrect. This is almost correct. You should verify that a Dataplex has permissions first and that BigQuery data lineage is enabled.
D is correct. There are a few steps needed before being able to add your assets to dataplex. Firstly, you should ensure that Dataplex has the proper permissions to read the needed GCS buckets and BigQuery datasets. Additionally, it is recommended to enable BigQuery Data Lineage to track where data comes from, where it is passed to, and what transformations are applied to it. Then to onboard data assets you would need to manually add either the cloud storage bucket or BigQuery dataset to the proper zone. After you add the dataset or bucket dataplex will scan through and add your assets to the proper zone, map data lineage, build Data Catalog information, and surface the data for data discovery. GCS assets will be made into BigQuery external tables, from here they can be manually upgraded to BigLake tables.
You are working at a consulting firm. One of your clients is inquiring about GCP's disaster recovery protocols and SLO's. What are the two objectives you would look at to rate a disaster recovery scenario?
Select All That ApplyA is correct. RPO, or Recovery Point Objective, is how far back you want your data to be recoverable. For example, do you want your database backed up every day, every week, every hour, etc?
B is correct. RTO, or Recovery Time Objective, is how quickly you should be able to recover from a failure. This is usually measured in "Wall Clock Time", or how much real world time passes before recovery is complete.
C is incorrect. Durability is a good metric to ensure the survivability of data. It isn't a disaster recovery objective, but it can be used to help define one.
D is incorrect. Availability can help protect your business against failures and outages. It isn't a disaster recovery objective, but it can be used to help define one.
Your analytics team is interested in capturing web traffic so that they can build a recommender system for offered products. They have set up Google Analytics on the website and are capturing web traffic effectively. Now they want to move that data into their data mesh so that they can enrich the data with other sources, perform feature engineering, and build the recommender system. What would be the most efficient architecture to accomplish this?
Option A is incorrect. It is almost correct, but you don't need to use Dataproc as BigQuery ML can build a recommender system.
Option B is incorrect. Although this is technically feasible it would be operationally inefficient compared to using BigQuery Transfer Service to move the data.
Option C is correct. You can use the BigQuery Transfer Service and Google Analytics to easily set up a daily transfer of data from GA to BQ with minimal configuration. You can use the BigQuery Transform operator to perform complex feature engineering. Finally, you can use BigQuery ML to build a Matrix Factorization model to produce the recommendations. This method is also ideal because Dataplex will automatically capture newly created tables and models, perform the necessary data quality checks and permissions set up and make the data available for use by the analytics team.
Option D is incorrect: There is no such thing as the Google Analytics Transfer Service, also there is no native way to do this with Google Analytics.
What is usually meant by Data Sovereignty?
Option A is incorrect. Data sovereignty governs the data itself, not the owners.
Option B is incorrect: Data Sovereignty does not refer to national rights to data.
Option C is correct: Data Sovereignty is a legal concept which determines the laws and regulations which apply to data when it is processed, viewed, collected, or stored within a certain geographical boundary.
Option D is incorrect: Data sovereignty refers to any laws and regulations for data within a geographical context. Although some laws may apply to privacy, this is not the sole purpose of data sovereignty.
The analytics teams is building a dashboard in Looker which queries the same BigQuery table every few hours throughout the day. The data mart is rebuilt once per day in the mornings. What is the most efficient solution you should implement in order to minimize the query response time and the amount of data processed per day?
A: Option A is not correct. A Materialized View could work for the performance and processing requirement, but it is not the most optimal solution and would be redundant in this case.
B: Option B is not correct because Looker automatically caches queries, but only for 1 hour. Changing it to 24 hours could work, but it isn't the most optimal solution in this case.
C: Option C is not correct for a few reasons:
- This is a complex solution and would be difficult and time consuming to implement.
- BigQuery doesn't have a user editable resource manager for the underlying query engine, it is a serverless service.
- You can allocate slots on the project level, not the table level
- Additional slots improves availability, but not performance
D: Option D is correct. BigQuery will automatically cache query results in a temporary table for 24 hours. Whenever looker runs the same query it will take results from the cache and won't query the underlying table again until the underlying data changes, in this case, the following morning.
You are working at a start-up company and the CIO wants to move to a new BigQuery class. Their current requirements are:
- There are a large number of scheduled jobs which run each morning that generally use the same amount of data each day.
- The analytics team is about to introduce many new high performance queries which they are seeking to optimize.
- The data security and regulatory teams want to introduce data masking to comply with new privacy regulations.
- The CIO wants to be assured that if a failure occurs in a single zone that the service will still be up and the data will be assured.
A is incorrect. On-Demand would be expensive to run each day for the standard workloads, wouldn't provide BI acceleration, and doesn't support data masking.
B is incorrect. Standard supports slot reservation for the standard jobs and provides zonal high-availability, but wouldn't support Bi acceleration or data masking
C is correct. Enterprise is the best fit for slot reservation, BI Acceleration, zonal high-availability, and data masking.
D is incorrect. Enterprise-plus offers more features that are not required at this level and would therefore be economically inefficient compared to Enterprise.
As a member of the data engineering team you run dozens dataproc jobs daily. Lately you have noticed that the chron scripts that you use to run the dataproc jobs are becoming difficult to manage, understand, and visualize. You are looking for a better method to organize your jobs while enabling advanced job management tools such as graceful handing of job failures, automated scheduling, and job dependency graphs. What tool could you use to solve this?
Option A is correct: Cloud Composer is Google's fully managed Apache Airflow service. Use it to visualize and manage your dataproc jobs and task dependencies.
Option B is incorrect: Although Cloud Scheduler can be used to execute jobs at a given time and can track job status it is not possible to visualize or manage complex DAG dependencies. Additionally, Cloud scheduler doesn't have an efficient task dependency tree.
Option C is incorrect: Cloud workflows is used to automate event driven API workflows and isn't the best tool for running scheduled batch dataproc jobs.
Option D is incorrect: This is a huge undertaking which would require a significant effort to build. Cloud Composer does all of this by default with minimal user input or maintenance required.
You need to develop a Cloud Function to perform some data processing. You want to work locally and only deploy the Cloud Function when you are sure it is ready. What is the optimal method for development?
A is incorrect. This is smart for developing the python script, but it lacks the robust testing and deployment best practices.
B is incorrect. This is close, but Cloud Build should be used to deploy the function when it is ready.
C is correct. Use Jupiter notebooks to develop, use the Functions Framework to debug and test, and Cloud Build to deploy the function all from your local machine.
D is incorrect. This would likely result in a difficult development process with many failed attempts to deploy and test the function.
You are using Dataplex and BigQuery to host a data mesh. You want to build freshness checks against the data so that users can know if data has been properly run each morning. What is the most optimal method to accomplish this?
A is incorrect. This is technically feasible, but would be difficult to scale and would require significant maintenance efforts to ensure viability. Dataplex does this automatically with data profiling and auto data quality checks.
B is incorrect. This is technically feasible, but a better destination would be Cloud Logger with alerts set up if jobs fail. Dataplex can apply data profiling and quality checks automatically and is the better choice here.
C is correct. Dataplex can be used to integrate BigQuery data profiling with Data Catalog to automatically map and make available detailed data profiles for any BigQuery table. If you have built a data mesh with dataplex you can use integrated dataplex monitoring with automatically captured metadata which can report on data quality data points such as data freshness and availability.
D is incorrect. This is an inefficient solution. It would be better to use automatic Cloud Logging as a product of the service or communicate directly with the API for custom alerting. Also, Dataplex can perform these checks automatically as part of auto data quality checks.
You are working with a financial services firm. They are working with data which, by law, requires strict ACID-compliant transaction handling. They are operating in a single region only. Which database service would be the best fit for their requirements?
A in incorrect. BigQuery is not ACID compliant.
B is incorrect. Although Cloud Spanner is ACID compliant, it is a global service, and would not best fit the requirements.
C is correct. Cloud SQL is ACID compliant and is a regional service.
D is incorrect. BigTable is not ACID compliant.
A data science team has 3PBs of data stored in GCS standard dual-region storage and accumulates an additional GB of data every week. They are interested in ways to lower their monthly storage bills and costs. What are some methods that you should recommend to lower the costs?
Select all that applyOption A is correct: Autoclass will automatically reclassify data into a cheaper class without the user having to perform any configurations themselves.
Option B is correct: You can compress files
Option C is correct: GCS buckets can be stored as multi-region, dual-region, or single-region. There are slight differences in price between these
Option D is incorrect: For deleting aged files you should use a lifecycle policy. It is wise to avoid deleting files unless specifically asked to do so.
You are running a sports booking website which processes tens of thousands of ticket purchases per day across hundreds of events and dozens of teams. You decide to use BigTable in order to properly mange the complex and high velocity event stream. How should you build your BigTable key to ensure an efficient operation?
Option A is incorrect. Leading a BigTable key with a timestamp is an anti-pattern which will likely lead to hot-spotting.
Option B is incorrect for the same reasons as A.
Option C is correct because the key is not monotonically increasing and therefore minimizes the chance for hot-spotting.
Option D is incorrect because there is still a strong change of hot-spotting.
A manufacturer of IOT devices collects sensor readings from the robots which construct the IOT devices. The robots send the data to an api endpoint for ingestion. However, you have noticed that occasionally, the ingest fails and the data are lost. What is a better method to ingest the data?
Option A is incorrect: This is incorrect because it is not clear how the data are formatted or how the data manufacturer reads the data.
Option B is incorrect: This would alter the way the data are ingested and would require unnecessary reworks.
Option C is incorrect: Cloud Spanner is a RDBMS which ensures ACID transactions. It is not required for this instance.
Option D is correct: Using pub/sub provides redundancy against lost messages and provides a scalable and guaranteed message delivery services.
You currently have a pub/sub topic set up which is capturing application event data in JSON format. What is the most optimal method to stream the data into BigQuery without any transformations?
Option A is incorrect. This method is inefficient for data which does not require transformations or processing.
Option B is incorrect. This method is technically feasible, but it is not the optimal solution.
Option C is correct. Use a BigQuery Subscription to Pub/Sub to automatically stream data into BigQuery via the Storage Write API. This is the best method to stream event data into BigQuery which doesn't require any kind of preprocessing or transformations.
Option D is incorrect. This is a good method if you want to isolate the data and perform some other processing on the data before inserting it. However, it is inefficient in this case.
You are managing a BigTable instance and have recently built a new table. You have noticed that the data response times have dropped and queries are behaving poorly. What is the first thing you should do to diagnose what is wrong?
Option A is incorrect. This is a good check, but it shouldn't be the first thing you check.
Option B is incorrect. This is unnecessary as Bigtable can handle millions of requests per second.
Option C is incorrect. This is another good check, but it shouldn't be the first thing you check.
Option D is correct. Checking for hot-spotting is the first thing you should check if Bigtable performance is degrading. Use the Key Visualizer to highlight row-keys which are prone to hot-spotting.
You are managing a PostGREs Cloud SQL cluster and notice that queries are starting to fail. Upon investigation you notice that your data is now over your persistent disk size of 25 TBs and Cloud SQL can no longer effectively process the data. What is the most efficient solution you could recommend to resolve the issue?
Option A is correct. Cloud SQL has a hard storage limit of 25TB. Cloud Spanner is often used by customers when they reach the hard limit of 25TB for Cloud SQL storage or if latency issues across regions becomes too great to use Cloud SQL alone.
Option B is incorrect. Cloud SQL has a hard storage limit of 25TB.
Option C is incorrect. Alloy DB is used to speed up analytical queries, but is still subject to the limits of Cloud SQL, including the hard storage limit of 25TB.
Option D is incorrect. This is technically feasible, but unless this is a development server this would not be a wise move because it could violate privacy or other legal regulations. Additionally, this is not a sustainable solution as it is likely that you will hit this cap again in the future.
You are operating in a multi-cloud architecture and need to integrate date from a few sources stored in S3 and Azure Blob storage into BigQuery, where you are already hosting a number of GCS external tables. There are files added to the buckets each day. Additionally, the data contains some PII and requires column level security to comply with certain legal requirements. How could you accomplish this most efficiently?
A is incorrect: This is technically feasible, but BigQuery Omni would be a better choice since it wouldn't require a data transfer.
B is incorrect: This might be technically feasible, but it would be difficult to manage effectively due to variance in file size. It also would not be fault tolerant as a serverless solution. Additionally, this is very inefficient compared to using BigQuery Omni.
C is incorrect: This solution would be difficult to manage and would require a complex network of permissions and processors to be developed. BigQuery Omni is a much more efficient solution.
D is correct: BigLake tables can be used to read data directly from other public cloud providers via BigQuery Omni. This enables easy multi-cloud analytics without requiring the data to be transferred into GCS first.
You are running a Cloud Spanner instance and notice that your compliance team frequently looks to link user data and regulatory data by geographical data in order to ensure regulatory compliance. They occasionally notice slow query performance and asked you to optimize the query. User data is pre-processed with Cloud DLP before it is written to Cloud Spanner in order to comply with each country's specific regulations. What is a method you could use in Cloud Spanner to improve the performance of the query while also ensuring that user data cannot be written outside of the region where it is created?
Option A is correct. In this instance the best option is to create an interleaved table in spanner which will allow the users to query the geographical data, the regulatory data, and the users data in a single table. This provides a much more efficient query than when querying the tables separately and attempting to join on non-primary keys.
Option B is not correct. There isn't a good natural relationship immediately available between geographical data, regulatory data and user data. Additionally, even with effective foreign-key relationships it would still be more efficient to query a single interleaved table.
Option C is not correct. Although BigQuery could perform this query more efficiently it would require a complex and inefficient solution to implement compared to better options. BigQuery datasets are limited to a single region by default which might bring up other geolocation issues compared to Spanner leading to increased query latency. There could be privacy and regulatory concerns as well. Additionally, the data requires strict referential integrity which is not supported by BigQuery.
Option D is not correct. Spanner is an RDBMS and while working with STRUCT data in Spanner is possible it is not a best practice and is not as efficient as creating an interleaved or foreign-key table.
You are working for an online sports betting company. The CFO wants to know the best database technology to host a new highly dynamic sports betting application which has to collect and process data, ensure strong consistency, and have atomic transaction handling, all with minimal latency. The application is expected to have thousands of transactions running per second. What database technology would you recommend?
Option A is incorrect. Cloud SQL can be used to power website data, but at the scale and complexity suggested Cloud SQL would likely not be able to keep up with the transactions.
Option B is correct. Memorystore for Redis is GCP's managed Redis instance. It is an in-memory database engine used to host highly dynamic websites which require thousands or even millions of transactions per second. Additionally, Redis is a true RDBMS which means that it supports strong consistency and transaction handling.
Option C is incorrect. This works for many different websites, but in this instance it would likely be too slow to properly mange correctly. Firestore does offer transactions, but the transaction handling is not as robust as Memorystore or Cloud SQL. Firestore is not ACID compliant and does not support strong consistency standards.
Option D is incorrect. Cloud Spanner is strongly consistent and ACID compliant, but it would not be quick enough to support the website's requirements.
You are working with a data science team to help migrate some machine learning workflows over to GCP. The data have been migrated to BigQuery already, and now they need to move the models. Currently they are running some batch linear regression models on a local on-prem cluster operating a PyTorch framework. The data science team is interested in the most efficient processing regardless of cost. How would you advise them to migrate the models over to operate most efficiently in GCP?
A is incorrect. This is technically feasible, but it wouldn't provide any benefits over BigQuery ML and would be more complicated to develop and deploy.
B is correct. Since the data are already in BigQuery, and the models are a batch linear regression, it would be easier to rebuild the models using BigQuery ML. BigQuery ML is also more performant platform for tabular ML compared to Dataproc.
C is incorrect. This would be a much more complicated development process compared to just using BigQuery ML.
D is incorrect. This could work, and be relatively easy to develop, but it still wouldn't be as efficient or performant as just using BigQuery ML.
You have been using Cloud Dataflow to run batch data processing pipelines, but lately you have noticed the costs going up. What is the most efficient way to help reduce costs if you don't need to worry about jobs completing at a specific time and are fault tolerant?
Option A is correct: Use Dataflow FlexRS to lower the costs of your batch pipelines.
Option B is incorrect: Dataflow is a fully managed service, therefore this is not something that you can control.
Option C is incorrect: This would likely result in more expensive pipelines, not cheaper
Option D is incorrect: This could work, but it would require a rework of the data model, Dataflow job, and likely all downstream applications as well..
You are working for a health insurance firm which processes and pays claims containing highly sensitive and protected consumer data. Legal and company compliance policies dictate the absolute highest level of cryptographic security for data moving between services within GCP. Which security architecture should you use?
A in incorrect. This is insufficient by itself. You should also use Cloud HSM to physically separate the cryptographic process from other GCP services.
B is incorrect. This is true, but it would be insufficient according to the requirements.
C is incorrect. This is insufficient by itself. You should also use CSEK in this instance to control the keys and make it impossible for anyone, including Google, to access the data.
D is correct. Use Cloud HSM to physically separate the cryptographic process from the other GCP services and use CSEK to ensure that only someone who possess the key can decrypt the data.
Which regulation(s) are the most well know data privacy laws specific to the state of California?
Choose all that apply
Options B and C are the correct answers. California's CCPA and CPRA laws are some of the most restrictive in the United States. It outlines strict guidelines for data collected, stored, accessed, or processed within California's geographical boundaries.
GDPR applies to EU member states.
FedRAMP is used by the federal government when developing contracts with contracting companies.
COPPA is an important privacy law, but it isn't specific to California.
HIPAA is important for health care data, but it isn't specific to California. HITECH is important for health care data, but it isn't specific to California.You are working with a data science team who needs to migrate an on-prem Hadoop cluster to GCP. The cluster includes 50 TBs of data hosted on HDFS as well as the associated JARs for execution. What would the most efficient future state architecture look like in GCP?
A is incorrect. You could do this, but using GCS to host the data is the recommended best practice and is in fact more high performing and more scalable than HDFS.
B is incorrect. This is essentially attempting to build dataproc manually. This would be a difficult and time consuming task which would provide no additional benefits compared to just using dataproc standard.
C is correct. Use Dataproc to run the Hadoop jobs and use GCS to host the data instead of HDFS.
D is incorrect. Dataproc Serverless can only run Spark jobs.
A health care analytics team is operating in BigQuery and often has to work with very sensitive health care data. What method or tool should they use to check for and prevent any leakage of personally identifiable information or other sensitive data?
Option A is incorrect. This would be very expensive and inefficient to run against all columns in your dataset. Also, it wouldn't be a scalable or sustainable solution as it would have to be retrained with each new piece of data discovered.
Option B is correct. The Sensitive Data API with Cloud DLP can be to automatically detect PII or other sensitive data in each of your BigQuery datasets. It has built in infotypes for most PII and you can build custom infotypes to satisfy any regulation.
Option C is incorrect. The purpose of sensitive data protection is to avoid risk. In this instance the team should have to show why they require access to the PII, even if they have permissions to view it. A leak of PII data and an accompanying regulatory violation can be very costly. Also, BigQuery has dynamic masking policies which can automatically reveal data based upon who is looking at it.
Option D is incorrect. This could work, but Cloud DLP does this natively by default. Additionally, by using a third party tool you could potentially open up another surface area of attack.
You are working for a globally operating financial services firm and the CIO wants to be sure that BigQuery is durable and highly-available in the rare, but possible, occurrence of a regional disaster. Which actions would you recommend to ensure this?
A is incorrect. BigQuery is dual zone be default, but this doesn't protect against a regional outage.
B is incorrect. This could work, but it is not an optimal solution and would require a huge rework of the BigQuery instance and data.
C is incorrect. While true, this is not guaranteed, and the CIO is looking for a guaranteed solution.
D is correct. Use Cross-Region Datasets to ensure that BigQuery will be available and durable in the case of a regional disaster.
The data platform team is looking to enable a service oriented architecture in GCP to power advanced analytics with Spark ML as well as perform operational queries with Looker. They have an application creating and processing events. What architecture would you recommend to most efficiently enable the transmission, collation, and aggregation of event data in GCP?
A is correct. Use Pub/Sub to manage the messages as they come in. This will ensure that no data is lost from the time of production to consumption of the message. It enables a strong acknowledgment chain to prevent the loss of data. Set up a BigQuery subscription to easily consume the messages from Pub/Sub as they are created without any additional processing. Use BigQuery to build the operational views of the data. Use Dataproc Serverless to read from the data hosted in BigQuery and execute the Spark ML models.
B is incorrect. This is almost correct. This would probably work the vast majority of the time, but it wouldn't be as robust without Pub/Sub to handle the messages. Any failures in transmission would result in lost data.
C is incorrect. There a few inefficiencies here. You should use Pub/Sub to handle the messages. Using GCS and external tables is fine, but it does lack some features of importing the data directly into BigQuery. Dataproc standard hosted on Kubernetes would work, but it wouldn't be as efficient as Dataproc Serverless for Spark.
D is incorrect. This is almost correct. It would work except for using BigQuery ML. It isn't clear what kind of model is being produced here, and BigQuery ML doesn't support all possible models. In this case, the requirements specifically call for Spark ML which means Dataproc, and almost certainly Dataproc Serverless.
You are working for a Social Media strategy consultancy. They need to ensure that their data cannot be used to identify a person. They have already used Cloud DLP to scrub their data of PII, but they want to make sure that some other fields, such as date of birth and city, cannot be used to re-identify a person. How can they check for this and ensure a minimal possibility of re-identification while still enabling good data to be aggregated and analyzed?
A is incorrect. This will ensure a zero re-identification risk, but will also scrub all useful fields of information, making it impossible to provide useful analytics.
B is incorrect. This could work, but it is an unnecessary and inefficient step since Cloud DLP can do this natively.
C is incorrect. This will ensure a zero re-identification risk, but will also scrub all useful fields of information, making it impossible to provide useful analytics. Additionally, the data would need to be decrypted at some point, and the risk could remain.
D is correct. DLP can also be used to prevent re-identification which is essentially a way to identify a particular user by observing alternative means of identification, such as an email address, phone number, or address. DLP can provide a risk report to show you the probability of re-identification of users in your data space. Use this information to formulate a plan to de-identify your data.
What is the durability rating of Google Cloud Storage (GCS)?
A is incorrect.
B is incorrect.
C is incorrect.
D is correct. Cloud Storage is designed for 99.999999999% (11 9's) durability.
You are developing a data mesh with Dataplex for use by your analytics team. All of your data are hosted in GCP and you want to build BigLake tables on the data. Which files types are supported by BigLake?
Select All That ApplyYou are working with a financial services firm. They are operating an on-prem Kubernetes cluster which processes highly sensitive financial data. Because of this they want to encrypt the processed data on-prem and move the data into GCP without exiting the GCP network. Additionally, they want to use Google managed services only to ensure a high-quality product with minimal maintenance required. What is the architecture they should use?
Option A is incorrect: The customer wants to use fully managed services only. Additionally, data sent over SSL/TLS means that it is sent over the internet and therefore must exit the Google network.
Option B is correct Use Google Distributed Cloud Virtual (GKE Enterprise on-premises) to run a managed Kubernetes cluster on-prem. Set up Cloud Interconnect with Dedicated Interconnect to ensure that the data never leaves the Google Network. Finally, use Customer Supplied Encryption Keys (CSEK) to encrypt the data before sending it to GCP
Option C is incorrect. This is a secure solution with managed services, but the customer wants to perform the processing on-prem.
Option D is incorrect. This is insecure because the data leaves the GCP network. You should use Cloud Interconnect with a Dedicated Connection.
You are working with a data science team gathering data for climate science at various national parks throughout the country. These data are gathered through barometric and other atmospheric measuring instruments. These sensor groups can consist of thousands of nodes which stream data to Pub/Sub. Due to the difficult internet connections available at some of the sites the data sometimes do not come in immediately, leading to late data. How can you prepare your dataflow data-pipelines to handle late arriving data?
Option A is incorrect. Dataflow does not natively handle late arriving data. This must be configured manually
Option B is incorrect. Dataflow SQL does not allow late-arriving data.
Option C is incorrect. Batch pipelines don't really have a concept of late arriving data.
Option D is correct. Dataflow can automatically pick up and handle late arriving data if enabled via the Java SDK.
Select The Correct Pub/Sub Components
select all that apply
Pub/Sub consists of six components:
- Publisher (also called a producer): creates messages and sends (publishes) them to the messaging service on a specified topic.
- Message: the data that moves through the service.
- Topic: a named entity that represents a feed of messages.
- Schema: a named entity that governs the data format of a Pub/Sub message.
- Subscription: a named entity that represents an interest in receiving messages on a particular topic.
- Subscriber (also called a consumer): receives messages on a specified subscription.
You are managing a BigQuery instance and have noticed that your costs are starting to rise. What are some things you could do to try and keep costs low?
Select all that applyA is correct. Materialized views act as a sort of data cache and can help keep cost low if the query doesn't force a query of the underlying table.
B is correct. Use partitioning and clustering to lower the amount of data queried. Partitioning will break a table into queryable chunks which limit the data processed by BigQuery. Clustering will sort the data in your table and select the data up to the point of the detected cluster key.
C is incorrect. BigQuery storage costs are the same as GCS, and both will automatically reclassify underutilized data into lower cost storage.
D is correct. Use Quotas and Limits to imit how much of a given resource your project can utilize. This gives a hard stop to querying capacity to any user.
You are working for a device supplier which captures IOT data from many devices scattered across a geographical region. You have set up a Pub/Sub service, but since you have started sending data you have noticed that some of the messages are corrupted or malformed which causing downstream application to crash. How would you most efficiently solve this?
Option A is incorrect. This could, in theory, work to prevent forwarding bad data, but it is a much less efficient solution than using Pub/Sub's built in schema checking service. Additionally, Cloud Functions lacks some of the guarantees that Pub/Sub offers, such as at least once message guarantee.
Option B is incorrect. Creating a schema is the proper approach, but schemas are built on topics, not on subscriptions.
Option C is incorrect. This would be a very inefficient and costly solution to implement.
Option D is correct. Using a Pub/Sub schema will ensure that any message received will comply with the proper message structure. If the schema check fails the message will be sent to a dead letter queue.
You are running a series of BigQuery queries which are dependent upon other queries executing correctly and in sequence. Currently you are running this process in cloud scheduler, but you notice that this is difficult to manage and, if one query fails, you have to rerun the entire job. What is a better tool you could use to execute the queries?
A is incorrect. This would only really shift the logic to another service without improve manageability.
B is incorrect. This might work in some instances, but it wouldn't allow the jobs to pick up where they failed, and it wouldn't improve manageability of the jobs.
C is correct. Use Cloud Composer to transform the queries into separate tasks. Cloud Composer will give great visibility into the individual tasks needed to complete the DAG and allow you to pick up where a task failed if it needs to be reran.
D is incorrect. Cloud Composer is the better option here.
You are building a Pub/Sub pipeline to connect a new data source to BigQuery, but the schema is constantly changing. Therefore, you need to perform some data transformations before loading the data into BigQuery. What is the optimal data processing architecture to use?
A is incorrect. This is technically possible, but the data should be processed before loading into BigQuery.
B is incorrect. This is technically feasible, and is probably more flexible than Dataflow, but for performance and optimization reasons Dataflow is the better solution.
C is incorrect. This is technically feasible, but it is inefficient and redundant compared to an end-to-end Dataflow pipelines.
D is correct. Use Dataflow for performing data processing and schematization before loading the data into the data warehouse.
You analytics team is using Looker to power an analytics dashboard used by multiple teams search day. Their datasource is a static BigQuery query table which updates once per day. What is the best method to minimize query costs and maximize performance for this dashboard?
A is incorrect. A materialized view would not help in this case as BigQuery will automatically cache the static data for 24 hours.
B is incorrect. BI Engine is unnecessary in this case because BigQuery will automatically cache the query result for 24 hours which will result in immediate retrieval of the data.
C is correct. BigQuery automatically caches static query results for 24 hours and this cache is referenced each time the query is ran.
D is incorrect. This could help save on query costs if a user is building ad hoc queries against the view, but it wouldn't increase performance. Looker automatically does this with its LookML builder in the dashboard logic so this is redundant in this case.
You have a large number of datasets stored in GCS totaling several terabytes of data and want to make them sharable with BigQuery users in your organization. What is the most efficient way to accomplish this?
A is incorrect. This is technically feasible, but very inefficient compared to using external tables. Dataplex can automatically perform all necessary steps automatically.
B is correct. When you add your GCS bucket to Dataplex it will automatically build BigQuery external tables for each prefix, apply data quality and permission rules, and surface the data for discovery within a data mesh.
C is incorrect. This is technically feasible, but Dataplex can automatically perform all necessary steps automatically.
D is incorrect. This is an inefficient design. BigQuery can automatically read GCS tables with external tables. Dataplex can be used to automatically handle all data catalog tasks.
You are working with a data science team for a major pharmaceuticals firm. They are interested in setting up a Dataplex data lake for their data stored in Cloud Storage. They have about 3 PB of data spread across dozens of GCS tables in their raw bucket. They are currently building the pipelines needed to produce the curated bucket. What formatting requirements are needed for GCS data when they decide to build the curated bucket?
Select All That ApplyAll are correct. The data should be high-quality, groomed, well-organized, properly prefixed, in a hive partitioned format, and ideally formatted as csv, new line delimited json data, or parquet format.
What is the first step when creating a data mesh?
A is incorrect. This is a necessary step, but not the first one you should take.
B is incorrect. This is a necessary step, but not the first one you should take.
C is incorrect. This is a necessary step, but not the first one you should take.
D is correct. The first step to building a data mesh is to establish the data governance council which will map out the zones, rules and operating procedures for building and operating the data mesh.
An e-commerce firm is looking to migrate an existing on-prem Postgres DB to GCP and is looking for the optimal database technology to host the data. They are currently operating in the US, but are expanding their business into Europe and they want to have low latency and easy access from anywhere on Earth. Additionally, they also want minimal changes to their current queries and interface. What is the optimal database technology to meet their current and future requirements?
A is incorrect. Cloud SQL is incorrect. It would not provide low latency access to anywhere on earth.
B is incorrect. Cloud SQL is incorrect. It would not provide low latency access to anywhere on earth.
C is correct. Cloud Spanner would fulfill all the requirements listed. It is easily accessed from anywhere on earth with low latency and its interface and SQL syntax are designed to be almost identical to Postgres.
D is incorrect. Memorystore for Redis is an ephemeral RDBMS used for powering low latency and high intensity web applications and is not really an effective substitute for Cloud SQL.
You are working with a data team at a sensitive government agency who has strict data governance requirements. What is the certification your organization needs in order to contract effectively with federal government agencies and organizations?
Option A is correct: FedRamp is a standardized methodology for ensuring compliance among cloud operators when building solutions for the Federal Government and it's associated agencies.
Option B is incorrect: HIPPA may be required for certain health care services providers, but FedRamp covers HIPPA and other certifications required for a particular service.
Option C is incorrect: CCPA applies to California only.
Option D is incorrect: Export Administration Regulations (EAR) doesn't apply to data engineering on Google Cloud unless you intend on exporting your software overseas.
You are working with a smaller organization who is getting started with GCP. They are looking to develop some simple data processing pipelines using a low code or no code solution. Which technology would you recommend?
A is incorrect. Dataprep is designed to clean and format data for further processing. Although it might work here it likely wouldn't be robust enough for these requirements.
B is incorrect. Dataflow requires advanced coding techniques and probably wouldn't work for all requirements here.
C is correct. Data Fusion would be the best option here. Cloud Data Fusion is a fully managed, cloud-native, and NoCode enterprise data integration service for quickly building and managing data pipelines.
D is incorrect. Cloud Functions might work for some processing tasks, but it would require advanced coding skills and likely wouldn't fit all scenarios required.
An e-commerce company is using Memorystore for Redis to host user actions on its site. However, it has become a challenge to maintain the data model due to frequent schema changes in products and users. Which storage technology should the company utilize instead of Memorystore to solve this issue?
Option A is incorrect. Cloud Spanner is an RDBMS. While you could use python to update the schema accordingly it would be very slow to implement and would greatly degrade the performance of the site.
Option B is incorrect. This could work, but a Firestore document model would be better and allow easier access to the data. Additionally the firestore data would be easier to observe and maintain.
Option C is incorrect. BigQuery is an analytical warehouse and is not designed for the kind of dynamic data access a site would require. It would also not solve the problem of schema evolution
Option D is correct. Migrate the site data to Firestore's document database to enable highly dynamic data access while allowing for a mutated schema.
You are running a globally-available online simulator game with GCP as a backend while the users play the game in their personal devices. What is the best method to send user event data to GCP which maximizes operating efficiency and minimizes development effort? You want to ensure that no data are lost, but duplicates are allowed.
Option A is not correct. This would be bad design because the data could come from anywhere in the world which would greatly increase latency, the service would be prone to failures and likely couldn't handle the amount of traffic, and storing the data in GCS would likely be insufficient for downstream processors.
Option B is not correct. Setting up and maintaining a Kafka cluster, while it could be used to solve this problem, is not the most efficient solution in this case. Also, a push method could bog down downstream services, so a pull method would be best in this case.
Option C is correct. Pub/Sub is GCP's globally available message broker service which is used to collect and redistribute event data to a large number of potential subscribers. Pub/Sub comes with at least once processing by default, which ensures that any messages are guaranteed to be captured, but there could be duplicates. A pull operation is best here because it ensures that downstream processors can safely and efficiently handle the incoming message volume.
Option D is not correct. This is a very complex method which would likey fail under load, couldn't guarantee message capture by default, and would be very expensive and time-consuming to develop and maintain. Additionally, there is no guarantee that downstream processors could handle the event, so if the processor fails it would be impossible to know if data was lost without significant investigation.
You are working at a sports ticketing company processing thousands of transactions per day. The data science team has developed a Spark machine learning recommendation model which will predict likely interesting eventing to market to people who have made a purchase to an event. They now need to operationalize the model and have asked the data engineering team to build the host architecture. What is the best tool to use to minimize development, configuration, and maintenance efforts while also providing a scalable and high-performing infrastructure?
A is correct. Dataproc Serverless for Spark is a low maintenance and low configuration environment for running Spark jobs, including Spark ML jobs.
B is incorrect. This would work, technically, but Dataproc Serverless is a better choice if you're trying to minimize development, configuration, and maintenance efforts.
C is incorrect. This could actually be a more performant and sustainable choice for running the model in the long run, but it would have to be rewritten which would not be the best choice in this case.
D is incorrect. Like the MIG answer, this could work to host the model, but it would be inefficient in this case.
You are working with a research firm specializing in social media analytics. They have a number of pub/sub topics which process event driven messages which come from their client mobile application. You have noticed that the messages can sometimes be processed by a pub/sub topic operating in Europe, but your data standards do not include compliance with GDPR. What is the most efficient method to ensure that the non-compliant pub/sub messages will not process in EU regions, but that compliant topics and services can still operate effectively?
A in incorrect. A resource location restriction would affect all services, not just the pub/sub topics. This could potentially break your application or negatively affect user experience.
B is incorrect. Cloud HA VPN is a product of Cloud Interconnect which is designed to connect On Premises networks to your Cloud VPC.
C is correct. Set a message storage policy on the topic to force specific pub/sub topics to operate in a given region. This will funnel the messages to a compliant region without affecting other services or topics operating in the EU.
D is incorrect. This would be inefficient and likely wouldn't solve the problem in a verifiable and sustainable way. It would require significant rebuilds and you would lose all the benefits of Pub/Sub.
You are working for a data science team who is running a number of Hadoop jobs on dataproc. They have noticed that their costs are increasing and are looking for ways to better optimize their costs. What are some things you could recommend?
Select All That Apply
A, B, and D are correct. Dataproc pricing is heavily dependent upon the amount of, and behavior of, whichever compute resources you are using to perform your processing tasks.
C, E, F are incorrect. Autoscaling is best for adapting to increased workloads but will not help control costs. Dataproc Serverless is used for Spark jobs only. Memory-optimized VMs are best for Spark jobs, and are more expensive than general compute clusters.
You on data platform team working with an e-commerce corporation. Currently they are operating in an on-prem environment, but they want to move to GCP. Currently they use Apache Kafka to handle messaging and Apache Flink to process the data. The CIO now wants to utilize managed services in GCP. Which GCP services can substitute for these?
Select all that applyA is correct. Pub/Sub is a distributed and fully managed messaging system which can be an effective substitute for Apace Kafka.
B is correct. Cloud Dataflow is a serverless data processing solution which can read from a pub/sub stream and process returned data. This is an effective substitute for Apache Flink.
C is incorrect. Dataproc is a tool used for running Spark or Hadoop jobs and doesn't apply here.
D is incorrect. Data fusion is a no code service for building data pipelines in GCP.
E is incorrect. Cloud Dataprep is a no code solution used for cleaning, preparing, and profiling data in GCP. Dataprep can be used to build dataflow pipelines via recipe export, but in this instance it wouldn't count as a direct substitute.
You are managing a BigQuery instance and begin to notice a spike in query costs. How would you identify which queries and jobs are causing this issue?
Option A is incorrect: This is an inefficient and disruptive solution which would require a large development effort to properly implement. BigQuery audit logs captures all this by default.
Option B is incorrect: This is another disruptive, invasive, and unnecessary solution which can be solved with Audit Logs.
Option C is incorrect: This solution is inefficient. You could pass the query to cloud function and capture the metadata associated with the request, the job id, the query syntax, and the requester identity. Following this, you could have Cloud Function execute the job and the user could then read the output of the job from the console. But, this is more efficiently accomplished by just using Cloud Logging with BigQuery Audit Logs.
Option D is correct: BigQuery Audit Logs is enabled by default and captures the needed metadata. You can then set up an alert in Cloud Logging to alert you when an expensive query is ran.
You are working for a warehouse operations center. You are tasked with developing a dataflow pipeline which gathers analytical data about operations such as packages processed per hour, number of losses per hour, and others. Management needs the data metrics to propagate every 5 minutes. Which windowing function can you use to best model this behavior?
A is incorrect. Concurrent windows are not a concept currently supported by Dataflow
B is correct. Use sliding windows to perform the calculations every 5 minutes. Sliding windows overlap which is ideal for running the calculations.
C is incorrect. Session windows are best for tracking events occurring in a website, such as measuring number of clicks or products purchased per user.
D is incorrect. This calculates data within a time constraint, but doesn't overlap, so it wouldn't be the best option in this case.
You are running a BigQuery instance. Your analytics users are noticing slow query response times for dynamic and complex dashboards which are using sophisticated analytical functions to build the dashboards. What is a tool you can use to speed up the analytical queries?
Option A is incorrect: Although this is true, BigQuery will only caches query results for repeated queries. The dashboard is building novel complex SQL statements each time it is ran, which would negate the cache.
Option B is correct: Use BI Engine to store data in memory and greatly speed up analytical queries.
Option C is incorrect: This is a very complex solution which might speed up your queries, but would be very difficult and costly to set up and maintain on its own. BI Engine will accomplish this for a fraction of the cost and complexity.
Option D is incorrect: A materialized view could help to cache data, but it doesn't support complex analytical queries, such as window functions.
Your data engineering team is tasked with migrating on-prem Hadoop workloads to Google Cloud. They currently have 25TB of data stored in AVRO format on hard disks in a, on-prem cluster rack. What is the preferred architecture to move workloads to Google Cloud while minimizing rework?
Option A is not correct. Dataproc Serverless only works with Spark.
Option B is correct. Use Dataproc Standard to run the workloads and move the files to Cloud Storage.
Option C is incorrect. This could work, and in the end may be more sustainable, but in this instance the goal is to minimize rework and ensure an efficient migration.
Option D is incorrect. Dataproc gives you to option to create highly customized cluster configurations and will automatically handle all infrastructure needed to build the cluster.
A financial services firm is processing some sensitive data from a third party vendor and now they want to bring that data into the BigQuery data warehouse to perform analytics on the data. The vendor has exposed an api endpoint which will be used to gather the data. They want to ensure that the data are properly desensitized when analysts or other users query the data while still ensuring that management can view the original data if required. What is the most efficient solution they could use to accomplish this?
A is incorrect. This will permanently alter the data and prevent administrators from viewing the data.
B is incorrect. This could work, but it is inefficient and would raise other security concerns such as view editing.
C in incorrect. This could work, but it is inefficient.
D is correct. Use a Cloud Function to query the API and retrieve the data and move it into BigQuery via the Storage Write API and a service account. Use Cloud Scheduler to activate the function via a chron schedule. This ensures positive control over the ephemeral data from the time it is queried to the time it is written into BigQuery. This is also an efficient use of resources as it only requires one copy of the data. Use Cloud DLP to check the table data for sensitive data automatically. Use the results of the Cloud DLP scan to create a masking taxonomy along with appropriate policy tags for each role. This will ensure that each user sees the appropriate level of masking when querying the data.
You are working for an e-commerce company which has a legacy business operating on-prem and wants to migrate a total of 2 PB of data to GCS. They are looking for a high-security transfer. What is the best method for migrating this data to GCS?
A is correct. If your data is very large (1 PB or greater), or your connection to Google Cloud is slow, then you should consider using Transfer Appliance. Google will ship a storage device to your on-prem location where you load the data into physical drives, ship them back to Google, and then Google will load the data directly into Cloud Storage. Transfer Appliance is secured end-to-end by Google.
B is incorrect. The data is too large for Storage Transfer Service, also this is mostly used to transfer data from other cloud providers to GCS.
C is incorrect. This would be a secure solution, but the data are so large that it would take an extremely long time to complete.
D is incorrect. The data are so large that this would likely never be able to complete. Additionally this is not the most secure option.
An analytics team has asked for your help in designing an efficient data model for their data mart in BigQuery. Which of the below options would you recommend?
Choose all answers that apply
Option A is incorrect: Flattening and denormalizing data is correct, but you should aggregate the data into a nested field.
Option B is incorrect: This would apply to a materialized view, not a standard view.
Option C is incorrect: This is unnecessary. BigQuery is a columnar database which means that the data can be in the table but doesn't need to be queried if not selected.
Option D is correct: to be efficient and secure teams should only query the data which is necessary to build the data mart. This keeps storage costs low and minimizes and privacy or security concerns.
Option E is correct: BigQuery can create highly efficient data structures known as nested and repeated fields. These fields can encapsulate complex objects and make queries much more efficient and ensure referential integrity.
Option F is correct: Use Dataplex to discover the tables that you need quickly and efficiently. Dataplex works by automating Data Catalog to tag BigQuery tables.
You need to connect to a remote Cloud SQL instance over the internet. What is the best method to use to quickly and easily connect to Cloud SQL while using native IAM roles?
A is incorrect. You can use TLS/SSL, but it would have to be configured and it is not the best option here.
B is incorrect. An SSH connection could work on its own, but it would not be the most efficient solution and does not natively integrate with Cloud IAM.
C is correct. You should use Cloud SQL Auth Proxy if you are connecting over the internet. This provides all the benefits of a dedicated SSH VPN connection with native integration with IAM roles and minimal set up.
D is incorrect. Dedicated Interconnect would not connect over the internet. You can use dedicated interconnect to connect to Cloud SQL using a private IP.
You are working in a financial services firm conducting research into bond price prediction. The data science team has built a Spark ML bayesian prediction model to perform the predictions and now wishes to operationalize it. Due to the highly sensitive nature of the data and processing task the data science team wants to have the most secure architecture possible. The data are currently flowing into a secured GCS bucket. What would you recommend?
A is correct: This is a robust architecture which is almost impenetrable unless someone has direct access to the cluster. Sole Tenancy means that you set up a compute engine instance on a server rack hosting only your instance and is the most secure set up you can have on GCP. You should also use Shielded VMs to ensure the security of your operating system. Using a dedicated service account means that you can isolate which principals can interact with the data and Dataproc. You can use audit logs to monitor who accesses the cluster or the bucket data and you can set up custom Cloud Logging events to alert you of any unauthorized access.
B is incorrect: Dataproc Serverless is not the best choice here because it is not the most secure version of dataproc and would be insufficient.
C is incorrect: Dataproc with standard VMs only would not be secure enough to satisfy the requirements
D is incorrect: Dataproc on GKE only would not be secure enough to satisfy the requirements.