Log In

Pre-Tests Foxtrot Communications

Pretest Information

These tests are used to test your knowledge, prepare you for exam questions, and provide a greater degree of clarification to topics. These questions should resemble those which you will see on the actual exam. There are no time limits for these questions (though feel free to time yourself!). Instead, it's more important get the questions and thought patterns correct, which in turn will speed up test taking. The goal should be 90% correct to be prepared for the exam.

If you have a session open it will resume automatically. You can also close the current session and start a new one.

GCP Professional Data Engineer Certification Preparation Guide (Nov 2023)
 → Pre-Tests
  → Test 1

Test Session History. You will see your 5 most recent attempts of this test. If your current test is still open it will load below. Only attempted questions will update for open sessions. To calculate overall exam score you must close the session.

Status Start Date Questions Attempted Questions Correct Questions Missed Percentage Correct


Current Session Summary: attempted. Question summary is marked below.

   ✅ Correct    ❌ Incorrect  ⬜️ Not Attempted

⬜️ 01 ⬜️ 02 ⬜️ 03 ⬜️ 04 ⬜️ 05 ⬜️ 06 ⬜️ 07 ⬜️ 08 ⬜️ 09 ⬜️ 10
⬜️ 11 ⬜️ 12 ⬜️ 13 ⬜️ 14 ⬜️ 15 ⬜️ 16 ⬜️ 17 ⬜️ 18 ⬜️ 19 ⬜️ 20
⬜️ 21 ⬜️ 22 ⬜️ 23 ⬜️ 24 ⬜️ 25 ⬜️ 26 ⬜️ 27 ⬜️ 28 ⬜️ 29 ⬜️ 30
⬜️ 31 ⬜️ 32 ⬜️ 33 ⬜️ 34 ⬜️ 35 ⬜️ 36 ⬜️ 37 ⬜️ 38 ⬜️ 39 ⬜️ 40
⬜️ 41 ⬜️ 42 ⬜️ 43 ⬜️ 44 ⬜️ 45 ⬜️ 46 ⬜️ 47 ⬜️ 48 ⬜️ 49 ⬜️ 50
⬜️ 51 ⬜️ 52 ⬜️ 53 ⬜️ 54 ⬜️ 55 ⬜️ 56 ⬜️ 57 ⬜️ 58 ⬜️ 59 ⬜️ 60



You work at a large financial institution and have been asked by management to architect a solution for enabling a data mesh in GCP. What is most efficient solution?

A: Buy a third party data cataloguing tool to scan through all your BigQuery datasets and identify the table name, column names, data types, and all other metadata. Buy another third party tool to automatically categorize the captured data based upon a set of tags that you developed. Buy another third party tool which surfaces this data to the users and integrates with IAM, BigQuery, and GCS to control and facilitate federated governance and inherited permissions. Develop all the pipelines, table metadata, permissions management functionality, and access logs yourself.

B: Use Data Catalog to capture metadata from sources based upon pre-developed tag templates. Use the tag templates to tag BigQuery assets, another tag for external GCS tables, and develop custom share rules based upon user tags and organizational structure. Store all associated metadata in Firestore to manage any state changes to the permissions structure.

C: Establish a data governance council to develop federated governance rules covering access rights and data quality checks within each zone and domain. Use Dataplex to automatically capture relevant metadata from BigQuery datasets, GCS tables, and other open source tools such as Spark. Use Dataplex to automatically associate permissions and perform data quality checks on a zone by zone basis. Use Dataplex to automatically surface data and make it available to interested users.

D: Work with individual teams to identify which data belongs to which teams, develop a system of tools using Cloud Build which operates a series of Cloud Functions to perform data quality checks on new BigQuery datasets as the teams build them. Build a data service platform on a local VM which contains web app that processes and surfaces user access and data assets to users who log into to the site. Capture access logs and queries ran by users to ensure that data are not being accessed improperly.




You are working to build out a data lake using Dataplex. After setting up the necessary zones you need to incorporate your BigQuery tables, views, and BQML models into Dataplex and Data Catalog. What actions do you need to take to do this?

A: Use a python script as part of a build process which creates and attaches tag templates to the assets according to manually generated schema files. Dataplex will then create metadata based on the tag templates.

B: Do nothing. Dataplex will automatically onboard all BigQuery datasets and tables and GCS buckets.

C: Go into the Dataplex console and manually add the datasets to each zone as required.

D: Ensure that Dataplex has the proper permissions to read your data assets, enable data lineage in BigQuery, and then add each BigQuery dataset and any GCS bucket to the proper zone. Following this, Dataplex will automatically add any discovered tables and generate the proper metadata and Data Catalog assets.




You are working at a consulting firm. One of your clients is inquiring about GCP's disaster recovery protocols and SLO's. What are the two objectives you would look at to rate a disaster recovery scenario?

Select All That Apply
A: RPO - Recovery Point Objective

B: RTO - Recovery Time Objective

C: Durability

D: Availability




Your analytics team is interested in capturing web traffic so that they can build a recommender system for offered products. They have set up Google Analytics on the website and are capturing web traffic effectively. Now they want to move that data into their data mesh so that they can enrich the data with other sources, perform feature engineering, and build the recommender system. What would be the most efficient architecture to accomplish this?

A: Use the BigQuery Transfer Service to set up an automatic load of the data from Google Analytics into BigQuery. From here, build a dataflow pipeline that exports this data to GCS. This data can then be used and enriched with Dataproc to build the recommender system.

B: Use Dataflow to access the Google Analytics API and execute a batch pipeline to move the data into BigQuery. From here, use Vertex AI to create a feature dataset with the BigQuery data and then use Tensorflow to construct the recommender system.

C: Use the BigQuery Transfer Service to set up an automatic load of the data from Google Analytics into BigQuery. This data can then be used and enriched within BigQuery with the GoogleSQL Transform operator to form features. Use BigQuery ML to build the recommender system using Matrix Factorization.

D: Use the Google Analytics Transfer Service to move data into GCS where it can be accessed by Dataproc. Use Dataproc to enrich the data, perform feature engineering, and build the recommender.




What is usually meant by Data Sovereignty?

A: Data Sovereignty refers to the rights of corporations to exercise positive control over their global data.

B: Data Sovereignty refers to which country the data originates from and which country has rights to that data.

C: Data Sovereignty is a legal concept which determines the laws and regulations which apply to data when it is processed, viewed, collected, or stored within a certain geographical boundary.

D: Data Sovereignty is the legal framework for determining how a private person can control their online data.




The analytics teams is building a dashboard in Looker which queries the same BigQuery table every few hours throughout the day. The data mart is rebuilt once per day in the mornings. What is the most efficient solution you should implement in order to minimize the query response time and the amount of data processed per day?

A: Create a materialized view for Looker to query instead.

B: Work with the analytics project lead to alter the Looker cached persistence time to 24 hours.

C: Create a Data Catalog Tag Template which tags the query as "high priority". Then use BigQuery's resource manager to allocate more slots to this query based on the tag value.

D: Do nothing. BigQuery automatically caches query results for 24 hours.




You are working at a start-up company and the CIO wants to move to a new BigQuery class. Their current requirements are:

  • There are a large number of scheduled jobs which run each morning that generally use the same amount of data each day.
  • The analytics team is about to introduce many new high performance queries which they are seeking to optimize.
  • The data security and regulatory teams want to introduce data masking to comply with new privacy regulations.
  • The CIO wants to be assured that if a failure occurs in a single zone that the service will still be up and the data will be assured.
Which BigQuery class would you recommend to optimize performance and cost to the list of requirements?

A: On-Demand

B: Standard

C: Enterprise

D: Enterprise-Plus




As a member of the data engineering team you run dozens dataproc jobs daily. Lately you have noticed that the chron scripts that you use to run the dataproc jobs are becoming difficult to manage, understand, and visualize. You are looking for a better method to organize your jobs while enabling advanced job management tools such as graceful handing of job failures, automated scheduling, and job dependency graphs. What tool could you use to solve this?

A: Use Cloud Composer to reorganize and manage your data workloads. Rewrite your dataproc jobs as DAGs and use Cloud Composer to automatically manage task and DAG dependencies.

B: Use Cloud Scheduler to better organize the chron jobs and alert you if the job fails.

C: Use Cloud Workflows to set up complex job dependencies on a job by job basis and manage failures.

D: Create a VM and set up a custom Airflow instance on the VM. Set up error handling, cloud logging, and install Ops Agent on the VM to monitor hardware and jobs. Set up a MIG group to ensure high availability and scalability. Develop a custom IAM role to apply to users who need to interact with Airflow. Set up custom Airflow connectors to interact with your Dataproc instance. Finally, Rewrite your dataproc jobs as DAGs and use Airflow to automatically manage task and DAG dependencies.




You need to develop a Cloud Function to perform some data processing. You want to work locally and only deploy the Cloud Function when you are sure it is ready. What is the optimal method for development?

A: Use Jupiter Notebooks to develop your function. When it is ready, export it as a python script and put it in a cloud function via the Cloud Console.

B: Use Jupiter Notebooks to develop your function. Use Cloud Functions Framework to debug and test your code. Once it is ready, insert the script into a Cloud Function via the cloud console.

C: Use Jupiter Notebooks to develop your function. Use Cloud Functions Framework to debug, and test your code. Use Cloud Build to automate the testing and deployment of your code and build the function.

D: Develop directly in Cloud Functions since it would be quicker to test and debug this way.




You are using Dataplex and BigQuery to host a data mesh. You want to build freshness checks against the data so that users can know if data has been properly run each morning. What is the most optimal method to accomplish this?

A: Build a separate BigQuery data to house relevant metadata. Create views which can be used to query each tables latest inserted rows. Create a new view each time a new table is created. Surface these views to users automatically with Dataplex.

B: Create a destination metadata table in BigQuery. Each time the pipeline job is ran send the results of the job to this destination metadata table. Surface this table to users automatically with Dataplex.

C: Use Dataplex monitoring with auto data quality checks to set up rules to monitor for freshness and automatically run and verify the tables each morning.

D: Set up a pub/sub topic to capture data quality results from your pipeline jobs each time they are ran. Create a Cloud Function which will read from this Pub/Sub topic and create a Cloud Logging alert which will alert you if the pipeline hasn't ran.




You are working with a financial services firm. They are working with data which, by law, requires strict ACID-compliant transaction handling. They are operating in a single region only. Which database service would be the best fit for their requirements?

A: BigQuery

B: Cloud Spanner

C: Cloud SQL

D: BigTable




A data science team has 3PBs of data stored in GCS standard dual-region storage and accumulates an additional GB of data every week. They are interested in ways to lower their monthly storage bills and costs. What are some methods that you should recommend to lower the costs?

Select all that apply
A: Enable Autoclass for GCS.

B: Compress the files using GZIP.

C: Change the storage type to single-region.

D: Build a Cloud Function which scans the file metadata of all objects in the bucket. If a file is older than 365 days, then delete the data.




You are running a sports booking website which processes tens of thousands of ticket purchases per day across hundreds of events and dozens of teams. You decide to use BigTable in order to properly mange the complex and high velocity event stream. How should you build your BigTable key to ensure an efficient operation?

A: timestamp#user_id#event#team

B: timestamp#event_id#user_id#team

C: user_id#event#team#timestamp

D: user_id#team#timestamp#event




A manufacturer of IOT devices collects sensor readings from the robots which construct the IOT devices. The robots send the data to an api endpoint for ingestion. However, you have noticed that occasionally, the ingest fails and the data are lost. What is a better method to ingest the data?

A: Write the data directly to BigTable via the BigTable API.

B: Write the data to GCS.

C: Write the data to Cloud Spanner.

D: Write the data to Pub/Sub.




You currently have a pub/sub topic set up which is capturing application event data in JSON format. What is the most optimal method to stream the data into BigQuery without any transformations?

A: Use Dataflow to stream the data from pub/sub to BigQuery

B: Use a Cloud Function to read the data from the Pub/Sub topic with a pull subscription. Then insert the data into BigQuery via the BigQuery Write API.

C: Set up a BigQuery subscription to the Pub/Sub topic to stream the data directly into the table via the BigQuery Write API.

D: Set up a Pub/Sub subscription to send the event data to GCS. Build an external table in BigQuery to read the data from GCS.




You are managing a BigTable instance and have recently built a new table. You have noticed that the data response times have dropped and queries are behaving poorly. What is the first thing you should do to diagnose what is wrong?

Option A: Check node performance in your regional cluster to see if you should add another node.

Option B: Limit the number of requests per minute via the BigTable API to let the instance catch up

Option C: Check audit logs to see if there are inefficient queries being ran.

Option D: Use the BigTable Key Visualizer to check for hot spotting.




You are managing a PostGREs Cloud SQL cluster and notice that queries are starting to fail. Upon investigation you notice that your data is now over your persistent disk size of 25 TBs and Cloud SQL can no longer effectively process the data. What is the most efficient solution you could recommend to resolve the issue?

A: Migrate the database and operations to Cloud Spanner.

B: Increase the size of the persistent disk on the Cloud SQL instance.

C: Enable Alloy DB for PostGREs.

D: Delete some older data to ease the storage pressure.




You are operating in a multi-cloud architecture and need to integrate date from a few sources stored in S3 and Azure Blob storage into BigQuery, where you are already hosting a number of GCS external tables. There are files added to the buckets each day. Additionally, the data contains some PII and requires column level security to comply with certain legal requirements. How could you accomplish this most efficiently?

A: Build a custom compute engine VM which can access the required data. Use the VM to read each file in the bucket and then transfer that file to GCS. Set up a cloud schedule job which will trigger the daily job. Build external tables for each GCS source as the data flows in. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.

B: Build a Cloud Function which reads all new files daily and deposits them into GCS based upon a cloud scheduler job. Build external tables for each GCS source as the data flows in. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.

C: Build a Lambda function in AWS and an Azure Function in Azure which pulls the data from each source and then deposits it into GCS. Send a message to a pub/sub topic which will signal that the files have been loaded. Use a Cloud function with eventarc which will activate when the pub/sub topic receives a new message. Use this function to pull the file from GCS, process it, and load it into a BigQuery table. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.

D: Use BigQuery Omni to build BigLake tables which can read the S3 and Azure Blob Storage Files and make the tables available in the BigQuery console. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.




You are running a Cloud Spanner instance and notice that your compliance team frequently looks to link user data and regulatory data by geographical data in order to ensure regulatory compliance. They occasionally notice slow query performance and asked you to optimize the query. User data is pre-processed with Cloud DLP before it is written to Cloud Spanner in order to comply with each country's specific regulations. What is a method you could use in Cloud Spanner to improve the performance of the query while also ensuring that user data cannot be written outside of the region where it is created?

A: Create a hierarchical interleaved table relationship with the geographical data as the root, regulatory data as the first child, and the Users data as a child of the regulatory data.

B: Create a foreign key relationship between the geographical data, regulatory Data, and users data.

C: Move the data to BigQuery since it is designed to perform analytical queries more efficiently than Spanner.

D: Create a secondary table with a single row for each geography-user-regulatory relationship.




You are working for an online sports betting company. The CFO wants to know the best database technology to host a new highly dynamic sports betting application which has to collect and process data, ensure strong consistency, and have atomic transaction handling, all with minimal latency. The application is expected to have thousands of transactions running per second. What database technology would you recommend?

A: Cloud SQL

B: Memorystore for Redis

C: Firestore

D: Cloud Spanner




You are working with a data science team to help migrate some machine learning workflows over to GCP. The data have been migrated to BigQuery already, and now they need to move the models. Currently they are running some batch linear regression models on a local on-prem cluster operating a PyTorch framework. The data science team is interested in the most efficient processing regardless of cost. How would you advise them to migrate the models over to operate most efficiently in GCP?

A: Migrate the models to Vertex AI, set up BigQuery tables as datasets, and operate the models in batch mode.

B: Rebuild the models using BigQuery ML.

C: Build a MIG consisting of memory-optimized VM instances with NVIDIA CUDA GPUs attached. Port the PyTorch application onto the MIG.

D: Build Dataproc Serverless cluster and rebuild the model in Spark ML.




You have been using Cloud Dataflow to run batch data processing pipelines, but lately you have noticed the costs going up. What is the most efficient way to help reduce costs if you don't need to worry about jobs completing at a specific time and are fault tolerant?

A: Use Dataflow FlexRS.

B: Change the partitioning on Dataflow to allow fewer partitions.

C: Change your pipelines to streaming.

D: Alter your data model to reduce shuffling transactions.




You are working for a health insurance firm which processes and pays claims containing highly sensitive and protected consumer data. Legal and company compliance policies dictate the absolute highest level of cryptographic security for data moving between services within GCP. Which security architecture should you use?

A: Use Customer Supplied Encryption Keys (CSEK) to encrypt the data, ensuring that only someone who posses the key can unlock the data.

B: Data are automatically protected at-rest and in-transit by default by Google with Cloud Key Management Service API using Cryptographic Data Verification on the data packets as they move through GCP.

C: Use Cloud HSM to process the cryptographic operations in a FIPS 140-2 Level 3 compliant cluster.

D: Use Cloud HSM and Customer Supplied Encryption Keys to process the cryptographic operations in a FIPS 140-2 Level 3 compliant cluster and ensure that only someone who possesses the key can unlock the data.




Which regulation(s) are the most well know data privacy laws specific to the state of California?

Choose all that apply

GDPR

CCPA

CPRA

FedRAMP

COPPA

HIPAA

HITECH




You are working with a data science team who needs to migrate an on-prem Hadoop cluster to GCP. The cluster includes 50 TBs of data hosted on HDFS as well as the associated JARs for execution. What would the most efficient future state architecture look like in GCP?

A: Migrate the Hadoop jobs to Dataproc standard and host the HDFS data on persistent disks.

B: Build a MIG of memory-optimized VMs. Deploy Hadoop and configure it for the VMs. Host the HDFS data on HDD persistent disks.

C: Run the Hadoop jobs with Dataproc standard and use GCS to host the data.

D: Use Dataproc Serverless to host the Hadoop jobs and migrate the data to GCS.




A health care analytics team is operating in BigQuery and often has to work with very sensitive health care data. What method or tool should they use to check for and prevent any leakage of personally identifiable information or other sensitive data?

Option A: build a machine learning model in BQML which scans each column and checks for commonly known sources of PII

Option B: Use the Sensitive Data API with Cloud Data Loss Prevention to automatically identify columns which contain PII and create masking or nullification rules to prevent leaks.

Option C: Do nothing, the team already has access and doesn't need to protect the data.

Option D: buy a third party tool custom built for health care data which can scan through your data sets and detect potential sources of PII or other sensitive data.




You are working for a globally operating financial services firm and the CIO wants to be sure that BigQuery is durable and highly-available in the rare, but possible, occurrence of a regional disaster. Which actions would you recommend to ensure this?

A: The standard configuration of BigQuery is enough to ensure availability in the case of a regional disaster

B: Store your data in GCS instead and use Dual-region bucket locations to ensure durability and availability of the data. Create a secondary BigQuery instance with external tables in case one region goes down.

C: Assure the CIO that regional disasters are almost impossible to occur and that no further action needs be taken.

D: Use Cross-Region Datasets to ensure that BigQuery will be available and durable in the case of a regional disaster.




The data platform team is looking to enable a service oriented architecture in GCP to power advanced analytics with Spark ML as well as perform operational queries with Looker. They have an application creating and processing events. What architecture would you recommend to most efficiently enable the transmission, collation, and aggregation of event data in GCP?

A: Send the data to a Pub/Sub topic. Set up a BigQuery subscription to the topic to stream events directly into BigQuery via the Write API. Create analytical views to process and consume the data. Use Dataproc Serverless to read from the BigQuery and build the Spark ML machine learning models.

B: Send the data directly to the BigQuery Write API. Create analytical views to process and consume the data. Use Dataproc Serverless to read from the BigQuery and build the Spark ML machine learning models.

C: Send the events to a Cloud Function. Use the cloud function to process the data and deposit the data as a flat file into GCS. Use BigQuery external tables to create the operational views. Use Dataproc Standard hosted on a Kubernetes cluster to perform machine learning.

D: Send the data to a Pub/Sub topic. Set up a Cloud Storage Subscription to the topic to stream events directly into GCS. Read the files as external tables in BigQuery. Use BigQuery to build the operational views and BigQuery ML to perform the machine learning models.




You are working for a Social Media strategy consultancy. They need to ensure that their data cannot be used to identify a person. They have already used Cloud DLP to scrub their data of PII, but they want to make sure that some other fields, such as date of birth and city, cannot be used to re-identify a person. How can they check for this and ensure a minimal possibility of re-identification while still enabling good data to be aggregated and analyzed?

A: Add all other potentially identifiable fields as infotypes to Cloud DLP.

B: Hire a third part risk-analysis firm to analyze the re-identification risk.

C: Encrypt all fields which can potentially be used to identify a person.

D: Run a re-identification risk analysis using Cloud DLP




What is the durability rating of Google Cloud Storage (GCS)?

A: 99.99%

B: 99.999%

C: 99.99999%

D: 99.999999999%




You are developing a data mesh with Dataplex for use by your analytics team. All of your data are hosted in GCP and you want to build BigLake tables on the data. Which files types are supported by BigLake?

Select All That Apply
A: CSV

B: New Line Delimited JSON

C: ORC

D: Parquet

E: Avro

F: Iceberg




You are working with a financial services firm. They are operating an on-prem Kubernetes cluster which processes highly sensitive financial data. Because of this they want to encrypt the processed data on-prem and move the data into GCP without exiting the GCP network. Additionally, they want to use Google managed services only to ensure a high-quality product with minimal maintenance required. What is the architecture they should use?

Option A: Build an on-prem rack system with a custom build Kubernetes engine running on docker containers. Use SSL/TLS encryption when sending the data to GCP over the internet.

Option B: Deploy Google Distributed Cloud Virtual on your on-prem VMWare servers. Set up Cloud Interconnect service to build a dedicated connection to GCP and operate on a Private IP. Before uploading the data, encrypt the data with CSEK.

Option C: Install Kubernetes on Google Kubernetes Engine via a Managed Instance Group. Encrypt the source data on-perm and load it to Cloud Storage using Customer Supplied Encryption Keys over Cloud Interconnect. Then use Google Kubernetes Engine to process the data.

Option D: Deploy Google Distributed Cloud Virtual on your on-prem VMWare servers. Before uploading the data over the internet, encrypt the data with CSEK.




You are working with a data science team gathering data for climate science at various national parks throughout the country. These data are gathered through barometric and other atmospheric measuring instruments. These sensor groups can consist of thousands of nodes which stream data to Pub/Sub. Due to the difficult internet connections available at some of the sites the data sometimes do not come in immediately, leading to late data. How can you prepare your dataflow data-pipelines to handle late arriving data?

A: Do nothing, dataflow handles late arriving data automatically

B: Use Dataflow SQL instead of the Java SDK

C: Switch to batch pipelines instead

D: Use Dataflow SDK to enable late arriving data handling.




Select The Correct Pub/Sub Components

select all that apply

Broker

Cluster

Publisher

Subscriber

Queue

Topic

Schema

Message

Node

Subscription




You are managing a BigQuery instance and have noticed that your costs are starting to rise. What are some things you could do to try and keep costs low?

Select all that apply
Use Materialized Views to cache queries and minimize processing

Use partitioning and clustering on the tables to make queries cheaper to run

Move the data to GCS and use external tables instead to save on storage costs

Set quotas to manage user spend each month




You are working for a device supplier which captures IOT data from many devices scattered across a geographical region. You have set up a Pub/Sub service, but since you have started sending data you have noticed that some of the messages are corrupted or malformed which causing downstream application to crash. How would you most efficiently solve this?

Option A: Send the data to a Cloud Function endpoint instead. Use the Cloud Function to check if every field is present, in the correct place, and has a correct data type. If not, then deposit the corrupted message in GCS for further review.

Option D: Build a Pub/Sub schema on the subscription to check that the data are properly structured, if not, send the message to a dead letter queue.

Option C: Rebuild all your downstream services to be able to handle potentially corrupted data.

Option D: Build a Pub/Sub schema on the topic to check that the data are properly structured, if not, send the message to a dead letter queue.




You are running a series of BigQuery queries which are dependent upon other queries executing correctly and in sequence. Currently you are running this process in cloud scheduler, but you notice that this is difficult to manage and, if one query fails, you have to rerun the entire job. What is a better tool you could use to execute the queries?

A: Use Dataflow SQL to run the jobs instead.

B: Transform the queries into CTEs and run them in sequence as a single view.

C: Use Cloud Composer to build a DAG which turns each query into a separate task.

D: Do Nothing. Cloud Scheduler is the best option for this.




You are building a Pub/Sub pipeline to connect a new data source to BigQuery, but the schema is constantly changing. Therefore, you need to perform some data transformations before loading the data into BigQuery. What is the optimal data processing architecture to use?

A: Load the data directly into BigQuery with the BigQuery Write API and BigQuery subscription. Perform the data processing in BigQuery.

B: Load the data directly into Cloud Storage with a Cloud Storage Subscription. Use an external table to read the data from Cloud Storage, perform the data processing, and load the data into BigQuery.

C: Use dataflow to load the raw data into Bigtable first, then use dataflow again to extract and process the data from BigTable and load it into BigQuery.

D: Use Cloud Dataflow to process and schematize the data and load it into BigQuery




You analytics team is using Looker to power an analytics dashboard used by multiple teams search day. Their datasource is a static BigQuery query table which updates once per day. What is the best method to minimize query costs and maximize performance for this dashboard?

A: Create a Materialized View to query the data once per day and use that to power the dashboard.

B: Use BI Engine to pre-process the data each morning and host the data in memory.

C: Do nothing. BigQuery automatically caches data for 24 hours as long as the underlying data doesn't change.

D: Build a view on top of the table which only allows the user to select certain columns from the table as needed.




You have a large number of datasets stored in GCS totaling several terabytes of data and want to make them sharable with BigQuery users in your organization. What is the most efficient way to accomplish this?

A: Use dataflow to build a pipeline for each table which reads the data from the files and loads the data into BigQuery. Create tag templates in Data Catalog for each table. Using the tag templates, apply permissions and data quality rules for each table and each column.

B: Use Dataplex to automatically build external tables for each dataset prefix in GCS, apply metadata, data quality rules, permissions, and make the data discoverable.

C: Build BigQuery external tables for each dataset prefix. Create a view for each table to surface the data to users. Create tag templates in Data Catalog for each table. Using the tag templates, apply permissions and data quality rules for each table and each column. Give users access to Data Catalog so that they can discover and query the data effectively.

D: Use pub/sub to reach each table and send the data to BigQuery Write API. Build a materialized view for each dataset for users to query. Create tag templates in Data Catalog for each view. Using the tag templates, apply permissions and data quality rules for each table and each column.




You are working with a data science team for a major pharmaceuticals firm. They are interested in setting up a Dataplex data lake for their data stored in Cloud Storage. They have about 3 PB of data spread across dozens of GCS tables in their raw bucket. They are currently building the pipelines needed to produce the curated bucket. What formatting requirements are needed for GCS data when they decide to build the curated bucket?

Select All That Apply
A: The data should be hive partitioned.

B: The data should be cleaned and typed.

C: The data should be scanned of any PII or other sensitive data.

D: The data should be either csv, new line delimited json, or parquet format.




What is the first step when creating a data mesh?

A: Set up your user permissions.

B: Develop your data sources.

C: Build your data models

D: Establish the data governance council




An e-commerce firm is looking to migrate an existing on-prem Postgres DB to GCP and is looking for the optimal database technology to host the data. They are currently operating in the US, but are expanding their business into Europe and they want to have low latency and easy access from anywhere on Earth. Additionally, they also want minimal changes to their current queries and interface. What is the optimal database technology to meet their current and future requirements?

A: Cloud SQL with Alloy DB

B: Cloud SQL

C: Cloud Spanner

D: Memorystore for Redis




You are working with a data team at a sensitive government agency who has strict data governance requirements. What is the certification your organization needs in order to contract effectively with federal government agencies and organizations?

FedRamp

HIPPA

CCPA

EAR




You are working with a smaller organization who is getting started with GCP. They are looking to develop some simple data processing pipelines using a low code or no code solution. Which technology would you recommend?

A: Dataprep

B: Dataflow

C: Data Fusion

D: Cloud Functions




An e-commerce company is using Memorystore for Redis to host user actions on its site. However, it has become a challenge to maintain the data model due to frequent schema changes in products and users. Which storage technology should the company utilize instead of Memorystore to solve this issue?

Option A: Use Cloud Spanner instead. Write a python application which checks the schema of the table periodically and updates them if required.

Option B: Use Bigtable to host the data as NoSQL tablets for products and users

Option C: Use BigQuery to host the data instead. Use materialized views to host the products and users data for quick and easy access.

Option D: Use Firestore's document model to handle the schema changes while providing quick operations.




You are running a globally-available online simulator game with GCP as a backend while the users play the game in their personal devices. What is the best method to send user event data to GCP which maximizes operating efficiency and minimizes development effort? You want to ensure that no data are lost, but duplicates are allowed.

A: Send the a regional compute optimized VM instance via an http PUT request. Use the VM to process the data and store it in Cloud Storage.

B: Send the data to a Kafka cluster and use Kafka to distribute the data to necessary API's with a push operation.

B: Send the data to Pub/Sub and use Pub/Sub to distribute the data to necessary API's with a pull operation.

B: Build a Cloud Function to send the data to the additional following Cloud Function http endpoints:
  • One endpoint to process and reformat the data and then send it to BigQuery for storage and querying.
  • One endpoint to process and translate any textual data via the Natural Language API.
  • One endpoint to push the data to another cloud function which handles billing




You are working at a sports ticketing company processing thousands of transactions per day. The data science team has developed a Spark machine learning recommendation model which will predict likely interesting eventing to market to people who have made a purchase to an event. They now need to operationalize the model and have asked the data engineering team to build the host architecture. What is the best tool to use to minimize development, configuration, and maintenance efforts while also providing a scalable and high-performing infrastructure?

A: Dataproc Serverless for Spark.

B: Use a standard Dataproc Cluster hosted on a compute engine MIG.

C: Rewrite the machine learning model in BigQuery ML.

B: Use a standard Dataproc Cluster hosted on GKE.




You are working with a research firm specializing in social media analytics. They have a number of pub/sub topics which process event driven messages which come from their client mobile application. You have noticed that the messages can sometimes be processed by a pub/sub topic operating in Europe, but your data standards do not include compliance with GDPR. What is the most efficient method to ensure that the non-compliant pub/sub messages will not process in EU regions, but that compliant topics and services can still operate effectively?

A: Set a resource location restriction at the organization level to prevent GCP from creating new services in the region.

B: Have your user install Cloud HA VPN which tunnels data back to a region where the regulation doesn't apply.

C: Use Message Storage Policies on the relevant topics to ensure that the data is only processed within an allowed GCP region.

D: Use a different regional service for those applications which produce messages which might violate GDPR standards.




You are working for a data science team who is running a number of Hadoop jobs on dataproc. They have noticed that their costs are increasing and are looking for ways to better optimize their costs. What are some things you could recommend?

Select All That Apply

A: Use ephemeral clusters

B: Use Preemptible Instances

C: Use Autoscaling clusters to adapt to increased workloads

D: Right size your instances

E: Use Dataproc Serverless

F: Use memory optimized VMs




You on data platform team working with an e-commerce corporation. Currently they are operating in an on-prem environment, but they want to move to GCP. Currently they use Apache Kafka to handle messaging and Apache Flink to process the data. The CIO now wants to utilize managed services in GCP. Which GCP services can substitute for these?

Select all that apply
A: Cloud Pub/Sub

B: Cloud Dataflow

C: Cloud Dataproc

D: Cloud Data Fusion

E: Cloud Dataprep




You are managing a BigQuery instance and begin to notice a spike in query costs. How would you identify which queries and jobs are causing this issue?

A: Set up a query tag indicator as part of each SQL query ran, have every user be sure to include this tag each time they run a query.

B: Monitor the BigQuery instance throughout the day. Wait until the spike occurs again and then ask around about which queries each user is running.

C: Funnel Query requests through a cloud function. Have Cloud Function capture metadata about the request, such as the user_id and query syntax.

D: Use Cloud Logging to capture BigQuery Audit Logs. Set up an alert with Cloud Logging to send you an alert when the spike occurs again. The alert should contain the user_id, query syntax, and associated job information.




You are working for a warehouse operations center. You are tasked with developing a dataflow pipeline which gathers analytical data about operations such as packages processed per hour, number of losses per hour, and others. Management needs the data metrics to propagate every 5 minutes. Which windowing function can you use to best model this behavior?

A: Concurrent Window.

B: Sliding/Hopping Window.

C: Session Window.

D: Fixed/Tumbling Window.




You are running a BigQuery instance. Your analytics users are noticing slow query response times for dynamic and complex dashboards which are using sophisticated analytical functions to build the dashboards. What is a tool you can use to speed up the analytical queries?

Option A: Do nothing, BigQuery automatically caches query results for 24 hours.

Option B: Enable BI Engine which can store data in memory and speed up analytical queries

Option C: Set up a Memorystore for Redis instance, build data pipelines to automatically pull commonly used data from BigQuery, and then use Redis to serve the queries.

Option D: Use a materialized view to cache commonly accessed data to speed up the queries. Have the dashboard query these tables




Your data engineering team is tasked with migrating on-prem Hadoop workloads to Google Cloud. They currently have 25TB of data stored in AVRO format on hard disks in a, on-prem cluster rack. What is the preferred architecture to move workloads to Google Cloud while minimizing rework?

A: Use Dataproc Serverless to run the workloads and move the data to Cloud Storage

B: Use Dataproc Standard to run the workloads and move the data to Cloud Storage

C: Rewrite the jobs in GoogleSQL to run in BigQuery and move the data to Cloud Storage

D: Set up a custom MIG group with memory optimized VMs and 25TB SSD persistent disks. Install Hadoop on the cluster, set up configurations, set up master and worker machine relationships, and rebuild the applications to work efficiently with GCP's custom architecture.




A financial services firm is processing some sensitive data from a third party vendor and now they want to bring that data into the BigQuery data warehouse to perform analytics on the data. The vendor has exposed an api endpoint which will be used to gather the data. They want to ensure that the data are properly desensitized when analysts or other users query the data while still ensuring that management can view the original data if required. What is the most efficient solution they could use to accomplish this?

A: Use a Cloud Function to query the data and perform desensitization tasks with a python function. Load the desensitized data into BigQuery via the Storage Write API.

B: Use a Cloud Function to query the data. Write the data to a GCS Cloud Bucket that only BigQuery and administrators can access. Use Cloud DLP to identify any sensitive data points in the files. Build one view in a dataset meant for analysts which queries the data and performs a DML transformation to mask the data. Build another view in a separate dataset for the administrators which queries the data in its raw state.

C: Create two Cloud Storage buckets, one labeled "quarantine" and another labeled "desensitized". Use one cloud function to query the API and move the data into the quarantine bucket. Use another Cloud Function to query the quarantined data, perform the desensitization tasks and move the clean data in the desensitized bucket. Use two different BigQuery external tables to query with the quarantined data or desensitized data depending upon the user rule.

D: Use a Cloud Function and Cloud Scheduler to query the data and load the desensitized data into BigQuery via the Storage Write API. From here use Cloud DLP to automatically check for sensitive data on the table. Use the output of Cloud DLP to create a BigQuery taxonomy with policy tags to perform dynamic data masking and desensitization based upon a user's data policy role.




You are working for an e-commerce company which has a legacy business operating on-prem and wants to migrate a total of 2 PB of data to GCS. They are looking for a high-security transfer. What is the best method for migrating this data to GCS?

A: Use a Transfer Appliance.

B: Use Storage Transfer Service.

C: Set up a dedicated interconnect. Use this to upload the files directly to GCS via the JSON API one by one.

D: Send the data over the internet with the GCS JSON API.




An analytics team has asked for your help in designing an efficient data model for their data mart in BigQuery. Which of the below options would you recommend?

Choose all answers that apply

A: Denormalize the schema by flattening some hierarchical data. Leave the data as rows in the table.

B: Use a standard view to grab the necessary data from the table without having to query the underlying tables again.

C: Normalize some side input tables which are rarely used by breaking them into a different table so the data mart table is smaller.

D: Only take the data that is needed to build the data mart.

E: Denormalize the schema by flattening some hierarchical data. Leave the data as rows in the table. Use nested and repeated fields to assemble the data efficiently in the parent table nodes.

F: Use Dataplex to quickly discover and identify the tables needed for the project.




You need to connect to a remote Cloud SQL instance over the internet. What is the best method to use to quickly and easily connect to Cloud SQL while using native IAM roles?

A: TLS/SSL connection

B: SSH connection

C: Cloud SQL Auth Proxy

D: Dedicated Interconnect




You are working in a financial services firm conducting research into bond price prediction. The data science team has built a Spark ML bayesian prediction model to perform the predictions and now wishes to operationalize it. Due to the highly sensitive nature of the data and processing task the data science team wants to have the most secure architecture possible. The data are currently flowing into a secured GCS bucket. What would you recommend?

A: Set up an IAM service account to interact with GCS. Apply read access to the source bucket and read and write access to the destination bucket for the service account. Use the service account to read the data from GCS. Build a sole tenancy Dataproc Cluster with shielded VMs. Interact with the Dataproc Cluster with a Private IP address only. Set up Cloud Logging alerts to alert you of any unauthorized access to the cluster.

B: Use Dataproc Serverless Spark to read from the bucket, process the data, and deposit the data into the destination bucket.

C: Use a standard dataproc cluster with Compute Engine to read from the bucket, perform the data processing and load the data into the destination bucket.

C: Use a standard dataproc cluster with Google Kubernetes Engine to read from the bucket, perform the data processing and load the data into the destination bucket.