Pre-Tests Foxtrot Communications

Current Session Summary: attempted. Question summary is marked below.

   ✅ Correct    ❌ Incorrect  ⬜️ Not Attempted

⬜️ 01	⬜️ 02	⬜️ 03	⬜️ 04	⬜️ 05	⬜️ 06	⬜️ 07	⬜️ 08	⬜️ 09	⬜️ 10
⬜️ 11	⬜️ 12	⬜️ 13	⬜️ 14	⬜️ 15	⬜️ 16	⬜️ 17	⬜️ 18	⬜️ 19	⬜️ 20
⬜️ 21	⬜️ 22	⬜️ 23	⬜️ 24	⬜️ 25	⬜️ 26	⬜️ 27	⬜️ 28	⬜️ 29	⬜️ 30
⬜️ 31	⬜️ 32	⬜️ 33	⬜️ 34	⬜️ 35	⬜️ 36	⬜️ 37	⬜️ 38	⬜️ 39	⬜️ 40
⬜️ 41	⬜️ 42	⬜️ 43	⬜️ 44	⬜️ 45	⬜️ 46	⬜️ 47	⬜️ 48	⬜️ 49	⬜️ 50
⬜️ 51	⬜️ 52	⬜️ 53	⬜️ 54	⬜️ 55	⬜️ 56	⬜️ 57	⬜️ 58	⬜️ 59	⬜️ 60

You work at a large financial institution and have been asked by management to architect a solution for enabling a data mesh in GCP. What is most efficient solution?

A: Buy a third party data cataloguing tool to scan through all your BigQuery datasets and identify the table name, column names, data types, and all other metadata. Buy another third party tool to automatically categorize the captured data based upon a set of tags that you developed. Buy another third party tool which surfaces this data to the users and integrates with IAM, BigQuery, and GCS to control and facilitate federated governance and inherited permissions. Develop all the pipelines, table metadata, permissions management functionality, and access logs yourself.

B: Use Data Catalog to capture metadata from sources based upon pre-developed tag templates. Use the tag templates to tag BigQuery assets, another tag for external GCS tables, and develop custom share rules based upon user tags and organizational structure. Store all associated metadata in Firestore to manage any state changes to the permissions structure.

C: Establish a data governance council to develop federated governance rules covering access rights and data quality checks within each zone and domain. Use Dataplex to automatically capture relevant metadata from BigQuery datasets, GCS tables, and other open source tools such as Spark. Use Dataplex to automatically associate permissions and perform data quality checks on a zone by zone basis. Use Dataplex to automatically surface data and make it available to interested users.

D: Work with individual teams to identify which data belongs to which teams, develop a system of tools using Cloud Build which operates a series of Cloud Functions to perform data quality checks on new BigQuery datasets as the teams build them. Build a data service platform on a local VM which contains web app that processes and surfaces user access and data assets to users who log into to the site. Capture access logs and queries ran by users to ensure that data are not being accessed improperly.

Option A is incorrect: Buying a plethora of tools would be inefficient in this instance as Dataplex can perform all of these duties natively with minimal effort.

Option B is incorrect: This effort is overly complicated and complex and would require a lot of development effort across teams. Additionally, it doesn't lay a proper foundation for a data mesh because it is not controlling for domain knowledge and differentiation. There are no clear delimiters between teams and federated duties.

Option C is correct: Before beginning construction of the data mesh itself you should establish a governing body, separate from the teams/domains themselves, which can establish rules for sharing and data quality regulations which apply to domains and zones. Dataplex is a unified Data Management and Governance suite which can provide the necessary pathways for domain driven design/development while simultaneously ensuring a high-quality and performant system. Dataplex will automatically capture, analyze, categorize, and apply governance functions to new data from BigQuery and GCS.

Option D is incorrect: Firstly, working directly with the teams to develop unique rules is a violation of data mesh's federated governance standards. Using Cloud Build and Cloud functions to perform the necessary checks would work, but it is inefficient compared to dataplex, which performs all these functions natively. Building a data service on a VM or App Engine could be useful for data discovery and sharing, but all the custom rules, logic, and development would have to be done in house and would be inefficient compared to Dataplex, which performs all these functions natively. Capturing access logs to ensure that users are not accessing data improperly is a stopgap measure and defeats the purpose of the data mesh, where those kinds of checks are performed by Dataplex natively as part of the domain drive development standard.

Option A is incorrect: Buying a plethora of tools would be inefficient in this instance as Dataplex can perform all of these duties natively with minimal effort.

Option B is incorrect: This effort is overly complicated and complex and would require a lot of development effort across teams. Additionally, it doesn't lay a proper foundation for a data mesh because it is not controlling for domain knowledge and differentiation. There are no clear delimiters between teams and federated duties.

Option C is correct: Before beginning construction of the data mesh itself you should establish a governing body, separate from the teams/domains themselves, which can establish rules for sharing and data quality regulations which apply to domains and zones. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:yw9W03MfzSniWCcaV2yX#:~:text=Dataplex%20is%20GCP%27s%20main%20tool%20for%20data%20mesh%20development.%20It%20is%20used%20to%20identify%20and%20standardize%20data%20assets%20within%20an%20organization" target="_blank" title="Dataplex is GCP's main tool for data mesh development. It is used to identify and standardize data assets within an organization, automatically apply rules and quality checks to the data, and facilitate its discoverability by other teams.">Dataplex</a> is a unified Data Management and Governance suite which can provide the necessary pathways for domain driven design/development while simultaneously ensuring a high-quality and performant system. Dataplex will automatically capture, analyze, categorize, and apply governance functions to new data from BigQuery and GCS.

Option D is incorrect: Firstly, working directly with the teams to develop unique rules is a violation of data mesh's federated governance standards. Using Cloud Build and Cloud functions to perform the necessary checks would work, but it is inefficient compared to dataplex, which performs all these functions natively. Building a data service on a VM or App Engine could be useful for data discovery and sharing, but all the custom rules, logic, and development would have to be done in house and would be inefficient compared to Dataplex, which performs all these functions natively. Capturing access logs to ensure that users are not accessing data improperly is a stopgap measure and defeats the purpose of the data mesh, where those kinds of checks are performed by Dataplex natively as part of the domain drive development standard.

You are working to build out a data lake using Dataplex. After setting up the necessary zones you need to incorporate your BigQuery tables, views, and BQML models into Dataplex and Data Catalog. What actions do you need to take to do this?

A: Use a python script as part of a build process which creates and attaches tag templates to the assets according to manually generated schema files. Dataplex will then create metadata based on the tag templates.

B: Do nothing. Dataplex will automatically onboard all BigQuery datasets and tables and GCS buckets.

C: Go into the Dataplex console and manually add the datasets to each zone as required.

D: Ensure that Dataplex has the proper permissions to read your data assets, enable data lineage in BigQuery, and then add each BigQuery dataset and any GCS bucket to the proper zone. Following this, Dataplex will automatically add any discovered tables and generate the proper metadata and Data Catalog assets.

A is incorrect. After datasets are added to a zone within Dataplex all tables are automatically scanned and tagged by Dataplex. Dataplex will then build out the metadata automatically.

B is incorrect. You must first add the datasets to the proper zones, and then Dataplex will onboard them.

C is incorrect. This is almost correct. You should verify that a Dataplex has permissions first and that BigQuery data lineage is enabled.

D is correct. There are a few steps needed before being able to add your assets to dataplex. Firstly, you should ensure that Dataplex has the proper permissions to read the needed GCS buckets and BigQuery datasets. Additionally, it is recommended to enable BigQuery Data Lineage to track where data comes from, where it is passed to, and what transformations are applied to it. Then to onboard data assets you would need to manually add either the cloud storage bucket or BigQuery dataset to the proper zone. After you add the dataset or bucket dataplex will scan through and add your assets to the proper zone, map data lineage, build Data Catalog information, and surface the data for data discovery. GCS assets will be made into BigQuery external tables, from here they can be manually upgraded to BigLake tables.

A is incorrect. After datasets are added to a zone within Dataplex all tables are automatically scanned and tagged by Dataplex. Dataplex will then build out the metadata automatically.

B is incorrect. You must first add the datasets to the proper zones, and then Dataplex will onboard them.

C is incorrect. This is almost correct. You should verify that a Dataplex has permissions first and that BigQuery data lineage is enabled.

D is correct. <a title="There are a few steps needed before being able to add your assets to dataplex." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:yw9W03MfzSniWCcaV2yX#:~:text=There%20are%20a%20few%20steps%20needed%20before%20being%20able%20to%20add%20your%20assets%20to%20dataplex." target="_blank">There are a few steps</a> needed before being able to add your assets to dataplex. Firstly, you should ensure that Dataplex has the proper permissions to read the needed GCS buckets and BigQuery datasets. Additionally, it is recommended to enable BigQuery Data Lineage to track where data comes from, where it is passed to, and what transformations are applied to it. Then to onboard data assets you would need to manually add either the cloud storage bucket or BigQuery dataset to the proper zone. After you add the dataset or bucket dataplex will scan through and add your assets to the proper zone, map data lineage, build Data Catalog information, and surface the data for data discovery. GCS assets will be made into BigQuery external tables, from here they can be manually upgraded to BigLake tables.

You are working at a consulting firm. One of your clients is inquiring about GCP's disaster recovery protocols and SLO's. What are the two objectives you would look at to rate a disaster recovery scenario?

Select All That Apply

A: RPO - Recovery Point Objective

B: RTO - Recovery Time Objective

C: Durability

D: Availability

A is correct. RPO, or Recovery Point Objective, is how far back you want your data to be recoverable. For example, do you want your database backed up every day, every week, every hour, etc?

B is correct. RTO, or Recovery Time Objective, is how quickly you should be able to recover from a failure. This is usually measured in "Wall Clock Time", or how much real world time passes before recovery is complete.

C is incorrect. Durability is a good metric to ensure the survivability of data. It isn't a disaster recovery objective, but it can be used to help define one.

D is incorrect. Availability can help protect your business against failures and outages. It isn't a disaster recovery objective, but it can be used to help define one.

A is correct. <a title="RPO, or Recovery Point Objective, is how far back you want your data to be recoverable." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:YhMIs4b1S0SmvhBL2vKO#:~:text=RPO%2C%20or%20Recovery%20Point%20Objective%2C%20is%20how%20far%20back%20you%20want%20your%20data%20to%20be%20recoverable." target="_blank">RPO, or Recovery Point Objective</a>, is how far back you want your data to be recoverable. For example, do you want your database backed up every day, every week, every hour, etc?

B is correct. <a title="RTO, or Recovery Time Objective, is how quickly you should be able to recover from a failure." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:YhMIs4b1S0SmvhBL2vKO#:~:text=RTO%2C%20or%20Recovery%20Time%20Objective%2C%20is%20how%20quickly%20you%20should%20be%20able%20to%20recover%20from%20a%20failure." target="_blank">RTO, or Recovery Time Objective</a>, is how quickly you should be able to recover from a failure. This is usually measured in "Wall Clock Time", or how much real world time passes before recovery is complete.

C is incorrect. Durability is a good metric to ensure the survivability of data. It isn't a disaster recovery objective, but it can be used to help define one.

D is incorrect. Availability can help protect your business against failures and outages. It isn't a disaster recovery objective, but it can be used to help define one.

Your analytics team is interested in capturing web traffic so that they can build a recommender system for offered products. They have set up Google Analytics on the website and are capturing web traffic effectively. Now they want to move that data into their data mesh so that they can enrich the data with other sources, perform feature engineering, and build the recommender system. What would be the most efficient architecture to accomplish this?

A: Use the BigQuery Transfer Service to set up an automatic load of the data from Google Analytics into BigQuery. From here, build a dataflow pipeline that exports this data to GCS. This data can then be used and enriched with Dataproc to build the recommender system.

B: Use Dataflow to access the Google Analytics API and execute a batch pipeline to move the data into BigQuery. From here, use Vertex AI to create a feature dataset with the BigQuery data and then use Tensorflow to construct the recommender system.

C: Use the BigQuery Transfer Service to set up an automatic load of the data from Google Analytics into BigQuery. This data can then be used and enriched within BigQuery with the GoogleSQL Transform operator to form features. Use BigQuery ML to build the recommender system using Matrix Factorization.

D: Use the Google Analytics Transfer Service to move data into GCS where it can be accessed by Dataproc. Use Dataproc to enrich the data, perform feature engineering, and build the recommender.

Option A is incorrect. It is almost correct, but you don't need to use Dataproc as BigQuery ML can build a recommender system.

Option B is incorrect. Although this is technically feasible it would be operationally inefficient compared to using BigQuery Transfer Service to move the data.

Option C is correct. You can use the BigQuery Transfer Service and Google Analytics to easily set up a daily transfer of data from GA to BQ with minimal configuration. You can use the BigQuery Transform operator to perform complex feature engineering. Finally, you can use BigQuery ML to build a Matrix Factorization model to produce the recommendations. This method is also ideal because Dataplex will automatically capture newly created tables and models, perform the necessary data quality checks and permissions set up and make the data available for use by the analytics team.

Option D is incorrect: There is no such thing as the Google Analytics Transfer Service, also there is no native way to do this with Google Analytics.

Option A is incorrect. It is almost correct, but you don't need to use Dataproc as BigQuery ML can build a recommender system.

Option B is incorrect. Although this is technically feasible it would be operationally inefficient compared to using BigQuery Transfer Service to move the data.

Option C is correct. You can use the <a title="BigQuery Transfer Service is great for facilitating ongoing data movements of many Google services, such as Google Analytics" href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:BM3FWDUYjxjFzf8hGpsL#:~:text=BigQuery%20Transfer%20Service%20is%20great%20for%20facilitating%20ongoing%20data%20movements%20of%20many%20Google%20services%2C%20such%20as%20Google%20Analytics" target="_blank">BigQuery Transfer Service</a> and Google Analytics to easily set up a daily transfer of data from GA to BQ with minimal configuration. You can use the BigQuery <a title="BigQueryML can perform advanced feature engineering either automatically or by using the Transform clause to perform manual preprocessing." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:tDjKJ5WSKMS1anKmuIq4#:~:text=BigQueryML%20can%20perform%20advanced%20feature%20engineering%20either%20automatically%20or%20by%20using%20the%20Transform%20clause%20to%20perform%20manual%20preprocessing." target="_blank">Transform</a> operator to perform complex feature engineering. Finally, you can use BigQuery ML to build a <a title="Matrix Factorization is for creating product recommendation systems." target="_blank" href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:tDjKJ5WSKMS1anKmuIq4#:~:text=Matrix%20Factorization%20is%20for%20creating%20product%20recommendation%20systems.">Matrix Factorization</a> model to produce the recommendations. This method is also ideal because Dataplex will automatically capture newly created tables and models, perform the necessary data quality checks and permissions set up and make the data available for use by the analytics team.

Option D is incorrect: There is no such thing as the Google Analytics Transfer Service, also there is no native way to do this with Google Analytics.

What is usually meant by Data Sovereignty?

A: Data Sovereignty refers to the rights of corporations to exercise positive control over their global data.

B: Data Sovereignty refers to which country the data originates from and which country has rights to that data.

C: Data Sovereignty is a legal concept which determines the laws and regulations which apply to data when it is processed, viewed, collected, or stored within a certain geographical boundary.

D: Data Sovereignty is the legal framework for determining how a private person can control their online data.

The analytics teams is building a dashboard in Looker which queries the same BigQuery table every few hours throughout the day. The data mart is rebuilt once per day in the mornings. What is the most efficient solution you should implement in order to minimize the query response time and the amount of data processed per day?

A: Create a materialized view for Looker to query instead.

B: Work with the analytics project lead to alter the Looker cached persistence time to 24 hours.

C: Create a Data Catalog Tag Template which tags the query as "high priority". Then use BigQuery's resource manager to allocate more slots to this query based on the tag value.

D: Do nothing. BigQuery automatically caches query results for 24 hours.

A: Option A is not correct. A Materialized View could work for the performance and processing requirement, but it is not the most optimal solution and would be redundant in this case.

B: Option B is not correct because Looker automatically caches queries, but only for 1 hour. Changing it to 24 hours could work, but it isn't the most optimal solution in this case.

C: Option C is not correct for a few reasons:

This is a complex solution and would be difficult and time consuming to implement.
BigQuery doesn't have a user editable resource manager for the underlying query engine, it is a serverless service.
You can allocate slots on the project level, not the table level
Additional slots improves availability, but not performance

D: Option D is correct. BigQuery will automatically cache query results in a temporary table for 24 hours. Whenever looker runs the same query it will take results from the cache and won't query the underlying table again until the underlying data changes, in this case, the following morning.

A: Option A is not correct. A <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:actLOHnZuRWiNz61cS2U#:~:text=Looker%20can%20take,reading%20stale%20data." target="_blank">Materialized View</a> could work for the performance and processing requirement, but it is not the most optimal solution and would be redundant in this case.

B: Option B is not correct because Looker automatically <a href="https://cloud.google.com/looker/docs/caching-and-datagroups#how_looker_uses_cached_queries" target="_blank">caches</a> queries, but only for 1 hour. Changing it to 24 hours could work, but it isn't the most optimal solution in this case.

C: Option C is not correct for a few reasons: <ul>
<li>This is a complex solution and would be difficult and time consuming to implement.</li>
<li>BigQuery doesn't have a user editable resource manager for the underlying query engine, it is a serverless service.</li>
<li>You can allocate slots on the project level, not the table level</li>
<li>Additional <A href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:D9cqxUjle4rjp4mQktN6/topics/:gWVUJag3Cn9lygzDMEHN#:~:text=In%20general%2C%20it,but%20not%20performance." target="_blank">slots</a> improves availability, but not performance</li>
</ul>

D: Option D is correct. BigQuery will automatically <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:actLOHnZuRWiNz61cS2U#:~:text=Looker%20can%20take,reading%20stale%20data." target="_blank">cache</a> query results in a temporary table for 24 hours. Whenever looker runs the same query it will take results from the cache and won't query the underlying table again until the underlying data changes, in this case, the following morning.

You are working at a start-up company and the CIO wants to move to a new BigQuery class. Their current requirements are:

There are a large number of scheduled jobs which run each morning that generally use the same amount of data each day.
The analytics team is about to introduce many new high performance queries which they are seeking to optimize.
The data security and regulatory teams want to introduce data masking to comply with new privacy regulations.
The CIO wants to be assured that if a failure occurs in a single zone that the service will still be up and the data will be assured.

Which BigQuery class would you recommend to optimize performance and cost to the list of requirements?

A: On-Demand

B: Standard

C: Enterprise

D: Enterprise-Plus

As a member of the data engineering team you run dozens dataproc jobs daily. Lately you have noticed that the chron scripts that you use to run the dataproc jobs are becoming difficult to manage, understand, and visualize. You are looking for a better method to organize your jobs while enabling advanced job management tools such as graceful handing of job failures, automated scheduling, and job dependency graphs. What tool could you use to solve this?

A: Use Cloud Composer to reorganize and manage your data workloads. Rewrite your dataproc jobs as DAGs and use Cloud Composer to automatically manage task and DAG dependencies.

B: Use Cloud Scheduler to better organize the chron jobs and alert you if the job fails.

C: Use Cloud Workflows to set up complex job dependencies on a job by job basis and manage failures.

D: Create a VM and set up a custom Airflow instance on the VM. Set up error handling, cloud logging, and install Ops Agent on the VM to monitor hardware and jobs. Set up a MIG group to ensure high availability and scalability. Develop a custom IAM role to apply to users who need to interact with Airflow. Set up custom Airflow connectors to interact with your Dataproc instance. Finally, Rewrite your dataproc jobs as DAGs and use Airflow to automatically manage task and DAG dependencies.

Option A is correct: Cloud Composer is Google's fully managed Apache Airflow service. Use it to visualize and manage your dataproc jobs and task dependencies.

Option B is incorrect: Although Cloud Scheduler can be used to execute jobs at a given time and can track job status it is not possible to visualize or manage complex DAG dependencies. Additionally, Cloud scheduler doesn't have an efficient task dependency tree.

Option C is incorrect: Cloud workflows is used to automate event driven API workflows and isn't the best tool for running scheduled batch dataproc jobs.

Option D is incorrect: This is a huge undertaking which would require a significant effort to build. Cloud Composer does all of this by default with minimal user input or maintenance required.

Option A is correct: <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:TKZnw14OK3ATX1A5FR6D#:~:text=Cloud%20Composer%20is%20GCP%27s%20fully%20managed%20Apache%20Airflow%20Service.%20It%20is%20hosted%20and%20configured%20in%20GCP%20and%20runs%20natively%20on%20Kubernetes%20Engine" title="Cloud Composer is GCP's fully managed Apache Airflow Service. It is hosted and configured in GCP and runs natively on Kubernetes Engine" target="_blank">Cloud Composer</a> is Google's fully managed Apache Airflow service. Use it to visualize and manage your dataproc jobs and task dependencies.

Option B is incorrect: Although Cloud Scheduler can be used to execute jobs at a given time and can track job status it is not possible to visualize or manage complex DAG dependencies. Additionally, Cloud scheduler doesn't have an efficient task dependency tree.

Option C is incorrect: Cloud workflows is used to automate event driven API workflows and isn't the best tool for running scheduled batch dataproc jobs.

Option D is incorrect: This is a huge undertaking which would require a significant effort to build. Cloud Composer does all of this by default with minimal user input or maintenance required.

You need to develop a Cloud Function to perform some data processing. You want to work locally and only deploy the Cloud Function when you are sure it is ready. What is the optimal method for development?

A: Use Jupiter Notebooks to develop your function. When it is ready, export it as a python script and put it in a cloud function via the Cloud Console.

B: Use Jupiter Notebooks to develop your function. Use Cloud Functions Framework to debug and test your code. Once it is ready, insert the script into a Cloud Function via the cloud console.

C: Use Jupiter Notebooks to develop your function. Use Cloud Functions Framework to debug, and test your code. Use Cloud Build to automate the testing and deployment of your code and build the function.

D: Develop directly in Cloud Functions since it would be quicker to test and debug this way.

A is incorrect. This is smart for developing the python script, but it lacks the robust testing and deployment best practices.

B is incorrect. This is close, but Cloud Build should be used to deploy the function when it is ready.

C is correct. Use Jupiter notebooks to develop, use the Functions Framework to debug and test, and Cloud Build to deploy the function all from your local machine.

D is incorrect. This would likely result in a difficult development process with many failed attempts to deploy and test the function.

A is incorrect. This is smart for developing the python script, but it lacks the robust testing and deployment best practices.

B is incorrect. This is close, but Cloud Build should be used to deploy the function when it is ready.

C is correct. Use Jupiter notebooks to develop, use the <a title="The Functions Framework is essentially a Cloud Functions emulator which allows you to pass web requests directly to your local workstation." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:YhMIs4b1S0SmvhBL2vKO#:~:text=The%20Functions%20Framework%20is%20essentially%20a%20Cloud%20Functions%20emulator%20which%20allows%20you%20to%20pass%20web%20requests%20directly%20to%20your%20local%20workstation." target="_blank">Functions Framework</a> to debug and test, and <a title="Cloud Build is GCP's fully managed and cloud native build solution which can facilitate development for a wide variety of applications." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:YhMIs4b1S0SmvhBL2vKO#:~:text=Cloud%20Build%20is%20GCP%27s%20fully%20managed%20and%20cloud%20native%20build%20solution%20which%20can%20facilitate%20development%20for%20a%20wide%20variety%20of%20applications." target="_blank"> Cloud Build</a> to deploy the function all from your local machine.

D is incorrect. This would likely result in a difficult development process with many failed attempts to deploy and test the function.

You are using Dataplex and BigQuery to host a data mesh. You want to build freshness checks against the data so that users can know if data has been properly run each morning. What is the most optimal method to accomplish this?

A: Build a separate BigQuery data to house relevant metadata. Create views which can be used to query each tables latest inserted rows. Create a new view each time a new table is created. Surface these views to users automatically with Dataplex.

B: Create a destination metadata table in BigQuery. Each time the pipeline job is ran send the results of the job to this destination metadata table. Surface this table to users automatically with Dataplex.

C: Use Dataplex monitoring with auto data quality checks to set up rules to monitor for freshness and automatically run and verify the tables each morning.

D: Set up a pub/sub topic to capture data quality results from your pipeline jobs each time they are ran. Create a Cloud Function which will read from this Pub/Sub topic and create a Cloud Logging alert which will alert you if the pipeline hasn't ran.

A is incorrect. This is technically feasible, but would be difficult to scale and would require significant maintenance efforts to ensure viability. Dataplex does this automatically with data profiling and auto data quality checks.

B is incorrect. This is technically feasible, but a better destination would be Cloud Logger with alerts set up if jobs fail. Dataplex can apply data profiling and quality checks automatically and is the better choice here.

C is correct. Dataplex can be used to integrate BigQuery data profiling with Data Catalog to automatically map and make available detailed data profiles for any BigQuery table. If you have built a data mesh with dataplex you can use integrated dataplex monitoring with automatically captured metadata which can report on data quality data points such as data freshness and availability.

D is incorrect. This is an inefficient solution. It would be better to use automatic Cloud Logging as a product of the service or communicate directly with the API for custom alerting. Also, Dataplex can perform these checks automatically as part of auto data quality checks.

A is incorrect. This is technically feasible, but would be difficult to scale and would require significant maintenance efforts to ensure viability. Dataplex does this automatically with data profiling and auto data quality checks.

B is incorrect. This is technically feasible, but a better destination would be Cloud Logger with alerts set up if jobs fail. Dataplex can apply data profiling and quality checks automatically and is the better choice here.

C is correct. <a title="Dataplex can be used to integrate BigQuery data profiling with Data Catalog" href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:tDjKJ5WSKMS1anKmuIq4#:~:text=Dataplex%20can%20be%20used%20to%20integrate%20BigQuery%20data%20profiling%20with%20Data%20Catalog" target="_blank">Dataplex</a> can be used to integrate BigQuery data profiling with Data Catalog to automatically map and make available detailed data profiles for any BigQuery table. If you have built a data mesh with dataplex you can use integrated dataplex monitoring with automatically captured metadata which can report on data quality data points such as data freshness and availability.

D is incorrect. This is an inefficient solution. It would be better to use automatic Cloud Logging as a product of the service or communicate directly with the API for custom alerting. Also, Dataplex can perform these checks automatically as part of auto data quality checks.

You are working with a financial services firm. They are working with data which, by law, requires strict ACID-compliant transaction handling. They are operating in a single region only. Which database service would be the best fit for their requirements?

A: BigQuery

B: Cloud Spanner

C: Cloud SQL

D: BigTable

A data science team has 3PBs of data stored in GCS standard dual-region storage and accumulates an additional GB of data every week. They are interested in ways to lower their monthly storage bills and costs. What are some methods that you should recommend to lower the costs?

Select all that apply

A: Enable Autoclass for GCS.

B: Compress the files using GZIP.

C: Change the storage type to single-region.

D: Build a Cloud Function which scans the file metadata of all objects in the bucket. If a file is older than 365 days, then delete the data.

Option A is correct: Autoclass will automatically reclassify data into a cheaper class without the user having to perform any configurations themselves.

Option B is correct: You can compress files

Option C is correct: GCS buckets can be stored as multi-region, dual-region, or single-region. There are slight differences in price between these

Option D is incorrect: For deleting aged files you should use a lifecycle policy. It is wise to avoid deleting files unless specifically asked to do so.

Option A is correct: <A title="A common practice is to set up lifecycle policies on your data" href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#Lifecycle%20management%20of%20data:~:text=A%20common%20practice%20is%20to%20set%20up%20lifecycle%20policies%20on%20your%20data" target="_blank">Autoclass</a> will automatically reclassify data into a cheaper class without the user having to perform any configurations themselves.Option B is correct: You can <a title="Cloud Storage supports compression, including BZIP2, DEFLATE, and GZIP. Compressing your files will help you save money on storage costs" href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#Planning%20for%20storage%20costs%20and%20performance:~:text=Cloud%20Storage%20supports%20compression%2C%20including%20BZIP2%2C%20DEFLATE%2C%20and%20GZIP.%20Compressing%20your%20files%20will%20help%20you%20save%20money%20on%20storage%20costs" target="_blank">compress</a> filesOption C is correct: <a tiel="GCS buckets can be stored as multi-region, dual-region, or single-region. There are slight differences in price between these" href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#Planning%20for%20storage%20costs%20and%20performance:~:text=GCS%20buckets%20can%20be%20stored%20as%20multi%2Dregion%2C%20dual%2Dregion%2C%20or%20single%2Dregion.%20There%20are%20slight%20differences%20in%20price%20between%20these" target="_blank">GCS buckets</a> can be stored as multi-region, dual-region, or single-region. There are slight differences in price between theseOption D is incorrect: For deleting aged files you should use a lifecycle policy. It is wise to avoid deleting files unless specifically asked to do so.

You are running a sports booking website which processes tens of thousands of ticket purchases per day across hundreds of events and dozens of teams. You decide to use BigTable in order to properly mange the complex and high velocity event stream. How should you build your BigTable key to ensure an efficient operation?

A: timestamp#user_id#event#team

B: timestamp#event_id#user_id#team

C: user_id#event#team#timestamp

D: user_id#team#timestamp#event

A manufacturer of IOT devices collects sensor readings from the robots which construct the IOT devices. The robots send the data to an api endpoint for ingestion. However, you have noticed that occasionally, the ingest fails and the data are lost. What is a better method to ingest the data?

A: Write the data directly to BigTable via the BigTable API.

B: Write the data to GCS.

C: Write the data to Cloud Spanner.

D: Write the data to Pub/Sub.

You currently have a pub/sub topic set up which is capturing application event data in JSON format. What is the most optimal method to stream the data into BigQuery without any transformations?

A: Use Dataflow to stream the data from pub/sub to BigQuery

B: Use a Cloud Function to read the data from the Pub/Sub topic with a pull subscription. Then insert the data into BigQuery via the BigQuery Write API.

C: Set up a BigQuery subscription to the Pub/Sub topic to stream the data directly into the table via the BigQuery Write API.

D: Set up a Pub/Sub subscription to send the event data to GCS. Build an external table in BigQuery to read the data from GCS.

Option A is incorrect. This method is inefficient for data which does not require transformations or processing.

Option B is incorrect. This method is technically feasible, but it is not the optimal solution.

Option C is correct. Use a BigQuery Subscription to Pub/Sub to automatically stream data into BigQuery via the Storage Write API. This is the best method to stream event data into BigQuery which doesn't require any kind of preprocessing or transformations.

Option D is incorrect. This is a good method if you want to isolate the data and perform some other processing on the data before inserting it. However, it is inefficient in this case.

Option A is incorrect. This method is inefficient for data which does not require transformations or processing.

Option B is incorrect. This method is technically feasible, but it is not the optimal solution.

Option C is correct. Use a <a title="Schematized messages can be pushed as is into BigQuery Write API via a Pub/Sub Subscription." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=Schematized%20messages%20can%20be%20pushed%20as%20is%20into%20BigQuery%20Write%20API%20via%20a%20Pub/Sub%20Subscription." target="_blank">BigQuery Subscription to Pub/Sub</a> to automatically stream data into BigQuery via the <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#:~:text=The%20BigQuery%20Write%20API%20is%20used%20to%20create%20a%20unified%20streaming%20ingest%20or%20batch%20loading%20engine%20for%20directly%20streaming%20data%20into%20a%20BigQuery%20table." title="The BigQuery Write API is used to create a unified streaming ingest or batch loading engine for directly streaming data into a BigQuery table." target="_blank">Storage Write API</a>. This is the best method to stream event data into BigQuery which doesn't require any kind of preprocessing or transformations.

Option D is incorrect. This is a good method if you want to isolate the data and perform some other processing on the data before inserting it. However, it is inefficient in this case.

You are managing a BigTable instance and have recently built a new table. You have noticed that the data response times have dropped and queries are behaving poorly. What is the first thing you should do to diagnose what is wrong?

Option A: Check node performance in your regional cluster to see if you should add another node.

Option B: Limit the number of requests per minute via the BigTable API to let the instance catch up

Option C: Check audit logs to see if there are inefficient queries being ran.

Option D: Use the BigTable Key Visualizer to check for hot spotting.

You are managing a PostGREs Cloud SQL cluster and notice that queries are starting to fail. Upon investigation you notice that your data is now over your persistent disk size of 25 TBs and Cloud SQL can no longer effectively process the data. What is the most efficient solution you could recommend to resolve the issue?

A: Migrate the database and operations to Cloud Spanner.

B: Increase the size of the persistent disk on the Cloud SQL instance.

C: Enable Alloy DB for PostGREs.

D: Delete some older data to ease the storage pressure.

Option A is correct. Cloud SQL has a hard storage limit of 25TB. Cloud Spanner is often used by customers when they reach the hard limit of 25TB for Cloud SQL storage or if latency issues across regions becomes too great to use Cloud SQL alone.

Option B is incorrect. Cloud SQL has a hard storage limit of 25TB.

Option C is incorrect. Alloy DB is used to speed up analytical queries, but is still subject to the limits of Cloud SQL, including the hard storage limit of 25TB.

Option D is incorrect. This is technically feasible, but unless this is a development server this would not be a wise move because it could violate privacy or other legal regulations. Additionally, this is not a sustainable solution as it is likely that you will hit this cap again in the future.

Option A is correct. Cloud SQL has a hard storage limit of 25TB. <A href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#:~:text=Cloud%20Spanner%20is%20often%20used%20by%20customers%20when%20they%20reach%20the%20hard%20limit%20of%2025TB%20for%20Cloud%20SQL%20storage%20or%20if%20latency%20issues%20across%20regions%20becomes%20too%20great%20to%20use%20Cloud%20SQL%20alone." title="Cloud Spanner is often used by customers when they reach the hard limit of 25TB for Cloud SQL storage or if latency issues across regions becomes too great to use Cloud SQL alone." target="_blank">Cloud Spanner</a> is often used by customers when they reach the hard limit of 25TB for Cloud SQL storage or if latency issues across regions becomes too great to use Cloud SQL alone.

Option B is incorrect. Cloud SQL has a hard storage limit of 25TB.

Option C is incorrect. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:BM3FWDUYjxjFzf8hGpsL#:~:text=Google%20is%20working%20hard%20to%20bridge%20the%20gap%20between%20transactional%20and%20analytical%20databases%20through%20tools%20such%20as%20Alloy%20DB%20for%20PostgreSQL%2C%20which%20use%20smarter%20querying%20and%20resource%20optimization%20to%20speed%20up%20joins%20and%20querying." title="Google is working hard to bridge the gap between transactional and analytical databases through tools such as Alloy DB for PostgreSQL, which use smarter querying and resource optimization to speed up joins and querying." target="_blank">Alloy DB</a> is used to speed up analytical queries, but is still subject to the limits of Cloud SQL, including the hard storage limit of 25TB.

Option D is incorrect. This is technically feasible, but unless this is a development server this would not be a wise move because it could violate privacy or other legal regulations. Additionally, this is not a sustainable solution as it is likely that you will hit this cap again in the future.

You are operating in a multi-cloud architecture and need to integrate date from a few sources stored in S3 and Azure Blob storage into BigQuery, where you are already hosting a number of GCS external tables. There are files added to the buckets each day. Additionally, the data contains some PII and requires column level security to comply with certain legal requirements. How could you accomplish this most efficiently?

A: Build a custom compute engine VM which can access the required data. Use the VM to read each file in the bucket and then transfer that file to GCS. Set up a cloud schedule job which will trigger the daily job. Build external tables for each GCS source as the data flows in. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.

B: Build a Cloud Function which reads all new files daily and deposits them into GCS based upon a cloud scheduler job. Build external tables for each GCS source as the data flows in. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.

C: Build a Lambda function in AWS and an Azure Function in Azure which pulls the data from each source and then deposits it into GCS. Send a message to a pub/sub topic which will signal that the files have been loaded. Use a Cloud function with eventarc which will activate when the pub/sub topic receives a new message. Use this function to pull the file from GCS, process it, and load it into a BigQuery table. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.

D: Use BigQuery Omni to build BigLake tables which can read the S3 and Azure Blob Storage Files and make the tables available in the BigQuery console. Use Cloud DLP and Data Catalog to create tag templates for each table and column which can be used to automate column level security.

A is incorrect: This is technically feasible, but BigQuery Omni would be a better choice since it wouldn't require a data transfer.

B is incorrect: This might be technically feasible, but it would be difficult to manage effectively due to variance in file size. It also would not be fault tolerant as a serverless solution. Additionally, this is very inefficient compared to using BigQuery Omni.

C is incorrect: This solution would be difficult to manage and would require a complex network of permissions and processors to be developed. BigQuery Omni is a much more efficient solution.

D is correct: BigLake tables can be used to read data directly from other public cloud providers via BigQuery Omni. This enables easy multi-cloud analytics without requiring the data to be transferred into GCS first.

A is incorrect: This is technically feasible, but BigQuery Omni would be a better choice since it wouldn't require a data transfer.

B is incorrect: This might be technically feasible, but it would be difficult to manage effectively due to variance in file size. It also would not be fault tolerant as a serverless solution. Additionally, this is very inefficient compared to using BigQuery Omni.

C is incorrect: This solution would be difficult to manage and would require a complex network of permissions and processors to be developed. BigQuery Omni is a much more efficient solution.

D is correct: <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:13406Ye34eP1Z6dbFWAX#:~:text=BigLake%20tables%20can%20be%20used%20to%20read%20data%20directly%20from%20other%20public%20cloud%20providers" title="BigLake tables can be used to read data directly from other public cloud providers" target="_blank">BigLake tables</a> can be used to read data directly from other public cloud providers via BigQuery Omni. This enables easy multi-cloud analytics without requiring the data to be transferred into GCS first.

You are running a Cloud Spanner instance and notice that your compliance team frequently looks to link user data and regulatory data by geographical data in order to ensure regulatory compliance. They occasionally notice slow query performance and asked you to optimize the query. User data is pre-processed with Cloud DLP before it is written to Cloud Spanner in order to comply with each country's specific regulations. What is a method you could use in Cloud Spanner to improve the performance of the query while also ensuring that user data cannot be written outside of the region where it is created?

A: Create a hierarchical interleaved table relationship with the geographical data as the root, regulatory data as the first child, and the Users data as a child of the regulatory data.

B: Create a foreign key relationship between the geographical data, regulatory Data, and users data.

C: Move the data to BigQuery since it is designed to perform analytical queries more efficiently than Spanner.

D: Create a secondary table with a single row for each geography-user-regulatory relationship.

Option A is correct. In this instance the best option is to create an interleaved table in spanner which will allow the users to query the geographical data, the regulatory data, and the users data in a single table. This provides a much more efficient query than when querying the tables separately and attempting to join on non-primary keys.

Option B is not correct. There isn't a good natural relationship immediately available between geographical data, regulatory data and user data. Additionally, even with effective foreign-key relationships it would still be more efficient to query a single interleaved table.

Option C is not correct. Although BigQuery could perform this query more efficiently it would require a complex and inefficient solution to implement compared to better options. BigQuery datasets are limited to a single region by default which might bring up other geolocation issues compared to Spanner leading to increased query latency. There could be privacy and regulatory concerns as well. Additionally, the data requires strict referential integrity which is not supported by BigQuery.

Option D is not correct. Spanner is an RDBMS and while working with STRUCT data in Spanner is possible it is not a best practice and is not as efficient as creating an interleaved or foreign-key table.

Option A is correct. In this instance the best option is to create an <a href="https://students.foxtrotcommunications.net/courses/:ETcTVIYgQNDGb3WxRSBA/modules/:eeIcO53sDEDNxTFxkqVt/topics/:KhQoRWh8RAfZhHgBwQ1R#:~:text=Some%20specific%20schema,but%20not%20both." target="_blank">interleaved table</a> in spanner which will allow the users to query the geographical data, the regulatory data, and the users data in a single table. This provides a much more efficient query than when querying the tables separately and attempting to join on non-primary keys.

Option B is not correct. There isn't a good natural relationship immediately available between geographical data, regulatory data and user data. Additionally, even with effective foreign-key relationships it would still be more efficient to query a single interleaved table.

Option C is not correct. Although BigQuery could perform this query more efficiently it would require a complex and inefficient solution to implement compared to better options. BigQuery datasets are limited to a single region by default which might bring up other geolocation issues compared to Spanner leading to increased query latency. There could be privacy and regulatory concerns as well. Additionally, the data requires strict referential integrity which is not supported by <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:YhMIs4b1S0SmvhBL2vKO#:~:text=BigQuery%20is%20tremendously,a%20relational%20database." target="_blank">BigQuery</a>.

Option D is not correct. Spanner is an RDBMS and while working with STRUCT data in Spanner is possible it is not a best practice and is not as efficient as creating an interleaved or foreign-key table.

You are working for an online sports betting company. The CFO wants to know the best database technology to host a new highly dynamic sports betting application which has to collect and process data, ensure strong consistency, and have atomic transaction handling, all with minimal latency. The application is expected to have thousands of transactions running per second. What database technology would you recommend?

A: Cloud SQL

B: Memorystore for Redis

C: Firestore

D: Cloud Spanner

Option A is incorrect. Cloud SQL can be used to power website data, but at the scale and complexity suggested Cloud SQL would likely not be able to keep up with the transactions.

Option B is correct. Memorystore for Redis is GCP's managed Redis instance. It is an in-memory database engine used to host highly dynamic websites which require thousands or even millions of transactions per second. Additionally, Redis is a true RDBMS which means that it supports strong consistency and transaction handling.

Option C is incorrect. This works for many different websites, but in this instance it would likely be too slow to properly mange correctly. Firestore does offer transactions, but the transaction handling is not as robust as Memorystore or Cloud SQL. Firestore is not ACID compliant and does not support strong consistency standards.

Option D is incorrect. Cloud Spanner is strongly consistent and ACID compliant, but it would not be quick enough to support the website's requirements.

Option A is incorrect. Cloud SQL can be used to power website data, but at the scale and complexity suggested Cloud SQL would likely not be able to keep up with the transactions.

Option B is correct. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#:~:text=If%20you%20use%20Memorystore%20for%20Redis%20you%20can%20take%20advantage%20of%20the%20Redis%20RDBMS%20which%20can%20ensure%20strong%20consistance%20and%20atomic%20transactions." title="If you use Memorystore for Redis you can take advantage of the Redis RDBMS which can ensure strong consistance and atomic transactions." target="_blank"> Memorystore for Redis</a> is GCP's managed Redis instance. It is an in-memory database engine used to host highly dynamic websites which require thousands or even millions of transactions per second. Additionally, Redis is a true RDBMS which means that it supports strong consistency and transaction handling.

Option C is incorrect. This works for many different websites, but in this instance it would likely be too slow to properly mange correctly. Firestore does offer transactions, but the transaction handling is not as robust as Memorystore or Cloud SQL. Firestore is not ACID compliant and does not support strong consistency standards.

Option D is incorrect. Cloud Spanner is strongly consistent and ACID compliant, but it would not be quick enough to support the website's requirements.

You are working with a data science team to help migrate some machine learning workflows over to GCP. The data have been migrated to BigQuery already, and now they need to move the models. Currently they are running some batch linear regression models on a local on-prem cluster operating a PyTorch framework. The data science team is interested in the most efficient processing regardless of cost. How would you advise them to migrate the models over to operate most efficiently in GCP?

A: Migrate the models to Vertex AI, set up BigQuery tables as datasets, and operate the models in batch mode.

B: Rebuild the models using BigQuery ML.

C: Build a MIG consisting of memory-optimized VM instances with NVIDIA CUDA GPUs attached. Port the PyTorch application onto the MIG.

D: Build Dataproc Serverless cluster and rebuild the model in Spark ML.

A is incorrect. This is technically feasible, but it wouldn't provide any benefits over BigQuery ML and would be more complicated to develop and deploy.

B is correct. Since the data are already in BigQuery, and the models are a batch linear regression, it would be easier to rebuild the models using BigQuery ML. BigQuery ML is also more performant platform for tabular ML compared to Dataproc.

C is incorrect. This would be a much more complicated development process compared to just using BigQuery ML.

D is incorrect. This could work, and be relatively easy to develop, but it still wouldn't be as efficient or performant as just using BigQuery ML.

A is incorrect. This is technically feasible, but it wouldn't provide any benefits over BigQuery ML and would be more complicated to develop and deploy.

B is correct. Since the data are already in BigQuery, and the models are a batch linear regression, it would be easier to rebuild the models using <A href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:tDjKJ5WSKMS1anKmuIq4#:~:text=BigQuery%20ML%20is%20a%20highly%20capable%20and%20SQL%20Native%20ML%20solution%20hosted%20solely%20in%20the%20BigQuery%20infrastructure." title="BigQuery ML is a highly capable and SQL Native ML solution hosted solely in the BigQuery infrastructure."
target="_blank">BigQuery ML</a>. BigQuery ML is also more <a title="BigQuery ML outperforms Dataproc generally, though it is most expensive byte per byte, unless you reserve slots before hand." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=BigQuery%20ML%20outperforms%20Dataproc%20generally%2C%20though%20it%20is%20most%20expensive%20byte%20per%20byte%2C%20unless%20you%20reserve%20slots%20before%20hand." target="_blank">performant platform</a> for tabular ML compared to Dataproc.

C is incorrect. This would be a much more complicated development process compared to just using BigQuery ML.

D is incorrect. This could work, and be relatively easy to develop, but it still wouldn't be as efficient or performant as just using BigQuery ML.

You have been using Cloud Dataflow to run batch data processing pipelines, but lately you have noticed the costs going up. What is the most efficient way to help reduce costs if you don't need to worry about jobs completing at a specific time and are fault tolerant?

A: Use Dataflow FlexRS.

B: Change the partitioning on Dataflow to allow fewer partitions.

C: Change your pipelines to streaming.

D: Alter your data model to reduce shuffling transactions.

You are working for a health insurance firm which processes and pays claims containing highly sensitive and protected consumer data. Legal and company compliance policies dictate the absolute highest level of cryptographic security for data moving between services within GCP. Which security architecture should you use?

A: Use Customer Supplied Encryption Keys (CSEK) to encrypt the data, ensuring that only someone who posses the key can unlock the data.

B: Data are automatically protected at-rest and in-transit by default by Google with Cloud Key Management Service API using Cryptographic Data Verification on the data packets as they move through GCP.

C: Use Cloud HSM to process the cryptographic operations in a FIPS 140-2 Level 3 compliant cluster.

D: Use Cloud HSM and Customer Supplied Encryption Keys to process the cryptographic operations in a FIPS 140-2 Level 3 compliant cluster and ensure that only someone who possesses the key can unlock the data.

A in incorrect. This is insufficient by itself. You should also use Cloud HSM to physically separate the cryptographic process from other GCP services.

B is incorrect. This is true, but it would be insufficient according to the requirements.

C is incorrect. This is insufficient by itself. You should also use CSEK in this instance to control the keys and make it impossible for anyone, including Google, to access the data.

D is correct. Use Cloud HSM to physically separate the cryptographic process from the other GCP services and use CSEK to ensure that only someone who possess the key can decrypt the data.

A in incorrect. This is insufficient by itself. You should also use Cloud HSM to physically separate the cryptographic process from other GCP services.

B is incorrect. This is true, but it would be insufficient according to the requirements.

C is incorrect. This is insufficient by itself. You should also use CSEK in this instance to control the keys and make it impossible for anyone, including Google, to access the data.

D is correct. Use <a
title="Hardware security modules (HSM) is a separate, physical cryptographic key cluster management suite managed by KMS." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=Hardware%20security%20modules%20(HSM)%20is%20a%20separate%2C%20physical%20cryptographic%20key%20cluster%20management%20suite%20managed%20by%20KMS." target="_blank">
Cloud HSM</a> to physically separate the cryptographic process from the other GCP services and use <a 
title="Customer Supplied Encryption Keys (CSEK) are user generated keys which are supplied to GCP API's at the time of data transmission." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=Customer%20Supplied%20Encryption%20Keys%20(CSEK)%20are%20user%20generated%20keys%20which%20are%20supplied%20to%20GCP%20API%27s%20at%20the%20time%20of%20data%20transmission." target="_blank">CSEK</a> to ensure that only someone who possess the key can decrypt the data.

Which regulation(s) are the most well know data privacy laws specific to the state of California?

Choose all that apply

Options B and C are the correct answers. California's CCPA and CPRA laws are some of the most restrictive in the United States. It outlines strict guidelines for data collected, stored, accessed, or processed within California's geographical boundaries.

GDPR applies to EU member states.

FedRAMP is used by the federal government when developing contracts with contracting companies.

COPPA is an important privacy law, but it isn't specific to California.

HIPAA is important for health care data, but it isn't specific to California.

HITECH is important for health care data, but it isn't specific to California.

Options B and C are the correct answers. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=California%27s%20CCPA%20and%20CPRA%20laws%20are%20some%20of%20the%20most%20restrictive%20in%20the%20United%20States.%20It%20outlines%20strict%20guidelines%20for%20data%20collected%2C%20stored%2C%20accessed%2C%20or%20processed%20within%20California%27s%20geographical%20boundaries." title="California's CCPA and CPRA laws are some of the most restrictive in the United States. It outlines strict guidelines for data collected, stored, accessed, or processed within California's geographical boundaries." target="_blank">California's CCPA and CPRA laws</a> are some of the most restrictive in the United States. It outlines strict guidelines for data collected, stored, accessed, or processed within California's geographical boundaries. 


GDPR applies to EU member states.


FedRAMP is used by the federal government when developing contracts with contracting companies.


COPPA is an important privacy law, but it isn't specific to California.


HIPAA is important for health care data, but it isn't specific to California.


HITECH is important for health care data, but it isn't specific to California.

You are working with a data science team who needs to migrate an on-prem Hadoop cluster to GCP. The cluster includes 50 TBs of data hosted on HDFS as well as the associated JARs for execution. What would the most efficient future state architecture look like in GCP?

A: Migrate the Hadoop jobs to Dataproc standard and host the HDFS data on persistent disks.

B: Build a MIG of memory-optimized VMs. Deploy Hadoop and configure it for the VMs. Host the HDFS data on HDD persistent disks.

C: Run the Hadoop jobs with Dataproc standard and use GCS to host the data.

D: Use Dataproc Serverless to host the Hadoop jobs and migrate the data to GCS.

A health care analytics team is operating in BigQuery and often has to work with very sensitive health care data. What method or tool should they use to check for and prevent any leakage of personally identifiable information or other sensitive data?

Option A: build a machine learning model in BQML which scans each column and checks for commonly known sources of PII

Option B: Use the Sensitive Data API with Cloud Data Loss Prevention to automatically identify columns which contain PII and create masking or nullification rules to prevent leaks.

Option C: Do nothing, the team already has access and doesn't need to protect the data.

Option D: buy a third party tool custom built for health care data which can scan through your data sets and detect potential sources of PII or other sensitive data.

Option A is incorrect. This would be very expensive and inefficient to run against all columns in your dataset. Also, it wouldn't be a scalable or sustainable solution as it would have to be retrained with each new piece of data discovered.

Option B is correct. The Sensitive Data API with Cloud DLP can be to automatically detect PII or other sensitive data in each of your BigQuery datasets. It has built in infotypes for most PII and you can build custom infotypes to satisfy any regulation.

Option C is incorrect. The purpose of sensitive data protection is to avoid risk. In this instance the team should have to show why they require access to the PII, even if they have permissions to view it. A leak of PII data and an accompanying regulatory violation can be very costly. Also, BigQuery has dynamic masking policies which can automatically reveal data based upon who is looking at it.

Option D is incorrect. This could work, but Cloud DLP does this natively by default. Additionally, by using a third party tool you could potentially open up another surface area of attack.

Option A is incorrect. This would be very expensive and inefficient to run against all columns in your dataset. Also, it wouldn't be a scalable or sustainable solution as it would have to be retrained with each new piece of data discovered.

Option B is correct. The Sensitive Data API with <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=Data%20Loss%20Prevention%2C%20or%20Cloud%20DLP%2C%20is%20a%20tool%20which%20is%20powerful%20enough%20to%20determine%20classified%20data%20across%20your%20organization%27s%20resources." title="Data Loss Prevention, or Cloud DLP, is a tool which is powerful enough to determine classified data across your organization's resources." target="_blank">Cloud DLP</a> can be to automatically detect PII or other sensitive data in each of your BigQuery datasets. It has built in infotypes for most PII and you can build custom infotypes to satisfy any regulation.

Option C is incorrect. The purpose of sensitive data protection is to avoid risk. In this instance the team should have to show why they require access to the PII, even if they have permissions to view it. A leak of PII data and an accompanying regulatory violation can be very costly. Also, BigQuery has dynamic masking policies which can automatically reveal data based upon who is looking at it.

Option D is incorrect. This could work, but Cloud DLP does this natively by default. Additionally, by using a third party tool you could potentially open up another surface area of attack.

You are working for a globally operating financial services firm and the CIO wants to be sure that BigQuery is durable and highly-available in the rare, but possible, occurrence of a regional disaster. Which actions would you recommend to ensure this?

A: The standard configuration of BigQuery is enough to ensure availability in the case of a regional disaster

B: Store your data in GCS instead and use Dual-region bucket locations to ensure durability and availability of the data. Create a secondary BigQuery instance with external tables in case one region goes down.

C: Assure the CIO that regional disasters are almost impossible to occur and that no further action needs be taken.

D: Use Cross-Region Datasets to ensure that BigQuery will be available and durable in the case of a regional disaster.

The data platform team is looking to enable a service oriented architecture in GCP to power advanced analytics with Spark ML as well as perform operational queries with Looker. They have an application creating and processing events. What architecture would you recommend to most efficiently enable the transmission, collation, and aggregation of event data in GCP?

A: Send the data to a Pub/Sub topic. Set up a BigQuery subscription to the topic to stream events directly into BigQuery via the Write API. Create analytical views to process and consume the data. Use Dataproc Serverless to read from the BigQuery and build the Spark ML machine learning models.

B: Send the data directly to the BigQuery Write API. Create analytical views to process and consume the data. Use Dataproc Serverless to read from the BigQuery and build the Spark ML machine learning models.

C: Send the events to a Cloud Function. Use the cloud function to process the data and deposit the data as a flat file into GCS. Use BigQuery external tables to create the operational views. Use Dataproc Standard hosted on a Kubernetes cluster to perform machine learning.

D: Send the data to a Pub/Sub topic. Set up a Cloud Storage Subscription to the topic to stream events directly into GCS. Read the files as external tables in BigQuery. Use BigQuery to build the operational views and BigQuery ML to perform the machine learning models.

A is correct. Use Pub/Sub to manage the messages as they come in. This will ensure that no data is lost from the time of production to consumption of the message. It enables a strong acknowledgment chain to prevent the loss of data. Set up a BigQuery subscription to easily consume the messages from Pub/Sub as they are created without any additional processing. Use BigQuery to build the operational views of the data. Use Dataproc Serverless to read from the data hosted in BigQuery and execute the Spark ML models.

B is incorrect. This is almost correct. This would probably work the vast majority of the time, but it wouldn't be as robust without Pub/Sub to handle the messages. Any failures in transmission would result in lost data.

C is incorrect. There a few inefficiencies here. You should use Pub/Sub to handle the messages. Using GCS and external tables is fine, but it does lack some features of importing the data directly into BigQuery. Dataproc standard hosted on Kubernetes would work, but it wouldn't be as efficient as Dataproc Serverless for Spark.

D is incorrect. This is almost correct. It would work except for using BigQuery ML. It isn't clear what kind of model is being produced here, and BigQuery ML doesn't support all possible models. In this case, the requirements specifically call for Spark ML which means Dataproc, and almost certainly Dataproc Serverless.

A is correct. Use <a title="PubSub is a globally available and highly redundant message streaming and queuing service." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:YhMIs4b1S0SmvhBL2vKO#:~:text=PubSub%20is%20a%20globally%20available%20and%20highly%20redundant%20message%20streaming%20and%20queuing%20service."
target="_blank">Pub/Sub</a> to manage the messages as they come in. This will ensure that no data is lost from the time of production to consumption of the message. It enables a strong acknowledgment chain to prevent the loss of data. Set up a <a title="Schematized messages can be streamed as is into BigQuery Write API via a Pub/Sub Subscription." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=Schematized%20messages%20can%20be%20streamed%20as%20is%20into%20BigQuery%20Write%20API%20via%20a%20Pub/Sub%20Subscription." target="_blank">BigQuery subscription</a> to easily consume the messages from Pub/Sub as they are created without any additional processing. Use BigQuery to build the operational views of the data. Use <a title="Use Dataproc Serverless to effortlessly run Spark jobs upon massive datasets consisting of billions of rows of data via a simple Jupiter Notebook." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:13406Ye34eP1Z6dbFWAX#:~:text=Use%20Dataproc%20Serverless%20to%20effortlessly%20run%20Spark%20jobs%20upon%20massive%20datasets%20consisting%20of%20billions%20of%20rows%20of%20data%20via%20a%20simple%20Jupiter%20Notebook." target="_blank">
Dataproc Serverless</a> to read from the data hosted in BigQuery and execute the Spark ML models.

B is incorrect. This is almost correct. This would probably work the vast majority of the time, but it wouldn't be as robust without Pub/Sub to handle the messages. Any failures in transmission would result in lost data.

C is incorrect. There a few inefficiencies here. You should use Pub/Sub to handle the messages. Using GCS and external tables is fine, but it does lack some features of importing the data directly into BigQuery. Dataproc standard hosted on Kubernetes would work, but it wouldn't be as efficient as Dataproc Serverless for Spark.

D is incorrect. This is almost correct. It would work except for using BigQuery ML. It isn't clear what kind of model is being produced here, and BigQuery ML doesn't support all possible models. In this case, the requirements specifically call for Spark ML which means Dataproc, and almost certainly Dataproc Serverless.

You are working for a Social Media strategy consultancy. They need to ensure that their data cannot be used to identify a person. They have already used Cloud DLP to scrub their data of PII, but they want to make sure that some other fields, such as date of birth and city, cannot be used to re-identify a person. How can they check for this and ensure a minimal possibility of re-identification while still enabling good data to be aggregated and analyzed?

A: Add all other potentially identifiable fields as infotypes to Cloud DLP.

B: Hire a third part risk-analysis firm to analyze the re-identification risk.

C: Encrypt all fields which can potentially be used to identify a person.

D: Run a re-identification risk analysis using Cloud DLP

A is incorrect. This will ensure a zero re-identification risk, but will also scrub all useful fields of information, making it impossible to provide useful analytics.

B is incorrect. This could work, but it is an unnecessary and inefficient step since Cloud DLP can do this natively.

C is incorrect. This will ensure a zero re-identification risk, but will also scrub all useful fields of information, making it impossible to provide useful analytics. Additionally, the data would need to be decrypted at some point, and the risk could remain.

D is correct. DLP can also be used to prevent re-identification which is essentially a way to identify a particular user by observing alternative means of identification, such as an email address, phone number, or address. DLP can provide a risk report to show you the probability of re-identification of users in your data space. Use this information to formulate a plan to de-identify your data.

A is incorrect. This will ensure a zero re-identification risk, but will also scrub all useful fields of information, making it impossible to provide useful analytics.

B is incorrect. This could work, but it is an unnecessary and inefficient step since Cloud DLP can do this natively.

C is incorrect. This will ensure a zero re-identification risk, but will also scrub all useful fields of information, making it impossible to provide useful analytics. Additionally, the data would need to be decrypted at some point, and the risk could remain.

D is correct. <a title="DLP can also be used to prevent re-identification" href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=DLP%20can%20also%20be%20used%20to%20prevent%20re%2Didentification" target="_blank">DLP</a> can also be used to prevent re-identification which is essentially a way to identify a particular user by observing alternative means of identification, such as an email address, phone number, or address. DLP can provide a risk report to show you the probability of re-identification of users in your data space. Use this information to formulate a plan to de-identify your data.

What is the durability rating of Google Cloud Storage (GCS)?

A: 99.99%

B: 99.999%

C: 99.99999%

D: 99.999999999%

You are developing a data mesh with Dataplex for use by your analytics team. All of your data are hosted in GCP and you want to build BigLake tables on the data. Which files types are supported by BigLake?

Select All That Apply

You are working with a financial services firm. They are operating an on-prem Kubernetes cluster which processes highly sensitive financial data. Because of this they want to encrypt the processed data on-prem and move the data into GCP without exiting the GCP network. Additionally, they want to use Google managed services only to ensure a high-quality product with minimal maintenance required. What is the architecture they should use?

Option A: Build an on-prem rack system with a custom build Kubernetes engine running on docker containers. Use SSL/TLS encryption when sending the data to GCP over the internet.

Option B: Deploy Google Distributed Cloud Virtual on your on-prem VMWare servers. Set up Cloud Interconnect service to build a dedicated connection to GCP and operate on a Private IP. Before uploading the data, encrypt the data with CSEK.

Option C: Install Kubernetes on Google Kubernetes Engine via a Managed Instance Group. Encrypt the source data on-perm and load it to Cloud Storage using Customer Supplied Encryption Keys over Cloud Interconnect. Then use Google Kubernetes Engine to process the data.

Option D: Deploy Google Distributed Cloud Virtual on your on-prem VMWare servers. Before uploading the data over the internet, encrypt the data with CSEK.

Option A is incorrect: The customer wants to use fully managed services only. Additionally, data sent over SSL/TLS means that it is sent over the internet and therefore must exit the Google network.

Option B is correct Use Google Distributed Cloud Virtual (GKE Enterprise on-premises) to run a managed Kubernetes cluster on-prem. Set up Cloud Interconnect with Dedicated Interconnect to ensure that the data never leaves the Google Network. Finally, use Customer Supplied Encryption Keys (CSEK) to encrypt the data before sending it to GCP

Option C is incorrect. This is a secure solution with managed services, but the customer wants to perform the processing on-prem.

Option D is incorrect. This is insecure because the data leaves the GCP network. You should use Cloud Interconnect with a Dedicated Connection.

Option A is incorrect: The customer wants to use fully managed services only. Additionally, data sent over SSL/TLS means that it is sent over the internet and therefore must exit the Google network.

Option B is correct Use <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:CZSxdxG56A0gFMQ0z4ef#:~:text=If%20you%20want%20to%20operate%20Kubernetes%20on%2Dprem%20racks%20while%20still%20being%20connected%20to%20GCP%20up%20can%20use%20Google%20Distributed%20Cloud%20Virtual%20(GKE%20Enterprise%20on%2Dpremises)%20which%20allow%20you%20to%20run%20Google%20managed%20Kubernetes%20clusters%20on%20VMWare%20or%20bare%20metal%20servers." title="If you want to operate Kubernetes on-prem racks while still being connected to GCP up can use Google Distributed Cloud Virtual (GKE Enterprise on-premises) which allow you to run Google managed Kubernetes clusters on VMWare or bare metal servers." target="_blank">Google Distributed Cloud Virtual (GKE Enterprise on-premises)</a> to run a managed Kubernetes cluster on-prem. Set up Cloud Interconnect with <a title="Dedicated Interconnect - A dedicated connection between your on-prem network and GCP." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:gz7hBLcmxSni43NE1zYT#:~:text=Dedicated%20Interconnect%20%2D%20A%20dedicated%20connection%20between%20your%20on%2Dprem%20network%20and%20GCP." target="_blank">Dedicated Interconnect</a> to ensure that the data never leaves the Google Network. Finally, use <a title="Customer Supplied Encryption Keys (CSEK) are user generated keys which are supplied to GCP API's at the time of data transmission. " href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=Customer%20Supplied%20Encryption%20Keys%20(CSEK)%20are%20user%20generated%20keys%20which%20are%20supplied%20to%20GCP%20API%27s%20at%20the%20time%20of%20data%20transmission." target="_blank">Customer Supplied Encryption Keys (CSEK)</a> to encrypt the data before sending it to GCP

Option C is incorrect. This is a secure solution with managed services, but the customer wants to perform the processing on-prem.

Option D is incorrect. This is insecure because the data leaves the GCP network. You should use Cloud Interconnect with a Dedicated Connection.

You are working with a data science team gathering data for climate science at various national parks throughout the country. These data are gathered through barometric and other atmospheric measuring instruments. These sensor groups can consist of thousands of nodes which stream data to Pub/Sub. Due to the difficult internet connections available at some of the sites the data sometimes do not come in immediately, leading to late data. How can you prepare your dataflow data-pipelines to handle late arriving data?

A: Do nothing, dataflow handles late arriving data automatically

B: Use Dataflow SQL instead of the Java SDK

C: Switch to batch pipelines instead

D: Use Dataflow SDK to enable late arriving data handling.

Select The Correct Pub/Sub Components

select all that apply

You are managing a BigQuery instance and have noticed that your costs are starting to rise. What are some things you could do to try and keep costs low?

Select all that apply

Use Materialized Views to cache queries and minimize processing

Use partitioning and clustering on the tables to make queries cheaper to run

Move the data to GCS and use external tables instead to save on storage costs

Set quotas to manage user spend each month

A is correct. Materialized views act as a sort of data cache and can help keep cost low if the query doesn't force a query of the underlying table.

B is correct. Use partitioning and clustering to lower the amount of data queried. Partitioning will break a table into queryable chunks which limit the data processed by BigQuery. Clustering will sort the data in your table and select the data up to the point of the detected cluster key.

C is incorrect. BigQuery storage costs are the same as GCS, and both will automatically reclassify underutilized data into lower cost storage.

D is correct. Use Quotas and Limits to imit how much of a given resource your project can utilize. This gives a hard stop to querying capacity to any user.

A is correct. <a title="Materialized views act as a sort of data cache and can help keep cost low if the query doesn't force a query of the underlying table." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:actLOHnZuRWiNz61cS2U#BigQuery%20materialized%20views%20(view%20logic):~:text=Materialized%20views%20act%20as%20a%20sort%20of%20data%20cache%20and%20can%20help%20keep%20cost%20low%20if%20the%20query%20doesn%27t%20force%20a%20query%20of%20the%20underlying%20table." target="_blank">Materialized views</a> act as a sort of data cache and can help keep cost low if the query doesn't force a query of the underlying table.

B is correct. Use <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:D9cqxUjle4rjp4mQktN6/topics/:jAm8ye0Mf6XFx2KQA7ZZ#:~:text=Make%20efficient%20and%20smart%20use%20of%20Partitioning%20and%20Clustering%20of%20your%20tables.%20Partitioning%20will%20break%20a%20table%20into%20queryable%20chunks%20which%20limit%20the%20data%20processed%20by%20BigQuery." title="Make efficient and smart use of Partitioning and Clustering of your tables. Partitioning will break a table into queryable chunks which limit the data processed by BigQuery." target="_blank">partitioning and clustering</a> to lower the amount of data queried. Partitioning will break a table into queryable chunks which limit the data processed by BigQuery. Clustering will sort the data in your table and select the data up to the point of the detected cluster key.

C is incorrect. BigQuery storage costs are the same as GCS, and both will automatically <a title="Long Term Storage - Any table or partition modified past 90 days. Storage costs then drop 50%." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:D9cqxUjle4rjp4mQktN6/topics/:jAm8ye0Mf6XFx2KQA7ZZ#:~:text=Long%20Term%20Storage%20%2D%20Any%20table%20or%20partition%20modified%20past%2090%20days.%20Storage%20costs%20then%20drop%2050%25." target="_blank">reclassify</a> underutilized data into lower cost storage.

D is correct. Use <a title="Quotas and Limits limit how much of a given resource your project can utilize. This gives a hard stop to querying capacity to any user." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:D9cqxUjle4rjp4mQktN6/topics/:jAm8ye0Mf6XFx2KQA7ZZ#:~:text=Quotas%20and%20Limits%20limit%20how%20much%20of%20a%20given%20resource%20your%20project%20can%20utilize.%20This%20gives%20a%20hard%20stop%20to%20querying%20capacity%20to%20any%20user." target="_blank">Quotas and Limits</a> to imit how much of a given resource your project can utilize. This gives a hard stop to querying capacity to any user.

You are working for a device supplier which captures IOT data from many devices scattered across a geographical region. You have set up a Pub/Sub service, but since you have started sending data you have noticed that some of the messages are corrupted or malformed which causing downstream application to crash. How would you most efficiently solve this?

Option A: Send the data to a Cloud Function endpoint instead. Use the Cloud Function to check if every field is present, in the correct place, and has a correct data type. If not, then deposit the corrupted message in GCS for further review.

Option D: Build a Pub/Sub schema on the subscription to check that the data are properly structured, if not, send the message to a dead letter queue.

Option C: Rebuild all your downstream services to be able to handle potentially corrupted data.

Option D: Build a Pub/Sub schema on the topic to check that the data are properly structured, if not, send the message to a dead letter queue.

Option A is incorrect. This could, in theory, work to prevent forwarding bad data, but it is a much less efficient solution than using Pub/Sub's built in schema checking service. Additionally, Cloud Functions lacks some of the guarantees that Pub/Sub offers, such as at least once message guarantee.

Option B is incorrect. Creating a schema is the proper approach, but schemas are built on topics, not on subscriptions.

Option C is incorrect. This would be a very inefficient and costly solution to implement.

Option D is correct. Using a Pub/Sub schema will ensure that any message received will comply with the proper message structure. If the schema check fails the message will be sent to a dead letter queue.

Option A is incorrect. This could, in theory, work to prevent forwarding bad data, but it is a much less efficient solution than using Pub/Sub's built in schema checking service. Additionally, Cloud Functions lacks some of the guarantees that Pub/Sub offers, such as at least once message guarantee.

Option B is incorrect. Creating a schema is the proper approach, but schemas are built on topics, not on subscriptions.

Option C is incorrect. This would be a very inefficient and costly solution to implement.

Option D is correct. Using a <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:D9cqxUjle4rjp4mQktN6/topics/:SaQuqiquljTE7Zr1u8EH#:~:text=A%20great%20way%20to%20ensure%20a%20high%20quality%20data%20product%20is%20to%20ensure%20that%20bad%20data%20doesn%27t%20make%20it%20into%20your%20system%20in%20the%20first%20place.%20Use%20Pub/Sub%20Schemas%20to%20create%20a%20data%20contract%20between%20source%20and%20destination." target="_blank" title="Use Pub/Sub Schemas to create a data contract between source and destination.">Pub/Sub schema</a> will ensure that any message received will comply with the proper message structure. If the schema check fails the message will be sent to a dead letter queue.

You are running a series of BigQuery queries which are dependent upon other queries executing correctly and in sequence. Currently you are running this process in cloud scheduler, but you notice that this is difficult to manage and, if one query fails, you have to rerun the entire job. What is a better tool you could use to execute the queries?

A: Use Dataflow SQL to run the jobs instead.

B: Transform the queries into CTEs and run them in sequence as a single view.

C: Use Cloud Composer to build a DAG which turns each query into a separate task.

D: Do Nothing. Cloud Scheduler is the best option for this.

You are building a Pub/Sub pipeline to connect a new data source to BigQuery, but the schema is constantly changing. Therefore, you need to perform some data transformations before loading the data into BigQuery. What is the optimal data processing architecture to use?

A: Load the data directly into BigQuery with the BigQuery Write API and BigQuery subscription. Perform the data processing in BigQuery.

B: Load the data directly into Cloud Storage with a Cloud Storage Subscription. Use an external table to read the data from Cloud Storage, perform the data processing, and load the data into BigQuery.

C: Use dataflow to load the raw data into Bigtable first, then use dataflow again to extract and process the data from BigTable and load it into BigQuery.

D: Use Cloud Dataflow to process and schematize the data and load it into BigQuery

You analytics team is using Looker to power an analytics dashboard used by multiple teams search day. Their datasource is a static BigQuery query table which updates once per day. What is the best method to minimize query costs and maximize performance for this dashboard?

A: Create a Materialized View to query the data once per day and use that to power the dashboard.

B: Use BI Engine to pre-process the data each morning and host the data in memory.

C: Do nothing. BigQuery automatically caches data for 24 hours as long as the underlying data doesn't change.

D: Build a view on top of the table which only allows the user to select certain columns from the table as needed.

You have a large number of datasets stored in GCS totaling several terabytes of data and want to make them sharable with BigQuery users in your organization. What is the most efficient way to accomplish this?

A: Use dataflow to build a pipeline for each table which reads the data from the files and loads the data into BigQuery. Create tag templates in Data Catalog for each table. Using the tag templates, apply permissions and data quality rules for each table and each column.

B: Use Dataplex to automatically build external tables for each dataset prefix in GCS, apply metadata, data quality rules, permissions, and make the data discoverable.

C: Build BigQuery external tables for each dataset prefix. Create a view for each table to surface the data to users. Create tag templates in Data Catalog for each table. Using the tag templates, apply permissions and data quality rules for each table and each column. Give users access to Data Catalog so that they can discover and query the data effectively.

D: Use pub/sub to reach each table and send the data to BigQuery Write API. Build a materialized view for each dataset for users to query. Create tag templates in Data Catalog for each view. Using the tag templates, apply permissions and data quality rules for each table and each column.

You are working with a data science team for a major pharmaceuticals firm. They are interested in setting up a Dataplex data lake for their data stored in Cloud Storage. They have about 3 PB of data spread across dozens of GCS tables in their raw bucket. They are currently building the pipelines needed to produce the curated bucket. What formatting requirements are needed for GCS data when they decide to build the curated bucket?

Select All That Apply

A: The data should be hive partitioned.

B: The data should be cleaned and typed.

C: The data should be scanned of any PII or other sensitive data.

D: The data should be either csv, new line delimited json, or parquet format.

What is the first step when creating a data mesh?

A: Set up your user permissions.

B: Develop your data sources.

C: Build your data models

D: Establish the data governance council

An e-commerce firm is looking to migrate an existing on-prem Postgres DB to GCP and is looking for the optimal database technology to host the data. They are currently operating in the US, but are expanding their business into Europe and they want to have low latency and easy access from anywhere on Earth. Additionally, they also want minimal changes to their current queries and interface. What is the optimal database technology to meet their current and future requirements?

A: Cloud SQL with Alloy DB

B: Cloud SQL

C: Cloud Spanner

D: Memorystore for Redis

You are working with a data team at a sensitive government agency who has strict data governance requirements. What is the certification your organization needs in order to contract effectively with federal government agencies and organizations?

FedRamp

HIPPA

CCPA

EAR

You are working with a smaller organization who is getting started with GCP. They are looking to develop some simple data processing pipelines using a low code or no code solution. Which technology would you recommend?

A: Dataprep

B: Dataflow

C: Data Fusion

D: Cloud Functions

An e-commerce company is using Memorystore for Redis to host user actions on its site. However, it has become a challenge to maintain the data model due to frequent schema changes in products and users. Which storage technology should the company utilize instead of Memorystore to solve this issue?

Option A: Use Cloud Spanner instead. Write a python application which checks the schema of the table periodically and updates them if required.

Option B: Use Bigtable to host the data as NoSQL tablets for products and users

Option C: Use BigQuery to host the data instead. Use materialized views to host the products and users data for quick and easy access.

Option D: Use Firestore's document model to handle the schema changes while providing quick operations.

Option A is incorrect. Cloud Spanner is an RDBMS. While you could use python to update the schema accordingly it would be very slow to implement and would greatly degrade the performance of the site.

Option B is incorrect. This could work, but a Firestore document model would be better and allow easier access to the data. Additionally the firestore data would be easier to observe and maintain.

Option C is incorrect. BigQuery is an analytical warehouse and is not designed for the kind of dynamic data access a site would require. It would also not solve the problem of schema evolution

Option D is correct. Migrate the site data to Firestore's document database to enable highly dynamic data access while allowing for a mutated schema.

Option A is incorrect. Cloud Spanner is an RDBMS. While you could use python to update the schema accordingly it would be very slow to implement and would greatly degrade the performance of the site.

Option B is incorrect. This could work, but a Firestore document model would be better and allow easier access to the data. Additionally the firestore data would be easier to observe and maintain.

Option C is incorrect. BigQuery is an analytical warehouse and is not designed for the kind of dynamic data access a site would require. It would also not solve the problem of schema evolution

Option D is correct. Migrate the site data to <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#:~:text=Firestore%20%2D%20Firestore%20is%20a%20NoSql%20solution%20for%20housing%20dynamic%20NoSql%20data.%20It%20is%20designed%20for%20massive%20scalability%20and%20dynamic%20access%20to%20frequently%20read%20and%20mutated%20data." title="Firestore - Firestore is a NoSql solution for housing dynamic NoSql data. It is designed for massive scalability and dynamic access to frequently read and mutated data." target="_blank">Firestore's document database</a> to enable highly dynamic data access while allowing for a mutated schema.

You are running a globally-available online simulator game with GCP as a backend while the users play the game in their personal devices. What is the best method to send user event data to GCP which maximizes operating efficiency and minimizes development effort? You want to ensure that no data are lost, but duplicates are allowed.

A: Send the a regional compute optimized VM instance via an http PUT request. Use the VM to process the data and store it in Cloud Storage.

B: Send the data to a Kafka cluster and use Kafka to distribute the data to necessary API's with a push operation.

B: Send the data to Pub/Sub and use Pub/Sub to distribute the data to necessary API's with a pull operation.

B: Build a Cloud Function to send the data to the additional following Cloud Function http endpoints:

One endpoint to process and reformat the data and then send it to BigQuery for storage and querying.
One endpoint to process and translate any textual data via the Natural Language API.
One endpoint to push the data to another cloud function which handles billing

Option A is not correct. This would be bad design because the data could come from anywhere in the world which would greatly increase latency, the service would be prone to failures and likely couldn't handle the amount of traffic, and storing the data in GCS would likely be insufficient for downstream processors.

Option B is not correct. Setting up and maintaining a Kafka cluster, while it could be used to solve this problem, is not the most efficient solution in this case. Also, a push method could bog down downstream services, so a pull method would be best in this case.

Option C is correct. Pub/Sub is GCP's globally available message broker service which is used to collect and redistribute event data to a large number of potential subscribers. Pub/Sub comes with at least once processing by default, which ensures that any messages are guaranteed to be captured, but there could be duplicates. A pull operation is best here because it ensures that downstream processors can safely and efficiently handle the incoming message volume.

Option D is not correct. This is a very complex method which would likey fail under load, couldn't guarantee message capture by default, and would be very expensive and time-consuming to develop and maintain. Additionally, there is no guarantee that downstream processors could handle the event, so if the processor fails it would be impossible to know if data was lost without significant investigation.

Option A is not correct. This would be bad design because the data could come from anywhere in the world which would greatly increase latency, the service would be prone to failures and likely couldn't handle the amount of traffic, and storing the data in GCS would likely be insufficient for downstream processors.

Option B is not correct. Setting up and maintaining a Kafka cluster, while it could be used to solve this problem, is not the most efficient solution in this case. Also, a push method could bog down downstream services, so a pull method would be best in this case.

Option C is correct. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=Stream%20Processing%20In%20GCP" target="_blank" title="Pub/Sub is GCP's globally available message broker service.">Pub/Sub</a> is GCP's globally available message broker service which is used to collect and redistribute event data to a large number of potential subscribers. Pub/Sub comes with at least once processing by default, which ensures that any messages are guaranteed to be captured, but there could be duplicates. A pull operation is best here because it ensures that downstream processors can safely and efficiently handle the incoming message volume.

Option D is not correct. This is a very complex method which would likey fail under load, couldn't guarantee message capture by default, and would be very expensive and time-consuming to develop and maintain. Additionally, there is no guarantee that downstream processors could handle the event, so if the processor fails it would be impossible to know if data was lost without significant investigation.

You are working at a sports ticketing company processing thousands of transactions per day. The data science team has developed a Spark machine learning recommendation model which will predict likely interesting eventing to market to people who have made a purchase to an event. They now need to operationalize the model and have asked the data engineering team to build the host architecture. What is the best tool to use to minimize development, configuration, and maintenance efforts while also providing a scalable and high-performing infrastructure?

A: Dataproc Serverless for Spark.

B: Use a standard Dataproc Cluster hosted on a compute engine MIG.

C: Rewrite the machine learning model in BigQuery ML.

B: Use a standard Dataproc Cluster hosted on GKE.

A is correct. Dataproc Serverless for Spark is a low maintenance and low configuration environment for running Spark jobs, including Spark ML jobs.

B is incorrect. This would work, technically, but Dataproc Serverless is a better choice if you're trying to minimize development, configuration, and maintenance efforts.

C is incorrect. This could actually be a more performant and sustainable choice for running the model in the long run, but it would have to be rewritten which would not be the best choice in this case.

D is incorrect. Like the MIG answer, this could work to host the model, but it would be inefficient in this case.

A is correct. <a title="Use Dataproc Serverless to effortlessly run Spark jobs upon massive datasets consisting of billions of rows of data via a simple Jupiter Notebook." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:13406Ye34eP1Z6dbFWAX#:~:text=Use%20Dataproc%20Serverless%20to%20effortlessly%20run%20Spark" target="_blank">Dataproc Serverless for Spark</a> is a low maintenance and low configuration environment for running Spark jobs, including Spark ML jobs.

B is incorrect. This would work, technically, but Dataproc Serverless is a better choice if you're trying to minimize development, configuration, and maintenance efforts.

C is incorrect. This could actually be a <a title=" BigQuery ML outperforms Dataproc generally, though it is most expensive byte per byte, unless you reserve slots before hand." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=BigQuery%20ML%20outperforms%20Dataproc%20generally%2C%20though%20it%20is%20most%20expensive%20byte%20per%20byte%2C%20unless%20you%20reserve%20slots%20before%20hand." target="_blank"> more performant and sustainable</a> choice for running the model in the long run, but it would have to be rewritten which would not be the best choice in this case.

D is incorrect. Like the MIG answer, this could work to host the model, but it would be inefficient in this case.

You are working with a research firm specializing in social media analytics. They have a number of pub/sub topics which process event driven messages which come from their client mobile application. You have noticed that the messages can sometimes be processed by a pub/sub topic operating in Europe, but your data standards do not include compliance with GDPR. What is the most efficient method to ensure that the non-compliant pub/sub messages will not process in EU regions, but that compliant topics and services can still operate effectively?

A: Set a resource location restriction at the organization level to prevent GCP from creating new services in the region.

B: Have your user install Cloud HA VPN which tunnels data back to a region where the regulation doesn't apply.

C: Use Message Storage Policies on the relevant topics to ensure that the data is only processed within an allowed GCP region.

D: Use a different regional service for those applications which produce messages which might violate GDPR standards.

A in incorrect. A resource location restriction would affect all services, not just the pub/sub topics. This could potentially break your application or negatively affect user experience.

B is incorrect. Cloud HA VPN is a product of Cloud Interconnect which is designed to connect On Premises networks to your Cloud VPC.

C is correct. Set a message storage policy on the topic to force specific pub/sub topics to operate in a given region. This will funnel the messages to a compliant region without affecting other services or topics operating in the EU.

D is incorrect. This would be inefficient and likely wouldn't solve the problem in a verifiable and sustainable way. It would require significant rebuilds and you would lose all the benefits of Pub/Sub.

A in incorrect. A resource location restriction would affect all services, not just the pub/sub topics. This could potentially break your application or negatively affect user experience.

B is incorrect. Cloud HA VPN is a product of Cloud Interconnect which is designed to connect On Premises networks to your Cloud VPC.

C is correct. Set a <a title="If required you can configure Message Storage Policies on the Pub/Sub topic to prevent data from processing outside of a given region." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=If%20required%20you%20can%20configure%20Message%20Storage%20Policies%20on%20the%20Pub/Sub%20topic%20to%20prevent%20data%20from%20processing%20outside%20of%20a%20given%20region."
target="_blank">message storage policy</a> on the topic to force specific pub/sub topics to operate in a given region. This will funnel the messages to a compliant region without affecting other services or topics operating in the EU.

D is incorrect. This would be inefficient and likely wouldn't solve the problem in a verifiable and sustainable way. It would require significant rebuilds and you would lose all the benefits of Pub/Sub.

You are working for a data science team who is running a number of Hadoop jobs on dataproc. They have noticed that their costs are increasing and are looking for ways to better optimize their costs. What are some things you could recommend?

Select All That Apply

You on data platform team working with an e-commerce corporation. Currently they are operating in an on-prem environment, but they want to move to GCP. Currently they use Apache Kafka to handle messaging and Apache Flink to process the data. The CIO now wants to utilize managed services in GCP. Which GCP services can substitute for these?

Select all that apply

A is correct. Pub/Sub is a distributed and fully managed messaging system which can be an effective substitute for Apace Kafka.

B is correct. Cloud Dataflow is a serverless data processing solution which can read from a pub/sub stream and process returned data. This is an effective substitute for Apache Flink.

C is incorrect. Dataproc is a tool used for running Spark or Hadoop jobs and doesn't apply here.

D is incorrect. Data fusion is a no code service for building data pipelines in GCP.

E is incorrect. Cloud Dataprep is a no code solution used for cleaning, preparing, and profiling data in GCP. Dataprep can be used to build dataflow pipelines via recipe export, but in this instance it wouldn't count as a direct substitute.

A is correct. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=Pub/Sub%20is%20GCP%27s%20globally%20available%20message%20broker%20service." title="Pub/Sub is GCP's globally available message broker service." target="_blank">Pub/Sub</a> is a distributed and fully managed messaging system which can be an effective substitute for Apace Kafka.

B is correct. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=Cloud%20Dataflow%20is%20a%20serverless%20data%20processing%20solution%20which%20can%20read%20from%20a%20pub/sub%20stream%20and%20process%20returned%20data." title="Cloud Dataflow is a serverless data processing solution which can read from a pub/sub stream and process returned data." target="_blank">Cloud Dataflow</a> is a serverless data processing solution which can read from a pub/sub stream and process returned data. This is an effective substitute for Apache Flink.

C is incorrect. Dataproc is a tool used for running Spark or Hadoop jobs and doesn't apply here.

D is incorrect. Data fusion is a no code service for building data pipelines in GCP.

E is incorrect. Cloud Dataprep is a no code solution used for cleaning, preparing, and profiling data in GCP. Dataprep can be used to build dataflow pipelines via recipe export, but in this instance it wouldn't count as a direct substitute.

You are managing a BigQuery instance and begin to notice a spike in query costs. How would you identify which queries and jobs are causing this issue?

A: Set up a query tag indicator as part of each SQL query ran, have every user be sure to include this tag each time they run a query.

B: Monitor the BigQuery instance throughout the day. Wait until the spike occurs again and then ask around about which queries each user is running.

C: Funnel Query requests through a cloud function. Have Cloud Function capture metadata about the request, such as the user_id and query syntax.

D: Use Cloud Logging to capture BigQuery Audit Logs. Set up an alert with Cloud Logging to send you an alert when the spike occurs again. The alert should contain the user_id, query syntax, and associated job information.

Option A is incorrect: This is an inefficient and disruptive solution which would require a large development effort to properly implement. BigQuery audit logs captures all this by default.

Option B is incorrect: This is another disruptive, invasive, and unnecessary solution which can be solved with Audit Logs.

Option C is incorrect: This solution is inefficient. You could pass the query to cloud function and capture the metadata associated with the request, the job id, the query syntax, and the requester identity. Following this, you could have Cloud Function execute the job and the user could then read the output of the job from the console. But, this is more efficiently accomplished by just using Cloud Logging with BigQuery Audit Logs.

Option D is correct: BigQuery Audit Logs is enabled by default and captures the needed metadata. You can then set up an alert in Cloud Logging to alert you when an expensive query is ran.

Option A is incorrect: This is an inefficient and disruptive solution which would require a large development effort to properly implement. BigQuery audit logs captures all this by default.

Option B is incorrect: This is another disruptive, invasive, and unnecessary solution which can be solved with Audit Logs.

Option C is incorrect: This solution is inefficient. You could pass the query to cloud function and capture the metadata associated with the request, the job id, the query syntax, and the requester identity. Following this, you could have Cloud Function execute the job and the user could then read the output of the job from the console. But, this is more efficiently accomplished by just using Cloud Logging with BigQuery Audit Logs.

Option D is correct: <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:D9cqxUjle4rjp4mQktN6/topics/:V4THDofx329pm3uVQsAv#:~:text=For%20monitoring%20purposes%20the%20most%20valuable%20function%20of%20cloud%20logging%20for%20BigQuery%20is%20audit%20logging%2C%20which%20is%20used%20by%20BigQuery%20to%20trace%20requests%20made%20against%20your%20instance." title="For monitoring purposes the most valuable function of cloud logging for BigQuery is audit logging, which is used by BigQuery to trace requests made against your instance." target="_blank">BigQuery Audit Logs</a> is enabled by default and captures the needed metadata. You can then set up an alert in Cloud Logging to alert you when an expensive query is ran.

You are working for a warehouse operations center. You are tasked with developing a dataflow pipeline which gathers analytical data about operations such as packages processed per hour, number of losses per hour, and others. Management needs the data metrics to propagate every 5 minutes. Which windowing function can you use to best model this behavior?

A: Concurrent Window.

B: Sliding/Hopping Window.

C: Session Window.

D: Fixed/Tumbling Window.

You are running a BigQuery instance. Your analytics users are noticing slow query response times for dynamic and complex dashboards which are using sophisticated analytical functions to build the dashboards. What is a tool you can use to speed up the analytical queries?

Option A: Do nothing, BigQuery automatically caches query results for 24 hours.

Option B: Enable BI Engine which can store data in memory and speed up analytical queries

Option C: Set up a Memorystore for Redis instance, build data pipelines to automatically pull commonly used data from BigQuery, and then use Redis to serve the queries.

Option D: Use a materialized view to cache commonly accessed data to speed up the queries. Have the dashboard query these tables

Option A is incorrect: Although this is true, BigQuery will only caches query results for repeated queries. The dashboard is building novel complex SQL statements each time it is ran, which would negate the cache.

Option B is correct: Use BI Engine to store data in memory and greatly speed up analytical queries.

Option C is incorrect: This is a very complex solution which might speed up your queries, but would be very difficult and costly to set up and maintain on its own. BI Engine will accomplish this for a fraction of the cost and complexity.

Option D is incorrect: A materialized view could help to cache data, but it doesn't support complex analytical queries, such as window functions.

Option A is incorrect: Although this is true, BigQuery will only caches query results for repeated queries. The dashboard is building novel complex SQL statements each time it is ran, which would negate the cache.

Option B is correct: Use <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:actLOHnZuRWiNz61cS2U#:~:text=Use%20BigQuery%20BI%20Engine%20to%20create%20highly%20optimized%20in%2Dmemory%20cached%20query%20results%20for%20frequently%20used%20data.%20Combine%20BI%20Engine%20with%20preferred%20tables%20to%20create%20blazing%20fast%20query%20results%20for%20your%20BI%20Tools." title="Use BigQuery BI Engine to create highly optimized in-memory cached query results for frequently used data. Combine BI Engine with preferred tables to create blazing fast query results for your BI Tools." target="_blank">BI Engine</a> to store data in memory and greatly speed up analytical queries.

Option C is incorrect: This is a very complex solution which might speed up your queries, but would be very difficult and costly to set up and maintain on its own. BI Engine will accomplish this for a fraction of the cost and complexity.

Option D is incorrect: A materialized view could help to cache data, but it doesn't support <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:actLOHnZuRWiNz61cS2U#:~:text=Be%20aware%20that%20materialized%20view%20do%20NOT%20support%20all%20SQL%20syntax%20and%20you%20must%20follow%20strict%20rules%20when%20constructing%20your%20query." title="Be aware that materialized view do NOT support all SQL syntax and you must follow strict rules when constructing your query." target="_blank">complex analytical queries</a>, such as window functions.

Your data engineering team is tasked with migrating on-prem Hadoop workloads to Google Cloud. They currently have 25TB of data stored in AVRO format on hard disks in a, on-prem cluster rack. What is the preferred architecture to move workloads to Google Cloud while minimizing rework?

A: Use Dataproc Serverless to run the workloads and move the data to Cloud Storage

B: Use Dataproc Standard to run the workloads and move the data to Cloud Storage

C: Rewrite the jobs in GoogleSQL to run in BigQuery and move the data to Cloud Storage

D: Set up a custom MIG group with memory optimized VMs and 25TB SSD persistent disks. Install Hadoop on the cluster, set up configurations, set up master and worker machine relationships, and rebuild the applications to work efficiently with GCP's custom architecture.

Option A is not correct. Dataproc Serverless only works with Spark.

Option B is correct. Use Dataproc Standard to run the workloads and move the files to Cloud Storage.

Option C is incorrect. This could work, and in the end may be more sustainable, but in this instance the goal is to minimize rework and ensure an efficient migration.

Option D is incorrect. Dataproc gives you to option to create highly customized cluster configurations and will automatically handle all infrastructure needed to build the cluster.

Option A is not correct. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:CZSxdxG56A0gFMQ0z4ef#:~:text=If%20you%20use%20Dataproc%20Serverless%20for%20Spark%20then%20DataProc%20can%20handle%20the%20autoscaling%20for%20you%20effortlessly%20and%20easily%20with%20just%20a%20few%20simple%20configurations%20at%20set%20up%20time." target="_blank" title="If you use Dataproc Serverless for Spark then DataProc can handle the autoscaling for you effortlessly and easily with just a few simple configurations at set up time.">Dataproc Serverless</a> only works with Spark.


Option B is correct. Use <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#:~:text=GCS%20data%20is%20often%20used%20as%20a%20feeder%20for%20Dataproc%20as%20well%20and%20is%20an%20effective%20substitute%20for%20an%20HDFS%20file%20system." target="_blank" title="GCS data is often used as a feeder for Dataproc as well and is an effective substitute for an HDFS file system.">Dataproc Standard</a> to run the workloads and move the files to Cloud Storage.


Option C is incorrect. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:7erWVVsVswyaQ7ct1Usz/topics/:cwZsU1aA58yWIknfYqNJ#:~:text=BigQuery%20accepts%20AVRO%20and%20Parquet%20data%20readily%20and%20works%20well%20with%20both" target="_blank" title="BigQuery accepts AVRO and Parquet data readily and works well with both">This could work</a>, and in the end may be more sustainable, but in this instance the goal is to minimize rework and ensure an efficient migration.


Option D is incorrect. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:CZSxdxG56A0gFMQ0z4ef#:~:text=Cloud%20Dataproc%20is%20GCP%27s%20fully%20managed%20Hadoop%20and%20Spark%20cluster%20management%20software.%20You%20can%20configure%20the%20hardware%20at%20the%20node%20level%2C%20the%20number%20of%20nodes%2C%20or%20an%20alternative%20configuration%20for%20the%20master%20node%2C%20if%20desired." target="_blank" title="Cloud Dataproc is GCP's fully managed Hadoop and Spark cluster management software. You can configure the hardware at the node level, the number of nodes, or an alternative configuration for the master node.">Dataproc</a> gives you to option to create highly customized cluster configurations and will automatically handle all infrastructure needed to build the cluster.

A financial services firm is processing some sensitive data from a third party vendor and now they want to bring that data into the BigQuery data warehouse to perform analytics on the data. The vendor has exposed an api endpoint which will be used to gather the data. They want to ensure that the data are properly desensitized when analysts or other users query the data while still ensuring that management can view the original data if required. What is the most efficient solution they could use to accomplish this?

A: Use a Cloud Function to query the data and perform desensitization tasks with a python function. Load the desensitized data into BigQuery via the Storage Write API.

B: Use a Cloud Function to query the data. Write the data to a GCS Cloud Bucket that only BigQuery and administrators can access. Use Cloud DLP to identify any sensitive data points in the files. Build one view in a dataset meant for analysts which queries the data and performs a DML transformation to mask the data. Build another view in a separate dataset for the administrators which queries the data in its raw state.

C: Create two Cloud Storage buckets, one labeled "quarantine" and another labeled "desensitized". Use one cloud function to query the API and move the data into the quarantine bucket. Use another Cloud Function to query the quarantined data, perform the desensitization tasks and move the clean data in the desensitized bucket. Use two different BigQuery external tables to query with the quarantined data or desensitized data depending upon the user rule.

D: Use a Cloud Function and Cloud Scheduler to query the data and load the desensitized data into BigQuery via the Storage Write API. From here use Cloud DLP to automatically check for sensitive data on the table. Use the output of Cloud DLP to create a BigQuery taxonomy with policy tags to perform dynamic data masking and desensitization based upon a user's data policy role.

A is incorrect. This will permanently alter the data and prevent administrators from viewing the data.

B is incorrect. This could work, but it is inefficient and would raise other security concerns such as view editing.

C in incorrect. This could work, but it is inefficient.

D is correct. Use a Cloud Function to query the API and retrieve the data and move it into BigQuery via the Storage Write API and a service account. Use Cloud Scheduler to activate the function via a chron schedule. This ensures positive control over the ephemeral data from the time it is queried to the time it is written into BigQuery. This is also an efficient use of resources as it only requires one copy of the data. Use Cloud DLP to check the table data for sensitive data automatically. Use the results of the Cloud DLP scan to create a masking taxonomy along with appropriate policy tags for each role. This will ensure that each user sees the appropriate level of masking when querying the data.

A is incorrect. This will permanently alter the data and prevent administrators from viewing the data.

B is incorrect. This could work, but it is inefficient and would raise other security concerns such as view editing.

C in incorrect. This could work, but it is inefficient.

D is correct. Use a Cloud Function to query the API and retrieve the data and move it into BigQuery via the <a title="The BigQuery Write API is used to create a unified streaming ingest or batch loading engine for directly streaming data into a BigQuery table." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:TCRZoATC24czqKMrdHJj#:~:text=The%20BigQuery%20Write%20API%20is%20used%20to%20create%20a%20unified%20streaming%20ingest%20or%20batch%20loading%20engine%20for%20directly%20streaming%20data%20into%20a%20BigQuery%20table." target="_blank">Storage Write API</a> and a service account. Use Cloud Scheduler to activate the function via a chron schedule. This ensures positive control over the ephemeral data from the time it is queried to the time it is written into BigQuery. This is also an efficient use of resources as it only requires one copy of the data. Use <a title="Data Loss Prevention, or Cloud DLP, is a tool which is used to determine classified data across your organization's resources. DLP works by utilizing so-called InfoTypes, which are used to identify sensitive data points." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=Data%20Loss%20Prevention%2C%20or%20Cloud%20DLP%2C%20is%20a%20tool%20which%20is%20used%20to%20determine%20classified%20data%20across%20your%20organization%27s%20resources.%20DLP%20works%20by%20utilizing%20so%2Dcalled%20InfoTypes%2C%20which%20are%20used%20to%20identify%20sensitive%20data%20points." target="_blank">Cloud DLP</a> to check the table data for sensitive data automatically. Use the results of the Cloud DLP scan to create a <a title="BigQuery can perform dynamic data masking to automatically protect and hide sensitive data in the cloud based upon a set of pre-defined rules known as taxonomies." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:eZe7RZYwT90rsJ5JN84v#:~:text=BigQuery%20can%20perform%20dynamic%20data%20masking%20to%20automatically%20protect%20and%20hide%20sensitive%20data%20in%20the%20cloud%20based%20upon%20a%20set%20of%20pre%2Ddefined%20rules%20known%20as%20taxonomies." target="_blank">masking taxonomy</a> along with appropriate policy tags for each role. This will ensure that each user sees the appropriate level of masking when querying the data.

You are working for an e-commerce company which has a legacy business operating on-prem and wants to migrate a total of 2 PB of data to GCS. They are looking for a high-security transfer. What is the best method for migrating this data to GCS?

A: Use a Transfer Appliance.

B: Use Storage Transfer Service.

C: Set up a dedicated interconnect. Use this to upload the files directly to GCS via the JSON API one by one.

D: Send the data over the internet with the GCS JSON API.

A is correct. If your data is very large (1 PB or greater), or your connection to Google Cloud is slow, then you should consider using Transfer Appliance. Google will ship a storage device to your on-prem location where you load the data into physical drives, ship them back to Google, and then Google will load the data directly into Cloud Storage. Transfer Appliance is secured end-to-end by Google.

B is incorrect. The data is too large for Storage Transfer Service, also this is mostly used to transfer data from other cloud providers to GCS.

C is incorrect. This would be a secure solution, but the data are so large that it would take an extremely long time to complete.

D is incorrect. The data are so large that this would likely never be able to complete. Additionally this is not the most secure option.

A is correct. If your data is <a title="if your data is very large (1 PB or greater), or your connection to Google Cloud is slow, then you should consider using Transfer Appliance." href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:sAaaeDZKVWcAcJX0AbOU#:~:text=if%20your%20data%20is%20very%20large%20(1%20PB%20or%20greater)%2C%20or%20your%20connection%20to%20Google%20Cloud%20is%20slow%2C%20then%20you%20should%20consider%20using%20Transfer%20Appliance." target="_blank">very large (1 PB or greater)</a>, or your connection to Google Cloud is slow, then you should consider using Transfer Appliance. Google will ship a storage device to your on-prem location where you load the data into physical drives, ship them back to Google, and then Google will load the data directly into Cloud Storage. Transfer Appliance is secured end-to-end by Google.

B is incorrect. The data is too large for Storage Transfer Service, also this is mostly used to transfer data from other cloud providers to GCS.

C is incorrect. This would be a secure solution, but the data are so large that it would take an extremely long time to complete.

D is incorrect. The data are so large that this would likely never be able to complete. Additionally this is not the most secure option.

An analytics team has asked for your help in designing an efficient data model for their data mart in BigQuery. Which of the below options would you recommend?

Choose all answers that apply

Option A is incorrect: Flattening and denormalizing data is correct, but you should aggregate the data into a nested field.

Option B is incorrect: This would apply to a materialized view, not a standard view.

Option C is incorrect: This is unnecessary. BigQuery is a columnar database which means that the data can be in the table but doesn't need to be queried if not selected.

Option D is correct: to be efficient and secure teams should only query the data which is necessary to build the data mart. This keeps storage costs low and minimizes and privacy or security concerns.

Option E is correct: BigQuery can create highly efficient data structures known as nested and repeated fields. These fields can encapsulate complex objects and make queries much more efficient and ensure referential integrity.

Option F is correct: Use Dataplex to discover the tables that you need quickly and efficiently. Dataplex works by automating Data Catalog to tag BigQuery tables.

Option A is incorrect: Flattening and denormalizing data is correct, but you should aggregate the data into a nested field.

Option B is incorrect: This would apply to a <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:9Vz6lAWmtTScqcqOkmRo/topics/:actLOHnZuRWiNz61cS2U#:~:text=Use%20materialized%20views%20to%20encapsulate%20complex%20business%20logic%20and%20push%20query%20optimization%20further%20down%20the%20stack." title="Use materialized views to encapsulate complex business logic and push query optimization further down the stack." target="_blank">materialized view</a>, not a standard view.

Option C is incorrect: This is unnecessary. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:BM3FWDUYjxjFzf8hGpsL#:~:text=Additionally%2C%20as%20a%20columnar%20database%2C%20BigQuery%20does%20not%20need%20to%20query%20every%20column%20in%20the%20table%20to%20process%20the%20data%2C%20which%20greatly%20saves%20on%20computational%20expenses." title="Additionally, as a columnar database, BigQuery does not need to query every column in the table to process the data, which greatly saves on computational expenses." target="_blank">BigQuery is a columnar database</a> which means that the data can be in the table but doesn't need to be queried if not selected.

Option D is correct: to be efficient and secure teams should only query the data which is necessary to build the data mart. This keeps storage costs low and minimizes and privacy or security concerns.

Option E is correct: BigQuery can create highly efficient data structures known as <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:o5JjviWgzWeewNeY3ehf/topics/:BM3FWDUYjxjFzf8hGpsL#:~:text=If%20you%20are%20working%20with%20hierarchical%20data%2C%20such%20as%20JSON%20event%20data%2C%20you%20can%20take%20advantage%20of%20nested%20and%20repeated%20fields%20to%20denormalize%20the%20data%20to%20create%20much%20more%20efficient%20and%20faster%20query%20operations." title="If you are working with hierarchical data, such as JSON event data, you can take advantage of nested and repeated fields to denormalize the data to create much more efficient and faster query operations." target="_blank">nested and repeated fields</a>. These fields can encapsulate complex objects and make queries much more efficient and ensure referential integrity.

Option F is correct: Use <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:CZSxdxG56A0gFMQ0z4ef#:~:text=GCP%20Data%20Catalog%20is%20a%20fully%20managed%20data%20cataloging%20solution%20designed%20to%20make%20sense%20of%20organizational%20data.%20Data%20Catalog%20can%20help%20you%20to%20organize%20your%20data%20and%20make%20it%20searchable%20and%20discoverable%20by%20your%20enterprise%20users." title="GCP Data Catalog is a fully managed data cataloging solution designed to make sense of organizational data. Data Catalog can help you to organize your data and make it searchable and discoverable by your enterprise users." target="_blank">Dataplex</a> to discover the tables that you need quickly and efficiently. Dataplex works by automating Data Catalog to tag BigQuery tables.

You need to connect to a remote Cloud SQL instance over the internet. What is the best method to use to quickly and easily connect to Cloud SQL while using native IAM roles?

A: TLS/SSL connection

B: SSH connection

C: Cloud SQL Auth Proxy

D: Dedicated Interconnect

You are working in a financial services firm conducting research into bond price prediction. The data science team has built a Spark ML bayesian prediction model to perform the predictions and now wishes to operationalize it. Due to the highly sensitive nature of the data and processing task the data science team wants to have the most secure architecture possible. The data are currently flowing into a secured GCS bucket. What would you recommend?

A: Set up an IAM service account to interact with GCS. Apply read access to the source bucket and read and write access to the destination bucket for the service account. Use the service account to read the data from GCS. Build a sole tenancy Dataproc Cluster with shielded VMs. Interact with the Dataproc Cluster with a Private IP address only. Set up Cloud Logging alerts to alert you of any unauthorized access to the cluster.

B: Use Dataproc Serverless Spark to read from the bucket, process the data, and deposit the data into the destination bucket.

C: Use a standard dataproc cluster with Compute Engine to read from the bucket, perform the data processing and load the data into the destination bucket.

C: Use a standard dataproc cluster with Google Kubernetes Engine to read from the bucket, perform the data processing and load the data into the destination bucket.

A is correct: This is a robust architecture which is almost impenetrable unless someone has direct access to the cluster. Sole Tenancy means that you set up a compute engine instance on a server rack hosting only your instance and is the most secure set up you can have on GCP. You should also use Shielded VMs to ensure the security of your operating system. Using a dedicated service account means that you can isolate which principals can interact with the data and Dataproc. You can use audit logs to monitor who accesses the cluster or the bucket data and you can set up custom Cloud Logging events to alert you of any unauthorized access.

B is incorrect: Dataproc Serverless is not the best choice here because it is not the most secure version of dataproc and would be insufficient.

C is incorrect: Dataproc with standard VMs only would not be secure enough to satisfy the requirements

D is incorrect: Dataproc on GKE only would not be secure enough to satisfy the requirements.

A is correct: This is a robust architecture which is almost impenetrable unless someone has direct access to the cluster. <a href="https://students.foxtrotcommunications.net/courses/:LW9hT1z0p3TGwTaftyuO/modules/:86IKzduyFeqJqgLigHpS/topics/:CZSxdxG56A0gFMQ0z4ef#:~:text=Sole%20Tenancy%20means%20that%20you%20set%20up%20a%20compute%20engine%20instance%20on%20a%20server%20rack%20hosting%20only%20your%20instance." title="Sole Tenancy means that you set up a compute engine instance on a server rack hosting only your instance." target="_blank">Sole Tenancy</a> means that you set up a compute engine instance on a server rack hosting only your instance and is the most secure set up you can have on GCP. You should also use Shielded VMs to ensure the security of your operating system. Using a dedicated service account means that you can isolate which principals can interact with the data and Dataproc. You can use audit logs to monitor who accesses the cluster or the bucket data and you can set up custom Cloud Logging events to alert you of any unauthorized access.

B is incorrect: Dataproc Serverless is not the best choice here because it is not the most secure version of dataproc and would be insufficient.

C is incorrect: Dataproc with standard VMs only would not be secure enough to satisfy the requirements

D is incorrect: Dataproc on GKE only would not be secure enough to satisfy the requirements.

Pretest Information