Available in 1, 3, 6 and 12 Months Free Updates Plans
PDF: $15 $60

Test Engine: $20 $80

PDF + Engine: $25 $99

Databricks-Certified-Professional-Data-Engineer Practice Test


Page 6 out of 22 Pages

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on task A. If tasks A and B complete successfully but task C fails during a scheduled run, which statement describes the resulting state?


A. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.


B. All logic expressed in the notebook associated with tasks A and B will have been successfully completed; any changes made in task C will be rolled back due to task failure.


C. All logic expressed in the notebook associated with task A will have been successfully completed; tasks B and C will not commit any changes because of stage failure.


D. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until ail tasks have successfully been completed.


E. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task C failed, all commits will be rolled back automatically.





A.
  All logic expressed in the notebook associated with tasks A and B will have been successfully completed; some operations in task C may have completed successfully.

Explanation:

The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path to the Parquet file as

/mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query creates an external table, which is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created from an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a supported format, such as Parquet or CSV.

The resulting state after running the second command is that an external table will be created in the storage container mounted to /mnt/finance_eda_bucket with the new name prod.sales_by_store. The command will not change any data or move any files in the storage container; it will only update the table reference in the metastore and create a new Delta transaction log for the renamed table.

Verified References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “ALTER TABLE RENAME TO” section; Databricks Documentation, under “Create an external table” section.

Which statement describes the default execution mode for Databricks Auto Loader?


A. New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.


B. Cloud vendor-specific queue storage and notification services are configured to track newly arriving files; new files are incrementally and impotently into the target Delta Lake table.


C. Webhook trigger Databricks job to run anytime new data arrives in a source directory; new data automatically merged into target tables using rules inferred from the data.


D. New files are identified by listing the input directory; the target table is materialized by directory querying all valid files in the source directory.





A.
  New files are identified by listing the input directory; new files are incrementally and idempotently loaded into the target Delta Lake table.

Explanation:

Databricks Auto Loader simplifies and automates the process of loading data into Delta Lake. The default execution mode of the Auto Loader identifies new files by listing the input directory. It incrementally and idempotently loads these new files into the target Delta Lake table. This approach ensures that files are not missed and are processed exactly once, avoiding data duplication. The other options describe different mechanisms or integrations that are not part of the default behavior of the Auto Loader. References: Databricks Auto Loader Documentation: Auto Loader Guide Delta Lake and Auto Loader: Delta Lake Integration

Which statement characterizes the general programming model used by Spark Structured Streaming?


A. Structured Streaming leverages the parallel processing of GPUs to achieve highly parallel data throughput.


B. Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.


C. Structured Streaming uses specialized hardware and I/O streams to achieve subsecond latency for data transfer.


D. Structured Streaming models new data arriving in a data stream as new rows appended to an unbounded table.


E. Structured Streaming relies on a distributed network of nodes that hold incremental state values for cached stages.





B.
  Structured Streaming is implemented as a messaging bus and is derived from Apache Kafka.

Explanation:

This is the correct answer because it characterizes the general programming model used by Spark Structured Streaming, which is to treat a live data stream as a table that is being continuously appended. This leads to a new stream processing model that is very similar to a batch processing model, where users can express their streaming computation using the same Dataset/DataFrame API as they would use for static data. The Spark SQL engine will take care of running the streaming query incrementally and continuously and updating the final result as streaming data continues to arrive.

Verified References: [Databricks Certified Data Engineer Professional], under “Structured Streaming” section; Databricks Documentation, under “Overview” section.

A Delta Lake table was created with the below query: Realizing that the original query had a typographical error, the below code was executed: ALTER TABLE prod.sales_by_stor RENAME TO prod.sales_by_store Which result will occur after running the second command?


A. The table reference in the metastore is updated and no data is changed.


B. The table name change is recorded in the Delta transaction log.


C. All related files and metadata are dropped and recreated in a single ACID transaction.


D. The table reference in the metastore is updated and all data files are moved.


E. A new Delta transaction log Is created for the renamed table.





A.
  The table reference in the metastore is updated and no data is changed.

Explanation:

The query uses the CREATE TABLE USING DELTA syntax to create a Delta Lake table from an existing Parquet file stored in DBFS. The query also uses the LOCATION keyword to specify the path to the Parquet file as /mnt/finance_eda_bucket/tx_sales.parquet. By using the LOCATION keyword, the query creates an external table, which is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created from an existing directory in a cloud storage system, such as DBFS or S3, that contains data files in a supported format, such as Parquet or CSV. The result that will occur after running the second command is that the table reference in the metastore is updated and no data is changed. The metastore is a service that stores metadata about tables, such as their schema, location, properties, and partitions. The metastore allows users to access tables using SQL commands or Spark APIs without knowing their physical location or format. When renaming an external table using the ALTER TABLE RENAME TO command, only the table reference in the metastore is updated with the new name; no data files or directories are moved or changed in the storage system. The table will still point to the same location and use the same format as before. However, if renaming a managed table, which is a table whose metadata and data are both managed by Databricks, both the table reference in the metastore and the data files in the default warehouse directory are moved and renamed accordingly.

Verified References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “ALTER TABLE RENAME TO” section; Databricks Documentation, under “Metastore” section; Databricks Documentation, under “Managed and external tables” section.

Which statement describes the correct use of pyspark.sql.functions.broadcast?


A. It marks a column as having low enough cardinality to properly map distinct values to available partitions, allowing a broadcast join.


B. It marks a column as small enough to store in memory on all executors, allowing a broadcast join.


C. It caches a copy of the indicated table on attached storage volumes for all active clusters within a Databricks workspace.


D. It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.


E. It caches a copy of the indicated table on all nodes in the cluster for use in all future queries during the cluster lifetime.





D.
   It marks a DataFrame as small enough to store in memory on all executors, allowing a broadcast join.

Explanation:

https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.functions.broadca st.html

The broadcast function in PySpark is used in the context of joins. When you mark a DataFrame with broadcast, Spark tries to send this DataFrame to all worker nodes so that it can be joined with another DataFrame without shuffling the larger DataFrame across the nodes. This is particularly beneficial when the DataFrame is small enough to fit into the memory of each node. It helps to optimize the join process by reducing the amount of data that needs to be shuffled across the cluster, which can be a very expensive operation in terms of computation and time.

The pyspark.sql.functions.broadcast function in PySpark is used to hint to Spark that a DataFrame is small enough to be broadcast to all worker nodes in the cluster. When this hint is applied, Spark can perform a broadcast join, where the smaller DataFrame is sent to each executor only once and joined with the larger DataFrame on each executor. This can significantly reduce the amount of data shuffled across the network and can improve the performance of the join operation.

In a broadcast join, the entire smaller DataFrame is sent to each executor, not just a specific column or a cached version on attached storage. This function is particularly useful when one of the DataFrames in a join operation is much smaller than the other, and can fit comfortably in the memory of each executor node.

References:

Databricks Documentation on Broadcast Joins: Databricks Broadcast Join Guide

PySpark API Reference: pyspark.sql.functions.broadcast


Page 6 out of 22 Pages
Previous