Available in 1, 3, 6 and 12 Months Free Updates Plans
PDF: $15 $60

Test Engine: $20 $80

PDF + Engine: $25 $99

Databricks-Certified-Professional-Data-Engineer Practice Test


Page 3 out of 22 Pages

A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A. If task A fails during a scheduled run, which statement describes the results of this run?


A. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.


B. Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.


C. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.


D. Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.


E. Tasks B and C will be skipped; task A will not commit any changes because of stage failure.





D.
  Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.

Explanation:

When a Databricks job runs multiple tasks with dependencies, the tasks are executed in a dependency graph. If a task fails, the downstream tasks that depend on it are skipped and marked as Upstream failed. However, the failed task may have already committed some changes to the Lakehouse before the failure occurred, and those changes are not rolled back automatically. Therefore, the job run may result in a partial update of the Lakehouse. To avoid this, you can use the transactional writes feature of Delta Lake to ensure that the changes are only committed when the entire job run succeeds. Alternatively, you can use the Run if condition to configure tasks to run even when some or all of their dependencies have failed, allowing your job to recover from failures and continue running. References: transactional writes: https://docs.databricks.com/delta/deltaintro.html#transactional-writes Run if: https://docs.databricks.com/en/workflows/jobs/conditional-tasks.html

Review the following error traceback:

Which statement describes the error being raised?


A. The code executed was PvSoark but was executed in a Scala notebook.


B. There is no column in the table named heartrateheartrateheartrate


C. There is a type error because a column object cannot be multiplied.


D. There is a type error because a DataFrame object cannot be multiplied.


E. There is a syntax error because the heartrate column is not correctly identified as a column.





E.
  There is a syntax error because the heartrate column is not correctly identified as a column.

Explanation:

The error being raised is an AnalysisException, which is a type of exception that occurs when Spark SQL cannot analyze or execute a query due to some logical or semantic error1. In this case, the error message indicates that the query cannot resolve the column name ‘heartrateheartrateheartrate’ given the input columns ‘heartrate’ and ‘age’. This means that there is no column in the table named ‘heartrateheartrateheartrate’, and the query is invalid. A possible cause of this error is a typo or a copy-paste mistake in the query. To fix this error, the query should use a valid column name that exists in the table, such as ‘heartrate’. References: AnalysisException

In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected. The function is displayed below with a blank: Which response correctly fills in the blank to meet the specified requirements?


A. Option A


B. Option B


C. Option C


D. Option D


E. Option E





B.
  Option B

Explanation:

Option B correctly fills in the blank to meet the specified requirements. Option B uses the “cloudFiles.schemaLocation” option, which is required for the schema detection and evolution functionality of Databricks Auto Loader. Additionally, option B uses the “mergeSchema” option, which is required for the schema evolution functionality of Databricks Auto Loader. Finally, option B uses the “writeStream” method, which is required for the incremental processing of JSON files as they arrive in a source directory. The other options are incorrect because they either omit the required options, use the wrong method, or use the wrong format.

References:

Configure schema inference and evolution in Auto Loader:

https://docs.databricks.com/en/ingestion/auto-loader/schema.html

Write streaming data: https://docs.databricks.com/spark/latest/structuredstreaming/writing-streaming-data.html

A junior data engineer has been asked to develop a streaming data pipeline with a grouped aggregation using DataFrame df. The pipeline needs to calculate the average humidity and average temperature for each non-overlapping five-minute interval. Events are recorded once per minute per device.

Streaming DataFrame df has the following schema: "device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT" Code block:

Choose the response that correctly fills in the blank within the code block to complete this task


A. to_interval("event_time", "5 minutes").alias("time")


B. window("event_time", "5 minutes").alias("time")


C. "event_time"


D. window("event_time", "10 minutes").alias("time")


E. lag("event_time", "10 minutes").alias("time")





B.
  window("event_time", "5 minutes").alias("time")

Explanation:

This is the correct answer because the window function is used to group streaming data by time intervals. The window function takes two arguments: a time column and a window duration. The window duration specifies how long each window is, and must be a multiple of 1 second. In this case, the window duration is “5 minutes”, which means each window will cover a non-overlapping five-minute interval. The window function also returns a struct column with two fields: start and end, which represent the start and end time of each window. The alias function is used to rename the struct column as “time”. Verified References: [Databricks Certified Data Engineer Professional], under “Structured Streaming” section; Databricks Documentation, under “WINDOW” section. https://www.databricks.com/blog/2017/05/08/event-time-aggregation-watermarkingapache-sparks-structured-streaming.html

A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, builtin file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?


A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.


B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.


C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.


D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.


E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.





B.
  Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.

Explanation:

The key to efficiently converting a large JSON dataset to Parquet files of a specific size without shuffling data lies in controlling the size of the output files directly. Setting spark.sql.files.maxPartitionBytes to 512 MB configures Spark to process data in chunks of 512 MB. This setting directly influences the size of the part-files in the output, aligning with the target file size.

Narrow transformations (which do not involve shuffling data across partitions) can then be applied to this data.

Writing the data out to Parquet will result in files that are approximately the size specified by spark.sql.files.maxPartitionBytes, in this case, 512 MB.

The other options involve unnecessary shuffles or repartitions (B, C, D) or an incorrect setting for this specific requirement (E).

References:

Apache Spark Documentation: Configuration - spark.sql.files.maxPartitionBytes

Databricks Documentation on Data Sources: Databricks Data Sources Guide


Page 3 out of 22 Pages
Previous