A Databricks job has been configured with 3 tasks, each of which is a Databricks notebook. Task A does not depend on other tasks. Tasks B and C run in parallel, with each having a serial dependency on Task A. If task A fails during a scheduled run, which statement describes the results of this run?
A. Because all tasks are managed as a dependency graph, no changes will be committed to the Lakehouse until all tasks have successfully been completed.
B. Tasks B and C will attempt to run as configured; any changes made in task A will be rolled back due to task failure.
C. Unless all tasks complete successfully, no changes will be committed to the Lakehouse; because task A failed, all commits will be rolled back automatically.
D. Tasks B and C will be skipped; some logic expressed in task A may have been committed before task failure.
E. Tasks B and C will be skipped; task A will not commit any changes because of stage failure.
Explanation:
When a Databricks job runs multiple tasks with dependencies, the tasks are
executed in a dependency graph. If a task fails, the downstream tasks that depend on it are
skipped and marked as Upstream failed. However, the failed task may have already
committed some changes to the Lakehouse before the failure occurred, and those changes
are not rolled back automatically. Therefore, the job run may result in a partial update of the
Lakehouse. To avoid this, you can use the transactional writes feature of Delta Lake to
ensure that the changes are only committed when the entire job run succeeds.
Alternatively, you can use the Run if condition to configure tasks to run even when some or
all of their dependencies have failed, allowing your job to recover from failures and
continue running. References:
transactional writes: https://docs.databricks.com/delta/deltaintro.html#transactional-writes
Run if: https://docs.databricks.com/en/workflows/jobs/conditional-tasks.html
Review the following error traceback:
Which statement describes the error being raised?
A. The code executed was PvSoark but was executed in a Scala notebook.
B. There is no column in the table named heartrateheartrateheartrate
C. There is a type error because a column object cannot be multiplied.
D. There is a type error because a DataFrame object cannot be multiplied.
E. There is a syntax error because the heartrate column is not correctly identified as a column.
Explanation:
The error being raised is an AnalysisException, which is a type of exception
that occurs when Spark SQL cannot analyze or execute a query due to some logical or
semantic error1. In this case, the error message indicates that the query cannot resolve the
column name ‘heartrateheartrateheartrate’ given the input columns ‘heartrate’ and ‘age’.
This means that there is no column in the table named ‘heartrateheartrateheartrate’, and
the query is invalid. A possible cause of this error is a typo or a copy-paste mistake in the
query. To fix this error, the query should use a valid column name that exists in the table,
such as ‘heartrate’. References: AnalysisException
In order to facilitate near real-time workloads, a data engineer is creating a helper function to leverage the schema detection and evolution functionality of Databricks Auto Loader. The desired function will automatically detect the schema of the source directly, incrementally process JSON files as they arrive in a source directory, and automatically evolve the schema of the table when new fields are detected. The function is displayed below with a blank: Which response correctly fills in the blank to meet the specified requirements?
A. Option A
B. Option B
C. Option C
D. Option D
E. Option E
Explanation:
Option B correctly fills in the blank to meet the specified requirements. Option B uses the
“cloudFiles.schemaLocation” option, which is required for the schema detection and
evolution functionality of Databricks Auto Loader. Additionally, option B uses the
“mergeSchema” option, which is required for the schema evolution functionality of
Databricks Auto Loader. Finally, option B uses the “writeStream” method, which is required
for the incremental processing of JSON files as they arrive in a source directory. The other
options are incorrect because they either omit the required options, use the wrong method,
or use the wrong format.
References:
Configure schema inference and evolution in Auto Loader:
https://docs.databricks.com/en/ingestion/auto-loader/schema.html
Write streaming data: https://docs.databricks.com/spark/latest/structuredstreaming/writing-streaming-data.html
A junior data engineer has been asked to develop a streaming data pipeline with a grouped
aggregation using DataFrame df. The pipeline needs to calculate the average humidity and
average temperature for each non-overlapping five-minute interval. Events are recorded
once per minute per device.
Streaming DataFrame df has the following schema:
"device_id INT, event_time TIMESTAMP, temp FLOAT, humidity FLOAT"
Code block:
Choose the response that correctly fills in the blank within the code block to complete this
task
A. to_interval("event_time", "5 minutes").alias("time")
B. window("event_time", "5 minutes").alias("time")
C. "event_time"
D. window("event_time", "10 minutes").alias("time")
E. lag("event_time", "10 minutes").alias("time")
Explanation:
This is the correct answer because the window function is used to group
streaming data by time intervals. The window function takes two arguments: a time column
and a window duration. The window duration specifies how long each window is, and must
be a multiple of 1 second. In this case, the window duration is “5 minutes”, which means
each window will cover a non-overlapping five-minute interval. The window function also
returns a struct column with two fields: start and end, which represent the start and end
time of each window. The alias function is used to rename the struct column as “time”.
Verified References: [Databricks Certified Data Engineer Professional], under “Structured
Streaming” section; Databricks Documentation, under “WINDOW” section.
https://www.databricks.com/blog/2017/05/08/event-time-aggregation-watermarkingapache-sparks-structured-streaming.html
A data ingestion task requires a one-TB JSON dataset to be written out to Parquet with a target part-file size of 512 MB. Because Parquet is being used instead of Delta Lake, builtin file-sizing features such as Auto-Optimize & Auto-Compaction cannot be used. Which strategy will yield the best performance without shuffling data?
A. Set spark.sql.files.maxPartitionBytes to 512 MB, ingest the data, execute the narrow transformations, and then write to parquet.
B. Set spark.sql.shuffle.partitions to 2,048 partitions (1TB*1024*1024/512), ingest the data, execute the narrow transformations, optimize the data by sorting it (which automatically repartitions the data), and then write to parquet.
C. Set spark.sql.adaptive.advisoryPartitionSizeInBytes to 512 MB bytes, ingest the data, execute the narrow transformations, coalesce to 2,048 partitions (1TB*1024*1024/512), and then write to parquet.
D. Ingest the data, execute the narrow transformations, repartition to 2,048 partitions (1TB* 1024*1024/512), and then write to parquet.
E. Set spark.sql.shuffle.partitions to 512, ingest the data, execute the narrow transformations, and then write to parquet.
Explanation:
The key to efficiently converting a large JSON dataset to Parquet files of a
specific size without shuffling data lies in controlling the size of the output files directly.
Setting spark.sql.files.maxPartitionBytes to 512 MB configures Spark to process
data in chunks of 512 MB. This setting directly influences the size of the part-files
in the output, aligning with the target file size.
Narrow transformations (which do not involve shuffling data across partitions) can
then be applied to this data.
Writing the data out to Parquet will result in files that are approximately the size
specified by spark.sql.files.maxPartitionBytes, in this case, 512 MB.
The other options involve unnecessary shuffles or repartitions (B, C, D) or an
incorrect setting for this specific requirement (E).
References:
Apache Spark Documentation: Configuration - spark.sql.files.maxPartitionBytes
Databricks Documentation on Data Sources: Databricks Data Sources Guide
Page 3 out of 22 Pages |
Previous |