A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction
by the machine learning team. The table contains information about customers derived
from a number of upstream sources. Currently, the data engineering team populates this
table nightly by overwriting the table with the current valid values derived from upstream
data sources.
Immediately after each update succeeds, the data engineer team would like to determine
the difference between the new version and the previous of the table.
Given the current implementation, which method can be used?
A. Parse the Delta Lake transaction log to identify all newly written data files.
B. Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.
C. Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.
D. Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.
Explanation:
Delta Lake provides built-in versioning and time travel capabilities, allowing
users to query previous snapshots of a table. This feature is particularly useful for
understanding changes between different versions of the table. In this scenario, where the
table is overwritten nightly, you can use Delta Lake's time travel feature to execute a query
comparing the latest version of the table (the current state) with its previous version. This
approach effectively identifies the differences (such as new, updated, or deleted records)
between the two versions. The other options do not provide a straightforward or efficient
way to directly compare different versions of a Delta Lake table.
References:
Delta Lake Documentation on Time Travel: Delta Time Travel
Delta Lake Versioning: Delta Lake Versioning Guide
The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables. Which approach will ensure that this requirement is met?
A. Whenever a database is being created, make sure that the location keyword is used
B. When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.
C. Whenever a table is being created, make sure that the location keyword is used.
D. When tables are created, make sure that the external keyword is used in the create table statement.
E. When the workspace is being configured, make sure that external cloud object storage has been mounted.
Explanation:
This is the correct answer because it ensures that this requirement is met.
The requirement is that all tables in the Lakehouse should be configured as external Delta
Lake tables. An external table is a table that is stored outside of the default warehouse
directory and whose metadata is not managed by Databricks. An external table can be
created by using the location keyword to specify the path to an existing directory in a cloud
storage system, such as DBFS or S3. By creating external tables, the data engineering
team can avoid losing data if they drop or overwrite the table, as well as leverage existing
data without moving or copying it. Verified References: [Databricks Certified Data Engineer
Professional], under “Delta Lake” section; Databricks Documentation, under “Create an
external table” section.
The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible. A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team. Which statement captures best practices for this situation?
A. Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.
B. All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.
C. In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.
D. Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.
Explanation:
The best practice in such scenarios is to ensure that production data is
handled securely and with proper access controls. By granting only read access to
production data in development and testing environments, it mitigates the risk of
unintended data modification. Additionally, maintaining isolated databases for different
environments helps to avoid accidental impacts on production data and systems.
References:
Databricks best practices for securing data:
https://docs.databricks.com/security/index.html
Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?
A. Regex
B. Julia
C. pyspsark.ml.feature
D. Scala Datasets
E. C++
Explanation:
Regex, or regular expressions, are a powerful way of matching patterns in
text. They can be used to identify key areas of text when parsing Spark Driver log4j output,
such as the log level, the timestamp, the thread name, the class name, the method name,
and the message. Regex can be applied in various languages and frameworks, such as
Scala, Python, Java, Spark SQL, and Databricks notebooks.
References:
https://docs.databricks.com/notebooks/notebooks-use.html#use-regularexpressions
https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html#using-regularexpressions-in-udfs
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_extract.html
https://docs.databricks.com/spark/latest/sparkr/functions/regexp_replace.html
A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records. In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?
A. Set the configuration delta.deduplicate = true.
B. VACUUM the Delta table after each batch completes.
C. Perform an insert-only merge with a matching condition on a unique key
D. Perform a full outer join on a unique key and overwrite existing data.
E. Rely on Delta Lake schema enforcement to prevent duplicate records.
Explanation:
To deduplicate data against previously processed records as it is inserted
into a Delta table, you can use the merge operation with an insert-only clause. This allows
you to insert new records that do not match any existing records based on a unique key,
while ignoring duplicate records that match existing records. For example, you can use the
following syntax:
MERGE INTO target_table USING source_table ON target_table.unique_key =
source_table.unique_key WHEN NOT MATCHED THEN INSERT *
This will insert only the records from the source table that have a unique key that is not
present in the target table, and skip the records that have a matching key. This way, you
can avoid inserting duplicate records into the Delta table.
References:
https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-usingmerge
https://docs.databricks.com/delta/delta-update.html#insert-only-merge
Page 5 out of 22 Pages |
Previous |