Available in 1, 3, 6 and 12 Months Free Updates Plans
PDF: $15 $60

Test Engine: $20 $80

PDF + Engine: $25 $99

Databricks-Certified-Professional-Data-Engineer Practice Test


Page 5 out of 22 Pages

A Delta Lake table in the Lakehouse named customer_parsams is used in churn prediction by the machine learning team. The table contains information about customers derived from a number of upstream sources. Currently, the data engineering team populates this table nightly by overwriting the table with the current valid values derived from upstream data sources.

Immediately after each update succeeds, the data engineer team would like to determine the difference between the new version and the previous of the table. Given the current implementation, which method can be used?


A. Parse the Delta Lake transaction log to identify all newly written data files.


B. Execute DESCRIBE HISTORY customer_churn_params to obtain the full operation metrics for the update, including a log of all records that have been added or modified.


C. Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.


D. Parse the Spark event logs to identify those rows that were updated, inserted, or deleted.





C.
  Execute a query to calculate the difference between the new version and the previous version using Delta Lake’s built-in versioning and time travel functionality.

Explanation:

Delta Lake provides built-in versioning and time travel capabilities, allowing users to query previous snapshots of a table. This feature is particularly useful for understanding changes between different versions of the table. In this scenario, where the table is overwritten nightly, you can use Delta Lake's time travel feature to execute a query comparing the latest version of the table (the current state) with its previous version. This approach effectively identifies the differences (such as new, updated, or deleted records) between the two versions. The other options do not provide a straightforward or efficient way to directly compare different versions of a Delta Lake table. References: Delta Lake Documentation on Time Travel: Delta Time Travel Delta Lake Versioning: Delta Lake Versioning Guide

The data architect has mandated that all tables in the Lakehouse should be configured as external Delta Lake tables. Which approach will ensure that this requirement is met?


A. Whenever a database is being created, make sure that the location keyword is used


B. When configuring an external data warehouse for all table storage. leverage Databricks for all ELT.


C. Whenever a table is being created, make sure that the location keyword is used.


D. When tables are created, make sure that the external keyword is used in the create table statement.


E. When the workspace is being configured, make sure that external cloud object storage has been mounted.





C.
  Whenever a table is being created, make sure that the location keyword is used.

Explanation:

This is the correct answer because it ensures that this requirement is met. The requirement is that all tables in the Lakehouse should be configured as external Delta Lake tables. An external table is a table that is stored outside of the default warehouse directory and whose metadata is not managed by Databricks. An external table can be created by using the location keyword to specify the path to an existing directory in a cloud storage system, such as DBFS or S3. By creating external tables, the data engineering team can avoid losing data if they drop or overwrite the table, as well as leverage existing data without moving or copying it. Verified References: [Databricks Certified Data Engineer Professional], under “Delta Lake” section; Databricks Documentation, under “Create an external table” section.

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible. A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team. Which statement captures best practices for this situation?


A. Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.


B. All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.


C. In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.


D. Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.





C.
  In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

Explanation:

The best practice in such scenarios is to ensure that production data is handled securely and with proper access controls. By granting only read access to production data in development and testing environments, it mitigates the risk of unintended data modification. Additionally, maintaining isolated databases for different environments helps to avoid accidental impacts on production data and systems.

References:

Databricks best practices for securing data: https://docs.databricks.com/security/index.html

Which of the following technologies can be used to identify key areas of text when parsing Spark Driver log4j output?


A. Regex


B. Julia


C. pyspsark.ml.feature


D. Scala Datasets


E. C++





A.
  Regex

Explanation:

Regex, or regular expressions, are a powerful way of matching patterns in text. They can be used to identify key areas of text when parsing Spark Driver log4j output, such as the log level, the timestamp, the thread name, the class name, the method name, and the message. Regex can be applied in various languages and frameworks, such as Scala, Python, Java, Spark SQL, and Databricks notebooks.

References:

https://docs.databricks.com/notebooks/notebooks-use.html#use-regularexpressions

https://docs.databricks.com/spark/latest/spark-sql/udf-scala.html#using-regularexpressions-in-udfs

https://docs.databricks.com/spark/latest/sparkr/functions/regexp_extract.html

https://docs.databricks.com/spark/latest/sparkr/functions/regexp_replace.html

A data engineer is configuring a pipeline that will potentially see late-arriving, duplicate records. In addition to de-duplicating records within the batch, which of the following approaches allows the data engineer to deduplicate data against previously processed records as it is inserted into a Delta table?


A. Set the configuration delta.deduplicate = true.


B. VACUUM the Delta table after each batch completes.


C. Perform an insert-only merge with a matching condition on a unique key


D. Perform a full outer join on a unique key and overwrite existing data.


E. Rely on Delta Lake schema enforcement to prevent duplicate records.





C.
   Perform an insert-only merge with a matching condition on a unique key

Explanation:

To deduplicate data against previously processed records as it is inserted into a Delta table, you can use the merge operation with an insert-only clause. This allows you to insert new records that do not match any existing records based on a unique key, while ignoring duplicate records that match existing records. For example, you can use the following syntax:

MERGE INTO target_table USING source_table ON target_table.unique_key = source_table.unique_key WHEN NOT MATCHED THEN INSERT *

This will insert only the records from the source table that have a unique key that is not present in the target table, and skip the records that have a matching key. This way, you can avoid inserting duplicate records into the Delta table.

References:


https://docs.databricks.com/delta/delta-update.html#upsert-into-a-table-usingmerge

https://docs.databricks.com/delta/delta-update.html#insert-only-merge


Page 5 out of 22 Pages
Previous