Available in 1, 3, 6 and 12 Months Free Updates Plans
PDF: $15 $60

Test Engine: $20 $80

PDF + Engine: $25 $99

Databricks-Certified-Professional-Data-Engineer Practice Test


Page 7 out of 22 Pages

Incorporating unit tests into a PySpark application requires upfront attention to the design of your jobs, or a potentially significant refactoring of existing code. Which statement describes a main benefit that offset this additional effort?


A. Improves the quality of your data


B. Validates a complete use case of your application


C. Troubleshooting is easier since all steps are isolated and tested individually


D. Yields faster deployment and execution times


E. Ensures that all steps interact correctly to achieve the desired end result





A.
  Improves the quality of your data

A data engineer is testing a collection of mathematical functions, one of which calculates the area under a curve as described by another function. Which kind of the test does the above line exemplify?


A. Integration


B. Unit


C. Manual


D. functional





B.
  Unit

Explanation:

A unit test is designed to verify the correctness of a small, isolated piece of code, typically a single function. Testing a mathematical function that calculates the area under a curve is an example of a unit test because it is testing a specific, individual function to ensure it operates as expected. References: Software Testing Fundamentals: Unit Testing

A junior data engineer is migrating a workload from a relational database system to the Databricks Lakehouse. The source system uses a star schema, leveraging foreign key constrains and multi-table inserts to validate records on write. Which consideration will impact the decisions made by the engineer while migrating this workload?


A. All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.


B. Databricks only allows foreign key constraints on hashed identifiers, which avoid collisions in highly-parallel writes.


C. Foreign keys must reference a primary key field; multi-table inserts must leverage Delta Lake's upsert functionality.


D. Committing to multiple tables simultaneously requires taking out multiple table locks and can lead to a state of deadlock.





A.
  All Delta Lake transactions are ACID compliance against a single table, and Databricks does not enforce foreign key constraints.

Explanation:

In Databricks and Delta Lake, transactions are indeed ACID-compliant, but this compliance is limited to single table transactions. Delta Lake does not inherently enforce foreign key constraints, which are a staple in relational database systems for maintaining referential integrity between tables. This means that when migrating workloads from a relational database system to Databricks Lakehouse, engineers need to reconsider how to maintain data integrity and relationships that were previously enforced by foreign key constraints. Unlike traditional relational databases where foreign key constraints help in maintaining the consistency across tables, in Databricks Lakehouse, the data engineer has to manage data consistency and integrity at the application level or through careful design of ETL processes.

References:

Databricks Documentation on Delta Lake: Delta Lake Guide

Databricks Documentation on ACID Transactions in Delta Lake: ACID

Transactions in Delta Lake

In order to prevent accidental commits to production data, a senior data engineer has instituted a policy that all development work will reference clones of Delta Lake tables. After testing both deep and shallow clone, development tables are created using shallow clone. A few weeks after initial table creation, the cloned versions of several tables implemented as Type 1 Slowly Changing Dimension (SCD) stop working. The transaction logs for the source tables show that vacuum was run the day before. Why are the cloned tables no longer working?


A. The data files compacted by vacuum are not tracked by the cloned metadata; running refresh on the cloned table will pull in recent changes.


B. Because Type 1 changes overwrite existing records, Delta Lake cannot guarantee data consistency for cloned tables.


C. The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command


D. Running vacuum automatically invalidates any shallow clones of a table; deep clone should always be used when a cloned table will be repeatedly queried.





C.
  The metadata created by the clone operation is referencing data files that were purged as invalid by the vacuum command

Explanation:

In Delta Lake, a shallow clone creates a new table by copying the metadata of the source table without duplicating the data files. When the vacuum command is run on the source table, it removes old data files that are no longer needed to maintain the transactional log's integrity, potentially including files referenced by the shallow clone's metadata. If these files are purged, the shallow cloned tables will reference non-existent data files, causing them to stop working properly. This highlights the dependency of shallow clones on the source table's data files and the impact of data management operations like vacuum on these clones.

References: Databricks documentation on Delta Lake, particularly the sections on cloning tables (shallow and deep cloning) and data retention with the vacuum command (https://docs.databricks.com/delta/index.html).

The Databricks CLI is use to trigger a run of an existing job by passing the job_id parameter. The response that the job run request has been submitted successfully includes a filed run_id. Which statement describes what the number alongside this field represents?


A. The job_id is returned in this field.


B. The job_id and number of times the job has been are concatenated and returned.


C. The number of times the job definition has been run in the workspace.


D. The globally unique ID of the newly triggered run.





D.
  The globally unique ID of the newly triggered run.

Explanation:

When triggering a job run using the Databricks CLI, the run_id field in the response represents a globally unique identifier for that particular run of the job. This run_id is distinct from the job_id. While the job_id identifies the job definition and is constant across all runs of that job, the run_id is unique to each execution and is used to track and query the status of that specific job run within the Databricks environment. This distinction allows users to manage and reference individual executions of a job directly.


Page 7 out of 22 Pages
Previous