Free Databricks-Machine-Learning-Associate Practice Exam Questions

Which of the following machine learning algorithms typically uses bagging?

A. IGradient boosted trees

B. K-means

C. Random forest

D. Decision tree

C. Random forest

Explanation:

Random Forest is a machine learning algorithm that typically uses bagging (Bootstrap Aggregating). Bagging is a technique that involves training multiple base models (such as decision trees) on different subsets of the data and then combining their predictions to improve overall model performance. Each subset is created by randomly sampling with replacement from the original dataset. The Random Forest algorithm builds multiple decision trees and merges them to get a more accurate and stable prediction.

References:

Databricks documentation on Random Forest: Random Forest in Spark ML

A data scientist has developed a machine learning pipeline with a static input data set using Spark ML, but the pipeline is taking too long to process. They increase the number of workers in the cluster to get the pipeline to run more efficiently. They notice that the number of rows in the training set after reconfiguring the cluster is different from the number of rows in the training set prior to reconfiguring the cluster. Which of the following approaches will guarantee a reproducible training and test set for each model?

A. Manually configure the cluster

B. Write out the split data sets to persistent storage

C. Set a speed in the data splitting operation

D. Manually partition the input data

B. Write out the split data sets to persistent storage

Explanation:

To ensure reproducible training and test sets, writing the split data sets to persistent storage is a reliable approach. This allows you to consistently load the same training and test data for each model run, regardless of cluster reconfiguration or other changes in the environment.

Correct approach:

Split the data.

Write the split data to persistent storage (e.g., HDFS, S3).

Load the data from storage for each model training session.

train_df, test_df = spark_df.randomSplit([0.8,0.2], seed=42)

train_df.write.parquet("path/to/train_df.parquet") test_df.write.parquet("path/to/test_df.parquet")# Later, load the datatrain_df = spark.read.parquet("path/to/train_df.parquet") test_df = spark.read.parquet("path/to/test_df.parquet")

References:

Spark DataFrameWriter Documentation

A data scientist is wanting to explore the Spark DataFrame spark_df. The data scientist wants visual histograms displaying the distribution of numeric features to be included in the exploration. Which of the following lines of code can the data scientist run to accomplish the task?

A. spark_df.describe()

B. dbutils.data(spark_df).summarize()

C. This task cannot be accomplished in a single line of code.

D. spark_df.summary()

E. dbutils.data.summarize (spark_df)

Explanation:

To display visual histograms and summaries of the numeric features in a Spark DataFrame, the Databricks utility functiondbutils.data.summarizecan be used. This function provides a comprehensive summary, including visual histograms.

Correct code:

dbutils.data.summarize(spark_df)

Other options likespark_df.describe()andspark_df.summary()provide textual statistical summaries but do not include visual histograms.

References:

Databricks Utilities Documentation

A data scientist is using Spark SQL to import their data into a machine learning pipeline. Once the data is imported, the data scientist performs machine learning tasks using Spark ML. Which of the following compute tools is best suited for this use case?

A. Single Node cluster

B. Standard cluster

C. SQL Warehouse

D. None of these compute tools support this task

B. Standard cluster

Explanation:

For a data scientist using Spark SQL to import data and then performing machine learning tasks using Spark ML, the best-suited compute tool is a Standard cluster. A Standard cluster in Databricks provides the necessary resources and scalability to handle large datasets and perform distributed computing tasks efficiently, making it ideal for running Spark SQL and Spark ML operations.

References:

Databricks documentation on clusters: Clusters in Databricks

A machine learning engineer has created a Feature Table new_table using Feature Store Client fs. When creating the table, they specified a metadata description with key information about the Feature Table. They now want to retrieve that metadata programmatically. Which of the following lines of code will return the metadata description?

A. There is no way to return the metadata description programmatically.

B. fs.create_training_set("new_table")

C. fs.get_table("new_table").description

D. fs.get_table("new_table").load_df()

E. fs.get_table("new_table")

C. fs.get_table("new_table").description

Explanation:

To retrieve the metadata description of a feature table created using the Feature Store Client (referred here asfs), the correct method involves callingget_tableon thefsclient with the table name as an argument, followed by accessing thedescriptionattribute of the returned object. The code snippetfs.get_table("new_table").descriptioncorrectly achieves this by fetching the table object for "new_table" and then accessing its description attribute, where the metadata is stored. The other options do not correctly focus on retrieving the metadata description.

References:

Databricks Feature Store documentation (Accessing Feature Table Metadata).

Pass exam with Dumps4free or we will provide you with three additional months of access for FREE.

Databricks-Machine-Learning-Associate Practice Test