Cloudera CCA175 Test Dumps

Total Questions Answers: 96
Last Updated: 17-Feb-2025
Check Our Recently Added CCA175 Practice Exam Questions

Question # 1

Problem Scenario 62 : You have been given below code snippet.val a = sc.parallelize(List("dogM, "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = => (x.length, x))
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, String)] = Array((3,xdogx), (5,xtigerx), (4,xlionx), (3,xcatx), (7,xpantherx),

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
b.mapValuesf'x" + _ + "x").collect
mapValues [Pair] : Takes the values of a RDD that consists of two-component tuples, and
applies the provided function to transform each value. Tlien,.it.forms newtwo-componend
tuples using the key and the transformed value and stores them in a new RDD.

Question # 2

Problem Scenario 59 : You have been given below code snippet.
val x = sc.parallelize(1 to 20)
val y = sc.parallelize(10 to 30) operationl
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[lnt] = Array(16,12, 20,13,17,14,18,10,19,15,11)

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
val z = x.intersection(y)
intersection : Returns the elements in the two RDDs which are the same.

Question # 3

Problem Scenario 78 : You have been given MySQL DB with following details
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Columns of order table : (orderid , order_date , order_customer_id, order_status)
Columns of ordeMtems table : (order_item_td , order_item_order_id ,
Please accomplish following activities.
1. Copy "retail_db.orders" and "retail_db.order_items" table to hdfs in respective directory
p92_orders and p92_order_items .
2. Join these data using order_id in Spark and Python
3. Calculate total revenue perday and per customer
4. Calculate maximum revenue customer

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Import Single table .
sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -
password=cloudera -table=orders -target-dir=p92_orders –m 1
sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -
password=cloudera -table=order_items -target-dir=p92_order_orderitems -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command, hadoop fs
-cat p92_orders/part-m-00000 hadoop fs -cat p92 orderitems/part-m-00000
Step 3 : Load these above two directory as RDD using Spark and Python (Open pyspark
terminal and do following). orders = sc.textFile(Mp92_orders") orderitems =
Step 4 : Convert RDD into key value as (orderjd as a key and rest of the values as a value)
#First value is orderjd
orders Key Value = line: (int(line.split(",")[0]), line))
#Second value as an Orderjd
orderltemsKeyValue = line: (int(line.split(",")[1]), line))
Step 5 : Join both the RDD using orderjd
joinedData = orderltemsKeyValue.join(ordersKeyValue)
#print the joined data
for line in joinedData.collect():
#Format of joinedData as below.
#[Orderld, 'All columns from orderltemsKeyValue', 'All columns from ordersKeyValue']
ordersPerDatePerCustomer = line: ((line[1][1].split(",")[1],
line[1][1].split(",M)[2]), float(line[1][0].split(",")[4]))) amountCollectedPerDayPerCustomer =
ordersPerDatePerCustomer.reduceByKey(lambda runningSum, amount: runningSum +
#(Out record format will be ((date,customer_id), totalAmount} for line in
amountCollectedPerDayPerCustomer.collect(): print(line)
#now change the format of record as (date,(customer_id,total_amount))
revenuePerDatePerCustomerRDD =
threeElementTuple: (threeElementTuple[0][0],
for line in revenuePerDatePerCustomerRDD.collect():
#Calculate maximum amount collected by a customer for each day
perDateMaxAmountCollectedByCustomer =
revenuePerDatePerCustomerRDD.reduceByKey(lambda runningAmountTuple,
newAmountTuple: (runningAmountTuple if runningAmountTuple[1] >=
newAmountTuple[1] else newAmountTuple})for line in perDateMaxAmountCollectedByCustomer\sortByKey().collect(): print(line)

Question # 4

Problem Scenario 36 : You have been given a file named spark8/data.csv (type,name).
1. Load this file from hdfs and save it back as (id, (all names of same type)) in results
directory. However, make sure while saving it should be

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create file in hdfs (We will do using Hue). However, you can first create in local
filesystem and then upload it to hdfs.
Step 2 : Load data.csv file from hdfs and create PairRDDs
val name = sc.textFile("spark8/data.csv")
val namePairRDD => (x.split(",")(0),x.split(",")(1)))
Step 3 : Now swap namePairRDD RDD.
val swapped = => item.swap)
Step 4 : Now combine the rdd by key.
val combinedOutput = namePairRDD.combineByKey(List(_), (x:List[String], y:String) => y ::
x, (x:List[String], y:List[String]) => x ::: y)
Step 5 : Save the output as a Text file and output must be written in a single file.

Question # 5

Problem Scenario 26 : You need to implement near real time solutions for collecting
information when submitted in file with below information. You have been given below
directory location (if not available than create it) /tmp/nrtcontent. Assume your departments
upstream service is continuously committing data in this directory as a new file (not stream
of data, because it is near real time solution). As soon as file committed in this directory
that needs to be available in hdfs in /tmp/flume location
echo "I am preparing for CCA175 from" > /tmp/nrtcontent/.he1.txt
mv /tmp/nrtcontent/.he1.txt /tmp/nrtcontent/he1.txt
After few mins
echo "I am preparing for CCA175 from" > /tmp/nrtcontent/.qt1.txt
mv /tmp/nrtcontent/.qt1.txt /tmp/nrtcontent/qt1.txt
Write a flume configuration file named flumes.conf and use it to load data in hdfs with
following additional properties.
1. Spool /tmp/nrtcontent
2. File prefix in hdfs sholuld be events
3. File suffix should be Jog
4. If file is not commited and in use than it should have as prefix.
5. Data should be written as text to hdfs

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create directory mkdir /tmp/nrtcontent
Step 2 : Create flume configuration file, with below configuration for source, sink and
channel and save it in flume6.conf.
agent1 .sources = source1
agent1 .sinks = sink1
agent1.channels = channel1
agent1 .sources.source1.channels = channel1
agent1 = channel1
agent1 .sources.source1.type = spooldir
agent1 .sources.source1.spoolDir = /tmp/nrtcontent
agent1 .sinks.sink1 .type = hdfs
agent1 .sinks.sink1.hdfs.path = /tmp/flume
agent1.sinks.sink1.hdfs.filePrefix = events
agent1.sinks.sink1.hdfs.fileSuffix = .log
agent1 .sinks.sink1.hdfs.inUsePrefix = _
agent1 .sinks.sink1.hdfs.fileType = Data Stream
Step 4 : Run below command which will use this configuration file and append data in
Start flume service:
flume-ng agent -conf /home/cloudera/flumeconf -conf-file
/home/cloudera/fIumeconf/fIume6.conf -name agent1
Step 5 : Open another terminal and create a file in /tmp/nrtcontent
echo "I am preparing for CCA175 from" > /tmp/nrtcontent/.he1.txt
mv /tmp/nrtcontent/.he1.txt /tmp/nrtcontent/he1.txt
After few mins
echo "I am preparing for CCA175 from" > /tmp/nrtcontent/.qt1.txt
mv /tmp/nrtcontent/.qt1.txt /tmp/nrtcontent/qt1.txt

Question # 6

Problem Scenario 82 : You have been given table in Hive with following structure (Which
you have created in previous exercise).
productid int code string name string quantity int price float
Using SparkSQL accomplish following activities.
1. Select all the products name and quantity having quantity <= 2000
2. Select name and price of the product having code as 'PEN'
3. Select all the products, which name starts with PENCIL
4. Select all products which "name" begins with 'P\ followed by any two characters,
followed by space, followed by zero or more characters

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Copy following tile (Mandatory Step in Cloudera QuickVM) if you have not done it.
sudo su root
cp /usr/lib/hive/conf/hive-site.xml /usr/lib/sparkVconf/
Step 2 : Now start spark-shell
Step 3 ; Select all the products name and quantity having quantity <= 2000
val results = sqlContext.sql(......SELECT name, quantity FROM products WHERE quantity
<= 2000......)
Step 4 : Select name and price of the product having code as 'PEN'
val results = sqlContext.sql(......SELECT name, price FROM products WHERE code =
results. showQ
Step 5 : Select all the products , which name starts with PENCIL
val results = sqlContext.sql(......SELECT name, price FROM products WHERE
upper(name) LIKE 'PENCIL%.......}
results. showQ
Step 6 : select all products which "name" begins with 'P', followed by any two characters,
followed by space, followed byzero or more characters
- "name" begins with 'P', followed by any two characters,
- followed by space, followed by zero or more characters
val results = sqlContext.sql(......SELECT name, price FROM products WHERE name LIKE
'P_ %.......)
results. show()

Question # 7

Problem Scenario 75 : You have been given MySQL DB with following details.
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. Copy "retail_db.order_items" table to hdfs in respective directory p90_order_items .
2. Do the summation of entire revenue in this table using pyspark.
3. Find the maximum and minimum revenue as well.
4. Calculate average revenue
Columns of ordeMtems table : (order_item_id , order_item_order_id ,
order_item_product_id, order_item_quantity,order_item_subtotal,order_

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Import Single table .
sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -
password=cloudera -table=order_items -target -dir=p90 ordeMtems -m 1
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 2 : Read the data from one of the partition, created using above command. hadoop fs
-cat p90_order_items/part-m-00000
Step 3 : In pyspark, get the total revenue across all days and orders. entire TableRDD =
#Cast string to float
extractedRevenueColumn = line: float(line.split(",")[4]))
Step 4 : Verify extracted data
for revenue in extractedRevenueColumn.collect():
print revenue
#use reduce'function to sum a single column vale
totalRevenue = extractedRevenueColumn.reduce(lambda a, b: a + b)
Step 5 : Calculate the maximum revenue
maximumRevenue = extractedRevenueColumn.reduce(lambda a, b: (a if a>=b else b))
Step 6 : Calculate the minimum revenue
minimumRevenue = extractedRevenueColumn.reduce(lambda a, b: (a if a<=b else b))
Step 7 : Caclculate average revenue

Question # 8

Problem Scenario 29 : Please accomplish the following exercises using HDFS command
line options.
1. Create a directory in hdfs named hdfs_commands.
2. Create a file in hdfs named data.txt in hdfs_commands.
3. Now copy this data.txt file on local filesystem, however while copying file please make
sure file properties are not changed e.g. file permissions.
4. Now create a file in local directory named data_local.txt and move this file to hdfs in
hdfs_commands directory.
5. Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file system.
6. Create a file in local filesystem named file1.txt and put it to hdfs

Answer: See the explanation for Step by Step Solution and configuration.
Solution :
Step 1 : Create directory
hdfs dfs -mkdir hdfs_commands
Step 2 : Create a file in hdfs named data.txt in hdfs_commands. hdfs dfs -touchz
Step 3 : Now copy this data.txt file on local filesystem, however while copying file please
make sure file properties are not changed e.g. file permissions.
hdfs dfs -copyToLocal -p hdfs_commands/data.txt/home/cloudera/Desktop/HadoopExam
Step 4 : Now create a file in local directory named data_local.txt and move this file to hdfs
in hdfs_commands directory.
touch data_local.txt
hdfs dfs -moveFromLocal /home/cloudera/Desktop/HadoopExam/dataJocal.txt
Step 5 : Create a file data_hdfs.txt in hdfs_commands directory and copy it to local file
hdfs dfs -touchz hdfscommands/data hdfs.txt
hdfs dfs -getfrdfs_commands/data_hdfs.txt /home/cloudera/Desktop/HadoopExam/
Step 6 : Create a file in local filesystem named filel .txt and put it to hdfs
touch filel.txt
hdfs dfs -put/home/cloudera/Desktop/HadoopExam/file1.txt hdfs_commands/

