Go Back on CCA175 Exam
Available in 1, 3, 6 and 12 Months Free Updates Plans
PDF: $15 $60

Test Engine: $20 $80

PDF + Engine: $25 $99

CCA175 Practice Test


Page 2 out of 20 Pages

Problem Scenario 65 : You have been given below code snippet.
val a = sc.parallelize(List("dog", "cat", "owl", "gnu", "ant"), 2)
val b = sc.parallelize(1 to a.count.tolnt, 2)
val c = a.zip(b)
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(String, Int)] = Array((owl,3), (gnu,4), (dog,1), (cat,2>, (ant,5))






Answer: See the explanation for Step by Step Solution and configuration.
Explanation:
Solution : c.sortByKey(false).collect
sortByKey [Ordered] : This function sorts the input RDD's data and stores it in a new RDD.
"The output RDD is a shuffled RDD because it stores data that is output by a reducer
which has been shuffled. The implementation of this function is actually very clever.
First, it uses a range partitioner to partition the data in ranges within the shuffled RDD.
Then it sorts these ranges individually with mapPartitions using standard sort mechanisms.

Problem Scenario 1:
You have been given MySQL DB with following details.
user=retail_dba
password=cloudera
database=retail_db
table=retail_db.categories
jdbc URL = jdbc:mysql://quickstart:3306/retail_db
Please accomplish following activities.
1. Connect MySQL DB and check the content of the tables.
2. Copy "retaildb.categories" table to hdfs, without specifying directory name.
3. Copy "retaildb.categories" table to hdfs, in a directory name "categories_target".
4. Copy "retaildb.categories" table to hdfs, in a warehouse directory name
"categories_warehouse".






Answer: See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1 : Connecting to existing MySQL Database mysql -user=retail_dba -
password=cloudera retail_db
Step 2 : Show all the available tables show tables;
Step 3 : View/Count data from a table in MySQL select count(1} from categories;
Step 4 : Check the currently available data in HDFS directory hdfs dfs -Is
Step 5 : Import Single table (Without specifying directory).
sqoop import -connect jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -
password=cloudera -table=categories
Note : Please check you dont have space between before or after '=' sign. Sqoop uses the
MapReduce framework to copy data from RDBMS to hdfs
Step 6 : Read the data from one of the partition, created using above command, hdfs dfs -
catxategories/part-m-00000
Step 7 : Specifying target directory in import command (We are using number of mappers
=1, you can change accordingly) sqoop import -connect
jdbc:mysql://quickstart:3306/retail_db -username=retail_dba -password=cloudera
~table=categories -target-dir=categortes_target -m 1
Step 8 : Check the content in one of the partition file.
hdfs dfs -cat categories_target/part-m-00000
Step 9 : Specifying parent directory so that you can copy more than one table in a specified
target directory. Command to specify warehouse directory.
sqoop import -.-connect jdbc:mysql://quickstart:3306/retail_db -username=retail dba -
password=cloudera -table=categories -warehouse-dir=categories_warehouse -m 1

Problem Scenario 63 : You have been given below code snippet.val a = sc.parallelize(List("dog", "tiger", "lion", "cat", "panther", "eagle"), 2)
val b = a.map(x => (x.length, x))
operation1
Write a correct code snippet for operationl which will produce desired output, shown below.
Array[(lnt, String}] = Array((4,lion), (3,dogcat), (7,panther), (5,tigereagle))






Answer: See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
b.reduceByKey(_ + _).collect
reduceByKey JPair] : This function provides the well-known reduce functionality in Spark.
Please note that any function f you provide, should be commutative in order to generate
reproducible results.

Problem Scenario 89 : You have been given below patient data in csv format,
patientID,name,dateOfBirth,lastVisitDate
1001,Ah Teck,1991-12-31,2012-01-20
1002,Kumar,2011-10-29,2012-09-20
1003,Ali,2011-01-30,2012-10-21
Accomplish following activities.
1. Find all the patients whose lastVisitDate between current time and '2012-09-15'
2. Find all the patients who born in 2011
3. Find all the patients age
4. List patients whose last visited more than 60 days ago
5. Select patients 18 years old or younger






Answer: See the explanation for Step by Step Solution and configuration.
Explanation:
Solution :
Step 1:
hdfs dfs -mkdir sparksql3
hdfs dfs -put patients.csv sparksql3/
Step 2 : Now in spark shell
// SQLContext entry point for working with structured data
val sqlContext = neworg.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame.
import sqlContext.impIicits._
// Import Spark SQL data types and Row.
import org.apache.spark.sql._
// load the data into a new RDD
val patients = sc.textFilef'sparksqIS/patients.csv")
// Return the first element in this RDD
patients.first()
//define the schema using a case class
case class Patient(patientid: Integer, name: String, dateOfBirth:String , lastVisitDate:
String)
// create an RDD of Product objects
val patRDD = patients.map(_.split(M,M)).map(p => Patient(p(0).tolnt,p(1),p(2),p(3)))
patRDD.first()
patRDD.count(}
// change RDD of Product objects to a DataFrame val patDF = patRDD.toDF()
// register the DataFrame as a temp table patDF.registerTempTable("patients"}
// Select data from table
val results = sqlContext.sql(......SELECT* FROM patients '.....)
// display dataframe in a tabular format
results.show()
//Find all the patients whose lastVisitDate between current time and '2012-09-15'
val results = sqlContext.sql(......SELECT * FROM patients WHERE
TO_DATE(CAST(UNIX_TIMESTAMP(lastVisitDate, 'yyyy-MM-dd') AS TIMESTAMP))
BETWEEN '2012-09-15' AND current_timestamp() ORDER BY lastVisitDate......)
results.showQ
/.Find all the patients who born in 2011
val results = sqlContext.sql(......SELECT * FROM patients WHERE
YEAR(TO_DATE(CAST(UNIXJTlMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS
TIMESTAMP))) = 2011 ......)
results. show()
//Find all the patients age
val results = sqlContext.sql(......SELECT name, dateOfBirth, datediff(current_date(),
TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TlMESTAMP}}}/365
AS age
FROM patients
Mini >
results.show()
//List patients whose last visited more than 60 days ago
- List patients whose last visited more than 60 days ago
val results = sqlContext.sql(......SELECT name, lastVisitDate FROM patients WHERE
datediff(current_date(), TO_DATE(CAST(UNIX_TIMESTAMP[lastVisitDate, 'yyyy-MM-dd')
AS T1MESTAMP))) > 60......);
results. showQ;
- Select patients 18 years old or younger
SELECT' FROM patients WHERE TO_DATE(CAST(UNIXJTlMESTAMP(dateOfBirth,
'yyyy-MM-dd') AS TIMESTAMP}) > DATE_SUB(current_date(),INTERVAL 18 YEAR);
val results = sqlContext.sql(......SELECT' FROM patients WHERE
TO_DATE(CAST(UNIX_TIMESTAMP(dateOfBirth, 'yyyy-MM-dd') AS TIMESTAMP)) >
DATE_SUB(current_date(), T8*365)......);
results. showQ;
val results = sqlContext.sql(......SELECT DATE_SUB(current_date(), 18*365) FROM
patients......);
results.show();

Problem Scenario 48 : You have been given below Python code snippet, with intermediate
output.
We want to take a list of records about people and then we want to sum up their ages and
count them.
So for this example the type in the RDD will be a Dictionary in the format of {name: NAME,
age:AGE, gender:GENDER}.
The result type will be a tuple that looks like so (Sum of Ages, Count)
people = []
people.append({'name':'Amit', 'age':45,'gender':'M'})
people.append({'name':'Ganga', 'age':43,'gender':'F'})
people.append({'name':'John', 'age':28,'gender':'M'})
people.append({'name':'Lolita', 'age':33,'gender':'F'})
people.append({'name':'Dont Know', 'age':18,'gender':'T'})
peopleRdd=sc.parallelize(people) //Create an RDD
peopleRdd.aggregate((0,0), seqOp, combOp) //Output of above line : 167, 5)
Now define two operation seqOp and combOp , such that
seqOp : Sum the age of all people as well count them, in each partition. combOp :
Combine results from all partitions.






Answer: See the explanation for Step by Step Solution and configuration.Explanation:
Solution :
seqOp = (lambda x,y: (x[0] + y['age'],x[1] + 1))
combOp = (lambda x,y: (x[0] + y[0], x[1] + y[1]))


Page 2 out of 20 Pages
Previous