Speaker "Carol Mcdonald" Details Back



Spark Workshop


Load and Inspect Data in Apache Spark

·         Define Apache Spark components

·         Describe different ways of getting data into Spark

·         Create and use Resilient Distributed Datasets (RDDs)

·         Apply transformation to RDDs

·         Use actions on RDDs

o   Lab: Load and inspect data in RDD

·         Cache intermediate RDDs

·         Use Spark DataFrames for simple queries

o   Lab: Load and inspect data in DataFrames

Build a Simple Apache Spark Application

·         Define the lifecycle of a Spark program

o   Lab: Create the application

·         Define different ways to run a Spark application

·         Run your Spark application

o   Lab: Launch the application

·        SupplementalLab

·         Describe Pair RDD (Why pair RDD)

o   Where are Pair RDD used

o   Differentiate between MapReduce and Spark – lot of time on this. Value execution process. How does it really work.

·         Create Pair RDD

o   Lab: Create Pair RDD

·         Apply transformations and actions to pair RDD

o   Use sortByKey, joins

o   Differentiate between groupByKey and reduceByKey

o   Lab: Apply transformations and actions to Pair RDD

·         Control partitioning across nodes

o   Why partition; Types of partitioning

Build an Application Using Spark DataFrames

·         Create DataFrames

o   Ways to create DataFrames

o   Lab: Create DataFrame

·         Explore data in DataFrames

o   Use DataFrame functions and operations

o   Use SQL

o   Lab: Explore data in DataFrames

·         Create User-Defined Functions (UDF)

o   Lab: Create and use User Defined functions

·         Repartition DataFrames

Monitor a Spark Application

·         Describe the components of the Spark execution model

·         Use the SparkUI to monitor a Spark application Spark

o   Lab: Monitor a Spark Application using the Spark UI

·         Debugging

Spark Streaming

·         What is Spark Streaming?

·         Why Spark Streaming?

·         How to use Spark Streaming

o   Lab: Writing a Spark Streaming with HBase Application

Initializing the StreamingContext

Apply transformations and output operations to DStreams

Writing to HBase

Spark Machine Learning

·         Collaborative filtering for recommendations with Spark.  Using Spark MLlib’sAlternating Least Squares algorithm to make recommendations.

·         Lab:  Movie Recommendations. Load the sample data.


Carol Mcdonald: Carol is an HBase Hadoop instructor at MapR. She has extensive experience as a software developer and architect, building complex mission-critical applications in the banking, health insurance and telecom industries. Carol has over 15 years of experience working with Java and Java Enterprise technologies in many roles of the software development life cycle, including design, development, and technology evangelism. As a Java Technology Evangelist at Sun Microsystems, Carol traveled worldwide, speaking at Sun Tech Days, JUGs, companies, and conferences. Previously in her career, Carol was a software developer for Shaw Systems, Hoffman La Roche, and Digital Equipment Corporation. Carol holds a BS in Geology from Vanderbilt University, and an MS in Computer Science from the University of Tennessee-Knoxville.