Back To Schedule
Monday, October 1 • 10:00am - 3:00pm
All Day Training Class on "Hadoop Ecosystem" (Separate Registration is Required)

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

One Day Intensive Hands-on Hadoop Training Class. Topics Covered Are:

Hadoop BI Developer Hadoop – Architecture, HDFS, EcoSystem & Map Reduce

Time: 10am – noon, 1pm – 4:30pm


Audience: Engineers, Programmers, Networking specialists, Managers, Executives


Software covered: HDFS, MapReduce, Pig, Hive, HBase


Labs: 5 labs, 20 mins each



  • Introduce students to the core concepts of Hadoop

  • Deep dive into the critical architecture paths of HDFS, MapReduce and HBase

  • Teach the basics of how to effectively write Pig and Hive scripts

  • Explain how to choose the correct use cases for Hadoop

  • Give each student access to an individual 1-node Hadoop cluster in Rackspace to run through some hands-on labs for the 5 software components: HDFS, MapReduce, Pig, Hive, HBase

  • Provide links to the best books, blog posts and videos for students to learn more about Hadoop on their own


Summary: This is a fast paced, vendor agnostic, technical overview of the Hadoop landscape. No prior knowledge of databases or programming is assumed. This survey course is targeted towards both technical and non-technical people who want to understand the emerging world of Big Data, with a specific focus on Hadoop. In each sub-topic, the instructor will provide links and resource recommendations for students who want to explore that area further (for example, YouTube videos, books, blog posts). Students will be given a ~100 page PDF slide deck which can be used as reference material after the course. PDFs will also be given out for the 5 short labs in the course.

Course structure:

 10am – 10:30am: Introduction to Big Data and Hadoop

 10:30am – 11:15am: HDFS Lecture

11:15am – 11:40am: HDFS lab

 11:40am – noon: MapReduce Introduction Lecture

 Noon – 1pm: Lunch

1pm – 1:20pm: MapReduce Advanced Lecture

1:20pm – 1:40pm: MapReduce Lab

1:40pm – 2pm: Pig Lecture

2pm – 2:20pm: Pig lab

2:20pm – 2:40pm: Hive Lecture

2:40pm – 3pm: Hive Lab

3pm – 3:40pm: HBase Lecture

3:40pm – 4pm – HBase Lab

4pm – 4:30pm: Next-gen Hadoop (2.0) Lecture

Session 1: Intro to Hadoop (10am to 10:30am)

  • Parallel Computer vs. Distributed Computing

  • Brief history of Hadoop

  • Scaling with Hadoop

  • Hadoop clusters at Yahoo! and Facebook

  • RDBMS/SQL vs. Hadoop

  • Hadoop Daemons introduction: NameNode, DataNode, JobTracker, TaskTracker

  • Intro to the Hadoop ecosystem: HDFS, MapReduce, Pig, Hive, HBase, ZooKeeper

  • Vendor Comparison (Cloudera vs. Hortonworks vs. Amazon EMR)

  • Hardware + Software recommendations for Hadoop


Session 2: HDFS (10:30am – 11:40am)

  • Linux File system options

  • Sample HDFS commands

  • HDFS sample architecture at Yahoo!

  • Data Locality

  • Rack Awareness

  • Write Pipeline

  • Read Pipeline

  • NameNode architecture (EditLog, FsImage, location of replicas, safe mode)

  • Secondary NameNode architecture

  • DataNode architecture

  • Heartbeats

  • Block Scanner

  • Fsck Health Check + file breakdown

  • Balancer

  • LAB #1: Exploring the HDFS cmd line


Session 3: MapReduce (11:40am to 1:40pm, minus one hour lunch)

  • MapReduce Architecture

  • JobTracker/TaskTracker

  • Combiner

  • Partitioner (shuffle)

  • Thinking in the MapReduce way (examples of Mappers & Reducers)

  • Counters

  • Hadoop Streaming (with python)

  • Hadoop Java example

  • Input/output formats

  • Speculative Execution

  • Distributed Cache

  • Job Scheduling (FIFO, Fair Scheduler, Capacity Scheduler)

  • LAB #2: Running MapReduce wordcount in Python & Java


Session 4: Pigs Eat Anything (1:40pm to 2:20pm)

  • Pig philosophy and architecture

  • Pig Latin and the Grunt shell

  • Loading data

  • Data types and schemas

  • Pig Latin details: structure, functions, expressions, relational operators

  • Intro to User Defined Functions and Scripts

  • LAB #3: Exploring Pig Latin commands


Session 5: Hive for Structured Data (2:20pm to 3pm)

  • Hive philosophy and architecture

  • Hive vs. RDBMS

  • HiveQL and Hive Shell

  • Managing tables

  • Data types and schemas

  • Querying data

  • LAB #4: Analyzing movie reviews with Hive


Session 6: Real-time I/O with HBase (3pm – 4pm)

  • HBase versions and origins

  • HBase architecture

  • HBase core concepts

  • HBase vs. RDBMS

  • HBase Master and Region Servers

  • Data Modeling

  • Column Families and Regions

  • HBase Internals: Bloom Filters and Block Indexes

  • Write Pipeline / Read Pipeline

  • Compactions

  • LAB #5: Intro to the HBase command line


Session 7: Next-gen Hadoop (4pm – 4:30pm)

  • HDFS improvements: HDFS Federation, NameNode HA, Snapshots

  • MapReduce improvements: YARN, Performance


Monday October 1, 2012 10:00am - 3:00pm
Redwood Room