Friday, November 16, 2012

Training Day: Hadoop for Laymen

I've spent this past week in a company-provided training course, which has had a few decent nuggets of information (and free lunches) to balance out the otherwise slow pace of the course and the need to drive to Tysons Corner during rush hour. There are also people in the class from some of our "sister companies" that remind me of the CS majors from my undergrad: eager to showcase their knowledge, catch the instructor in a fallacy, or drive the discussion off a tangential cliff of irrelevant details. I learn best on my own, so I would usually read ahead in the slides, do the exercises, and then tune out the rest of the day on Reddit or doing my day job.

The material that we learned is based on the concept of MapReduce, which is how Google scales their infrastructure to query and analyze gigantic data sets. So you can avoid a three thousand dollar training course, here's how it works:

MapReduce is a way of solving problems in a distributed manner. Rather than buying a few incredibly expensive supercomputers to handle petabytes of data, you buy tons of crappy consumer computers, split the data across them all, and then break down the big job into smaller jobs that each crappy computer can handle. You assume that some of the crappy computers will break down or take too long and cover your ass by assigning the same work to multiple computers, and using the one that finishes first. This is kind of how outsourcing to foreign countries works as well.

Hadoop is the part that takes care of all the grunt work: mapping a job onto the computers, reducing their results into one master result, and handling failures, job scheduling, and data replication. With all of that out of the way, you can focus on coding the job itself, but it requires a slightly different mindset to think in MapReduce terms. You don't really see benefits of this until your data sets are larger than you could possibly imagine (e.g. analyzing the number of clicks your stupid high school friends have made in Farmville, or searching through your ridiculous collection of porn). A common example is to get a word count of the complete works of Shakespeare. You could do this by maintaining a master tally and walk through his works, word by word. Or, you could give a couple sentences to each computer and consolidate their results in the end.

Hadoop is supported by an "ecosystem" of related tools, including Hive, Pig, Sqoop, Flume, and Oozie, because it takes an embarassing name to get any money in Silicon Valley. And of course, the company offering the training just happens to sell support for their own version of Hadoop, as well as a certification exam which is probably worthless in the long run, but will allow me to show "career growth" on my next performance review.

tagged as programming, day-to-day | permalink | 1 comment
day in history


Previous Post: Mike Day


Next Post: List Day: 15 Things That Outstay Their Welcomes

 

You are currently viewing a single post from the annals of URI! Zone history. The entire URI! Zone is © 1996 - 2024 by Brian Uri!. Please see the About page for further information.

Jump to Top
Jump to the Front Page


November 2012
SMTWHFS
123
45678910
11121314151617
18192021222324
252627282930
OLD POSTS
Old News Years J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
J F M A M J
J A S O N D
visitors since November 2003