Using hadoop, divide and conquer -- edulix

   1 [18:56] <kim0> Next up is Edulix .. Presenting "Hadoop" The ultimate hammer to bang on big data :)
   2 [18:56] <ClassBot> There are 5 minutes remaining in the current session.
   3 [19:00] <Edulix> hello people, thanks for your assistance. this is the session titled "Using hadoop, divide and conquer"
   4 [19:01] <Edulix> kim0 told me about these ubuntu cloud sessions, and kidly asked me to do a talk over hadoop, so here I am  =)
   5 === ChanServ changed the topic of #ubuntu-classroom to: Welcome to the Ubuntu Classroom - || Support in #ubuntu || Upcoming Schedule: || Questions in #ubuntu-classroom-chat || Event: Ubuntu Cloud Days - Current Session: Using hadoop, divide and conquer - Instructors: edulix
   6 [19:01] <Edulix> first I must say that I am in no way a hadoop expert, as I have been working with hadoop just for a bit over a month
   7 [19:01] <ClassBot> Logs for this session will be available at following the conclusion of the session.
   8 [19:02] <Edulix> but I hope that I can help to show you a bit of hadoop and ease the learning curve for those who want to use it
   9 [19:03] <Edulix> I'm going to base this talk in the hadoop tutorial available in as it helped me a lot, but it's a bit dense, so I'll do a resumed version
  10 [19:03] <Edulix> So what's hadoop anyway?
  11 [19:03] <Edulix> it's a large-scale distributed batch processing infrastructure, designed to efficiently distribute large amounts of work across a set of machines
  12 [19:03] <Edulix> here large amounts of work means really really large
  13 [19:04] <Edulix> Hundreds of gigabytes of data is low end for hadoop!
  14 [19:04] <Edulix> hadoop supports handling hundreds of petabytes... Normally the input data is not that big, but the intermediate data is or can be
  15 [19:04] <Edulix> of course, all this does not fit on a single hard drive, much less in memory
  16 [19:05] <Edulix> so hadoop comes with support for its own distributed file system: HDFS
  17 [19:05] <Edulix> which breaks up input data and sends fractions  (blocks) of the original data to some machines in your cluster
  18 [19:06] <Edulix> everyone that has tried will know that performing large-scale computation is difficult
  19 [19:06] <Edulix> whenever multiple machines are used in cooperation with one another, the probability of failures rises: partial failures are an expected and common
  20 [19:07] <Edulix> Network failures, computers over heating, disks crashing, data corruption, maliciously modified data..
  21 [19:07] <Edulix> shit happens, all the time (TM)
  22 [19:07] <Edulix> In all these cases, the rest of the distributed system should be able to recover and continue to make progress. the show must go on
  23 [19:08] <Edulix> Hadoop provides no security, and no defense to man in the middle attacks for example
  24 [19:08] <Edulix> it assumes you control your computers so they are secure
  25 [19:08] <Edulix> on the other hand, it is designed to handle hardware failure and data congestion issues very robustly
  26 [19:09] <Edulix> to be successful, a large-scale distributed system must be able to manage resources efficiently
  27 [19:09] <Edulix> CPU, RAM, Harddisk space, network bandwidth
  28 [19:10] <Edulix> This includes allocating some of these resources toward maintaining the system as a whole
  29 [19:10] <Edulix> ..... while devoting as much time as possible to the actual core computation
  30 [19:10] <Edulix> So let's talk about the hadoop approach to things
  31 [19:11] <Edulix> btw if you have nay questions, just ask in #ubuntu-classroom-chat with QUESTION: your question
  32 [19:11] <Edulix> Hadoop uses a  simplified programming model which allows the user to quickly write and test distributed systems
  33 [19:12] <Edulix> and also to tests its efficient & automatic distribution of data and work across machines
  34 [19:13] <Edulix> and also allows to use the underlying parallelism of the CPU cores
  35 [19:13] <Edulix> In a hadoop cluster, data is distributed to all the nodes of the cluster as it is being loaded in
  36 [19:13] <Edulix> HDFS will split large data files into blocks which are managed by different nodes in the cluster
  37 [19:13] <Edulix> Also replicating data in different nodes, just in case
  38 [19:14] <ClassBot> kim0 asked: Does hadoop require certain "problems" that fits its model ? can I throw random computations to it
  39 [19:15] <Edulix> I'm going to answer that now =)
  40 [19:16] <Edulix> basically, hadoop uses the mapreduce programming paradigm
  41 [19:16] <Edulix> In hadoop, Data is conceptually record-oriented. Input files are split into input splits referring to a set of records.
  42 [19:17] <Edulix> The stragy of the scheduler is moving the computation to the data, i.e. which data will be processed by a node is chosen based on its locality to the node, which results in high performance.
  43 [19:17] <Edulix> Hadoop programs need to follow a particular programming model (MapReduce), which limits the amount of communication, as each individual record is processed by a task in isolation from one another
  44 [19:18] <Edulix> In MapReduce, records are processed in isolation by tasks called Mappers
  45 [19:18] <Edulix> The output from the Mappers is then brought together into a second set of tasks called Reducers
  46 [19:18] <Edulix> where results from different mappers can be merged together
  47 [19:18] <Edulix> Note that if you for example don't need the Reduce step, you can implement a Map-only processing.
  48 [19:19] <Edulix> This simplification makes the Hadoop framework much more reliable, because if a node is slow or crashes, other node can simply replace the former taking the same inputsplit and processing it again
  49 [19:19] <ClassBot> chadadavis asked: Is there any facility for automatically determining how to partition the data, i.e. based on how long one chunk of processing takes?
  50 [19:21] <Edulix> to be able to partitoon the data,
  51 [19:21] <Edulix> you need to have first a structure for that data. for example,
  52 [19:22] <Edulix> if you have a png image that you need processthen the input data is the image file. you might partition your image in chunks that start in a given position (x,y) and have a height and a width
  53 [19:22] <Edulix> but the partitioning is usually done by you, the hadoop program developer
  54 [19:23] <Edulix> though hadoop is in charge of selecting where to send to that partition, depending on data locality
  55 [19:24] <Edulix> when you partition the input data, you don't send the data (input split) to the node that will process it: ideally it will already have that data!
  56 [19:24] <Edulix> how is this possible?
  57 [19:25] <Edulix> because when you do the partition, the InputSplit only defines this partition (so it might be in the image example 4 numbers: x,y, height, width) and depending on which nodes the file blocks of the input data reside, hadoop will send that split to that node
  58 [19:26] <Edulix> and then the node will open the file in HDFS for reading starting (fseek) in there
  59 [19:26] <Edulix> ok, I continue =)
  60 [19:26] <Edulix> separate nodes in a Hadoop cluster still communicate with one another, implicitly
  61 [19:27] <Edulix> pieces of data can be tagged with key names which inform Hadoop how to send related bits of information to a common destination node
  62 [19:27] <Edulix> Hadoop internally manages all of the data transfer and cluster topology issues
  63 [19:27] <Edulix> One of the major benefits of using Hadoop in contrast to other distributed systems is its flat scalability curve
  64 [19:28] <Edulix> Using other distributed programming paradigms, you might get better results for 2, 5, perhaps a dozen machines. But when you need to go really large scale, this is where hadoop excels
  65 [19:29] <Edulix> After you program is written and functioning on perhaps ten nodes (to tests that it can be used in multiple nodes with replication etc and not only in standalone mode),
  66 [19:29] <Edulix> then  very little --if any-- work is required for that same program to run on a much larger amount of hardware efficiently
  67 [19:29] <Edulix> == distributed file system ==
  68 [19:29] <Edulix> a distributed file system is designed to hold a large amount of data and provide access to this data to many clients distributed across a network
  69 [19:30] <Edulix> HDFS is designed to store a very large amount of information, across multiple machines, and also supports very large files
  70 [19:30] <Edulix> some of its requirements are:
  71 [19:30] <Edulix> it should store data reliably even if some machines fail
  72 [19:30] <Edulix> it should provide fast, scalable access to this information
  73 [19:31] <Edulix> And finally it should integrate well with Hadoop MapReduce, allowing data to be read and computed upon locally when possible
  74 [19:31] <Edulix> This last point is crucial. HDFS is optimized for MapReduce, and thus has made some decisions/tradeoffs:
  75 [19:31] <Edulix> Applications that use HDFS are assumed to perform long sequential streaming reads from file because of MapReduce
  76 [19:31] <Edulix> so HDFS is optimized to provide streaming read performance
  77 [19:32] <Edulix> this comes at the expense of random seek times to arbitrary positions in fileswhen a node
  78 [19:32] <Edulix> this comes at the expense of random seek times to arbitrary positions in files
  79 [19:32] <Edulix> i.e. when a node reads, it might start reading in the middle of a file, but then it will read byte after byte, not jumping here and there
  80 [19:32] <Edulix> Data will be written to the HDFS once and then read several times; AFAIK there is no file update support
  81 [19:33] <Edulix> due to the large size of files, and the sequential nature of reads, the system does not provide a mechanism for local caching of data
  82 [19:33] <Edulix> data replication strategies combat machines or even whole racks failing
  83 [19:34] <Edulix> hadoop comes configured to have each file block stored in three nodes by default: two in the same rack, and the other block in another machine
  84 [19:35] <Edulix> if the first rack fails, speed might degrade relatively but information wouldn't be lost
  85 [19:35] <Edulix> BTW HDFS design is based on google file system (GFS)
  86 [19:36] <Edulix> and as you probably  has guessed, in HDFS data/files is/are split in blocks of equal size in DataNodes (machines in the cluster)
  87 [19:36] <ClassBot> gaberlunzie asked: would this sequential access mean hadoop can work with tape?
  88 [19:37] <Edulix> I haven't heard anyone doing such a thing,
  89 [19:37] <Edulix> and I don't think it's a good idea
  90 [19:38] <Edulix> why? because the reads are sequential, but you need to do the first seek to start reading at the point your inputsplit indicates
  91 [19:38] <Edulix> doing this first seek might be too slow for a tape, but I might be completely wrong  here
  92 [19:39] <Edulix> note that the data stored in HDFS is supposed to be temporary, mostly, just for working
  93 [19:39] <Edulix> so you copy the data there, do your thing, then copy the output result back
  94 [19:39] <Edulix> in contrast, tapes are mostly used for large term storage
  95 [19:40] <Edulix> (cotinuing) default block size in HDFS is very large (64MB)
  96 [19:40] <Edulix> This decreases metadata overhead and allows for fast streaming reads of data
  97 [19:41] <Edulix> Because HDFS stores files as a set of large blocks across several machines, these files are not part of the ordinary file system
  98 [19:41] <Edulix> For each DataNode machine, the blocks it stores reside in a particular directory managed by the DataNode service, and these blocks are stored as files whose filenames are their blockid
  99 [19:41] <Edulix> HDFS comes with its own utilities for file management equivalent to ls, cp, mv, rm, etc
 100 [19:41] <Edulix> the metadata (names of files and dirs and where are the blocks stored) of the files can be modified by multiple clients concurrently
 101 [19:42] <Edulix> The metadata (names of files and dirs and where are the blocks stored) of the files can be modified by multiple clients concurrently. To orchestrate this, metadata is stored and handled by the NameNode, that stores metadata usually in memory (it's not much data), so that it's fast (because this data *will* be accessed randomly).
 102 [19:43] <ClassBot> chadadavis asked: If I first have to copy the data (e.g. from a DB) to HDFS before splitting, couldn't the mappers just pull/query the data directly from the DB as well?
 103 [19:43] <Edulix> yes you can =)
 104 [19:43] <Edulix> and if the data is in a DB, you should
 105 [19:45] <Edulix> input data is read from an InputFormat
 106 [19:45] <Edulix> and there are different input formats provided by hadoop: FileInputFormat for example to read from a single file
 107 [19:45] <Edulix> but there's also DBInputFormat, for example
 108 [19:46] <Edulix> in my experience, you will probably create your own =)
 109 [19:46] <Edulix> Deliveratively I haven't explained any code, but I recommend you that if you're interested you should start playing with hadoop locally in your own machine
 110 [19:47] <Edulix> just download hadoop from and follow the quickstart
 111 [19:47] <Edulix> for quickstart and for development, you typically use hadoop as standalone in your own machine
 112 [19:47] <Edulix> in this case HDFS will simply refer to your own file system
 113 [19:47] <Edulix> You just need to download hadoop, configure Java (because hadoop is written in java), and execute the example as mentioned in the quickstart page
 114 [19:47] <Edulix> as mentioned earlier, with hadoop you usually operate as follows, because of its batching nature: you copy input data to HDFS, then request to launch the hadoop task with an output dir, and when it's done, the output dir will have the task results
 115 [19:48] <Edulix> For starting developing a hadoop app was this tutorial because it explains pretty much everything I needed
 116 [19:48] <Edulix> but note that it's a bit old
 117 [19:48] <Edulix> and one of the things that I found most frustrating in hadoop while developing was that there are duplicated classes i.e. org.apache.hadoop.mapreduce.Job and org.apache.hadoop.mapre.jobcontrol.Job
 118 [19:49] <Edulix> In that case, use alwys org.apache.hadoop.mapreduce, because is the new improved API
 119 [19:49] <Edulix> be warned, the examples in use the old mapred api :P
 120 [19:50] <Edulix> and hey, now I'm open to even more questions !
 121 [19:51] <Edulix> and if you have questions later on, you can always join us in, #hadoop, and hopefully someone will help you there =)
 122 [19:51] <ClassBot> There are 10 minutes remaining in the current session.
 123 [19:52] <ClassBot> gaberlunzie asked: does hadoop have formats to read video (e.g., EDLs and AAFs)?
 124 [19:52] <Edulix> most probably.. not, but maybe someone has done that before
 125 [19:52] <Edulix> anyway, creating a new input format is really easy
 126 [19:53] <ClassBot> chadadavis asked: Mappers can presumably also be written in something other than Java? Are there APIs for other languages (e.g. Python?) Or is managed primarily at the shell level?
 127 [19:54] <Edulix> good question!
 128 [19:54] <Edulix> yes, there are examples in python and in C++
 129 [19:55] <Edulix> I haven't used them though
 130 [19:55] <ClassBot> kim0 asked: Can I use hadoop to crunch lots of data running on Amazon EC2 cloud ?
 131 [19:55] <Edulix> heh I forgot to mention it =)
 132 [19:56] <Edulix> answer is yes!
 133 [19:56] <Edulix> more details in
 134 [19:56] <ClassBot> There are 5 minutes remaining in the current session.
 135 [19:56] <Edulix> that's one of the nice things of using hadoop: many big people uses it in the industry. yahoo, for example, and amazon has support for it too
 136 [19:57] <Edulix> so don't need to really have lots of machines for doing large computation
 137 [19:57] <Edulix> just use amazon =)
 138 [19:57] <ClassBot> gaberlunzie asked: is there a hadoop format repository?
 139 [19:58] <Edulix> I don't know huh
 140 [19:58] <Edulix> :P
 141 [19:58] <Edulix> I didn't investigate much about this because I needed to have my own
 142 [19:58] <Edulix> but probably in contrib there is
 143 [20:00] <Edulix> ok so that's it!
 144 [20:00] <Edulix> Thanks for your assistance to the talk, and thanks for the organizers
 145 === ChanServ changed the topic of #ubuntu-classroom to: Welcome to the Ubuntu Classroom - || Support in #ubuntu || Upcoming Schedule: || Questions in #ubuntu-classroom-chat || Event: Ubuntu Cloud  Days - Current Session: UEC/Eucalyptus Private Cloud - Instructors: obino
 146 [20:01]  * kim0 claps .. Thanks Edulix 

UbuntuCloudDays/23032011/UsingHadoopDivideAndConquer (last edited 2011-03-26 17:58:14 by nigelbabu)