HadoopPigSpec

Launchpad Entry: server-maverick-hadoop-pig
Created: May 19th, 2010
Contributors: ThierryCarrez
Packages affected: hadoop, hbase

Summary

Apache Hadoop stack improvements for the Maverick cycle.

Release Note

Ubuntu 10.10.10 comes with support for the Hadoop family of projects: hadoop, hbase and pig are now available through the Ubuntu archives.

Rationale

The Apache Hadoop project family proposes reliable, scalable, distributed computing that is suitable for cloud workloads. As the distribution of choice for cloud environments, Ubuntu Server edition needs to support this stack.

User stories

As a systems developer, I want to deploy a complete Hadoop infrastructure. I use the packaging available in 10.10.10 and everything can be installed easily.

As an hadoop user, I want to produce sequences of Map-Reduce programs. I install the pig package and am able to compile such programs.

Assumptions

None.

Design

Scope

The main contenders are:

Hadoop Core / HDFS / Mapreduce: the core of the hadoop system (See RFH in Debian)
Pig: A high-level data-flow language and execution framework for parallel computation
Zookeeper: A high-performance coordination service for distributed applications (See RFH in Debian)

Also part of this spec, though it is now another top-level Apache project:

HBase: Scalable, distributed database that supports structured data storage for large tables (thkoch: HBase comes with a patched Hadoop. These patches are not yet included in Hadoop/Debian. See "Hadoop support for hbase" (2010/05/07) in general@hadoop.apache.org, not sure yet, whether they should be included at all)

Other Hadoop subprojects are maturing and should be considered for future releases:

Chukwa: A data collection system for managing large distributed systems [thkoch: My collegue played with Chukwa two days and he does not consider it useable. Documentation is not current. It also doesn't seem to give anything, Ganglia couldn't do)

Current situation

Which upstream ?

The Apache Hadoop stable distribution (0.20) has a few noticeable shortcomings and doesn't play nice with HBase. The 0.21 codebase, that is supposed to fixed those issues, is under development and highly unstable right now. That's the reason why Yahoo (large Hadoop user) and Cloudera (Hadoop solutions provider) both maintain their own distributions of a 0.20 patched codebase.

In coordination with Debian, we need to evaluate all potential codebases and pick one that is both maintainable and usable. From early contacts, it appears that most of the Cloudera patchset is made of Apache JIRA fixes backports, which might make it a sustainable alternative over the long run.

Hadoop

Packaged in Debian (Apache codebase, 0.20.2), current, provides hadoop core, hdfs and mapreduce.
Rewrote the build.xml to exclude org.apache.hadoop.fs.{kfs,s3native,s3} to avoid packaging jets3t and kfs
Moving to main: 8 dependencies in universe
- commons-el
- xmlenc
- lucene2
  - commons-digester
  - icu4j
  - commons-compress
  - jtidy
  - db-je

HBase

Packaged in Debian (0.20.4), current.
Build patch disables thrift (to avoid packaging thrift)
old Rest API is deprecated and was excluded due to license issues of json.org dependency. License issue has been resolved, but old REST API will be removed soon from trunk
new REST API (stargate) requires jersey and jaxb, which have many dependencies. See "jersey (stargate dependency) is insane!" in hbase-dev@hadoop.apache.org.
Shell feature depends on JRuby (multiverse)

Zookeeper

Packaged in Debian and Ubuntu (3.3.0) (3.3.1 on mentors), upstream very recently released 3.3.1

Pig

Not packaged in Debian
Depends on hbase
Option 1: binary distribution to multiverse
Option 2: proper packaging to universe
- Ivy build, needs to be reimplemented as build.xml
- Disable build of contrib/piggybank to avoid packaging Jackson (thkoch: libjackson-json is on mentors)

Proposed objectives

Push hadoop and hbase in current form to Ubuntu: DONE as synced from Debian.
Use alternatives to manage hadoop jars files.
- Given that most patches are applied to build jar files and that all jars are shipped in libhadoop-java (and its dependencies) the proposal is to use libhadoop-java as the integration for different patch sets. The alternative system is used for each jar and each patchset can provide their own jars:
  - libhadoop-java
  - libhadoop-cloudera-java
  - libhadoop-yahoo-java
  The rest of the packages are the ones from Debian as they provide integration with the system (init scripts, etc...).
Rewrite pig build system to build without ivy
Properly package pig for universe
Evaluate potential improvements to Hadoop/HBase/Zookeeper packages, work with Debian (thkoch: I'll be on the Debian-Mini-Conf and berlinbuzzwords.de in Berlin)
Consider moving hadoop to main: DEFERRED for Maverick.

Implementation

See work items on server-maverick-hadoop-pig whiteboard.

Test/Demo Plan

tbd

Unresolved issues

Codebase selection, which impacts the quantity of work to be done on that stack.

BoF agenda and discussion

UDS discussion notes

Worse case scenario: have everything available in multiverse.

Hadoop Core
- In debian. Will be pulled in maverick.
- Discussion of potential usability improvements:
  - provide a working default configuration
  - how to distribute the configuration files to the systems part of the hadoop cluster:
    - distribute ssh public keys
  - 3 roles: same configuration files.
    - add a debconf question.
    - create 3 binary packages that install the configuration files for each type of roles.
  - tasks during installer: maybe.
- Consider moving to main:
  - 8 build dependencies
- Cloudera patchset status (supposed to be upstream) (thkoch: I've had a look over the cloudera patches and decided not to use any in the end. Most of them add new functionality/subprojects)
Hbase
- Not in debian yet, but soon: http://git.debian.org/?p=pkg-java/hbase.git
  - uploaded to Debian. Should be pulled in maverick.
- Unpackaged dependencies: AgileJSON, Thrift
  - http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=524135
  - http://git.debian.org/?p=users/eevans/thrift.git
- Dependencies in multiverse: JRuby (about 10+ missing build-deps)
  - which ones? AFAICT jruby has 4 build-deps all in main. --mathiaz
  - the jruby orig tarball ships many prebuilt jars -- twerner
Pig
- Not in Debian
- Depends on hadoop, hbase, zookeeper-hbase
- Target for multiverse in 10.10 with a midterm plan to move to universe/main ?
Zookeeper:
- In Ubuntu universe
- Usability improvements ? Move to main ?
Hive: Defer.
HDFS:
- see https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-cloud-gluster/
- part of Hadoop Core - is it enabled/built in the Debian packages?
MapReduce:
- available in the source package.
- is it enabled/built in the Debian packages?

Test if hadoop works with openjdk.

CategorySpec

HadoopPigSpec (last edited 2010-06-09 22:42:04 by dsl-173-206-3-81)