HadoopPigSpec
⇤ ← Revision 1 as of 2010-05-19 10:13:33
2883
Comment:
|
5065
|
Deletions are marked like this. | Additions are marked like this. |
Line 22: | Line 22: |
As a systems developer, I want to deploy a complete Hadoop infrastructure. I use the packaging available in 10.10.10 and everything can be installed easily. As an hadoop user, I want to produce sequences of Map-Reduce programs. I install the pig package and am able to compile such programs. |
|
Line 23: | Line 27: |
None. |
|
Line 44: | Line 50: |
* Moving to main: 8 dependencies in universe * commons-el * xmlenc * lucene2 * commons-digester * icu4j * commons-compress * jtidy * db-je |
|
Line 66: | Line 81: |
* Consider moving hadoop to main | |
Line 81: | Line 97: |
Use this section to take notes during the BoF; if you keep it in the approved spec, use it for summarising what was discussed and note any options that were rejected. | === UDS discussion notes === Worse case scenario: have everything available in multiverse. * Hadoop Core * In debian. Will be pulled in maverick. * Discussion of potential usability improvements: * provide a working default configuration * how to distribute the configuration files to the systems part of the hadoop cluster: * distribute ssh public keys * 3 roles: same configuration files. * add a debconf question. * create 3 binary packages that install the configuration files for each type of roles. * tasks during installer: maybe. * Consider moving to main: * 8 build dependencies * Cloudera patchset status (supposed to be upstream) * Hbase * Not in debian yet, but soon: http://git.debian.org/?p=pkg-java/hbase.git * uploaded to Debian. Should be pulled in maverick. * Unpackaged dependencies: AgileJSON, Thrift * http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=524135 * http://git.debian.org/?p=users/eevans/thrift.git * Dependencies in multiverse: JRuby (about 10+ missing build-deps) * which ones? AFAICT jruby has 4 build-deps all in main. --mathiaz * the jruby orig tarball ships many prebuilt jars -- twerner * Pig * Not in Debian * Depends on hadoop, hbase, zookeeper-hbase * Target for multiverse in 10.10 with a midterm plan to move to universe/main ? * Zookeeper: * In Ubuntu universe * Usability improvements ? Move to main ? * Hive: Defer. * HDFS: * see https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-cloud-gluster/ * part of Hadoop Core - is it enabled/built in the Debian packages? * MapReduce: * available in the source package. * is it enabled/built in the Debian packages? Test if hadoop works with openjdk. |
Launchpad Entry: server-maverick-hadoop-pig
Created: May 19th, 2010
Contributors: ThierryCarrez
Packages affected: hadoop, hbase
Summary
Apache Hadoop stack improvements for the Maverick cycle.
Release Note
Ubuntu 10.10.10 comes with support for the Hadoop family of projects: hadoop, hbase and pig are now available through the Ubuntu archives.
Rationale
The Apache Hadoop project family proposes reliable, scalable, distributed computing that is suitable for cloud workloads. As the distribution of choice for cloud environments, Ubuntu Server edition needs to support this stack.
User stories
As a systems developer, I want to deploy a complete Hadoop infrastructure. I use the packaging available in 10.10.10 and everything can be installed easily.
As an hadoop user, I want to produce sequences of Map-Reduce programs. I install the pig package and am able to compile such programs.
Assumptions
None.
Design
Scope
The main contenders are:
- Hadoop Core / HDFS / Mapreduce: the core of the hadoop system
- Pig: A high-level data-flow language and execution framework for parallel computation
- Zookeeper: A high-performance coordination service for distributed applications
Also part of this spec, though it is now another top-level Apache project:
- HBase: Scalable, distributed database that supports structured data storage for large tables
Other Hadoop subprojects are maturing and should be considered for future releases:
- Chukwa: A data collection system for managing large distributed systems
Current situation
Hadoop
- Packaged in Debian (0.20.2), current, provides hadoop core, hdfs and mapreduce.
- Rewrote the build.xml to exclude org.apache.hadoop.fs.{kfs,s3native,s3} to avoid packaging jets3t and kfs
- Moving to main: 8 dependencies in universe
- commons-el
- xmlenc
- lucene2
- commons-digester
- icu4j
- commons-compress
- jtidy
- db-je
HBase
- Packaged in Debian (0.20.4), current.
- Build patch disables thrift (to avoid packaging thrift) and rest (haven't found out why yet)
- Shell feature depends on JRuby (multiverse)
Zookeeper
- Packaged in Debian and Ubuntu (3.3.0), upstream very recently released 3.3.1
Pig
- Not packaged in Debian
- Depends on hbase
- Option 1: binary distribution to multiverse
- Option 2: proper packaging to universe
- Ivy build, needs to be reimplemented as build.xml
- Disable build of contrib/piggybank to avoid packaging Jackson
Proposed objectives
- Push hadoop and hbase in current form to Ubuntu
- Rewrite pig build system to build without ivy
- Properly package pig for universe
- Evaluate potential improvements to Hadoop/HBase/Zookeeper packages, work with Debian
- Consider moving hadoop to main
Implementation
See work items on server-maverick-hadoop-pig whiteboard.
Test/Demo Plan
tbd
Unresolved issues
tbd
BoF agenda and discussion
UDS discussion notes
Worse case scenario: have everything available in multiverse.
- Hadoop Core
- In debian. Will be pulled in maverick.
- Discussion of potential usability improvements:
- provide a working default configuration
- how to distribute the configuration files to the systems part of the hadoop cluster:
- distribute ssh public keys
- 3 roles: same configuration files.
- add a debconf question.
- create 3 binary packages that install the configuration files for each type of roles.
- tasks during installer: maybe.
- Consider moving to main:
- 8 build dependencies
- Cloudera patchset status (supposed to be upstream)
- Hbase
Not in debian yet, but soon: http://git.debian.org/?p=pkg-java/hbase.git
- uploaded to Debian. Should be pulled in maverick.
- Unpackaged dependencies: AgileJSON, Thrift
- Dependencies in multiverse: JRuby (about 10+ missing build-deps)
- which ones? AFAICT jruby has 4 build-deps all in main. --mathiaz
- the jruby orig tarball ships many prebuilt jars -- twerner
- Pig
- Not in Debian
- Depends on hadoop, hbase, zookeeper-hbase
- Target for multiverse in 10.10 with a midterm plan to move to universe/main ?
- Zookeeper:
- In Ubuntu universe
- Usability improvements ? Move to main ?
- Hive: Defer.
- HDFS:
see https://blueprints.launchpad.net/ubuntu/+spec/server-maverick-cloud-gluster/
- part of Hadoop Core - is it enabled/built in the Debian packages?
- available in the source package.
- is it enabled/built in the Debian packages?
Test if hadoop works with openjdk.
HadoopPigSpec (last edited 2010-06-09 22:42:04 by dsl-173-206-3-81)