Site Sponsors:
New Project: Hadoop9000 
For the benefit of folks who will either be taking our Hadoop training, or are interested in having a copy of the latest stable Hadoop (a copy that they can debug / tweak / re-build on-their-own), then we just created the "Hadoop 9000" Project.

Weighing in at 3.5GB, the VirtualBox Virtual Machine (OVA Appliance) will be uploaded sometime this week.



[ view entry ] ( 1926 views )   |  permalink  |  related link
Setting-up Hadoop 2 on Linux in 3 Easy Steps 
When the time came to pick a 'distro for use by our Hadoop students, because the Hortonworks VM was using CentOS (love it), just to round the student experience out a bit I decided to use Ubuntu.

Hadoop 2.2.0 is still about Java. If you are thinking production, be sure to use Oracle Java 1.6.

STEP 01: Source Install (OPTIONAL)

Obviously tracking to an LTS version, when you want to set-up the source code for Hadoop for spelunking on Linux (first-timers will want to avoid this step!), then you will want to do the following:
sudo bash
apt-get install maven
apt-get install git
apt-get update
apt-get upgrade
mkdir /hadoop9000
cd /hadoop9000
git clone git://
cd hadoop-common
mvn install -DskipTests
mvn eclipse:eclipse -DdownloadSources=true -DdownloadJavadocs=true
chown guest /hadoop9000 (or whoever)

The above will create the latest Hadoop on your machine. -To keep your production work moving along on the Hadoop 2 LCD, then also consider using the official binary install.

DEVELOPER NOTE: If you are looking to use your install for software development, note that the above step is not optional. Why? Because native mode libraries (as well as the rest of the lot) need to be generated for your R&D Platform. Depending upon the version of Linux you have, you may also need to install projects like ptotocol buffers (etc.) to compile Hadoop's C/C++ JNI underpinnings. Once created, just chum the lot of the 50-or-so libraries into /hadoop9000/lib. Why? Because you will want to use those JARs in your IDE (eclipse, netbeans, etc.) from a single standard location.

STEP 02: Official Binary Install (REQUIRED)

If rebuilding from the source is not what you want to do, then you can simply download & unzip the hadoop tar under /hadoop9000.

Note that if using gzip compressing is on the radar, then we will need to be sure to provide the proper 32 / 64 rendition of Hadoop's native libraries, as well. (Step 01 can build those native libraries for us, too. Use: mvn compile -Pnative )

STEP 03: Hadoop Environment Variables (REQUIRED)

Next, those Hadoop environment variables need to be wired-into your .bashrc: (The embolden ones are required - the rest are optional)
export JAVA_HOME="/usr/lib/jvm/java-6-oracle"
export HADOOP_HOME="/hadoop9000"

export HIVE_INSTALL="$HADOOP_HOME/hive/hive-0.12.0-bin"
# Can also place into
#export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=$HADOOP_HOME/lib/"
Yes, Oracle Java (/usr/lib/jvm/java-6-oracle) is officially endorsed as the the best real-world way to go on Hadoop.

Chasing the Moths

When debugging, do not forget to edit the

This should get you started. If you need to learn more, then consider signing-up for our next week of virtual training on Hadoop.



[ view entry ] ( 1785 views )   |  permalink  |  related link
Linux Disk Dump (35MB/Sec) 
Last week I noticed that my main drive was showing some classic boot-up warning signs. Rather than risk another 4 hour re-install session (followed by an 6 hour content restoration), we decided to get a 2T Toshiba backup drive.

It arrive skewed. After checking around, lots of my friends at the local computer repair places were saying that the Toshiba Drives have even worse failure rates than Western Digital. (So much for saving $20!)

So it was back to the Internet to get yet another 2T Seagate Barracuda: Tried and true, the 7200 RPM drive is well worth an extra $20.oo (lol.) -Better still, by matching the existing drive our confidence factor was absolute for an anticipated manual swap-out.


After well under 20 hours (was out Christmas shopping when the `dd` completed), the note to self is to use the POSIX `time` command next time around.

More officially however, here are the stats for the disk-dumping of 2TBs worth of data between two "identical" hard drives:

dd if=/dev/sda of=/dev/sdb
3907029168+0 records in
3907029168+0 records out
2000398934016 bytes (2.0 TB) copied, 57876.7 s, 34.6 MB/s

(Yea, sounds a bit off to me, too. -But we were running both live & hot.)

Regardless, the next time I Dee-Dee this locus, we might update this entry with the actual `time` (man time) results. Yet for now there is enough info in the above (round to 35MB/Sec - we were not that hot) for others to do the math to support their own `dd` ops.



[ view entry ] ( 2087 views )   |  permalink  |  related link

<<First <Back | 45 | 46 | 47 | 48 | 49 | 50 | 51 | 52 | 53 | 54 | Next> Last>>