The National Institute for Computational Sciences

Software

Title

Category: APPLICATIONS > MACHINE LEARNING / BIG DATA

Description

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Use

Hadoop is available as a module. The module file sets up necessary enviromental virables for Hadoop and provides two commands, cluster_start and cluster_stop, to start and stop a Hadoop cluster with a minimun of 3 nodes.

The module file also sets up enviroment for Mahout, a scalable machine learning library, and Hive, a data warehouse based on Hadoop.

IMPORTANT: By default, HDFS is set up on local SSD, the data on which will be purged once the job is finished.

An example PBS script to run HiBench sort, and a simple hive example Hadoop_example.sh:

#PBS -A your-account-number
#PBS -j oe
#PBS -l nodes=6 
#PBS -l walltime=1:00:00

module load hadoop/2.5.0

#start the hadoop cluster with one name node,
#one secondary name node plus resource manager and job history manager,
#four data nodes plus node managers 
cluster-start 

#hive example
wget http://files.grouplens.org/datasets/movielens/ml-100k.zip
unzip ml-100k.zip
cd ml-100k
cat << _EOF_ > hive-script.sql
CREATE TABLE u_data (
  userid INT,
  movieid INT,
  rating INT,
  unixtime STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

LOAD DATA LOCAL INPATH './u.data'
OVERWRITE INTO TABLE u_data;

SELECT COUNT(*) FROM u_data;
_EOF_

hive -f hive-script.sql

#stop hadoop cluster
cluster-stop

To specify hadoop configuration files tailored to your application, please redirect the configuration direcotry by

export HADOOP_CONF_DIR=/path/to/your/configuration/files

after loading the hadoop module. User specific configuration files should follow the same format as the provided template files at $HADOOP_HOME/etc/hadoop/template.

Support

This package has the following support level : Supported

Available Versions

Version Available Builds
Other
2.7.3
?