Hadoop Basics for System Administrators

Hadoop can store any kind of data.

This is why Hadoop data is often called unstructured.

Hadoop has replication built into it.

By default it makes three copies of each file.

Plus each storage node is a computer, so it can participate in querying and writing the data.

Hadoop works in a master-slave mode.

The Hadoop namenode keeps track of what data is located on which slave (data) node.

Hadoop does not give record level access to lines or rows in a file, like a database.

Instead it is a storage system designed to store individual files.

The alternative to Hadoop storage for big data is object storage, like Amazon S3 or Cleversafe.

With object storage, each file has a unique object ID and no file path.

Streaming video is one example of data that is stored in this format (Netflix uses Amazon S3).

The other difference is metadata.

The Hadoop file metadata is fixed.

But that is limited.

Notice there is no UPDATE.

You query Apache data files by using the MapReduce programming technique.

This requires knowledge of Java, so there are other tools that the administrator and other persons can use.

MapReduce creates a new set of data from existing data.

So you’re free to say that MapReduce processes queries.

The map part gets the data and the reduce makes a new set.

Part of the reduction step is to eliminate duplicated records.

Hadoop is still so new that expect the new versions to differ from previous versions.

For example, MapReduce version 2 is called Yarn.

They changed the name because they completely changed the design.

Why is Hadoop popular?

Instead you could create your own schema to create some structure around the data using tools like Apache Hive.

Install and start Hadoop

Here is a very simple installation of Hadoop version 1.2.0.

In this example I create a single node HDFS (Hadoop Distributed File System).

I will set it up to run as root.

First, Download Hadoop from the Apache Hadoop website and unzip it to any folder.

In my case I put it here: /root/hadoop-1.2.0.

you gotta create an RSA key on the machine where you install Hadoop.

Alternatively you’re able to use your existing ssh key.

(It saves me from typing the passphrase when I want to execute Hadoop functions.)

(This command only works if you enter capital Y when prompted.)

Then you start it up:

Fix any oops message you see.

(I did not see any.)

Hadoop config files

The main configuration file is conf/core-site.com.

The core-site.xml from this simple installation is shown below.

Why is Hadoop popular?#

Install and start Hadoop#

Hadoop config files#

Why is Hadoop popular?

Install and start Hadoop

Hadoop config files