Lab 02 Map Reduce Design Patterns

Objective:

This is the second lab in Hadoop Map Reduce training session series. In this lab, we will cover following design patterns. To Complete this lab, You need Vm which can be obtained by emailing at shujamughal@gmail.com

Summarization Pattern

Numerical Summarization
Inverted Index Summarization

Filtering Pattern

Let’s start with Summarization pattern.

Summarization patterns are used to produce a top-level, summarized view of your data. We will start the lab with Numerical Summarization example where we need to find the maximum temperature year wise.

Numerical Summarization:

Consider the following sample input: (Format = year; temperature)

1990;21

1990;20

1990;24

1991;23

1991;24

1991;26

1991;27

…

We need to design a map reduce application which will find the maximum temperature of the year so the output of sample data must be equal to following.

1990 24

1991 27

Let’s start with the Map part. Map input is the line from the file and output will be year as key and temperature as value.

Map:

Input	Output
1990;30	key=1990, value=30

To start, we need to import the “hadoop-max-example” project which is given at the start of lab.

Import the project by following the steps which we have explained in the previous lab.

After Importing the project, complete the following Map program according to above scenario. You need to open the “MaxTemperatureMapper.java” file

Map

protected void map(LongWritable key, Text value, Context context)

throws java.io.IOException, InterruptedException {

Text yearText = new Text();

IntWritable tempWritable = new IntWritable(0);

//write here

context.write(yearText,tempWritable);

}

Solution of Map Program is given below.

Numerical Summarization (Finding Maximum Temperature)

Map Solution

protected void map(LongWritable key, Text value, Context context)

throws java.io.IOException, InterruptedException {

Text yearText = new Text();

IntWritable tempWritable = new IntWritable(0);

String[] line = value.toString().split(";");

String year = line[0];

yearText.set(year);

int temp = Integer.parseInt(line[1]);

tempWritable.set(temp);

context.write(yearText,tempWritable);

}

Till this point, we have completed the Map part. Now let’s focus on reducer. The reducer will process the input values associated with specific year and find the maximum temperature.

Reducer:

Input	Output
1990, {20,35,34,21}	key=1990, value=35

Complete the following reducer program. You need to open “MaxTemperatureReducer.java” file for it.

public class MaxTemperatureReducer extends

Reducer<Text, IntWritable, Text, IntWritable> {

protected void reduce(Text key, Iterable<IntWritable> values, Context context)

throws java.io.IOException, InterruptedException {

int maxValue;

//Write here.

context.write(key, new IntWritable(maxValue));

}

If you are not able to write then please see the solution below.

Numerical Summarization (Finding Maximum Temperature)

Reduce Program Solution

protected void reduce(Text key, Iterable<IntWritable> values, Context context)

throws java.io.IOException, InterruptedException {

ArrayList<Integer> temperatureList = new ArrayList<Integer>();

for (IntWritable value : values) {

temperatureList.add(value.get());

}

Collections.sort(temperatureList);

int size = temperatureList.size();

int maxValue = temperatureList.get(size -1);

context.write(key, new IntWritable(maxValue));

}

Now we have completed the map and reduce portion. And output will be as shown below.

Next Step is to test the map and reduce part. Open the file MaxTemperatureTest. The code of map test is also given below.

@Test

public void testMapper() throws IOException {

mapDriver.withInput(new LongWritable(), new Text(

"1990;25"));

mapDriver.withOutput(new Text("1990"), new IntWritable(25));

mapDriver.runTest();

}

The input for map code is 1990;25 and output is key=1990 and value =25

Below is the code for reducer test.

@Testpublic void testReducer() throws IOException {

List<IntWritable> values = new ArrayList<IntWritable>();

values.add(new IntWritable(21));

values.add(new IntWritable(22));

values.add(new IntWritable(23));

reduceDriver.withInput(new Text("1990"), values);

reduceDriver.withOutput(new Text("1990"), new IntWritable(23));

reduceDriver.runTest();

}

The input of reducer is key= 1990 and values ={21,22,23} and output = key =1990 and value-23

Below are the steps to test the functionality.

Right click the MaxTemperatureTest.java file and run it as Junit Test. The output is shown below.

We have successfully tested the map and reduce part. Now it’s time to run it locally. Add some data in the “sample-max.txt” file which is place in input folder as shown below.

Now Select the “DriverTest.java” file and run it as Junit Test as shown below.

After successfully running the test, final output will be written in the file which is in the output folder crated by this test as shown below.

Great….you has run the job locally and output is also confirmed. Let’s run it on cluster. To run it cluster, we need program jar file. To generate jar file, right click on project and select Run As-> Maven Install.

By this process, a jar file will be created in the target folder of your project directory as shown below.

Next step is open the terminal window and create a file which will be used as input for this application.

Create a file “sample-max.txt” using the vim.tiny command as explaind in the first lab of Map Reduce session.

Once you have created the file, you can view the contents of file by entering the following command as shown below.

cat sample-max.txt

Now we have file ready which we need to process. But for it, we need to copy this file to hadoop file system. To do this let’s first create the directory on hdfs to place the file.

To create the directory, Use the following command.

hadoop fs -mkdir /user/root/lab02_Problem01

To View the newly created directory, use the following command.

hadoop fs -ls | grep lab02_Problem01

The screen shot is shown below

Now copy the sample-max.txt file to this location, and to do this, use the following command.

hadoop fs -put sample-max.txt /user/root/lab02_Problem01/sample-max.txt

To confirm the operation, you can use the following command to check the existence of file on hdfs.

hadoop fs -ls /user/root/lab02_Problem01/

The screen shot is shown below.

To view the contents of file which we just placed over hdfs, use the following command.

hadoop fs -cat /user/root/lab02_Problem01/sample-max.txt

The screen shot is given below.

Now we have everything ready. It’s time to launch the job, To launch it, use the following command.

hadoop jar hadoop-max-example-1.0.jar com.platalytics.maxtemperature.MaxTemperature /user/root/lab02_Problem01/sample-max.txt /user/root/lab02_Problem01/output

Congratulations!!! You have done with the map reduce cluster job. The last step is to verify the output. Use the following command to verify it.

Following command will show the output files generated by this process.

hadoop fs -ls /user/root/lab02_Problem01/output

hadoop fs -cat /user/root/lab02_Problem01/output/part-r-00000

Inverted Index Summarization:

Generate an index from a data set to allow for faster searches or data enrichment capabilities. It is often convenient to index large data sets on keywords, so that searches can trace terms back to records that contain specific values. While building an inverted index does require extra processing up front, taking the time to do so can greatly reduce the amount of time it takes to find something.

Suppose we have following sample input for which we need to build the inverted index.

T[0]=This is the sample app for Inverted Index Problem

T[1]=We will solve it using a Simple Map Reduce Program

T[2]=This program is written in Java

The output for this input must be equal to following.

Index T[0]

Inverted T[0]

Java T[2]

Map T[1]

Problem T[0]

Program T[1]

Reduce T[1]

Simple T[1]

This T[0] T[2]

We T[1]

a T[1]

app T[0]

for T[0]

in T[2]

is T[2] T[0]

it T[1]

program T[2]

sample T[0]

solve T[1]

the T[0]

using T[1]

will T[1]

written T[2]

Import the “inverted-index-example” project which is given at the start of lab. Use the same steps to import which we have discussed before.

After import, Let’s start with the map part. Complete the following Map part according to above scenario. You need to open the “InvertedIndexMapper.java” file to complete the task.

protected void map(LongWritable key, Text value, Context context)

throws java.io.IOException, InterruptedException {

Text wordText = new Text();

Text document = new Text();

//write code here

context.write(wordText,document);

}

Please observe the solution below if you are not able to solve it.

protected void map(LongWritable key, Text value, Context context)

throws java.io.IOException, InterruptedException {

Text wordText = new Text();

Text document = new Text();

String[] line = value.toString().split("=");

String documentName = line[0];

document.set(documentName);

String textStr = line[1];

String[] wordArray = textStr.split(" ");

for(int i = 0; i < wordArray.length; i++) {

wordText.set(wordArray[i]);

context.write(wordText,document);

}

Now we have completed the map part. Let’s write the reduce part. Complete the following reducer code according to above scenario. You need to open the “InvertedIndexReducer.java” file to complete the task.

protected void reduce(Text key, Iterable<Text> values, Context context)

throws java.io.IOException, InterruptedException {

Text documentList = new Text();

//write code here

context.write(key, documentList);

}

Please observe the solution below if you are not able to think.

protected void reduce(Text key, Iterable<Text> values, Context context)

throws java.io.IOException, InterruptedException {

Text documentList = new Text();

StringBuffer buffer = new StringBuffer();

for (Text value : values) {

if(buffer.length() != 0) {

buffer.append(" ");

}

buffer.append(value.toString());

}

documentList.set(buffer.toString());

context.write(key, documentList);

}

Now we have completed the map and reduce part. Write the code in the provided template as well as shown below.

Next step is to test the map and reduce part. Following is the test code for mapper.

@Test

public void testMapper() throws IOException {

mapDriver.withInput(new LongWritable(), new Text(

"T[0]=hi there"));

mapDriver.addOutput(new Text("hi"), new Text("T[0]"));

mapDriver.addOutput(new Text("there"), new Text("T[0]"));

mapDriver.runTest();

}

Following is the code for reducer test.

@Test

public void testMapper() throws IOException {

mapDriver.withInput(new LongWritable(), new Text(

"T[0]=hi there"));

mapDriver.addOutput(new Text("hi"), new Text("T[0]"));

mapDriver.addOutput(new Text("there"), new Text("T[0]"));

mapDriver.runTest();

}

Now let’s run the test by selecting the file InvertedIndexTest.java and run as Junit Test as shown below.

Great….!!!! Now it’s time to run job locally. We have input file placed in the folder named as input. Select the file “InvertedIndexDriverTest.java” and run as Junit as shown below.

Excellent! Everything works fine. It’s time to run on the cluster now. Follow the similar steps to run the example on the cluster.

Filtering Pattern:

Filtering pattern doesn’t change the actual records. These patterns all find a subset of data, whether it be small, like a top-ten listing, or large, like the results of a deduplication. Basically the primary use is Filter out records that are not of interest and keep ones that are.

Suppose we have following input data for which we need to apply filtering pattern.

10/11/2012,r..@yahoo.com,3,I was really satisfied with the customer service at the store

09/12/2012,x..@yahoo.com,3,Basic needs met at the store

10/11/2012,r.@gmail.com,4,Excellent

09/12/2012,x..@hotmail.com,5,Great Collection

01/02/2013,dan...@xyz.com,1,Horrible service at the store

09/12/2012,zz..@hotmail.com,5,Great Collection

01/02/2013,d...@gmail.com,4,Good collection and service at the store

The input file contains the information in the following format

Date;email;rating;comment

We need to filter the customer comments where rating is 1. So the output of this job must be equal to following.

Horrible service at the store

In this pattern, we are only filtering the data so no reducer is required.

Import the “filter-mapper-example” project into eclipse using the same steps as discussed previously.

After importing, let’s start with map part. Complete the following code to achieve the above task. You need to open “FilterMapper.java” file for it.

public class FilterMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

protected void map(LongWritable key, Text value, Context context)

throws java.io.IOException, InterruptedException {

String[] tokens = value.toString().split(",");

//write code here

context.write(new Text(tokens[3]),NullWritable.get() );

}

Kindly see the solution below if you are not able to solve it.

public class FilterMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

protected void map(LongWritable key, Text value, Context context)

throws java.io.IOException, InterruptedException {

String[] tokens = value.toString().split(",");

int rating = Integer.parseInt(tokens[2]);

if(rating == 1){

System.out.println(tokens[3]);

context.write(new Text(tokens[3]),NullWritable.get() );

}

The screen shot is given below.

Let’s test the map as we do for previous tasks.

@Test

public void testMapper() throws IOException {

mapDriver.withInput(new LongWritable(), new Text(

"01/02/2013,dan...@xyz.com,1,Horrible service at the store"));

mapDriver.withOutput(new Text("Horrible service at the store"),

NullWritable.get());

mapDriver.runTest();

}

Select the file “FilterTest.java” and run it as Juni test as shown below.

Now run the program locally by selecting DriverTest and run it as Junit test. Before running, also make sure that we have placed the input file in the folder named as input. The screen shot is given below.

Great work! It’s time to run on the cluster. Follow the similar steps to run it on the cluster.

This is the end of lab 

Lab 02 Map Reduce Design Patterns

Tuesday, July 1, 2014

Lab 02 Map Reduce Design Patterns

Objective:

Numerical Summarization:

Inverted Index Summarization:

Filtering Pattern:

No comments:

Post a Comment