Tuesday, July 1, 2014

Lab 02 Map Reduce Design Patterns





Lab 02 Map Reduce Design Patterns


Objective:

This is the second lab in Hadoop Map Reduce training session series. In this lab, we will cover following design patterns. To Complete this lab, You need Vm which can be obtained by emailing at shujamughal@gmail.com
  • Summarization Pattern
    • Numerical Summarization
    • Inverted Index Summarization
  • Filtering Pattern
Let’s start with Summarization pattern.
Summarization patterns are used to produce a top-level, summarized view of your data.  We will start the lab with Numerical Summarization example where we need to find the maximum temperature year wise.

Numerical Summarization:

Consider the following sample input:  (Format = year; temperature)
1990;21
1990;20
1990;24
1991;23
1991;24
1991;26
1991;27

We need to design a map reduce application which will find the maximum temperature of the year so the output of sample data must be equal to following.
1990 24
1991 27

Let’s start with the Map part. Map input is the line from the file and output will be year as key and temperature as value.
Map:
Input
Output
1990;30
key=1990, value=30

To start, we need to import the “hadoop-max-example” project which is given at the start of lab.
Import the project by following the steps which we have explained in the previous lab.
After Importing the project, complete the following Map program according to above scenario. You need to open the “MaxTemperatureMapper.java” file
Map
protected void map(LongWritable key, Text value, Context context)
     throws java.io.IOException, InterruptedException {
 Text yearText = new Text();
 IntWritable tempWritable = new IntWritable(0);
//write here

   context.write(yearText,tempWritable);
 }  

Solution of Map Program is given below.
Numerical Summarization (Finding Maximum Temperature)
Map Solution
protected void map(LongWritable key, Text value, Context context)
     throws java.io.IOException, InterruptedException {
 Text yearText = new Text();
 IntWritable tempWritable = new IntWritable(0);
   String[] line = value.toString().split(";");
   String year = line[0];
   yearText.set(year);
   int temp = Integer.parseInt(line[1]);
   tempWritable.set(temp);
   context.write(yearText,tempWritable);
 }  

Till this point, we have completed the Map part. Now let’s focus on reducer. The reducer will process the input values associated with specific year and find the maximum temperature.
Reducer:
Input
Output
1990, {20,35,34,21}
key=1990, value=35








Complete the following reducer program. You need to open “MaxTemperatureReducer.java” file for it.
public class MaxTemperatureReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws java.io.IOException, InterruptedException {
int maxValue;
             
            //Write here.


context.write(key, new IntWritable(maxValue));
}


If you are not able to write then please see the solution below.

Numerical Summarization (Finding Maximum Temperature)
Reduce Program Solution
protected void reduce(Text key, Iterable<IntWritable> values, Context context)
throws java.io.IOException, InterruptedException {
ArrayList<Integer> temperatureList = new ArrayList<Integer>();
for (IntWritable value : values) {
temperatureList.add(value.get());
}
Collections.sort(temperatureList);
int size  = temperatureList.size();
int maxValue = temperatureList.get(size -1);
context.write(key, new IntWritable(maxValue));
}

Now we have completed the map and reduce portion. And output will be as shown below.


Next Step is to test the map and reduce part. Open the file MaxTemperatureTest.  The code of map test is also given below.
@Test
public void testMapper() throws IOException {
mapDriver.withInput(new LongWritable(), new Text(
"1990;25"));
mapDriver.withOutput(new Text("1990"), new IntWritable(25));
mapDriver.runTest();
}

The input for map code is 1990;25 and output is key=1990 and value =25
Below is the code for reducer test.
@Testpublic void testReducer() throws IOException {
List<IntWritable> values = new ArrayList<IntWritable>();
values.add(new IntWritable(21));
values.add(new IntWritable(22));
values.add(new IntWritable(23));
reduceDriver.withInput(new Text("1990"), values);
reduceDriver.withOutput(new Text("1990"), new IntWritable(23));
reduceDriver.runTest();
}

The input of reducer is  key= 1990 and values ={21,22,23} and output = key =1990 and value-23

Below are the steps to test the functionality.
  1. Right click the MaxTemperatureTest.java file and run it as Junit Test. The output is shown below.


We have successfully tested the map and reduce part. Now it’s time to run it locally. Add some data in the “sample-max.txt” file which is place in input folder as shown below.


Now Select the “DriverTest.java” file and run it as Junit Test as shown below.


After successfully running the test, final output will be written in the file which is in the output folder crated by this test as shown below.


Great….you has run the job locally and output is also confirmed. Let’s run it on cluster. To run it cluster, we need program jar file. To generate jar file, right click on project and select Run As-> Maven Install.
By this process, a jar file will be created in the target folder of your project directory as shown below.


Next step is open the terminal window and create a file which will be used as input for this application.
Create a file “sample-max.txt” using the vim.tiny command as explaind in the first lab of Map Reduce session.
Once you have created the file, you can view the contents of file by entering the following command as shown below.
cat sample-max.txt


Now we have file ready which we need to process. But for it, we need to copy this file to hadoop file system. To do this let’s first create the directory on hdfs to place the file.
To create the directory, Use the following command.
hadoop fs -mkdir /user/root/lab02_Problem01

To View the newly created directory, use the following command.
hadoop fs -ls | grep lab02_Problem01

The screen shot is shown below


Now copy the sample-max.txt  file to this location, and to do this, use the following command.

hadoop fs -put sample-max.txt  /user/root/lab02_Problem01/sample-max.txt

To confirm the operation, you can use the following command to check the existence of file on hdfs.
hadoop fs -ls /user/root/lab02_Problem01/
The screen shot is shown below.


To view the contents of file which we just placed over hdfs, use the following command.
hadoop fs -cat /user/root/lab02_Problem01/sample-max.txt

The screen shot is given below.


Now we have everything ready. It’s time to launch the job, To launch it, use the following command.
hadoop jar hadoop-max-example-1.0.jar com.platalytics.maxtemperature.MaxTemperature /user/root/lab02_Problem01/sample-max.txt  /user/root/lab02_Problem01/output  




Congratulations!!! You have done with the map reduce cluster job. The last step is to verify the output.  Use the following command to verify it.
Following command will show the output files generated by this process.
hadoop fs -ls /user/root/lab02_Problem01/output  

hadoop fs -cat /user/root/lab02_Problem01/output/part-r-00000


Inverted Index Summarization:

Generate an index from a data set to allow for faster searches or data enrichment capabilities. It is often convenient to index large data sets on keywords, so that searches can trace terms back to records that contain specific values. While building an inverted index does require extra processing up front, taking the time to do so can greatly reduce the amount of time it takes to find something.
Suppose we have following sample input for which we need to build the inverted index.
T[0]=This is the sample app for Inverted Index Problem
T[1]=We will solve it using a Simple Map Reduce Program
T[2]=This program is written in Java

The output for this input must be equal to following.
Index T[0]
Inverted T[0]
Java T[2]
Map T[1]
Problem T[0]
Program T[1]
Reduce T[1]
Simple T[1]
This T[0] T[2]
We T[1]
a T[1]
app T[0]
for T[0]
in T[2]
is T[2] T[0]
it T[1]
program T[2]
sample T[0]
solve T[1]
the T[0]
using T[1]
will T[1]
written T[2]
Import the “inverted-index-example” project which is given at the start of lab. Use the same steps to import which we have discussed before.
After import, Let’s start with the map part. Complete the following Map part according to above scenario. You need to open the “InvertedIndexMapper.java” file to complete the task.
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
Text wordText = new Text();
Text document = new Text();
           
            //write code here





context.write(wordText,document);
}
}

Please observe the solution below if you are not able to solve it.
protected void map(LongWritable key, Text value, Context context)
throws java.io.IOException, InterruptedException {
Text wordText = new Text();
Text document = new Text();

String[] line = value.toString().split("=");

String documentName = line[0];
document.set(documentName);
String textStr = line[1];
String[] wordArray = textStr.split(" ");
for(int i = 0; i <  wordArray.length; i++) {
wordText.set(wordArray[i]);
context.write(wordText,document);
}
}

Now we have completed the map part. Let’s write the reduce part.  Complete the following reducer code according to above scenario. You need to open the “InvertedIndexReducer.java” file to complete the task.
protected void reduce(Text key, Iterable<Text> values, Context context)
throws java.io.IOException, InterruptedException {
       Text documentList = new Text();
            //write code here  
context.write(key, documentList);
}

Please observe the solution below if you are not able to think.

protected void reduce(Text key, Iterable<Text> values, Context context)
throws java.io.IOException, InterruptedException {
Text documentList = new Text();
StringBuffer buffer = new StringBuffer();
for (Text value : values) {
if(buffer.length() != 0) {
buffer.append(" ");
}
buffer.append(value.toString());
}
documentList.set(buffer.toString());
context.write(key, documentList);
}

Now we have completed the map and reduce part. Write the code in the provided template as well as shown below.



Next step is to test the map and reduce part.  Following is the test code for mapper.
@Test
public void testMapper() throws IOException {
mapDriver.withInput(new LongWritable(), new Text(
"T[0]=hi there"));
mapDriver.addOutput(new Text("hi"), new Text("T[0]"));
mapDriver.addOutput(new Text("there"), new Text("T[0]"));
mapDriver.runTest();
}

Following is the code for reducer test.
@Test
public void testMapper() throws IOException {
mapDriver.withInput(new LongWritable(), new Text(
"T[0]=hi there"));
mapDriver.addOutput(new Text("hi"), new Text("T[0]"));
mapDriver.addOutput(new Text("there"), new Text("T[0]"));
mapDriver.runTest();
}


Now let’s run the test by selecting the file InvertedIndexTest.java and run as Junit Test as shown below.



Great….!!!! Now it’s time to run job locally. We have input file placed in the folder named as input. Select the file “InvertedIndexDriverTest.java” and run as Junit as shown below.


Excellent! Everything works fine. It’s time to run on the cluster now. Follow the similar steps to run the example on the cluster.

Filtering Pattern:

Filtering pattern doesn’t change the actual records. These patterns all find a subset of data, whether it be small, like a top-ten listing, or large, like the results of a deduplication. Basically the primary use is Filter out records that are not of interest and keep ones that are.

Suppose we have following input data for which we need to apply filtering pattern.

10/11/2012,r..@yahoo.com,3,I was really satisfied with the customer service at the store
09/12/2012,x..@yahoo.com,3,Basic needs met at the store
10/11/2012,r.@gmail.com,4,Excellent
09/12/2012,x..@hotmail.com,5,Great Collection
01/02/2013,dan...@xyz.com,1,Horrible service at the store
09/12/2012,zz..@hotmail.com,5,Great Collection
01/02/2013,d...@gmail.com,4,Good collection and service at the store

The input file contains the information in the following format
Date;email;rating;comment



We need to filter the customer comments where rating is 1. So the output of this job must be equal to following.

Horrible service at the store

In this pattern, we are only filtering the data so no reducer is required.

Import the “filter-mapper-example” project into eclipse using the same steps as discussed previously.

After importing, let’s start with map part. Complete the following code to achieve the above task. You need to open “FilterMapper.java” file for it.

public class FilterMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

 protected void map(LongWritable key, Text value, Context context)
     throws java.io.IOException, InterruptedException {
   String[] tokens = value.toString().split(",");
   //write code here



context.write(new Text(tokens[3]),NullWritable.get() );
   }
 }  


Kindly see the solution below if you are not able to solve it.

public class FilterMapper extends Mapper<LongWritable, Text, Text, NullWritable> {

 protected void map(LongWritable key, Text value, Context context)
     throws java.io.IOException, InterruptedException {
   String[] tokens = value.toString().split(",");
   int rating = Integer.parseInt(tokens[2]);
   if(rating == 1){
    System.out.println(tokens[3]);
    context.write(new Text(tokens[3]),NullWritable.get() );
   }
 }  
The screen shot is given below.



Let’s test the map as we do for previous tasks.

@Test
public void testMapper() throws IOException {
mapDriver.withInput(new LongWritable(), new Text(
"01/02/2013,dan...@xyz.com,1,Horrible service at the store"));
mapDriver.withOutput(new Text("Horrible service at the store"),
            NullWritable.get());
mapDriver.runTest();
}


Select the file “FilterTest.java” and run it as Juni test as shown below.





Now run the program locally by selecting DriverTest and run it as Junit test. Before running, also make sure that we have placed the input file in the folder named as input. The screen shot is given below.





Great work! It’s time to run on the cluster.  Follow the similar steps to run it on the cluster.

This is the end of lab 

No comments:

Post a Comment