Hadoop stackoverflow usecase

Hadoop (Stack Overflow)

Abstract
:
gaming.stackexchange.com is a repository which is technical group related to gaming. There are around group of users who does contribute or help others in their queries .

The abstract of this paper is to find the top 20 users in this group and their average scores.

DataSet
:
We can download the dataset from link

https://archive.org/download/stackexchange

Download the
gaming.stackexchange.com
dataset from the above link

After you untar the above zip file. The below files are displayed.

The Above table have a correlation between each other.

As part of our use case here we need to find the top 20 users and the average of their scores.

Score : score is a point of appreciation given by the users who go the help regarding that particular issue.

Badge – ?

So here we can correlate both the table Posts and Badges.

Here we can use Hadoop to convert the xml files to csv format using Mapreduce written in Java

Badges.xml ( sample text)

Develop a Driver class and Mapper class for processing the XML files ( Badges, Posts)

Driver Class

public class XMLDriver extends Configured implements Tool

{

@Override

public int run(String[] args) throws Exception

{

if (args.length != 2)

{

System.err.printf(“Usage: %s [generic options] n”, getClass().getSimpleName());

ToolRunner.printGenericCommandUsage(System.err);

return -1;

}

Job job = Job.getInstance(getConf());

job.setJobName(“Tech Talks – Stackoverflow Forum Posts – Data Wrangler”);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setInputFormatClass(TextInputFormat.class);

job.setOutputFormatClass(TextOutputFormat.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(Text.class);

job.setJarByClass(XMLDriver.class);// required for mr1

job.setMapperClass(XMLMapper.class);

job.setNumReduceTasks(0);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(Text.class);

return job.waitForCompletion(true) ? 0 : 1;

}

public static void main(String args[]) throws Exception

{

int exitCode = ToolRunner.run(new Configuration(), new XMLDriver (), args);

System.exit(exitCode);

}

}

Mapper Class

public class StackoverflowDataWranglerMapper extends Mapper

{

static enum BadRecordCounters

{

NO_CREATION_DATE, UNKNOWN_USER_ID, UNPARSEABLE_RECORD, UNTAGGED_POSTS

}

private final Text outputKey = new Text();

private final Text outputValue = new Text();

private final DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();

private DocumentBuilder builder;

private static final Joiner TAG_JOINER = Joiner.on(“,”).skipNulls();

// 2008-07-31T21:42:52.667

private static final DateFormat DATE_PARSER = new SimpleDateFormat(“yyyy-MM-dd’T’HH:mm:ss.SSS”);

private static final SimpleDateFormat DATE_BUILDER = new SimpleDateFormat(“yyyy-MM-dd”);

@Override

protected void setup(Context context) throws IOException, InterruptedException

{

try

{

builder = factory.newDocumentBuilder();

}

catch (ParserConfigurationException e)

{

new IOException(e);

}

}

@Override

protected void map(LongWritable inputKey, Text inputValue, Context context)

throws IOException, InterruptedException

{

try

{

String entry = inputValue.toString();

if (entry.contains(“<row “))

{

Document doc = builder.parse(new InputSource(new StringReader(entry)));

Element rootElem = doc.getDocumentElement();

String owneruserid = rootElem.getAttribute(“OwnerUserId”);

String score = rootElem.getAttribute(“Score”);

outputKey.clear();

outputKey.set(id);

StringBuilder sb = new StringBuilder(score).append(“|”).append(owneruserid);

sb.append(answercount).append(“t”).append(commentcount).append(“t”).append(favoritecount).toString();

outputValue.set(sb.toString());

context.write(outputKey, outputValue);

}

}

catch (SAXException e) {

context.getCounter(“Bad Record Counters”, “Unparsable records”).increment(1);

}

finally{

builder.reset();

}

}

}

Reducer Class

Not Required.

Running the Mapper jobs on badges.xml and posts.xml and extracting the

Score and owneruserid – post.xml

Console output of Posts

17/03/15 23:58:03 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

17/03/15 23:58:03 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

17/03/15 23:58:03 INFO input.FileInputFormat: Total input paths to process : 1

17/03/15 23:58:03 WARN snappy.LoadSnappy: Snappy native library not loaded

17/03/15 23:58:03 INFO mapred.JobClient: Running job: job_local1160727004_0001

17/03/15 23:58:03 INFO mapred.LocalJobRunner: Waiting for map tasks

17/03/15 23:58:03 INFO mapred.LocalJobRunner: Starting task: attempt_local1160727004_0001_m_000000_0

17/03/15 23:58:03 INFO util.ProcessTree: setsid exited with exit code 0

17/03/15 23:58:03 INFO mapred.Task: Using ResourceCalculatorPlugin : [email protected]

17/03/15 23:58:04 INFO mapred.MapTask: Processing split: file:/home/gopal/Posts.xml:167772160+36491736

17/03/15 23:58:04 INFO mapred.JobClient: map 0% reduce 0%

17/03/15 23:58:06 INFO mapred.Task: Task:attempt_local1160727004_0001_m_000000_0 is done. And is in the process of commiting

17/03/15 23:58:06 INFO mapred.LocalJobRunner:

17/03/15 23:58:06 INFO mapred.Task: Task attempt_local1160727004_0001_m_000000_0 is allowed to commit now

17/03/15 23:58:06 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local1160727004_0001_m_000000_0’ to /home/gopal/Posts_output1

17/03/15 23:58:06 INFO mapred.LocalJobRunner:

17/03/15 23:58:06 INFO mapred.Task: Task ‘attempt_local1160727004_0001_m_000000_0’ done.

17/03/15 23:58:06 INFO mapred.LocalJobRunner: Finishing task: attempt_local1160727004_0001_m_000000_0

17/03/15 23:58:06 INFO mapred.LocalJobRunner: Starting task: attempt_local1160727004_0001_m_000001_0

17/03/15 23:58:06 INFO mapred.Task: Using ResourceCalculatorPlugin : [email protected]

17/03/15 23:58:06 INFO mapred.MapTask: Processing split: file:/home/gopal/Posts.xml:0+33554432

17/03/15 23:58:07 INFO mapred.JobClient: map 16% reduce 0%

17/03/15 23:58:07 INFO mapred.Task: Task:attempt_local1160727004_0001_m_000001_0 is done. And is in the process of commiting

17/03/15 23:58:07 INFO mapred.LocalJobRunner:

17/03/15 23:58:07 INFO mapred.Task: Task attempt_local1160727004_0001_m_000001_0 is allowed to commit now

17/03/15 23:58:07 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local1160727004_0001_m_000001_0’ to /home/gopal/Posts_output1

17/03/15 23:58:07 INFO mapred.LocalJobRunner:

17/03/15 23:58:07 INFO mapred.Task: Task ‘attempt_local1160727004_0001_m_000001_0’ done.

17/03/15 23:58:07 INFO mapred.LocalJobRunner: Finishing task: attempt_local1160727004_0001_m_000001_0

17/03/15 23:58:07 INFO mapred.LocalJobRunner: Starting task: attempt_local1160727004_0001_m_000002_0

17/03/15 23:58:07 INFO mapred.Task: Using ResourceCalculatorPlugin : [email protected]

17/03/15 23:58:07 INFO mapred.MapTask: Processing split: file:/home/gopal/Posts.xml:33554432+33554432

17/03/15 23:58:08 INFO mapred.JobClient: map 33% reduce 0%

17/03/15 23:58:09 INFO mapred.Task: Task:attempt_local1160727004_0001_m_000002_0 is done. And is in the process of commiting

17/03/15 23:58:09 INFO mapred.LocalJobRunner:

17/03/15 23:58:09 INFO mapred.Task: Task attempt_local1160727004_0001_m_000002_0 is allowed to commit now

17/03/15 23:58:09 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local1160727004_0001_m_000002_0’ to /home/gopal/Posts_output1

17/03/15 23:58:09 INFO mapred.LocalJobRunner:

17/03/15 23:58:09 INFO mapred.Task: Task ‘attempt_local1160727004_0001_m_000002_0’ done.

17/03/15 23:58:09 INFO mapred.LocalJobRunner: Finishing task: attempt_local1160727004_0001_m_000002_0

17/03/15 23:58:09 INFO mapred.LocalJobRunner: Starting task: attempt_local1160727004_0001_m_000003_0

17/03/15 23:58:09 INFO mapred.Task: Using ResourceCalculatorPlugin : [email protected]

17/03/15 23:58:09 INFO mapred.MapTask: Processing split: file:/home/gopal/Posts.xml:67108864+33554432

17/03/15 23:58:09 INFO mapred.JobClient: map 50% reduce 0%

17/03/15 23:58:10 INFO mapred.Task: Task:attempt_local1160727004_0001_m_000003_0 is done. And is in the process of commiting

17/03/15 23:58:10 INFO mapred.LocalJobRunner:

17/03/15 23:58:10 INFO mapred.Task: Task attempt_local1160727004_0001_m_000003_0 is allowed to commit now

17/03/15 23:58:10 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local1160727004_0001_m_000003_0’ to /home/gopal/Posts_output1

17/03/15 23:58:10 INFO mapred.LocalJobRunner:

17/03/15 23:58:10 INFO mapred.Task: Task ‘attempt_local1160727004_0001_m_000003_0’ done.

17/03/15 23:58:10 INFO mapred.LocalJobRunner: Finishing task: attempt_local1160727004_0001_m_000003_0

17/03/15 23:58:10 INFO mapred.LocalJobRunner: Starting task: attempt_local1160727004_0001_m_000004_0

17/03/15 23:58:10 INFO mapred.Task: Using ResourceCalculatorPlugin : [email protected]

17/03/15 23:58:10 INFO mapred.MapTask: Processing split: file:/home/gopal/Posts.xml:100663296+33554432

17/03/15 23:58:10 INFO mapred.JobClient: map 66% reduce 0%

17/03/15 23:58:11 INFO mapred.Task: Task:attempt_local1160727004_0001_m_000004_0 is done. And is in the process of commiting

17/03/15 23:58:11 INFO mapred.LocalJobRunner:

17/03/15 23:58:11 INFO mapred.Task: Task attempt_local1160727004_0001_m_000004_0 is allowed to commit now

17/03/15 23:58:11 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local1160727004_0001_m_000004_0’ to /home/gopal/Posts_output1

17/03/15 23:58:11 INFO mapred.LocalJobRunner:

17/03/15 23:58:11 INFO mapred.Task: Task ‘attempt_local1160727004_0001_m_000004_0’ done.

17/03/15 23:58:11 INFO mapred.LocalJobRunner: Finishing task: attempt_local1160727004_0001_m_000004_0

17/03/15 23:58:11 INFO mapred.LocalJobRunner: Starting task: attempt_local1160727004_0001_m_000005_0

17/03/15 23:58:11 INFO mapred.Task: Using ResourceCalculatorPlugin : [email protected]

17/03/15 23:58:11 INFO mapred.MapTask: Processing split: file:/home/gopal/Posts.xml:134217728+33554432

17/03/15 23:58:11 INFO mapred.JobClient: map 83% reduce 0%

17/03/15 23:58:11 INFO mapred.Task: Task:attempt_local1160727004_0001_m_000005_0 is done. And is in the process of commiting

17/03/15 23:58:11 INFO mapred.LocalJobRunner:

17/03/15 23:58:11 INFO mapred.Task: Task attempt_local1160727004_0001_m_000005_0 is allowed to commit now

17/03/15 23:58:11 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local1160727004_0001_m_000005_0’ to /home/gopal/Posts_output1

17/03/15 23:58:11 INFO mapred.LocalJobRunner:

17/03/15 23:58:11 INFO mapred.Task: Task ‘attempt_local1160727004_0001_m_000005_0’ done.

17/03/15 23:58:11 INFO mapred.LocalJobRunner: Finishing task: attempt_local1160727004_0001_m_000005_0

17/03/15 23:58:11 INFO mapred.LocalJobRunner: Map task executor complete.

17/03/15 23:58:12 INFO mapred.JobClient: map 100% reduce 0%

17/03/15 23:58:12 INFO mapred.JobClient: Job complete: job_local1160727004_0001

17/03/15 23:58:12 INFO mapred.JobClient: Counters: 12

17/03/15 23:58:12 INFO mapred.JobClient: File Output Format Counters

17/03/15 23:58:12 INFO mapred.JobClient: Bytes Written=1808359

17/03/15 23:58:12 INFO mapred.JobClient: FileSystemCounters

17/03/15 23:58:12 INFO mapred.JobClient: FILE_BYTES_READ=722341071

17/03/15 23:58:12 INFO mapred.JobClient: FILE_BYTES_WRITTEN=6697546

17/03/15 23:58:12 INFO mapred.JobClient: File Input Format Counters

17/03/15 23:58:12 INFO mapred.JobClient: Bytes Read=204284376

17/03/15 23:58:12 INFO mapred.JobClient: Map-Reduce Framework

17/03/15 23:58:12 INFO mapred.JobClient: Map input records=201808

17/03/15 23:58:12 INFO mapred.JobClient: Physical memory (bytes) snapshot=0

17/03/15 23:58:12 INFO mapred.JobClient: Spilled Records=0

17/03/15 23:58:12 INFO mapred.JobClient: CPU time spent (ms)=0

17/03/15 23:58:12 INFO mapred.JobClient: Total committed heap usage (bytes)=491257856

17/03/15 23:58:12 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0

17/03/15 23:58:12 INFO mapred.JobClient: Map output records=201805

17/03/15 23:58:12 INFO mapred.JobClient: SPLIT_RAW_BYTES=546

userid and Name – badges.xml

Console output for Badges.xml

17/03/16 02:30:05 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

17/03/16 02:30:05 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String).

17/03/16 02:30:05 INFO input.FileInputFormat: Total input paths to process : 1

17/03/16 02:30:05 WARN snappy.LoadSnappy: Snappy native library not loaded

17/03/16 02:30:05 INFO mapred.JobClient: Running job: job_local1847365573_0001

17/03/16 02:30:05 INFO mapred.LocalJobRunner: Waiting for map tasks

17/03/16 02:30:05 INFO mapred.LocalJobRunner: Starting task: attempt_local1847365573_0001_m_000000_0

17/03/16 02:30:05 INFO util.ProcessTree: setsid exited with exit code 0

17/03/16 02:30:05 INFO mapred.Task: Using ResourceCalculatorPlugin : [email protected]

17/03/16 02:30:05 INFO mapred.MapTask: Processing split: file:/home/gopal/Badges.xml:0+31624409

17/03/16 02:30:06 INFO mapred.JobClient: map 0% reduce 0%

17/03/16 02:30:11 INFO mapred.Task: Task:attempt_local1847365573_0001_m_000000_0 is done. And is in the process of commiting

17/03/16 02:30:11 INFO mapred.LocalJobRunner:

17/03/16 02:30:11 INFO mapred.Task: Task attempt_local1847365573_0001_m_000000_0 is allowed to commit now

17/03/16 02:30:11 INFO output.FileOutputCommitter: Saved output of task ‘attempt_local1847365573_0001_m_000000_0’ to /home/gopal/Badges_output9

17/03/16 02:30:11 INFO mapred.LocalJobRunner:

17/03/16 02:30:11 INFO mapred.Task: Task ‘attempt_local1847365573_0001_m_000000_0’ done.

17/03/16 02:30:11 INFO mapred.LocalJobRunner: Finishing task: attempt_local1847365573_0001_m_000000_0

17/03/16 02:30:11 INFO mapred.LocalJobRunner: Map task executor complete.

17/03/16 02:30:11 INFO mapred.JobClient: map 100% reduce 0%

17/03/16 02:30:11 INFO mapred.JobClient: Job complete: job_local1847365573_0001

17/03/16 02:30:11 INFO mapred.JobClient: Counters: 12

17/03/16 02:30:11 INFO mapred.JobClient: File Output Format Counters

17/03/16 02:30:11 INFO mapred.JobClient: Bytes Written=5254604

17/03/16 02:30:11 INFO mapred.JobClient: File Input Format Counters

17/03/16 02:30:11 INFO mapred.JobClient: Bytes Read=31624409

17/03/16 02:30:11 INFO mapred.JobClient: FileSystemCounters

17/03/16 02:30:11 INFO mapred.JobClient: FILE_BYTES_READ=31624559

17/03/16 02:30:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=5306381

17/03/16 02:30:11 INFO mapred.JobClient: Map-Reduce Framework

17/03/16 02:30:11 INFO mapred.JobClient: Map input records=276166

17/03/16 02:30:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=0

17/03/16 02:30:11 INFO mapred.JobClient: Spilled Records=0

17/03/16 02:30:11 INFO mapred.JobClient: Total committed heap usage (bytes)=89915392

17/03/16 02:30:11 INFO mapred.JobClient: CPU time spent (ms)=0

17/03/16 02:30:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=0

17/03/16 02:30:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=92

17/03/16 02:30:11 INFO mapred.JobClient: Map output records=276163

Posts.txt

Badges.txt

MySQL

  1. Create a Database in MySQL ( Games).
  2. Create table posts and badges .

See Image Below

SQOOP

Now export the Pipe separated data into MySQL using SQOOP

sqoop export –verbose –connect jdbc:mysql://localhost/games –username root –password root –table badges –input-fields-terminated-by “|” –export-dir /badges_output/part-m-*

sqoop export –verbose –connect jdbc:mysql://localhost/games –username root –password root –table posts –input-fields-terminated-by “|” –export-dir /posts_output/part-m-*.

Execute a Join on both the tables

SELECT AVG(posts.score) as Average_score, posts.owneruserid,badges.userid, badges.Name FROM badges,posts where posts.owneruserid = badges.userid group by badges.userid INTO OUTFILE ‘/tmp/final_report.csv’ FIELDS TERMINATED BY ‘,’ ENCLOSED BY ‘”‘ LINES TERMINATED BY ‘n’;

The Final output of the Users and their average score is displayed below

Final Report.csv

稿源:[email protected] (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 后端存储 » Hadoop stackoverflow usecase

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录