CTBD 05b two step join

Moderator: Concepts and Technologies for DS and BDP

bongo
Neuling
Neuling
Beiträge: 9
Registriert: 23. Apr 2017 11:34

CTBD 05b two step join

Beitrag von bongo »

Hi,

I'm wondering why in the example of "average income in one year per town" with mapreduce, the Reducer 1 does not simply emit (city, 2007 income) pairs directly, instead of keeping the SSN as a key until the next mapper throws it away anyway.

This would mean we could work with less mappers (and thus partitions) in the next stage and have a mapper 2 doing nothing essentially. Wouldn't this be a good thing?

I assume Reducer 1 could also work like follows:

Code: Alles auswählen

// this reducer receives exactly two values for one SSN (1x town, 1x income2007)
iterator it = values.iterator();

income = -1
town = ""
while(it.hasNext()){
	curval = it.next()
	if(curval.matches("\\d+"){
		income = parseInt(curval)
	} else {
		town = curval
	}
	if(town && income >=0){
		EmitResult(town, income)
	}
}

meichholz
Moderator
Moderator
Beiträge: 167
Registriert: 30. Mär 2016 08:28

Re: CTBD 05b two step join

Beitrag von meichholz »

Hi,

you're right, you could directly emit the (city, income) pairs.

Working directly on the (city, income) pairs also has some implications on your Hadoop implementation. By default, Hadoop uses the TextInputFormat input format, which requires the type of the input key to be a LongWritable. If you want to adapt the implementation, you have to use an appropriate input format, which accepts keys of type string (i.e. Text), for example, KeyValueTextInputFormat.

Best,
Matthias

bongo
Neuling
Neuling
Beiträge: 9
Registriert: 23. Apr 2017 11:34

Re: CTBD 05b two step join

Beitrag von bongo »

I don't understand why my choice of reducer design should have an implication on the InputFormat Type that can be used.

In the example from the slides, there are two mappers that each take <LongWritable, Text> and output <Text,Text>, so the reducer sees only strings anyway.

meichholz
Moderator
Moderator
Beiträge: 167
Registriert: 30. Mär 2016 08:28

Re: CTBD 05b two step join

Beitrag von meichholz »

Hi,

The output of the reducer that you sketched serves as input for the second mapper. The mapper for the second phase needs to be able to deal with the types emitted by the first reducer. If you emit a pair of strings and use TextInputFormat for the second mapper, it tries to handle the keys as long values. I would suggest that you take the existing implementation and try to adapt the first reducer to produce (city, income) pairs. You can then debug the second mapper and have a look at the input keys and values.

Best,
Matthias

Antworten

Zurück zu „Archiv“