datafu.hourglass.jobs
Class TimePartitioner

java.lang.Object
  extended by org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
      extended by datafu.hourglass.jobs.TimePartitioner
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable

public class TimePartitioner
extends org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
implements org.apache.hadoop.conf.Configurable

A partitioner used by AbstractPartitionPreservingIncrementalJob to limit the number of named outputs used by each reducer.

The purpose of this partitioner is to prevent a proliferation of small files created by AbstractPartitionPreservingIncrementalJob. This job writes multiple outputs. Each output corresponds to a day of input data. By default records will be distributed across all the reducers. This means that if many input days are consumed, then each reducer will write many outputs. These outputs will typically be small. The problem gets worse as more input data is consumed, as this will cause more reducers to be required.

This partitioner solves the problem by limiting how many days of input data will be mapped to each reducer. At the extreme each day of input data could be mapped to only one reducer. This is controlled through the configuration setting incremental.reducers.per.input, which should be set in the Hadoop configuration. Input days are assigned to reducers in a round-robin fashion.

Author:
"Matthew Hayes"

Field Summary
static java.lang.String INPUT_TIMES
           
static java.lang.String REDUCERS_PER_INPUT
           
 
Constructor Summary
TimePartitioner()
           
 
Method Summary
 org.apache.hadoop.conf.Configuration getConf()
           
 int getPartition(org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord> key, org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord> value, int numReduceTasks)
           
 void setConf(org.apache.hadoop.conf.Configuration conf)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

INPUT_TIMES

public static java.lang.String INPUT_TIMES

REDUCERS_PER_INPUT

public static java.lang.String REDUCERS_PER_INPUT
Constructor Detail

TimePartitioner

public TimePartitioner()
Method Detail

getConf

public org.apache.hadoop.conf.Configuration getConf()
Specified by:
getConf in interface org.apache.hadoop.conf.Configurable

setConf

public void setConf(org.apache.hadoop.conf.Configuration conf)
Specified by:
setConf in interface org.apache.hadoop.conf.Configurable

getPartition

public int getPartition(org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord> key,
                        org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord> value,
                        int numReduceTasks)
Specified by:
getPartition in class org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>


Matthew Hayes