TimePartitioner (DataFu Hourglass)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

datafu.hourglass.jobs
Class TimePartitioner

java.lang.Object
  org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
      datafu.hourglass.jobs.TimePartitioner

All Implemented Interfaces:: org.apache.hadoop.conf.Configurable

public class TimePartitioner
extends org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
implements org.apache.hadoop.conf.Configurable
extends org.apache.hadoop.mapreduce.Partitioner<org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord>,org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord>>
implements org.apache.hadoop.conf.Configurable

A partitioner used by AbstractPartitionPreservingIncrementalJob to limit the number of named outputs used by each reducer.

The purpose of this partitioner is to prevent a proliferation of small files created by AbstractPartitionPreservingIncrementalJob. This job writes multiple outputs. Each output corresponds to a day of input data. By default records will be distributed across all the reducers. This means that if many input days are consumed, then each reducer will write many outputs. These outputs will typically be small. The problem gets worse as more input data is consumed, as this will cause more reducers to be required.

This partitioner solves the problem by limiting how many days of input data will be mapped to each reducer. At the extreme each day of input data could be mapped to only one reducer. This is controlled through the configuration setting incremental.reducers.per.input, which should be set in the Hadoop configuration. Input days are assigned to reducers in a round-robin fashion.

Author:: "Matthew Hayes"

Field Summary
`static java.lang.String`	`INPUT_TIMES`
`static java.lang.String`	`REDUCERS_PER_INPUT`

Constructor Summary
`TimePartitioner()`

Method Summary
`org.apache.hadoop.conf.Configuration`	`getConf()`
`int`	`getPartition(org.apache.avro.mapred.AvroKey<org.apache.avro.generic.GenericRecord> key, org.apache.avro.mapred.AvroValue<org.apache.avro.generic.GenericRecord> value, int numReduceTasks)`
`void`	`setConf(org.apache.hadoop.conf.Configuration conf)`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Field Detail

INPUT_TIMES

public static java.lang.String INPUT_TIMES

REDUCERS_PER_INPUT

public static java.lang.String REDUCERS_PER_INPUT

Constructor Detail