datafu.hourglass.jobs
Class AbstractNonIncrementalJob

java.lang.Object
  extended by org.apache.hadoop.conf.Configured
      extended by datafu.hourglass.jobs.AbstractJob
          extended by datafu.hourglass.jobs.TimeBasedJob
              extended by datafu.hourglass.jobs.AbstractNonIncrementalJob
All Implemented Interfaces:
org.apache.hadoop.conf.Configurable

public abstract class AbstractNonIncrementalJob
extends TimeBasedJob

Base class for Hadoop jobs that consume time-partitioned data in a non-incremental way. Typically this is only used for comparing incremental jobs against a non-incremental baseline. It is essentially the same as AbstractPartitionCollapsingIncrementalJob without all the incremental features.

Jobs extending this class consume input data partitioned according to yyyy/MM/dd. Only a single input path is supported. The output will be written to a directory in the output path with name format yyyyMMdd derived from the end of the time window that is consumed.

This class has the same configuration and methods as TimeBasedJob. In addition it also recognizes the following properties:

When combine.inputs is true, then CombinedAvroKeyInputFormat is used instead of AvroKeyInputFormat. This enables a single map task to consume more than one file.

The num.reducers.bytes.per.reducer property controls the number of reducers to use based on the input size. The total size of the input files is divided by this number and then rounded up.

Author:
"Matthew Hayes"

Nested Class Summary
static class AbstractNonIncrementalJob.BaseCombiner
          Combiner base class for AbstractNonIncrementalJob.
static class AbstractNonIncrementalJob.BaseMapper
          Mapper base class for AbstractNonIncrementalJob.
static class AbstractNonIncrementalJob.BaseReducer
          Reducer base class for AbstractNonIncrementalJob.
static class AbstractNonIncrementalJob.Report
          Reports files created and processed for an iteration of the job.
 
Constructor Summary
AbstractNonIncrementalJob(java.lang.String name, java.util.Properties props)
          Initializes the job.
 
Method Summary
 boolean getCombineInputs()
          Gets whether inputs should be combined.
 java.lang.Class<? extends AbstractNonIncrementalJob.BaseCombiner> getCombinerClass()
          Gets the combiner class.
protected abstract  org.apache.avro.Schema getMapOutputKeySchema()
          Gets the key schema for the map output.
protected abstract  org.apache.avro.Schema getMapOutputValueSchema()
          Gets the value schema for the map output.
abstract  java.lang.Class<? extends AbstractNonIncrementalJob.BaseMapper> getMapperClass()
          Gets the mapper class.
protected abstract  org.apache.avro.Schema getReduceOutputSchema()
          Gets the reduce output schema.
abstract  java.lang.Class<? extends AbstractNonIncrementalJob.BaseReducer> getReducerClass()
          Gets the reducer class.
 AbstractNonIncrementalJob.Report getReport()
          Gets a report summarizing the run.
 void run()
          Runs the job.
 void setCombineInputs(boolean combineInputs)
          Sets whether inputs should be combined.
 
Methods inherited from class datafu.hourglass.jobs.TimeBasedJob
getDaysAgo, getEndDate, getNumDays, getStartDate, setDaysAgo, setEndDate, setNumDays, setProperties, setStartDate, validate
 
Methods inherited from class datafu.hourglass.jobs.AbstractJob
config, createRandomTempPath, ensurePath, getCountersParentPath, getFileSystem, getInputPaths, getName, getNumReducers, getOutputPath, getProperties, getRetentionCount, getTempPath, initialize, isUseCombiner, randomTempPath, setCountersParentPath, setInputPaths, setName, setNumReducers, setOutputPath, setRetentionCount, setTempPath, setUseCombiner
 
Methods inherited from class org.apache.hadoop.conf.Configured
getConf, setConf
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

AbstractNonIncrementalJob

public AbstractNonIncrementalJob(java.lang.String name,
                                 java.util.Properties props)
                          throws java.io.IOException
Initializes the job.

Parameters:
name - job name
props - configuration properties
Throws:
java.io.IOException
Method Detail

getCombineInputs

public boolean getCombineInputs()
Gets whether inputs should be combined.

Returns:
true if inputs are to be combined

setCombineInputs

public void setCombineInputs(boolean combineInputs)
Sets whether inputs should be combined.

Parameters:
combineInputs - true to combine inputs

getReport

public AbstractNonIncrementalJob.Report getReport()
Gets a report summarizing the run.

Returns:
report

run

public void run()
         throws java.io.IOException,
                java.lang.InterruptedException,
                java.lang.ClassNotFoundException
Runs the job.

Specified by:
run in class AbstractJob
Throws:
java.io.IOException
java.lang.InterruptedException
java.lang.ClassNotFoundException

getMapOutputKeySchema

protected abstract org.apache.avro.Schema getMapOutputKeySchema()
Gets the key schema for the map output.

Returns:
map output key schema

getMapOutputValueSchema

protected abstract org.apache.avro.Schema getMapOutputValueSchema()
Gets the value schema for the map output.

Returns:
map output value schema

getReduceOutputSchema

protected abstract org.apache.avro.Schema getReduceOutputSchema()
Gets the reduce output schema.

Returns:
reduce output schema

getMapperClass

public abstract java.lang.Class<? extends AbstractNonIncrementalJob.BaseMapper> getMapperClass()
Gets the mapper class.

Returns:
the mapper

getReducerClass

public abstract java.lang.Class<? extends AbstractNonIncrementalJob.BaseReducer> getReducerClass()
Gets the reducer class.

Returns:
the reducer

getCombinerClass

public java.lang.Class<? extends AbstractNonIncrementalJob.BaseCombiner> getCombinerClass()
Gets the combiner class.

Returns:
the combiner


Matthew Hayes