Package datafu.hourglass.jobs

Incremental Hadoop jobs and some supporting classes.

See:
          Description

Interface Summary
DateRangeConfigurable An interface for an object with a configurable output date range.
Setup Used as a callback by PartitionCollapsingIncrementalJob and PartitionPreservingIncrementalJob to provide configuration settings for the Hadoop job.
 

Class Summary
AbstractJob Base class for Hadoop jobs.
AbstractNonIncrementalJob Base class for Hadoop jobs that consume time-partitioned data in a non-incremental way.
AbstractNonIncrementalJob.BaseCombiner Combiner base class for AbstractNonIncrementalJob.
AbstractNonIncrementalJob.BaseMapper Mapper base class for AbstractNonIncrementalJob.
AbstractNonIncrementalJob.BaseReducer Reducer base class for AbstractNonIncrementalJob.
AbstractNonIncrementalJob.Report Reports files created and processed for an iteration of the job.
AbstractPartitionCollapsingIncrementalJob An IncrementalJob that consumes partitioned input data and collapses the partitions to produce a single output.
AbstractPartitionCollapsingIncrementalJob.Report Reports files created and processed for an iteration of the job.
AbstractPartitionPreservingIncrementalJob An IncrementalJob that consumes partitioned input data and produces output data having the same partitions.
AbstractPartitionPreservingIncrementalJob.Report Reports files created and processed for an iteration of the job.
DateRangePlanner Determines the date range of inputs which should be processed.
ExecutionPlanner Base class for execution planners.
FileCleaner Used to remove files from the file system when they are no longer needed.
IncrementalJob Base class for incremental jobs.
MaxInputDataExceededException  
PartitionCollapsingExecutionPlanner Execution planner used by AbstractPartitionCollapsingIncrementalJob and its derived classes.
PartitionCollapsingIncrementalJob A concrete version of AbstractPartitionCollapsingIncrementalJob.
PartitionPreservingExecutionPlanner Execution planner used by AbstractPartitionPreservingIncrementalJob and its derived classes.
PartitionPreservingIncrementalJob A concrete version of AbstractPartitionPreservingIncrementalJob.
ReduceEstimator Estimates the number of reducers needed based on input size.
StagedOutputJob A derivation of Job that stages its output in another location and only moves it to the final destination if the job completes successfully.
TimeBasedJob Base class for Hadoop jobs that consume time-partitioned data.
TimePartitioner A partitioner used by AbstractPartitionPreservingIncrementalJob to limit the number of named outputs used by each reducer.
 

Package datafu.hourglass.jobs Description

Incremental Hadoop jobs and some supporting classes.

Jobs within this package form the core of the incremental framework implementation. There are two types of incremental jobs: partition-preserving and partition-collapsing.

A partition-preserving job consumes input data partitioned by day and produces output data partitioned by day. This is equivalent to running a MapReduce job for each individual day of input data, but much more efficient. It compares the input data against the existing output data and only processes input data with no corresponding output.

A partition-collapsing job consumes input data partitioned by day and produces a single output. What distinguishes this job from a standard MapReduce job is that it can reuse the previous output. This enables it to process data much more efficiently. Rather than consuming all input data on each run, it can consume only the new data since the previous run and merges it with the previous output.

Partition-preserving and partition-collapsing jobs can be created by extending AbstractPartitionPreservingIncrementalJob and AbstractPartitionCollapsingIncrementalJob, respectively, and implementing the necessary methods. Alternatively, there are concrete versions of these classes, PartitionPreservingIncrementalJob and PartitionCollapsingIncrementalJob, which can be used instead. With these classes, the implementations are provided through setters.

Incremental jobs use Avro for input, intermediate, and output data. To implement an incremental job, one must define their schemas. A key schema and intermediate value schema specify the output of the mapper and combiner, which output key-value pairs. The key schema and an output value schema specify the output of the reducer, which outputs a record having key and value fields.

An incremental job also requires that implementations of map and reduce be defined, and optionally combine. The map implementation must implement a Mapper interface, which is very similar to the standard map interface in Hadoop. The combine and reduce operations are implemented through an Accumulator interface. This is similar to the standard reduce in Hadoop, however values are provided one-at-a-time rather than by an enumerable list. Also an accumulator returns either one value or no value at all by returning null. That is, the accumulator may not return an arbitrary number of values for the output. This restriction precludes the implementation of certain operations, like flatten, which do not fit well within the incremental programming model.



Matthew Hayes