|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |
See:
Description
Interface Summary | |
---|---|
DateRangeConfigurable | An interface for an object with a configurable output date range. |
Setup | Used as a callback by PartitionCollapsingIncrementalJob and PartitionPreservingIncrementalJob
to provide configuration settings for the Hadoop job. |
Class Summary | |
---|---|
AbstractJob | Base class for Hadoop jobs. |
AbstractNonIncrementalJob | Base class for Hadoop jobs that consume time-partitioned data in a non-incremental way. |
AbstractNonIncrementalJob.BaseCombiner | Combiner base class for AbstractNonIncrementalJob . |
AbstractNonIncrementalJob.BaseMapper | Mapper base class for AbstractNonIncrementalJob . |
AbstractNonIncrementalJob.BaseReducer | Reducer base class for AbstractNonIncrementalJob . |
AbstractNonIncrementalJob.Report | Reports files created and processed for an iteration of the job. |
AbstractPartitionCollapsingIncrementalJob | An IncrementalJob that consumes partitioned input data and collapses the
partitions to produce a single output. |
AbstractPartitionCollapsingIncrementalJob.Report | Reports files created and processed for an iteration of the job. |
AbstractPartitionPreservingIncrementalJob | An IncrementalJob that consumes partitioned input data and produces
output data having the same partitions. |
AbstractPartitionPreservingIncrementalJob.Report | Reports files created and processed for an iteration of the job. |
DateRangePlanner | Determines the date range of inputs which should be processed. |
ExecutionPlanner | Base class for execution planners. |
FileCleaner | Used to remove files from the file system when they are no longer needed. |
IncrementalJob | Base class for incremental jobs. |
MaxInputDataExceededException | |
PartitionCollapsingExecutionPlanner | Execution planner used by AbstractPartitionCollapsingIncrementalJob and its derived classes. |
PartitionCollapsingIncrementalJob | A concrete version of AbstractPartitionCollapsingIncrementalJob . |
PartitionPreservingExecutionPlanner | Execution planner used by AbstractPartitionPreservingIncrementalJob and its derived classes. |
PartitionPreservingIncrementalJob | A concrete version of AbstractPartitionPreservingIncrementalJob . |
ReduceEstimator | Estimates the number of reducers needed based on input size. |
StagedOutputJob | A derivation of Job that stages its output in another location and only
moves it to the final destination if the job completes successfully. |
TimeBasedJob | Base class for Hadoop jobs that consume time-partitioned data. |
TimePartitioner | A partitioner used by AbstractPartitionPreservingIncrementalJob to limit the number of named outputs
used by each reducer. |
Incremental Hadoop jobs and some supporting classes.
Jobs within this package form the core of the incremental framework implementation. There are two types of incremental jobs: partition-preserving and partition-collapsing.
A partition-preserving job consumes input data partitioned by day and produces output data partitioned by day. This is equivalent to running a MapReduce job for each individual day of input data, but much more efficient. It compares the input data against the existing output data and only processes input data with no corresponding output.
A partition-collapsing job consumes input data partitioned by day and produces a single output. What distinguishes this job from a standard MapReduce job is that it can reuse the previous output. This enables it to process data much more efficiently. Rather than consuming all input data on each run, it can consume only the new data since the previous run and merges it with the previous output.
Partition-preserving and partition-collapsing jobs can be created by extending AbstractPartitionPreservingIncrementalJob
and AbstractPartitionCollapsingIncrementalJob
, respectively, and implementing the necessary methods.
Alternatively, there are concrete versions of these classes, PartitionPreservingIncrementalJob
and
PartitionCollapsingIncrementalJob
, which can be used instead. With these classes, the implementations are provided
through setters.
Incremental jobs use Avro for input, intermediate, and output data. To implement an incremental job, one must define their schemas. A key schema and intermediate value schema specify the output of the mapper and combiner, which output key-value pairs. The key schema and an output value schema specify the output of the reducer, which outputs a record having key and value fields.
An incremental job also requires that implementations of map and reduce be defined, and optionally combine. The map implementation must
implement a Mapper
interface, which is very similar to the standard map interface in Hadoop.
The combine and reduce operations are implemented through an Accumulator
interface.
This is similar to the standard reduce in Hadoop, however values are provided one-at-a-time rather than by an enumerable list.
Also an accumulator returns either one value or no value at all by returning null. That is, the accumulator may not return an arbitrary number of values
for the output. This restriction precludes the implementation of certain operations, like flatten, which do not fit well within the
incremental programming model.
|
||||||||||
PREV PACKAGE NEXT PACKAGE | FRAMES NO FRAMES |