datafu.hourglass.jobs
Class PartitionPreservingExecutionPlanner

java.lang.Object
  extended by datafu.hourglass.jobs.ExecutionPlanner
      extended by datafu.hourglass.jobs.PartitionPreservingExecutionPlanner

public class PartitionPreservingExecutionPlanner
extends ExecutionPlanner

Execution planner used by AbstractPartitionPreservingIncrementalJob and its derived classes. This creates a plan to process partitioned input data and produce partitioned output data.

To use this class, the input and output paths must be specified. In addition the desired input date range can be specified through several methods. Then createPlan() can be called and the execution plan will be created. The inputs to process will be available from getInputsToProcess(), the number of reducers to use will be available from getNumReducers(), and the input schemas will be available from getInputSchemas().

Configuration properties are used to configure a ReduceEstimator instance. This is used to calculate how many reducers should be used. The number of reducers to use is based on the input data size and the num.reducers.bytes.per.reducer property. Check ReduceEstimator for more details on how the properties are used.

Author:
"Matthew Hayes"

Constructor Summary
PartitionPreservingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs, java.util.Properties props)
          Initializes the execution planner.
 
Method Summary
 void createPlan()
          Create the execution plan.
 java.util.List<java.util.Date> getDatesToProcess()
          Gets the input dates which are to be processed.
 java.util.List<org.apache.avro.Schema> getInputSchemas()
          Gets the input schemas.
 java.util.Map<java.lang.String,org.apache.avro.Schema> getInputSchemasByPath()
          Gets a map from input path to schema.
 java.util.List<DatePath> getInputsToProcess()
          Gets the inputs which are to be processed.
 boolean getNeedsAnotherPass()
          Gets whether another pass will be required.
 int getNumReducers()
          Get the number of reducers to use based on the input data size.
 
Methods inherited from class datafu.hourglass.jobs.ExecutionPlanner
determineAvailableInputDates, determineDateRange, getAvailableInputsByDate, getDailyData, getDatedData, getDateRange, getDaysAgo, getEndDate, getFileSystem, getInputPaths, getMaxToProcess, getNumDays, getOutputPath, getProps, getStartDate, isFailOnMissing, loadInputData, setDaysAgo, setEndDate, setFailOnMissing, setInputPaths, setMaxToProcess, setNumDays, setOutputPath, setStartDate
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PartitionPreservingExecutionPlanner

public PartitionPreservingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs,
                                           java.util.Properties props)
Initializes the execution planner.

Parameters:
fs - file system
props - configuration properties
Method Detail

createPlan

public void createPlan()
                throws java.io.IOException
Create the execution plan.

Throws:
java.io.IOException

getNumReducers

public int getNumReducers()
Get the number of reducers to use based on the input data size. Must call createPlan() first.

Returns:
number of reducers to use

getInputSchemas

public java.util.List<org.apache.avro.Schema> getInputSchemas()
Gets the input schemas. Because multiple inputs are allowed, there may be multiple schemas. Must call createPlan() first.

Returns:
input schemas

getInputSchemasByPath

public java.util.Map<java.lang.String,org.apache.avro.Schema> getInputSchemasByPath()
Gets a map from input path to schema. Because multiple inputs are allowed, there may be multiple schemas. Must call createPlan() first.

Returns:
map from path to input schema

getNeedsAnotherPass

public boolean getNeedsAnotherPass()
Gets whether another pass will be required. Because there may be a limit on the number of inputs processed in a single run, multiple runs may be required to process all data in the desired date range. Must call createPlan() first.

Returns:
true if another pass is required

getInputsToProcess

public java.util.List<DatePath> getInputsToProcess()
Gets the inputs which are to be processed. Must call createPlan() first.

Returns:
inputs to process

getDatesToProcess

public java.util.List<java.util.Date> getDatesToProcess()
Gets the input dates which are to be processed. Must call createPlan() first.

Returns:
dates to process


Matthew Hayes