datafu.hourglass.jobs
Class PartitionCollapsingExecutionPlanner

java.lang.Object
  extended by datafu.hourglass.jobs.ExecutionPlanner
      extended by datafu.hourglass.jobs.PartitionCollapsingExecutionPlanner

public class PartitionCollapsingExecutionPlanner
extends ExecutionPlanner

Execution planner used by AbstractPartitionCollapsingIncrementalJob and its derived classes. This creates a plan to process partitioned input data and collapse the partitions into a single output.

To use this class, the input and output paths must be specified. In addition the desired input date range can be specified through several methods. Then createPlan() can be called and the execution plan will be created. The inputs to process will be available from getInputsToProcess(), the number of reducers to use will be available from getNumReducers(), and the input schemas will be available from getInputSchemas().

Previous output may be reused by using setReusePreviousOutput(boolean). If previous output exists and it is to be reused then it will be available from getPreviousOutputToProcess(). New input data to process that is after the previous output time range is available from getNewInputsToProcess(). Old input data to process that is before the previous output time range and should be subtracted from the previous output is available from getOldInputsToProcess().

Configuration properties are used to configure a ReduceEstimator instance. This is used to calculate how many reducers should be used. The number of reducers to use is based on the input data size and the num.reducers.bytes.per.reducer property. This setting can be controlled more granularly through num.reducers.input.bytes.per.reducer and num.reducers.previous.bytes.per.reducer. Check ReduceEstimator for more details on how the properties are used.

Author:
"Matthew Hayes"

Constructor Summary
PartitionCollapsingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs, java.util.Properties props)
          Initializes the execution planner.
 
Method Summary
 void createPlan()
          Create the execution plan.
 DateRange getCurrentDateRange()
           
 java.util.List<org.apache.avro.Schema> getInputSchemas()
          Gets the input schemas.
 java.util.Map<java.lang.String,org.apache.avro.Schema> getInputSchemasByPath()
          Gets a map from input path to schema.
 java.util.List<DatePath> getInputsToProcess()
          Gets all inputs that will be processed.
 boolean getNeedsAnotherPass()
          Gets whether another pass will be required.
 java.util.List<DatePath> getNewInputsToProcess()
          Gets only the new data that will be processed.
 int getNumReducers()
          Get the number of reducers to use based on the input and previous output data size.
 java.util.List<DatePath> getOldInputsToProcess()
          Gets only the old data that will be processed.
 DatePath getPreviousOutputToProcess()
          Gets the previous output to reuse, or null if no output is being reused.
 boolean getReusePreviousOutput()
          Gets whether previous output should be reused, if it exists.
 void setReusePreviousOutput(boolean reuse)
          Sets whether previous output should be reused, if it exists.
 
Methods inherited from class datafu.hourglass.jobs.ExecutionPlanner
determineAvailableInputDates, determineDateRange, getAvailableInputsByDate, getDailyData, getDatedData, getDateRange, getDaysAgo, getEndDate, getFileSystem, getInputPaths, getMaxToProcess, getNumDays, getOutputPath, getProps, getStartDate, isFailOnMissing, loadInputData, setDaysAgo, setEndDate, setFailOnMissing, setInputPaths, setMaxToProcess, setNumDays, setOutputPath, setStartDate
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

PartitionCollapsingExecutionPlanner

public PartitionCollapsingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs,
                                           java.util.Properties props)
Initializes the execution planner.

Parameters:
fs - file system
props - configuration properties
Method Detail

createPlan

public void createPlan()
                throws java.io.IOException
Create the execution plan.

Throws:
java.io.IOException

getReusePreviousOutput

public boolean getReusePreviousOutput()
Gets whether previous output should be reused, if it exists.

Returns:
true if previous output should be reused

setReusePreviousOutput

public void setReusePreviousOutput(boolean reuse)
Sets whether previous output should be reused, if it exists.

Parameters:
reuse - true if previous output should be reused

getNumReducers

public int getNumReducers()
Get the number of reducers to use based on the input and previous output data size. Must call createPlan() first.

Returns:
number of reducers to use

getCurrentDateRange

public DateRange getCurrentDateRange()

getPreviousOutputToProcess

public DatePath getPreviousOutputToProcess()
Gets the previous output to reuse, or null if no output is being reused. Must call createPlan() first.

Returns:
previous output to reuse, or null

getInputsToProcess

public java.util.List<DatePath> getInputsToProcess()
Gets all inputs that will be processed. This includes both old and new data. Must call createPlan() first.

Returns:
inputs to process

getNewInputsToProcess

public java.util.List<DatePath> getNewInputsToProcess()
Gets only the new data that will be processed. New data is data that falls within the desired date range. Must call createPlan() first.

Returns:
new inputs to process

getOldInputsToProcess

public java.util.List<DatePath> getOldInputsToProcess()
Gets only the old data that will be processed. Old data is data that falls before the desired date range. It will be subtracted out from the previous output. Must call createPlan() first.

Returns:
old inputs to process

getNeedsAnotherPass

public boolean getNeedsAnotherPass()
Gets whether another pass will be required. Because there may be a limit on the number of inputs processed in a single run, multiple runs may be required to process all data in the desired date range. Must call createPlan() first.

Returns:
true if another pass is required

getInputSchemas

public java.util.List<org.apache.avro.Schema> getInputSchemas()
Gets the input schemas. Because multiple inputs are allowed, there may be multiple schemas. Must call createPlan() first.

This does not include the output schema, even though previous output may be fed back as input. The reason is that the ouput schema it determined based on the input schema.

Returns:
input schemas

getInputSchemasByPath

public java.util.Map<java.lang.String,org.apache.avro.Schema> getInputSchemasByPath()
Gets a map from input path to schema. Because multiple inputs are allowed, there may be multiple schemas. Must call createPlan() first.

Returns:
map from path to input schema


Matthew Hayes