|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Object datafu.hourglass.jobs.ExecutionPlanner datafu.hourglass.jobs.PartitionCollapsingExecutionPlanner
public class PartitionCollapsingExecutionPlanner
Execution planner used by AbstractPartitionCollapsingIncrementalJob
and its derived classes.
This creates a plan to process partitioned input data and collapse the partitions into a single output.
To use this class, the input and output paths must be specified. In addition the desired input date
range can be specified through several methods. Then createPlan()
can be called and the
execution plan will be created. The inputs to process will be available from getInputsToProcess()
,
the number of reducers to use will be available from getNumReducers()
, and the input schemas
will be available from getInputSchemas()
.
Previous output may be reused by using setReusePreviousOutput(boolean)
. If previous output exists
and it is to be reused then it will be available from getPreviousOutputToProcess()
. New input data
to process that is after the previous output time range is available from getNewInputsToProcess()
.
Old input data to process that is before the previous output time range and should be subtracted from the
previous output is available from getOldInputsToProcess()
.
Configuration properties are used to configure a ReduceEstimator
instance. This is used to
calculate how many reducers should be used.
The number of reducers to use is based on the input data size and the
num.reducers.bytes.per.reducer property. This setting can be controlled more granularly
through num.reducers.input.bytes.per.reducer and num.reducers.previous.bytes.per.reducer.
Check ReduceEstimator
for more details on how the properties are used.
Constructor Summary | |
---|---|
PartitionCollapsingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs,
java.util.Properties props)
Initializes the execution planner. |
Method Summary | |
---|---|
void |
createPlan()
Create the execution plan. |
DateRange |
getCurrentDateRange()
|
java.util.List<org.apache.avro.Schema> |
getInputSchemas()
Gets the input schemas. |
java.util.Map<java.lang.String,org.apache.avro.Schema> |
getInputSchemasByPath()
Gets a map from input path to schema. |
java.util.List<DatePath> |
getInputsToProcess()
Gets all inputs that will be processed. |
boolean |
getNeedsAnotherPass()
Gets whether another pass will be required. |
java.util.List<DatePath> |
getNewInputsToProcess()
Gets only the new data that will be processed. |
int |
getNumReducers()
Get the number of reducers to use based on the input and previous output data size. |
java.util.List<DatePath> |
getOldInputsToProcess()
Gets only the old data that will be processed. |
DatePath |
getPreviousOutputToProcess()
Gets the previous output to reuse, or null if no output is being reused. |
boolean |
getReusePreviousOutput()
Gets whether previous output should be reused, if it exists. |
void |
setReusePreviousOutput(boolean reuse)
Sets whether previous output should be reused, if it exists. |
Methods inherited from class datafu.hourglass.jobs.ExecutionPlanner |
---|
determineAvailableInputDates, determineDateRange, getAvailableInputsByDate, getDailyData, getDatedData, getDateRange, getDaysAgo, getEndDate, getFileSystem, getInputPaths, getMaxToProcess, getNumDays, getOutputPath, getProps, getStartDate, isFailOnMissing, loadInputData, setDaysAgo, setEndDate, setFailOnMissing, setInputPaths, setMaxToProcess, setNumDays, setOutputPath, setStartDate |
Methods inherited from class java.lang.Object |
---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Constructor Detail |
---|
public PartitionCollapsingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs, java.util.Properties props)
fs
- file systemprops
- configuration propertiesMethod Detail |
---|
public void createPlan() throws java.io.IOException
java.io.IOException
public boolean getReusePreviousOutput()
public void setReusePreviousOutput(boolean reuse)
reuse
- true if previous output should be reusedpublic int getNumReducers()
createPlan()
first.
public DateRange getCurrentDateRange()
public DatePath getPreviousOutputToProcess()
createPlan()
first.
public java.util.List<DatePath> getInputsToProcess()
createPlan()
first.
public java.util.List<DatePath> getNewInputsToProcess()
createPlan()
first.
public java.util.List<DatePath> getOldInputsToProcess()
createPlan()
first.
public boolean getNeedsAnotherPass()
createPlan()
first.
public java.util.List<org.apache.avro.Schema> getInputSchemas()
createPlan()
first.
This does not include the output schema, even though previous output may be fed back as input. The reason is that the ouput schema it determined based on the input schema.
public java.util.Map<java.lang.String,org.apache.avro.Schema> getInputSchemasByPath()
createPlan()
first.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |