PartitionCollapsingExecutionPlanner (DataFu Hourglass)

Overview

Package

Class

Use

Tree

Deprecated

Index

Help

PREV CLASS NEXT CLASS

FRAMES NO FRAMES

SUMMARY: NESTED | FIELD | CONSTR | METHOD

DETAIL: FIELD | CONSTR | METHOD

datafu.hourglass.jobs
Class PartitionCollapsingExecutionPlanner

java.lang.Object
  datafu.hourglass.jobs.ExecutionPlanner
      datafu.hourglass.jobs.PartitionCollapsingExecutionPlanner

public class PartitionCollapsingExecutionPlanner
extends ExecutionPlanner
extends ExecutionPlanner

Execution planner used by AbstractPartitionCollapsingIncrementalJob and its derived classes. This creates a plan to process partitioned input data and collapse the partitions into a single output.

To use this class, the input and output paths must be specified. In addition the desired input date range can be specified through several methods. Then createPlan() can be called and the execution plan will be created. The inputs to process will be available from getInputsToProcess(), the number of reducers to use will be available from getNumReducers(), and the input schemas will be available from getInputSchemas().

Previous output may be reused by using setReusePreviousOutput(boolean). If previous output exists and it is to be reused then it will be available from getPreviousOutputToProcess(). New input data to process that is after the previous output time range is available from getNewInputsToProcess(). Old input data to process that is before the previous output time range and should be subtracted from the previous output is available from getOldInputsToProcess().

Configuration properties are used to configure a ReduceEstimator instance. This is used to calculate how many reducers should be used. The number of reducers to use is based on the input data size and the num.reducers.bytes.per.reducer property. This setting can be controlled more granularly through num.reducers.input.bytes.per.reducer and num.reducers.previous.bytes.per.reducer. Check ReduceEstimator for more details on how the properties are used.

Author:: "Matthew Hayes"

Constructor Summary
`PartitionCollapsingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs, java.util.Properties props)` Initializes the execution planner.

Method Summary
`void`	`createPlan()` Create the execution plan.
`DateRange`	`getCurrentDateRange()`
`java.util.List<org.apache.avro.Schema>`	`getInputSchemas()` Gets the input schemas.
`java.util.Map<java.lang.String,org.apache.avro.Schema>`	`getInputSchemasByPath()` Gets a map from input path to schema.
`java.util.List<DatePath>`	`getInputsToProcess()` Gets all inputs that will be processed.
`boolean`	`getNeedsAnotherPass()` Gets whether another pass will be required.
`java.util.List<DatePath>`	`getNewInputsToProcess()` Gets only the new data that will be processed.
`int`	`getNumReducers()` Get the number of reducers to use based on the input and previous output data size.
`java.util.List<DatePath>`	`getOldInputsToProcess()` Gets only the old data that will be processed.
`DatePath`	`getPreviousOutputToProcess()` Gets the previous output to reuse, or null if no output is being reused.
`boolean`	`getReusePreviousOutput()` Gets whether previous output should be reused, if it exists.
`void`	`setReusePreviousOutput(boolean reuse)` Sets whether previous output should be reused, if it exists.

Methods inherited from class datafu.hourglass.jobs.ExecutionPlanner
`determineAvailableInputDates, determineDateRange, getAvailableInputsByDate, getDailyData, getDatedData, getDateRange, getDaysAgo, getEndDate, getFileSystem, getInputPaths, getMaxToProcess, getNumDays, getOutputPath, getProps, getStartDate, isFailOnMissing, loadInputData, setDaysAgo, setEndDate, setFailOnMissing, setInputPaths, setMaxToProcess, setNumDays, setOutputPath, setStartDate`

Methods inherited from class java.lang.Object
`clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait`

Constructor Detail

PartitionCollapsingExecutionPlanner

public PartitionCollapsingExecutionPlanner(org.apache.hadoop.fs.FileSystem fs,
                                           java.util.Properties props)

Initializes the execution planner.

Parameters:: fs - file system; props - configuration properties

Method Detail

createPlan

public void createPlan()
                throws java.io.IOException

Create the execution plan.

Throws:: java.io.IOException

getReusePreviousOutput

public boolean getReusePreviousOutput()

Gets whether previous output should be reused, if it exists.

Returns:: true if previous output should be reused

setReusePreviousOutput

public void setReusePreviousOutput(boolean reuse)

Sets whether previous output should be reused, if it exists.

Parameters:: reuse - true if previous output should be reused

getNumReducers

public int getNumReducers()

Get the number of reducers to use based on the input and previous output data size. Must call createPlan() first.

Returns:: number of reducers to use

getCurrentDateRange

public DateRange getCurrentDateRange()

getPreviousOutputToProcess

public DatePath getPreviousOutputToProcess()

Gets the previous output to reuse, or null if no output is being reused. Must call createPlan() first.

Returns:: previous output to reuse, or null

getInputsToProcess

public java.util.List<DatePath> getInputsToProcess()

Gets all inputs that will be processed. This includes both old and new data. Must call createPlan() first.

Returns:: inputs to process

getNewInputsToProcess

public java.util.List<DatePath> getNewInputsToProcess()

Gets only the new data that will be processed. New data is data that falls within the desired date range. Must call createPlan() first.

Returns:: new inputs to process

getOldInputsToProcess

public java.util.List<DatePath> getOldInputsToProcess()

Gets only the old data that will be processed. Old data is data that falls before the desired date range. It will be subtracted out from the previous output. Must call createPlan() first.

Returns:: old inputs to process

getNeedsAnotherPass

public boolean getNeedsAnotherPass()

Gets whether another pass will be required. Because there may be a limit on the number of inputs processed in a single run, multiple runs may be required to process all data in the desired date range. Must call createPlan() first.

Returns:: true if another pass is required

getInputSchemas

public java.util.List<org.apache.avro.Schema> getInputSchemas()

Gets the input schemas. Because multiple inputs are allowed, there may be multiple schemas. Must call createPlan() first.

This does not include the output schema, even though previous output may be fed back as input. The reason is that the ouput schema it determined based on the input schema.

Returns:: input schemas

getInputSchemasByPath

public java.util.Map<java.lang.String,org.apache.avro.Schema> getInputSchemasByPath()

Gets a map from input path to schema. Because multiple inputs are allowed, there may be multiple schemas. Must call createPlan() first.

Returns:: map from path to input schema