Class Quantile

  extended by org.apache.pig.EvalFunc<T>
      extended by datafu.pig.util.SimpleEvalFunc<>
          extended by datafu.pig.stats.Quantile
Direct Known Subclasses:

public class Quantile
extends SimpleEvalFunc<>

Computes quantiles for a sorted input bag, using type R-2 estimation.

N.B., all the data is pushed to a single reducer per key, so make sure some partitioning is done (e.g., group by 'day') if the data is too large. That is, this isn't distributed quantiles.

Note that unlike datafu's StreamingQuantile algorithm, this implementation gives exact quantiles. But, it requires that the input bag to be sorted. Quantile must spill to disk when the input data is too large to fit in memory, which will contribute to longer runtimes. Because StreamingQuantile implements accumulate it can be much more efficient than Quantile for large input bags which do not fit well in memory.

The constructor takes a single integer argument that specifies the number of evenly-spaced quantiles to compute, e.g.,

Alternatively the constructor can take the explicit list of quantiles to compute, e.g.

The list of quantiles need not span the entire range from 0.0 to 1.0, nor do they need to be evenly spaced, e.g.


 define Quantile datafu.pig.stats.Quantile('0.0','0.5','1.0');

 -- input: 9,10,2,3,5,8,1,4,6,7
 input = LOAD 'input' AS (val:int);

 grouped = GROUP input ALL;

 -- produces: (1,5.5,10)
 quantiles = FOREACH grouped {
   sorted = ORDER input BY val;
   GENERATE Quantile(sorted);

See Also:
Median, StreamingQuantile

Field Summary
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
Constructor Summary
Quantile(java.lang.String... k)
Method Summary call( bag)
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
          Override outputSchema so we can verify the input schema at pig compile time, instead of runtime
Methods inherited from class datafu.pig.util.SimpleEvalFunc
exec, getReturnType
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public Quantile(java.lang.String... k)
Method Detail


public call( bag)


public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Description copied from class: SimpleEvalFunc
Override outputSchema so we can verify the input schema at pig compile time, instead of runtime

outputSchema in class SimpleEvalFunc<>
input - input schema
call to super.outputSchema in case schema was defined elsewhere

Matthew Hayes, Sam Shah