Class SimpleRandomSample

  extended by org.apache.pig.EvalFunc<T>
      extended by org.apache.pig.AccumulatorEvalFunc<T>
          extended by org.apache.pig.AlgebraicEvalFunc<>
              extended by datafu.pig.sampling.SimpleRandomSample
All Implemented Interfaces:
org.apache.pig.Accumulator<>, org.apache.pig.Algebraic

public class SimpleRandomSample
extends org.apache.pig.AlgebraicEvalFunc<>

Scalable simple random sampling.

This UDF implements a scalable simple random sampling algorithm described in

 X. Meng, Scalable Simple Random Sampling and Stratified Sampling, ICML 2013.
It takes a sampling probability p as input and outputs a simple random sample of size exactly ceil(p*n) with probability at least 99.99%, where $n$ is the size of the population. This UDF is very useful for stratified sampling. For example,
 DEFINE SRS datafu.pig.sampling.SimpleRandomSample('0.01');
 examples = LOAD ...
 grouped = GROUP examples BY label;
 sampled = FOREACH grouped GENERATE FLATTEN(SRS(examples));
 STORE sampled ...
We note that, in a Java Hadoop job, we can output pre-selected records directly using MultipleOutputs. However, this feature is not available in a Pig UDF. So we still let pre-selected records go through the sort phase. However, as long as the sample size is not huge, this should not be a big problem.


Nested Class Summary
static class SimpleRandomSample.Final
static class SimpleRandomSample.Initial
static class SimpleRandomSample.Intermediate
Field Summary
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
Constructor Summary
SimpleRandomSample(java.lang.String samplingProbability)
Method Summary
 java.lang.String getFinal()
 java.lang.String getInitial()
 java.lang.String getIntermed()
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Methods inherited from class org.apache.pig.AlgebraicEvalFunc
accumulate, cleanup, getValue
Methods inherited from class org.apache.pig.AccumulatorEvalFunc
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Detail


public SimpleRandomSample()


public SimpleRandomSample(java.lang.String samplingProbability)
Method Detail


public java.lang.String getInitial()
Specified by:
getInitial in interface org.apache.pig.Algebraic
Specified by:
getInitial in class org.apache.pig.AlgebraicEvalFunc<>


public java.lang.String getIntermed()
Specified by:
getIntermed in interface org.apache.pig.Algebraic
Specified by:
getIntermed in class org.apache.pig.AlgebraicEvalFunc<>


public java.lang.String getFinal()
Specified by:
getFinal in interface org.apache.pig.Algebraic
Specified by:
getFinal in class org.apache.pig.AlgebraicEvalFunc<>


public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
outputSchema in class org.apache.pig.EvalFunc<>

Matthew Hayes, Sam Shah