Class SimpleRandomSample

  extended by org.apache.pig.EvalFunc<T>
      extended by org.apache.pig.AccumulatorEvalFunc<T>
          extended by org.apache.pig.AlgebraicEvalFunc<>
              extended by datafu.pig.sampling.SimpleRandomSample
All Implemented Interfaces:
org.apache.pig.Accumulator<>, org.apache.pig.Algebraic

public class SimpleRandomSample
extends org.apache.pig.AlgebraicEvalFunc<>

Scalable simple random sampling.

This UDF implements a scalable simple random sampling algorithm described in

 X. Meng, Scalable Simple Random Sampling and Stratified Sampling, ICML 2013.
It takes a sampling probability p as input and outputs a simple random sample of size exactly ceil(p*n) with probability at least 99.99%, where $n$ is the size of the population. This UDF is very useful for stratified sampling. For example,
 DEFINE SRS datafu.pig.sampling.SimpleRandomSample('0.01');
 examples = LOAD ...
 grouped = GROUP examples BY label;
 sampled = FOREACH grouped GENERATE FLATTEN(SRS(examples));
 STORE sampled ...
We note that, in a Java Hadoop job, we can output pre-selected records directly using MultipleOutputs. However, this feature is not available in a Pig UDF. So we still let pre-selected records go through the sort phase. However, as long as the sample size is not huge, this should not be a big problem.


Nested Class Summary
static class SimpleRandomSample.Final
static class SimpleRandomSample.Initial
static class SimpleRandomSample.Intermediate
Constructor Summary
SimpleRandomSample(java.lang.String samplingProbability)
Method Summary
 java.lang.String getFinal()
 java.lang.String getInitial()
 java.lang.String getIntermed()
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Matthew Hayes, Sam Shah