datafu.pig.sampling
Class ReservoirSample

java.lang.Object
  extended by org.apache.pig.EvalFunc<T>
      extended by org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
          extended by datafu.pig.sampling.ReservoirSample
All Implemented Interfaces:
org.apache.pig.Accumulator<org.apache.pig.data.DataBag>, org.apache.pig.Algebraic

@Nondeterministic
public class ReservoirSample
extends org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
implements org.apache.pig.Algebraic

Performs a simple random sample using an in-memory reservoir to produce a uniformly random sample of a given size.

This is similar to SimpleRandomSample, however it is guaranteed to produce a sample of the given size. This comes at the cost of scalability. SimpleRandomSample produces a sample of the desired size with likelihood of 99.99%, while using less internal storage. ReservoirSample on the other hand uses internal storage with size equaling the desired sample to guarantee the exact sample size.

This algebraic implementation is backed by a heap and maintains the original roll in order to compensate for skew.

Author:
wvaughan

Nested Class Summary
static class ReservoirSample.Final
           
static class ReservoirSample.Initial
           
static class ReservoirSample.Intermediate
           
 
Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
ReservoirSample(java.lang.String numSamples)
           
 
Method Summary
 void accumulate(org.apache.pig.data.Tuple input)
           
 void cleanup()
           
 org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
           
 java.lang.String getFinal()
           
 java.lang.String getInitial()
           
 java.lang.String getIntermed()
           
 org.apache.pig.data.DataBag getValue()
           
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

ReservoirSample

public ReservoirSample(java.lang.String numSamples)
Method Detail

accumulate

public void accumulate(org.apache.pig.data.Tuple input)
                throws java.io.IOException
Specified by:
accumulate in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
Specified by:
accumulate in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
Throws:
java.io.IOException

cleanup

public void cleanup()
Specified by:
cleanup in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
Specified by:
cleanup in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>

getValue

public org.apache.pig.data.DataBag getValue()
Specified by:
getValue in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
Specified by:
getValue in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>

exec

public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
                                 throws java.io.IOException
Overrides:
exec in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
Throws:
java.io.IOException

outputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Overrides:
outputSchema in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>

getInitial

public java.lang.String getInitial()
Specified by:
getInitial in interface org.apache.pig.Algebraic

getIntermed

public java.lang.String getIntermed()
Specified by:
getIntermed in interface org.apache.pig.Algebraic

getFinal

public java.lang.String getFinal()
Specified by:
getFinal in interface org.apache.pig.Algebraic


Matthew Hayes, Sam Shah