datafu.pig.sampling
Class ReservoirSample
java.lang.Object
org.apache.pig.EvalFunc<T>
org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
datafu.pig.sampling.ReservoirSample
- All Implemented Interfaces:
- org.apache.pig.Accumulator<org.apache.pig.data.DataBag>, org.apache.pig.Algebraic
@Nondeterministic
public class ReservoirSample
- extends org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
- implements org.apache.pig.Algebraic
Performs a simple random sample using an in-memory reservoir to produce
a uniformly random sample of a given size.
This is similar to SimpleRandomSample
, however it is guaranteed to produce
a sample of the given size. This comes at the cost of scalability.
SimpleRandomSample
produces a sample of the desired size with likelihood of 99.99%,
while using less internal storage. ReservoirSample on the other hand uses internal storage
with size equaling the desired sample to guarantee the exact sample size.
This algebraic implementation is backed by a heap and maintains the original roll in order
to compensate for skew.
- Author:
- wvaughan
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Method Summary |
void |
accumulate(org.apache.pig.data.Tuple input)
|
void |
cleanup()
|
org.apache.pig.data.DataBag |
exec(org.apache.pig.data.Tuple input)
|
java.lang.String |
getFinal()
|
java.lang.String |
getInitial()
|
java.lang.String |
getIntermed()
|
org.apache.pig.data.DataBag |
getValue()
|
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
|
Methods inherited from class org.apache.pig.EvalFunc |
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ReservoirSample
public ReservoirSample(java.lang.String numSamples)
accumulate
public void accumulate(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Specified by:
accumulate
in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
- Specified by:
accumulate
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
- Throws:
java.io.IOException
cleanup
public void cleanup()
- Specified by:
cleanup
in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
- Specified by:
cleanup
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
getValue
public org.apache.pig.data.DataBag getValue()
- Specified by:
getValue
in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
- Specified by:
getValue
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
exec
public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Overrides:
exec
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
- Throws:
java.io.IOException
outputSchema
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
- Overrides:
outputSchema
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
getInitial
public java.lang.String getInitial()
- Specified by:
getInitial
in interface org.apache.pig.Algebraic
getIntermed
public java.lang.String getIntermed()
- Specified by:
getIntermed
in interface org.apache.pig.Algebraic
getFinal
public java.lang.String getFinal()
- Specified by:
getFinal
in interface org.apache.pig.Algebraic
Matthew Hayes, Sam Shah