datafu.pig.sampling
Class ReservoirSample
java.lang.Object
org.apache.pig.EvalFunc<T>
org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
datafu.pig.sampling.ReservoirSample
- All Implemented Interfaces:
- org.apache.pig.Accumulator<org.apache.pig.data.DataBag>, org.apache.pig.Algebraic
@Nondeterministic
public class ReservoirSample
- extends org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
- implements org.apache.pig.Algebraic
Maintains an in-memory reservoir to produce a uniformly random sample of a given size.
This algebraic implementation is backed by a heap and maintains the original roll in order
to compensate for skew.
- Author:
- wvaughan
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Method Summary |
void |
accumulate(org.apache.pig.data.Tuple input)
|
void |
cleanup()
|
org.apache.pig.data.DataBag |
exec(org.apache.pig.data.Tuple input)
|
java.lang.String |
getFinal()
|
java.lang.String |
getInitial()
|
java.lang.String |
getIntermed()
|
org.apache.pig.data.DataBag |
getValue()
|
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
|
Methods inherited from class org.apache.pig.EvalFunc |
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
ReservoirSample
public ReservoirSample(java.lang.String numSamples)
accumulate
public void accumulate(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Specified by:
accumulate
in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
- Specified by:
accumulate
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
- Throws:
java.io.IOException
cleanup
public void cleanup()
- Specified by:
cleanup
in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
- Specified by:
cleanup
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
getValue
public org.apache.pig.data.DataBag getValue()
- Specified by:
getValue
in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
- Specified by:
getValue
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
exec
public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Overrides:
exec
in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
- Throws:
java.io.IOException
outputSchema
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
- Overrides:
outputSchema
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
getInitial
public java.lang.String getInitial()
- Specified by:
getInitial
in interface org.apache.pig.Algebraic
getIntermed
public java.lang.String getIntermed()
- Specified by:
getIntermed
in interface org.apache.pig.Algebraic
getFinal
public java.lang.String getFinal()
- Specified by:
getFinal
in interface org.apache.pig.Algebraic
Matthew Hayes, Sam Shah