datafu.pig.sampling
Class SimpleRandomSample
java.lang.Object
org.apache.pig.EvalFunc<T>
org.apache.pig.AccumulatorEvalFunc<T>
org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
datafu.pig.sampling.SimpleRandomSample
- All Implemented Interfaces:
- org.apache.pig.Accumulator<org.apache.pig.data.DataBag>, org.apache.pig.Algebraic
public class SimpleRandomSample
- extends org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
Scalable simple random sampling.
This UDF implements a scalable simple random sampling algorithm described in
X. Meng, Scalable Simple Random Sampling and Stratified Sampling, ICML 2013.
It takes a sampling probability p as input and outputs a simple random sample of size
exactly ceil(p*n) with probability at least 99.99%, where $n$ is the size of the
population. This UDF is very useful for stratified sampling. For example,
DEFINE SRS datafu.pig.sampling.SimpleRandomSample('0.01');
examples = LOAD ...
grouped = GROUP examples BY label;
sampled = FOREACH grouped GENERATE FLATTEN(SRS(examples));
STORE sampled ...
We note that, in a Java Hadoop job, we can output pre-selected records directly using
MultipleOutputs. However, this feature is not available in a Pig UDF. So we still let
pre-selected records go through the sort phase. However, as long as the sample size is
not huge, this should not be a big problem.
- Author:
- ximeng
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Method Summary |
java.lang.String |
getFinal()
|
java.lang.String |
getInitial()
|
java.lang.String |
getIntermed()
|
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
|
Methods inherited from class org.apache.pig.AlgebraicEvalFunc |
accumulate, cleanup, getValue |
Methods inherited from class org.apache.pig.AccumulatorEvalFunc |
exec |
Methods inherited from class org.apache.pig.EvalFunc |
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SimpleRandomSample
public SimpleRandomSample()
SimpleRandomSample
public SimpleRandomSample(java.lang.String samplingProbability)
getInitial
public java.lang.String getInitial()
- Specified by:
getInitial
in interface org.apache.pig.Algebraic
- Specified by:
getInitial
in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
getIntermed
public java.lang.String getIntermed()
- Specified by:
getIntermed
in interface org.apache.pig.Algebraic
- Specified by:
getIntermed
in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
getFinal
public java.lang.String getFinal()
- Specified by:
getFinal
in interface org.apache.pig.Algebraic
- Specified by:
getFinal
in class org.apache.pig.AlgebraicEvalFunc<org.apache.pig.data.DataBag>
outputSchema
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
- Overrides:
outputSchema
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Matthew Hayes, Sam Shah