datafu.pig.sampling
Class SampleByKey
java.lang.Object
org.apache.pig.EvalFunc<java.lang.Boolean>
org.apache.pig.FilterFunc
datafu.pig.sampling.SampleByKey
public class SampleByKey
- extends org.apache.pig.FilterFunc
Provides a way of sampling tuples based on certain fields.
This is essentially equivalent to grouping on the fields, applying SAMPLE,
and then flattening. It is much more efficient though because it does not require
a reduce step.
The method of sampling is to convert the key to a hash, derive a double value
from this, and then test this against a supplied probability. The double value
derived from a key is uniformly distributed between 0 and 1.
The only required parameter is the sampling probability. This may be followed
by an optional seed value to control the random number generation.
SampleByKey will work deterministically as long as the same seed is provided.
Example:
DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.5');
-- input: (A,1), (A,2), (A,3), (B,1), (B,3)
data = LOAD 'input' AS (A_id:chararray, B_id:chararray, C:int);
output = FILTER data BY SampleByKey(A_id);
--output: (B,1), (B,3)
- Author:
- evion
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Constructor Summary |
SampleByKey(java.lang.String probability)
|
SampleByKey(java.lang.String probability,
java.lang.String salt)
|
Methods inherited from class org.apache.pig.FilterFunc |
finish |
Methods inherited from class org.apache.pig.EvalFunc |
getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
SampleByKey
public SampleByKey(java.lang.String probability)
SampleByKey
public SampleByKey(java.lang.String probability,
java.lang.String salt)
setUDFContextSignature
public void setUDFContextSignature(java.lang.String signature)
- Overrides:
setUDFContextSignature
in class org.apache.pig.EvalFunc<java.lang.Boolean>
exec
public java.lang.Boolean exec(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Specified by:
exec
in class org.apache.pig.EvalFunc<java.lang.Boolean>
- Throws:
java.io.IOException
Matthew Hayes, Sam Shah