datafu.pig.sampling
Class SampleByKey

java.lang.Object
  extended by org.apache.pig.EvalFunc<java.lang.Boolean>
      extended by org.apache.pig.FilterFunc
          extended by datafu.pig.sampling.SampleByKey

public class SampleByKey
extends org.apache.pig.FilterFunc

Provides a way of sampling tuples based on certain fields. This is essentially equivalent to grouping on the fields, applying SAMPLE, and then flattening. It is much more efficient though because it does not require a reduce step.

The method of sampling is to convert the key to a hash, derive a double value from this, and then test this against a supplied probability. The double value derived from a key is uniformly distributed between 0 and 1.

The only required parameter is the sampling probability. This may be followed by an optional seed value to control the random number generation.

SampleByKey will work deterministically as long as the same seed is provided.

Example:

 DEFINE SampleByKey datafu.pig.sampling.SampleByKey('0.5');
 
-- input: (A,1), (A,2), (A,3), (B,1), (B,3)
 
 data = LOAD 'input' AS (A_id:chararray, B_id:chararray, C:int);
 output = FILTER data BY SampleByKey(A_id);
 
 --output: (B,1), (B,3)
  
 
 

Author:
evion

Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
SampleByKey(java.lang.String probability)
           
SampleByKey(java.lang.String probability, java.lang.String salt)
           
 
Method Summary
 java.lang.Boolean exec(org.apache.pig.data.Tuple input)
           
 void setUDFContextSignature(java.lang.String signature)
           
 
Methods inherited from class org.apache.pig.FilterFunc
finish
 
Methods inherited from class org.apache.pig.EvalFunc
getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, outputSchema, progress, setInputSchema, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

SampleByKey

public SampleByKey(java.lang.String probability)

SampleByKey

public SampleByKey(java.lang.String probability,
                   java.lang.String salt)
Method Detail

setUDFContextSignature

public void setUDFContextSignature(java.lang.String signature)
Overrides:
setUDFContextSignature in class org.apache.pig.EvalFunc<java.lang.Boolean>

exec

public java.lang.Boolean exec(org.apache.pig.data.Tuple input)
                       throws java.io.IOException
Specified by:
exec in class org.apache.pig.EvalFunc<java.lang.Boolean>
Throws:
java.io.IOException


Matthew Hayes, Sam Shah