datafu.pig.sampling
Class WeightedSample

java.lang.Object
  extended by org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
      extended by datafu.pig.sampling.WeightedSample

@Nondeterministic
public class WeightedSample
extends org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>

Performs weighted bernoulli sampling on a bag.

Create a new bag by performing a weighted sampling without replacement from the input bag. Sampling is biased according to a weight that is part of the inner tuples in the bag. That is, tuples with relatively high weights are more likely to be chosen over tuples with low weights. Optionally, a limit on the number of items to return may be specified.

Example:

 define WeightedSample datafu.pig.sampling.WeightedSample()
 
 -- input:
 -- ({(a, 100),(b, 1),(c, 5),(d, 2)})
 input = LOAD 'input' AS (A: bag{T: tuple(name:chararray,score:int)});
 
 output1 = FOREACH input GENERATE WeightedSample(A,1);
 -- output1:
 -- uses the field indexed by 1 as a score
 -- ({(a,100),(c,5),(b,1),(d,2)}) -- example of random
 
 -- sample using the second column (index 1) and keep only the top 3
 output2 = FOREACH input GENERATE WeightedSample(A,1,3);
 -- output2:
 -- ({(a,100),(c,5),(b,1)})
 
 


Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
WeightedSample()
           
WeightedSample(java.lang.String seed)
           
 
Method Summary
 org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
           
 int find_cumsum_interval(double[] scores, double val, int begin, int end)
           
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
           
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

WeightedSample

public WeightedSample()

WeightedSample

public WeightedSample(java.lang.String seed)
Method Detail

exec

public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
                                 throws java.io.IOException
Specified by:
exec in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Throws:
java.io.IOException

find_cumsum_interval

public int find_cumsum_interval(double[] scores,
                                double val,
                                int begin,
                                int end)

outputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Overrides:
outputSchema in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>


Matthew Hayes, Sam Shah