datafu.pig.sampling
Class WeightedSample
java.lang.Object
org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
datafu.pig.sampling.WeightedSample
@Nondeterministic
public class WeightedSample
- extends org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Performs weighted bernoulli sampling on a bag.
Create a new bag by performing a weighted sampling without replacement
from the input bag. Sampling is biased according to a weight that
is part of the inner tuples in the bag. That is, tuples with relatively
high weights are more likely to be chosen over tuples with low weights.
Optionally, a limit on the number of items to return may be specified.
Example:
define WeightedSample datafu.pig.sampling.WeightedSample()
-- input:
-- ({(a, 100),(b, 1),(c, 5),(d, 2)})
input = LOAD 'input' AS (A: bag{T: tuple(name:chararray,score:int)});
output1 = FOREACH input GENERATE WeightedSample(A,1);
-- output1:
-- uses the field indexed by 1 as a score
-- ({(a,100),(c,5),(b,1),(d,2)}) -- example of random
-- sample using the second column (index 1) and keep only the top 3
output2 = FOREACH input GENERATE WeightedSample(A,1,3);
-- output2:
-- ({(a,100),(c,5),(b,1)})
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Method Summary |
org.apache.pig.data.DataBag |
exec(org.apache.pig.data.Tuple input)
|
int |
find_cumsum_interval(double[] scores,
double val,
int begin,
int end)
|
org.apache.pig.impl.logicalLayer.schema.Schema |
outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
|
Methods inherited from class org.apache.pig.EvalFunc |
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
WeightedSample
public WeightedSample()
WeightedSample
public WeightedSample(java.lang.String seed)
exec
public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Specified by:
exec
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
- Throws:
java.io.IOException
find_cumsum_interval
public int find_cumsum_interval(double[] scores,
double val,
int begin,
int end)
outputSchema
public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
- Overrides:
outputSchema
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Matthew Hayes, Sam Shah