datafu.pig.bags
Class DistinctBy

java.lang.Object
  extended by org.apache.pig.EvalFunc<T>
      extended by org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
          extended by datafu.pig.bags.DistinctBy
All Implemented Interfaces:
org.apache.pig.Accumulator<org.apache.pig.data.DataBag>

public class DistinctBy
extends org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>

Get distinct elements in a bag by a given set of field positions. The input and output schemas will be identical. The first tuple containing each distinct combination of these fields will be taken. This operation is order preserving. If both A and B appear in the output, and A appears before B in the input, then A will appear before B in the output. Example:

 define DistinctBy datafu.pig.bags.DistinctBy('0');
 
 -- input:
 -- ({(a, 1),(a,1),(b, 2),(b,22),(c, 3),(d, 4)})
 input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
 
 output = FOREACH input GENERATE DistinctBy(B);
 
 -- output:
 -- ({(a,1),(b,2),(c,3),(d,4)})
  
 


Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
DistinctBy(java.lang.String... fields)
           
 
Method Summary
 void accumulate(org.apache.pig.data.Tuple input)
           
 void cleanup()
           
 org.apache.pig.data.DataBag getValue()
           
 org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
           
 
Methods inherited from class org.apache.pig.AccumulatorEvalFunc
exec
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

DistinctBy

public DistinctBy(java.lang.String... fields)
Method Detail

accumulate

public void accumulate(org.apache.pig.data.Tuple input)
                throws java.io.IOException
Specified by:
accumulate in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
Specified by:
accumulate in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>
Throws:
java.io.IOException

cleanup

public void cleanup()
Specified by:
cleanup in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
Specified by:
cleanup in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>

getValue

public org.apache.pig.data.DataBag getValue()
Specified by:
getValue in interface org.apache.pig.Accumulator<org.apache.pig.data.DataBag>
Specified by:
getValue in class org.apache.pig.AccumulatorEvalFunc<org.apache.pig.data.DataBag>

outputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema outputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Overrides:
outputSchema in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>


Matthew Hayes, Sam Shah