datafu.pig.bags
Class BagGroup

java.lang.Object
  extended by org.apache.pig.EvalFunc<T>
      extended by datafu.pig.util.ContextualEvalFunc<T>
          extended by datafu.pig.util.AliasableEvalFunc<org.apache.pig.data.DataBag>
              extended by datafu.pig.bags.BagGroup

public class BagGroup
extends AliasableEvalFunc<org.apache.pig.data.DataBag>

Performs an in-memory group operation on a bag. The first argument is the bag. The second argument is a projection of that bag to the keys to group by.

The following example groups input_bag by k. The output is a bag having tuples consisting of the group key k and a bag with the corresponding (k,v) tuples from input_bag for that key.

 define BagGroup datafu.pig.bags.BagGroup();

 data = LOAD 'input' AS (input_bag: bag {T: tuple(k: int, v: chararray)});
 -- ({(1,A),(1,B),(2,A),(2,B),(2,C),(3,A)})

 -- Group input_bag by k
 data2 = FOREACH data GENERATE BagGroup(input_bag, input_bag.(k)) as grouped;
 -- data2: {grouped: {(group: int,input_bag: {T: (k: int,v: chararray)})}}
 -- ({(1,{(1,A),(1,B)}),(2,{(2,A),(2,B),(2,C)}),(3,{(3,A)})})
 
 

If the key k is not needed within the input_bag for the output, it can be projected out like so:

 data3 = FOREACH data2 {
   -- project only the value
   projected = FOREACH grouped GENERATE group, input_bag.(v);
   GENERATE projected as grouped;
 }

 -- data3: {grouped: {(group: int,input_bag: {T: (k: int,v: chararray)})}}
 -- ({(1,{(A),(B)}),(2,{(A),(B),(C)}),(3,{(A)})})
 
 


Field Summary
 
Fields inherited from class org.apache.pig.EvalFunc
log, pigLogger, reporter, returnType
 
Constructor Summary
BagGroup()
           
 
Method Summary
 org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
           
 org.apache.pig.impl.logicalLayer.schema.Schema getOutputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
          Specify the output schema as in {link EvalFunc#outputSchema(Schema)}.
 
Methods inherited from class datafu.pig.util.AliasableEvalFunc
getBag, getBoolean, getDouble, getDouble, getFieldAliases, getFloat, getFloat, getInteger, getInteger, getLong, getLong, getObject, getPosition, getPosition, getPrefixedAliasName, getString, getString, outputSchema
 
Methods inherited from class datafu.pig.util.ContextualEvalFunc
getContextProperties, getInstanceName, getInstanceProperties, setUDFContextSignature
 
Methods inherited from class org.apache.pig.EvalFunc
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, warn
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

BagGroup

public BagGroup()
Method Detail

getOutputSchema

public org.apache.pig.impl.logicalLayer.schema.Schema getOutputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Description copied from class: AliasableEvalFunc
Specify the output schema as in {link EvalFunc#outputSchema(Schema)}.

Specified by:
getOutputSchema in class AliasableEvalFunc<org.apache.pig.data.DataBag>
Returns:
outputSchema

exec

public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
                                 throws java.io.IOException
Specified by:
exec in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
Throws:
java.io.IOException