datafu.pig.bags
Class BagGroup
java.lang.Object
org.apache.pig.EvalFunc<T>
datafu.pig.util.ContextualEvalFunc<T>
datafu.pig.util.AliasableEvalFunc<org.apache.pig.data.DataBag>
datafu.pig.bags.BagGroup
public class BagGroup
- extends AliasableEvalFunc<org.apache.pig.data.DataBag>
Performs an in-memory group operation on a bag. The first argument is the bag.
The second argument is a projection of that bag to the keys to group by.
The following example groups input_bag by k. The output is a bag having tuples
consisting of the group key k and a bag with the corresponding (k,v) tuples from input_bag
for that key.
define BagGroup datafu.pig.bags.BagGroup();
data = LOAD 'input' AS (input_bag: bag {T: tuple(k: int, v: chararray)});
-- ({(1,A),(1,B),(2,A),(2,B),(2,C),(3,A)})
-- Group input_bag by k
data2 = FOREACH data GENERATE BagGroup(input_bag, input_bag.(k)) as grouped;
-- data2: {grouped: {(group: int,input_bag: {T: (k: int,v: chararray)})}}
-- ({(1,{(1,A),(1,B)}),(2,{(2,A),(2,B),(2,C)}),(3,{(3,A)})})
If the key k is not needed within the input_bag for the output, it can be projected
out like so:
data3 = FOREACH data2 {
-- project only the value
projected = FOREACH grouped GENERATE group, input_bag.(v);
GENERATE projected as grouped;
}
-- data3: {grouped: {(group: int,input_bag: {T: (k: int,v: chararray)})}}
-- ({(1,{(A),(B)}),(2,{(A),(B),(C)}),(3,{(A)})})
Fields inherited from class org.apache.pig.EvalFunc |
log, pigLogger, reporter, returnType |
Method Summary |
org.apache.pig.data.DataBag |
exec(org.apache.pig.data.Tuple input)
|
org.apache.pig.impl.logicalLayer.schema.Schema |
getOutputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
Specify the output schema as in {link EvalFunc#outputSchema(Schema)}. |
Methods inherited from class datafu.pig.util.AliasableEvalFunc |
getBag, getBoolean, getDouble, getDouble, getFieldAliases, getFloat, getFloat, getInteger, getInteger, getLong, getLong, getObject, getPosition, getPosition, getPrefixedAliasName, getString, getString, outputSchema |
Methods inherited from class org.apache.pig.EvalFunc |
finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, warn |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
BagGroup
public BagGroup()
getOutputSchema
public org.apache.pig.impl.logicalLayer.schema.Schema getOutputSchema(org.apache.pig.impl.logicalLayer.schema.Schema input)
- Description copied from class:
AliasableEvalFunc
- Specify the output schema as in {link EvalFunc#outputSchema(Schema)}.
- Specified by:
getOutputSchema
in class AliasableEvalFunc<org.apache.pig.data.DataBag>
- Returns:
- outputSchema
exec
public org.apache.pig.data.DataBag exec(org.apache.pig.data.Tuple input)
throws java.io.IOException
- Specified by:
exec
in class org.apache.pig.EvalFunc<org.apache.pig.data.DataBag>
- Throws:
java.io.IOException