Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs

Community

Apache Software Foundation

Getting Started

DataFu Pig

Apache DataFu Pig is a collection of user-defined functions for working with large scale data in Apache Pig. It has a number of useful functions available:

Statistics

Compute quantiles, median, variance, wilson binary confidence, etc.

Set Operations

Perform set intersection, union, or difference of bags.

Bags

Convenient functions for working with bags such as enumerate items, append, prepend, concat, group, distinct, etc.

Sessions

Sessionize events from a stream of data.

Estimation

Streaming implementations that can estimate quantiles and median.

Sampling

Simple random sampling with or without replacement, weighted sampling.

Link Analysis

Run PageRank on a graph represented by a bag of nodes and edges.

More

Other useful methods like Assert and Coalesce.

If you'd like to read more details about these functions, check out the Guide. Otherwise if you are ready to get started using DataFu Pig, keep reading.

The rest of this page assumes you already have a built JAR available. If this is not the case, please see the Download page.

Basic Example: Computing Median

Let's use DataFu Pig to perform a very basic task: computing the median of some data. Suppose we have a file input in Hadoop with the following content:

1
2
3
2
2
2
3
2
2
1

We can clearly see that the median is 2 for this data set. First we'll start up Pig's grunt shell by running pig and then register the DataFu JAR:

register datafu-pig-1.6.1.jar

To compute the median we'll use DataFu's StreamingMedian, which computes an estimate of the median but has the benefit of not requiring the data to be sorted:

DEFINE Median datafu.pig.stats.StreamingMedian();

Next we can load the data and pass it into the function to compute the median:

data = LOAD 'input' using PigStorage() as (val:int);
data = FOREACH (GROUP data ALL) GENERATE Median(data);
DUMP data

This produces the expected output:

((2.0))

Next Steps

Check out the Guide for more information on what you can do with DataFu Pig.