# DataFu Pig

Apache DataFu Pig is a collection of user-defined functions for working with large scale data in Apache Pig. It has a number of useful functions available:

#### Statistics

Compute quantiles, median, variance, wilson binary confidence, etc.

#### Set Operations

Perform set intersection, union, or difference of bags.

#### Bags

Convenient functions for working with bags such as enumerate items, append, prepend, concat, group, distinct, etc.

#### Sessions

Sessionize events from a stream of data.

#### Estimation

Streaming implementations that can estimate quantiles and median.

#### Sampling

Simple random sampling with or without replacement, weighted sampling.

Run PageRank on a graph represented by a bag of nodes and edges.

#### More

Other useful methods like Assert and Coalesce.

If you'd like to read more details about these functions, check out the Guide. Otherwise if you are ready to get started using DataFu Pig, keep reading.

## Basic Example: Computing Median

Let's use DataFu Pig to perform a very basic task: computing the median of some data. Suppose we have a file `input` in Hadoop with the following content:

``````1
2
3
2
2
2
3
2
2
1
``````

We can clearly see that the median is 2 for this data set. First we'll start up Pig's grunt shell by running `pig` and then register the DataFu JAR:

``````register datafu-pig-1.6.1.jar
``````

To compute the median we'll use DataFu's `StreamingMedian`, which computes an estimate of the median but has the benefit of not requiring the data to be sorted:

``````DEFINE Median datafu.pig.stats.StreamingMedian();
``````

Next we can load the data and pass it into the function to compute the median:

``````data = LOAD 'input' using PigStorage() as (val:int);
data = FOREACH (GROUP data ALL) GENERATE Median(data);
DUMP data
``````

This produces the expected output:

``````((2.0))
``````

## Next Steps

Check out the Guide for more information on what you can do with DataFu Pig.