#### Apache DataFu Pig

# Getting Started

Apache DataFu Pig is a collection of user-defined functions for working with large scale data in Apache Pig. It has a number of useful functions available:

#### Statistics

Compute quantiles, median, variance, wilson binary confidence, etc.

#### Set Operations

Perform set intersection, union, or difference of bags.

#### Bags

Convenient functions for working with bags such as enumerate items,
append, prepend, concat, group, distinct, etc.

#### Sessions

Sessionize events from a stream of data.

#### Estimation

Streaming implementations that can estimate
quantiles, median, cardinality.

#### Sampling

Simple random sampling with or without replacement,
weighted sampling.

#### Link Analysis

Run PageRank on a graph represented by a bag of
nodes and edges.

#### More

Other useful methods like Assert and Coalesce.

If you'd like to read more details about these functions, check out the Guide. Otherwise if you are
ready to get started using DataFu Pig, keep reading.

The rest of this page assumes you already have a built JAR available. If this is not the case, please see Quick Start.

Let's use DataFu Pig to perform a very basic task: computing the median of some data.
Suppose we have a file `input`

in Hadoop with the following content:

```
1
2
3
2
2
2
3
2
2
1
```

We can clearly see that the median is 2 for this data set. First we'll start up Pig's grunt shell by running `pig`

and
then register the DataFu JAR:

```
register datafu-pig-incubating-1.3.1.jar
```

To compute the median we'll use DataFu's `StreamingMedian`

, which computes an estimate of the median but has the benefit
of not requiring the data to be sorted:

```
DEFINE Median datafu.pig.stats.StreamingMedian();
```

Next we can load the data and pass it into the function to compute the median:

```
data = LOAD 'input' using PigStorage() as (val:int);
data = FOREACH (GROUP data ALL) GENERATE Median(data);
DUMP data
```

This produces the expected output:

```
((2.0))
```

## Next Steps

Check out the Guide for more information on what you can do with DataFu Pig.