#### Apache DataFu Pig - Guide

## Statistics

Apache DataFu has two UDFs that can be used to compute the median of a bag.
Median computes the median exactly, but
requires that the input bag be sorted. StreamingMedian,
on the other hand, does not require that the bag be sorted, however it computes only an estimate of the median. But, because it does not require
the input bag to be sorted, it is more efficient.

Let's take a look at computing the median using `StreamingMedian`

:

```
define Median datafu.pig.stats.StreamingMedian();
-- input: 3,5,4,1,2
input = LOAD 'input' AS (val:int);
-- produces: 3
medians = FOREACH (GROUP input ALL) GENERATE Median(input.val);
```

### Quantiles

Quantiles are points at regular intervals within an ordered data set. Essentially
we divide an ordered data set into segments, and the quantiles are the values between the segments. The quantiles people are probably
most familiar with are those for median and percentiles.

Similar to median, DataFu has two UDFs that can compute quantiles. The median UDFs are in fact just wrappers around the quantile UDFs.
Quantile computes the quantiles of a sorted bag exactly,
and StreamingQuantile computes an estimate of
the quantiles of a bag that does not need to be sorted.

Let's take a look at computing the median using `StreamingQuantile`

:

```
define Quantile datafu.pig.stats.StreamingQuantile('0.0','0.5','1.0');
-- input: 9,10,2,3,5,8,1,4,6,7
input = LOAD 'input' AS (val:int);
-- produces: (1,5.5,10)
quantiles = FOREACH (GROUP input ALL) GENERATE Quantile(input.val);
```

### Variance

Variance can be computed using the VAR
UDF:

```
define VAR datafu.pig.stats.VAR();
-- input: 1,2,3,4,5,6,7,8,9
input = LOAD 'input' AS (val:int);
-- produces: 6.666666666666668
variance = FOREACH (GROUP input ALL) GENERATE VAR(input.val);
```