Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs

Community

Apache Software Foundation

Apache DataFu Pig - Guide

Statistics

Median

Apache DataFu has two UDFs that can be used to compute the median of a bag. Median computes the median exactly, but requires that the input bag be sorted. StreamingMedian, on the other hand, does not require that the bag be sorted, however it computes only an estimate of the median. But, because it does not require the input bag to be sorted, it is more efficient.

Let's take a look at computing the median using StreamingMedian:

define Median datafu.pig.stats.StreamingMedian();

-- input: 3,5,4,1,2
input = LOAD 'input' AS (val:int);

-- produces: 3
medians = FOREACH (GROUP input ALL) GENERATE Median(input.val);

Quantiles

Quantiles are points at regular intervals within an ordered data set. Essentially we divide an ordered data set into segments, and the quantiles are the values between the segments. The quantiles people are probably most familiar with are those for median and percentiles.

Similar to median, DataFu has two UDFs that can compute quantiles. The median UDFs are in fact just wrappers around the quantile UDFs. Quantile computes the quantiles of a sorted bag exactly, and StreamingQuantile computes an estimate of the quantiles of a bag that does not need to be sorted.

Let's take a look at computing the median using StreamingQuantile:

define Quantile datafu.pig.stats.StreamingQuantile('0.0','0.5','1.0');

-- input: 9,10,2,3,5,8,1,4,6,7
input = LOAD 'input' AS (val:int);

-- produces: (1,5.5,10)
quantiles = FOREACH (GROUP input ALL) GENERATE Quantile(input.val);

Variance

Variance can be computed using the VAR UDF:

define VAR datafu.pig.stats.VAR();

-- input: 1,2,3,4,5,6,7,8,9
input = LOAD 'input' AS (val:int);

-- produces: 6.666666666666668
variance = FOREACH (GROUP input ALL) GENERATE VAR(input.val);