Apache DataFu™ (incubating)

Apache DataFu (incubating) 1.3.1 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu (incubating) 1.3.1.

Additions:

  • New UDF CountDistinctUpTo that counts tuples within a bag to a preset limit (DATAFU-117)

Improvements:

  • TupleFromBag and FirstTupleFromBag now implement Accumulator...
Read more...

DataFu's Hourglass, Incremental Data Processing in Hadoop

Matthew Hayes

For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that shows the number of visitors to every page on the site over the last thirty days. To keep this...

Read more...

DataFu 1.0

William Vaughan

DataFu is an open-source collection of user-defined functions for working with large-scale data in Hadoop and Pig.

About two years ago, we recognized a need for a stable, well-tested library of Pig UDFs that could assist in common data mining and...

Read more...

DataFu, The WD-40 of Big Data

Matthew Hayes, Sam Shah

If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.

No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank...

Read more...