Apache DataFu™ (incubating)

Apache DataFu (incubating) 1.3.3 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu (incubating) 1.3.3.

Additions:

  • UDF for hash functions such as murmur3 and others. (DATAFU-47)
  • UDF for diffing tuples. (DATAFU-119)
  • Support for macros in DataFu. Macros count_all_non_distinct and...
Read more...

Apache DataFu (incubating) 1.3.2 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu (incubating) 1.3.2.

Improvements:

  • LICENSE, NOTICE, and DISCLAIMER now included in META-INF of JARs.
  • Test files now generated to build/test-files within projects.
  • AliasableEvalFunc now uses getInputSchema...
Read more...

Apache DataFu (incubating) 1.3.1 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu (incubating) 1.3.1.

Additions:

  • New UDF CountDistinctUpTo that counts tuples within a bag to a preset limit (DATAFU-117)

Improvements:

  • TupleFromBag and FirstTupleFromBag now implement Accumulator...
Read more...

DataFu's Hourglass, Incremental Data Processing in Hadoop

Matthew Hayes

For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that shows the number of visitors to every page on the site over the last thirty days. To keep this...

Read more...

DataFu 1.0

William Vaughan

DataFu is an open-source collection of user-defined functions for working with large-scale data in Hadoop and Pig.

About two years ago, we recognized a need for a stable, well-tested library of Pig UDFs that could assist in common data mining and...

Read more...

DataFu, The WD-40 of Big Data

Matthew Hayes, Sam Shah

If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.

No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank...

Read more...