Apache DataFu™

Apache DataFu-Spark 2.0.0 Released

Eyal Allweil

I'd like to announce the release of Apache DataFu-Spark 2.0.0.

This version is the first to support Spark 3.x. In this release, Spark versions 3.0.0 to 3.1.3 are supported.

The four classes in SparkUDAFs - MultiSet, MultiArraySet, MapMerge and CountDistinctUpTo...

Read more...

Apache DataFu-Spark 1.8.0 Released

Eyal Allweil

I'd like to announce the release of Apache DataFu-Spark 1.8.0.

Many thanks to Arpit Bhardwaj and Shaked Aharon, who worked on this version.


Improvements

  • dedupWithCombiner method now supports a list of columns in the order / group by params ...
Read more...

Apache DataFu-Spark 1.7.0 Released

Eyal Allweil

I'd like to announce the release of Apache DataFu-Spark 1.7.0.

Many thanks to new contributors Arpit Bhardwaj, Ben Rahamim and Shaked Aharon!


Additions

  • Add collectLimitedList and dedupRandomN methods (DATAFU-165)
  • Improve broadcastJoinSkewed...
Read more...

Apache DataFu 1.6.1 Released

Eyal Allweil

I'd like to announce the release of Apache DataFu 1.6.1.

Additions

  • Explode Array method (DATAFU-154)

Improvements

  • Add support for newer versions of Gradle (DATAFU-157)
  • Document Explode Array usage recommendation (DATAFU-158)

Fixes

  • Gradle...
Read more...

Apache DataFu 1.6.0 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu 1.6.0.

Additions:

  • datafu-spark library (DATAFU-148).

Improvements:

  • Remove log suppression in unit tests (DATAFU-82).

Fixes:

  • Failure to assemble due to jcenter HTTP usage (DATAFU-152).
Read more...

Apache DataFu 1.5.0 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu 1.5.0.

Additions:

  • dedup macro (DATAFU-129)
  • samplebykeys macro (DATAFU-127)

Improvements:

  • Update Ruby gem for site generation (DATAFU-147)
  • Make DataFu compile with Java 8 (DATAFU-132)

Changes...

Read more...

Apache DataFu 1.4.0 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu 1.4.0. This is the first release since graduating from Apache Incubator. Note that the artifacts now begin with apache-datafu instead of apache-datafu-incubating.

Changes:

  • Removed MD5 hash for...
Read more...

Apache DataFu (incubating) 1.3.3 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu (incubating) 1.3.3.

Additions:

  • UDF for hash functions such as murmur3 and others. (DATAFU-47)
  • UDF for diffing tuples. (DATAFU-119)
  • Support for macros in DataFu. Macros count_all_non_distinct and...
Read more...

Apache DataFu (incubating) 1.3.2 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu (incubating) 1.3.2.

Improvements:

  • LICENSE, NOTICE, and DISCLAIMER now included in META-INF of JARs.
  • Test files now generated to build/test-files within projects.
  • AliasableEvalFunc now uses getInputSchema...
Read more...

Apache DataFu (incubating) 1.3.1 Released

Matthew Hayes

I'd like to announce the release of Apache DataFu (incubating) 1.3.1.

Additions:

  • New UDF CountDistinctUpTo that counts tuples within a bag to a preset limit (DATAFU-117)

Improvements:

  • TupleFromBag and FirstTupleFromBag now implement Accumulator...
Read more...

DataFu's Hourglass, Incremental Data Processing in Hadoop

Matthew Hayes

For a large scale site such as LinkedIn, tracking metrics accurately and efficiently is an important task. For example, imagine we need a dashboard that shows the number of visitors to every page on the site over the last thirty days. To keep this...

Read more...

DataFu 1.0

William Vaughan

DataFu is an open-source collection of user-defined functions for working with large-scale data in Hadoop and Pig.

About two years ago, we recognized a need for a stable, well-tested library of Pig UDFs that could assist in common data mining and...

Read more...

DataFu, The WD-40 of Big Data

Matthew Hayes, Sam Shah

If Pig is the “duct tape for big data“, then DataFu is the WD-40. Or something.

No, seriously, DataFu is a collection of Pig UDFs for data analysis on Hadoop. DataFu includes routines for common statistics tasks (e.g., median, variance), PageRank...

Read more...