Apache DataFu Spark is a collection of utils and user-defined functions for working with large scale data in Apache Spark.
This matrix represents versions of Spark that DataFu has been compiled and tested on. Some/many methods work on unsupported versions as well.
DataFu | Spark |
---|---|
1.7.0 | 2.2.0 to 2.2.2, 2.3.0 to 2.3.2 and 2.4.0 to 2.4.3 |
1.8.0 | 2.2.3, 2.3.3, and 2.4.4 to 2.4.5 |
2.0.0 | 3.0.x - 3.1.x |
2.1.0 (unreleased) | 3.2.x and up |
A list of some of the things you can do with DataFu Spark is given below:
"Dedup" a table - remove duplicates based on a key and ordering (typically a date updated field, to get only the mostly recently updated record).
Do a skewed join between tables (where the small table is still too big to fit in memory)
Count distinct up to - an efficient implementation when you just want to verify that a certain minimum of distinct rows appear in a table
Call Python code from Spark Scala, or Scala code from PySpark
If you'd like to read more details about these functions, check out the Guide. Otherwise if you are ready to get started using DataFu Spark, keep reading.
The rest of this page assumes you already have a built JAR available. If this is not the case, please see the Download page.
This jar should be loaded to the Spark class path. You can verify that you've done this correctly by trying to import one of our DataFu classes, for example, DataFrameOps.
A common scenario in data sent to the HDFS — the Hadoop Distributed File System — is multiple rows representing updates for the same logical data. For example, in a table representing accounts, a record might be written every time customer data is updated, with each update receiving a newer timestamp. Let’s consider the following simplified example.
We can see that though most of the customers only appear once, julia and quentin have 2 and 3 rows, respectively. How can we get just the most recent record for each customer? We can use DataFu's dedupWithOrder method.
import datafu.spark.DataFrameOps._
val customers = spark.read.format("csv").option("header", "true").load("customers.csv")
csv.dedupWithOrder($"id", $"date_updated".desc).show
Our result will be as expected — each customer only appears once, as you can see below:
There are two additional variants of dedupWithOrder in datafu-spark. The dedupWithCombiner method has similar functionality to dedupWithOrder, but uses a UDAF to utilize map side aggregation. dedupTopN allows retaining more than one record for each key.
Check out the Guide for more information on what you can do with DataFu Spark.