Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs

Community

Apache Software Foundation

Apache DataFu Pig - Guide

Hashing

MD5

The MD5 hash of a string can be computed with the MD5 UDF.

For example:

define MD5 datafu.pig.hash.MD5();

--input: "hello, world!"
data_in = LOAD 'input' as (val:chararray);
data_out = FOREACH data_in GENERATE MD5(val) as val;

-- produces: (fc3ff98e8c6a0d3087d515c0473f8677)
DUMP data_out;

The function can instead output base64 by passing 'base64' to the constructor. The default is 'hex' for hexadecimal.

define MD5 datafu.pig.hash.MD5('base64');

SHA

A SHA hash can be computed with SHA. The output will be in hexadecimal.

define SHA datafu.pig.hash.SHA();

--input: "hello, world!"
data_in = LOAD 'input' as (val:chararray);
data_out = FOREACH data_in GENERATE SHA(val) as val;

-- produces: (7509e5bda0c762d2bac7f90d758b5b2263fa01ccbc542ab5e3df163be08e6ca9)
DUMP data_out;

By default this uses SHA-256. The constructor also takes an optional parameter for the particular SHA algorithm to use. To use SHA-512 instead:

define SHA512 datafu.pig.hash.SHA('512');