Apache DataFu™

Getting Started

DataFu Spark Docs

DataFu Pig Docs

DataFu Hourglass Docs

Community

Apache Software Foundation

Apache DataFu Pig - Guide

PageRank

Run PageRank on a large number of independent graphs through the PageRank UDF:

define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true');

topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE);

topic_edges_grouped = GROUP topic_edges by (topic, source) ;
topic_edges_grouped = FOREACH topic_edges_grouped GENERATE
  group.topic as topic,
  group.source as source,
  topic_edges.(dest,weight) as edges;

topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic;

topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE
  group as topic,
  FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,pr);

skill_ranks = FOREACH skill_ranks GENERATE
  topic, source, pr;

This implementation stores the nodes and edges (mostly) in memory. It is therefore best suited when one needs to compute PageRank on many reasonably sized graphs in parallel.