Run PageRank on a large number of independent graphs through the PageRank UDF:
define PageRank datafu.pig.linkanalysis.PageRank('dangling_nodes','true'); topic_edges = LOAD 'input_edges' as (topic:INT,source:INT,dest:INT,weight:DOUBLE); topic_edges_grouped = GROUP topic_edges by (topic, source) ; topic_edges_grouped = FOREACH topic_edges_grouped GENERATE group.topic as topic, group.source as source, topic_edges.(dest,weight) as edges; topic_edges_grouped_by_topic = GROUP topic_edges_grouped BY topic; topic_ranks = FOREACH topic_edges_grouped_by_topic GENERATE group as topic, FLATTEN(PageRank(topic_edges_grouped.(source,edges))) as (source,pr); skill_ranks = FOREACH skill_ranks GENERATE topic, source, pr;
This implementation stores the nodes and edges (mostly) in memory. It is therefore best suited when one needs to compute PageRank on many reasonably sized graphs in parallel.