public class CosineDistanceHash extends LSHFunc
Locality-sensitive hashing (LSH) is a method of performing probabilistic dimension reduction of high-dimensional data. The basic idea is to hash the input items so that similar items are mapped to the same buckets with high probability (the number of buckets being much smaller than the universe of possible input items).In particular, this implementation implements a locality sensitive hashing scheme which maps high-dimensional vectors which are close together (with high probability) according to Cosine Similarity into the same buckets. Each LSH maps a vector onto one side or the other of a random hyperplane, thereby producing a single bit as the hash value. Multiple, independent, hashes can be run on the same input and aggregated together to form a more broad domain than a single bit. For more information, see Charikar, Moses S.. (2002). "Similarity Estimation Techniques from Rounding Algorithms". Proceedings of the 34th Annual ACM Symposium on Theory of Computing 2002.
Constructor and Description |
---|
CosineDistanceHash(java.lang.String sDim,
java.lang.String sRepeat,
java.lang.String sNumHashes) |
CosineDistanceHash(java.lang.String sDim,
java.lang.String sRepeat,
java.lang.String sNumHashes,
java.lang.String sSeed)
Locality sensitive hash that maps vectors onto 0,1 in such a way that colliding
vectors are "near" one another according to cosine similarity with high probability.
|
Modifier and Type | Method and Description |
---|---|
protected LSHCreator |
createLSHCreator() |
protected int |
getDimension() |
exec, getSeed, outputSchema
allowCompileTimeCalculation, finish, getArgToFuncMapping, getCacheFiles, getInputSchema, getLogger, getPigLogger, getReporter, getReturnType, getSchemaName, getSchemaType, getShipFiles, isAsynchronous, progress, setInputSchema, setPigLogger, setReporter, setUDFContextSignature, warn
public CosineDistanceHash(java.lang.String sDim, java.lang.String sRepeat, java.lang.String sNumHashes, java.lang.String sSeed)
Generally, multiple LSH are combined via repetition to increase the range of the hash function to the full set of longs. The number of functions which you want to internally repeat is specified by the sRepeat parameter. The size of the hash family corresponds to the number of independent hashes you want to apply to the data. In a k-near neighbors style of searching, this corresponds to the number of neighbors you want to find (i.e. the number of vectors within a distance according to cosine similarity).
Consider the following example where we input some 3-dimensional points and a set of 3-dimensional queries and find the nearest neighbors of the query points:
-- Create a CosineDistanceHash of -- 3 dimensional data -- 1500 internal hashes (being combined into one hash) -- family of 5 hashes -- with a seed of 0 -- This creates a bag of tuples: -- lsh_id:Integer the family ID (in this case, 0-4) -- hash:Long the hash define LSH datafu.pig.hash.lsh.CosineDistanceHash('3', '1500', '5', '0'); define METRIC datafu.pig.hash.lsh.metric.L2(); PTS = LOAD 'input' AS (dim1:double, dim2:double, dim3:double); --hash the input points PTS_HASHED = foreach PTS generate TOTUPLE(dim1, dim2, dim3) as pt , FLATTEN(LSH(TOTUPLE(dim1, dim2, dim3))); -- the hash family ID and the hash should group the input points into partitions PARTITIONS = group PTS_HASHED by (lsh_id, hash); -- take in the query points and hash them QUERIES = LOAD 'queries' as (dim1:double, dim2:double, dim3:double); QUERIES_HASHED = foreach QUERIES generate TOTUPLE(dim1, dim2, dim3) as query_pt , FLATTEN(LSH(TOTUPLE(dim1, dim2, dim3))) ; -- join the hashed query points with the (presumably larger) list of input data split by partitions QUERIES_W_PARTS = join QUERIES_HASHED by (lsh_id, hash), PARTITIONS by (group.$0, group.$1); -- Now, use the appropriate METRIC UDF (in this case Cosine distance) to find the first point within -- a parameterized threshold (in this case, .001). It takes: -- query_pt:Tuple the query point -- threshold:Double the threshold, so that if the distance between the query point and a point -- in the partition is less than this threshold, it returns the point (and stops searching) -- partition:Bag The bag of tuples in the partition. tuples from NEAR_NEIGHBORS = foreach QUERIES_W_PARTS generate query_pt as query_pt , METRIC(query_pt, .001, PTS_HASHED) as neighbor ; describe NEAR_NEIGHBORS; -- {query_pt: (dim1: double,dim2: double,dim3: double) -- ,neighbor: (pt: (dim1: double,dim2: double,dim3: double) -- ,lsh::lsh_id: int -- ,lsh::hash: long -- ) -- } -- project out the query and the matching point NEIGHBORS_PROJ = foreach NEAR_NEIGHBORS { generate query_pt as query_pt, neighbor.pt as matching_pts; }; -- Filter out the hashes which resulted in no matches NOT_NULL = filter NEIGHBORS_PROJ by SIZE(matching_pts) > 0; -- group by the query NEIGHBORS_GRP = group NOT_NULL by query_pt; describe NEIGHBORS_GRP; -- Generate the query, the number of matches and the bag of matching points NEIGHBOR_CNT = foreach NEIGHBORS_GRP{ MATCHING_PTS = foreach NOT_NULL generate FLATTEN(matching_pts); DIST_MATCHING_PTS = DISTINCT MATCHING_PTS; generate group as query_pt, COUNT(NOT_NULL), DIST_MATCHING_PTS; }; describe NEIGHBOR_CNT; -- NEIGHBOR_CNT: {query_pt: (dim1: double,dim2: double,dim3: double) -- ,long -- ,DIST_MATCHING_PTS: { (matching_pts::dim1: double,matching_pts::dim2: double,matching_pts::dim3: double) -- } -- } STORE NEIGHBOR_CNT INTO 'neighbors';
sDim
- Dimension of the vectorssRepeat
- Number of internal repetitionssNumHashes
- Size of the hash family (if you're looking for k near neighbors, this is the k)sSeed
- Seed to use when constructing LSH familypublic CosineDistanceHash(java.lang.String sDim, java.lang.String sRepeat, java.lang.String sNumHashes)
protected LSHCreator createLSHCreator()
createLSHCreator
in class LSHFunc
protected int getDimension()
getDimension
in class LSHFunc