W-shingling
Encyclopedia : W : WS : WSH : W-shingling
- The correct title of this } is }}}. The initial letter is capitalized due to [Naming conventions #Lower case first lettertechnical restrictions].
Basic definiton
A w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.
The document, "a rose is a rose is a rose" can be tokenized as follows:
- (a,rose,is,a,rose,is,a,rose)
Resemblance
For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or
- [r(A,B)=]
References
- (Broder, Glassman, Manasse, and Zweig 1997) Syntactic Clustering of the Web. SRC Technical Note #1997-015. Available [as HTML]
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
