Opentopia Directory Encyclopedia Tools

W-shingling

Encyclopedia : W : WS : WSH : W-shingling


The correct title of this } is }}}. The initial letter is capitalized due to [Naming conventions #Lower case first lettertechnical restrictions].

Basic definiton

A w-shingling is a set of unique "shingles"—contiguous subsequences of tokens in a document—that can be used to gauge the similarity of two documents. The w denotes the number of tokens in each shingle in the set.

The document, "a rose is a rose is a rose" can be tokenized as follows:

(a,rose,is,a,rose,is,a,rose)
The set of all contiguous sequences of 4 tokens is

By removing duplicate elements from this set, a 4-shingling is obtained:

Resemblance

For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shinglings' intersection and union, or

[r(A,B)=]
where |A| is the size of set A. The resemblance is a number in the range [0,1], where 1 indicates that two documents are identical.

References

Search Titles
0123456789
ABCDEFGHIJ
KLMNOPQRST
UVWXYZ?

E-mail this article to:

Personal Message: