posted on 2022-07-25, 00:34authored byG K Gupta, A Singla
Similarity detection is an interesting and challenging problem. It is challenging since the aim often is to compare a very large number of documents efficiently and without excessive storage overheads. Very large numbers of documents needs to be compared since it is believed that as many as about one-third of Web documents are either identical or similar to other pages on the Web. There are other applications of similarity detection. For example, an obvious application is plagiarism detection. Many similarity detection algorithms are based on using fingerprinting in which a fingerprint of each document is built from a set of substrings of the document. These fingerprints can then be compared to detect similarity between documents. We present a new fingerprinting algorithm, called Sequence of Selected Words Fingerprint (or SSWF), that not only uses a set of selected words from the document but also the sequence in which they appear. This approach is shown to work well for the 1-to-n similarity detection problem in which a document is given and we wish to find similar documents in a given collection.