Monash University
Browse

Using fingerprints based on the sequence of a set of selected words in a document for 1-to-n similarity analysis

Download (114.26 kB)
report
posted on 2022-07-25, 00:34 authored by G K Gupta, A Singla
Similarity detection is an interesting and challenging problem. It is challenging since the aim often is to compare a very large number of documents efficiently and without excessive storage overheads. Very large numbers of documents needs to be compared since it is believed that as many as about one-third of Web documents are either identical or similar to other pages on the Web. There are other applications of similarity detection. For example, an obvious application is plagiarism detection. Many similarity detection algorithms are based on using fingerprinting in which a fingerprint of each document is built from a set of substrings of the document. These fingerprints can then be compared to detect similarity between documents. We present a new fingerprinting algorithm, called Sequence of Selected Words Fingerprint (or SSWF), that not only uses a set of selected words from the document but also the sequence in which they appear. This approach is shown to work well for the 1-to-n similarity detection problem in which a document is given and we wish to find similar documents in a given collection.

History

Technical report number

2006/199

Year of publication

2006

Usage metrics

    Monash Information Technology Technical Reports

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC