Create a list of files:
The first 3 documents are different drafts of the same paper, so we would expect them to be similar to each other. The last document is a draft of a different paper, so it should be dissimilar to the first 3. All files are about 45K words long.
Now we can use
cheatR to find duplicates.
The only function,
catch_em, takes the following input arguments:
flist- a list of documents (
time_lim- max time in seconds for each comparison (we found that some corrupt files run forever and crash R, so a time limit might be needed).
The resulting list contains a matrix with the similarity values between each pair of documents:
results #> paper1_copy1.docx paper1_copy2.docx paper1_copy3.docx #> paper1_copy1.docx 100% #> paper1_copy2.docx 87% 100% #> paper1_copy3.docx 90% 88% 100% #> paper2_copy1.docx 0% 0% 0% #> paper2_copy1.docx #> paper1_copy1.docx #> paper1_copy2.docx #> paper1_copy3.docx #> paper2_copy1.docx 100% #> #> All files read successfully. #> All files compared successfully.
You can also plot the relational graph if you’d like to get a more clear picture of who copied from who.