比较两个文件的相似度可以通过python中的difflib.SequenceMatcher/ssdeep/python_mmdt/tlsh来实现。当需要比较大量文件且文件较大时,需要更高的效率。可以考虑模糊hash,比如ssdeep/python_mmdt测试过程发现:difflib方法,读取文件后,可以实现匹配度输出ssdeep/mmdt/tlsh方法可以实现,提前实现模糊hash值,验证时,只读取一次就完成了比较,从而优化了比较的时间,以及内存/cpu的消耗。在tlsh测试中,值越小,相似度越高。当比较小文件时,它并不理想。在比较小文件时,三种方法差别不大。比较大文件(本例中为81MB)时,difflib方法慢得令人无法接受。实际环境中,推荐使用mmdt方式,因为ssdeep二进制比较差异较大,失去参考价值。还有哪些文件类型有这个问题需要考虑。测试环境:OS:ubuntu20.04python:3.8.10py-tlsh==4.7.2python-mmdt==0.3.1ssdeep==3.4#-*-coding:utf-8-*-importssdeepimporttimefrompython_mmdt.mmdt.mmdtimportMMDTfromdifflibimportSequenceMatcherdefdifflib_test(file1,file2):start_time=time.time()withopen(file1,'rb')asf:s1=f.read()withopen(file2,'rb')asf:s2=f.read()match_obj=SequenceMatcher(None,s1,s2)print("difflibmatch:",match_obj.ratio())end_time=time.time()print('difflib_testcost:',end_time-start_time)defmmdt_test(file1,file2):start_time=time.time()mmdt=MMDT()r1=mmdt.mmdt_hash(file1)print(r1)r2=mmdt.mmdt_hash_streaming(file2)print(r2)#sim1=mmdt.mmdt_compare(file1,file2)#print("mmdt匹配:",sim1)sim2=mmdt.mmdt_compare_hash(r1,r2)print("mmdtmatch:",sim2)end_time=time.time()print('mmdt_testcost:',end_time-start_time)defssdeep_test(file1,file2):start_time=time.time()sig1=ssdeep.hash_from_file(file1)sig2=ssdeep.hash_from_file(file2)print(sig1)print(sig2)print("ssdeepmatch:",ssdeep.compare(sig1,sig2))end_time=time.time()print('ssdeep_testcost:',end_time-start_time)if__name__=='__main__':start_time=time.time()file1='/root/test/fstab'file2='/root/test/fstab2'#file1='/root/test/initrd.img-5.4.0-125-generic'#file2='/root/test/initrd.img-5.4.0-135-generic'mmdt_test(file1,file2)ssdeep_test(file1,file2)difflib_test(file1,file2)end_time=time.time()print('Totalexecutiontime:',end_time-start_time)小文件/大文件对比效果如下:testtlshimporttlshimporttimedeftlsh_test(file1,file2):start_time=time.time()withopen(file1,'rb')asf:s1=tlsh.hash(f.read())withopen(file2,'rb')asf:s2=tlsh.hash(f.read())match_obj=tlsh.diff(s1,s2)print("tlshmatch:",match_obj)end_time=time.time()print('difflib_testcost:',end_time-start_time)if__name__=='__main__':start_time=时间。time()#file1='/root/test/fstab'#file2='/root/test/fstab2'file1='/root/test/initrd.img-5.4.0-125-generic'file2='/root/test/initrd.img-5.4.0-135-generic'tlsh_test(file1,file2)end_time=time.time()print('总执行时间:',end_time-start_time)比较小文件/大文件
