Aspose.words+docx实现docx合并,去除aspose印记。原因是工作中需要完成多个word文档的合并,并尽量保证原始样式将word转成html在终端显示。本文实现主要解决的问题:word多个文档的合并【主要是完成append的合并】将合并后的文档转换为html文件,涉及word中显示的英文和日文字体,以及merge中图片的base64d转换是因为aspose是针对商业应用的,为了达到完美卖淫,aspose在转换结果中的烙印并没有通过破解去除。安装主工具aspose.words.python@6.22python-docxdocxcomposebs4主代码应用宝import#!/usr/bin/envpython3#-*-coding:utf-8-*-#DESC:1.基于docx合并多个docx#2.基于aspose实现docx到html的转换#3.增删改查基于bs4等的html的元素和内容importosimportreimportpandasaspdimportaspose.wordsasawimportaspose.words.savingassavingfrombs4importBeautifulSoupfromdocximportDocumentfromdocxcompose.composerimportComposermergeworddocumentsdefmerge_docx(docx_list:list,docx_merge_tdocar_list:str_str)>str:"""合并word文档目前只是拼装word,没有分页等操作。"""iflen(docx_list)==0:raiseException("inputisempty.")iflen(docx_list)==1:返回os.path.join(docx_list_src,docx_list[0])#使用第一个词作为基词base_docx=Document(os.path.join(docx_list_src,docx_list[0]))base_docx_composer=Composer(base_docx)#Composer.append方式合并到基词对于docx_list[1:]中的next_docx:next_docx_path=os.path.join(docx_list_src,next_docx)base_docx_composer.append(Document(next_docx_path))base_docx_composer.save(docx_merge_tar)print("mergedocxlistok.")将单词返回到docx_merge_tar将进入htmldefaspose_convert_docx_html(docx_file_path:str,html_file_path:str)->str:"""使用aspose.words-python将word转换为html"""docx=aw.Document(docx_file_path)#设置转换选项save_options=saving.HtmlSaveOptions(aw.SaveFormat.HTML)#将图像保存为base64save_options.export_images_as_base64=Truedocx.save(html_file_path,save_options)returnhtml_file_pathremoveasposemarkdefdel_aspose_elemet(html_tar_file:str,to_tar_file:str):"""去掉aspose信息"""html_content=open(html_tar_file,"r",encoding="utf-8")soup=BeautifulSoup(html_content,features="lxml")#删除soup.find_all(style)中tag的指定aspose内容=re.compile("-aw-headerfooter-type:")):tag.extract()word_key_tag=soup.find("p",text=re.compile("EvaluationOnly"))word_key_tag.extract()f=open(to_tar_file,"w",encoding="utf-8")f.write(soup.prettify())f.close()测试__name__=='__main__':docx_file_path=r"D:\merge_tar\demo.docx"html_file_path=r"D:\merge_tar\demo.html"aspose_convert_docx_html(docx_file_path,html_file_path)process_file_path=r"D:\merge_tar\demo_d.html"del_aspose_elemet(html_file_path,process_file_pathxapose.doc)测试结果转换有很多aspose转换word转html后的options设置处理aspose,具体可以参考sapose.words的github查看demosbs4在处理html方面非常强大,这篇文章主要是记录下工作中文档处理的实际成果,如果对你有用,那就太好了
