word文档在线预览转html格式，使用phpoffice、pydocx、javaPOI解决方案，最后使用unoconv解决

时间：2023-03-29 23:17:00 PHP

最近有客户想要一个word和excel文件的在线预览功能。下面是实现这个功能的全过程。由于我们使用的是PHP开发项目，首先想到的是使用PHPoffice中的phpword进行转换，下面是关键代码。保存('测试.html);可以用这种方法转，但是转换后的html文件和原文件相比，少了很多字。如果样式和原文不一样还可以容忍，但是内容丢失了，不太好，而且我也无法处理DOC格式，所以最终选择放弃了这种方式。然后，我想用python来解决这个问题。发现python有个可以处理word文档的pydocx库，于是就安装了。pipinstallpydocx使用起来也很简单，主要代码如下：frompydocximportPyDocXhtml=PyDocX.to_html("test2.doc")f=open("test.html",'w',encoding="utf-8")f.write(html)f.close()转换效果还可以，只是表格样式和原文有点不同，内容没有丢失，但是有个问题，这个库是用来转docx的，不能转doc，我们客户也上传了很多doc格式的文件，只好另辟蹊径。查了资料发现java有个poi库可以用来转换word文件。ApachePOI是Apache软件基金会的开源库。POI为Java程序提供API以读取和写入MicrosoftOffice格式的文件。我想试一试。查了半天资料，才开始写。首先Maven引入依赖：org.apache.poipoi4.1。2org.apache.poipoi-ooxml4.1.2org.apache.poipoi-scratchpad4.1.2fr.opensagres.xdocreportfr.opensagres.poi.xwpf.converter.xhtml2.0.2cn.hutoolhutool-all5.4.3以下是引用别人的可用代码：importcn.hutool.core.img.ImgUtil;importfr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;importfr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter；导入fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions；导入org.apache.poi.hwpf.HWPFDocument；导入org.apache.poi.hwpf.converter.WordToHtmlConverter；导入org.apache.poi.openxml4j.util.ZipSecureFile；导入org.apache.poi.xwpf.usermodel.XWPFDocument；导入org.w3c.dom.Document；导入javax.xml.parsers.DocumentBuilderFactory；导入javax.xml.parsers.ParserConfigurationException；导入javax.xml。transform.OutputKeys；导入javax.xml.transform.Transformer；导入javax.xml.transform.TransformerException；导入javax.xml.transform.TransformerFactory;importjavax.xml.transform.dom.DOMSource;importjavax.xml.transform.stream.StreamResult;importjava.awt.image.BufferedImage;importjava.io.*;/***office转换工具测试**/publicclassOfficeConvertUtil{/***将word2003转换成html文件2017-2-27**@paramwordPathword文件路径*@paramwordNameword文件名不带后缀*@paramsuffixword文件后缀*@throwsIOException*@throwsTransformerException*@throwsParserConfigurationException*/publicstaticStringWord2003ToHtml(StringwordPath,StringwordName,Stringsuffix)抛出IOException,TransformerException,ParserConfigurationException{StringhtmlPath=wordPath+FileName.separator+"htmlseparator"+String.html;=wordName+".html";finalStringimagePath=htmlPath+"image"+File.separator;//判断html文件是否存在，每次重新生成FilehtmlFile=newFile(htmlPath+htmlName);//if(htmlFile.exists()){//returnhtmlFile.getAbsolutePath();//}//原始word文档finalStringfile=wordPath+File.separator+wordName+suffix;InputStreaminput=newFileInputStream(新文件(文件));HWPFDocumentwordDocument=newHWPFDocument(输入);WordToHtmlConverterwordToHtmlConverter=newWordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());wordToHtmlConverter.setPicture(,widthInches,heightInches)->{BufferedImagebufferedImage=ImgUtil.toImage(content);}Stringbase64Img=ImgUtil.toBase64(bufferedImage,pictureType.getExtension());//对于带图片的words，将图片转base64编码保存在一个页面中StringBuildersb=(newStringBuilder(base64Img.length()+"data:;base64,".length())).append("data:;base64,").append(base64Img));返回sb.toString();});//解析word文档wordToHtmlConverter.processDocument(wordDocument);文档htmlDocument=wordToHtmlConverter.getDocument();//生成html文件上级文件夹Filefolder=newFile(htmlPath);如果(!folder.exists()){folder.mkdirs();}//生成html文件地址OutputStreamoutStream=newFileOutputStream(htmlFile);DOMSourcedomSource=newDOMSource(htmlDocument);StreamResultstreamResult=newStreamResult(outStream);TransformerFactoryfactory=TransformerFactory.newInstance();转换器序列化器=factory.newTransformer();serializer.setOutputProperty(OutputKeys.ENCODING,"utf-8");serializer.setOutputProperty(OutputKeys.INDENT,"yes");serializer.setOutputProperty(OutputKeys.METHOD,"html");serializer.transform(domSource,streamResult);outStream.close();返回htmlFile.getAbsolutePath();}/***2007版word转html2017-2-27**@paramwordPathword文件路径*@paramwordNameword文件名不带后缀*@paramsuffixword文件后缀*@return*@throwsIOException*/publicstaticStringWord2007ToHtml(StringwordPath,StringwordName,Stringsuffix)抛出IOException{ZipSecureFile.setMinInflateRatio(-1.0d);StringhtmlPath=wordPath+File.separator+"html"+File.separator;StringhtmlName=wordName+".html";StringimagePath=htmlPath+"image"+File.separator;//判断html文件是否存在FilehtmlFile=newFile(htmlPath+htmlName);//if(htmlFile.exists()){//returnhtmlFile.getAbsolutePath();//}//word文件FilewordFile=new文件（wordPath+File.separator+wordName+后缀）；//1)加载word文档生成XWPFDocument对象InputStreamin=newFileInputStream(wordFile);XWPF文档ent文档=newXWPFDocument(in);//2）解析XHTML配置（这里设置IURIResolver，设置图片存放目录）FileimgFolder=newFile(imagePath);//wordwithimage，将图片转为base64编码，保存在页面中XHTMLOptionsoptions=XHTMLOptions.create().indent(4).setImageManager(newBase64EmbedImgManager());//3)ConvertXWPFDocumenttoXHTML//生成html文件父文件夹Filefolder=newFile(htmlPath);如果(!folder.exists()){folder.mkdirs();}OutputStreamout=newFileOutputStream(htmlFile);XHTMLConverter.getInstance().convert(文档、输出、选项);返回htmlFile.getAbsolutePath();}publicstaticvoidmain(String[]args)throwsException{System.out.println(Word2003ToHtml("D:\\tmp","test",".doc"));System.out.println(Word2007ToHtml("D:\\tmp","test2",".docx"));}}用java转换doc格式挺好的，但是转换docx格式的时候，样式全乱了，查了半天POI文档，网上没有大佬解决这个问题样式乱，所以想用python转docx，java转doc，但是太麻烦了查了半天资料，我最终的解决方案如下。还是回到用php处理，但不是用phpoffice，而是用unocov进行转换，先安装libreofficeyuminstalllibreoffice然后unocovyuminstallunoconv转换unoconv-fhtml-otest.htmltest.doc-f是输出格式，-o是输出文件，最后一个是输入文件。具体用法可以查看相关文档。我在php中执行外部命令，生成转换后的文件，然后重定向到生成的文件。因为excel转html的时候报错，所以转成pdfforexcel。如果(file_exists($source)){$dir=dirname($source);$ext=pathinfo($source)['扩展名'];如果(!in_array($ext,['xls','xlsx'])){$filetype='html';}else$filetype='pdf';$filename=strstr(basename($source),'.',true).'.'.$文件类型；$文件=$文件名；if(!file_exists('data/'.$file)){//echo"sudo/usr/bin/unoconv-f{$filetype}-o".'/数据/网络/公共/数据/'。$文件。''。'/data/web/data_manage/public/'。$来源；退出；$res=shell_exec("sudo/usr/bin/unoconv-f{$filetype}-o".'/data/web/public/data/'.$file.''.'/data/web/data_manage/民众/'。$来源);if(!file_exists('data/'.$file)){dump($res);exit('生成预览文件时出错');}}header("位置:".'/data/'.$file);出口（）;}elseexit('文件不存在');最后，doc、docx和excel文件，wps文件可以预览，样式还是有点变化，内容没有损失，客户还是可以接受的。以上就是我解决这个问题的经验。希望能帮到大家

上一篇：深入学习PHP中JSON相关函数

下一篇：【分享】接口测试-header详解

word文档在线预览转html格式，使用phpoffice、pydocx、javaPOI解决方案，最后使用unoconv解决相关文章