当前位置: 首页 > 后端技术 > Python

[爬虫]lxml获取当前节点的html并正确显示中文

时间:2023-03-26 11:41:09 Python

获取当前节点:etree.tostring正确显示中文方法一:使用html库的unescape函数html.unescapefromlxmlimportetreeimporthtmlwithopen('list.html','r',encoding='utf-8')asf:text=f.read()tree=etree.HTML(text)r=html.unescape(etree.tostring(tree.xpath('//*[@id="scroll_marquee"]')[0]).decode('utf-8'))print(r)print(type(r))参考链接:调用tostring()中文乱码("digits)爬取网页时;")解决方法二:使用lxml库的etree.tostring方法fromlxmlimportetreeimportrequestsresponse=requests.get('https://www.baidu.com/).texttree=etree.HTML(response)strs=tree.xpath("//body")strs=strs[0]strs=str(etree.tostring(info,encoding="utf-8"),encoding='utf-8')打印(strs)参考链接:lxml提取html标签内容,tostring()无法显示中文的解决办法