当前位置: 首页 > 后端技术 > Python

python数据抓取三种方法

时间:2023-03-25 20:37:48 Python

三种数据抓取方法正则表达式(re库)BeautifulSoup(bs4)lxml*利用之前搭建的下载网页的功能获取目标网页的html,我们使用https://guojiadiqu.bmcx.com/A...例如,获取html。fromget_htmlimportdownloadurl='https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/'page_content=download(url)*假设我们需要爬取这个网页中的国名和简介,我们使用这三种数据来爬取turn方法获取数据。1.正则表达式fromget_htmlimportdownloadimportreurl='https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/'page_content=download(url)country=re.findall('class="h2dabiaoti">(.*?)',page_content)#注意返回的是listsurvey_data=re.findall('(.*?)',page_content)survey_info_list=re.findall('

  (.*?)

',survey_data[0])survey_info=''.join(survey_info_list)print(country[0],survey_info)2.BeautifulSoup(bs4)fromget_htmlimportdownloadfrombs4importBeautifulSoupurl='https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/'html=download(url)#createbeautifulsoupobjectsoup=BeautifulSoup(html,"html.parser")#searchcountry=soup.find(attrs={'class':'h2dabiaoti'}).textsurvey_info=soup.find(attrs={'id':'wzneirong'}).textprint(country,survey_info)3.lxmlfromget_htmlimportdownloadfromlxmlimportetree#parsetreeurl='https://guojiadiqu.bmcx.com/AFG__guojiayudiqu/'page_content=download(url)selector=etree.HTML(page_content)#xpath分析可用text)survey_select=selector.xpath('//*[@id="wzneirong"]/p')forsurvey_contentinsurvey_select:print(survey_content.text,end='')运行结果:最后参考《用python写网络爬虫》的三种方式的性能对比如下图所示:仅供参考