Python爬虫实战：单线程、多线程、协程性能对比

时间：2023-03-14 17:07:50 科技观察

1.前言今天给大家分享一下如何爬取中农网的商品行情数据，使用普通的单线程、多线程和coroutine分别来爬取，比较网络爬虫中单线程、多线程和协程的性能。目标网址：https://www.zhongnongwang.com/quote/product-htm-page-1.html抓取产品名称、最新报价、单位、报价单号、报价时间等信息，并保存到本地Excel.2、抓取测试翻页查看URL变化：https://www.zhongnongwang.com/quote/product-htm-page-1.htmlhttps://www.zhongnongwang.com/quote/product-htm-page-2.htmlhttps://www.zhongnongwang.com/quote/product-htm-page-3.htmlhttps://www.zhongnongwang.com/quote/product-htm-page-4.htmlhttps://www.中农网com/quote/product-htm-page-5.htmlhttps://www.zhongnongwang.com/quote/product-htm-page-6.html查看网页，可以发现网页结构简单，容易解析和提取数据。思路：每条产品报价信息在tbody类的table标签下的tr标签中，获取tr标签的所有内容，然后遍历提取每条产品名称，最新报价，单位，报价单号，报价时间等信息。#-*-coding:UTF-8-*-"""@文件：demo.py@作者：叶听云@CSDN:https://yetingyun.blog.csdn.net/"""importrequestsimportloggingfromfake_useragentimportUserAgentfromxmlimportetree#log的基本配置outputlogging.basicConfig(level=logging.INFO,format='%(asctime)s-%(levelname)s:%(message)s')#随机生成请求头ua=UserAgent(verify_ssl=False,path='fake_useragent.json')url='https://www.zhongnongwang.com/quote/product-htm-page-1.html'#假装请求头headers={"Accept-Encoding":"gzip",#使用gzip压缩传输数据使访问更快"User-Agent":ua.random}#发送请求得到响应rep=requests.get(url,headersheaders=headers)print(rep.status_code)#200#Xpath定位并提取数据html=etree.HTML(rep.text)items=html.xpath('/html/body/div[10]/table/tr[@align="center"]')logging.info(f'有多少条信息arethereonthispage:{len(items)}')#一个页面有20条信息#遍历并提取数据foriteminitems:name=''.join(item.xpath('.//td[1]/a/text()'))#产品名称价格=''.join(item.xpath('.//td[3]/text()'))#最新报价unit=''.join(item.xpath('.//td[4]/text()'))#unitnums=''.join(item.xpath('.//td[5]/text()'))#quotenumbertime_=''.join(item.xpath('.//td[6]/text()'))#quotetimelogging.info([name,price,unit,nums,time_])运行结果如下：可以成功爬取数据，然后使用普通的单线程、多线程、协程爬取50页数据并保存他们要出类拔萃3.单线程爬虫#-*-coding:UTF-8-*-"""@文件：single-threaded.py@作者：叶亭云@CSDN：https://yetingyun.blog.csdn.net/"""importrequestsimportloggingfromfake_useragentimportUserAgentfromxmlimportetreeimportopenpyxlfromdatetimeimportdatetime#日志输出的基本配置logging.basicConfig(level=logging.INFO,format='%(asctime)s-%(levelname)s:%(message)s')#随机生成请求头ua=UserAgent(verify_ssl=False,path='fake_useragent.json')wb=openpyxl.Workbook()sheet=wb.activesheet.append(['产品名称','最新报价','单位','报价单号','报价单time'])start=datetime.now()forpageinrange(1,51):#constructURLurl=f'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html'#pretendrequestheaders={"Accept-Encoding":"gzip",#使用gzip压缩传输数据，访问速度更快"User-Agent":ua.random}#发送请求得到响应rep=requests.get(url,headersheaders=headers)#print(rep.status_code)#Xpath定位提取数据html=etree.HTML(rep.text)items=html.xpath('/html/body/div[10]/table/tr[@align="center"]')logging.info(f'有多少条信息thereonthispage:{len(items)}')#一页有20条信息#遍历并提取数据foriteminitems:name=''.join(item.xpath('.//td[1]/a/text()'))#productnameprice=''.join(item.xpath('.//td[3]/text()'))#最新报价unit=''.join(item.xpath('.//td[4]/text()'))#unitnums=''.join(item.xpath('.//td[5]/text()'))#quotationtime_=''.join(item.xpath('.//td[6]/text()'))#quotationtimesheet.append([name,price,unit,nums,time_])logging.info([name,price,unit,nums,time_])wb.save(filename='data1.xlsx')delta=(datetime.now()-start).total_seconds()logging.info(f'Timespent:{delta}s')运行结果如下：单线程爬虫必须完成前一个页面的爬取才能继续爬取。也可能受当前网络状况的影响，耗时为48.528703s，在数据爬取之前，速度比较慢4.多线程爬虫#-*-coding:UTF-8-*-"""@文件：multithreading.py@作者：叶庭云@CSDN：https://yetingyun.blog.csdn.net/"""importrequestsimportloggingfromfake_useragentimportUserAgentfromxmlimportetreeimportopenpyxlfromconcurrent.futuresimportThreadPoolExecutor,wait,ALL_COMPLETEDfromdatetimeimportdatetime#日志输出的基本配置logging.basicConfig(level=logging.INFO,format='%(asctime)s-%(levelname)s:%(message)s')#随机生成请求头ua=UserAgent(verify_ssl=False,path='fake_useragent.json')wb=openpyxl.Workbook()sheet=wb.activesheet.append(['产品名称','最新报价','单位','报价单号','引用时间'])start=datetime.now()defget_data(page):#constructURLurl=f'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html'#pretendrequestHeaderheaders={"Accept-Encoding":"gzip",#使用gzip压缩传输数据使访问更快"User-Agent":ua.random}#发送请求得到响应rep=requests.get(url,headersheaders=headers)#print(rep.status_code)#Xpath定位提取数据html=etree.HTML(rep.text)items=html.xpath('/html/body/div[10]/table/tr[@align="center"]')logging.info(f'这个页面有多少条信息：{len(items)}')#一页有20条信息#遍历并提取itemitems的数据:name=''.join(item.xpath('.//td[1]/a/text()'))#产品名称价格=''.join(item.xpath('.//td[3]/text()'))#最新报价unit=''.join(item.xpath('.//td[4]/text()'))#Unitnums=''.join(item.xpath('.//td[5]/text()'))#quotenumbertime_=''.join(item.xpath('.//td[6]/text()'))#quotetimesheet.append([name,price,unit,nums,time_])recording.info([name,price,unit,nums,time_])defrun():#crawl1-50pageswithThreadPoolExecutor(max_workers=6)asexecutor:future_tasks=[executor.submit(get_data,i)foriinrange(1,51)]wait(future_tasks,return_when=ALL_COMPLETED)wb.save(filename='data2.xlsx')delta=(datetime.now()-start).total_seconds()print(f'time:{delta}s')run()运行结果如下：多线程爬虫的爬虫效率有了很大的提升，爬虫时间为2.648128s，爬虫速度d非常快5.异步协程爬虫#-*-coding:UTF-8-*-"""@文件：demo1.py@作者：叶亭云@CSDN：https://yetingyun.blog.csdn.net/"""importaiohttpimportasyncioimportloggingfromfake_useragentimportUserAgentfromxmlimportetreeimportopenpyxlfromdatetimeimportdatetime#日志输出的基本配置logging.basicConfig(level=logging.INFO,format='%(asctime)s-%(levelname)s:%(message)s')#随机生成请求头ua=UserAgent(verify_ssl=false,path='fake_useragent.json')wb=openpyxl.Workbook()sheet=wb.activesheet.append(['产品名称','最新报价','单位','报价编号','报价时间'])start=datetime.now()classSpider(object):def__init__(self):#self.semaphore=asyncio.Semaphore(6)#Semaphore，有时候需要控制协程的个数，防止self.header攀爬toofast={"Accept-Encoding":"gzip",#使用gzip压缩传输数据，访问更快"User-Agent":ua.random}asyncdefscrape(self,url):#asyncwithself.semaphore:#Set最大sem同样，有时候需要控制协程的数量，防止爬取过快被逆向session=aiohttp.ClientSession(headers=self.header,connector=aiohttp.TCPConnector(ssl=False))response=awaitsession.get(url)result=awaitresponse.text()等待tsession.close()返回结果asyncdefscrape_index(self,page):url=f'https://www.zhongnongwang.com/quote/product-htm-page-{page}.html'text=awaitself.scrape(url)awaitself.parse(text)asyncdefparse(self,text):#Xpath定位并提取数据html=etree.HTML(text)items=html.xpath('/html/body/div[10]/table/tr[@align="center"]')logging.info(f'本页有多少条信息：{len(items)}')#一页有20条信息#遍历并提取itemitems:name=''.join(item.xpath('.//td[1]/a/text()'))#产品名称价格=''.join(item.xpath('.//td[3]/text()'))#最新报价unit=''.join(item.xpath('.//td[4]/text()'))#unitnums=''.join(item.xpath('.//td[5]/text()'))#quotenumbertime_=''.join(item.xpath('.//td[6]/text()'))#quotetimesheet.append([name,price,unit,nums,time_])logging.info([name,price,unit,nums,time_])defmain(self):#50pagedatascrape_index_tasks=[asyncio.ensure_future(self.scrape_index(page))forpageinrange(1,51)]loop=asyncio.get_event_loop()tasks=asyncio.gather(*scrape_index_tasks)loop.run_until_complete(tasks)if__name__=='__main__':spider=Spider()spider.main()wb.save('data3.xlsx')delta=(datetime.now()-start).total_seconds()print("Timespent:{:.3f}s".format(delta))结果为如下：while到协程异步爬虫时，爬取速度更快。一个高手，爬取50页数据只需要0.930s。aiohttp+asyncio异步爬虫就是这么吓人。异步爬虫在服务器能承受高并发的前提下增加并发数，爬虫效率提升非常可观，比多线程更快。三个爬虫都爬取了50页数据，并保存在本地。结果如下：六、总结与回顾今天演示了简单的单线程爬虫、多线程爬虫、协程异步爬虫。可以看出，一般来说，异步爬虫最快，多线程爬虫稍慢，单线程爬虫更慢。必须完成上一页的爬取，才能继续爬取。但是协程异步爬虫相对来说就没那么好写了。不能使用request库进行数据抓取，只能使用aiohttp，当爬取数据量较大时，异步爬虫需要设置最大信号量来控制协程数量，防止爬取。速度太快会被反爬。所以在实际写Python爬虫的时候，我们一般都是使用多线程的爬虫来提速，但是必须要注意的是网站是有ip访问频率限制的，爬的太快可能会导致ip被封，所以一般都是使用多线程的同时爬取Proxyip可以用来并发爬取数据。多线程（multithreading）：是指从软件或硬件上实现多个线程并发执行的技术。具有多线程能力的计算机由于硬件的支持，可以一次执行多个线程，从而提高整体处理性能。具有此功能的系统包括对称多处理器、多核处理器和芯片级多处理或同步多线程处理器。在一个程序中，这些独立运行的程序片段被称为“线程”（Thread），使用它进行编程的概念被称为“多线程”。异步：不同的程序单元为了完成某项任务，在过程中不需要沟通和协调就可以完成任务。不相关的程序单元可以是异步的。例如，爬虫下载网页。一旦调度程序调用下载程序，就可以安排其他任务，而无需与下载程序任务保持通信以协调行为。不同网页的下载、保存等操作是互不相关的，不需要相互通知和协调。这些异步操作的完成时间是不确定的。简而言之，异步就是乱序。协程又称微线程、纤程，是用户态的轻量级线程。协程有自己的寄存器上下文和堆栈。协程调度切换时，将寄存器上下文和栈保存到其他地方，切换回来时恢复之前保存的寄存器上下文和栈。因此，协程可以保留上次调用的状态，即所有局部状态的特定组合，每次重新进入流程，就相当于进入上次调用的状态。协程本质上是一个单一的进程。与多进程相比，协程不需要线程上下文切换、原子操作锁定和同步开销的开销，编程模型非常简单。我们可以使用协程来实现异步操作。比如在网络爬虫场景中，我们发送请求之后，需要等待一定的时间才能得到响应，但实际上在这个等待过程中，程序还可以做很多其他的事情，直到响应之后搞定了，切回去继续处理，这样可以充分利用CPU等资源，这就是协程的优势。

上一篇：为什么老编辑器Vim这么难用，却这么受欢迎？

下一篇：iOS探秘：SortcustomobjectsinNSArray

Python爬虫实战：单线程、多线程、协程性能对比相关文章