Python网络爬虫之同步与异步

时间：2023-03-11 20:19:02 科技观察

一、同步与异步#同步编程（同一时间只能做一件事，做完了再做下一件事）<-a_url-><-b_url-><-c_url->#异步编程（可以粗略理解为同时做多件事情，但是有顺序）<-a_url-><-b_url-><-c_url-><-d_url-><-e_url-><-f_url-><-g_url-><-h_url-><--i_url--><--j_url-->Templateimportasyncio#Functionname:做当前任务时不要等待，可以继续做其他任务。asyncdefdonow_meantime_dontwait(url):response=awaitrequests.get(url)#函数名：快速高效的做任务asyncdefast_do_your_thing():awaitasyncio.wait([donow_meantime_dontwait(url)forurllinurls])#下面两行是例程，记住loop=asyncio.get_event_loop()loop.run_until_complete(fast_do_your_thing())提示：await表达式中的对象必须是可等待的请求不支持非阻塞aiohttp是异步请求的库代码importasyncioimportrequestsimporttimeimportaiohttpurls=['https://book.douban.com/tag/fiction','https://book.douban.com/tag/sciencefiction','https://book.douban.com/tag/comics','https://book.douban.com/tag/fantasy','https://book.douban.com/tag/history','https://book.douban.com/tag/economics']asyncdefrequests_meantime_dont_wait(url):print(url)asyncwithaiohttp.ClientSession()assession:asyncwithsession.get(url)asresp:print(resp.status)print("{url}得到响应".format(url=url))asyncdeffast_requsts(urls):start=time.time()awaitasyncio。wait([requests_meantime_dont_wait(url)forurlinurls])end=time.time()print("Completein{}seconds".format(end-start))loop=asyncio.get_event_loop()loop.run_until_complete(fast_requsts(urls))gevent简介gevent是一个python并发库，它为各种并发和与网络相关的任务提供了简洁的API。gevent中使用的主要模式是greenlet，它是一个基于C的轻量级协程，以扩展模块的形式接入Python。Greenlets都在主操作系统进程中运行，但它们是协同调度的。猴子补丁请求库是阻塞的，以便将请求从同步更改为异步。只有将requests库的阻塞方式改为非阻塞方式，才能实现异步操作。而gevent库中的猴子补丁（monkeypatch），gevent可以修改标准库中的大部分阻塞系统调用。这样，在不改变原有代码的情况下，应用的阻塞方法就变成了协程（异步）。代码来自geventimportmonkeyimportgeventimportrequestsimporttimemonkey.patch_all()defreq(url):print(url)resp=requests.get(url)print(resp.status_code,url)defsynchronous_times(urls):"""同步请求运行时间"""start=time.time()forurlinurls:req(url)end=time.time()print('同步执行时间{}s'.format(end-start))defasynchronous_times(urls):"""异步请求运行时间"""start=time.time()gevent.joinall([gevent.spawn(req,url)forurlinurls])end=time.time()print('异步执行时间{}s'.format(end-start))urls=['https://book.douban.com/tag/fiction','https://book.douban.com/tag/sciencefiction','https://book.douban.com/tag/comics','https://book.douban.com/tag/fantasy','https://book.douban.com/tag/history','https://book.douban.com/tag/economics']synchronous_times(urls)asynchronous_times(urls)gevent：异步理论与实战gevent库的核心是Greenlet——一个用C语言编写的轻量级python模块，任何时候，系统都只能允许一个Greenlet运行。当一个greenlet遇到IO操作，比如访问网络，它会自动切换到其他greenlet，等待IO操作完成，然后在合适的时候切换回来继续执行。因为IO操作是非常耗时的，所以程序经常处于等待状态。通过gevent自动为我们切换协程，保证了一直有greenlet在运行，而不是等待IO。串异步高并发的核心是将一个大任务拆分成一批子任务，子任务由系统高效调度，实现同步或异步。在两个子任务之间切换通常称为上下文切换。同步是让子任务串行化，异步有点影子克隆的意思，但是任何一个时间点，都只有一个真身，子任务并不是真正的并行，而是充分利用碎片时间，让程序做不要浪费在等待上。这是异步的、高效的杠杆作用。gevent中的上下文切换是通过yield实现的。在这个例子中，我们将有两个子任务，每个都使用另一个的等待时间来做自己的事情。这里我们使用gevent.sleep(0)表示程序会在这里停止0秒。importgeventdeffoo():print('Runninginfo')gevent.sleep(0)print('Explicitcontextswitchtofooagain')defbar():print('Explicitcontexttobar')gevent.sleep(0)print('Implicitcontextswitchbacktobar')gevent.joinall([gevent.spawn(foo),gevent.spawn(bar)])Runningsequence:RunninginfooExplicitcontexttobarExplicitcontextswitchtofooagainImplicitcontextswitchbacktobar同步异步序列问题同步操作是串行的，123456...，但是异步序列是随机任意的（根据子任务的消耗时间而定）codeimportgeventimportrandomdeftask(pid):"""Somenon-deterministictask"""gevent.sleep(random.randint(0,2)*0.001)print('Task%sdone'%pid)#synchronous（结果更像Serial）defsynchronous():foriinrange(1,10):task(i)#Asynchronous（结果更像是随机step）defasynchronous():threads=[gevent.spawn(task,i)foriinrange(10)]gevent.joinall(threads)print('同步同步:')synchronous()print('异步异步:')asynchronous()输出Synchronous同步：Task1doneTask2doneTask3doneTask4doneTask5doneTask6doneTask7doneTask8doneTask9doneAsynchronous异步：Task1doneTask5doneTask6doneTaskonedTask8donedoneTask4在sk9doneTask0doneTask3done同步情况下，所有任务都是顺序执行的，导致主程序阻塞（阻塞会暂停主程序的执行）。gevent.spawn将安排传入的任务（子任务的集合），gevent。joinall方法会阻塞当前程序，除非所有greenlets都执行完，程序才会结束。实现在实战中如何使用gevent，将异步访问得到的数据进行提取。在有道词典搜索框中输入“你好”，回车。观察数据请求，观察有道url构造。解析url规则#url构造只需要传入wordurl="http://dict.youdao.com/w/eng/{}/".format(word)解析网页数据deffetch_word_info(word):url="http://dict.youdao.com/w/eng/{}/".format(word)resp=requests.get(url,headers=headers)doc=pq(resp.text)pros=''forproindoc.items('.baav.pronounce'):pros+=pro.text()description=''forliindoc.items('#phrsListTab.trans-containerulli'):description+=li.text()return{'word':word,'注音符号':pros,'comment':description}因为requests库在任何时候只允许一次访问完成，才能进行下一次访问。不能通过正规渠道扩展成异步，所以猴子补丁同步代码这里使用','head','up','down','right','left','east']defsynchronous():start=time.time()print('synchronousstarted')forwordinwords:print(fetch_word_info(word))end=time.time()print("同步运行时间：%s秒"%str(end-start))#执行同步synchronous()异步代码importrequestsfrompyqueryimportPyQueryaspqimportgeventimporttimeimportgevent.monkeygevent.monkey.patch_all()words=['好','坏','酷','热','不错','更好','头','上','下','右','左','东']defasynchronous():start=time.time()print('异步启动')events=[gevent.spawn(fetch_word_info,word)forwordinwords]wordinfos=gevent.joinall(events)forwordinfoinwordinfos:#Getdatagetmethodprint(wordinfo.get())end=time.time()print("异步chronousrunningtime:%sseconds"%str(end-start))#Executeasynchronousasynchronous()我们可以对待爬取网站的实时异步访问，速度会大大提高。我们现在爬取12个词的信息，也就是说我们一瞬间访问了12次这个网站，这不是问题。如果我们爬取10000+个词，用gevent，几秒钟就发到网站上了，如果一下子发一个请求，说不定网站会屏蔽爬虫。解决办法是把列表分成若干个子列表，分批爬取。比如我们有一个数字列表（0-19），应该把它平均分成4份，即子列表有5个数字。下面是我在stackoverflow上找到的列表均衡方案：method1seqence=list(range(20))size=5#sublistlengthoutput=[seqence[i:i+size]foriinrange(0,len(seqence),size)]print(output)方法2chunks=lambdaseq,size:[seq[i:i+size]foriinrange(0,len(seq),size)]print(chunks(seq,5))method3defchunks(seq,size):foriinrange(0,len(seq),size):yieldseq[i:i+size]prinT(chunks(seq,5))forxinchunks(req,5):print(x)当数据量不大的时候，你可以选择任何一种方法。如果特别大，建议使用方法3。动手实现importrequestsfrompyqueryimportPyQueryaspqimportgeventimporttimeimportgevent.monkeygevent.monkey.patch_all()words=['good','bad','cool','hot','nice','更好','头','上','下','右','左','东']deffetch_word_info(word):url="http://dict.youdao.com/w/eng/{}/".format(word)resp=requests.get(url,headers=headers)doc=pq(resp.text)pros=''forproindoc.items('.baav.pronounce'):pros+=pro。text()description=''forliindoc.items('#phrsListTab.trans-containerulli'):description+=li.text()return{'word':word,'注音符号':pros,'comment':description}defasynchronous(words):start=time.time()print('异步启动')chunks=lambdaseq,size:[seq[i:i+size]foriinrange(0,len(seq),size)]forsubwordsichunks(words,3):events=[gevent.spawn(fetch_word_info,word)forwordinsubwords]wordinfos=gevent.joinall(events)forwordinfoinwordinfos:#获取数据get方法print(wordinfo.get())time.sleep(1)end=time.time（）打印（“异步运行时间：%s秒”%str(end-start))asynchronous(words)

上一篇：Windows10离线升级小工具让升级更方便

下一篇：Mysql单表多少数据合适？如何优化其性能？

Python网络爬虫之同步与异步相关文章