Asyncpy协程爬虫框架

时间：2023-03-26 14:37:35 Python

Asyncpy是我基于asyncio和aiohttp开发的一款轻量级高效爬虫框架。采用了scrapy设计模式，参考了github上一些开源框架的处理逻辑。github:https://github.com/lixi5338619/asyncpypypi:https://pypi.org/project/asyncpy/asyncpy安装所需的结构和过程环境python版本>=3.6依赖包:['lxml','parsel','docopt','aiohttp']安装命令：pipinstallasyncpy如果安装报错：ERROR:Couldnotfindaversionthatsatisfytherequirementasyncpy(fromversions:none)ERROR:Nomatchingdistributionfoundforasyncpy请检查你的当前python版本，python版本需要在3.6以上。如果还是下载不了，可以去https://pypi.org/project/asyncpy/下载最新版本的whl文件。点击下载文件，下载完成后使用cmd安装：pipinstallasyncpy-version-py3-none-any.whl创建爬虫文件，在命令行输入asyncpy--version查看是否安装成功。创建demo文件，使用cmd命令：asyncpygenspiderdemoglobalsettingssettings配置介绍CONCURRENT_REQUESTS并发数RETRIES重试次数DOWNLOAD_DELAY下载延迟RETRY_DELAY重试延迟DOWNLOAD_TIMEOUT超时限制USER_AGENT用户代理LOG_FILE日志路径LOG_LEVEL日志级别USER_AGENT全局UAPIPELINES管道中间件MIDDLEWARE如果你想要启动全局设置，需要将settings_attr传入spider文件中的settings：文件，Custom_settings可以像scrapy一样在爬虫文件中引入。它与settings_attr不冲突。classDemoSpider2(Spider):name='demo2'start_urls=[]concurrency=30#并发数custom_settings={"RETRIES":1,#重试次数"DOWNLOAD_DELAY":0,#下载延时"RETRY_DELAY":0,#Retrydelay"DOWNLOAD_TIMEOUT":10,#Timeout"LOG_FILE":"demo2.log"#Logfile}生成日志文件在设置文件中添加：LOG_FILE='./asyncpy.log'LOG_LEVEL='DEBUG'如果需要为多个爬虫生成多个日志文件，需要在settings中删除log配置，在custom_settings中重新配置。自定义中间件将新功能添加到创建的demo_middleware文件中。可以根据request.meta和spider属性进行针对性操作。fromasyncpy.middlewareimportMiddlewaremiddleware=Middleware()@middleware.requestasyncdefUserAgentMiddleware(spider,request):ifrequest.meta.get('valid'):print("当前爬虫名称：%s"%spider.name)ua="Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/60.0.3100.0Safari/537.36"request.headers.update({"User-Agent":ua})@middleware.requestasyncdefProxyMiddleware(spider,request):ifspider.name=='demo':request.aiohttp_kwargs.update({"proxy":"http://123.45.67.89:0000"})方法一。设置文件打开管道。（版本更新，请暂时选择2种方式）MIDDLEWARE=['demo_middleware.middleware',]方法二，在start()中传入中间件：frommiddlewaresimportmiddlewareDemoSpider.start(middleware=middleware)自定义Pipelines如果你定义了item（目前只支持dict字典格式的item），并且在settings中开启了pipeline，那么你就可以编写代码连接数据库，向pipelines中插入数据。在蜘蛛文件中：item={}item['response']=response.textitem['datetime']='2020-05-2113:14:00'yielditem在pipelines.py文件中：classSpiderPipeline():def__init__(self):passdefprocess_item(self,item,spider_name):passmethod1,settings中打开pipeline:(版本更新，请暂时选择方法2)PIPELINES=['pipelines.SpiderPipeline',]method2、在start()中传入pipelines：frompipelinesimportSpiderPipelineDemoSpider.start(pipelines=SpiderPipeline)post请求重写start_requests如果需要直接发起post请求，可以删除start_urls中的元素，重新启动start_requests方法。解析响应使用scrapy中的解析库parse。解析方式和scrapy一样，支持xpath、cssselector、re。简单示例：xpath("//div[id=demo]/text()").get()-----获取第一个元素xpath("//div[id=demo]/text()").getall()-----获取所有元素并返回列表，启动爬虫。通过蜘蛛文件中的类名.start()启动爬虫。比如爬虫的类名是DemoSpiderDemoSpider.start()来启动多个爬虫，这里没有完善，可以多进程的方式测试。从Demo.demo导入DemoSpider从Demo.demo2导入DemoSpider2importmultiprocessingdefopen_DemoSpider2():DemoSpider2.start()defopen_DemoSpider():DemoSpider.start()if__name__=="__main__":p1=multiprocessing.Process(target=p2moSpider)=multiprocessing.Process(target=open_DemoSpider2)p1.start()p2.start()特别鸣谢：Scrapy,Ruia,Looter,asyncio,aiohttp更多细节请参考demo，链接：如果你对Asyncpy文档感兴趣，可以去github打个star，谢谢大家！

上一篇：我只是一个python王——第一部分001的print格式化输出

下一篇：如何用Python开发一个QQ机器人

Asyncpy协程爬虫框架相关文章