当前位置: 首页 > 后端技术 > Python

ScrapyTips

时间:2023-03-26 15:56:21 Python

概述Scrapy是一个用Python开发的网络抓取框架,用于抓取网站并从页面中提取结构化数据。Scrapy用途广泛,可用于数据挖掘、监控和自动化测试。Scrapy1.1开始支持Python3。(H12016)Scrapy1.5不再支持Python3.3。(2017年下半年)Scrapy官网:https://scrapy.org/ScrapyGitHub:https://github.com/scrapy/scrapyScrapypypi:https://pypi.org/project/Scrapy/Scrapy官方文档:https://docs.scrapy.org/en/la...Scrapy中文网1.5文档:http://www.scrapyd.cn/doc/硬核知识点基本请求和响应对象请求:scrapy.http.request.Request#HtmlResponseinheritedfromTextResponseInheritedfromHtmlResponseresponse:scrapy.http.response.html.HtmlResponseresponse:scrapy.http.response.text.TextResponseresponse:scrapy.http.response.Response在蜘蛛中打印蜘蛛的配置(设置)kinself.settings:print(k,self.settings.get(k))ifisinstance(self.settings.get(k),scrapy.settings.BaseSettings):forkkinself.settings.get(k):print('\t',kk,self.settings.get(k).get(kk))Scrapy队列中的请求数(如何在scrapy中获取队列中的请求数?)#scrapy.core.调度程序。scheduler#spiderlen(self.crawler.engine.slot.scheduler)#pipelinelen(spider.crawler.engine.slot.scheduler)Scrapy当前请求的网络数#scrapy.core.engine.Slot.inprogress是一组#spiderlen(self.crawler.engine.slot.inprogress)#pipelinelen(spider.crawler.engine.slot.inprogress)Scrapy获取spider中的管道对象(Howtogetthepipelineobject在Scrapyspider)#PiplineclassMongoDBPipeline(object):def__init__(self,mongodb_db=None,mongodb_collection=None):self.connection=pymongo.Connection(settings['MONGODB_SERVER'],settings['MONGODB_PORT'])defget_date(self):passdefopen_spider(self,spider):spider.myPipeline=selfdefprocess_item(self,item,spider):pass#spiderclassMySpider(Spider):def__init__(self):self.myPipeline=无defstart_requests(self)):#可以直接存储数据self.mysqlPipeline.process_item(item,self)defparse(self,response):self.myPipeline.get_date()singlespidermulticookiesession(Multiplecookiesessionsperspider)#Scrapy使用cookiejarRequestmetakey支持单蜘蛛跟踪多个cookiesession#默认情况下它使用一个cookiejar(session),但是你可以传递一个标识符来使用多个。fori,urlinenumerate(urls):yieldscrapy.Request("http://www.example.com",meta={'cookiejar':i},callback=self.parse_page)#注意cookiejar元键是不“粘”。您需要在后续请求中传递它。defparse_page(self,response):#做一些处理returnscrapy.Request("http://www.example.com/otherpage",meta={'cookiejar':response.meta['cookiejar']},callback=self.parse_other_page)spiderfinished的条件Closingspider(finished)#scrapy.core.engine.ExecutionEnginedefspider_is_idle(self,spider):ifnotself.scraper.slot.is_idle():#scraperisnotidlereturnFalseifself.downloader.active:#下载器有待处理的请求returnFalseifself.slot.start_requestsisnotNone:#不是所有的启动请求都被处理returnFalseifself.slot.scheduler.has_pending_requests():#scheduler有待处理的请求returnFalsereturnTrue#spider里面打印条件self.logger.debug('engine.scraper.slot.is_idle:%s'%repr(self.crawler.engine.scraper.slot.is_idle()))self.logger.debug('\tengine.scraper.slot.active:%s'%repr(self.crawler.engine.scraper.slot.active))self.logger.debug('\tengine.scraper.slot.queue:%s'%repr(self.crawler.engine.scraper.slot.queue))self.logger.debug('engine.downloader.active:%s'%repr(self.crawler.engine.downloader.active))self.logger.debug('engine.slot.start_requests:%s'%repr(self.crawler.engine.slot.start_requests))self.logger.debug('engine.slot.scheduler.has_pending_requests:%s'%repr(self.crawler.engine.slot.scheduler.has_pending_requests()))判断空闲idle信号,添加请求(Scrapy:Howtomanuallyinsertarequestfromaspider_idle事件回调?)classFooSpider(BaseSpider):yet=False@classmethoddeffrom_crawler(cls,crawler,*args,**kwargs):from_crawler=super(FooSpider,cls).from_crawlerspider=from_crawler(crawler,*args,**kwargs)crawler.signals.connect(spider.idle,signal=scrapy.signals.spider_idle)returnspiderdefidle(self):如果不是self.yet:self.crawler.engine.crawl(self.create_request(),self)self.yet=True部分配置项说明HTTPERROR_ALLOW_ALL默认值:False非200responsetimeoutTruecallbackerrbackFalseerrbackerrback架构图Scrapy1.1架构图Scrapy最新的架构图walker看来新图只是老图的提炼而已,没有本质区别这个文章来自walkersnapshot