介绍一个可以替代Scrapy的爬虫框架——feapder

时间：2023-03-13 08:51:56 科技观察

一、前言大家好，我是安国！众所周知，Python最流行的爬虫框架是Scrapy，主要用于爬取网站结构数据，今天推荐一个更简单、轻量级、强大的爬虫框架：feapder项目地址：https://github.com/Boris-代码/feapder2。介绍与安装feapder类似Scrapy，支持轻量级爬虫、分布式爬虫、批量爬虫、爬虫报警机制等内置爬虫如下：AirSpider轻量级爬虫，适用于简单场景，数据量小的爬虫，Spider分布式爬虫，基于Redis，适用于海量数据，支持断点续爬，数据自动存储等功能BatchSpider分布式批量爬虫，主要用于实战中需要周期采集的爬虫，之前我们在虚拟机安装了相应的依赖库environment#install依赖库pip3installfeapder3.下面我们用最简单的AirSpider爬取一些简单的数据目标网站：aHR0cHM6Ly90b3BodWIudG9kYXkvIA==详细实现步骤如下（5步）3-1创建爬虫工程首先，我们使用“feapdercreate-p”命令创建一个爬虫project#创建爬虫工程feapdercreate-ptophub_demo3-2创建爬虫AirSpider命令行进入spiders文件夹目录，使用“feapdercreate-s”命令创建爬虫cdspiders#创建轻量级爬虫feapdercreate-stophub_spider1其中1为默认,表示创建轻量级爬虫AirSpider2表示创建分布式爬虫Spider3表示创建分布式批量爬虫BatchSpider3-3配置数据库，创建Data表，创建映射Item以Mysql为例，首先我们创建一个数据表在数据库中#创建数据表createtabletopic(idintauto_incrementprimarykey,titlevarchar(100)nullcomment'文章标题',authvarchar(20)nullcomment'作者',like_countintdefault0nullcomment'点赞次数',collectionintdefault0nullcomment'收藏数量',commentintdefault0nullcomment'评论数量');然后，打开项目根目录下的settings.py文件，配置数据库连接信息#settings.pyMYSQL_IP="localhost"MYSQL_PORT=3306MYSQL_DB="xag"MYSQL_USER_NAME="root"MYSQL_USER_PASS="root"最后创建一个mappingItem（可选）到items文件夹中，使用“feapdercreate-i”命令创建文件映射到数据库PS：由于AirSpider不支持数据自动存储，所以这一步不是必须的3-4第一步写爬虫和数据分析就是用“MysqlDB”初始化数据库fromfeapder.db.mysqldbimportMysqlDBclassTophubSpider(feapder.AirSpider):def__init__(self,*args,**kwargs):super().__init__(*args,**kwargs)self.db=MysqlDB()第二步，在start_requests方法中，指定要爬取的主链接地址，使用关键字“download_midware”配置随机UAimportfeapderfromfake_useragentimportUserAgentdefstart_requests(self):yieldfeapder.Request("https://tophub.today/",download_midware=self.download_midware)defdownload_midware(self,request):#randomUA#Dependency:pip3installfake_useragentua=UserAgent().randomrequest.headers={'User-Agent':ua}returnrequest第三步，抓取首页标题和链接寻址并使用feapder内置方法xpath解析数据efparse(self,request,response):#print(response.text)card_elements=response.xpath('//div[@class="cc-cd"]')#过滤出对应的卡片元素【什么值得买]buy_good_element=[card_elementforcard_elementincard_elementsifcard_element.xpath('.//div[@class="cc-cd-is"]//span/text()').extract_first()=='什么值得买'][0]#获取内部文章标题和地址a_elements=buy_good_element.xpath('.//div[@class="cc-cd-cbnano"]//a')fora_elementina_elements:#titleandlinktitle=a_element.xpath('.//span[@class="t"]/text()').extract_first()href=a_element.xpath('.//@href').extract_first()#再次下发新任务，带上文章标题yieldfeapder。request(href,download_midware=self.download_midware,callback=self.parser_detail_page,titletitle=title)第四步，爬取详情页数据，上一步下发新任务，通过关键字“callback”指定回调函数,最后进入parser_detail_pagedefparser_detail_page(self,request,response)中的详情页进行数据分析:"""分析文章详情数据:paramrequest::paramresponse::return:"""title=request.titleurl=request.url#解析文章详情页，获取点赞数、收藏数、评论数和作者名author=response.xpath('//a[@class="author-title"]/text()').extract_first().strip()print("作者：",author,'文章标题：',title,"地址：",url)desc_elements=response.xpath('//span[@class="xilie"]/span')print("descnumber:",len(desc_elements))#likelike_count=int(re.findall('\d+',desc_elements[1].xpath('./text()').extract_first())[0])#Collectioncollection_count=int(re.findall('\d+',desc_elements[2].xpath('./text()').extract_first())[0])#Commentcomment_count=int(re.findall('\d+',desc_elements[3].xpath('./text()').extract_first())[0])print("like:",like_count,"favorite:",collection_count,"comment:",comment_count)3-5数据存储使用上面实例化的数据库对象执行SQL，插入数据即可database#插入数据库"%(title,author,like_count,collection_count,comment_count)#执行self.db.execute(sql)4.最后，本文通过一个简单的例子讲讲feapder中最简单的爬虫AirSpider

上一篇：人工智能或将使空调性能提高十倍

下一篇：从物联网中获得有意义的客户洞察

介绍一个可以替代Scrapy的爬虫框架——feapder相关文章