介绍一个可以替代Scrapy的爬虫框架——feapder_0

时间：2023-03-26 12:38:12 Python

一、前言大家好，我是安国！众所周知，Python最流行的爬虫框架是Scrapy，主要用于爬取网站结构数据。今天给大家推荐一个更简单、轻量级、强大的爬虫框架：feapder项目地址：https://github。com/鲍里斯代码...2。介绍与安装feapder与Scrapy类似，支持轻量级爬虫、分布式爬虫、批量爬虫、爬虫告警机制。内置三种爬虫如下：AirSpider轻量级爬虫、Spider分布式爬虫适用于场景简单、数据量小，基于Redis，适用于海量数据，支持断点续爬，数据自动存储等功能BatchSpider分布式批量爬虫，主要用于定时采集爬虫实战之前，我们在虚拟环境中安装对应的依赖库#安装依赖库pip3installfeapder3。实战中，我们使用最简单的AirSpider爬取一些简单的数据目标网站：aHR0cHM6Ly90b3BodWIudG9kYXkvIA==详细实现步骤如下（5步）3-1创建爬虫工程首先，我们使用“feapdercreate-p”命令创建爬虫项目#创建爬虫项目feapdercreate-ptophub_demo3-2创建爬虫AirSpider命令行进入spiders文件夹目录，使用“feapdercreate-s”命令创建爬虫cdspiders#创建轻量级爬虫feapdercreate-stophub_spider1其中1为默认，表示创建轻量级爬虫AirSpider2表示创建分布式爬虫Spider3表示创建分布式批处理第二个爬虫BatchSpider3-3配置数据库，创建数据表，创建mappingItem以Mysql为例，首先我们在数据库中创建一张数据表#创建数据表createtabletopic(idintauto_incrementprimarykey,titlevarchar(100)nullcomment'文章标题',authvarchar(20)空评论'作者',like_countintdefault0nullcomment'likes',collectionintdefault0nullcomment'favorites',commentintdefault0nullcomment'comments');然后，打开项目根目录下的settings.py文件，配置数据库连接信息#settings.pyMYSQL_IP="localhost"MYSQL_PORT=3306MYSQL_DB="xag"MYSQL_USER_NAME="root"MYSQL_USER_PASS="root"最后创建一个mappingItem（可选）到items文件夹中，使用“feapdercreate-i”命令创建一个文件映射到数据库PS：由于AirSpider不支持数据自动存储，所以这一步不需要3-4编译爬虫和数据分析第一步先用“MysqlDB”初始化数据库fromfeapder.db.mysqldbimportMysqlDBclassTophubSpider(feapder.AirSpider):def__init__(self,*args,**kwargs):super().__init__(*args,**kwargs)self.db=MysqlDB()第二步，在start_requests方法中，指定爬取的Main链接地址，使用关键字“download_midware”配置随机UAimportfeapderfromfake_useragentimportUserAgentdefstart_requests(self):yieldfeapder.Request("https://tophub.today/",download_midware=self.download_midware)defdownload_midware(self,request):#随机UA#依赖:pip3installfake_useragentua=UserAgent().randomrequest.headers={'User-Agent':ua}returnrequest第三步，抓取首页标题和链接地址，使用feapder内置方法xpath解析数据。defparse(self,request,response):#print(response.text)card_elements=response.xpath('//div[@class="cc-cd"]')#过滤出对应的卡片元素【whatisworthbuying]buy_good_element=[card_elementforcard_elementsincard_elementsifcard_element.xpath('.//div[@class="cc-cd-is"]//span/text()').extract_first()=='什么值得买'][0]#获取内部文章标题和地址a_elements=buy_good_element.xpath('.//div[@class="cc-cd-cbnano"]//a')fora_elementina_elements:#titleandlinktitle=a_element.xpath('.//span[@class="t"]/text()').extract_first()href=a_element.xpath('.//@href').extract_first()#以文章标题再次下发新任务yieldfeapder.request(href,download_midware=self.download_midware,callback=self.parser_detail_page,title=title)第四步爬取详情页数据并在上一步下发新任务，通过关键字"callback"指定回调函数"，最后在parser_detail_pagedefparser_detail_page(self,request,response):"""解析文章详情页数据:paramrequest::paramresponse::return:"""title=request.titleurl=request.url#解析在文章详情页，获取点赞数、收藏数、评论数和作者名author=response.xpath('//a[@class="author-title"]/text()').extract_first()。strip()print("Author:",author,'文章标题:',title,"地址:",url)desc_elements=response.xpath('//span[@class="xilie"]/span')打印("descnumber:",len(desc_elements))#Like_count=int(re.findall('\d+',desc_elements[1].xpath('./text()').extract_first())[0])#最喜欢的collection_count=int(re.findall('\d+',desc_elements[2].xpath('./text()').extract_first())[0])#评论comment_count=int(re.findall('\d+',desc_elements[3].xpath('./text()').extract_first())[0])print("赞：",like_count,"收藏:",collection_count,"评论:",comment_count)3-5数据存储使用数据库o上面实例化的bject执行SQL，将数据插入数据库#Insertdatabasesql="INSERTINTOtopic(title,auth,like_count,collection,comment)values('%s','%s','%s','%d','%d')"%(title,author,like_count,collection_count,comment_count)#executeself.db.execute(sql)4.最后通过一个简单的例子，本文讲讲feapder的高级特性的使用，feapder中最简单的爬虫AirSpider，后面会通过一系列的例子来详细讲解，我已经上传了所有的代码在这篇文章到公众号后台，后台回复关键字“airspider”获取完整源码如果觉得文章还不错，请点赞分享留言，因为这将是我做文章的最大动力持续输出更多优质文章！

上一篇：Python大话：pythonword应用--制作最简单的word文档

下一篇：EdrawDiagram9.4安装破解教程

介绍一个可以替代Scrapy的爬虫框架——feapder_0相关文章