Pyspider框架——Python爬虫实战爬V2EX网站贴

时间：2023-03-13 08:39:50 科技观察

背景：PySpider：一个国人写的强大的网络爬虫系统，具有强大的WebUI。Python语言编写，分布式架构，支持多种数据库后端，强大的WebUI支持脚本编辑器、任务监视器、项目管理器和结果查看器。在线示例：http://demo.pyspider.org/官方文档：http://docs.pyspider.org/en/l...Github：https://github.com/binux/pysp...本文爬虫代码Github地址：https://github.com/zhisheng17...更多精彩文章可微信阅读公众号：猿博，欢迎关注。说了这么多，还是看正文吧！前提：你已经安装好Pyspider和MySQL-python（保存数据）。如果你还没有安装，请阅读我之前的文章，以免你走弯路。学习Pyspider框架时遇到的一些坑HTTP599：SSL证书问题：无法获取本地颁发者证书错误V2EX网站内容，然后将爬取的数据保存到本地。V2EX里面的大部分帖子是不需要登录的，当然也有一些帖子需要登录才能查看。（因为我爬的时候发现一直报错，查看具体原因后才明白我需要登录才能查看这些帖子。）所以我认为没有必要使用cookie。当然，如果非要登录的话，也很简单，简单的方法就是在登录后添加cookie。我们在https://www.v2ex.com/上扫描，发现没有一个列表可以包含所有的帖子，只能退而求其次，通过抓取该类别下的所有标签列表页，来遍历所有帖子：https://www.v2ex.com/?tab=tech然后https://www.v2ex.com/go/progr...***每个帖子的详细地址是（举例）:https://www.v2ex.com/t/314683...创建项目在pyspiderdashboard右下角点击“Create”按钮替换on_start函数的self.crawl的url：@every(minutes=24*60)defon_start(self):self.crawl('https://www.v2ex.com/',callback=self.index_page,validate_cert=False)self.crawl告诉pyspider抓取指定的页面，然后使用回调函数解析结果。@every)装饰器，表示每天执行一次on_start，这样可以抓到最新的帖子。validate_cert=False一定要这样，否则会报HTTP599:SSLcertificateproblem:unabletogetlocalissuercertificate错误首页：点击绿色run执行，会看到follows上有一个红色的1，切换到followspanel，点击绿色播放按钮：第二张截图，一开始就出现了这个问题。解决办法看之前写的文章，之后问题就不会再出现了。Tab列表页：在Tab列表页中，我们需要提取主题列表页的所有URL。您可能已经发现示例处理程序提取了一个非常大的URL代码：@config(age=10*24*60*60)defindex_page(self,response):foreachinresponse.doc('a[href^="https://www.v2ex.com/?tab="]').items():self.crawl(each.attr.href,callback=self.tab_page,validate_cert=False)由于帖子列表页和tab的长度listpage不一样，这里新回调了self.tab_page@config(age=102460*60)意思是我们认为页面10天内有效，不会再更新抓取Golistpage:代码:@config(age=10*24*60*60)deftab_page(self,response):foreachinresponse.doc('a[href^="https://www.v2ex.com/go/"]').items():self.crawl(each.attr.href,callback=self.board_page,validate_cert=False)postdetailspage(T):结果里可以看到一些回复项，我们不需要，我们可以删除。同时，我们还需要让他实现自动翻页功能。代码：@config(age=10*24*60*60)defboard_page(self,response):foreachinresponse.doc('a[href^="https://www.v2ex.com/t/"]')。items():url=each.attr.hreffurl.find('#reply')>0:url=url[0:url.find('#')]self.crawl(url,callback=self.detail_page,validate_cert=False)foreachinresponse.doc('a.page_normal').items():self.crawl(each.attr.href,callback=self.board_page,validate_cert=False)#去除运行截图后实现自动翻页功能：自动翻页后截图：至此，我们已经可以匹配到所有帖子的url。点击每个帖子后面的按钮可以查看帖子的具体详情。代码：@config(priority=2)defdetail_page(self,response):title=response.doc('h1').text()content=response.doc('div.topic_content').html().replace('"','\\"')self.add_question(title,content)#插入数据库return{"url":response.url,"title":title,"content":content,}插入数据库,我们需要先定义一个add_question函数。#连接数据库def__init__(self):self.db=MySQLdb.connect('localhost','root','root','wenda',charset='utf8')defadd_question(self,title,content):try:cursor=self.db.cursor()sql='insertintoquestion(title,content,user_id,created_date,comment_count)values("%s","%s",%d,%s,0)'%(title,content,random.randint(1,10),'now()');#插入数据库的SQL语句printsqlcursor.execute(sql)printcursor.lastrowidself.db.commit()exceptException,e:printeself.db.rollback()查看爬虫运行结果：先调试，再调整运行。pyspider框架在windows下设置运行速度有bug。建议不要跑得太快，否则很容易被发现是爬虫，人家会封你的IP。查看正在运行的工作，查看爬取的内容，然后在本地数据库GUI软件上查询可以看到数据已经保存在本地了。如果需要，您可以导入它。一开始我给大家讲了爬虫的代码。如果你详细看项目，你会发现我上传的爬取数据。（仅供学习使用，不用于商业用途！）当然，你会看到其他的爬虫代码，如果你觉得不错，可以给它一个Star，或者如果你也有兴趣，可以fork我的项目并加入学习了，这个项目会长期更新的。***：代码：#createdby10412#!/usr/bin/envpython#-*-encoding:utf-8-*-#Createdon2016-10-2020:43:00#Project:V2EXfrompyspider.libs.base_handlerimport*importreimportrandomimportMySQLdbclassHandler(BaseHandler):crawl_config={}def__init__(self):self.db=MySQLdb.connect('localhost','root','root','wenda',charset='utf8')defadd_question(self,title,content):try:cursor=self.db.cursor()sql='insertintoquestion(title,content,user_id,created_date,comment_count)values("%s","%s",%d,%s,0)'%(title,content,random.randint(1,10),'now()');printsqlcursor.execute(sql)printcursor.lastrowidself.db.commit()exceptException,e:printeself.db.rollback()@every(分钟=24*60)defon_start(self):self.crawl('https://www.v2ex.com/',callback=self.index_page,validate_cert=False)@config(age=10*24*60*60)defindex_page(self,response):foreachinresponse.doc('a[href^="https://www.v2ex.com/?tab="]').items():self.crawl(each.attr.href,回调=self.tab_page,validate_cert=False)@config(age=10*24*60*60)deftab_page(self,response):foreachinresponse.doc('a[href^="https://www.v2ex.com/go/"]').items():self.crawl(each.attr.href,callback=self.board_page,validate_cert=False)@config(age=10*24*60*60)defboard_page(self,response):foreachinresponse.doc('a[href^="https://www.v2ex.com/t/"]').items():url=each.attr.hrefifurl.find('#reply')>0:url=url[0:url.find('#')]self.crawl(url,callback=self.detail_page,validate_cert=False)foreachinresponse.doc('a.page_normal').items():self.crawl(each.attr.href,callback=self.board_page,validate_cert=False)@config(priority=2)defdetail_page(self,response):title=response.doc('h1').text()content=response.doc('div.topic_content').html().replace('"','\\"')self.add_question(title,content)#插入数据库return{"url":response.url,"title":title,"content":content,}

上一篇：UPS电源的工作原理及应用注意事项

下一篇：SpringBeanIOC和AOP循环依赖解读

Pyspider框架——Python爬虫实战爬V2EX网站贴相关文章