一位合格的数据分析师分享关于Python网络爬虫（Scrapy自动爬虫）的轶事

时间：2023-03-12 12:32:53 科技观察

接上篇文章《一名合格的数据分析师分享Python网络爬虫二三事(综合实战案例)》5.综合实战案例三、使用Scrapy框架进行爬虫（一）认识ScrapyScrapy使用Twisted异步网络库来处理网络通信。整体结构大致如下（注：图片来源于网络）：Scrapy的用法可以参考其官方文档（2）在Scrapy自动爬虫实战中，我们曾经循环构建URL爬取数据。其实还有另一种实现方式，就是先设置初始URL，获取当前URL中的新链接，根据这些链接继续爬取，直到爬取的页面中没有新链接为止。(a)需要使用自动爬虫爬取囧事百科的链接和内容，将文章头部的内容和链接存储到MySQL数据库中。(b)分析A.如何提取首页的文章链接？打开首页后，查看源码，在首页搜索任意文章内容，可以看到“/article/118123230”链接。点击后发现这是我们想要的文章内容，所以我们需要设置链接在自动爬虫中包含“文章”。B、如何提取详情页的文章内容和链接内容。*判断文章内容，表达式如下："//div[@class='content']/text()"链接打开任意详情页，复制详情页链接，查看源码在详情页中，搜索链接如下：使用下面的XPathExpression提取文章链接。["//link[@rel='canonical']/@href"]（三）项目源码A.创建爬虫项目打开CMD，切换到爬虫项目所在目录，输入：scrapystartprojectqsbkautoB。项目结构说明spiders.qsbkspd.py：爬虫文件items.py：项目实体，要抽取内容的容器，如当当网商品标题，评论数等pipelines.py：项目管道，主要是用于数据的后续处理，如写入数据到Excel、db等settings.py：项目设置，如默认不开启管道、遵守robots协议等scrapy.cfg：项目配置C.创建一个爬虫进入创建的爬虫项目，输入：scrapygenspider-tcrawlqsbkspdqiushibaie=ke.com(域名)D.DefineitemsimportscrapyclassQsbkautoItem(scrapy.Item):#definethefieldsforyouritemherelike:#name=scrapy.Field()Link=scrapy.Field()#文章链接Connent=scrapy.Field()#文章内容passE.写爬虫qsbkauto.py#-*-coding:utf-8-*-importscrapyfromscrapy.linkextractorsimportLinkExtractorfromscrapy.spidersimportCrawlSpider,Rulefromqsbkauto.itemsimportQsbkautoItemfromscrapy.httpimportRequestclassQsbkspdSpider(CrawlSpider):name='qsbkspd='allow'comed_domains'http://qiushibaike.com/']defstart_requests(self):i_headers={"User-Agent":"Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/49.0.2623.22Safari/537.36SE2.XMetaSr1.0"}yieldRequest('http://www.qiushibaike.com/',headers=i_headers)rules=(Rule(LinkExtractor(allow=r'article/'),callback='parse_item',follow=True),)defparse_item(self,response):#i={}#i['domain_id']=response.xpath('//输入[@id="sid"]/@value').extract()#i['name']=response.xpath('//div[@id="name"]').extract()#i['description']=response.xpath('//div[@id="description"]').extract()i=QsbkautoItem()i["content"]=response.xpath("//div[@class='content']/text()").extract()i["link"]=response.xpath("//link[@rel='canonical']/@href").extract()returnipipelines.pyimportMySQLdbimporttimeclassQsbkautoPipeline(对象):defexeSQL(self,sql):'''功能：连接MySQL数据库并执行sql语句@sql：确定SQL语句'''con=MySQLdb.connect(host='localhost',#portuser='root',#usr_namepasswd='xxxx',#passnamedb='spdRet',#db_namecharset='utf8',local_infile=1)con.query(sql)con.commit()con.close()defprocess_item(self,item,spider):link_url=item['link'][0]content_header=item['content'][0][0:10]curr_date=time.strftime('%Y-%m-%d',time.localtime(time.time()))content_header=curr_date+'__'+content_headerif(len(link_url)andlen(content_header)):#判断是否是emptyValuetry:sql="insertintoqiushi(content,link)values('"+content_header+"','"+link_url+"')"self.exeSQL(sql)exceptExceptionaser:print("插入错误，错误如下：")print(er)else:passreturnitemsetting.py关闭ROBOTSTXT_OBEY设置USER_AGENT打开ITEM_PIPELINESF。执行爬虫scrapycrawlqsbkauto--nologG。】点此查看作者更多好文

上一篇：人工智能会扮演好医生的角色吗-_0

下一篇：Linux网络-数据包接收流程

一位合格的数据分析师分享关于Python网络爬虫（Scrapy自动爬虫）的轶事相关文章