安装ScrapypipinstallScrapy创建项目scrapystartprojecttutorial创建爬虫在tutorial/spiders目录下创建quotes_spider.py文件,代码如下:importscrapyclassQuotesSpider(scrapy.Spider):name=“引用”defstart_requests(self):urls=['https://segmentfault.com/blog/sown',]forurlinurls:yieldscrapy.Request(url=url,callback=self.parse)defparse(self,response):forquoteinresponse.css('section.stream-list__item'):print(quote.css('h2.titlea::text').extract_first())print(quote.css('h2.titlea::attr(href)').extract_first())启动前在settings.py中添加:USER_AGENT='Mozilla/5.0(WindowsNT10.0;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/75.0.3770.142Safari/537.36'ROBOTSTXT_OBEY=False启动项目scrapy爬取名言界面,输出DEBUG、INFO提示信息,以及爬取的文章标题和链接。对于最简单的初级爬虫,基本流程已经跑完了。提取二级页面quotes_spider.py:importurllibimportscrapydefparse_article(response):article=response.css('article.article').extract_first()print(article)classQuotesSpider(scrapy.Spider):name="quotes"defstart_requests(self):urls=['https://segmentfault.com/blog/sown',]forurlinurls:yieldscrapy.Request(url=url,callback=self.parse)defparse(self,response):在response.css('section.stream-list__item')中引用:print(quote.css('h2.titlea::text').extract_first())article=urllib.parse.urljoin(response.url,quote.css('h2.titlea::attr(href)').extract_first())yieldscrapy.Request(url=article,callback=parse_article)保存数据到MySQLitems.py:#-*-coding:utf-8-*-importscrapyclassArticleItem(scrapy.Item):title=scrapy.Field()content=scrapy.Field()passpipelines.py:#-*-coding:utf-8-*-将pymysql导入为pymysqlfrompymysql.cursors导入DictCursorclassTutorialPipeline(object):defprocess_item(self,item,spider):returnitemclassMySQLPipeline(object):def__init__(self):self.connect=pymysql.connect(host='127.0.0.1',port=3306,db='spider',user='root',passwd='root',charset='utf8',use_unicode=True)self.cursor=self.connect.cursor(DictCursor)defprocess_item(self,item,spider):self.cursor.execute("""插入文章(标题,内容)值(%s,%s)""",(item['title'],item['content']))self.connect.commit()返回itemquotes_spider.py:importurllibimportscrapyfrom..itemsimportArticleItemclassQuotesSpider(scrapy.Spider):name="quotes"defstart_requests(self):urls=['https://segmentfault.com/blog/sown',]forurlinurls:yieldscrapy.Request(url=url,callback=self.parse)defparse(self,response):forquoteinresponse.css('section.stream-list__item'):title=quote.css('h2.titlea::text').extract_first()article=urllib.parse.urljoin(response.url,quote.css('h2.titlea::attr(href)').extract_first())yieldscrapy.Request(url=article,callback=self.parse_article,meta={'title':title})defparse_article(self,response):title=response.meta['title']content=response.css('article.article').extract_first()item=ArticleItem()item['title']=titleitem['content']=内容产出itemsettings.py:ITEM_PIPELINES={'tutorial.pipelines.MySQLPipeline':300}通过启动命令,传送start_url参数quotes_spider.py:importurllib导入scrapyfrom..items导入ArticleItemclassQuotesSpider(scrapy.Spider):name="quotes"def__init__(self,start_url=None,*args,**kwargs):super(QuotesSpider,self).__init__(*args,**kwargs)self.start_url=start_urldefstart_requests(self):urls=['https://segmentfault.com',]forurlinurls:yieldscrapy.Request(url=url,callback=self.parse)defparse(self,response):yieldscrapy.Request(self.start_url,callback=self.parse_list,meta={})defparse_list(self,response):forquoteinresponse.css('section.stream-list__item'):标题=quote.css('h2.titlea::text').extract_first()article=urllib.parse.urljoin(response.url,quote.css('h2.titlea::attr(href)').extract_first())yieldscrapy.Request(url=article,callback=self.parse_article,meta={'title':title})defparse_article(self,response):title=response.meta['title']content=response.css('article.article').extract_first()item=ArticleItem()项目['title']=titleitem['content']=contentyielditem执行:scrapycrawlquotes-astart_url=https://segmentfault.com/blog/sown爬取的内容存在可能遇到的一些问题
,导致应该返回的字符串变成一个列表。通常使用text=response.css('[id=content]::text').extract()应该返回所有的文本内容,但是因为内容存在
,所以会返回
拆分列表.这时候就需要根据实际情况合并列表或者改变匹配规则策略。在flask中调用scrapy时报错“ValueError:signalonlyworksinmainthread”替换为如下调用方法subprocess.run(['scrapy','crawl','nmzsks',"-a","year="+year,"-a","start_url="+start_url],shell=True)NomodulenamedArticleItem.itemsadd..from..itemsimportArticleItem引用中文时,抓取的页面可能是乱码不是utf-8Encoding,scrapy还没有水土不服,用下面的方法转换content.encode('latin1').decode('gbk')
