一、zhihuSpider.py爬虫代码:#!/usr/bin/envpython#-*-coding:utf-8-*-fromscrapy.contrib.spidersimportCrawlSpider,Rulefromscrapy.selectorimportSelectorfromscrapy.contrib.linkextractors.sgmlimportSgmlLinkExtractorfromscrapy.httpimportRequest,FormRequestfromzhihu.itemsimportZhihuItemclassZhihuSipder(CrawlSpider):name="zhihu"allowed_domains=["www.zhihu.com"]start_urls=["http://www.zhihu.com"]rules=(Rule(SgmlLinkExtractor(allow=('/question/\d+#.*?',)),callback='parse_page',follow=True),规则(SgmlLinkExtractor(allow=('/question/\d+',)),callback='parse_page',follow=True),)headers={"Accept":"*/*","Accept-Encoding":"gzip,deflate","Accept-Language":"en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4","Connection":"keep-alive","Content-Type":"application/x-www-form-urlencoded;charset=UTF-8","User-Agent":"Mozilla/5.0(Macintosh;IntelMacOSX10_10_1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/38.0.2125.111Safari/537.36","Referer":"http://www.zhihu.com/"}#重写了爬虫类的方法,实现了自定义请求,运行成功后会调用callback回调functiondefstart_requests(self):return[Request("https://www.zhihu.com/login",meta={'cookiejar':1},callback=self.post_login)]#FormRequeset有问题defpost_login(self,response):print'Preparinglogin'#下面这句用于抓取请求的网页后返回网页中_xsrf字段的文本,用于成功提交表单xsrf=Selector(response).xpath('//input[@name="_xsrf"]/@value').extract()[0]printxsrf#FormRequeset.from_response是Scrapy提供的函数,用于postform#登录成功后会是调用after_login回调函数return[FormRequest.from_response(response,#"http://www.zhihu.com/login",okiejar']},ersmeta={'cookiejar':response.meta['coheaders=self.headers,#这里注意headformdata={'_xsrf':xsrf,'email':'1095511864@qq.com','password':'123456'},callback=self.after_login,dont_filter=True)]defafter_login(self,response):forurlinself.start_urls:yieldself.make_requests_from_url(url)defparse_page(self,response):problem=Selector(response)item=ZhihuItem()item['url']=response.urlitem['name']=problem.xpath('//span[@class="name"]/text()').extract()printitem['name']item['title']=problem.xpath('//h2[@class="zm-item-titlezm-editable-content"]/text()').extract()item['描述']=problem.xpath('//div[@class="zm-editable-content"]/text()').extract()item['answer']=problem.xpath('//div[@class="zm-editable-contentclearfix"]/text()').extract()returnitem2.Itemclasssettingsfromscrapy.itemimportItem,FieldclassZhihuItem(Item):#在这里为你的项目定义字段like:#name=scrapy.Field()url=Field()#保存爬虫问题的urltitle=Field()#爬虫问题的标题description=Field()#爬虫问题的描述answer=Field()#爬取问题答案name=Field()#个人用户名3.setting.py设置爬取间隔BOT_NAME='zhihu'SPIDER_MODULES=['zhihu.spiders']NEWSPIDER_MODULE='zhihu.spiders'DOWNLOAD_DELAY=0.25#设置下载间隔为250ms4、Cookie原理HTTP是一种无状态的面向连接的协议。为了保持连接状态,引入了cookie机制Cookie是HTTP消息头中的一个属性,包括:Cookie名称(Name)Cookie值(Value)Cookie过期时间(Expires/Max-Age)Cookie动作路径(Path)Cookie域名(Domain),使用使用cookies来保证安全连接(安全)。前两个参数是应用cookies的必要条件,还包括cookies的大小(Size,不同的浏览器对cookies的数量和大小有不同的限制)。更多爬虫项目爬虫案例视频教程,请点击这里。
