1.items.pyclassDouyuspiderItem(scrapy.Item):name=scrapy.Field()#存储照片的名称imagesUrls=scrapy.Field()#照片的urlPathimagesPath=scrapy.Field()#照片的存储路径本地2.spiders/douyu.pyimportscrapyimportjsonfromdouyuSpider.itemsimportDouyuspiderItemclassDouyuSpider(scrapy.Spider):name="douyu"allowed_domains=["http://capi.douyucdn.cn"]offset=0url="http://capi.douyucdn.cn/api/v1/getVerticalRoom?limit=20&offset="start_urls=[url+str(offset)]defparse(self,response):#返回从json中获取的数据段数据集合data=json.loads(response.text)["data"]#如果data中没有值,则直接退出函数#ifnotdata:#returnforeachindata:item=DouyuspiderItem()item["name"]=each["nickname"]item["imagesUrls"]=each["vertical_src"]yielditemselself.offset+=20yieldscrapy.Request(self.url+str(self.offset),callback=self.parse)3.设置设置。pyITEM_PIPELINES={'douyuSpider.pipelines.Imagespipeline':1}#Images的存放位置,然后在pipelines.py="/Users/Po中调用IMAGES_STOREwer/lesson_python/douyuSpider/Images"#user-agentUSER_AGENT='DYZB/2.290(iPhone;iOS9.3.4;Scale/2.00)'4.pipelines.pyimportscrapyimportosfromscrapy.pipelines.imagesimportImagesPipelinefromscrapy.utils.projectimportget_project_settings类ImagesPipeline(ImagesPipeline):IMAGES_STORE=get_project_settings().get("IMAGES_STORE")defget_media_requests(self,item,info):image_url=item["imagesUrls"]yieldscrapy.Request(image_url)defitem_completeditem,信息):#固定写法,获取图片路径,判断路径是否正确,如果正确,放在image_path中。ImagesPipeline源码分析可以看image_path=[x["path"]forok,xinresultsifok]os.rename(self.IMAGES_STORE+"/"+image_path[0],self.IMAGES_STORE+"/"+item["name"]+".jpg")item["imagesPath"]=self.IMAGES_STORE+"/"+item["name"]returnitem#get_media_requests的作用是为每个图片链接生成一个Request对象,该方法的输出将作为results在item_completed的输入中。结果是一个元组,每个元组包含(success,imageinfoorfailure)如果success=true,imageinfo或_failure是一个字典,包括url/path/checksum三个key。在项目根目录新建main.py文件,用于调试fromscrapyimportcmdlinecmdline.execute('scrapycrawldouyu'.split())执行程序py2main.py完整的Python爬虫视频教程请点击:Python网络爬虫课程。
