先上效果图吧,无图说鸟!之前写过一篇抓女生数据的文章,主要是用selenium模拟网页操作,然后用动态加载,再用xpath提取网页数据,但是这种方式效率不高。所以今天再补充一个高效获取数据的方法。由于没有模拟操作,一切都可以手动控制,不用打开网页就可以获取数据!但是我们需要分析这个网页。打开网页http://www.lovewzly.com/jiaoyou.html后,按F12进入网络项。url过滤后,只有页面在变化,而且是一页一页的积累,而我们在浏览器中打开这个url,我们会得到一批json字符串,所以我可以直接操作里面的json数据,然后存储起来!代码结构图:运行过程:headers一定要搭建防盗链,模拟浏览器运行,先这样写,可以避免后续问题!条件组装,然后记得把数据转成json格式,然后把json数据提取出来,把提取出来的数据放到文件里或者存储起来。学习到的主要技术:学习requests+urllib操作execl文件操作字符串异常处理等基本请求数据:defcraw_data(self):'''DataCrawl'''headers={'Referer':'http://www.lovewzly.com/jiaoyou.html','User-Agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/53.0.2785.104Safari/537.36Core/1.53.4620.400QQBrowser/9.7.13014.400'}page=1whileTrue:query_data={'page':page,'gender':self.gender,'starage':self.stargage,'endage':self.endgage,'stratheight':self.startheight,'endheight':self.endheight,'marry':self.marry,'salary':self.salary,}url='http://www.lovewzly.com/api/user/pc/list/search?'+urllib.urlencode(查询数据)打印urlreq=urllib2.Request(url,headers=headers)response=urllib2.urlopen(req).read()#printresponself.parse_data(response)page+=1字段提取:defparse_data(self,response):'''数据解析'''persons=json.loads(response).get('data').get('list')ifpersonsisNone:print'datahasbeenrequested'returnforpersoninpersons:nick=person.get('username')gender=person.get('性别')age=2018-int(person.get('birthdayyear'))address=person.get('city')heart=person.get('monolog')height=person.get('height')img_url=person.get('avatar')education=person.get('education')printnick,age,height,address,heart,educationself.store_info(nick,age,height,address,heart,education,img_url)自我.store_info_execl(nick,age,height,address,heart,education,img_url)文件存储:defstore_info(self,nick,age,height,address,heart,education,img_url):'''存储照片,用心独白'''ifage<22:tag='22岁以下'elif22<=age<28:tag='22-28岁'elif28<=age<32:tag='28-32岁'elif32<=age:tag='32岁以上'filename=u'{}age_height{}_education{}_{}_{}.jpg'.format(age,height,education,address,nick)try:#补全文件目录image_path=u'E:/store/pic/{}'.format(tag)#判断文件夹是否存在ifnotos.path.exists(image_path):os.makedirs(image_path)printimage_path+'createdsuccessful'#注意这里是写图片,要写成二进制格式。withopen(image_path+'/'+filename,'wb')asf:f.write(urllib.urlopen(img_url).read())txt_path=u'E:/store/txt'txt_name=u'内心独白.txt'#检查文件夹是否存在。ifnotos.path.exists(txt_path):os.makedirs(txt_path)printtxt_path+'创建成功'#writetxttextwithopen(txt_path+'/'+txt_name,'a')asf:f.write(heart)exceptExceptionase:e.messageexecloperation:defstore_info_execl(self,nick,age,height,address,heart,education,img_url):person=[]person.append(self.count)#正好是数据栏person.append(nick)person。append(u'Female'ifself.gender==2elseu'Male')person.append(age)person.append(height)person.append(address)person.append(education)person.append(heart)person.append(img_url)forjinrange(len(person)):self.sheetInfo.write(self.count,j,person[j])self.f.save(u'MyLord'sLove.xlsx')self.count+=1print'inserted{}itemsData'.format(self.count)***显示!源码地址:https://github.com/pythonchannel/python27/blob/master/test/meizhi.py
