当前位置: 首页 > 后端技术 > Python

什么?自如租房价格是一张图[1][Python爬虫]

时间:2023-03-26 16:16:18 Python

前几天有朋友想爬取望京附近的自如租房价格。遇到一些问题,让我帮忙分析一下。1分析一下,我想,这东西我以前也做过,能有什么难度呢?所以只要打开一个租赁页面。额(⊙o⊙)……居然换成图了。之前应该有一个单独的Ajax请求来获取价格信息。根据页面内容可以看到:①价格虽然是由4个标签组成,但是背景图是一样的。②用CSS可以看到每个价格窗口宽度的大小:20px;高度:30px;是固定的,只设置图片的偏移量。不过我不嫌弃,整理一下思路:请求从网页获取图片信息,获取价格抵消信息,裁剪图片识别获取价格数据。刚好最近在研究CNN图片识别相关,这么规律的数字,稍微训练一下,识别率肯定能达到100%。2实战就照你说的做,先找个入口,然后搞一波网页。2.1获取原网页,按地铁,15号线找望京东,获取房间列表,再处理下一页。示例代码:#-*-编码:UTF-8-*-importosimporttimeimportrandomimportrequestsfromlxml.etreeimportHTML__author__='lpe234'index_url='https://www.ziroom.com/z/s100006-t201081/?isOpen=0'visited_index_urls=set()defget_pages(start_url:str):"""地铁15号线望京东附近,获取首页所有详情页地址:paramstart_url:return:"""#去重ifstart_urlinvisited_index_urls:returnvisited_index_urls.add(start_url)headers={'User-Agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/90.0.4430.212Safari/537.36'}resp=requests.get(start_url,headers=headers)resp_content=resp.content.decode('utf-8')root=HTML(resp_content)#解析当前页面列表hrefs=root.xpath('//div[@class="Z_list-box"]/div/div[@class="pic-box"]/a/@href')forhrefinhrefs:ifnothref.startswith('http'):href='http:'+href.strip()print(href)parse_detail(href)#解析翻页pages=root.xpath('//div[@class="Z_pages"]/a/@href')forpageinpages:ifnotpage.startswith('http'):page='http:'+pageget_pages(page)defparse_detail(detail_url:str):"""访问详情页:paramdetail_url::return:"""headers={'User-Agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/90.0.4430.212Safari/537.36'}filename='pages/'+detail_url.split('/')[-1]ifos.path.exists(filename):return#randompause1-5秒time.sleep(random.randint(1,5))resp=requests.get(detail_url,headers=headers)resp_content=resp.content.decode('utf-8')withopen(filename,'wb+')aspage:page.write(resp_content.encode())if__name__=='__main__':get_pages(start_url=index_url)简单获取附近的listings,一共600条左右。遍历前面得到的所有网页,解析价格图片并存储。示例代码:#-*-coding:UTF-8-*-importosimportrefromurllib.requestimporturlretrievefromlxml.etreeimportHTML__author__='lpe234'poss=list()defwalk_pages():"""遍历所有页面:return:"""fordirpath,dirnames,filenamesinos.walk('pages'):forpageinfilenames:page=os.path.join('pages',page)print(page)parse_page(page)defparse_page(page_path:str):"""解析页面:parampage_path::return:"""withopen(page_path,'rb')aspage:page_content=''.join([_.decode('utf-8')for_inpage.readlines()])root=HTML(page_content)styles=root.xpath('//div[@class="Z_price"]/i/@style')pos_re=re.compile(r'背景位置:(.*?)px;')img_re=re.compile(r'url\((.*?)\);')forstyleinstyles:style=style.strip()print(style)pos=pos_re.findall(样式)[0]img=img_re.findall(样式)[0]ifimg.endswith('red.png'):继续ifnotimg.startswith('http'):img='http:'+imgprint(f'pos:{pos},img:{img}')save_img(img)poss.append(pos)defsave_img(img_url:str):img_name=img_url.split('/')[-1]img_path=os.path.join('imgs',img_name)ifos.path。存在(img_path):返回urlretrieve(img_url,img_path)如果__name__=='__main__':walk_pages()打印(排序([float(_)for_inposs]))打印(排序(设置([float(_)for_inposs])))最后一共得到21张价格相关图片数据,其中20张橙色图片为普通价格图片,1张红色图片为特价商品专用。好像不需要图像识别,直接根据图像名称和偏移量映射即可。2.3进行价格分析本来想写鉴定的,但觉得不合适。这是什么身份证明?不就是图片名+偏移贴图吗?示例代码:#-*-coding:UTF-8-*-importrefromlxml.etreeimportHTMLimportrequests__author__='lpe234'PRICE_IMG={'1b68fa980af5e85b0f545fccfe2f8af1.png':[8,9,1,6,7,0,2,4,5,3],'4eb5ebda7cc7c3214aebde816b10d204.png':[9,5,7,0,8,6,3,1,2,4],'5c6750e29a7aae17288dcadadb5e33b1.png':[4,5,9,3,1,6,2,8,7,0],'6f8787069ac0a69b36c8cf13aacb016b.png':[6,1,9,7,4,5,0,8,3,2],'7ce54f64c5c0a425872683e3d1df36f4.png':[5,1,3,7,6,8,9,4,0,2],'8e7a6d05db4a1eb58ff3c26619f40041.png':[3,8,7,1,2,9,0,6,4,5],'73ac03bb4d5857539790bde4d9301946.png':[7,1,9,0,8,6,4,5,2,3],'234a22e00c646d0a2c20eccde1bbb779.png':[1,2,0,5,8,3,7,6,4,9],'486ff52ed774dbecf6f24855851e3704.png':[4,7,8,0,1,6,9,2,5,3],'19003aac664523e53cc502b54a50d2b6.png':[4,9,2,8,7,3],0,6,5,1],'93959ce492a74b6617ba8d4e5e195a1d.png':[5,4,3,0,8,7,9,6,2,1],'7995074a73302d345088229b960929e9.png':[0,7,4,2,1,3,8,6,5,9],'939205287b8e01882b89273e789a77c5.png':[8,0,1,5,7,3,9,6,2,4],'477571844175c1058ece4cee45f5c4b3.png':[2,1,5,8,0,9,7,4,3,6],'a822d494f1e8421a2fb2ec5e6450a650.png':[3,1,6,5,8,4,9,7,2,0],'a68621a4bca79938c464d8d728644642.png':[7,0,3,4,6,1,5,9,8,2],'b2451cc91e265db2a572ae750e8c15bd.png':[9,1,6,2,8,5,3,4,7,0],'bdf89da0338b19fbf594c599b177721c.png':[3,1,6,4,7,9,5,2,8,0],'de345d4e39fa7325898a8fd858addbb8.png':[7,2,6,3,8,4,0,1,9,5],'eb0d3275f3c698d1ac304af838d8bbf0.png':[3,6,5,0,4,8,9,2,1,7],'img_pricenumber_detail_red.png':[6,1,9,7,4,5,0,8,3,2]}POS_IDX=[-0.0,-31.24,-62.48,-93.72,-124.96,-156.2,-187.44,-218.68,-249.92,-281.16]defparse_price(img:str,pos_list:list):price_list=PRICE_IMG.get(img)ifnotprice_list:raiseException('imgnotfound.%s',img)step=1price=0_pos_list=reversed(pos_list)forposin_pos_list:price+=price_list[POS_IDX.index(float(pos))]*stepstep*=10returnpricedefparse_page(content:str):root=HTML(content)styles=root.xpath('//div[@class="Z_price"]/i/@style')pos_re=re.compile(r'background-position:(.*?)px;')pos_img=re.findall('price/(.*?)\\);',styles[0])[0]poss=list()forstyleinstyles:style=style.strip()pos=pos_re.findall(style)[0]poss.append(pos)print(pos_img)print(poss)returnparse_price(pos_img,poss)defrequest_page(url:str):headers={'用户代理':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,如Gecko)Chrome/90.0.4430.212Safari/537.36'}resp=requests.get(url,headers=headers)resp_content=resp.content.decode('utf-8')returnresp_content为了方便,已做接口服务提供:测试接口==>https://lemon.lpe234.xyz/common/ziru/3总结本来以为可以给小伙伴们炫耀下自己新学的技能,结果粗心大意,又想了想,既然自如做了这套东西,何不多弄点图片素材,把它做的难一点。但是为了在朋友面前出丑,后面还是得用CNN,不然就白搭了。