看我如何抓取二手房价数据

时间：2023-03-18 16:25:23 科技观察

上次介绍了如何通过Python抓取新房房地产价格信息。很多朋友在问，如何捕捉二手房的最新价格信息？好的！今天再给大家讲讲如何捕捉二手房的价格信息。模块安装与上次新房相同。这里需要安装以下模块（当然，如果已经安装过，就不需要再安装了）：#安装参考模块pip3installbs4pip3installrequestspip3installllxmlpip3installnumpypip3installpandasOK，安装完成后，就可以开始写代码了。至于配置请求头和代理IP地址的代码，我在上次新方的介绍中已经提到了，这里不再赘述，下面直接抓取代码。二手房价格数据对象这里我们创建一个二手房价格信息的对象，然后我们只需要将获取到的数据保存为一个对象，这样处理起来就方便多了。SecHouse对象代码如下：#二手房信息对象classSecHouse(object):def__init__(self,district,area,name,price,desc,pic):self.district=districtself.area=areaself.price=priceself.name=nameself.desc=descself.pic=picdeftext(self):returnsself.district+","+\self.area+","+\self.name+","+\self.price+","+\self.desc+","+\self.pic准备获取二手房价信息并保存。接下来，我们还是以Shell为例，批量爬取北京的二手房数据，并保存到本地。这里主要是想说一下如何抓取数据的过程，所以这里还是以最简单的txt文本格式保存。如果要保存到数据库，可以自己修改代码保存数据库。获取区县信息我们在抓取二手房信息的时候，肯定想知道房子所在的区域，所以这里写了一个抓取北京所有区县信息的方法，暂时保存在一个列表变量中为以后使用在后续程序中使用，代码如下：#获取区县信息defget_districts():#RequestURLurl='https://bj.ke.com/xiaoqu/'headers=create_headers()#请求到获取数据response=requests.get(url,timeout=10,headers=headers)html=response.contentroot=etree.HTML(html)#processingdataelements=root.xpath('///div[3]/div[1]/dl[2]/dd/div/div/a')en_names=list()ch_names=list()#循环处理对象forelementinelements:link=element.attrib['href']en_names.append(link.split('/')[-2])ch_names.append(element.text)#打印区县中英文名称列表forindex,nameinenumerate(en_names):chinese_city_district_dict[name]=ch_names[index]returnen_names获取区域板块除了获取上面的区县信息外，我们还应该获取比区县更小的板块区域信息。同一个区县，不同板块区域的二手房价格肯定是不一样的。因此，板块对我们来说也是非常重要的，具有参考价值。获取栏目信息的代码如下：#获取某区下的所有栏目信息defget_areas(district):#RequestedURLpage="http://bj.ke.com/xiaoqu/{0}".format(district)#section列表定义areas=list()try:headers=create_headers()response=requests.get(page,timeout=10,headers=headers)html=response.contentroot=etree.HTML(html)#获取标签信息links=根。xpath('//div[3]/div[1]/dl[2]/dd/div/div[2]/a')#linkinlinks列表的处理：relative_link=link.attrib['href']#最后，"/"removerelative_link=relative_link[:-1]#获取最后一段信息area=relative_link.split("/")[-1]#去掉区县名称，防止重复ifarea!=district:chinese_area=link.textchinese_area_dict[area]=chinese_area#添加到板块信息列表areas.append(area)returnareaexceptionase:print(e)获取二手房信息并保存#创建文件准备写入withopen("sechouse.txt","w",encoding='utf-8')asf:#定义变量total_page=1#初始化listsec_house_list=list()#获取所有区县信息districts=get_districts()#循环处理区县fordistrictindistricts:#获取某区县下所有板块信息arealist=get_areas(district)#循环遍历所有区段下所有板块的二手房信息forareainarealist:#中文区县chinese_district=chinese_city_district_dict.get(district,"")#中文版块chinese_area=chinese_area_dict.get(area,"")#请求地址page='http://bj.ke.com/ershoufang/{0}/'.format(area)headers=create_headers()response=requests.get(page,timeout=10,headers=headers)html=response.content#解析HTMLsoup=BeautifulSoup(html,"lxml")#获取总页数try:page_box=汤。find_all('div',class_='page-box')[0]matches=re.search('.*data-total-count="(\d+)".*',str(page_box))#得到total页数total_page=int(math.ceil(int(matches.group(1))/10))exceptExceptionase:print(e)print(total_page)#设置请求头headers=create_headers()#从第一页开始,遍历到最后一页foriinrange(1,total_page+1):#请求地址page='http://bj.ke.com/ershoufang/{0}/pg{1}'.format(area,i)print(page)#获取返回内容response=requests.get(page,timeout=10,headers=headers)html=response.contentsoup=BeautifulSoup(html,"lxml")#获取二手房查询列表house_elements=soup.find_all('li',class_="clear")#遍历每条信息forhouse_eleminhouse_elements:#priceprice=house_elem.find('div',class_="totalPrice")#titlename=house_elem.find('div',class_='title')#desc=house_elem.find('div',class_="houseInfo")#图片地址pic=house_elem.find('a',class_="img").find('img',class_="lj-lazy")#清理数据price=price.text.strip()name=name.text.replace("\n","")desc=desc.text.replace("\n","").strip()pic=pic.get('data-original').strip()#保存二手房对象sec_house=SecHouse(chinese_district,chinese_area,name,price,desc,pic)print(sec_house.text())sec_house_list.append(sec_house)#循环写入信息到txtforsec_houseinsec_house_list:f.write(sec_house.text()+"\n")代码为写到这里好了，现在我们可以通过命令pythonsechouse.py运行代码来抓取数据了。我们可以打开当前目录下的sechouse.txt文件来查看抓取的结果。结果如下图所示：总结本文介绍如何通过Python批量抓取房产网站二手房数据。一段时间后，我们可以对比分析抓取的结果，看看近期二手房价格是涨了还是跌了？如果您喜欢我们的文章，请关注收藏再看。

上一篇：帮助您的公司处理大数据的七种工具

下一篇：数据挖掘领域十大经典算法之一——朴素贝叶斯算法（附代码）

看我如何抓取二手房价数据相关文章