大数据获取案例：Python网络爬虫实例

时间：2023-03-25 21:38:56 Python

网络爬虫：网络爬虫（又称网络蜘蛛、网络机器人，在FOAF社区更常被称为网络追逐者），是一种遵循一定规则的网络爬虫，自动从万维网上抓取信息的程序或脚本。其他不太常用的名称包括ant、autoindex、emulator或worm。以上是网络爬虫的百度，下面是使用Python爬取网页获取数据的介绍。用于获取有关COVID-19的实时数据。PyCharm使用的工具创建一个新的Python文件并将其命名为get_data。使用爬虫最常用的请求模块第一部分：获取网页信息：importrequestsurl="https://voice.baidu.com/act/newpneumonia/newpneumonia"response=requests.get(url)第二部分：你可以观察到数据的特点：数据包含在script标签中，使用xpath获取数据。导入一个模块fromlxmlimportetree生成一个html对象，解析得到一个list类型的内容。使用第一项获取所有内容。接下来，先获取组件的内容。此时使用json模块将字符串类型转换为字典（Python数据结构）为了获取国内数据，需要找到组件中的caseList然后添加代码：fromlxmlimportetreeimportjson#生成HTML对象html=etree.HTML(response.text)result=html.xpath('//script[@type="application/json"]/text()')result=result[0]#json.load()方法可以将字符串转成python数据类型result=json.loads(result)result_in=result['component'][0]['caseList']　Part3:将国内数据存储到excel表中：使用openyxl模块，导入openpyxl首先创建一个工作簿，在工作簿下创建一个工作表然后给工作表命名并给工作表赋属性。代码如下：importopenpyxl#Createworkbookwb=openpyxl.Workbook()#Createworksheetws=wb.activews.title="国内疫情"ws.append(['省份','累计确诊病例','死亡病例','治愈病例','现有确诊病例','累计确诊病例','死亡病例','治愈病例','现有病例数'])'''area-->mostlyprovincecity-->cityconfirmed-->cumulativecruel-->valuerangerelativeTime-->confirmedRelative-->cumulativeincrementcuredRelative-->valuerangeincrementQuantitycurConfirm-->现有确认城镇curConfirmRelative-->existingconfirmationtownincrement'''foreachinresult_in:temp_list=[each['area'],each['confirmed'],each['died'],each['crued'],each['curConfirm'],each['confirmedRelative'],each['diedRelative'],each['curedRelative'],each['curConfirmRelative]']']]foriinrange(len(temp_list)):iftemp_list[i]=='':temp_list[i]='0'ws.append(temp_list)wb.save('./data.xlsx')第四部分：excel中存储国外数据：在组件的globalList中获取国外数据，然后在excel表中创建sheet，分别代表不同的洲。代码如下：data_out=result['component'][0]['globalList']foreachindata_out:sheet_title=each['area']#创建一个新工作表ws_out=wb.create_sheet(sheet_title)ws_out.append(['Country','CumulativeDiagnosis','Death','cured','existingconfirmed','cumulativeconfirmedincrement'])forcountryineach['subList']:list_temp=[country['country']],country['confirmed'],country['died'],country['crued'],country['curConfirm'],country['confirmedRelative']]foriinrange(len(list_temp)):如果list_temp[i]=='':list_temp[i]='0'ws_out.append(list_temp)wb.save('./data.xlsx')整体代码如下：响应=请求。get(url)#print(response.text)#生成HTML对象html=etree.HTML(response.text)result=html.xpath('//script[@type="application/json"]/text()')result=result[0]#json.load()方法可以将字符串转换成python数据类型result=json.loads(result)#创建工作簿wb=openpyxl.Workbook()#创建工作表ws=wb.activews.title="国内疫情"ws.append(['省份','累计确诊','死亡','治愈','现有确诊','累计确诊增量','死亡增量','治愈增量','现有诊断增量'])result_in=result['component'][0]['caseList']data_out=result['component'][0]['globalList']'''area-->主要是省市-->cityconfirmed-->cumulativecruel-->valuerangerelativeTime-->confirmedRelative-->cumulativeincrementcuredRelative-->valuerangeincrementcurConfirm-->existingconfirmationTowncurConfirmRelative-->现有城镇增量'''foreachinresult_in:temp_list=[each['area'],each['confirmed'],each['died'],each['crued'],each['curConfirm'],each['confirmedRelative'],each['diedRelative'],each['curedRelative'],each['curConfirmRelative']]foriinrange(len(temp_list)):如果temp_list[i]=='':temp_list[i]='0'ws.append(temp_list)#获取国外各疫情数据indata_out:sheet_title=each['area']#新建工作表ws_out=wb.create_sheet(sheet_title)ws_out.append(['Country','CumulativeDiagnosis','Death','Cure','现有诊断','CumulativeDiagnosisIncrement'])forcountryineach['subList']:list_temp=[country['country'],country['confirmed'],country['died'],country['crued'],country['curConfirm'],country['confirmedRelative']]foriinrange(len(list_temp)):如果list_temp[i]=='':list_temp[i]='0'ws_out.append(list_temp)wb.save('./data.xlsx')结果如下：国内：国外：推荐：020持续更新，精品小圈子每天都有新内容，干货集中度极高建立联系，讨论技术，你想要的都在这里！成为第一个加入团队并超越同行的人！（入群不需要任何费用）点此与Python开发者交流学习。群号：858157650申请即送：Python软件安装包，免费提供Python实用教程资料，包括Python基础学习、进阶学习、爬虫、人工智能、自动化运维、自动化测试等。

上一篇：git-log很好用，也可以写个.

下一篇：Python3网络爬虫开发实战读书笔记---第11章应用爬虫

大数据获取案例：Python网络爬虫实例相关文章