当前位置: 首页 > 后端技术 > Python

爬取10000张NASA关于火星探测的图片,发现一个秘密

时间:2023-03-26 12:03:45 Python

前言最近用爬虫技术爬取了NASA,也就是大家经常在电影里看到的NASA,涉及火星探测的图片有10000张。嗯,小事,小事。做完之后有点小激动,于是就有了这篇文章,里面会有以下内容:我为什么要爬取NASA图片,我是怎么爬取NASA图片的(超级详细)我得到了什么(高清大图)图片)我发现了什么秘密(超爆)为什么我爬上了NASA的图片?天天想着万一哪天丢了工作怎么办,想玩个自媒体,天天给大家说废话。大白话,有历史之谜,宇宙之谜等等,所以我把重点放在了NASA上。NASA有各种太空探索任务,相关的文章、采访、图片和视频都是公开的。这是一个不可多得的资源库。我是如何爬取NASA图片的(超级详细)NASA的网站是公开的,地址是https://www.nasa.gov/打开后,它的主页是这样的,可以看到各种内容。右上角还有一个搜索框。我们进入Mars,也就是火星,稍等片刻,就会显示出与火星相关的各种内容。其中之一是火星探索。点击火星探索后,您将到达一个新页面。图片,您将到达目标页面https://www.nasa.gov/mission_pages/mars/images/index.html。下拉页面,您会看到一个大按钮,上面写着“更多图像”。点击试试,会发现页面内容并不是页面直接加载的,而是api请求后F12异步渲染的,打开浏览器开发者模式,重新执行前面的步骤,观察请求信息,就可以了发现会有下面这种情况,看来这个url地址很重要。我们先看他的请求地址:https://www.nasa.gov/api/2/ubernode/_search?size=24&from=24&sort=promo-date-time%3Adesc&q=((ubernode-type%3Aimage)%20AND%20(topics%3A3152))&_source_include=promo-date-time%2Cmaster-image%2Cnid%2Ctitle%2Ctopics%2Cmissions%2Ccollections%2Cother-tags%2Cubernode-type%2Cprimary-tag%2Csecondary-tag%2Ccardfeed-title%2Ctype%2Ccollection-asset-link%2Clink-or-attachment%2Cpr-leader-sentence%2Cimage-feature-caption%2Cattachments%2Curi注意参数size=24&from=24很明显size就是每次请求的图片数量.经测试,from为查询的起始位置。我们可以更改它以获得其他内容。我们看一下它的返回信息:{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,“跳过”:0,“失败”:0},“命中”:{“总计”:659,“max_score”:空,“命中”:[{“_index”:“nasa-public”,“_type”:"ubernode","_id":"450040","_score":null,"_source":{"image-feature-caption":"火星2020漫游者在漫游者上安装了多个摄像头后进行了眼科检查。",“主题”:[“3140”,“3152”],“nid”:“450040”,“标题”:“美国宇航局‘验光师’验证火星2020漫游者的20/20愿景”,“类型”:“ubernode”,“uri":"/image-feature/jpl/nasa-optometrists-verify-mars-2020-rovers-2020-vision","collections":["4525","5246"],"link-or-attachment":"link","missions":["6336"],"primary-tag":"6336","cardfeed-title":"NASA'Optometrist'VerifyMars2020Rover's20/20Vision","promo-date-time":"2019-08-05T17:49:00-04:00","secondary-tag":"3140","master-image":{"fid":"603128","alt":"工程师在火星2020火星车的桅杆和前底盘顶部测试摄像头。","width":"1600","id":"603128","title":"工程师在顶部测试摄像头火星2020探测器的桅杆和前底盘。","uri":"public://thumbnails/image/pia23314-16.jpg","height":"900"},"ubernode-type":"image"},"sort":[1565041740000]},{"_index":"nasa-public","_type":"ubernode","_id":"433172","_score":null,"_source":{"image-feature-caption":"NASA仍未收到机遇号火星车的消息,但至少我们可以看到又来了。","topics":["3152"],"nid":"433172","title":"机会在尘封的画面中出现","type":"ubernode","uri":"/image-feature/opportunity-emerges-in-a-dusty-picture","collections":["7628"],"link-or-attachment":"link","missions":["3639"],"primary-tag":"3152","cardfeed-title":"机会出现在尘封的画面","promo-date-time":"2018-09-26T12:39:00-04:00","secondary-tag":"7628","master-image":{"fid":"584263","alt":"NASA的机遇号探测器在这个正方形的中心出现了一个光点","width":"1400","id":"584263","title":"NASA的机遇号火星车在这个广场的中心出现了一个光点",“uri”:“public://thumbnails/image/pia22549-16.jpg”,“高度”:“788”},“ubernode-type”:“图像”},“排序”:[1537979940000]}]}}上面的json内容太长,我删掉了一些重复的,其实hits数组也是24,和页面显示的图片数量一样,所以基本可以断定页面上的信息来自这个数组。进一步对比发现,master-image字段下是我们需要的信息,包括图片地址、图片大小、图片标题。这里是代码,组装请求URL,获取内容,下载图片三步。我使用Dart语言。可以import'dart:convert';import'package:dio/dio.dart';main()async{//每页的页数固定为24,只需更改初始值即可for(intfrom=0;from<24*100;from=from+24){awaitgetPage(from);}}//获取每个页面的信息并下载FuturegetPage(intfrom)async{Stringurl='https://www.nasa.gov/api/2/ubernode/_search?size=24&from='+from.toString()+'&sort=promo-date-time%3Adesc&q=((ubernode-type%3Aimage)%20AND%20(topics%3A3152))&_source_include=promo-date-time%2Cmaster-image%2Cnid%2Ctitle%2Ctopics%2Cmissions%2Ccollections%2Cother-tags%2Cubernode-type%2Cprimary-tag%2Csecondary-tag%2Ccardfeed-title%2Ctype%2Ccollection-asset-link%2Clink-or-attachment%2Cpr-leader-sentence%2Cimage-feature-caption%2Cattachments%2Curi';//获取内容varres=awaitDio().get(url);varmap=jsonDecode(res.toString());(map['hits']['hits']asList).forEach((element)async{UrifileUri=Uri.parse(getUri(元素));字符串保存路径=getSavePath(元素);awaitDio().downloadUri(fileUri,savePath);print('已下载:'+savePath);});}//获取图片下载URLStringgetUri(dynamicelement){Stringuri=element['_source']['master-image']['uri'].toString();uri=uri.replaceAll('public://','https://www.nasa.gov/sites/default/files/styles/full_width_feature/public/');returnuri;}//处理信息并返回图片保存地址StringgetSavePath(dynamicelement){Stringid=element['_id'];Stringfid=element['_source']['master-image']['fid'].toString();Stringtitle=element['_source']['master-image']['title'].toString();Stringuri=element['_source']['master-image']['uri'].toString();字符串savePath=id+'_'+fid+'_'+title.trim()+'.'+uri.split('.').last;savePath=savePath.replaceAll('/','');savePath=savePath.replaceAll('\\','');savePath=savePath.replaceAll('"','');savePath='images/'+savePath;returnsavePath;}上面的代码还是很简单的,有经验的同学应该你应该一眼就明白了,走上去。已下载:images/470436_643588_ThisisthethirdcolorimagetakenbyNASA'sIngenuityhelicopter.jpg已下载:images/470435_643587_ThisisthesecondcolorimagebyNASA'sIngenuityhelicopter.jpg已下载:images/468546_639327_Thisisfirsthigh-resolution,colorimagestobesentbackbytheHazardCameras(Hazcams).jpg下载:images/452007_605784_DanielsonCrateronMars.jpg下载:images/458478_615132_GulliesonMars.jpg下载:images/469416_641582_这个沙丘的一片区域-占据ros公里直径的陨石坑在高火星北部平原的纬度..jpegDownloaded:images/458075_614251_Mars2020WithSampleTubes(Artist'sConcept).jpgDownloaded:images/470381_643473_CME.jpgDownloaded:images/458896_Mars下载:images/467026_635309_Mars2020WithSampleTubes(Artist'sConcept).jpg已下载:images/470438_643591_这张黑白照片是由美国宇航局的Ingenuity直升机在Athird飞行期间拍摄的l25,2021.jpg已下载:images/465488_631398_火星古代冰中的悬崖.jpg已下载:images/463659_626874_AvalancheonMars.jpg已下载:images/470251_643164_这张来自美国宇航局毅力号火星车的图片显示了该机构在HelicopterityMars之后成功完成了高速旋转测试..jpeg下载:images/468636_639726_Mars'JezeroCrater.jpg我得到了什么?这些图片这些图片,图片的标题都有,够看一个月的,估计是发现了什么秘诀吧?这张照片是我最喜欢的。一个那么清,一个那么浑,为什么?火星裂谷发生器?好吧,这是真正的秘密:NASA的网站不是反对收集的,所以如果你不相信我,请尝试一下。..