前言最近用爬虫技术爬取了NASA,也就是大家经常在电影里看到的NASA,涉及火星探测的图片有10000张。嗯,小事,小事。做完之后有点小激动,于是就有了这篇文章,里面会有以下内容:我为什么要爬取NASA图片,我是怎么爬取NASA图片的(超级详细)我得到了什么(高清大图)图片)我发现了什么秘密(超爆)为什么我爬上了NASA的图片?天天想着万一哪天丢了工作怎么办,想玩个自媒体,天天给大家说废话。大白话,有历史之谜,宇宙之谜等等,所以我把重点放在了NASA上。NASA有各种太空探索任务,相关的文章、采访、图片和视频都是公开的。这是一个不可多得的资源库。我是如何爬取NASA图片的(超级详细)NASA的网站是公开的,地址是https://www.nasa.gov/打开后,它的主页是这样的,可以看到各种内容。右上角还有一个搜索框。我们进入Mars,也就是火星,稍等片刻,就会显示出与火星相关的各种内容。其中之一是火星探索。点击火星探索后,您将到达一个新页面。图片,您将到达目标页面https://www.nasa.gov/mission_pages/mars/images/index.html。下拉页面,您会看到一个大按钮,上面写着“更多图像”。点击试试,会发现页面内容并不是页面直接加载的,而是api请求后F12异步渲染的,打开浏览器开发者模式,重新执行前面的步骤,观察请求信息,就可以了发现会有下面这种情况,看来这个url地址很重要。我们先看他的请求地址:https://www.nasa.gov/api/2/ubernode/_search?size=24&from=24&sort=promo-date-time%3Adesc&q=((ubernode-type%3Aimage)%20AND%20(topics%3A3152))&_source_include=promo-date-time%2Cmaster-image%2Cnid%2Ctitle%2Ctopics%2Cmissions%2Ccollections%2Cother-tags%2Cubernode-type%2Cprimary-tag%2Csecondary-tag%2Ccardfeed-title%2Ctype%2Ccollection-asset-link%2Clink-or-attachment%2Cpr-leader-sentence%2Cimage-feature-caption%2Cattachments%2Curi注意参数size=24&from=24很明显size就是每次请求的图片数量.经测试,from为查询的起始位置。我们可以更改它以获得其他内容。我们看一下它的返回信息:{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,“跳过”:0,“失败”:0},“命中”:{“总计”:659,“max_score”:空,“命中”:[{“_index”:“nasa-public”,“_type”:"ubernode","_id":"450040","_score":null,"_source":{"image-feature-caption":"火星2020漫游者在漫游者上安装了多个摄像头后进行了眼科检查。",“主题”:[“3140”,“3152”],“nid”:“450040”,“标题”:“美国宇航局‘验光师’验证火星2020漫游者的20/20愿景”,“类型”:“ubernode”,“uri":"/image-feature/jpl/nasa-optometrists-verify-mars-2020-rovers-2020-vision","collections":["4525","5246"],"link-or-attachment":"link","missions":["6336"],"primary-tag":"6336","cardfeed-title":"NASA'Optometrist'VerifyMars2020Rover's20/20Vision","promo-date-time":"2019-08-05T17:49:00-04:00","secondary-tag":"3140","master-image":{"fid":"603128","alt":"工程师在火星2020火星车的桅杆和前底盘顶部测试摄像头。","width":"1600","id":"603128","title":"工程师在顶部测试摄像头火星2020探测器的桅杆和前底盘。","uri":"public://thumbnails/image/pia23314-16.jpg","height":"900"},"ubernode-type":"image"},"sort":[1565041740000]},{"_index":"nasa-public","_type":"ubernode","_id":"433172","_score":null,"_source":{"image-feature-caption":"NASA仍未收到机遇号火星车的消息,但至少我们可以看到又来了。","topics":["3152"],"nid":"433172","title":"机会在尘封的画面中出现","type":"ubernode","uri":"/image-feature/opportunity-emerges-in-a-dusty-picture","collections":["7628"],"link-or-attachment":"link","missions":["3639"],"primary-tag":"3152","cardfeed-title":"机会出现在尘封的画面","promo-date-time":"2018-09-26T12:39:00-04:00","secondary-tag":"7628","master-image":{"fid":"584263","alt":"NASA的机遇号探测器在这个正方形的中心出现了一个光点","width":"1400","id":"584263","title":"NASA的机遇号火星车在这个广场的中心出现了一个光点",“uri”:“public://thumbnails/image/pia22549-16.jpg”,“高度”:“788”},“ubernode-type”:“图像”},“排序”:[1537979940000]}]}}上面的json内容太长,我删掉了一些重复的,其实hits数组也是24,和页面显示的图片数量一样,所以基本可以断定页面上的信息来自这个数组。进一步对比发现,master-image字段下是我们需要的信息,包括图片地址、图片大小、图片标题。这里是代码,组装请求URL,获取内容,下载图片三步。我使用Dart语言。可以import'dart:convert';import'package:dio/dio.dart';main()async{//每页的页数固定为24,只需更改初始值即可for(intfrom=0;from<24*100;from=from+24){awaitgetPage(from);}}//获取每个页面的信息并下载Future
