前言Python是获取数据的小能手,所以这次希望用它来爬取一些题的答案进行练习。1.导入模块importrefombs4importBeautifulSoupimportrequestsimporttimeimportjsonimportpandasaspdimportnumpyasnp2。状态码r=requests.get('https://github.com/explore')r.status_code3。/5.0(Macintosh;IntelMacOSX10_14_6)AppleWebKit/537.36(KHTML,likeGecko)Chrome/80.0.3987.87Safari/537.36'}cookies={'cookie':'_zap=3d979dbb-f25b-4014-8770-89045dec48f6;dvNZdAPTML2FU48f6;dvNZdAPTML2FU48f6;-eileT3E=|1561292196";tst=r;_ga=GA1.2.910277933.1582789012;q_c1=9a429b07b08a4ae1afe0a99386626304|1584073146000|1561373910000;_xsrf=bf1c5edf-75bd-4512-8319-02c650b7ad2c;_gid=GA1.2.1983259099.1586575835;l_n_c=1;l_cap_id="NDIxM2M4OWY4N2YwNDRjM2E3ODAxMDdmYmY2NGFiMTQ=|1586663749|ceda775ba80ff485b63943e0baf9968684237435";r_cap_id="OWY3OGQ1MDJhMjFjNDBiYzk0MDMxMmVlZDIwNzU0NzU=|1586663749|0948d23c731a8fa985614d3ed58edb6405303e99";cap_id="M2I5NmJkMzRjMjc3NGZjNDhiNzBmNDMyNDQ3NDlmNmE=|1586663749|dacf440ab7ad64214a939974e539f9b86ddb9eac";n_c=1;Hm_lvt_98beee57fd2ef70ccdd5ca52b9740c49=1586585625,1586587735,1586667228,1586667292;Hm_lpvt_98beee57fd2ef70ccdd5ca52b9740c49=1586667292;SESSIONID=GWBltmMTwz5oFeBTjRm4Akv8pFF6p8Y6qWkgUP4tjp6;JOID=UVkSBEJI6EKgHAipMkwAEWAkvEomDbkAwmJn4mY1kHHPVGfpYMxO3voUDK88UO62JqgwW5Up4hC2kX_KGO9xoKI=;osd=UlEXAU5L4EelEAuhN0kMEmghuUYlBbwFzmFv52M5k3nKUWvqaMlL0vkcCaowU-azI6QzU5As7hO-lHrGG-d0pa4=;capsion_ticket="2|1:0|10:1586667673|14:capsion_ticket|44:YTJkYmIyN2Q4YWI4NDI0Mzk0NjQ1YmIwYmUxZGYyNzY=|b49eb8176314b73e0ade9f19dae4b463fb970c8cbd1e6a07a6a0e535c0ab8ac3";z_c0="2|1:0|10:1586667694|4:z_c0|92:Mi4xOGc1X0dnQUFBQUFBOE84d3ZpU2hEeVlBQUFCZ0FsVk5ydTVfWHdDazlHMVM1eFU5QjlqamJxWVhvZ2xuWlhTaVJ3|bcd3601ae34951fe72fd3ffa359bcb4acd60462715edcd1e6c4e99776f9543b3";unlock_ticket="AMCRYboJGhEmAAAAYAJVTbankl4i-Y7Pzkta0e4momKdPG3NRc6GUQ==";KLBRSID=fb3eda1aa35a9ed9f88f346a7a3ebe83|1586667697|1586660346'}start_url='https://www.zhihu.com/api/v3/feed/topstory/recommend?session_token=c03069ed8f250472b687fd1ee704dd5b&desktop=true&page_number=5&limit=6&action=pull&ad_interval=-1&before_id=23'4.beautifulsoup解析s=requests.Surlession='://www.zhihu.com/'html=s.get(url=start_url,headers=headers,cookies=cookies,timeout=5)soup=BeautifulSoup(html.content)问题=[]##namequestion_address=[]##urltemp1=soup.find_all('div',class_='CardTopstoryItemTopstoryItem-isRecommend')foritemintemp1:temp2=item.find_all('div',itemprop="zhihu:question")#print(temp2)iftemp2!=[]:####有列等情况,暂时略过question_address.append(temp2[0].find('meta',itemprop='url').get('content'))question.append(temp2[0].find('meta',itemprop='name').get('content'))5.存储信息question_focus_number=[]#Attentionquestion_answer_number=[]#Answerforurlinquestion_address:test=s.get(url=url,headers=headers,cookies=cookies,timeout=5)soup=BeautifulSoup(test.content)info=soup.find_all('div',class_='阙stionPage')[0]#print(info)focus_number=info.find('meta',itemprop="answerCount").get('content')answer_number=info.find('meta',itemprop="zhihu:followerCount"").get('content')question_focus_number.append(focus_number)question_answer_number.append(answer_number)6.整理信息并输出question_info=pd.DataFrame(list(zip(zip(question,question_focus_number,question_answer_number)),columns=['questionName','NumberofFollowers','NumberofAnswers']foritemin['NumberofFollowers','答案数']:question_info[item]=np.array(question_info[item],dtype='int')question_info.sort_values(by='numberoffollowers',ascending=False)输出:7.总计:简单爬取不难,但涉及到账号密码等,需要注意爬取数据,尽量不要给别人的服务器造成负担(例如:放长一点的休眠时间);不要将爬取的数据用于商业活动;再好的技术,也不要轻触用户隐私数据,合理、合法、有节制地使用爬虫技术,否则不可能带来不必要的麻烦。
