《复仇者联盟3:***战争》于2018年5月11日在中国内地上映,截至5月16日累计票房15.25亿。这一票房纪录已经超越了漫威系列单部电影的票房纪录。不得不说,漫威电影已经成为一种文化潮流。先晒出海报欣赏:复仇者联盟3是漫威十年磨一剑的大结局。漫威证实,他们付出了很多努力,将一部精彩的电影献给了我们。我也利用周末的时间去电影院看。看完之后,个人觉得无论是战斗特效还是剧情,都是一种惬意的享受。与此同时,电影依旧保持着往日幽默搞笑的风格,往往能让观众捧腹大笑。还没看过的可以去电影院看看,真的很值得一看。本文使用Python做一个网络爬虫,爬取豆瓣影评,分析然后制作豆瓣影评云图。1分析首先通过影评网页确定爬取的内容。我要爬取的是用户名,是否看过,五星评论值,评论时间,有用人数,评论内容。然后确定每页评论的url结构。第二页的url地址:第三页的url地址:***发现规律:除首页外,后面每一页的url地址中只有start=的值逐页递增,其他保持不变。2数据爬取本文爬取数据,主要使用requests库和lxml库中的XPath。豆瓣网站虽然对网络爬虫很友好,但是还是有反爬虫机制的。如果不设置延迟,如果一次性发起大量请求,你的IP会被封。另外,如果不登录豆瓣,只能看到前10页的视频。所以爬取数据的HTTP请求一定要带上自己账号的cookie。cookie的获取并不难,可以通过浏览器登录豆瓣,然后在开发者模式下获取。我想从影评首页开始爬取。爬取入口为:https://movie.douban.com/subject/24773958/comments?status=P,然后获取页面中下一页的url地址和需要爬取的。内容,然后继续访问下一页的地址。importjiebaimportrequestsimportpandasaspdimporttimeimportrandomfromlxmlimportetreedefstart_spider():base_url='https://movie.douban.com/subject/24773958/comments'start_url=base_url+'?start=0'number=1html=request_get(start_url)whilehtml.status_code==200:#getit一页urlselector=etree.HTML(html.text)nextpage=selector.xpath("//div[@id='paginator']/a[@class='next']/@href")nextpage=nextpage[0]next_url=base_url+nextpage#获取评论comments=selector.xpath("//div[@class='comment']")marvelthree=[]foreachincomments:marvelthree.append(get_comments(each))data=pd.DataFrame(marvelthree)#写入csv文件,'a+'是附加模式try:ifnumber==1:csv_headers=['user','你看到了吗','五星级','评论时间','有用的数字','评论内容']data.to_csv('./Marvel3_yingpping.csv',header=csv_headers,index=False,mode='a+',encoding='utf-8')else:data.to_csv('./Marvel3_yingpping.csv',header=False,index=False,mode='a+',encoding='utf-8')除了UnicodeEncodeError:print("En编码错误,数据无法写入文件,直接忽略数据")data=[]html=request_get(next_url)我在请求头中添加了一个随机变化的User-agent,并添加了一个cookie***来增加请求的随机等待时间,防止IP因请求过多而被封。defrequest_get(url):'''使用Session跨请求保留某些参数。它还将在同一Session实例发出的所有请求之间保留cookie'''timeout=3UserAgent_List=["Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/41.0.2228.0Safari/537.36","Mozilla/5.0(Macintosh;IntelMacOSX10_10_1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/41.0.2227.1Safari/537.36","Mozilla/5.0(X11;Linux86_64)AppleWebKit/537.36(KHTML,likeGecko)Chrome227.0/537.36","Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/41.0.2227.0Safari/537.36","Mozilla/5.0(WindowsNT6.3;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/41.0.2226.0Safari/537.36","Mozilla/5.0(WindowsNT6.4;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/41.0.2225.0Safari/537.36","Mozilla/5.0(WindowsNT6.3;WOW64)WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/41.0.2225.0Safari/537.36","Mozilla/5.0(WindowsNT5.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/41.0.2224.3Safari/537.36","摩斯拉/5.0(视窗sNT10.0)AppleWebKit/537.36(KHTML,likeGecko)Chrome/40.0.2214.93Safari/537.36","Mozilla/5.0(WindowsNT10.0)AppleWebKit/537.36(KHTML,likeGecko)Chrome/40.0.2214.93Safari/537.36","Mozilla/5.0(WindowsNT6.3;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/37.0.2049.0Safari/537.36","Mozilla/5.0(WindowsNT4.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/37.0.2049.0Safari/537.36","Mozilla/5.0(WindowsNT6.1;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/36.0.1985.67Safari/537.36","Mozilla/5.0(WindowsNT5.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/36.0.1985.67Safari/537.36","Mozilla/5.0(WindowsNT5.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/35.0.3319.102Safari/537.36","Mozilla/5.0(WindowsNT5.1)))AppleWebKit/537.36(KHTML,likeGecko)Chrome/35.0.2309.372Safari/537.36","Mozilla/5.0(WindowsNT5.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/35.0.2117.157Safari/537.36","Mozilla/5.0(Macintosh;IntelMacOSX10_9_3)AppleWebKit/537.36(KHTML,likeGecko)Chrome/35.0.1916.47Safari/537.36","Mozilla/5.0(WindowsNT5.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/34.0.1866.237Safari/537Safari]header={'User-agent':random.choice(UserAgent_List),'Host':'movie.douban.com','Referer':'https://movie.douban.com/subject/24773958/?from=showing',}session=requests.Session()cookie={'cookie':"你的cookie值",}time.sleep(random.randint(5,15))response=requests.get(url,headers=header,cookies=cookie_nologin,timeout=3)ifresponse.status_code!=200:print(response.status_code)returnresponse最后一步获取数据:defget_comments(eachComment):commentlist=[]user=eachComment.xpath("./h3/span[@class='comment-info']/a/text()")[0]#userwatched=eachComment.xpath("./h3/span[@class='comment-info']/span[1]/text()")[0]#你见过rating=eachComment.xpath("./h3/span[@class='comment-info']/span[2]/@title")#五-星级iflen(rating)>0:rating=rating[0]comment_time=eachComment.xpath("./h3/span[@class='comment-info']/span[3]/@title")#评论时间iflen(comment_time)>0:comment_time=comment_time[0]else:#部分评论没有五星级,需要赋空值comment_time=ratingrating=''votes=eachComment.xpath("./h3/span[@class='comment-vote']/span/text()")[0]#“有用”数content=eachComment.xpath("./p/text()")[0]#评论内容commentlist.append(user)commentlist.append(watched)commentlist.append(rating)commentlist.append(comment_time)commentlist.append(votes)commentlist.append(content.strip())#print(list)returncommentlist3做个云图,因为爬取的评论数据是一大串字符串,所以需要对每个句子进行切分,然后统计我用jieba库对每个词的评论切分做云图,切分后的数据我丢给网站worditout处理。defsplit_word():withcodecs.open('Marvel3_yingpping.csv','r','utf-8')ascsvfile:reader=csv.reader(csvfile)content_list=[]forrowinreader:try:content_list.append(row[5])exceptIndexError:passcontent=''.join(content_list)seg_list=jieba.cut(content,cut_all=False)result='\n'.join(seg_list)print(result)***制作的云图效果为:》灭霸这个词出现的次数最多,这并不奇怪。因为整部复仇者联盟3电影的剧情大概就是灭霸在宇宙各个星球上收集了6颗超级宝石,然后各个超级英雄联手阻止灭霸,目的是为了防止灭霸毁灭整个宇宙。
