王力宏的瓜好大啊！我用Python爬了卦文评论区，发现更精彩了

时间：2023-03-26 17:23:15 Python

早上打开微博，WC，微博推送给我的第一篇文章是一篇卦文。于是巧妙地找到了瓜文的出处。基本情况是力宏的前妻看不下去了，发帖撕力宏……博文如下：一开始我还是有点懵。前两天，力宏承认离婚，并发博文：博文中透露的是相聚放松的好氛围，岁月静好。好像用词有点不妥，不过我也不纠结了。虽然我不追星，对各种明星基本没什么感觉，但是我很多年前就从娃哈哈的矿泉水瓶里认识了王力宏这个角色……记不清是什么时候了，娃哈哈换了代言人力宏，当时网上的声讨还是很多的，现在看来……于是抱着吃瓜群众的好奇心看了李静蕾的微博撕逼文，WC，我真的欠力宏一个奥斯卡...这么吃瓜的文章，我怎么能错过评论区...所以我准备用Python爬取评论区数据。主要代码如下：#抓取一页评论内容defget_one_page(url):headers={'User-agent':'Mozilla/5.0(WindowsNT6.1;Win64;x64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/78.0.3880.4Safari/537.36','Host':'weibo.cn','Accept':'application/json,text/plain,*/*','Accept-Language':'zh-CN,zh;q=0.9','Accept-Encoding':'gzip,deflate,br','Cookie':'OwnCookie','DNT':'1','Connection':'keep-alive'}#获取网页htmlresponse=requests.get(url,headers=headers,verify=False)#爬取成功ifresponse.status_code==200:#返回值为html文档，传递给解析函数returnresponse.textreturnNone#解析并保存评论信息defsave_one_page(html):comments=re.findall('(.*?)',html)forcommentincomments[1:]:result=re.sub('<.*?>','',comment)if'reply@'notinresult:withopen('comments.txt','a+',encoding='utf-8')asfp:fp.write(result)爬取和解析过程这里就不说了。不清楚的可以看一下：微博评论区爬取，有数据，下面我们就用Python看一下TOP10词汇表。主要代码如下：stop_words=[]withopen('stop_words.txt','r',encoding='utf-8')asf:lines=f.readlines()forlineinlines:stop_words.append(line.strip())content=open('comments.txt','rb').read()#jiebaword_list=jieba.cut(content)words=[]forwordinword_list:ifwordnotinstop_words:words.append(word)wordcount={}forwordinwords:ifword!='':wordcount[word]=wordcount.get(word,0)+1wordtop=sorted(wordcount.items(),key=lambdax复制代码:x[1],reverse=True)[:10]wx=[]wy=[]forwinwordtop:wx.append(w[0])wy.append(w[1])(Bar(init_opts=opts.InitOpts(theme=ThemeType.MACARONS)).add_xaxis(wx).add_yaxis('数量',wy).reversal_axis().set_global_opts(title_opts=opts.TitleOpts(title='评论词TOP10'),yaxis_opts=opts.AxisOpts(name='word'),xaxis_opts=opts.AxisOpts(name='quantity'),).set_series_opts(label_opts=opts.LabelOpts(position='right'))).render_notebook()看效果：这里我们先不做评论，然后生成词云看评论区。主要代码实现如下：defjieba_():stop_words=[]withopen('stop_words.txt','r',encoding='utf-8')asf:lines=f.readlines()forlineinlines:stop_words.append(line.strip())content=open('comments.txt','rb').read()#jiebaword_list=jieba.cut(content)words=[]forwordinword_list:ifwordnotinstop_words:words.append(word)globalword_cloud#用逗号分隔单词word_cloud=','.join(words)defcloud():#打开词云背景图cloud_mask=np.array(Image.open('bg.png'))#定义词云的一些属性wc=WordCloud(#背景图片分割颜色为白色background_color='white',#背景图案mask=cloud_mask,#显示最大字数max_words=200,#显示中文font_path='./fonts/simhei.ttf',#最大字号max_font_size=100)globalword_cloud#词云函数x=wc.generate(word_cloud)#生成词云图image=x.to_image()#显示词云图image.show()#保存词云图wc.to_file('melon.png')看效果：源码已经整理好了，如果需要的可以点击公众号回复wlh从Python二次背景获取

上一篇：2.1变量：存储数据的Word文档

下一篇：Anti-Android反抓包--no_proxy

王力宏的瓜好大啊！我用Python爬了卦文评论区，发现更精彩了相关文章