当前位置: 首页 > 后端技术 > Python

字体反爬虫解决方案——突破抖音反爬虫机制

时间:2023-03-26 01:53:44 Python

字体反爬虫案例在爬取一些网站信息的时候,偶尔会遇到这样的情况:网页浏览显示正常,用python爬取是乱码,在开发者模式下查看网页源码F12也是乱码。这一般是因为网站设置了字体反爬虫。1.准备url地址:https://www.iesdouyin.com/sha...2.获取数据分析字体加密方法任务:爬取个人信息展示页面中的以下,粉丝数和点赞数据,内容页面如下图所示。在编写代码之前,我们需要确定目标数据的元素定位。定位时,我们在HTML中发现了一些奇怪的符号。HTML代码如下:页面中的重要数据是一些奇怪的字符,应该显示数字的地方在HTML中显示为“”。需要注意的是,Chrome开发者工具元素面板显示的内容不一定是对应文字的原文。如果想知道“”符号是什么,需要在网页源代码中确认。对应的网页源码如下:。wFollowers抖音将这些数字的数据映射成字体并且使用了自己的字体,那么我们可以看开发者工具的网络查看使用的字体,一般以wolf或者ttf结尾,可以看到To:我们刷新了几次,发现我们一直在访问的字体文件是:https://s3.pstatp.com/ies/res..._falcon/static/font/iconfont_9eb9a50.woff我们先下载这个文件,打开fontcreator软件,我们就可以理解字体和数字的关系了我们看到这张照片。这时候需要安装pipinstallfontTools,使用fontTool打开ttf文件并转换成xml文件。使用下面的代码fromfontTools.ttLibimportTTFontfont_1=TTFont('douyin.ttf')font_1.saveXML('font_1.xml')这就是我们需要找到的映射。结合上面的字体和数字的对应关系一起使用,这个就破解了。3.代码实现字体映射关系映射表regex_list=[{'name':['0xe602','0xe60e','0xe618'],'value':'1'},{'name':['0xe603','0xe60d','0xe616'],'value':'0'},{'name':['0xe604','0xe611','0xe61a'],'value':'3'},{'name':['0xe605','0xe610','0xe617'],'value':'2'},{'name':['0xe606','0xe60c','0xe619'],'value':'4'},{'name':['0xe607','0xe60f','0xe61b'],'value':'5'},{'name':['0xe608','0xe612','0xe61f'],'value':'6'},{'name':['0xe609','0xe615','0xe61e'],'value':'9'},{'name':['0xe60a','0xe613','0xe61c'],'value':'7'},{'name':['0xe60b','0xe614','0xe61d'],'value':'8'}]4。完整代码#!/usr/bin/envpython#-*-coding:utf-8-*-importreimportrequestsfromlxmlimportetreestart_url='https://www.iesdouyin.com/share/user/88445518961'defget_real_num(content):content=content.replace('&#','0').replace(';','')regex_list=[{'name':['0xe602','0xe60e','0xe618'],'value':'1'},{'name':['0xe603','0xe60d','0xe616'],'value':'0'},{'name':['0xe604','0xe611','0xe61a'],'value':'3'},{'name':['0xe605','0xe610','0xe617'],'value':'2'},{'name':['0xe606','0xe60c','0xe619'],'value':'4'},{'name':['0xe607','0xe60f','0xe61b'],'值':'5'},{'名称':['0xe608','0xe612','0xe61f'],'值':'6'},{'名称':['0xe609','0xe615','0xe61e'],'value':'9'},{'name':['0xe60a','0xe613','0xe61c'],'value':'7'},{'name':['0xe60b','0xe614','0xe61d'],'value':'8'}]fori1inregex_list:forfont_codeini1['name']:content=re.sub(font_code,str(i1['value']),content)html=etree.HTML(content)douyin_info={}#获取抖音IDdouyin_id=''.join(html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/text()"))douyin_id=douyin_id.replace('抖音ID:','').replace('','')i_id=''.join(html.xpath("//div[@class='personal-card']/div[@class='info1']/p[@class='shortid']/i/text()"))douyin_info['douyin_id']=str(douyin_id)+str(i_id)#注意douyin_info['follow_count']=''.join(html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='focusblock']//i/text()"))#粉丝fans_value=''.join(html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='followerblock']//i[@class='iconiconfontfollow-num']/text()"))unit=html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='followerblock']/span[@class='num']/text()")ifunit[-1].strip()=='w':抖音信息['fans']=str(float(fans_value)/10)+'w'fans_count=douyin_info['fans'][:-1]fans_count=float(fans_count)fans_count=fans_count*10000douyin_info['fans_count']=fans_countelse:douyin_info['fans']=fans_valuedouyin_info['fans_count']=fans_value#点赞like=''.join(html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-numblock']//i[@class='iconiconfontfollow-num']/text()"))unit=html.xpath("//div[@class='personal-card']/div[@class='info2']/p[@class='follow-info']//span[@class='liked-numblock']/span[@class='num']/text()")ifunit[-1].strip()=='w':douyin_info['like']=str(float(like)/10)+'w'like_count=douyin_info['like'][:-1]like_count=float(like_count)like_count=like_count*10000douyin_info['like_count']=like_countelse:douyin_info['like']=likedouyin_info['like_count']=like#作品worko_count=''.join(html.xpath("//div[@class='video-tab']/div/div[1]//i/text()"))抖音信息['work_count']=worko_countreturndouyin_infodefget_html():header={'user-agent':'Mozilla/5.0(WindowsNT10.0;WOW64)AppleWebKit/537.36(KHTML,likeGecko)Chrome/68.0.3440.75Safari/537.36'}response=requests.get(url=start_url,headers=header,verify=False)返回response.textdefrun():content=get_html()info=get_real_num(content)print(info)if__name__=='__main__':run()5.结果