Python爬虫入门：如何用Python抓取网页？基本流程是什么

时间：2023-03-26 15:41:29 Python

Python爬取网页的基本流程：首先从精挑细选的种子网址中选取一部分。将这些网址放入要抓取的网址队列中。从待抓取URL队列中读取待抓取队列的URL，解析DNS，获取主机IP，下载该URL对应的网页，存入下载的网页库中。另外，将这些网址放入已抓取的网址队列中。分析抓取的URL队列中的URL，从下载的网页数据中分析出其他URL，并与抓取的URL进行比对，去除重复，最后将去重后的URL放入待抓取的URL队列中，从而进入下一个循环.1、HTTP请求是使用urllib2/urllib实现的：urllib2和urllib是Python中的两个内置模块。实现HTTP功能，实现方式以urllib2为基础，urllib为辅。urllib2提供了一个基本函数urlopen，通过向指定的URL发送请求来获取数据。最简单的形式是：importurllib2response=urllib2.urlopen('http://www.zhihu.com')html=response.read()printhtml其实可以把上面链接到http://www.zhihu.com的请求response分为两步，一是request，二是response，形式如下：importurllib2#requestrequest=urllib2.Request('http://www.zhihu.com')#responseresponse=urllib2。urlopen(request)html=response.read()打印html和post请求实现：importurllibimporturllib2url='http://www.xxxxxx.com/login'postdata={'username':'qiye','password':'qiye_pass'}#info需要编码成urllib2可以理解的格式，这里是urllibdata=urllib.urlencode(postdata)req=urllib2.Request(url,data)response=urllib2.urlopen(req)html=response.read()重写上面的例子，添加请求头信息，并在请求头中设置User-Agent字段和Referer字段信息。2.请求标头处理importurllibimporturllib2url='http://www.xxxxxx.com/login'user_agent='Mozilla/4.0(compatible;MSIE5.5;WindowsNT)'referer='http://www.xxxxxx.com/'postdata={'username':'qiye','password':'qiye_pass'}#将user_agent,referer写入header信息headers={'User-Agent':user_agent,'Referer':referer}data=urllib.urlencode(postdata)req=urllib2.Request(url,data,headers)response=urllib2.urlopen(req)html=response.read()urllib2也自动处理cookie，使用CookieJar函数进行cookie管理。如果需要获取cookie项的值，可以这样做：3.Cookie处理importurllib2importcookielibcookie=cookielib.CookieJar()opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))response=opener.open('http://www.zhihu.com')foritemincookie:printitem.name+':'+item.value但是有时候会出现这种情况，我们不希望urllib2自动处理，我们想添加的内容cookie，你可以通过设置请求头。importurllib2opener=urllib2.build_opener()opener.addheaders.append(('Cookie','email='+"xxxxxxx@163.com"))req=urllib2.Request("http://www.zhihu.com/")response=opener.open(req)printresponse.headersretdata=response.read()总之，非常感谢您点击观看我的文章。如果对你有帮助，请给我一个赞。如果大家有任何疑问或者需要文中的信息，可以后台私信我，欢迎“骚扰”。

上一篇：杀手级xadmin开发在线教育网站8-3_解决xadmin新建用户出现手机号码重复的问题

下一篇：一个在交流群里讨论了两轮的问题，答案其实和一个PEP有关

Python爬虫入门：如何用Python抓取网页？基本流程是什么相关文章