当前位置: 首页 > 科技观察

使用Python的urlliib.parse库解析URL

时间:2023-03-14 21:12:19 科技观察

Python中的urllib.parse模块提供了很多解析和构建URL的函数。解析urlparse()函数可以将URL解析为ParseResult对象。该对象包含六个元素,分别是:协议(scheme)域名(netloc)路径(path)路径参数(params)查询参数(query)片段(fragment)fromurllib.parseimporturlparseurl='http://user:pwd@domain:80/path;params?query=queryarg#fragment'parsed_result=urlparse(url)print('parsed_resultcontains',len(parsed_result),'elements')print(parsed_result)结果是:parsed_resultcontains6elementsParseResult(scheme='http',netloc='user:pwd@domain:80',path='/path',params='params',query='query=queryarg',fragment='fragment')ParseResult继承自namedtuple,因此可以通过索引和命名属性获取URL中各部分的值。为了方便起见,ParseResult还提供了用户名、密码、主机名和端口,以进一步拆分netloc。print('scheme:',parsed_result.scheme)print('netloc:',parsed_result.netloc)print('path:',parsed_result.path)print('params:',parsed_result.params)print('query:',parsed_result.query)print('fragment:',parsed_result.fragment)print('username:',parsed_result.username)print('password:',parsed_result.password)print('hostname:',parsed_result.hostname)print('port:',parsed_result.port)结果:scheme:httpnetloc:user:pwd@domain:80path:/pathparams:paramsquery:query=queryargfragment:fragmentusername:userpassword:pwdhostname:domainport:80除了urlparse(),还有一个类似urlsplit()的函数也可以拆分URL,不同的是urlsplit()没有将路径参数(params)和路径(path)分开。当URL的路径部分包含多个参数时,用urlparse()解析是有问题的:url='http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment'parsed_result=urlparse(url)print(parsed_result)print('parsed.path:',parsed_result.path)print('parsed.params:',parsed_result.params)结果是:ParseResult(scheme='http',netloc='user:pwd@domain:80',path='/path1;params1/path2',params='params2',query='query=queryarg',fragment='fragment')parsed.path:/path1;params1/path2parsed。params:此时可以使用urlsplit()解析params2:fromurllib.parseimporturlsplitsplit_result=urlsplit(url)print(split_result)print('split.path:',split_result.path)#SplitResultwithoutparamsattributeresultis:SplitResult(scheme='http',netloc='user:pwd@domain:80',path='/path1;params1/path2;params2',query='query=queryarg',fragment='fragment')split.path:/path1;params1/path2;params2如果你只想拆分出U后面的片段标识符RL,可以使用urldefrag()函数:fromurllib.parseimporturldefragurl='http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment'd=urldefrag(url)print(d)print('url:',d.url)print('fragment:',d.fragment)结果是:DefragResult(url='http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg',fragment='fragment')url:http://user:pwd@domain:80/path1;params1/path2;params2?query=queryargfragment:Fragment构建的URLParsedResult对象和SplitResult对象都有一个geturl()方法,可以返回一个完整的URL字符串print(parsed_result.geturl())print(split_result.geturl())的结果是:http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragmenthttp://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment但geturl()仅在ParsedResult和SplitResult对象中可用。如果要将一个普通元组组合成一个URL,需要使用urlunparse()函数:fromurllib.parseimporturlunparseurl_compos=('http','user:pwd@domain:80','/path1;params1/path2','params2','query=queryarg','fragment')print(urlunparse(url_compos))结果是:http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment相对路径转换绝对路径此外,urllib.parse还提供了一个urljoin()函数,将相对路径转换为绝对路径URL。fromurllib.parseimporturljoinprint(urljoin('http://www.example.com/path/file.html','anotherfile.html'))print(urljoin('http://www.example.com/path/','anotherfile.html'))print(urljoin('http://www.example.com/path/file.html','../anotherfile.html'))print(urljoin('http://www.example.com/path/file.html','/anotherfile.html'))结果:http://www.example.com/path/anotherfile.htmlhttp://www.example.com/path/anotherfile.htmlhttp://www.example.com/anotherfile.htmlhttp://www.example.com/anotherfile.html查询参数的构造与分析使用urlencode()函数将一个dict转换为合法的查询参数:来自urllib.parseimporturlencodequery_args={'name':'darksun','country':'China'}query_args=urlencode(query_args)print(query_args)结果为:name=dark+sun&country=%E4%B8%AD%E5%9B%BD可以看到特殊字符也被正确转义了。相反,您可以使用parse_qs()将查询参数解析为字典。fromurllib.parseimportparse_qsprint(parse_qs(query_args))结果是:{'name':['darksun'],'country':['China']}如果只是想转义特殊字符,可以使用quote或quote_plus函数,其中quote_plus比quote更激进,也会转义符号,例如:和/。fromurllib.parseimportquote,quote_plus,urlencodeurl='http://localhost:1080/~hello!/'print('urlencode:',urlencode({'url':url}))print('quote:',quote(url))print('quote_plus:',quote_plus(url))的结果是:urlencode:url=http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2Fquote:http%3A//localhost%3A1080/%7Ehello%21/quote_plus:http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2F可以看到应该在urlencode中调用quote_plus进行转义。反向操作,使用unquote或unquote_plus函数:fromurllib.parseimportunquote,unquote_plusencoded_url='http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2F'print(unquote(encoded_url))print(unquote_plus(encoded_url))结果是:http://localhost:1080/~hello!/http://localhost:1080/~hello!/你会发现unquote函数可以正确的把quote_plus的结果转回来。