使用Python的urlliib.parse库解析URL

时间：2023-03-14 21:12:19 科技观察

Python中的urllib.parse模块提供了很多解析和构建URL的函数。解析urlparse()函数可以将URL解析为ParseResult对象。该对象包含六个元素，分别是：协议（scheme）域名（netloc）路径（path）路径参数（params）查询参数（query）片段（fragment）fromurllib.parseimporturlparseurl='http://user:pwd@domain:80/path;params?query=queryarg#fragment'parsed_result=urlparse(url)print('parsed_resultcontains',len(parsed_result),'elements')print(parsed_result)结果是：parsed_resultcontains6elementsParseResult(scheme='http',netloc='user:pwd@domain:80',path='/path',params='params',query='query=queryarg',fragment='fragment')ParseResult继承自namedtuple，因此可以通过索引和命名属性获取URL中各部分的值。为了方便起见，ParseResult还提供了用户名、密码、主机名和端口，以进一步拆分netloc。print('scheme:',parsed_result.scheme)print('netloc:',parsed_result.netloc)print('path:',parsed_result.path)print('params:',parsed_result.params)print('query:',parsed_result.query)print('fragment:',parsed_result.fragment)print('username:',parsed_result.username)print('password:',parsed_result.password)print('hostname:',parsed_result.hostname)print('port:',parsed_result.port)结果：scheme:httpnetloc:user:pwd@domain:80path:/pathparams:paramsquery:query=queryargfragment:fragmentusername:userpassword:pwdhostname:domainport:80除了urlparse()，还有一个类似urlsplit()的函数也可以拆分URL，不同的是urlsplit()没有将路径参数（params）和路径（path）分开。当URL的路径部分包含多个参数时，用urlparse()解析是有问题的：url='http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment'parsed_result=urlparse(url)print(parsed_result)print('parsed.path:',parsed_result.path)print('parsed.params:',parsed_result.params)结果是：ParseResult(scheme='http',netloc='user:pwd@domain:80',path='/path1;params1/path2',params='params2',query='query=queryarg',fragment='fragment')parsed.path:/path1;params1/path2parsed。params:此时可以使用urlsplit()解析params2：fromurllib.parseimporturlsplitsplit_result=urlsplit(url)print(split_result)print('split.path:',split_result.path)#SplitResultwithoutparamsattributeresultis:SplitResult(scheme='http',netloc='user:pwd@domain:80',path='/path1;params1/path2;params2',query='query=queryarg',fragment='fragment')split.path:/path1;params1/path2;params2如果你只想拆分出U后面的片段标识符RL，可以使用urldefrag()函数：fromurllib.parseimporturldefragurl='http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment'd=urldefrag(url)print(d)print('url:',d.url)print('fragment:',d.fragment)结果是：DefragResult(url='http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg',fragment='fragment')url:http://user:pwd@domain:80/path1;params1/path2;params2?query=queryargfragment:Fragment构建的URLParsedResult对象和SplitResult对象都有一个geturl()方法，可以返回一个完整的URL字符串print(parsed_result.geturl())print(split_result.geturl())的结果是：http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragmenthttp://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment但geturl()仅在ParsedResult和SplitResult对象中可用。如果要将一个普通元组组合成一个URL，需要使用urlunparse()函数：fromurllib.parseimporturlunparseurl_compos=('http','user:pwd@domain:80','/path1;params1/path2','params2','query=queryarg','fragment')print(urlunparse(url_compos))结果是：http://user:pwd@domain:80/path1;params1/path2;params2?query=queryarg#fragment相对路径转换绝对路径此外，urllib.parse还提供了一个urljoin()函数，将相对路径转换为绝对路径URL。fromurllib.parseimporturljoinprint(urljoin('http://www.example.com/path/file.html','anotherfile.html'))print(urljoin('http://www.example.com/path/','anotherfile.html'))print(urljoin('http://www.example.com/path/file.html','../anotherfile.html'))print(urljoin('http://www.example.com/path/file.html','/anotherfile.html'))结果：http://www.example.com/path/anotherfile.htmlhttp://www.example.com/path/anotherfile.htmlhttp://www.example.com/anotherfile.htmlhttp://www.example.com/anotherfile.html查询参数的构造与分析使用urlencode()函数将一个dict转换为合法的查询参数：来自urllib.parseimporturlencodequery_args={'name':'darksun','country':'China'}query_args=urlencode(query_args)print(query_args)结果为：name=dark+sun&country=%E4%B8%AD%E5%9B%BD可以看到特殊字符也被正确转义了。相反，您可以使用parse_qs()将查询参数解析为字典。fromurllib.parseimportparse_qsprint(parse_qs(query_args))结果是：{'name':['darksun'],'country':['China']}如果只是想转义特殊字符，可以使用quote或quote_plus函数，其中quote_plus比quote更激进，也会转义符号，例如：和/。fromurllib.parseimportquote,quote_plus,urlencodeurl='http://localhost:1080/~hello!/'print('urlencode:',urlencode({'url':url}))print('quote:',quote(url))print('quote_plus:',quote_plus(url))的结果是：urlencode:url=http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2Fquote:http%3A//localhost%3A1080/%7Ehello%21/quote_plus:http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2F可以看到应该在urlencode中调用quote_plus进行转义。反向操作，使用unquote或unquote_plus函数：fromurllib.parseimportunquote,unquote_plusencoded_url='http%3A%2F%2Flocalhost%3A1080%2F%7Ehello%21%2F'print(unquote(encoded_url))print(unquote_plus(encoded_url))结果是：http://localhost:1080/~hello!/http://localhost:1080/~hello!/你会发现unquote函数可以正确的把quote_plus的结果转回来。

上一篇：在ApacheCassandra中定义和优化数据分区

下一篇：中国联通招待费零增长够不够：合理化不合理

使用Python的urlliib.parse库解析URL相关文章