正则备忘(qbit)

时间：2023-03-26 17:55:47 Python

本文示例默认使用Python3作为实现语言，使用Python3的re模块或regex库。根据qbit的猜测：在Python3的Unicode字符集下，re模块的\s匹配\f\n\r\t\v加上全角和半角空格，共7个字符。RegularExpressionDocumentationRegularExpression30-minuteIntroductoryTutorial另一个揭开正则表达式神秘面纱的好入门教程。qbit觉得这篇文章对Multiline的解释很好。截图如下：提取双引号和它们之间的内容使用re.findalltext='''abc"def"ghi'''re.findall(r'"[^"]+"',text)#result['"def"']与re.search。>>>text='''abc"def"ghi'''>>>re.search(r'"([^"]+)"',text).group(0)'"def"'在双引号之间提取内容使用re.findall。text='''abc"def"ghi'''re.findall(r'"([^"]+)"',text)#results['def']withre.search.>>>text='''abc"def"ghi'''>>>re.search(r'"([^"]+)"',text).group(1)'def'环顾四周：(?<=pattern),(?=pattern)text='''abc"def"ghi'''re.findall(r'(?<=")[^"]+(?=")',text)#Result['def']查找以某些字符串开头的行#比如查找以+++,---,index开头的行#方法一，逐行匹配foriinlst:ifre.match(r"(---|\+\+\+|index).*",i):printi#方法二，一次性匹配re.findall(r'^(?:\+\+\+|---|index).*$',content,re.M)#方法2精简版re.findall(r'^(?:[-\+]{3}|index).*$',content,re.M)包含/不包含（参考：使用正则表达式排除特定字符串）文本内容>>>print(text)www.sina.com.cnwww.educ.orgwww.hao.ccwww.baidu.comwww.123.comsina.com.cneduc.orghao.ccbaidu.com123.com匹配以www开头的行>>>re.findall(r'^www.*$',text,re.M)['www.sina.com.cn','www.educ.org','www.hao.cc','www.baidu.com','www.123.com']匹配不以www>>>re.findall(r'^(?!www).*$',text,re.M)['','sina.com.cn','educ.org','hao.cc','baidu.com','123.com']匹配以cn结尾的行>>>re.findall(r'^.*?cn$',text,re.M)['www.sina.com.cn','sina.com.cn']匹配不以com结尾的行>>>re.findall(r'^.*?(?>>re.findall(r'^.*?com.*?$',text,re.M)['www.sina.com.cn','www.baidu.com','www.123.com','sina.com.cn','baidu.com','123.com']匹配不包含com>>>re.findall(r'^(?!.*com).*$',text,re.M)['www.educ.org','www.hao.cc','','educ.org','hao.cc']>>>re.findall(r'^(?:(?!com).)*?$',text,re.M)['www.educ.org','www.hao.cc','','educ.org','hao.cc']全部匹配，去掉部分，用分组得到第一级URL，即移除下几层#方法一>>>strr='http://www.baidu.com/abc/d.html'>>>re.findall(r'(http://.+?)/.*',strr)['http://www.baidu.com']#方法二>>>re.sub(r'(http://.+?)/.*',r'\1',strr)'http://www.baidu.com'两个有助于理解正则分组的例子#一个>>>strr='A/B/C'>>>re.sub(r'(.)/(.)/(.)',r'xx',strr)'xx'>>>re.sub(r'(.)/(.)/(.)',r'\1xx',strr)'Axx'>>>re.sub(r'(.)/(.)/(.)',r'\2xx',strr)'Bxx'>>>re.sub(r'(.)/(.)/(.)',r'\3xx',strr)'Cxx'#两个>>>text='AA,BB:222'>>>re.search(r'(.+),(.+):(\d+)',text).group(0)'AA,BB:222'>>>re.search(r'(.+),(.+):(\d+)',text).group(1)'AA'>>>re.search(r'(.+),(.+):(\d+)',text).group(2)'BB'>>>re.search(r'(.+),(.+):(\d+)',text).group(3)'222'提取包含hello字符串的div>>>content''>>>>>>p=r'>>re.search(p,content).group()''>>>re.findall(p,content)['"','"']>>>对于re.finditer(p,content)中的iter:print(iter.group())>>>>>>>p=r']+hello.+?>'>>>re.search(p,content).group()''>>>re.findall(p,content)['','']>>>foriterinre.finditer(p,content):print(iter.group())如果你使用的工具支持positivelookahead，并且你可以在positivelookahead中使用capturingparentheses，你可以模拟固化分组（atomicgrouping）和占有优先量词（possessivequantifiers）的实现ThousandsPython>>>format(23456789,',')'23,456,789'#使用正反序环视和正序环视>>>re.sub(r'(?<=\d)(?=(?:\d{3})+$)',',','2345678')'2,345,678'JavaScript//使用正序lookaround（因为js不支持正反序lookaround）//结果为"23,456,789""23456789"。replace(/(\d)(?=(?:\d{3})+$)/g,"$1,")单嵌套括号（平衡组）>>>importre>>>line=r'coverLayer(gasoline)TarimBasin(Subject:Caprock(OilandGas)Subject:Evaluation)塔里木盆地'>>>re.findall(r'$[^()]*(\([^()]*$[^()]*)*\)',line)['','（石油和天然气）主题：评估']>>>re.findall(r'$[^()]*(?:\([^()]*$[^()]*)*\)',line)['(gasoline)','(Subject:Caprock(OilandGas)Subject:Evaluation)']匹配汉字>>>regex.findall(r'\p{Han}','孔子/现代价值/“知”论')['知','子','现在','世代','价值','Value']正则表达式和lambdadic={'user':'walker','do的有趣组合main':'163.com'}rule=r'%user%@%domain%'email=re.sub('%[^%]*%',lambdamatchobj:dic[matchobj.group(0).strip('%')],rule)print('email:%s'%email)#walker@163.com常规号码后接号码url参数替换（将123替换为321）#程序将\1321视为一个整体>>>re.sub(r'(.*a=)123',r'\1321','http://qbit.cn?a=123')Z1#组号冗余写>>>re.sub(r'(.*a=)123',r'\g<1>321','http://qbit.cn?a=123')http://qbit.cn?a=321相关阅读C++中三种正则表达式的比较(Cregex,C++regex,boostregex)Python中re.search和re.findall的区别以及接触正则表达式测试：regex101（强烈推荐），regexrPython的regex模块-更强大的正则表达式引擎桌面工具：RegexBuddy（演示版下载：SetupRegexBuddyDemo.exe，修改注册表：RegexBuddy4.7.0x64评估试用已过期，无限试用方法）Unicode正则:https://www.regular-expressio...本文来自qbitsnap

上一篇：网页-用Python谈Web开发

下一篇：Python之父从Dropbox离职，编程语言发展之路并未停歇

正则备忘(qbit)相关文章