当前位置: 首页 > 后端技术 > Python

Unicode正则表达式(qbit)

时间:2023-03-26 19:09:36 Python

前言本文根据《精通正则表达式》和Unicode正则表达式整理。本文示例默认使用Python3作为实现语言,使用Python3的re模块或regex库。BasicUnicodePropertyClasses\p{L}|\p{Letter}字母\p{M}|\p{Mark}不能单独出现,必须与其他基本字符(重音符、包围框等)一起出现。字符\p{Z}|\p{Separator}用来表示分隔,但是是不可见的字符(各种空白字符)\p{S}|\p{Symbol}各种图形符号(Dingbats)和字母符号\p{N}|\p{Number}任何数字字符\p{P}|\p{Punctuation}标点字符\p{C}|\p{Other}匹配任何其他字符(很少用于普通字符)基本Unicode子属性Letter\p{Ll}|\p{Lowercase_Letter}小写字母\p{Lu}|\p{Uppercase_Letter}大写字母\p{Lt}|\p{Titlecase_Letter}单词开头出现的字母\p{L&}|\p{Ll},\p{Lu},\p{Lt}简写\p{Lm}|\p{Modifier_Letter}少量有特殊用途的类字母字符\p{Lo}|\p{Other_Letter}没有大写且不是修饰符的字母,包括希伯来语、阿拉伯语、孟加拉语、泰语、和日语。Mark\p{Mn}|\p{Non_Spacing_Mark}用于修饰其他字符的“字符”,例如重音符号、变音符号、某些“元音标记”和语调标记。\p{Mc}|\p{Spacing_Combining_Mark}会占用一定宽度的修饰字符(各种语言中大多数“元音标记”,包括孟加拉语、古吉拉特语、泰米尔语、泰卢固语、卡纳达语、马来语、僧伽罗语、缅甸语和高棉语)。\p{Me}|\p{Enclosing_Mark}可以包围其他字符的标记,如圆圈、方框、菱形等Separator\p{Zs}|\p{Space_Separator}各种空白字符,如空格字符,不非换行符和各种固定宽度的空白字符。\p{Zl}|\p{Line_Separator}LINESEPARATOR字符(U+2028)\p{Zp}|\p{Paragraph_Separator}PARAGRAPHSEPARATOR字符(U+2029),段落分隔符Symbol\p{Sc}|\p{Currency_Symbol}货币符号,$,$,....\p{Sk}|\p{Modifier_Symbol}在大多数版本中,它表示组合字符,但作为功能齐全的字符,它们有自己的含义。\p{So}|\p{Other_Symbol}各种印刷符号、方框图符号、盲文符号、非字母汉字等Number\p{Nd}|\p{Decimal_Digit_Number}0到9的各种数字字母表(不包括中文、日文和韩文)\p{Nl}|\p{Letter_Number}几乎所有罗马数字。\p{No}|\p{Other_Number}用作加密符号(上标)和标记的数字,阿拉伯数字以外的数字表示字符(不包括中文、日文和韩文字符)。标点符号\p{Pd}|\p{Dash_Punctuation}各种格式的连字符和破折号\p{Ps}|\p{Open_Punctuation}(,《 等字符\p{Pe}|\p{Close_Punctuation} )、》等字符\p{Pi}|\p{Initial_Punctuation}",<等字符\p{Pf}|\p{Final_Punctuation}",>等字符\p{Pc}|\p{Connector_Punctuation}少数具有特殊语法意义的标点,如下划线\p{Po}|\p{Other_Punctuation}用于表示所有其他标点符号:!、&、.、:等。Other\p{Cc}|\p{Control}ASCII和Latin-1编码中的控制字符(TAB、LF、CR)等\p{Cf}|\p{Format}用于表示格式的不可见字符\p{Co}|\p{Private_Use}分配给私人用户的代码点(例如公司标志)\p{Cs}|\p{Surrogate}UTF-16编码的代理对的二分之一\p{Cn}|\p{Unassigned}目前未分配的码位UnicodeScripts主要用于匹配特定的语言例如:匹配汉字>>>regex.findall(r'\p{Han}','孔子/现代价值/“知”论')['孔','子','Now','Generation','Price','value']list\p{Common}\p{Arabic}\p{Armenian}\p{Bengali}\p{Bopomofo}\p{Braille}\p{Buhid}\p{加拿大原住民}\p{切诺基语}\p{西里尔语}\p{梵文}\p{埃塞俄比亚语}\p{格鲁吉亚语}\p{希腊语}\p{古吉拉特语}\p{古尔穆克语}\p{汉语}\p{Hangul}\p{Hanunoo}\p{希伯来语}\p{平假名}\p{Inherited}\p{卡纳达语}\p{片假名}\p{高棉语}\p{Lao}\p{拉丁语}\p{林布语}\p{马拉雅拉姆语}\p{蒙古语}\p{缅甸语}\p{奥甘语}\p{奥里亚语}\p{如尼语}\p{僧伽罗语}\p{Syriac}\p{Tagalog}\p{Tagbanwa}\p{TaiLe}\p{Tamil}\p{Telugu}\p{Thaana}\p{Thai}\p{Tibetan}\p{Yi}Unicode块可用源Unicode编码默认设置\p{InBasic_Latin}:U+0000–U+007F\p{InLatin-1_Supplement}:U+0080–U+00FF\p{InLatin_Extended-A}:U+0100–U+017F\p{InLatin_Extended-B}:U+0180–U+024F\p{InIPA_Extensions}:U+0250–U+02AF\p{InSpacing_Modifier_Letters}:U+02B0–U+02FF\p{InCombining_Diacritical_Marks}:U+0300–U+036F\p{InGreek_and_Coptic}:U+0370–U+03FF\p{InCyrillic}:U+0400–U+04FF\p{InCyrillic_Supplementary}:U+0500–U+052F\p{InArmenian}:U+0530–U+058F\p{InHebrew}:U+0590–U+05FF\p{InArabic}:U+0600–U+06FF\p{InSyriac}:U+0700–U+074F\p{InThana}:U+0780–U+07BF\p{InDevanagari}:U+0900–U+097F\p{InBengali}:U+0980–U+09FF\p{InGurmukhi}:U+0A00–U+0A7F\p{InGujarati}:U+0A80–U+0AFF\p{InOriya}:U+0B00–U+0B7F\p{InTamil}:U+0B80–U+0BFF\p{InTelugu}:U+0C00–U+0C7F\p{在卡纳达语}:U+0C80–U+0CFF\p{InMalayalam}:U+0D00–U+0D7F\p{InSinhala}:U+0D80–U+0DFF\p{InThai}:U+0E00–U+0E7F\p{InLao}:U+0E80–U+0EFF\p{藏语}:U+0F00–U+0FFF\p{缅甸语}:U+1000–U+109F\p{格鲁吉亚语}:U+10A0–U+10FF\p{InHangul_Jamo}:U+1100–U+11FF\p{InEthiopic}:U+1200–U+137F\p{InCherokee}:U+13A0–U+13FF\p{InUnified_Canadian_Aboriginal_Syllabics}:U+1400–U+167F\p{InOgham}:U+1680–U+169F\p{InRunic}:U+16A0–U+16FF\p{InTagalog}:U+1700–U+171F\p{InHanunoo}:U+1720–U+173F\p{InBuhid}:U+1740–U+175F\p{InTagbanwa}:U+1760–U+177F\p{InKhmer}:U+1780–U+17FF\p{InMongolian}:U+1800–U+18AF\p{InLimbu}:U+1900–U+194F\p{InTai_Le}:U+1950–U+197F\p{InKhmer_Symbols}:U+19E0–U+19FF\p{InPhonetic_Extensions}:U+1D00–U+1D7F\p{InLatin_Extended_Additional}:U+1E00–U+1EFF\p{InGreek_Extended}:U+1F00–U+1FFF\p{InGeneral_Punctuation}:U+2000–U+206F\p{InSuperscripts_and_Subscripts}:U+2070–U+209F\p{InCurrency_Symbols}:U+20A0–U+20CF\p{InCombining_Diacritical_Marks_for_Symbols}:U+20D0–U+20FF\p{InLetterlike_Symbols}:U+2100–U+214F\p{InNumber_Forms}:U+2150–U+218F\p{InArrows}:U+2190–U+21FF\p{InMathematical_Operators}:U+2200–U+22FF\p{InMiscellaneous_Technical}:U+2300–U+23FF\p{InControl_Pictures}:U+2400–U+243F\p{InOptical_Character_Recognition}:U+2440–U+245F\p{InEnclosed_Alphanumerics}:U+2460–U+24FF\p{InBox_Drawing}:U+2500–U+257F\p{InBlock_Elements}:U+2580–U+259F\p{InGeometric_Shapes}:U+25A0–U+25FF\p{InMiscellaneous_Symbols}:U+2600–U+26FF\p{InDingbats}:U+2700–U+27BF\p{InMiscellaneous_Mathematical_Symbols-A}:U+27C0–U+27EF\p{InSupplemental_Arrows-A}:U+27F0–U+27FF\p{InBraille_Patterns}:U+2800–U+28FF\p{InSupplemental_Arrows-B}:U+2900–U+297F\p{InMiscellaneous_Mathematical_Symbols-B}:U+2980–U+29FF\p{InSupplemental_Mathematical_Operators}:U+2A00–U+2AFF\p{InMiscellaneous_Symbols_and_Arrows}:U+2B00–U+2BFF\p{InCJK_Radicals_Supplement}:U+2E80–U+2EFF\p{InKangxi_Radicals}:U+2F00–U+2FDF\p{InIdeographic_Description_Characters}:优+2FF0–U+2FFF\p{InCJK_Symbols_and_Punctuation}:U+3000–U+303F\p{InHiragana}:U+3040–U+309F\p{InKatakana}:U+30A0–U+30FF\p{InBopomofo}:U+3100–U+312F\p{InHangul_Compatibility_Jamo}:U+3130–U+318F\p{InKanbun}:U+3190–U+319F\p{InBopomofo_Extended}:U+31A0–U+31BF\p{InKatakana_Phonetic_Extensions}:U+31F0–U+31FF\p{InEnclosed_CJK_Letters_and_Months}:U+3200–U+32FF\p{InCJK_Compatibility}:U+3300–U+33FF\p{InCJK_Unified_Ideographs_Extension_A}:U+3400–U+4DBF\p{InYijing_Hexagram_Symbols}:U+4DC0–U+4DFF\p{InCJK_Unified_Ideographs}:U+4E00–U+9FFF\p{InYi_Syllables}:U+A000–U+A48F\p{InYi_Radicals}:U+A490–U+A4CF\p{InHangul_Syllables}:U+AC00–U+D7AF\p{InHigh_Surrogates}:U+D800–U+DB7F\p{InHigh_Private_Use_Surrogates}:U+DB80–U+DBFF\p{InLow_Surrogates}:U+DC00–U+DFFF\p{InPrivate_Use_Area}:U+E000–U+F8FF\p{InCJK_Compatibility_Ideographs}:U+F900–U+FAFF\p{InAlphabetic_Presentation_Forms}:U+FB00–U+FB4F\p{InArabic_Presentation_Forms-A}:U+FB50–U+FDFF\p{InVariation_Sel矢量}:U+FE00–U+FE0F\p{InCombining_Half_Marks}:U+FE20–U+FE2F\p{InCJK_Compatibility_Forms}:U+FE30–U+FE4F\p{InSmall_Form_Variants}:U+FE50–U+FE6F\p{InArabic_Presentation_Forms-B}:U+FE70–U+FEFF\p{InHalfwidth_and_Fullwidth_Forms}:U+FF00–U+FFEF\p{InSpecials}:U+FFF0–U+FFFFUnicode编码表Unicode字符参考示例文本过滤,去除标点符号符号及其他特殊字符>>>regex.sub(r'[^\p{L}]','','1Confucius/现代价值/“知”论)'ConfuciusModernValueTheoryofKnowing'本文来自量子位折断

最新推荐
猜你喜欢