在处理文本的时候,经常会遇到全角和半角不一致的问题。因此,程序需要能够在两者之间快速切换。由于全角和半角本身就存在映射关系,所以处理起来并不复杂。具体规则是:全角字符的unicode编码范围为65281~65374(十六进制为0xFF01~0xFF5E),半角字符的unicode编码范围为33~126(十六进制为0x21~0x7E)。为32(0x20)除空格外按unicode编码全角/半角排序依次对应(半角+65248=全角),非空格数据直接用+-方法,空格分开处理。有些函数使用chr()函数,取一个(256)范围内的整数(即0到255)作为参数,返回一个对应的字符。unichr()是相同的,只是它返回Unicode字符。ord()函数是chr()函数或unichr()函数的配对函数。它以一个字符(长度为1的字符串)为参数,返回对应的ASCII值或Unicode值。先打印映射关系:foriinxrange(33,127):printi,chr(i),i+65248,unichr(i+65248)返回结果:33!65281!34"65282"35#65283#36$65284¥37%65285%38&65286&39'65287′40(65288(41)65289)42*65290*43+65291+44,65293-44.652947/65295/4806529604916529715026529825299352353004535653015546530265303753038530495305:65306:65307;62>65310>63?65311?64@65312@65A65313A66B65314B67C65315C68D65316D69E65317E70F65318F71G65319G72H65320H73I65321I74J65322J75K65323K76L65324L77M65325M78N65326N79O65327O80P65328P81Q65329Q82R65330R83S65331S84T65332T85U65333U86V65334V87W65335W88X65336X89Y65337Y90Z65338Z91[65339[92\65340\93]65341]94^65342^95_65343_96`65344'97a65345a98b65346b99c65347c100d65348d101e65349e102f65350f103g65351g104h65352h105i65353i106j65354j107k65355k108l65356l109m65357m110n65358n111o65359o112p65360p113q65361q114r65362r115s65363s116t65364t117u65365u118v65366v119w65367w120x65368x121y65369y122z65370z123{65371{124|65372|125}65373}126~65374~全角转半角:deffull2half(s):n=[]s=s.decode('utf-8')forcharins:num=ord(char)ifnum==0x3000:num=32elif0xFF01<=num<=0xFF5E:num-=0xfee0num=unichr(num)n.append(num)return''.join(n)半角转全角:defhalf2full(s):n=[]s=s.decode('utf-8')forcharins:num=char(char)ifnum==320:num=0x3000elif0x21<=num<=0x7E:num+=0xfee0num=unichr(num)n.append(num)return''.join(n)上面的实现很简单,但实际可能并非如此比如在中文文章中,我们期望转换所有出现的字母和数字都是半角的,而普通标点符号使用全角。上述转换不适用于解决方案,即自定义词典。#!/usr/bin/envpython#-*-编码:utf-8-*-FH_SPACE=FHS=((u" ",u""),)FH_NUM=FHN=((u"0",u"0"),(u"1",u"1"),(u"2",u"2"),(u"3",u"3"),(u"4",u"4"),(u"5",u"5"),(u"6",u"6"),(u"7",u"7"),(u"8",u"8""),(u"9",u"9"),)FH_ALPHA=FHA=((u"a",u"a"),(u"b",u"b"),(u"c",u"c"),(u"g",u"d"),(u"e",u"e"),(u"f",u"f"),(u"g",u"g"),(u"j",u"h"),(u"i",u"i"),(u"j",u"j"),(u"j",u"k"),(u"l",u"l"),(u"m",u"m"),(u"n",u"n"),(u"o",u"o"),(u"j",u"p"),(u"j",u"q"),(u"j",u"r"),(u"s",u"s"),(u"t",u"t"),(u"u",u"u"),(u"v",u"v"),(u"w",u"w"),(u"x",u"x"),(u"y",u"y"),(u"z",u"z"),(u"A",u"A"),(u"B",u"B"),(u"C",u"C"),(u"D",u"D"),(u"E",u"E"),(u"F",u"F"),(u"G",u"G"),(u"H",u"H"),(u"I",u"I"),(u"J",u"J"),(u"K",u"K"),(u"L",u"L"),(u"M",u"M"),(u"N",u"N"),(u"O",u"O"),(u"P",u"P"),(u"Q",u"Q"),(u"R",u"R"),(u"S",u"S"),(u"T",u"T"),(u"U",u"U"),(u"V",u"V"),(u"W",u"W"),(u"X",u"X"),(u"Y",u"Y"),(u"Z",u"Z"),)FH_PUNCTUATION=FHP=((u".",u"."),(u",",u","),(u"!",u"!"),(u"?",u"?"),(u""",u'"'),(u"'",u"'"),(u"'",u"`"),(u"@",u"@"),(u"_",u"_"),(u":",u":"),(u";",u";"),(u"#",u"#"),(u"¥",u"$"),(u"%",u"%"),(u"&",u"&"),(u"(",u"("),(u")",u")"),(u"-",u"-"),(u"=",u"="),(u"*",u"*"),(u"+",u"+"),(u"-",u"-"),(u"/",u"/"),(u"<",u"<"),(u">",u">"),(u"[",u"["),(u"¥",u"\\"),(u"]",u"]"),(u"\",u"^"),(u"{",u"{"),(u"|",u"|"),(u"}",u"}"),(u"~",u"~"),)FH_ASCII=HAC=lambda:((fr,to)formin(FH_ALPHA,FH_NUM,FH_PUNCTUATION)forfr,toinm)HF_SPACE=HFS=((u"",u" "),)HF_NUM=HFN=lambda:((h,z)forz,hinFH_NUM)HF_ALPHA=HFA=lambda:((h,z)forz,hinFH_ALPHA)HF_PUNCTUATION=HFP=lambda:((h,z)forz,hinFH_SPUNCTUATION)HF_ASCII=ZAC=lambda:((h,z)forz,hinFH_ASCII())defconvert(text,*maps,**ops):"""全角/半角转换args:text:需要转换的unicode字符串"if"skip"inops:skip=ops["skip"]ifisinstance(skip,basestring):skip=tuple(skip)defreplace(text,fr,to):returntextiffrinskipelsetext.replace(fr,to)else:defreplace(text,fr,to):returntext.replace(fr,to)forminmaps:ifcallable(m):m=m()elifisinstance(m,dict):m=m.items()forfr,toinm:text=replace(text,fr,to)returntextif__name__=='__main__':text=u"成田机场—[JR特快成田特快横滨方向第2站]—东京—【JR新干线隼鸟巴士新青森方向第6站】—新青森—【JR特急超级白鸟号函馆方向第四站】—函馆"print(convert(text,FH_ASCII,{u"【“:你[”,你“】”:u"]",u",":u",",u".":u"",u"?":u"?",u"!":u"!"},spit=",.?!"""))特别注意:英文系统中的引号不区分前引号和反引号
