记一个BOM引起的bug

时间：2023-04-04 01:05:46 Node.js

bug。今天队友给了我一个json配置文件，可以换成下面的（毕竟内容不是重点）：{"text":"thisisaexample"}考虑到这个json是不需要的是常驻的，所以没有用require引用，因为node模块的缓存机制必然会导致内存泄漏，所以采用如下方法：fs.readFile(`${__dirname}/y.json`,'utf8',function(err,str){if(err){throwerr;}try{constdata=JSON.parse(str);//...}catch(err){throwerr;}});但是奇怪的事情发生了，JSON.parse报错了？？？UnexpectedtokeninJSONatposition0这时候我一头雾水，于是用require试了一下，发现完全没有问题。考虑到组员用的windows，问了他，了解到这个json是用notepad++写的，再加上之前写php经常遇到的BOM问题，我猜测这个bug是BOM引起的，把读到的str转成看buffer开头是efbbbf。今天先来看看BOM是什么：BOM字节顺序标记（英文：byte-ordermark，BOM）是码位U+FEFF处的Unicode字符的名称。当以UTF-16或UTF-32编码一串UCS/Unicode字符时，此字符用于指示其字节顺序。它经常被用作标记，表明该文件是用UTF-8、UTF-16或UTF-32编码的。说白了，它存在于文本文件的开头，标志着该文件是以那种格式编码的，在mac上应该没有，但是windows的notepad++一般都有。也可以用python写一个BOM标记的文件来验证这个问题：importcodecscode='''{"x":20}'''f=codecs.open('y.json','w','utf_8_sig')f.write(code)f.close()在了解了原因和BOM是什么之后，还有一个疑问为什么可以用require引入？requirejson做什么？我记得require使用fs.readFileSync同步读取。为什么这是可能的？猜的没用，看node的源码，发现有这么一段：尝试{module.exports=JSON.parse(internalModule.stripBOM(content));}catch(err){err.message=文件名+':'+err.message;抛出错误；}};看上面的代码就可以很清楚了，require在读完之后就把字符串中的BOM去掉。我们来看看internalModule.stripBOM的实现：functionstripBOM(content){//检查第一个字符是否为BOMif(content.charCodeAt(0)===0xFEFF){content=content.slice(1);}returncontent;}至此，问题已经解决，但是我还是不明白为什么efbbbf是utf8，为什么要转成feff，这不就是utf16bigendian的表示吗？解开这个疑惑：Unicode和utf8先说编码的历史吧。最先出现的字符编码是ASCII，八位二进制，可以表示256种状态。英文可以用128个符号编码，但其他语言却不能表达，所以欧洲一些国家开始规定自己的表示法。例如，130在法语中表示一个字符，在俄语中表示一个字符，结果为0-127一致，128-255可能相差很大；为了解决这个问题，国际组织设计并提出了Unicode，一种可以容纳世界上所有语言和字符的编码方案。Unicode只规定了符号的二进制编码，并没有规定如何存储，比如中文可能至少需要2个字节，而英文只需要1个字节AsaUnicodeimplementation,utf8iswidelyusedinInternetapplications.Utf8clarifiestheencodingrules:forsingle-bytesymbols,thefirstpositionis0,andthelatter7bitsareusedforrepresentation,soEnglishutf8encodingItisconsistentwiththeASCIIcode.Forn(n>2)bytesofsymbols,thefirstnofthefirstbyteissetto1,then+1thissetto0,andthefirsttwobitsofthefollowingbyteareallsetto10，剩下的二进制位，为这个符号的Unicode码可以参见以下对照：字符字节Unicode符号范围utf8编码方式100000000-0000007F0xxxxxxx200000080-000007FF110xxxxx10xxxxxx300000800-0000FFFF1110xxxx10xxxxxx10xxxxxx400010000-0010FFFF11110xxx10xxxxxx10xxxxxx10xxxxxx500200000-03FFFFFF111110xx10xxxxxx10xxxxxx10xxxxxx10xxxxxx604000000-7FFFFFFF1111110x10xxxxxx10xxxxxx10xxxxxx10xxxxxx10xxxxxx来看下feff转化为efbbbf，fs.readFileSync进行了buffer->string的转换，buffer的编码为utf8，而string为UnicodeAccordingtotheabovetable,calculate:FEFF111111101111111accordingtoitsrange,getitsutf8encoding:111011111011101110111111EFBBBFUsecodetorealizetheprocessofconvertingUnicodetoutf8:defUnicodeToUtf8(unic):res=list()ifunic<0x7F:res.append(hex(unic&0x7F))elifunic>=0x80andunic<=0x7FF:#110xxxxxres.append(((unic>>6)&0x1F)|0xC0)#10xxxxxxres.append((unic&0x3F)|0x80)elifunic>=0x800andunic<=0xFFFF:#1110xxxxres.append(((unic>>12)&0x0F)|0xE0)#allis10xxxxxxres.append(((unic>>6)&0x3F)|0x80)res.append((unic&0x3F)|0x80)elifunic>=0x10000andunic<=0x1FFFFF:#11110xxxres.append(((unic>>18)&0x07)|0xF0)#全部是10xxxxxxres.append(((unic>>12)&0x3F)|0x80)res.append(((unic>>6)&0x3F)|0x80)res.append((unic&0x3F)|0x80)elifunic>=0x200000andunic<=0x3FFFFFF:#111110xxres.append(((unic>>24)&0x03)|0xF8)#allis10xxxxxxres.append(((unic>>18)&0x3F)|0x80)res.append(((unic>>12)&0x3F)|0x80)res.append(((unic>>6)&0x3F)|0x80)res.append((unic&0x3F)|0x80)elifunic>=0x4000000andunic<=0x7FFFFFFF:#1111110xres.append(((unic>>30)&0x01)|0xFC)#allis10xxxxxxres.append(((unic>>24)&0x3F)|0x80)res.append(((unic>>18)&0x3F)|0x80)res.append(((unic>>12)&0x3F)|0x80)res.append(((unic>>6)&0x3F)|0x80)res.append((unic&0x3F)|0x80)returnmap(lambdar:hex(r),res)#testprintUnicodeToUtf8(0xFEFF)utf8转Unicode只需要去掉flag即可，这里不是这个执行到此结束，终于明白了，可以和组员说说bug的解决方法，使用上面的stripBOM表示感谢。如有错误，请指出！Unicode和utf8的部分内容参考自阮老师的文章

上一篇：一个配置简单但功能强大的ReactKoa2同构-通用项目模板

下一篇：开源了一个前端小脚手架（团队内共享）

记一个BOM引起的bug相关文章