ElasticsearchAnalyzer内置分词器主要介绍Elasticsearch中Analyzer分词器的组成和Es中内置的一些分词器及其使用方法。pre-knowledgees提供analyzeapi方便我们快速规范分词器然后对输入的文本进行分词帮助我们学习和实验分词器POST_analyze{"analyzer":"standard","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}[the,2,quick,brown,foxes,jumped,over,the,lazy,dog's,bone]1.ES中Analyzer一个很重要的概念就是分词,全-ES的文本搜索也是基于分词结合倒排索引的。那么在这篇文章中,我们就来看看什么是分词。如何分词。tokenizer是一个专门处理分词的组件。在很多中间件设计中,各个组件的职责划分明确。单一职责原则使得以后变更时容易扩展。分词器由三部分组成。CharacterFilters:主要是对原文进行处理,比如去除html标签Tokenizer:将文本按照规则进行分词,即分词TokenFilters:对分词进行处理,小写,删除停用词,添加同义词,展开一些分词场景:数据写入索引查询时进行分词,查询时需要对查询文本进行分词。拆分,转小写,支持中文(但中文是按每个字符拆分,无意义)is)Defaultusestoptokenfilterenglishpre-definedWhitespaceAnalyzer:每当遇到空格就进行分词,不会转为小写KeywordAnalyzer:不管分词,输入直接当做输出PatterAnalyzer:正则表达式Language:语言分词30多种CustomerAnalyzer:自定义分词器3.StandardAnalyzerStandard是es中默认的分词器,是按照Unicode文本分词算法进行文本分词的POST。_analyze{"analyzer":"standard","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}[the,2,quick,brown,foxes,jumped,over,the,lazy,dog's},bone]3.1定义包括小写标记过滤器和停止标记过滤器RemovestopwordsTokenizer[StandardTokenizer]TokenFilters[StandardTokenFilter]:无用,只是一个保留的tokenfilter(标准tokenfilter目前什么都不做。它保留为占位符,以防将来版本需要添加一些过滤功能。)[LowerCaseTokenFilter]:TurnLowercasetokenfilter[StopTokenFilter]:停用词token过滤器默认不开启3.2配置max_token_length:最大分词长度,超过这个长度直接分词default255stopwords:预定义的停用词列表如:englisthorstopwordarray[]defaultnonedosetstopwords_path:filepathcontainingstopwords3.3experiment//使用基于standardPUT的自定义分词器my_index{"settings":{"analysis":{"analyzer":{"my_english_analyzer":{"type":"standard","max_token_length":5,//最大单词数"stopwords":"_english_"//使用英语语法启用过滤停用词}}}}}GETmy_index/_analyze{"analyzer":"my_english_analyzer","text":"Thehellogoodnamejack"}//可以看到最多5个字符需要分词,停用词the没了["hello","goodn","ame","jack"]4.SimpleAnalyzer的简单分词规则是遇到非字母words只需将单词分段并将其转换为小写,(小写tokennizer)POST_analyze{"analyzer":"simple","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}[the,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]4.1DefinitionTokenizerLowerCaseTokenizer4.2Configuration无配置参数4.3实验简单分析器分词器的实现如下PUT/simple_example{"settings":{"analysis":{"analyzer":{"rebuilt_simple":{"tokenizer":"lowercase","filter":[]}}}}}5.停止分析器停止分析器与简单分析器相同,除了说明有一个tokenfilter用于过滤停用词,默认使用英文停用词规则POST_analyze{"analyzer":"stop","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}//可以看到非字母被切分并转为小写,然后去除停用词[quick,brown,foxes,jumped,over,lazy,dog,s,bone]5.1DefinitionTokenizerLowerCaseTokenizer:tokenfilters转换为小写StopTokenFilter:过滤停用词默认使用规则english5.2配置stopwords:指定分词规则默认english,或者分词数组stopwords_path:指定分词停用词文件5.3实验下面是StopAnalyzer的实现,先转小写再过滤停用词PUT/stop_example{"settings":{"analysis":{"filter":{"english_stop":{"type":"stop","stopwords":"_english_"}},"analyzer":{"rebuilt_stop":{"tokenizer":"lowercase","filter":["english_stop"]}}}}}设置stopwords参数指定过滤器停用词列表PUTmy_index{"settings":{"analysis":{"analyzer":{"my_stop_analyzer":{"type":"stop","stopwords":["the","over"]}}}}}POSTmy_index/_analyze{"analyzer":"my_stop_analyzer","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}[quick,brown,foxes,jumped,lazy,dog,s,bone]6.分词,不会小写POST_analyze{"analyzer":"whitespace","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}[The,2,QUICK,Brown-Foxes,jumpedoverthe,lazy,dog's,骨头。]6.1DefinitionTokenizerWhitespaceTokenizer6.2Configuration无配置6.3实验whitespaceanalyzer的实现如下,可以添加filterPUT/whitespace_example{"settings":{"analysis":{"analyzer":{"rebuilt_whitespace":{"tokenizer":"whitespace","filter":[]}}}}}7.KeywordAnalyzer很特别,它不做分词,所以输入的时候输出POST。POST_analyze{"analyzer":"keyword","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}//注意这里不做分词而是原样输出[The2QUICKBrown-Foxesjumpedover懒狗的骨头。]7.1DefinitionTokennizerKeywordTokenizer7.2ConfigurationNoconfiguration7.3experimentalrebuit下面是KeywordAnalyzer实现PUT/keyword_example{"settings":{"analysis":{"analyzer":{"rebuilt_keyword":{"tokenizer":"keyword","filter":[]}}}}}8.PatterAnalyzer对正则表达式进行切分,注意正则表达式匹配token,即待切分的token默认为\w+正则分词POST_analyze{"analyzer":"pattern","text":"2只快速的棕狐跳过了懒狗的bone."}//默认遵循\w+规则[the,2,quick,brown,foxes,jumped,over,the,lazy,dog,s,bone]8.1DefinitionTokennizerPatternTokenizerTokenFiltersLowerCaseTokenFilterStopTokenFilter(8.2Configurationpattern一个Java正则表达式,默认为\W+.flagsJava正则表达式.lowercasetolowercaseisdefaultenabledtrue.stopwordsstopwordsfilterdefaultnone未启用,默认为_none_.stopwords_pathstopwords文件路径8.3ExperimentalPatternAnalyzer实现如下PUT/pattern_example{"settings":{"analysis":{"tokenizer":{"split_on_non_word":{"type":"pattern","pattern":"\\W+"}},"analyzer":{"rebuilt_pattern":{"tokenizer":"split_on_non_word","filter":["lowercase"]}}}}}9.LanguageAnalyzer提供了这么多的语言分词器如下,其中其中有阿拉伯语、亚美尼亚语、巴斯克语、孟加拉语、保加利亚语、加泰罗尼亚语、捷克语、荷兰语、英语、芬兰语、法语、加利西亚语、德语、印地语、匈牙利语、印度尼西亚语、爱尔兰语、意大利语、拉脱维亚语、立陶宛语、挪威语、葡萄牙语、罗马尼亚语、俄语,sorani,spanish,swedish,turkish.GET_analyze{"analyzer":"english","text":"The2QUICKBrown-Foxesjumpedoverthelazydog'sbone."}[2,quick,brown,foxes,jumped},over,lazy,dog,bone]10.CustomerAnalyzer无话可说的是,当提供的内置分词器不能满足你的需求时,你可以组合以下三部分CharacterFilters:主要是对原文进行处理,比如去除html标签Tokenizer:将文本按规则分词,即分词TokenFilters:对分词进行处理,小写,删除停用词,添加同义词,扩充一些PUTmy_index{"设置“:{“分析”:{“分析器”:{“my_custom_analyzer”:{“类型”:“自定义”,“char_filter”:[“表情符号”],“tokenizer":"punctuation","filter":["lowercase","english_stop"]}},"tokenizer":{"punctuation":{"type":"pattern","pattern":"[.,!]"}},"char_filter":{"表情符号":{"type":"mapping","mappings":[":)=>_happy_",":(=>_sad_"]}},"filter":{"english_stop":{"type":"stop","stopwords":"_english_"}}}}}POSTmy_index/_analyze{"analyzer":"my_custom_analyzer","text":"我是一个:)person,andyou?"}[i'm,_happy_,person,you]总结本文主要介绍Elasticsearch中内置的一些Analyzer分词器。这些内置的分词器可能并不常用,但是如果你能整理出这些内置的分词器,对你理解Analyzer绝对有很大的帮助。您了解字符过滤器、分词器和分词过滤器的用处。有机会再说说IKAnalyzer、ICUAnalyzer、Thulac等一些中文分词器,毕竟在中文分词器的开发中,中文分词器用的比较多。欢迎大家访问个人博客Johnnyhut欢迎关注我公众号
