1前言Elasticsearch是一个开源的搜索引擎,建立在ApacheLucene?的基础上,是一个全文搜索引擎库。那么如何实现Elasticsearch和Python的联系就成了我们关心的问题(怎么万物都要和Python联系起来)。2Python交互因此,Python也提供了一个可以连接Elasticsearch的依赖库。pipinstallelasticsearch最初连接到Elasticsearch操作对象。def__init__(self,index_type:str,index_name:str,ip="127.0.0.1"):#self.es=Elasticsearch([ip],http_auth=('username','password'),port=9200)self.es=Elasticsearch("localhost:9200")self.index_type=index_typeself.index_name=index_name默认端口为9200,初始化前请确保本地已经搭建好Elasticsearch的环境。根据ID获取文档数据defget_doc(self,uid):returnsself.es.get(index=self.index_name,id=uid)插入文档数据definsert_one(self,doc:dict):self.es.index(index=self.index_name,doc_type=self.index_type,body=doc)definsert_array(self,docs:list):fordocindocs:self.es.index(index=self.index_name,doc_type=self.index_type,body=doc)搜索文档数据defsearch(self,query,count:int=30):dsl={"query":{"multi_match":{"query":query,"fields":["title","content","link"]}},"highlight":{"fields":{"title":{}}}}match_data=self.es.search(index=self.index_name,body=dsl,size=count)returnmatch_datadef__search(self,query:dict,count:int=20):#count:返回的数据大小results=[]params={'size':count}match_data=self.es.search(index=self.index_name,body=query,params=params)forhitinmatch_data['hits']['hits']:results.append(hit['_source'])returnresults删除文档数据defdelete_index(self):try:self.es.indices.delete(index=self.index_name)except:pass不错,对搜索类的封装也是为了方便调用,所以整体贴出来fromelasticsearchimportElasticsearchclasselasticSearch():def__init__(self,index_type:str,index_name:str,ip="127.0.0.1"):#self.es=Elasticsearch([ip],http_auth=('elastic','password'),port=9200)self.es=Elasticsearch("localhost:9200")self.index_type=index_typeself.index_name=index_namedefcreate_index(self):ifself.es.indices.exists(index=self.index_name)isTrue:self.es.indices.delete(index=self.index_name)self.es.indices.create(index=self.index_name,ignore=400)defdelete_index(self):try:self.es.indices.delete(index=self.index_name)except:passdefget_doc(self,uid):returnsself.es.get(index=self.index_name,id=uid)definsert_one(self,doc:dict):self.es.index(index=self.index_name,doc_type=self.index_type,body=doc)definsert_array(self,docs:list):fordocindocs:self.es.index(index=self.index_name,doc_type=self.index_type,body=doc)defsearch(self,query,count:int=30):dsl={“查询”:{“多匹配”:{“查询”:查询,“fields":["title","content","link"]}},"highlight":{"fields":{"title":{}}}}match_data=self.es.search(index=self.index_name,body=dsl,size=count)returnmatch_data尝试将Mongodb中的数据插入到ES中'蜘蛛').find({},{'_id':0,})es=elasticSearch(index_type="spider_data",index_name="spider")es.create_index()foriinsheet:data={'title':i["title"],'content':i["data"],'link':i["link"],'create_time':datetime.now()}es.insert_one(doc=data)toES勾选in并启动elasticsearch-head插件,如果是npm安装,则cd到根目录,直接运行npmrunstart,本地访问http://localhost:9100/,发现新添加的spider数据文件确实进入了就是这样3爬虫存储要想实现ES搜索,首先得有数据支持,大量的data往往来自爬虫。为了节省时间,写了一个简单的爬虫爬取百度百科。简单粗暴,先递归获取很多很多url链接importrequestsimportreimporttimeexist_urls=[]headers={'User-Agent':'Mozilla/5.0(WindowsNT6.1)AppleWebKit/537.36(KHTML,likeGecko)Chrome/62.0.3202.62Safari/537.36',}defget_link(url):try:response=requests.get(url=url,headers=headers)response.encoding='UTF-8'html=response.textlink_lists=re.findall('.*?
