Tomcat是如何处理搜索引擎爬虫请求的？

时间：2023-03-21 17:53:41 科技观察

互联网上的每一个站点都需要被搜索引擎收录并及时显示在结果中，从而为用户和读者提供信息。搜索引擎如何收录我们的网站？这涉及到“搜索引擎爬虫”爬取网站内容的过程。只有被搜索引擎抓取并收录的内容，才有机会在特定的查询攻击后显示在结果中。这些搜索引擎内容工具也被称为爬虫、蜘蛛、网络爬虫等，一方面我们欢迎它们访问网站收集内容，但另一方面又因为影响正常服务而让我们很头疼.毕竟蜘蛛也是占用服务器资源的。Spider过多占用资源过于频繁，会影响正常的用户请求处理。因此，有些网站干脆单独提供服务供搜索引擎访问，其他普通用户请求到另一台服务器。说到这里，需要提一下，是否是Spider请求的标识是通过HTTP请求头中的User-Agent字段来判断的，每个搜索引擎都有自己独立的标识。并且通过这些内容，管理员还可以了解到访问日志中搜索引擎爬取了哪些内容。另外，在搜索引擎的“抓取声明文件”的robots.txt中也会有类似User-agent的说明。比如下面是淘宝的robots.txt说明User-agent:BaiduspiderAllow:/articleAllow:/oshtmlDisallow:/product/Disallow:/User-Agent:GooglebotAllow:/articleAllow:/oshtmlAllow:/productAllow:/spuAllow:/dianpuAllow:/overseaAllow:/listDisallow:/User-agent:BingbotAllow:/articleAllow:/oshtmlAllow:/productAllow:/spuAllow:/dianpuAllow:/overseaAllow:/listDisallow:/User-Agent:360SpiderAllow:/articleAllow:/oshtmlDisallow:/User-Agent:YisouspiderAllow:/articleAllow:/oshtmlDisallow:/User-Agent:SogouspiderAllow:/articleAllow:/oshtmlAllow:/productDisallow:/User-Agent:Yahoo!SlurpAllow:/productAllow:/spuAllow:/dianpuAllow:/overseaAllow:/listDisallow:/我们来看看Tomcat对搜索引擎请求做了哪些特殊处理？对于涉及Session的请求，我们知道通过Session，我们可以在服务器端识别出一个特定的用户。大量Spider请求到达后，如果访问频繁，请求量大，需要创建数量庞大的Session，占用和消耗大量内存，无形中占用了正常用户的处理资源。为此，Tomcat提供了一个“Valve”来处理Spider的请求。首先识别Spider请求，对于Spider请求，让它使用相同的SessionId继续后续的请求流程，从而避免产生大量的Session数据。这里需要注意的是，即使Spider显式传递了一个sessionId，也会被丢弃，而是根据客户端Ip判断，即只为同一个Spider提供一个Session。我们看代码：//IftheincomingrequesthasvalidsessionID,noactionisrequiredif(request.getSession(false)==null){//Isthisacrawler-checktheUAheadersEnumerationuaHeaders=request.getHeaders("user-agent");StringuaHeader=null;if(uaHeaders.hasMoreElements()){uaHeader=uaHeaders.nextElement();}//IfmorethanoneUAheader-assumenotabotif(uaHeader!=null&&!uaHeaders.hasMoreElements()){if(uaPattern.matcher(uaHeader).matches()){isBot=true;if(log.isDebugEnabled()){log.debug(request.hashCode()+":Botfound.UserAgent="+uaHeader);}}}//Ifthisisabot,isthesessionIDknown?if(isBot){clientIp=request.getRemoteAddr();sessionId=clientIpSessionId.get(clientIp);if(sessionId!=null){request.setRequestedSessionId(sessionId);//重用session}}}getNext().invoke(request,response);if(isBot){if(sessionId==null){//Hasbotjustcreatedasession,ifso??makeanoteofitHttpSessions=request.getSession(false);if(s!=null){clientIpSessionId.put(clientIp,s.getId());//为Spider生成sessionsessionIdClientIp.put(s.getId(),clientIp);//#valueUnbound()会调用onsessionexpirations.setAttribute(this.getClass().getName(),this);s.setMaxInactiveInterval(sessionInactiveInterval);if(log.isDebugEnabled()){log.debug(request.hashCode()+":Newbotsession.SessionID="+s.getId());}}}else{if(log.isDebugEnabled()){log.debug(请求。hashCode()+":Botsessionaccessed.SessionID="+sessionId);}}}判断蜘蛛是通过常规的privateStringcrawlerUserAgents=".*[bB]ot.*|.*Yahoo!Slurp.*|.*Feedfetcher-Google.*";//初始化Valve时，执行compileuaPattern=Pattern.compile(crawlerUserAgents);这样当Spider到达时，可以被User-agent识别出来，并进行特殊处理，减少其影响。Valve的名字是：“CrawlerSessionManagerValve”，好吧这个名字一目了然。还有其他问题吗？我们来看一下，使用ClientIp来判断session共享。最近，Tomcat做了一个bug修复。原因是当Valve配置在Engine下层并被多个Host共享时，只有一个Host会生效。修复后，除了ClientIp，请求的Host和Context也有限制。这些元素共同构成了客户端标识，Session可以更大程度的共享。修改内容如下：综上所述，Valve通过logo识别出Spider请求后，为其分配一个固定的Session，避免大量创建Session造成的资源占用。Valve默认是不开启的，需要在server.xml中开启。另外，让我们看看上面提供的常规模式。对比淘宝的robots.txt，你会发现它没有包含这些国内搜索引擎的处理。这个时候怎么办？配置的时候传进去就OK了。它是公共财产publicvoidsetCrawlerUserAgents(StringcrawlerUserAgents){this.crawlerUserAgents=crawlerUserAgents;if(crawlerUserAgents==null||crawlerUserAgents.length()==0){uaPattern=null;}else{uaPattern=Pattern.compile(crawlerUserAgents);}}【本文为专栏作家“侯书城”原创稿件，转载请通过作者微信获取授权公众号“Tomcat那些事儿”】点此查看更多本作者好文

上一篇：谷歌的量子计算变了：关键人物约翰·马蒂尼斯宣布辞职

下一篇：2015年3月编程语言榜单：F#排名第13

Tomcat是如何处理搜索引擎爬虫请求的？相关文章