Skip to Main Content

Photo courtesy of Thomas Shahan.As a server administrator, I have to fight a never-ending battle against the likes of exploit bots, spam bots and crackers. But one recurring issue I have experienced has come from an unlikely source: search engine spiders or, more specifically, Microsoft's MSNBot.

Search engine spiders have the ambitious responsibility of staying up-to-date on all of the available content on the Web. They "crawl" known Web pages, checking for updates, following links to discover new pages and sites, and generally collecting all of the information that's needed to provide current, relevant results in their respective search engines. This is a very difficult task, considering not only the vast size of the Web, but also many site-specific considerations: Some servers can't respond to page requests as quickly as a spider may want to request them. Some Web sites have content that they would prefer search engines did not index. The list goes on.

Over the years, most major search engines have refined their spiders to account for these considerations. Google's spider has become fairly good at figuring out how quickly a particular Web site can serve up results, so it doesn't pummel the server too hard. All search engines worth mentioning have agreed on a standard file called "robots.txt" that allows a Web site administrator to declare that spiders shouldn't attempt to access or index certain areas of the site. Most search engine spiders follow these rules just fine.

However, there is one major spider that has recently been standing out as an exception. MSNBot, which powers Microsoft's Bing search engine, seems to have a different understanding of what is reasonable spider behavior. It's quite common for MSNBot to request "/robots.txt" 5 or 6 times within a few seconds on any given Web site. I have seen it request URLs that have never existed and seem to be totally made up or applicable to some other site. And, in general, it tends to request pages too frequently and with too many repeat requests in a short amount of time. I have also seen MSNBot request pages or areas of the site that are specifically disallowed in robots.txt files.

Photo courtesy of Matt Mets.MSNBot has long been criticized for its poorly tuned crawling behavior, and the fact that its various spider instances do not appear to coordinate their efforts properly. But recently, an important open source project server was taken down, reportedly, by abusive behavior from MSNBot.

The server was for a project called CPAN Testers, the beta testing initiative for the popular CPAN Perl code distribution system. On its blog, a CPAN Testers representative made a post entitled, "MSNBOT must die!" in which they criticized MSNBot for effectively launching a DDoS attack on their server, despite that their robots.txt file expressly disallows MSNBot from accessing anything on the domain. MSNBot apparently ignored the robots.txt rule, even though Microsoft's documentation says they fully adhere to it, and MSNBot "decided to unleash 20-30 bots every few seconds" to crawl the site, according to the blog post. The CPAN Testers response was to reboot the server and then forcefully block MSNBot's known IP address ranges, which unfortunately means that Bing will no longer be able to show the site in its search results.

In response to the blog post, other Web site administrators have come out with their own horror stories about MSNBot and how they've resorted to similar drastic measures. One commenter wrote, "I banned MSN a long time ago from perldesignpatterns.com. If you don't respect the robots.txt there and attempt to crawl the site, you'll wind up downloading the same content over and over in vast numbers of permutations. Guess who was ignoring robots.txt?" A commenter in a Reddit.com discussion of this story wrote, "I see this too at web2py.com. Those IP addresses you mention ignore robots.txt and also continuously scan for files that are not on the system and never were."

Since Bing went live in June 2009, it has been steadily increasing its usage share, mostly at the expense of Yahoo! and smaller search engines. But, if Microsoft hopes to win the mind share of Web site administrators, they'll need to fix these major crawling bugs. <>

Thomas Shahan photo credits: http://www.flickr.com/photos/opoterser/ / CC BY 2.0

Matt Mets photo credits: http://www.flickr.com/photos/cibomahto/ / CC BY-SA 2.0