Yahoo-Overture does not respect robots.txt

Today I received the following message in my mailbox:
> an improper scan has caused a ban on your site
> date: Tue Feb 24 18:30:20 2004
> ip: 66.77.73.32
> host: shop-gw.sac.overture.com
> agent: Yahoo-VerticalCrawler-FormerWebCrawler/3.9 crawler at trd dot overture dot com; http://www.alltheweb.com/help/webmaster/crawler
I regulary receive this kind of messages, usually created because bad robots or script kiddies access my spam trap. But Yahoo and Overture are well respected companies, and I would assume that they would have respected my `robots.txt` file, in which I explicitly deny access to the `/private folder`:
> User-agent: *
> Disallow: /cgi-bin
> Disallow: /dummy/dummy.html
> Disallow: /errors
> Disallow: /fimcap
> Disallow: /js
> Disallow: /mailtemplates
> Disallow: /mt-static
> Disallow: /private
> Disallow: /spam
So I looked in my access log and found that they indeed violated my robots file!!
> 66.77.73.32 – – [24/Feb/2004:17:20:24 -0500] “GET /robots.txt HTTP/1.0” 200 758 “-” “Yahoo-VerticalCrawler-FormerWebCrawler/3.9 crawler at trd dot overture dot com; http://www.alltheweb.com/help/webmaster/crawler”
> 66.77.73.32 – – [24/Feb/2004:17:57:22 -0500] “GET /private/ HTTP/1.0” 200 4815 “-” “Yahoo-VerticalCrawler-FormerWebCrawler/3.9 crawler at trd dot overture dot com; http://www.alltheweb.com/help/webmaster/crawler”
> 66.77.73.32 – – [24/Feb/2004:18:30:20 -0500] “GET /private/welcome.html HTTP/1.0” 200 351 “-” “Yahoo-VerticalCrawler-FormerWebCrawler/3.9 crawler at trd dot overture dot com; http://www.alltheweb.com/help/webmaster/crawler”
Notice that the page mentioned in the User Agent string states that Yahoo-Overture _does_ support the robots exclusion protocol!

Advertisements
Yahoo-Overture does not respect robots.txt