Jon Udell - News about Google News (about the site Infoworld.com not appearing, though its weblog subdomain is appearing):
According to Google News product manager Nathan Stoll, the omission is a technical problem rather than an editorial one. The Google News crawler, he says, is a very different beast from the regular Google crawler. And while the regular crawler happily includes our stuff, the news crawler -- for reasons as yet undetermined -- doesn't.
I was surprised to learn this because I've only ever been aware of three user-agent strings (i.e., crawler signatures) broadcast by Google bots:
1. GoogleBot (for the main index)
2. GoogleBot-Image (for images)
3. Feedfetcher-Google (for RSS feeds)
There's no separate signature for the news crawler. It identifies itself as GoogleBot too. Given that the main crawler and the news crawler use different algorithms for site traversal and page analysis, according to Stoll, I'd expect them to identify themselves differently. But perhaps for historical reasons, they don't.
Despite a tendency for snarky sites to play "Gotcha" with that explanation, it does seem to be true.
According to an older mailing list report,
Leaving out the version numbers, Google News user agent is "Mozilla (Googlebot)" whereas regular Google is just Googlebot.
I suspect that's slightly incorrect now, i.e. Google News has the "Mozilla (Googlebot)" signature, though not all instances of that signature are Google News (though it may have true at the time it was written, given various lag times in use of different code).
Given that Google News does include "www.infoworld.nl", my guess is that someone made a typo in the sources file somewhere for "www.infoworld.com", and the Google News crawler is mistakenly looking at a cybersquatted site (hence it wouldn't be reporting a can't-find-site error, but it wouldn't find any useful news content either).
By Seth Finkelstein | posted in google | on July 20, 2006 03:57 PM (Infothought permalink)