Whenever I write a log post that gets published at Planet Debian, shortly afterwards I get various hits for the URL of the log post with either "_hide" or "_show" suffixed.

Looking at the p.d.o source, this is due to

document.write( "<a href="#" id="http://jmtd.net/log/six_silicates/_hide"...

I'm not sure why google is treating the id attribute of an anchor as something worth crawling. Is this a google bug? Does anyone else experience this? Does anyone have any idea how we might get it to stop (whilst still indexing real pages, of course :))?


I'm not surprised that Google crawls URIs wherever it finds them, not just in hypertext links. From Google's point of view, there's always a chance that the URI points to something that someone might want to search.

This could be fixed by Planet Debian changing (or just removing) the @id attributes in the hide/show links. They aren't valid HTML or XML IDs, and anyways they don't seem to be used. (Some people consider it rude to generate and publish invalid URIs pointing to other people's domains, precisely because greedy automated scripts will eventually come along and try to dereference them.)

-- Matt Brubeck http://limpet.net/mbrubeck/, 2009-05-21

Matt Brubeck