Prying Into the Google Crowbar

I'm studying the technical implementation of the Google Toolbar to find a JavaScript technique a Web publisher could use to detect and defeat autolinking.

I wouldn't use it on Workbench, but if I were Barnes & Noble or another online bookseller, I wouldn't have much patience for software that adds links to my competitors on my pages.

I thought I might be able to compare document.fileSize to the size of the page in document.body.innerHTML.length, but these values don't match and don't change after AutoLink is pressed.

I found some good and bad things about the implementation.

The good: the current beta enables a user to choose a map provider, which can be Google Maps, MapQuest, or Yahoo Maps. (Click Options, AutoLink Settings to choose one.)

The bad: You can't use View Source to learn what the Google Toolbar has done to a page after autolinking.

When you try, Internet Explorer shows the source code of the original page. To my knowledge, this is the first time I have ever viewed a Web page where I couldn't examine the exact HTML, JavaScript, and CSS formatting used to create it.

Comments

"I wouldn't have much patience for software that adds links to my competitors on my pages."

I went to a Barnes and Noble page for a book, scanned the page, and saw the ISBN, unlinked. Then I clicked "Show Book Info" and yes, the ISBN turned into a link to what ended up being an Amazon.com page. This is a default install with no settings changed. The point: Google Toolbar 3 is software that adds links to websites (yes, okay, Barnes and Noble's competitors), but *only if the user specifically requests it*.

If there's a fire, it's not the software feature, but rather the way Google is handling this. Yes, there's a pointer on Google's FAQ about it, but come on guys, that's not entering into the conversation. It's been several days since people got upset about this. Scoble would have been all over this if it happened to Microsoft (again).

Try the link i just gave with my name below. It's been cracked, the JS rewrites the links so that they are just as if the toolbar had never touched your page.

Be sure to read all the comments though, toward the end, there is some understandable paranoia and some advice for avoiding letting Google know what you're up to :-)

Well, theoretically, it wouldn't be too hard to create an "opaque" web page in much the same manner, but entirely from the web. The key point is that the browser doesn't display DOM modifications when doing a View Source (and Explorer has no equivalent to Mozilla's DOM Inspector, which (I assume) would show the DOM modifications Google makes).

Anyways, you'd initially serve a "shell" page consisting essentially only of JavaScript (with a unique ID) that uses JavaScript to go back and poll the server with an XMLHttpRequest, sending back the intially-generated unique ID. The server sends back either JS code to pass to eval() that makes the correct modifications, or XML/HTML to use to replace the element's innerHTML property. Once the server has served the page content for one unique ID, it is disabled until it's recycled for another page.

Boom.

I wonder if the toolbar actually uses JavaScript to search the page and make modifications, or if it calls lower-level IE-specific APIs. If it's the former, it ought to be pretty trivial for page authors to wrap the JS functions the toolbar calls and check the source for arguments.callee.caller. That's a pretty ugly hack for general-purpose deployment, though.

I think that as long as the page modifications are only made at the user's explicit request, page authors don't have much right to tell users they can't modify the HTML they've downloaded. Filtering proxies, after all, already exist, and can automatically do what Google's doing, to every page.

To see the the DOM rendering in IE; Go to the page you want to observe then copy this JS into the location bar and hit return;



javascript:document.write(document.getElementsByTagName('HTML')[0].innerHTML);

After the page reloads, view source.

"Google Crowbar"?

Isn't this the sort of catty nicknaming usually reserved for slashdot posters? A bit childish.

Given that Google Toolbar doesn't rewrite ISBNs, addresses etc that already have links: aren't authors best defences to simply write decent well-linked pages to begin with?

Scoble complained that Google Toolbar linked to a non-Microsoft map from an address on a Microsoft web page. If that page had been well-authored to begin with, the user wouldn't have had to resort to a client-side helper to find a map. Step up.

(And frankly, any website that blocked Google Toolbar in an obnoxious way -- some of the discussion on the site Nick W linked to suggested popping up a "you are using Google Toolbar..." dialog box -- would be a website I'm unlikely to visit again. Life's too short.)

Publishers who block the Crow-, er, Toolbar would probably pay a price. That's one of the reasons I wouldn't implement a block here and would be reluctant to do it elsewhere. I don't do anything to control how my RSS feeds are used, either.

But if you're Barnes & Noble or another ecommerce site, it's probably worth the price to avoid leaking customers.

I think that as long as the page modifications are only made at the user's explicit request, page authors don't have much right to tell users they can't modify the HTML they've downloaded.

Sure, but a Web publisher has the right to make the edits difficult or impossible.

Here's a blog entry I posted a while back about an IE web extension I wrote to pop-up a window and display the current HTML. It leveraged the javascript that IE uses to display XML, so you can collapse or expand parts of the HTML tree.

The link in the entry is incorrect. You can get the extension at:
http://radio.weblogs.com/0105476/stories/2002/03/26/menuext.zip

Ooops, my a tag didn't take. Here's the URL: radio.weblogs.com

Roger, the publishers have the right to make whatever use of web technologies they want, but...

1) Locking down pages goes against the spirit of sharing that defines the web. The source code of every page and data of every image that a user requests is handed by the web server to the user's computer to display/mangle/capture as he pleases. And DRM/control schemes that try to subvert this spirit of sharing are facing massively-connected networks of users that have much more power to demand that things be the way they want. Like being able to modify the HTML documents they download (*on demand*).

2) Surfers also have the very easily excercisable right to not visit any website that makes their browsing experience less pleasurable, eg sites that don't let them run a Javascript (or whatever the Google toolbar does) on any page they want to.

So it might not be a wise decision to excercise that "make edits difficult" right. Especially when you consider how many people are realisitcally going to browse at B&N, then use the Google Toolbar to buy the book at Amazon.

You can't use View Source to learn what the Google Toolbar has done to a page after autolinking... To my knowledge, this is the first time I have ever viewed a Web page where I couldn't examine the exact HTML, JavaScript, and CSS formatting used to create it.


This is entirely to do with the way IE handles Javascript rewriting of the page. If you've never seen it before, then it sounds like you've never looked at a page rewritten by Javascript, which is surprising since it's fairly common these days. "View Source" will always show the content that was originally downloaded. This is in no way a deliberate maneuver on Google's part, although you're making it sound like it is.


Ok so that didn't work either I should try reading the allowed tag list.

However I posted the code here

How many times do I have to post this link in someones comments before you, Dave, or Scoble look at it? :)

www.threadwatch.org

Thanks to Stephen Hay who pointed it out to me.

From the link
"The following downloads will stop the Google Autolink functionality. They will stop the Google Toolbar from altering your webpages and placing links to it's chosen partners."

Direct link to the JS
www.searchguild.com

Viewing the DOM tree of a rendered page (bookmarklet):
javascript:str=window.open('','source');str.document.write(''+document.getElementsByTagName('HTML') [0].innerHTML+'');void(0);

Urrggg... commenting systems!
Attempt 2:

javascript:str=window.open('','source');str.document.write(''+document.getElementsByTagName('HTML') [0].innerHTML+'');void(0);

Thanks for posting these code examples. I'll write a new entry about them today or tomorrow.

Add a Comment

All comments are moderated before publication. These HTML tags are permitted: <p>, <b>, <i>, <a>, and <blockquote>. This site is protected by reCAPTCHA (for which the Google Privacy Policy and Terms of Service apply).