Thursday, Feb 03, 2011
So I've avidly been following the drama brewing about Bing allegedly copying Google's search results, and I don't think any of the facts are in dispute about how Microsoft collects user data and how that data is used to improve Bing's search results. Up until this morning I've been in Google's camp, basing my opinion on the logic that, no matter what happens in Bing's black box, if Google changed their search result ranking, those changes would be reflected in Bing's search results and, no matter how you equivocate, that's copying. This morning I had a different thought though, one that's completely reversed my thinking on the issue: People aren't robots, and they don't always click on the first result. Microsoft's algorithm is assumed to give weight to the frequency that users click on links from a source page (in this instance a Google search results page) to a destination page. It uses those weights to tie relevance between the unique characteristics of the source page (in this case, the search term) and the destination page. When someone searches for 'Girl Talk' and clicks on the first result, Microsoft strengthens the correlation between the search term and the page the user clicked through to. This kind of user input into Microsoft's system can result in a higher ranking for that page when a user makes the same search on Bing. So this is copying, right? It would seem so, and Google even crafted a honeypot experiment to prove this was happening. By creating handmade results pages for 100 unlikely queries, and sending Google engineers home to search on those terms and click on the handmade first result, they successfully changed Bing's search result pages for several of those queries. Here's where Google's logical argument breaks down: Google's implication is that if Google were to change the search result page for, say, 'iPhone', by putting a link to California Trout as the first result, Microsoft's algorithm would copy Google, gradually causing the trout page to be their first result, too. However in reality people would click on the second result on Google's page, the one to Apple's iPhone page because it's clearly the better result. Microsoft would track those clicks as being more strongly tied to the search term iPhone, and that link, not the trout link, would be strengthened in Bing's results. To put it a different way, Bing isn't mining the first pages on Google search result pages as Google claims; they're mining the pages that users click on the most. This is a subtle and important difference, because the former would be the wholesaling copying of the output of Google's algorithms, but the latter is determining quality of the links based on user behavior. If Google's results weren't well-ranked then Microsoft's data-mining would result not in a copying of Google's rankings, but an improvement on them based on user behavior. So let's go back to the Google honeypot experiment: If Google had sent those engineers home with their laptops and asked them to search on obscure queries and told them to click on the best result then Google's experiment would have failed. The engineers were told to click on the first result and so that result was deemed to have quality, not because it was the first result, but because it was the result that the user clicked on. Google's claim is rooted in the assumption that the first result is the one that's clicked on the most, and in almost all cases that's true, because Google has very, very good search results and the first option is usually the best one. This combined with the fact that, given a number of apparently equal items, a user will click on the first one (hence the reason that ad rankings are conducted by auction with higher places being more valuable), and you get roughly the same outcome as if Microsoft was simply copying top results. Microsoft's means for weighting term relevancies for pages seems to be consistent across all web pages, and nobody has asserted that they special case Google or any other search engines for special data tracking. It would seem inconsistent for Microsoft to learn from click-paths across the web but deliberately turn a blind eye when the user ventures on to a competing search engine. So long as Microsoft's ranking algorithms rely on user behavior as the signal, and not the explicit ranking of search results on Google search result pages then it's not copying, even by inference, because the algorithm isn't using user behavior to infer what Google's top result is on a given query, it's using user behavior to estimate what the best link is for a given query. As subtle as that difference is, in my mind it makes a world of difference. (Full disclosure: I am a Google shareholder.) Update: While I don't think robots.txt should be inferred to letting site owners dictate what users can do with their own clickstream data, this might be a good example of where a site-supplied 'do-not-track' header would be useful, if Microsoft would choose to respect it. It would compliment the proposed user agent 'do not track' header nicely, allowing tracking only by mutual consent. Further Update: In response to the esteemable Matt Cuts's follow-up post around the controversy I add the following thoughts:
If you like it, please share it.
Hi, I'm Kevin Fox.
I also have a resume.
I'm co-founder in
The Imp is a computer and wi-fi connection smaller and cheaper than a memory card.
We're also hiring.
©2012 Kevin Fox