Facebook rivaling Google by building its own Web Crawler powered by… You!

Update April 14th 2012: Sergey Brin warns of the threat Facebook poses to the Internet and open web by building its inaccessible content. But is Facebook a threat for the Internet itself? Or just for Google?

With Facebook giving the publishers easy ways to mirror their external pages on Facebook, it means it is effectively building the most relevant search engine, the semantic search engine.

And it’s doing that using free labor: viewers of pages who “Like” or “Recommend” a page using the new Like button.

Here is an example of this search engine (using this very article), already available and directly integrated in Facebook, listing all the web pages external to Facebook that users “Liked” in the “Page” section of the results:

On the technical side, you can see Facebook’s crawler in the access log of your HTTP server.
It has the User Agent set to facebookexternalhit/*
Here is an example of Facebook pinging this very article in the access.log of the Apache server: – - [05/May/2010:08:31:40 -0600] “GET /2010/05/facebook-rivaling-google-by-building-its-own-web-crawler-powered-by-you/ HTTP/1.1″ 200 18379 “-” “facebookexternalhit/1.0 (+http://www.facebook.com/externalhit_uatext.php)”

Complex Expensive Mathematical Proprietary Algorithms vs. You

What will the result be?

A search engine more powerful than Google as it will index only real pages that actual users like, and not fake pages.

Google and Bing and all serious search engines in the market must spend millions of dollars building complex automated Web Crawlers that keep surfing around the clock, retrieving pages, following links and crunch each page to extract relevant information based on keywords to index them.

So far, the ranking of your site depends on how this complex algorithm works and classify keywords, with inner workings so obscure and complex it gave birth to a totally new field of Internet technologies: Search Engine Optimization (aka SEO).

Facebook is moving the power from complex obscure and proprietary algorithms to the end users, the only persons who can really tell if a page is worth reading.
Not only Facebook can tell the page is worth indexing, but it can tell what its rank should be, just by looking at the number of people who shared the same page.

Every time you click on a “Like” button, you effectively tell Facebook: here is the address of a page I read and liked, and it’s worth sharing with others.

Facebook is actually building the first successful crowdsourced search engine, and it will be a powerful one with pages chosen by users, getting rid of all the fake websites out there like parked domains and parasite websites that are just exploiting the biggest weakness of search algorithms: they are not human.

Those parasite websites usually do their own web crawling and build pages (sometimes on the fly) using keyword stuffing to lure search engines into thinking their content is relevant in order to achieve higher ranking, then serve lots of ads on the pages to generate revenue from that traffic.
I personally move right away from those websites when I land on them, meaning I will never click a “Like” button if they had one because it’s too obvious they are just copy pasting content from somewhere else.

You can’t lure the human eye so easily and a smaller percentage of people who lands of the parasite pages will actually “Like” them, while a regular search engine would rank them high based on the keywords.

Not to mention the porn sites.
Who will “Like” or “Recommend” a porn webpage with the link being posted directly to their Facebook profile and broadcasted to all their friends in their News Feed?

The raise of Microsoft’s Bing (under the cover of Facebook)

In a way, the “Like” button is how Facebook added a Captcha to all websites so only content worth indexing is being saved for search.
Even more powerful than a Captcha on a webpage, it’s also a Captcha on your brain and morality as users will not reference questionable websites like porn.

Who will benefit from that?
Facebook of course, but also Microsoft’s own search engine, Bing, which so far has been struggling against Google even after Microsoft spent billions of dollars on it.
Don’t forget that Microsoft invested $240M in Facebook back in 2007 (see Facebook’s press release), and they could well be behind Facebook’s strategy to take over the web.

Bing is already omnipresent in Facebook search and it keeps growing.

Facebook Bing Search

At more than a billion clicks per day on the “Like” button, it’s happening really fast.

The consequences are as follow:
- Facebook is referencing a “cleaner” web: it will have an inventory of real pages with less parasite websites referenced and more general audience content
- Bing from Microsoft will benefit directly from this crowdsourced search engine.
- the SEO importance will diminish

Of course people will adapt:
- SEO guys will get on the Facebook bandwagon and their job will be to add a “Like” button to your site (and still charge you a lot for that)
- Parasite websites will have to make their content much nicer to the human eye to fool humans into thinking they have the original content. It probably means that a simple copy of the original content instead of keyword stuffing will do better than a pages belching lots of content gathered from multiples places.
- Google will be (certainly is already) restless and start spending billions to compete with the Microsoft+Facebook alliance.

But don’t forget Google also has its own social network: Orkut.

Could Google’s response rely on finally getting Orkut to take off?
I would start by renaming it to something I don’t have to Google every time for spelling and pronunciation…

While Google will not go away any time soon, the search engine wars is just starting and the key is to make is social.

Tags: ,

6 Responses to “Facebook rivaling Google by building its own Web Crawler powered by… You!”

  1. Right on. Excellent write up.

    Wowd (www.wowd.com) has had a “crowdsourced search engine” available since October 2009.

    It’s got some nice properties with respect to anonymity (meaning that all personally identifiable information stays on your own computer, if you use the Wowd client).

    And the ranking is based on popularity, as you suggest, but on a more subtle inclusion of that information in a link-based ranking algorithm.

    Thanks for the analysis.

  2. PlF says:

    Funny, Facebook hired the ReCaptcha co-founder this week:


    Maybe they are really implementing Captcha in our brains :-)

  3. Great thoughts. I downloaded the facebook like plugin and am really starting to like your blog. Very good insights.

    I do think that Facebook is contributing in a big way to how we find content on the web, but I wouldn’t classify it or even consider it to compete with Google (yet and not because Google is better). It delivers information in a different way. Facebook will be much more like how we channel surf our tvs. It will be just an online way of killing time or hanging out.

    Google is for finding important information and finding it fast. You have a question. You go to Google. Need an opinion. You probably will end up turning to Facebook.

    One thing for sure is that it will be interesting how it all plays out in the next 3-4 years. I personally predict Google will have less of a spotlight but will still be greatly important

  4. Jeremy says:

    Thank you for explaining why I kept getting “facebook.com/externalhit_uatext.php” as a source referrer for my site. Makes sense how Facebook and MS are taking on Google with more user-relevant and friend-suggested content.

  5. I think this will fail. Well at least in the sense that they will “rival” Google. Nobody can rival Google, they’re miles ahead.