Collateral Damage of the Robots Race (on the Web)

13Mar11

It is now official: Google has declared war on ‘content farms’. Last week, Google changed its algorithm with the explicit goal to filter out the low quality content that content farms produce, changing up to 15% of its search results. I am sure that, because of this measure, the average quality of high-ranking pages has gone up, but I am also confident that this is just one step in a long battle that is yet to come. I believe Google is causing content farming as much as it is trying to fight it, which is a recipe for a never-ending story. In this post I will try to explain why I believe this is the case and I will discuss some side effects of the battle.

What struck me most about content farms when I first learned about them was not so much that they produced an ingenious form of web spam, but that these seemed to be legitimate companies with a ‘respected’ business model. Wikipedia describes content farms as “companies that employ large numbers of writers to generate large amounts of textual content which is specifically designed to satisfy algorithms for maximal retrieval by automated search engines”. By that definition both “Demand Media” and “The Huffington Post” are content farms. According to Wired Magazine, Demand Media aims to: “predict any question anyone might ask and generate an answer that will show up at the top of Google’s search results”, leading them to produce up to 4,000 videos and articles a day. 4.000 Videos? Yes, I guess there is something fishy about that baffling amount of content production, but in a way Demand Media’s ‘mission’ appears be respectable. The Huffington post is even a much more respected source: in March 2009,  the Gardian called it the world most powerful blog; in 2010 it won the ‘webby peoples voice award’ for best political blog. Seemingly, there is nothing wrong with content farming, and it shouldn’t be called web spam without further thought.

Or should it? What the Huffington Post and Demand Media have in common is that they produce content in response to trending search topics.  Let me draw a continuum running from purely expressive to purely responsive.

Expressive   ———————————————————–  Responsive

Egoblog —–SEO—- Huffington Post ——–Demand Media ——– Spam

When I write a blog post, without caring about what anyone else thinks about it, or whether they want to read it, it would be purely expressive. As soon as I start to care about visitor stats, I would move towards being more responsive. But I could move further, for instance, when I write about ‘hot’ topics or even write specifically to get some attention from an audience. One step further I would hire a search engine optimization (SEO) professional to adapt my site to rank high in Google’s results. Finally, site stats may turn out the only measure I care about (possibly because of advertisement revenues). When there is only one party, such as Google, who decides on relevance, responsive content wins the ranking, and thus our attention. As a result, responsive content tents to be more accessible and more prevalent on the web.

Content of content farms differs from other ‘responsive’ types of content, because content farms use robots (or computer algorithms) to do the search engine optimization (see this ‘Wired’ post). This makes perfect sense. As search engines rely on robots to decide on quality, the best way to figure out how to respond to them is by using ‘SEO’ robots. In any way: if you consider the web as a passive and independent body of information, and Google as a company which is trying to get you through the best places in this ‘passive’ information space: think again. Google has been so dominant for such a long time now that much of the content on the web is being produced specifically to ‘please’ Google’s robots, with the use of SEO robots as temporary highlight. Google’s algorithms are shaping the information that is produced for the web as much as they are selecting it for us. The web is an active and dynamic battlefield, not a, cleverly indexed, passive information space.

Is this a problem? Well, there is no ‘one-to-one’ relationship between responsiveness and quality of content. But one way to be ‘responsive’ is to produce a lot of content. As soon as quantity of content becomes more important than quality you need to take shortcuts in producing it. This is the collateral damage of search dominance. Content farms produce many post against the lowest fees for their authors, resulting in content that copies a lot from other sources and offering little new or insightful information. Let us call this content, which is just a slightly modified copy of an existing piece: ‘derivative content’.

The effect of large-scale production of derivative content is largely negative. Derivative content is making Google (and the searchers) task more difficult. Consider the case in which you want to find several point of views on a topic and you have to browse through large bulks of similar derivative content to find it, rather than get directly to the reliable sources. At the least it would be inconvenient. Also derivative content strengthens our representativeness bias. We adjust our believes to the availably of evidence even if we know it is biased, neglecting “base rates”; one amplifier of this bias is the wealth of (on-line) sources expressing a particular view (see my post on the impact on search on our cognitive biases). If most content gets fully responsive to any information need we enter a dystopian scenario. Say “content farms” are responsible for most information we would see on the web. In this case we only produce the information that we search for. We would create a massively schizophrenic information space, and we will all feel like Nina in the “Black Swan” or John Nash in “A Beatifull Mind”, not being able to trust our own judgement any more. So, to prevent this, we would need a proper balance between responsive and expressive information, and, in the end, because the web has becomes such a dominant information source, our search engines need to try to warrant it.

But, let me try to put the dystopian scenarios aside. I do believe it is wrong to see content farming as an accidental evil that Google is trying to fight. If it is an evil, it isn’t independent. Rather it is a cancer growing on an ecosystem which is created by Google’s dominant position in the first place. We shouldn’t see the web as a passive information space, crawled by Google’s bots. Rather it is an active body of information, which is strongly shaped by the preferences of those very same bots. Content farms show a soft force, that Google excerpts on all information that we are producing for the web. Even if Google manages to rule out content farms explicitly, it is still pushing content production towards a lower quality format. The robots race may turn out to be a long battle. Spam bots will trying to behave more and more like humans, and Google’s crawlers get smarter and smarter in carrying out an automated Turing test of web content. But it is unlikely that this will automatically result in the production of high-quality webpages. To solve the quality decrease of content on the web,  Google needs to realize it is causing the ‘race to the bottom’ that we are seeing on the web. Google needs to adapt its algorithms so that quality becomes more important than ‘relevance to query’. To do this Google, and more importantly, the rest of the world, need to realize there is no high-quality answer to any question anyone could ask. Even in Googletopia, (high-quality) information remains to be a limited resource.

Reading more.

I wrote about the effect of Google’s dominance in my post about Googlization. I wrote about the way search amplifies our cognitive biases in my post “Cognitive Bias in the Global Information Subway”. I also wrote about Search Skills, and Openness According to Google.

As mentioned, ‘Wired’ has an excellent article about content farming. Giga OM is following the content farming debate from the start. Two good posts are “The benefits and risks of content farms” and “Google Tightens the Screws on Content Farms

About these ads


3 Responses to “Collateral Damage of the Robots Race (on the Web)”

  1. In this post you can see two chatbots chatting to each other. Take it as visualation of the robots race on the web :)


  1. 1 Evaluating the NetGen Argument « @koenvanturnhout MacroBlog
  2. 2 Reading Lev Manovich’ “The Language of New Media” « @koenvanturnhout MacroBlog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Follow

Get every new post delivered to your Inbox.

%d bloggers like this: