Is there an algorithm for Wikipedia?

Google’s latest offering,

is rather fun, but I’m not convinced that I will use it very often.

Compare search results like this:


The page on Wikipedia is much more useful. It seems that humans are better at making tables of data from diverse sources of information that computers are at this point. Will it always be this way?

Wikipedia has strict guidelines on how articles are written and how propositions should be backed by reliable sources. Could these guidelines be further formalised and pave the way for an algorithm that could write something like Wikipedia from scratch? Google seem to be attempting to build a system that can produce the pages on Wikipedia with names like “List_of_*”. For all I know, Google might have looked at all the articles on Wikipedia whose names match that pattern and used them to get their tables started.

Sport is a popular subject. It’s safe to say that there are lot of people who are willing to give up their free time to collate data on the subject. If some joker changed the Wikipedia table to say that Manchester United were relegated at the end of the previous season, this error would be corrected quickly as there is no lack of people who care deeply about the matter.

During a presentation for Wolfram Alpha, Stephen Wolfram was asked whether he had taken data from Wikipedia. He denied it and said that the problem with Wikipedia was that one user might conscientiously add accurate data for 200 or so different chemical compounds in various articles. Over the course of a couple of years, ever single article would get edited by different groups. The data diverged. He argued that these sorts of projects needed a director, such as himself. However, he said that his team had used Wikipedia to find out what people were interested in. If the article on carbon dioxide is thousands of characters long, is edited five times a day, has an extensive talk page, is available in dozens of languages, and has 40 references, it is safe to say that carbon dioxide is a chemical compound that people are interested in. This is true regardless of the accuracy content of the article. It would be pretty trivial for Google (or any Perl hacker with a couple of hours to spare and a few gigs of hard disk space) to rank all of the pages on Wikipedia according to public interest using the criteria that I just listed.

In many ways, an algorithmic encyclopaedia is to be preferred because of the notorious problems of vandalism and bias. However, tasks like condensing and summarising are not straightforward. The problem of deciding what to write about could analysing Wikipedia, as described above, and tracking visitor trends. Is there going to be a move to unseat Wikipedia in the coming years? How long before humans can be removed from the algorithm completely?

Distributed Image Recognition

One of the most enduring myths of the internet is that it was designed to survive a nuclear war. If NORAD gets obliterated, the system as a whole keeps running. This is true for many of the systems of the internet but anyone who administers Apache, MySQL or SVN server can tell you that this isn’t as simple as it sounds.

One of the worries of the internet is that we all depend on google too much and that search is too important for one company to handle. I try to use all the search engines out there in rotation (although google always seems to find what I’m looking for the best).

There are a number of projects that are active at the moment to build an open, distributed search engine. For example, Grub, Faroo and YaCy. I like the idea of these but have a few concerns. What’s to stop the client, running on your computer in the background, downloading a lot of content that’s objectionable or illegal? Perhaps we should all be running such software as an act of civil disobedience to make it impossible for the police to track traffic.

There are other sorts of search that could be distributed. Image recognition is very processor intensive but should be parallelisable.

A distributed clone of the ESP Game could be written. Not that I want to knock the existing version of the game, it’s great but centralised. User’s give google all this data. Are they forced to give it all back to the users or just what they want to show.

Is it really that important that search is distributed? Eventually, I hope that it becomes that way. It’s a similar deal to Windows vs Linux or Java vs PHP or Wordpress vs Blogspot. The cathedral vs the bazaar.

Standing up for slow computers

I work with computers a lot.

I write PHP, SQL and JavaScript for a living and have a rough idea of what a usable web site looks like. You might have guessed from my choice of theme for my blog that I’m not really into frills. At work, I leave all that in the more than capable hands of our resident CSS and PhotoShop guru, Saul Howard.

However, designing web sites is all about the user’s experience and I am very interested in that.

As anyone who has designed a web page will be able to tell you, consistency across browsers is still some way off. To avoid the pitfalls of this, we keep a variety of computers in the office. In fact, I’m using an ancient PIII 700 box right now with a 14’’ CRT monitor. I never let a web site out on the web without first checking that it is usable (if a little cramped) on this machine. My life would be somewhat easier if I could put messages like this:

on my pages but it just wouldn’t feel right.

One of the buzz words in web design for the last couple of years has been AJAX. When implemented well, I think that it does add greatly to the snappiness of web pages.

However, turning web pages into full blown applications does have a downside. A lot of AJAX pages seem to ask much too much of a user’s browser and sometimes (even with the faster machines in the office) my CPU will run at 100% just to surf the net. Reading web pages on an old computer is often an exercise in pain unless you stick to Wikipedia (and my sites:)). A technique for improving the performance of web pages can fail completely.

I watched this video the other day. For what it’s worth, I think that the video has overly dramatic music but it raises some interesting points. It finishes up by predicting that $1000 computers will one day be more powerful than all the human brains on the planet. I doubt that the arithmetic for this is particularly meaningful. Even if we do have supercomputers for pennies; I fear dreadfully that all this extra power won’t cure cancer, forecast hurricanes or even enslave us as deluded brains in jars. Most likely the extra power will be eaten up by increasingly untuned pages in fantastically bloated browsers.