
One of venture capitalist Ed Sims’ companies is Greenplum, who offer a open source, massively parallel upgrade from mySQL. I agree with Ed that data mining will grow dramatically in importance, so we had better have superb scaling solutions in our toolkit — like Greenplum Database.
Google’s early implementation of Shared Nothing Architecture is what has allowed them to keep scaling their massive infrastructure at very low cost per parallel node. In observations on the Semantic Web, Tim O’Reilly touches on some current examples of how data mining can be a new business or enhance an existing business. Google’s Pagerank and Amazon’s “people who bought item this also bought” are outstanding examples.
By contrast, I’ve argued that one of the core attributes of “web 2.0” (another ambiguous and widely misused term) is “collective intelligence.” That is, the application is able to draw meaning and utility from data provided by the activity of its users, usually large numbers of users performing a very similar activity. So, for example, collaborative filtering applications like Amazon’s “people who bought item this also bought” or last.fm’s music recommendations, use specialized algorithms to match users with each other on the basis of their purchases or listening habits. There are many other examples: digg users voting up stories, or wikipedia’s crowdsourced encyclopedia and news stories.
But for me, the paradigmatic example of Web 2.0 is Google’s Pagerank. Not only did it lead to the biggest financial success story to date, it is the example that makes us think hardest about the true meaning of “collective intelligence.” What Larry Page realized was that meaning was already being encoded unconsciously by web page creators when they linked one page to another. And that understanding that a link was a vote allowed Google to give better search results than people who, up to that time, were just searching the contents of the various documents on the web.
And so, it seems to me that Pagerank illustrates the fundamental difference between the approaches of the Semantic Web and Web 2.0. The Semantic Web sees meaning as something that needs to be added to documents so that computers can act intelligently about them. Web 2.0 seeks to discover the ways that meaning has already been implicitly encoded by the way people use documents and digital objects, and then to extract that meaning, often by statistical means by studying large aggregates of related documents.
Looking at it this way, you can see that Wesabe is very much a Web 2.0 company. Their fundamental insight is that the way that people spend money is a vote, just like a link is for Pagerank, and that you can use that aggregated vote to build various kinds of intelligent user-facing services.
It’s clear Google is thinks data mining is a Very Big Business, and they are the current champs at databasing your personal data cloud so they can analyze it [e.g., gmail, search history, the data underneath Google Apps]. See also Macin Zukowski’s thesis Parallel query execution in the Monet DBMS.