Google Books and the rise of culturomics

James McElvenny writes: As many of Fully (sic)’s readers will be aware, there’s been quite a bit of buzz lately about a project led by Erez Lieberman Aiden and Jean-Baptiste Michel at Harvard University that bills itself as the beginning of ‘culturomics’, a new paradigm for studying cultural trends using large amounts of textual data. […]

James McElvenny writes:

As many of Fully (sic)’s readers will be aware, there’s been quite a bit of buzz lately about a project led by Erez Lieberman Aiden and Jean-Baptiste Michel at Harvard University that bills itself as the beginning of ‘culturomics’, a new paradigm for studying cultural trends using large amounts of textual data. The main report on this project is a Science paper from December last year. The project is very exciting for the amount of the data it makes use of, but unfortunately it’s also become a bit overblown with journalistic hyperbole.

The data set is drawn from a corpus (collection of texts) from Google Books containing over 500 billion words in English, French, Spanish, German, Russian, Chinese and Hebrew, spanning a period between the years 1500 and 2000. Aiden, Michel and their collaborators claim that their corpus represents 4% of all books ever published. This corpus has been reduced to a collection of n-grams (up to n = 5) with counts of their frequency of occurrence in different years. An n-gram is any string of words of length n that appear together in a text – so ‘dog’, for example, would be a 1-gram, ‘shaggy dog’ a 2-gram, and ‘smelly, shaggy dog’ a 3-gram, and so on. N-grams are one of the basic building blocks of modern computational linguistics and have many applications, the most immediately entertaining being Markov chain-based text masher-upperers.

In the December Science paper the culturomics team showcase the sort of analyses that can be done with their data. Among the more interesting of these is the frequency of mention of the names of ‘inventions’ from different periods. They find that inventions from the period 1800-1840 took on average 66 years from the date of their first appearance to reach more than a quarter of their peak frequency of mention, while those from 1840-1880 took 50 years and those from 1880-1920 took 27 years. From this they conclude that over time we have come to adopt new technology more rapidly.

They also do some more traditional linguistic analyses, such as the replacement over time of irregular past tense verb forms in English with regularised variants, e.g. burnt vs burned, throve vs thrived; and a verb that seems to be demonstrating the opposite trend: sneaked vs snuck.

They’ve made their n-gram data freely available so anyone can download it and try out their own analyses. They’ve also developed a nice little online tool, the ngram viewer, that makes it easy to do simple searches over the data. There’s a tumblr page where people are showing off some of the interesting searches they’ve already tried out.

Frequency of "the press" and "the media"

To their credit, they continually stress that any trends revealed by the data need to be interpreted carefully – we need to come up with external explanations to account for what we see. They specifically state that ‘culturomics’ is not a replacement for traditional close reading. Unfortunately, this is a point that seems to have been quickly lost in much of the mainstream media coverage, e.g. NYT, Scientific American, WSJ.

The problems of shaky interpretation and the limits of the data have already begun to show themselves in some other efforts linked to the project. In a recent (admittedly rather lighthearted) piece in Science, Adrian Veres, a co-author of the original Science paper, and John Bohannon introduce a ‘Science Hall of Fame’, ranking scientists according to how famous they are, measured by how often their names are mentioned.

But is a simple count of how often someone’s name is mentioned really a good way to judge how ‘famous’ they are? We don’t know the context they’re being mentioned in so we can’t tell what’s being said about them – there’s no way to distinguish between ‘fame’ and ‘infamy’ (assuming that’s important to us). We also can’t be sure we’ve got the right guy, so to speak. Veres and Bohannon say that they eliminated scientists like James Watson (he of DNA) from the study because their names are just too common and they could easily be confused with someone else.

Linguistics’ own Noam Chomsky, for example, comes in with a level of fame roughly half that of Charles Darwin (whose ‘fame’ is taken as the point of reference). Chomsky is of course one of the ‘Great Men’ of modern linguistics and his linguistic work would have to be the most cited in the discipline and beyond. But this ignores the fact that he is also very famous (probably more famous) for his political activism. In the count presented by Veres and Bohannon these two sources of fame are undoubtedly mixed and there’s no way to separate the ‘scientist’ from the ‘political activist’. It’s a bit like saying Elvis is the world’s most famous truck driver. In fact, as the quote from Chomsky in Veres and Bohannon’s Science article demonstrates, he is probably aware of this.

This difficulty of ensuring that we’re talking about the right thing is actually inherent in the data set. All we have are the n-grams and frequency counts. To solve these sorts of problems and do many other kinds of interesting analyses we need access to the actual underlying texts and their accompanying metadata (that’s information like book title, place of publication, etc). Several prominent linguists have already discussed these points in detail, e.g. Geoffrey Nunberg, Mark Liberman and David Crystal.

But to be fair to the culturomics crowd, reducing the data set as they have done is a fairly ingenious way of getting around possible copyright problems. The vast majority of the more recent works are protected by sometimes over-restrictive copyright laws whose infringement is zealously guarded against by paranoid publishers. It would be difficult for publishers to claim that this reduced data set represents a pirated copy of any existing book. Of course, Google could release the full text of books of books published before 1922, on which the copyright has already expired, but they haven’t. They’re sitting on their own intellectual property goldmine there.

But there are actually many other corpora (plural of corpus) in English and other languages that we can use for linguistic and cultural analysis. The difference is that none of them has the size of the underlying corpus the culturomics data is based on (but all of them make the richer underlying data available to the users). Some more notable English-language corpora, which all have online interfaces, include he British National Corpus and the corpora from Mark Davies. These can also be bought for offline use for amounts varying from a little to a lot.

In fact, any text can be used as a corpus, as long as there’s a way to search it. In centuries past, hardcore philologists would spend their weekends making concordances of important texts, like the Bible or the works of Shakespeare. These listed every word found in the text and its immediate context. Computers have taken a lot of the drudgery out of this work by producing concordances automatically. Lots of useful material can already be found online, through particular portals, such as Trove at the National Library of Australia, which provides a digital archive of Australian newspapers, or even through a simple Google search. The problem is that the more unstructured the data is and the less metadata there is describing it, the more difficult it becomes to automatically process it.

So what the culturomics people have done for us is to take a huge amount of data and masticate it into a useful form. They’ve also given us a little tool that we can use to play with it. This is not that new and is certainly not an answer to all problems in cultural history. The grander claims are really the product of media hyperbole, but it has to be said that Aiden, Michel and co are attempting a bit of shameless branding with their coinage of ‘culturomics’ (cf WEIRD). But despite all that, this project is still very cool.

And it’s good for us social scientists to try our hand at some of the quantitative methods that have made the natural sciences (aka ‘real’ sciences) what they are today, providing of course we proceed carefully and think about what we’re actually doing. Maybe the day will still come when maths gives us social scientists the predictive power we envy physicists for, as in Isaac Asimov’s modernist fantasy of psychohistory.

First image: ‘The Comptometer Kid’, derived from an image posted by gildedcentury, via Boing Boing