What is a corpus and why should we have one?

Simon Musgrave writes: What is a corpus and why should we have one? It sounds like the way a low-life character in Dickens might refer to a dead body. (The word does not occur in Dickens in that sense, but his contemporaries did use it in that way –see OED.) But in recent usage (at […]

Simon Musgrave writes:

It sounds like the way a low-life character in Dickens might refer to a dead body. (The word does not occur in Dickens in that sense, but his contemporaries did use it in that way –see OED.) But in recent usage (at least in my world), the word refers to “a collection of specimens of a language as used in real life…selected as a sizable ‘fair sample’ of the language as a whole or of some linguistic genre” (Sampson and McCarthy 2004: 1). These collections of specimens are now often very large, and are growing rapidly in line with Moore’s Law: the British National Corpus (BNC), completed in 1994, consist of 100 million words, but the Corpus of Contemporary American English, in the version available online since 2011, contains 425 million words. Of course such large collections of data can only be exploited using computers; but what can one actually do with all those words?

A word cloud based on word frequencies in Fully (sic)

Well, not surprisingly, with large numbers and computers involved, one can do a lot of counting. The simplest sort of counting is just to examine the frequencies of words – extensive data of this type from the BNC can be consulted online. These numbers become really interesting when we start to compare differences between varieties of a language. For example, the five most frequent words in the spoken English component of BNC are the, I, you, and and it, while the equivalent list for the written component is the, of, and, a and in. These differences are not surprising: speakers are much more likely to refer to themselves and to their hearer than are writers, but it is satisfying to see one’s intuition confirmed by solid empirical data! More subtly, Dirk Geeraerts and his colleagues at the University of Leuven in Belgium have examined the differences between Dutch as spoken in the Netherlands and as spoken in the northern part of Belgium. This research uses techniques such as examining the relative frequencies of synonymous words in the two varieties. Dutch has a word of its own, spijkerbroek, which can refer to what we call jeans, but the English word is also commonly borrowed. But the work at Leuven establishes that the comparative frequency of the two words varies across the two varieties of Dutch. Another area of investigation is collocation – what words are likely to occur together. Such information can tell us a huge amount about the usage of words and about their meaning; indeed collocation searching is a standard technique for contemporary lexicographers. To get a taste of what this research is like, have a look at the tool which generates word cloud visualisations for words in BNC.

The BNC has existed since 1994; there is also an American National Corpus project, and many other corpora exist at a national level. But does Australia need such a resource? In the last few years, a consensus has emerged from researchers in various disciplines that Australia lacks a substantial collection of computerised language data and that such a collection would be an important research resource. A result of this consensus is an initiative aimed at the establishment of an Australian National Corpus, the first phase of which will be officially launched in Brisbane on Monday evening. Building on earlier work collecting corpora in different disciplines, AusNC will bring existing and newly collected samples together in one place and will contain collections of:

published texts from many genres
transcribed speech, often with aligned audio files
visual records of interaction (video)
electronic texts including email, blogs and social media

AusNC aims to illustrate Australian English in all its variety: situational, social, generational, and ethnic; and to document languages other than English used in Australia, including AUSLAN, and the community languages of immigrants. All these different types of language data will then support a very wide range of researchers and their needs, including those of:

linguistic researchers
English language teachers, for school and adult education
lexicographers and terminologists
translators and interpreters
speech and language pathologists
natural language processing and language engineering
other language-oriented research in the humanities and social sciences

The last category mentioned above could include work in history, sociology, social psychology, cultural studies, indeed any area which relates to Australian society and culture.

If any of this sounds interesting to you, please visit the website next week when the initial collections go live. Have a look around, have a play, let us know what you like, more importantly let us know what you would like to see available! The committee steering the project is keen to make this a resource which is useful to as many people and communities as possible, and we can only do that if people and communities become involved – we look forward to hearing from you.