Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

University Library, University of Illinois at Urbana-Champaign

Linguistics Library Guide: Linguistics Corpora

Linguistics journals list

Books on corpora

To find more books on linguistic corpora, search the Library Catalog using the subject heading/ descriptors: Corpora (Linguistics)

International Corpora of English

The International Corpora for English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-six research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English  produced after 1989.

Hathi Trust Research Center

The HathiTrust Research Center (HTRC) facilitates non-profit and educational uses of the HathiTrust Digital Library by enabling computational analysis of public domain works and (on limited terms)  in-copyright works from its collection.

The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

Leveraging data storage and computational infrastructure at Indiana University and the University of Illinois at Urbana-Champaign, the HTRC will provision a secure computational and data environment for scholars to perform research using the HathiTrust Digital Library. The center will break new ground in the areas of text mining and non-consumptive research, allowing scholars to fully utilize content of the HathiTrust Library while preventing intellectual property misuse within the confines of current U.S. copyright law. 

Select Corpora

English

Corpus of Contemporary American English (COCA)

The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.

MiCase_Michigan Corpus of Academic Spoken English
Online collection of transcripts of academic speeches presented at the University of Michigan (may also include some sound recordings of speeches). Intended for use in a research project examining the characteristics of academic speech.

CDs of MiCase: the Michigan Corpus of Academic Spoken English are available at the Literatures and Languages Library along with accompanying book for MiCase.

Penn-Helsinki Parsed Corpora of Historical English
The Penn Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. For more go to:Penn Corpora of Historical English.

Web 1T 5-gram version 1,  Thorsten Brants, Alex Franz, eds. [Philadelphia, Pa.] : Linguistic Data Consortium, c2006.

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Other Languages

Catalan
AnCora consist of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 words. The corpora are annotated at different levels.

Chinese

Chinese-English Parallel Corpora

The following Chinese-English parallel corpora downloads were developed by TranslateFX researchers and linguists for public use. The corpora is made of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others. All the texts are from the Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong, and Hong Kong government websites.

French

Corpus de Français Parlé Parisien des années 2000. Provides recent (last two decades) interviews of Parisians.  Audio files and transcripts are available for download.

Spanish

Corpus del espanol
This corpus will allow you to search from over 100 million words from more than 20,000Spanish texts from the thirteenth to the twentieth of a quick and simple.

CREA: corpus de referencia del espanol actual (Real Academia Espanola)

CORDE: Corpus Diacrónico del Español (Real Academia Espanola)

Diccionario crítico etimológico castellano e hispánico

Elcastellano.org

Glosario electronico de terminos linguisticos

Nuevo tesoro lexicográfico de la lengua española

SIL Bibliography

Wikilengua del español


Multi-language Resources

Hermann von Helmholtz-Zentrum für Kulturtechnik--A phonographic archives of very early spoken recordings of various languages of First World War prisoners in German prison camps.


For more, consult The Linguist List