LibGuides: Linguistics Library Guide: Linguistics Corpora

Books on corpora

The Cambridge Handbook of English Corpus Linguistics by Douglas Biber (Editor); Randi Reppen (Editor)
Call Number: Electronic Book

ISBN: 9781107037380

Publication Date: 2015
Corpus Linguistics
ISBN: 9783110180435

Publication Date: 2008
Current Trends in Corpus Linguistics by Alexander Brock (Series edited by); José Luis Oncins Martínez (Editor)
ISBN: 9783631828717

Publication Date: 2020
Doing Corpus Linguistics by William Crawford; Eniko Csomay
ISBN: 9781317688068

Publication Date: 2015
English Corpus Linguistics by Charles F. Meyer
ISBN: 9781107057159

Publication Date: 2023
The Routledge Handbook of Corpus Linguistics by Anne O'Keeffe (Editor); Michael J. McCarthy (Editor)
ISBN: 9780429632648

Publication Date: 2022

To find more books on linguistic corpora, search the Library Catalog using the subject heading/ descriptors: Corpora (Linguistics)

International Corpora of English

The International Corpora for English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. Twenty-six research teams around the world are preparing electronic corpora of their own national or regional variety of English. Each ICE corpus consists of one million words of spoken and written English produced after 1989.

Hathi Trust Research Center

The HathiTrust Research Center (HTRC) facilitates non-profit and educational uses of the HathiTrust Digital Library by enabling computational analysis of public domain works and (on limited terms) in-copyright works from its collection.

The HTRC is a collaborative research center launched jointly by Indiana University and the University of Illinois, along with the HathiTrust Digital Library, to help meet the technical challenges of dealing with massive amounts of digital text that researchers face by developing cutting-edge software tools and cyberinfrastructure to enable advanced computational access to the growing digital record of human knowledge.

Leveraging data storage and computational infrastructure at Indiana University and the University of Illinois at Urbana-Champaign, the HTRC will provision a secure computational and data environment for scholars to perform research using the HathiTrust Digital Library. The center will break new ground in the areas of text mining and non-consumptive research, allowing scholars to fully utilize content of the HathiTrust Library while preventing intellectual property misuse within the confines of current U.S. copyright law.

Select Corpora

English

Corpus of Contemporary American English (COCA)

The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. The corpus contains more than 450 million words of text and is equally divided among spoken, fiction, popular magazines, newspapers, and academic texts.

MiCase_Michigan Corpus of Academic Spoken English
Online collection of transcripts of academic speeches presented at the University of Michigan (may also include some sound recordings of speeches). Intended for use in a research project examining the characteristics of academic speech.

CDs of MiCase: the Michigan Corpus of Academic Spoken English are available at the Literatures and Languages Library along with accompanying book for MiCase.

Penn-Helsinki Parsed Corpora of Historical English
The Penn Corpora of Historical English, including the Penn-Helsinki Parsed Corpus of Middle English, second edition (PPCME2), the Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME), and the Penn Parsed Corpus of Modern British English (PPCMBE), are running texts and text samples of British English prose across its history - from the earliest Middle English documents up to the First World War. For more go to: Penn Corpora of Historical English.

Web 1T 5-gram version 1, Thorsten Brants, Alex Franz, eds. [Philadelphia, Pa.] : Linguistic Data Consortium, c2006.

This data set, contributed by Google Inc., contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams. The data will be useful for statistical language modeling, e.g., for machine translation or speech recognition, as well as for other uses.

Other Languages

Catalan
AnCora consist of a Catalan corpus (AnCora-CA) and a Spanish corpus (AnCora-ES), each of them of 500,000 words. The corpora are annotated at different levels.

Chinese

Chinese-English Parallel Corpora

The following Chinese-English parallel corpora downloads were developed by TranslateFX researchers and linguists for public use. The corpora is made of aligned sentence pairs from quality bilingual texts, covering the financial and legal domains in Hong Kong. The sources include government legislations and regulations, stock exchange announcements, financial offering documents, regulatory filings, regulatory guidelines, corporate constitutional documents and others. All the texts are from the Hong Kong Stock Exchange, the Securities and Futures Commission of Hong Kong, and Hong Kong government websites.

French

ARTFL
Corpus of texts, ranging from classic works of French literature to various kinds of non-fiction prose and technical writing. The eighteenth, nineteenth and twentieth centuries are about equally represented, with a smaller selection of seventeenth century texts as well as some medieval and Renaissance texts. There is also a Provençal database that includes texts in their original spellings. Genres include novels, verse, theater, journalism, essays, correspondence, and treatises. Subjects include literary criticism, biology, history, economics, and philosophy. In most cases standard scholarly editions were used in converting the text into machine-readable form, and the data contain page references to these editions.

Corpus de Français Parlé Parisien des années 2000. Provides recent (last two decades) interviews of Parisians. Audio files and transcripts are available for download.

Spanish

El Corpus del Español

CREA: corpus de referencia del espanol actual (Real Academia Espanola)

CORDE: Corpus Diacrónico del Español (Real Academia Espanola)

Diccionario crítico etimológico castellano e hispánico

Elcastellano.org

Glosario electronico de terminos linguisticos

Nuevo tesoro lexicográfico de la lengua española

SIL Bibliography

Wikilengua del español

Multi-language Resources

Hermann von Helmholtz-Zentrum für Kulturtechnik--A phonographic archives of very early spoken recordings of various languages of First World War prisoners in German prison camps.

For more, consult The Linguist List