What's New! What's Free! Archive
Free Talkbank Corpora. TalkBank is an indisciplinary research project funded by a five year NSF grant to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. The LDC distributes grant-covered copies of the following Talkbank corpora:
Free copies for all of the above corpora are still available; shipping and handling fees apply for data on disc. TalkBank also funded the distribution of 50 free copies of American National Corpus (ANC) Second Release and 100 free copies of SLX Corpus of Classic Sociolinguistic Interviews.
Additional Free Corpora - shipping and handling fees apply for data on disc
American English Nickname Collection ~a compilation of 331K American English nicknames to given name mappings
Asian Spoken Language Sampler ~a variety of speech and transcript samples from LDC's Asian language publications
Buckwalter Arabic Morphological Analyzer Version 1.0 ~Arabic-English prefix, suffix, and stem lexicons supplemented by three morphological compatibility tables
Catalan TimeBank 1.0 ~210 Catalan documents annotated with temporal and event information
English Web Treebank ~50,000 words of English weblogs, newsgroups, email, reviews and question-answers manually annotated for syntactic structure; first 50 copies available at no-cost
Indian Language Part-of-Speech Tagset: Hindi ~98K words of manually annotated Hindi text
Indian Language Part-of-Speech Tagset: Sanskrit ~57K words of manually annotated Sanskrit text
LDC Spoken Language Sampler ~a variety of speech, transcript, and lexicon samples from LDC's publications
Malto Speech and Transcripts ~8 hours of transcribed Malto speech data from 27 speakers
ModeS TimeBank 1.0 ~Modern Spanish test annotated with TimeML and SpatialML mark-up
Manually Annotated Sub-Corpus First Release ~ 80K words of spoken and written American English with various annotations
OntoNotes 1.0 ~English and Chinese news text material with Treebank, PropBank, word sense, and coreference annotation
OntoNotes 2.0 ~English and Chinese news text and broadcast news material with Treebank, PropBank, word sense, and coreference annotation
SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
~120K words extracted from the OntoNotes corpus and formatted for the SemEval task
Timebank 1.2 ~newstext annotated with temporal information, adding events, times and temporal links between events and times
XTrans Toolkit - tool to support transcription tasks in multiple languages on multiple platforms
Tools for converting SPHERE speech files to other formats. Nearly all LDC speech corpora are published with speech files in NIST SPHERE format. LDC provides two programs that will convert SPHERE files to other formats:
For further information on these tools, please visit LDC's Using page and scroll down to the section entitled "Speech (Digital Audio) Files".ESPS Software - signal processing programs that can be used for the analysis, manipulation and labeling of speech.
Annotation Graph Toolkit (AGTK) - software infrastructure for linguistic annotation.
Transcriber - tool for segmenting, labeling and transcribing speech.
Champollion - parallel text sentence alignment tool for as many language pairs as possible.
Etc. - recent collaborations and grant awards plus other announcements.
Membership Mailbag Archive - to address the questions that our data users have asked, we introduced our Membership Mailbag series of newsletter articles in May 2008. This periodic series answers frequently asked questions about LDC data, the LDC Intranet, and the benefits of an LDC membership.
Member Surveys - LDC conducted two end-of-year surveys to obtain feedback on satisfaction levels with LDC Membership and data releases as well as our corpus catalog, and to gather suggestions on future publications.
Milestones and Celebrations - information on our landmark corpora distributions and events to celebrate our 10th and 15th anniversary years.
Use of LDC Corpora by Students - ways LDC corpora have been used for student research and for teaching purposes at university summer school programs.
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data