![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]()
|
||||
|
|
Linguistic Resources New Corpora ArchiveChronological list of our corpora releases for recent years. Please visit The LDC Corpus Catalog for a complete list of publications the LDC distributes.
2008 ReleasesAn English Dictionary of the Tamil Verb ~contains translations for 6597 English verbs and defines 9716 Tamil verbs
GALE Phase 1 Chinese Blog Parallel Text ~313K character of Chinese blog text and its translation from eight sources
CSLU: National Cellular Telephone Speech Release 2.3 ~approximately one minute of transcribed speech from 2336 speakers throughout the US
GALE Phase 1 Arabic Blog Parallel Text ~102K words of Arabic blog text and its English translation from thirty-three sources
STC-TIMIT 1.0 ~entire TIMIT database recorded through a single telephone channel
OntoNotes Release 2.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text
Penn Discourse Treebank Version 2.0 ~Wall Street Journal text annotated discourse relations and their arguments
ACE 2005 English SpatialML Annotations ~newswire text annotated for spatial expressions
CSLU: Portland Cellular Telephone Speech Version 1.3 ~cellular telephone speech with orthographic and phonetic transcription
Hungarian-English Parallel Text, Version 1.0 ~approximately two million sentence pairs plus additional resources for Hungarian
2007 Releases2004 Spring NIST Rich Transcription (RT-04S) Development Data ~development data used for speech-to-text and metadata extraction tasks
Chinese Treebank 6.0 (CTB 6.) ~780K words with POS-tagging and syntactic braketing
Arabic Gigaword Third Edition ~comprehensive archive of Arabic news text acquired by the LDC
CSLU: Kids' Speech Version 1.1 ~transcribed read and free response speech
GALE Phase 1 Distillation Training ~English, Chinese and/or Arabic queries and responses for the GALE Distillation task
2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks
MITRE 1997 Mandarin Broadcast News Speech Translations(Hub-4NE)
~translated and aligned broadcast news transcripts
CSLU: Apple Words and Phrases ~telephone speech from over 3000 callers
GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 23 hours of Chinese broadcast news selected from a variety of sources
Nationwide Speech Project ~read speech representing the primary regional varieties of American English
Chinese Gigaword Third Edition ~comprehensive archive of Chinese news text acquired by the LDC
2003 NIST Rich Transcription Evaluation Data ~evaluation data used for speech-to-text and metadata extraction tasks
CSLU: Yes/No Version 1.2 ~18,000 speakers saying "yes" or "no" in response to various questions
GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1 ~transcripts and English translations of 17 hours of Arabic broadcast news selected from a variety of sources
Mandarin Affective Speech ~read speech in five different emotional states
2001 Topic Annotated Enron Email Data Set ~manually indexed email data set
Tagged Chinese Gigaword ~newstext annotated with full POS tags
CSLU: Foreign Accented English Release 1.2 ~free response English speech by native speakers of 22 languages
English Gigaword Third Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC
OntoNotes v 1.0 ~Treebank, PropBank, word sense, and coreference annotated English and Chinese news text
ISI Chinese-English Automatically Extracted Parallel Text ~over 500K sentence pairs from newswire sources
TRECVID 2003 Keyframes & Transcripts ~keyframes extracted from English language broadcast programming
Fisher Levantine Arabic Conversational Telephone Speech and Transcripts ~279 transcribed telephone conversations totaling 45 hours of speech
TRECVID 2005 Keyframes & Transcripts ~keyframes extracted from Arabic, Chinese and English language broadcast programming
ARL Urdu Speech Database, Training Data ~transcribed read speech from 200 native speakers
ISI Arabic-English Automatically Extracted Parallel Text ~over 1 million sentence pairs from newswire sources
English Chinese Translation Treebank v. 1.0 ~English translation, part-of-speech tagged and treebanked
Levantine Arabic Conversational Telephone Speech, Transcripts ~transcribed conversations from over 900 speakers
2006 Releases
Arabic Broadcast News Speech and Transcripts ~10 hours of transcribed satellite radio news
TDT5 Multilingual Text, Topics and Annotations
~English, Arabic, and Chinese newswire text with corresponding topic relevance annotations
Iraqi Arabic Conversational Telephone Speech and Transcripts ~transcribed conversations from over 250 speakers
French Gigaword First Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC
2004 NIST Speaker Recognition Evaluation ~training and test data from the 2004 evaluation
CSLU: Stories ~transcribed speech with time-aligned phonetic labels
West Point Heroicio
Spanish Speech ~read and free response speech
Gulf Arabic Conversational Telephone Speech and Transcripts ~transcribed conversations from over 900 speakers
Web 1T 5-gram Version 1 ~English word n-grams and their observed frequency counts
Korean Broadcast News Speech and Transcripts ~transcribed VOA satellite radio news
West Point Korean Speech ~read and free response speech
Prague Dependency Treebank 2.0 ~morphological, syntactic, and semantic annotation of Czech news text
Russian through Switched Telephone Network (RuSTeN) ~multiple recorded calls from 125 speakers
CSLU: Multilanguage Telephone Speech Version 1.2 ~fixed vocabulary and continuous speech in eleven languages.
NIST 2003 Language Recognition Evaluation ~to test the detection of a given target language
Spanish Gigaword First Edition ~comprehensive archive of newswire text that has been acquired over several years by LDC
CSLU: Speaker Recognition Version 1.1 ~transcribed speech from 90 speakers including the same utterances recorded multiple times
English-Arabic Treebank V1.0 ~52K Arabic words POS tagged and treebanked with parallel English translations
Middle East Technical University Turkish Microphone Speech V 1.0 ~transcribed speech from 120 speakers
CSLU Spoltech Brazilian Portuguese ~transcribed read and spontaneous speech
Korean Treebank Annotations Version 2.0 ~Korean texts annotated with morphological and syntactic information
N4 NATO Native and Non-Native Speech ~military oriented database of multilingual and non-native speech
Timebank 1.2 ~newstext annotated with temporal information, adding events, times and temporal links between events and times; free copies available!
CSLU: Spelled and Spoken Words ~transcribed speech from over 3000 speakers
Korean Propbanks ~semantic annotation containing over 30K annotated predicate tokens
Speech Controlled Computing ~supports the
development of ASR applications in the domain of voice control for the home
ACE 2005 Multilingual Training Corpus ~Arabic, Chinese, and English data annotated for entities, relations, and events
Levantine Arabic QT Training Data Set 5, Speech and Transcripts ~250 hours of transcribed telephone speech
Arabic Gigaword Second Edition ~text from five Arabic news sources
CSLU Voices ~twelve speakers reading phonetically rich sentences
Multiple Translation Chinese (MTC) Part 4 ~human and machine translations of 100 Chinese news stories
|
|||
|
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data
Contact ldc@ldc.upenn.edu |
||||