Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



LDC Catalog by Type And Source

| | lexicon | lexicon, speech, text | speech | speech, text | text | video |

Corpora are first divided into major categories according to the type of data they contain, and then are further broken down into minor categories based on the source of the data.


| microphone speech | telephone speech |
[ top ]

microphone speech
[
]
LDC2010S07 Asian Spoken Language Sampler

telephone speech
[
]
LDC2010S07 Asian Spoken Language Sampler

lexicon
| dictionaries | field recordings | microphone speech | newswire | telephone conversations | varied | web collection |
[ top ]

dictionaries lexicon
[
lexicon]
LDC2008L01 An English Dictionary of the Tamil Verb
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
LDC96L14 CELEX2
LDC2002L27 Chinese-English Translation Lexicon Version 3.0
LDC2008L03 Global Yoruba Lexical Database v. 1.0
LDC2008L02 Hindi WordNet

field recordings lexicon
[
lexicon]
LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon

microphone speech lexicon
[
lexicon]
LDC99L23 American English Spoken Lexicon

newswire lexicon
[
lexicon]
LDC98L21 COMLEX English Syntax Lexicon

telephone conversations lexicon
[
lexicon]
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
LDC97L18 CALLHOME German Lexicon
LDC96L17 CALLHOME Japanese Lexicon
LDC96L15 CALLHOME Mandarin Chinese Lexicon
LDC96L16 CALLHOME Spanish Lexicon
LDC99L22 Egyptian Colloquial Arabic Lexicon
LDC2003L02 Korean Telephone Conversations Lexicon

varied lexicon
[
lexicon]
LDC2009L01 An English Dictionary of the Tamil Verb Second Edition
LDC98L21 COMLEX English Syntax Lexicon
LDC2008L03 Global Yoruba Lexical Database v. 1.0
LDC2004L01 Klex: Finite-State Lexical Transducer for Korean

web collection lexicon
[
lexicon]
LDC2002L27 Chinese-English Translation Lexicon Version 3.0

lexicon, speech, text
| field recordings | meeting speech | microphone speech | telephone speech |
[ top ]

field recordings lexicon, speech, text
[
lexicon, speech, text]
LDC2008S08 LDC Spoken Language Sampler

meeting speech lexicon, speech, text
[
lexicon, speech, text]
LDC2008S08 LDC Spoken Language Sampler

microphone speech lexicon, speech, text
[
lexicon, speech, text]
LDC2008S08 LDC Spoken Language Sampler

telephone speech lexicon, speech, text
[
lexicon, speech, text]
LDC2008S08 LDC Spoken Language Sampler

speech
| broadcast conversation | broadcast news | field recordings | meeting speech | microphone conversation | microphone speech | telephone conversations | telephone speech | transcribed speech |
[ top ]

broadcast conversation speech
[
speech]
LDC2009S02 Czech Broadcast Conversation Speech
LDC2013S02 GALE Phase 2 Arabic Broadcast Conversation Speech Part 1
LDC2013S04 GALE Phase 2 Chinese Broadcast Conversation Speech

broadcast news speech
[
speech]
LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
LDC97S44 1996 English Broadcast News Speech (HUB4)
LDC98S71 1997 English Broadcast News Speech (HUB4)
LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
LDC2000S86 1998 HUB4 Broadcast News Evaluation English Test Material
LDC2000S88 1999 HUB4 Broadcast News Evaluation English Test Material
LDC2004S04 2002 NIST Speaker Recognition Evaluation
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2007S10 2003 NIST Rich Transcription Evaluation Data
LDC2011S02 2006 NIST Spoken Term Detection Development Set
LDC2011S03 2006 NIST Spoken Term Detection Evaluation Set
LDC2006S46 Arabic Broadcast News Speech
LDC96S31 CSR-IV HUB4
LDC2004S01 Czech Broadcast News Speech
LDC2006S42 Korean Broadcast News Speech
LDC2004S08 RT-03 MDE Training Data Speech
LDC2005S16 RT-04 MDE Training Data Speech
LDC2000S92 TDT2 Careful Transcription Audio
LDC99S84 TDT2 English Audio
LDC2001S93 TDT2 Mandarin Audio Corpus
LDC2001S94 TDT3 English Audio
LDC2001S95 TDT3 Mandarin Audio
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
LDC99S82 USC Marketplace Broadcast News Speech
LDC2000S89 Voice of America (VOA) Czech Broadcast News Audio

field recordings speech
[
speech]
LDC94S14B Air Traffic Control BOS
LDC94S14A Air Traffic Control Complete
LDC94S14C Air Traffic Control DCA
LDC94S14D Air Traffic Control DFW
LDC2010S05 Asian Elephant Vocalizations
LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
LDC2004S12 TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls

meeting speech speech
[
speech]
LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
LDC2007S11 2004 Spring NIST Rich Transcription (RT-04S) Development Data
LDC2011S02 2006 NIST Spoken Term Detection Development Set
LDC2011S03 2006 NIST Spoken Term Detection Evaluation Set
LDC2004S09 NIST Meeting Pilot Corpus Speech

microphone conversation speech
[
speech]
LDC93S12 HCRC Map Task Corpus
LDC2004S02 ICSI Meeting Speech
LDC2004S05 ISL Meeting Speech Part 1
LDC96S64-4 JEIDA/JCSD-Channel 0 Four Digit Sequences
LDC2000S96 Speech in Noisy Environments (SPINE) Evaluation Audio
LDC2000S87 Speech in Noisy Environments (SPINE) Training Audio
LDC95S25 TRAINS Spoken Dialog Corpus

microphone speech speech
[
speech]
LDC2004S04 2002 NIST Speaker Recognition Evaluation
LDC2007S03 ARL Urdu Speech Database, Training Data
LDC2005S22 Articulation Index
LDC93S4A ATIS0 Complete
LDC93S4B ATIS0 Pilot
LDC93S4B-2 ATIS0 Read
LDC93S4B-3 ATIS0 SD Read
LDC93S5 ATIS2
LDC95S26 ATIS3 Test Data
LDC94S19 ATIS3 Training Data
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
LDC96S36 Boston University Radio Speech Corpus
LDC94S20 BRAMSHILL
LDC2008S09 CHAracterizing INdividual Speakers(CHAINS)
LDC2007S18 CSLU: Kids` Speech Version 1.1
LDC2008S07 CSLU: ISOLET Spoken Letter Database Version 1.3
LDC2006S16 CSLU: Spoltech Brazilian Portuguese Version 1.0
LDC2006S01 CSLU: Voices
LDC93S6A CSR-I (WSJ0) Complete
LDC93S6C CSR-I (WSJ0) Other
LDC93S6B CSR-I (WSJ0) Sennheiser
LDC94S13A CSR-II (WSJ1) Complete
LDC94S13C CSR-II (WSJ1) Other
LDC94S13B CSR-II (WSJ1) Sennheiser
LDC95S23 CSR-III Speech
LDC96S33 CSR-IV HUB3
LDC96S38 DCIEM/HCRC
LDC2002S28 Emotional Prosody Speech and Transcripts
LDC96S32 FFMTIMIT
LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
LDC96S64-1 JEIDA/JCSD-Channel 0 City Names
LDC96S64 JEIDA/JCSD-Channel 0 Complete
LDC96S64-2 JEIDA/JCSD-Channel 0 Control Words
LDC96S64-3 JEIDA/JCSD-Channel 0 Isolated Digits
LDC96S64-5 JEIDA/JCSD-Channel 0 Mono Syllables
LDC96S65-1 JEIDA/JCSD-Channel 1 City Names
LDC96S65 JEIDA/JCSD-Channel 1 Complete
LDC96S65-2 JEIDA/JCSD-Channel 1 Control Words
LDC96S65-4 JEIDA/JCSD-Channel 1 Four Digit Sequences
LDC96S65-3 JEIDA/JCSD-Channel 1 Isolated Digits
LDC96S65-5 JEIDA/JCSD-Channel 1 Mono Syllables
LDC95S28 LATINO-40 Spanish Read News
LDC2007S09 Mandarin Affective Speech
LDC2006S33 Middle East Technical University Turkish Microphone Speech v 1.0
LDC2006S13 N4 NATO Native and Non-Native Speech
LDC2007S15 Nationwide Speech Project
LDC93S3A Resource Management Complete Set 2.0
LDC93S3B Resource Management RM1 2.0
LDC93S3C Resource Management RM2 2.0
LDC96S39 RM Isolated and Spelled Word Data
LDC93S11 Road Rally
LDC2000S85 Santa Barbara Corpus of Spoken American English Part I
LDC2003S06 Santa Barbara Corpus of Spoken American English Part II
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
LDC2006S30 Speech Controlled Computing
LDC2000S96 Speech in Noisy Environments (SPINE) Evaluation Audio
LDC2001S04 Speech in Noisy Environments (SPINE2) Part 1 Audio
LDC2001S06 Speech in Noisy Environments (SPINE2) Part 2 Audio
LDC2001S08 Speech in Noisy Environments (SPINE2) Part 3 Audio
LDC2001S99 Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
LDC99S78 SUSAS
LDC99S83 Tactical Speaker Identification Speech Corpus (TSID)
LDC98S72 Taiwanese Putonghua Speech and Transcripts
LDC97S63 The CMU Kids Corpus
LDC93S9 TI 46-Word
LDC93S10 TIDIGITS
LDC93S1W TIMIT Acoustic-Phonetic Continuous Speech (MS-WAV version)
LDC93S1 TIMIT Acoustic-Phonetic Continuous Speech Corpus
LDC2002S04 Translanguage English Database (TED) Speech
LDC2002S02 West Point Arabic Speech
LDC2008S04 West Point Brazilian Portuguese Speech
LDC2005S30 West Point Company G3 American English Speech
LDC2005S28 West Point Croatian Speech
LDC2006S37 West Point Heroico Spanish Speech
LDC2006S36 West Point Korean Speech
LDC2003S05 West Point Russian Speech
LDC95S24 WSJCAM0 Cambridge Read News
LDC94S16 YOHO Speaker Verification

telephone conversations speech
[
speech]
LDC96S61 1996 Speaker Recognition Benchmark
LDC2002S22 1997 HUB5 Arabic Evaluation
LDC2002S23 1997 HUB5 English Evaluation
LDC2002S24 1997 HUB5 German Evaluation
LDC2002S25 1997 HUB5 Spanish Evaluation
LDC2002S10 1998 HUB5 English Evaluation
LDC2002S09 2000 HUB5 English Evaluation Speech
LDC2002S13 2001 HUB5 English Evaluation
LDC2002S12 2001 HUB5 Mandarin Evaluation
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
LDC2006S31 2003 NIST Language Recognition Evaluation
LDC2010S03 2003 NIST Speaker Recognition Evaluation
LDC2008S05 2005 NIST Language Recognition Evaluation
LDC2011S02 2006 NIST Spoken Term Detection Development Set
LDC2011S03 2006 NIST Spoken Term Detection Evaluation Set
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
LDC96S47 CALLFRIEND American English-Southern Dialect
LDC96S48 CALLFRIEND Canadian French
LDC96S49 CALLFRIEND Egyptian Arabic
LDC96S50 CALLFRIEND Farsi
LDC96S51 CALLFRIEND German
LDC96S52 CALLFRIEND Hindi
LDC96S53 CALLFRIEND Japanese
LDC96S54 CALLFRIEND Korean
LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
LDC96S59 CALLFRIEND Tamil
LDC96S60 CALLFRIEND Vietnamese
LDC97S42 CALLHOME American English Speech
LDC97S45 CALLHOME Egyptian Arabic Speech
LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
LDC97S43 CALLHOME German Speech
LDC96S37 CALLHOME Japanese Speech
LDC96S34 CALLHOME Mandarin Chinese Speech
LDC96S35 CALLHOME Spanish Speech
LDC2008S06 CSLU: Alphadigit Version 1.3
LDC2008S01 CSLU: Portland Cellular Telephone Speech Version 1.3
LDC2006S26 CSLU: Speaker Recognition Version 1.1
LDC2009T01 English CTS Treebank with Structural Metadata
LDC2005S13 Fisher English Training Part 2, Speech
LDC2004S13 Fisher English Training Speech Part 1 Speech
LDC2007S02 Fisher Levantine Arabic Conversational Telephone Speech
LDC2010S01 Fisher Spanish Speech
LDC96S29 Frontiers in Speech Processing 93
LDC2006S43 Gulf Arabic Conversational Telephone Speech
LDC98S69 HUB5 Mandarin Telephone Speech Corpus
LDC98S70 HUB5 Spanish Telephone Speech Corpus
LDC2006S45 Iraqi Arabic Conversational Telephone Speech
LDC95S22 KING Speaker Verification
LDC2003S03 Korean Telephone Conversations Speech
LDC2007S01 Levantine Arabic Conversational Telephone Speech
LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech
LDC2005S16 RT-04 MDE Training Data Speech
LDC2006S34 Russian through Switched Telephone Network (RuSTeN)
LDC2008S03 STC-TIMIT 1.0
LDC2001S13 Switchboard Cellular Part 1 Audio
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
LDC2004S07 Switchboard Cellular Part 2 Audio
LDC93S8 Switchboard Credit Card
LDC97S62 Switchboard-1 Release 2
LDC98S75 Switchboard-2 Phase I
LDC99S79 Switchboard-2 Phase II

telephone speech speech
[
speech]
LDC2001S97 2000 NIST Speaker Recognition Evaluation
LDC2002S34 2001 NIST Speaker Recognition Evaluation Corpus
LDC2004S04 2002 NIST Speaker Recognition Evaluation
LDC2007S10 2003 NIST Rich Transcription Evaluation Data
LDC2006S44 2004 NIST Speaker Recognition Evaluation
LDC2011S04 2005 NIST Speaker Recognition Evaluation Test Data
LDC2011S01 2005 NIST Speaker Recognition Evaluation Training Data
LDC2009S05 2007 NIST Language Recognition Evaluation Supplemental Training Set
LDC2009S04 2007 NIST Language Recognition Evaluation Test Set
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
LDC94S20 BRAMSHILL
LDC2007S08 CSLU: Foreign Accented English Release 1.2
LDC2006S15 CSLU: Spelled and Spoken Words
LDC2006S14 CSLU: Stories v 1.2
LDC2007S13 CSLU: Apple Words and Phrases
LDC2006S35 CSLU: Multilanguage Telephone Speech Version 1.2
LDC2006S39 CSLU: Names Release 1.3
LDC2008S02 CSLU: National Cellular Telephone Speech Release 2.3
LDC2009S01 CSLU: Numbers Version 1.3
LDC2009S03 CSLU: S4X Release 1.2
LDC2006S26 CSLU: Speaker Recognition Version 1.1
LDC2007S05 CSLU: Yes/No Version 1.2
LDC96S30 CTIMIT
LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
LDC98S67 HTIMIT
LDC98S68 LLHDB
LDC94S21 MACROPHONE
LDC93S2 NTIMIT
LDC94S17 OGI Multilanguage Corpus
LDC94S18 OGI Spelled and Spoken Word
LDC95S27 PhoneBook: NYNEX Isolated Words
LDC2004S08 RT-03 MDE Training Data Speech
LDC94S15 SPIDRE
LDC2002S06 Switchboard-2 Phase III Audio
LDC96S41 VAHA (POLYPHONE II)
LDC98S77 Voicemail Corpus Part I
LDC2002S35 Voicemail Corpus Part II
LDC2010S02 WTIMIT 1.0

transcribed speech speech
[
speech]
LDC99S80 1997 Speaker Recognition Benchmark
LDC98S76 1998 Speaker Recognition Benchmark
LDC99S81 1999 Speaker Recognition Benchmark
LDC2002S56 2000 Communicator Evaluation
LDC2003S01 2001 Communicator Evaluation

speech, text
| broadcast news | field recordings | meeting speech | microphone speech | telephone speech |
[ top ]

broadcast news speech, text
[
speech, text]
LDC2012S06 Turkish Broadcast News Speech and Transcripts

field recordings speech, text
[
speech, text]
LDC2012S03 Digital Archive of Southern Speech
LDC2012S04 Malto Speech and Transcripts
LDC2012S05 USC-SFI MALACH Interviews and Transcripts English

meeting speech speech, text
[
speech, text]
LDC2011S06 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set

microphone speech speech, text
[
speech, text]
LDC2011S10 2006 NIST Speaker Recognition Evaluation Test Set Part 1
LDC2012S01 2006 NIST Speaker Recognition Evaluation Test Set Part 2
LDC2011S11 2008 NIST Speaker Recognition Evaluation Supplemental Set
LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set
LDC2011S05 2008 NIST Speaker Recognition Evaluation Training Set Part 1
LDC2011S07 2008 NIST Speaker Recognition Evaluation Training Set Part 2
LDC2012S02 TORGO Database of Dysarthric Articulation

telephone speech speech, text
[
speech, text]
LDC2011S10 2006 NIST Speaker Recognition Evaluation Test Set Part 1
LDC2012S01 2006 NIST Speaker Recognition Evaluation Test Set Part 2
LDC2011S09 2006 NIST Speaker Recognition Evaluation Training Set
LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set
LDC2011S05 2008 NIST Speaker Recognition Evaluation Training Set Part 1
LDC2011S07 2008 NIST Speaker Recognition Evaluation Training Set Part 2

text
| broadcast conversation | broadcast news | dictionaries | email | fiction | government documents | journal articles | journal entries | meeting speech | microphone conversation | microphone speech | news magazine | newsgroups | newswire | question-answers | reviews | telephone conversations | telephone speech | text chat conversations | transcribed speech | varied | web collection | weblogs |
[ top ]

broadcast conversation text
[
text]
LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2008T03 ACE 2005 English SpatialML Annotations
LDC2011T02 ACE 2005 English SpatialML Annotations Version 2
LDC2006T06 ACE 2005 Multilingual Training Corpus
LDC2010T07 Chinese Treebank 7.0
LDC2009T20 Czech Broadcast Conversation MDE Transcripts
LDC94T5 ECI Multilingual Text
LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
LDC2007T20 GALE Phase 1 Distillation Training
LDC2012T06 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
LDC2012T14 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
LDC2013T04 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1
LDC2013T08 GALE Phase 2 Chinese Broadcast Conversation Transcripts
LDC2009T10 Language Understanding Annotation Corpus
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
LDC2009T24 OntoNotes Release 3.0
LDC2011T03 OntoNotes Release 4.0

broadcast news text
[
text]
LDC98T31 1996 CSR HUB4 Language Model
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
LDC98T29 1997 Spanish Broadcast News Transcripts (HUB4-NE)
LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2005T09 ACE 2004 Multilingual Training Corpus
LDC2008T03 ACE 2005 English SpatialML Annotations
LDC2011T02 ACE 2005 English SpatialML Annotations Version 2
LDC2010T09 ACE 2005 Mandarin SpatialML Annotations
LDC2006T06 ACE 2005 Multilingual Training Corpus
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
LDC2003T11 ACE-2 Version 1.0
LDC2006T20 Arabic Broadcast News Transcripts
LDC2012T07 Arabic Treebank - Broadcast News v1.0
LDC2011T06 Broadcast News Lattices
LDC2010T07 Chinese Treebank 7.0
LDC2008T22 Czech Academic Corpus 2.0
LDC2010T02 Czech Broadcast News MDE Transcripts
LDC2004T01 Czech Broadcast News Transcripts
LDC2011T08 Datasets for Generic Relation Extraction (reACE)
LDC94T5 ECI Multilingual Text
LDC2007T24 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
LDC2007T23 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
LDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
LDC2007T20 GALE Phase 1 Distillation Training
LDC2012T18 GALE Phase 2 Arabic Broadcast News Parallel Text
LDC2006T14 Korean Broadcast News Transcripts
LDC2009T10 Language Understanding Annotation Corpus
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
LDC2008T04 OntoNotes Release 2.0
LDC2009T24 OntoNotes Release 3.0
LDC2011T03 OntoNotes Release 4.0
LDC2004T12 RT-03 MDE Training Data Text and Annotations
LDC2011T01 SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
LDC98T25 TDT Pilot Study Corpus
LDC2000T44 TDT2 Careful Transcription Text
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2005T16 TDT4 Multilingual Text and Annotations
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC99T36 USC Marketplace Broadcast News Transcripts
LDC2000T53 Voice of America (VOA) Czech Broadcast News Transcripts

dictionaries text
[
text]
LDC93T1 ACL/DCI
LDC94T5 ECI Multilingual Text
LDC2004T25 Prague Czech-English Dependency Treebank 1.0
LDC2003T10 SAID

email text
[
text]
LDC2007T22 2001 Topic Annotated Enron Email Data Set
LDC2012T13 English Web Treebank
LDC2009T10 Language Understanding Annotation Corpus
LDC2010T22 Manually Annotated Sub-Corpus First Release

fiction text
[
text]
LDC2011T04 Indian Language Part-of-Speech Tagset: Sanskrit
LDC2012T12 Spanish TimeBank 1.0

government documents text
[
text]
LDC2013T06 1993-2007 United Nations Parallel Text
LDC95T20 Hansard French/English
LDC2000T50 Hong Kong Hansards Parallel Text
LDC2000T47 Hong Kong Laws Parallel Text
LDC2004T08 Hong Kong Parallel Text
LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation
LDC2003T16 SummBank 1.0
LDC94T4A UN Parallel Text (Complete)
LDC94T4B-1 UN Parallel Text (English)
LDC94T4B-2 UN Parallel Text (French)
LDC94T4B-3 UN Parallel Text (Spanish)

journal articles text
[
text]
LDC2009T29 ACL Anthology Reference Corpus
LDC93T1 ACL/DCI
LDC2005T35 American National Corpus (ANC) Second Release
LDC2009T04 BioProp Version 1.0
LDC2013T02 Chinese-English Biology and Chemistry Abstract Parallel Text
LDC2012T22 Chinese-English Semiconductor Parallel Text
LDC94T5 ECI Multilingual Text
LDC2008T20 PennBioIE CYP 1.0
LDC2008T21 PennBioIE Oncology 1.0
LDC2001T10 Prague Dependency Treebank 1.0
LDC2006T01 Prague Dependency Treebank 2.0

journal entries text
[
text]
LDC2012T01 ModeS TimeBank 1.0

meeting speech text
[
text]
LDC2004T04 ICSI Meeting Transcripts
LDC2004T10 ISL Meeting Transcripts Part 1
LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata

microphone conversation text
[
text]
LDC2000T54 Speech in Noisy Environments (SPINE) Evaluation Transcripts
LDC2000T49 Speech in Noisy Environments (SPINE) Training Transcripts
LDC2001T05 Speech in Noisy Environments (SPINE2) Part 1 Transcripts
LDC2001T07 Speech in Noisy Environments (SPINE2) Part 2 Transcripts
LDC2001T09 Speech in Noisy Environments (SPINE2) Part 3 Transcripts

microphone speech text
[
text]
LDC2000T54 Speech in Noisy Environments (SPINE) Evaluation Transcripts
LDC2000T49 Speech in Noisy Environments (SPINE) Training Transcripts
LDC2001T05 Speech in Noisy Environments (SPINE2) Part 1 Transcripts
LDC2001T07 Speech in Noisy Environments (SPINE2) Part 2 Transcripts
LDC2001T09 Speech in Noisy Environments (SPINE2) Part 3 Transcripts
LDC99T33 SUSAS Transcripts
LDC2002T03 Translanguage English Database (TED) Transcripts
LDC95T7 Treebank-2
LDC99T42 Treebank-3

news magazine text
[
text]
LDC2009T12 2008 CoNLL Shared Task Data
LDC2005T35 American National Corpus (ANC) Second Release
LDC2012T10 Catalan TimeBank 1.0
LDC2010T07 Chinese Treebank 7.0
LDC2008T22 Czech Academic Corpus 2.0
LDC94T5 ECI Multilingual Text
LDC2008T01 Hungarian-English Parallel Text, Version 1.0
LDC2001T10 Prague Dependency Treebank 1.0
LDC2006T01 Prague Dependency Treebank 2.0

newsgroups text
[
text]
LDC2006T06 ACE 2005 Multilingual Training Corpus
LDC2012T13 English Web Treebank
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
LDC2010T03 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
LDC2013T01 GALE Phase 2 Arabic Web Parallel Text
LDC2012T15 MADCAT Phase 1 Training Set
LDC2013T09 MADCAT Phase 2 Training Set
LDC2011T03 OntoNotes Release 4.0

newswire text
[
text]
LDC2009T12 2008 CoNLL Shared Task Data
LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2012T03 2009 CoNLL Shared Task Part 1
LDC2012T04 2009 CoNLL Shared Task Part 2
LDC2005T09 ACE 2004 Multilingual Training Corpus
LDC2008T03 ACE 2005 English SpatialML Annotations
LDC2011T02 ACE 2005 English SpatialML Annotations Version 2
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
LDC2003T11 ACE-2 Version 1.0
LDC93T1 ACL/DCI
LDC2005T35 American National Corpus (ANC) Second Release
LDC2012T21 Annotated English Gigaword
LDC2008T25 AQUAINT-2 Information-Retrieval Text Research Collection
LDC2004T18 Arabic English Parallel News Part 1
LDC2003T12 Arabic Gigaword
LDC2011T11 Arabic Gigaword Fifth Edition
LDC2009T30 Arabic Gigaword Fourth Edition
LDC2006T02 Arabic Gigaword Second Edition
LDC2007T40 Arabic Gigaword Third Edition
LDC2004T17 Arabic News Translation Text Part 1
LDC2009T22 Arabic Newswire English Translation Collection
LDC2001T55 Arabic Newswire Part 1
LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
LDC2003T06 Arabic Treebank: Part 1 v 2.0
LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
LDC2010T13 Arabic Treebank: Part 1 v 4.1
LDC2004T02 Arabic Treebank: Part 2 v 2.0
LDC2011T09 Arabic Treebank: Part 2 v 3.1
LDC2004T11 Arabic Treebank: Part 3 v 1.0
LDC2010T08 Arabic Treebank: Part 3 v 3.2
LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1
LDC2008T13 BLLIP North American News Text, Complete
LDC2008T14 BLLIP North American News Text, General Release
LDC2012T10 Catalan TimeBank 1.0
LDC2005T13 CCGbank
LDC2001T62 CETEMpublico
LDC2005T34 Chinese <-> English Name Entity Lists v 1.0
LDC2012T05 Chinese Dependency Treebank 1.0
LDC2005T10 Chinese English News Magazine Parallel Text
LDC2003T09 Chinese Gigaword
LDC2011T13 Chinese Gigaword Fifth Edition
LDC2009T27 Chinese Gigaword Fourth Edition
LDC2005T14 Chinese Gigaword Second Edition
LDC2007T38 Chinese Gigaword Third Edition
LDC2005T06 Chinese News Translation Text Part 1
LDC2005T23 Chinese Proposition Bank 1.0
LDC2008T07 Chinese Proposition Bank 2.0
LDC2001T11 Chinese Treebank 2.0
LDC2004T05 Chinese Treebank 4.0
LDC2005T01 Chinese Treebank 5.0
LDC2007T36 Chinese Treebank 6.0
LDC2010T07 Chinese Treebank 7.0
LDC96T11 COMLEX Syntax Text Corpus Version 2.0
LDC2008T24 COMNOM v 1.0
LDC95T6 CSR-III Text
LDC2008T22 Czech Academic Corpus 2.0
LDC2011T08 Datasets for Generic Relation Extraction (reACE)
LDC97T12 DSO Corpus of Sense-Tagged English
LDC94T5 ECI Multilingual Text
LDC2007T02 English Chinese Translation Treebank v 1.0
LDC2003T05 English Gigaword
LDC2011T07 English Gigaword Fifth Edition
LDC2009T13 English Gigaword Fourth Edition
LDC2005T12 English Gigaword Second Edition
LDC2007T07 English Gigaword Third Edition
LDC2012T02 English Translation Treebank: An-Nahar Newswire
LDC95T11 European Language Newspaper Text
LDC2009T23 FactBank 1.0
LDC2009T28 French Gigaword Second Edition
LDC2011T10 French Gigaword Third Edition
LDC2013T10 GALE Arabic-English Parallel Aligned Treebank -- Newswire
LDC2012T16 GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
LDC2012T20 GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire
LDC2012T17 GALE Phase 2 Arabic Newswire Parallel Text
LDC2005T28 HARD 2004 Text
LDC2000T46 Hong Kong News Parallel Text
LDC2008T01 Hungarian-English Parallel Text, Version 1.0
LDC95T8 Japanese Business News Text
LDC99T34 Japanese Business News Text Supplement
LDC2000T45 Korean Newswire
LDC2010T19 Korean Newswire Second Edition
LDC2006T03 Korean Propbank
LDC2006T09 Korean Treebank Annotations Version 2.0
LDC2009T10 Language Understanding Annotation Corpus
LDC2012T15 MADCAT Phase 1 Training Set
LDC2013T09 MADCAT Phase 2 Training Set
LDC95T13 Mandarin Chinese News Text
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC2003T13 Message Understanding Conference (MUC) 6
LDC96T10 Message Understanding Conference (MUC) 6 Additional News Text
LDC2001T02 Message Understanding Conference (MUC) 7
LDC2010T15 Message Understanding Conference 7 Timed (MUC7_T)
LDC2004T03 Morphologically Annotated Korean Text
LDC2003T18 Multiple-Translation Arabic (MTA) Part 1
LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
LDC2006T04 Multiple-Translation Chinese (MTC) Part 4
LDC2002T01 Multiple-Translation Chinese Corpus
LDC2010T10 NIST 2002 Open Machine Translation (OpenMT) Evaluation
LDC2010T11 NIST 2003 Open Machine Translation (OpenMT) Evaluation
LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation
LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation
LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation
LDC2013T03 NIST 2012 Open Machine Translation (OpenMT) Evaluation
LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
LDC2008T23 NomBank v 1.0
LDC95T21 North American News Text Corpus
LDC98T30 North American News Text Supplement
LDC2008T15 North American News Text, Complete
LDC2008T16 North American News Text, General Release
LDC2007T21 OntoNotes Release 1.0
LDC2008T04 OntoNotes Release 2.0
LDC2009T24 OntoNotes Release 3.0
LDC2011T03 OntoNotes Release 4.0
LDC2008T05 Penn Discourse Treebank Version 2.0
LDC99T40 Portuguese Newswire Text
LDC2004T25 Prague Czech-English Dependency Treebank 1.0
LDC2012T08 Prague Czech-English Dependency Treebank 2.0
LDC2001T10 Prague Dependency Treebank 1.0
LDC2006T01 Prague Dependency Treebank 2.0
LDC2004T14 Proposition Bank I
LDC2009T11 REFLEX Entity Translation Training/DevTest
LDC2002T07 RST Discourse Treebank
LDC2011T01 SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
LDC2006T12 Spanish Gigaword First Edition
LDC2009T21 Spanish Gigaword Second Edition
LDC2011T12 Spanish Gigaword Third Edition
LDC95T9 Spanish News Text
LDC99T41 Spanish Newswire Text, Volume 2
LDC2012T12 Spanish TimeBank 1.0
LDC2007T03 Tagged Chinese Gigaword
LDC2009T14 Tagged Chinese Gigaword Version 2.0
LDC98T25 TDT Pilot Study Corpus
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2005T16 TDT4 Multilingual Text and Annotations
LDC2006T19 TDT5 Topics and Annotations
LDC2002T31 The AQUAINT Corpus of English News Text
LDC2008T19 The New York Times Annotated Corpus
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC2006T08 TimeBank 1.2
LDC93T3A TIPSTER Complete
LDC93T3B TIPSTER Volume 1
LDC93T3C TIPSTER Volume 2
LDC93T3D TIPSTER Volume 3
LDC2000T52 TREC Mandarin
LDC2000T51 TREC Spanish
LDC95T7 Treebank-2
LDC99T42 Treebank-3

question-answers text
[
text]
LDC2012T13 English Web Treebank

reviews text
[
text]
LDC2012T13 English Web Treebank

telephone conversations text
[
text]
LDC2002T39 1997 HUB5 Arabic Transcripts
LDC2003T03 1997 HUB5 German Transcripts
LDC2003T04 1997 HUB5 Spanish Transcripts
LDC2003T02 1998 HUB5 English Transcripts
LDC2004T15 2000 Communicator Dialogue Act Tagged
LDC2004T16 2001 Communicator Dialogue Act Tagged
LDC2003T01 2001 HUB5 Mandarin Transcripts
LDC97T14 CALLHOME American English Transcripts
LDC97T19 CALLHOME Egyptian Arabic Transcripts
LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
LDC97T15 CALLHOME German Transcripts
LDC96T18 CALLHOME Japanese Transcripts
LDC96T16 CALLHOME Mandarin Chinese Transcripts
LDC2008T17 CALLHOME Mandarin Chinese Transcripts - XML version
LDC2001T61 CALLHOME Spanish Dialogue Act Annotation
LDC96T17 CALLHOME Spanish Transcripts
LDC2005T19 Fisher English Training Part 2, Transcripts
LDC2007T04 Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
LDC2010T04 Fisher Spanish - Transcripts
LDC2006T15 Gulf Arabic Conversational Telephone Speech, Transcripts
LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
LDC98T26 HUB5 Mandarin Transcripts
LDC98T27 HUB5 Spanish Transcripts
LDC2006T16 Iraqi Arabic Conversational Telephone Speech, Transcripts
LDC2003S07 Korean Telephone Conversations Complete Set
LDC2003T08 Korean Telephone Conversations Transcripts
LDC2007T01 Levantine Arabic Conversational Telephone Speech, Transcripts
LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts
LDC2009T26 NXT Switchboard Annotations
LDC2004T12 RT-03 MDE Training Data Text and Annotations
LDC2001T14 Switchboard Cellular Part 1 Transcription
LDC2001T60 Syllable-Final /s/ Lenition

telephone speech text
[
text]
LDC2005T35 American National Corpus (ANC) Second Release
LDC2006T16 Iraqi Arabic Conversational Telephone Speech, Transcripts
LDC2009T10 Language Understanding Annotation Corpus
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC99T42 Treebank-3

text chat conversations text
[
text]
LDC2010T05 NPS Internet Chatroom Conversations, Release 1.0

transcribed speech text
[
text]
LDC2010T04 Fisher Spanish - Transcripts
LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
LDC2007T23 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
LDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC98T25 TDT Pilot Study Corpus
LDC2001T57 TDT2 Multilanguage Text Version 4.0
LDC2001T58 TDT3 Multilanguage Text Version 2.0
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
LDC95T7 Treebank-2
LDC99T42 Treebank-3

varied text
[
text]
LDC2012T11 American English Nickname Collection
LDC2005T35 American National Corpus (ANC) Second Release
LDC96T11 COMLEX Syntax Text Corpus Version 2.0
LDC97T12 DSO Corpus of Sense-Tagged English
LDC94T5 ECI Multilingual Text
LDC2008T01 Hungarian-English Parallel Text, Version 1.0
LDC98T32 JURIS
LDC2002T26 Korean English Treebank Annotations
LDC2009T10 Language Understanding Annotation Corpus
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC2001T10 Prague Dependency Treebank 1.0
LDC93T3A TIPSTER Complete
LDC93T3B TIPSTER Volume 1
LDC93T3C TIPSTER Volume 2
LDC93T3D TIPSTER Volume 3
LDC95T7 Treebank-2
LDC99T42 Treebank-3

web collection text
[
text]
LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
LDC2005T35 American National Corpus (ANC) Second Release
LDC2012T09 Arabic-Dialect/English Parallel Text
LDC2010T07 Chinese Treebank 7.0
LDC2010T06 Chinese Web 5-gram Version 1
LDC2012T24 GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web
LDC2013T05 GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web
LDC2008T01 Hungarian-English Parallel Text, Version 1.0
LDC2010T16 Indian Language Part-of-Speech Tagset: Bengali
LDC2010T24 Indian Language Part-of-Speech Tagset: Hindi
LDC2009T08 Japanese Web N-gram Version 1
LDC2010T22 Manually Annotated Sub-Corpus First Release
LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
LDC2013T03 NIST 2012 Open Machine Translation (OpenMT) Evaluation
LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
LDC2012T23 Russian-English Computer Security Parallel Text
LDC2006T13 Web 1T 5-gram Version 1
LDC2009T25 Web 1T 5-gram, 10 European Languages
Version 1

weblogs text
[
text]
LDC2006T06 ACE 2005 Multilingual Training Corpus
LDC2012T09 Arabic-Dialect/English Parallel Text
LDC2012T13 English Web Treebank
LDC2012T16 GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
LDC2008T02 GALE Phase 1 Arabic Blog Parallel Text
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
LDC2008T06 GALE Phase 1 Chinese Blog Parallel Text
LDC2013T01 GALE Phase 2 Arabic Web Parallel Text
LDC2010T16 Indian Language Part-of-Speech Tagset: Bengali
LDC2012T15 MADCAT Phase 1 Training Set
LDC2013T09 MADCAT Phase 2 Training Set
LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation
LDC2011T03 OntoNotes Release 4.0

video
| broadcast conversation | broadcast news | field recordings | meeting speech | video |
[ top ]

broadcast conversation video
[
video]
LDC2010V02 TRECVID 2006 Keyframes

broadcast news video
[
video]
LDC2012V01 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News
LDC2007V02 TRECVID 2003 Keyframes & Transcripts
LDC2010V01 TRECVID 2004 Keyframes & Transcripts
LDC2007V01 TRECVID 2005 Keyframes & Transcripts
LDC2010V02 TRECVID 2006 Keyframes

field recordings video
[
video]
LDC2003V01 FORM2 Kinematic Gesture

meeting speech video
[
video]
LDC2011V05 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
LDC2011V06 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
LDC2011V03 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
LDC2011V04 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
LDC2011V01 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1
LDC2011V02 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2

video video
[
video]
LDC2009V01 Audiovisual Database of Spoken American English
LDC2004V01 FORM1 Kinematic Gesture

About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.