Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

LDC Catalog | By Type and Source | By Year | Top Ten | Projects | Catalog Search



LDC Projects

Many of the corpora available from the LDC have been used in multi-site, multi-year research projects, with benchmark tests carried out under government sponsorship. Typically, a test protocol is defined by the Natural Language Processing Group at NIST, and methods and findings are reported by researchers, based on data provided by the LDC for both training and testing of language-based systems. Below is a list of the projects that have used LDC data.
  ACE
 
LDC2005T09 ACE 2004 Multilingual Training Corpus
 
LDC2008T03 ACE 2005 English SpatialML Annotations
 
LDC2011T02 ACE 2005 English SpatialML Annotations Version 2
 
LDC2010T09 ACE 2005 Mandarin SpatialML Annotations
 
LDC2006T06 ACE 2005 Multilingual Training Corpus
 
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
 
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
 
LDC2003T11 ACE-2 Version 1.0
 
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
 
LDC2011T08 Datasets for Generic Relation Extraction (reACE)
 
LDC2004T14 Proposition Bank I
 
LDC2009T11 REFLEX Entity Translation Training/DevTest
 
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
  American National Corpus (ANC)
 
LDC2010T22 Manually Annotated Sub-Corpus First Release
  AQUAINT
 
LDC2008T25 AQUAINT-2 Information-Retrieval Text Research Collection
 
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
  ATIS
 
LDC93S4A ATIS0 Complete
 
LDC93S4B ATIS0 Pilot
 
LDC93S4B-2 ATIS0 Read
 
LDC93S4B-3 ATIS0 SD Read
 
LDC93S5 ATIS2
 
LDC95S26 ATIS3 Test Data
 
LDC94S19 ATIS3 Training Data
  BEST
  BOLT
  Communicator
 
LDC2004T15 2000 Communicator Dialogue Act Tagged
 
LDC2002S56 2000 Communicator Evaluation
 
LDC2004T16 2001 Communicator Dialogue Act Tagged
 
LDC2003S01 2001 Communicator Evaluation
  CoNLL
 
LDC2012T03 2009 CoNLL Shared Task Part 1
 
LDC2012T04 2009 CoNLL Shared Task Part 2
  CoNNL
  DARPA-CSR
 
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
 
LDC93S6A CSR-I (WSJ0) Complete
 
LDC93S6C CSR-I (WSJ0) Other
 
LDC93S6B CSR-I (WSJ0) Sennheiser
 
LDC94S13A CSR-II (WSJ1) Complete
 
LDC94S13C CSR-II (WSJ1) Other
 
LDC94S13B CSR-II (WSJ1) Sennheiser
 
LDC95S23 CSR-III Speech
 
LDC95T6 CSR-III Text
 
LDC96S33 CSR-IV HUB3
 
LDC96S31 CSR-IV HUB4
  DASL
 
LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
  DEFT
  DOE/IRS2008-0256
  DUC
  EARS
 
LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
 
LDC97S44 1996 English Broadcast News Speech (HUB4)
 
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
 
LDC98S71 1997 English Broadcast News Speech (HUB4)
 
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
 
LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
 
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
 
LDC2002S22 1997 HUB5 Arabic Evaluation
 
LDC2002T39 1997 HUB5 Arabic Transcripts
 
LDC2002S24 1997 HUB5 German Evaluation
 
LDC2003T03 1997 HUB5 German Transcripts
 
LDC2002S25 1997 HUB5 Spanish Evaluation
 
LDC2003T04 1997 HUB5 Spanish Transcripts
 
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
 
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
 
LDC2002S10 1998 HUB5 English Evaluation
 
LDC2003T02 1998 HUB5 English Transcripts
 
LDC2002S13 2001 HUB5 English Evaluation
 
LDC2002S12 2001 HUB5 Mandarin Evaluation
 
LDC2003T01 2001 HUB5 Mandarin Transcripts
 
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
 
LDC99L23 American English Spoken Lexicon
 
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
 
LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
 
LDC2003T12 Arabic Gigaword
 
LDC2001T55 Arabic Newswire Part 1
 
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
 
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
 
LDC96S47 CALLFRIEND American English-Southern Dialect
 
LDC96S49 CALLFRIEND Egyptian Arabic
 
LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
 
LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
 
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
 
LDC97S42 CALLHOME American English Speech
 
LDC97T14 CALLHOME American English Transcripts
 
LDC97S45 CALLHOME Egyptian Arabic Speech
 
LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
 
LDC97T19 CALLHOME Egyptian Arabic Transcripts
 
LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
 
LDC96L15 CALLHOME Mandarin Chinese Lexicon
 
LDC96S34 CALLHOME Mandarin Chinese Speech
 
LDC96T16 CALLHOME Mandarin Chinese Transcripts
 
LDC2003T09 Chinese Gigaword
 
LDC2005T14 Chinese Gigaword Second Edition
 
LDC2005T08 Discourse Graphbank
 
LDC99L22 Egyptian Colloquial Arabic Lexicon
 
LDC2003T05 English Gigaword
 
LDC2005T12 English Gigaword Second Edition
 
LDC2005S13 Fisher English Training Part 2, Speech
 
LDC2005T19 Fisher English Training Part 2, Transcripts
 
LDC2004S13 Fisher English Training Speech Part 1 Speech
 
LDC2004T19 Fisher English Training Speech Part 1 Transcripts
 
LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
 
LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
 
LDC98S69 HUB5 Mandarin Telephone Speech Corpus
 
LDC98T26 HUB5 Mandarin Transcripts
 
LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
 
LDC2006S29 Levantine Arabic QT Training Data Set 5, Speech
 
LDC2006T07 Levantine Arabic QT Training Data Set 5, Transcripts
 
LDC95T13 Mandarin Chinese News Text
 
LDC95T21 North American News Text Corpus
 
LDC98T30 North American News Text Supplement
 
LDC2004S08 RT-03 MDE Training Data Speech
 
LDC2004T12 RT-03 MDE Training Data Text and Annotations
 
LDC2005S16 RT-04 MDE Training Data Speech
 
LDC2005T24 RT-04 MDE Training Data Text/Annotations
 
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
 
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
 
LDC2006T12 Spanish Gigaword First Edition
 
LDC2009T21 Spanish Gigaword Second Edition
 
LDC2001S13 Switchboard Cellular Part 1 Audio
 
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
 
LDC2001T14 Switchboard Cellular Part 1 Transcription
 
LDC2004S07 Switchboard Cellular Part 2 Audio
 
LDC97S62 Switchboard-1 Release 2
 
LDC98S75 Switchboard-2 Phase I
 
LDC99S79 Switchboard-2 Phase II
 
LDC2002S06 Switchboard-2 Phase III Audio
 
LDC98S72 Taiwanese Putonghua Speech and Transcripts
 
LDC98T25 TDT Pilot Study Corpus
 
LDC2000S92 TDT2 Careful Transcription Audio
 
LDC2000T44 TDT2 Careful Transcription Text
 
LDC99S84 TDT2 English Audio
 
LDC2001S93 TDT2 Mandarin Audio Corpus
 
LDC2001T57 TDT2 Multilanguage Text Version 4.0
 
LDC2001S94 TDT3 English Audio
 
LDC2001S95 TDT3 Mandarin Audio
 
LDC2001T58 TDT3 Multilanguage Text Version 2.0
  GALE
 
LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
 
LDC97S44 1996 English Broadcast News Speech (HUB4)
 
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
 
LDC98S71 1997 English Broadcast News Speech (HUB4)
 
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
 
LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
 
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
 
LDC2002S22 1997 HUB5 Arabic Evaluation
 
LDC2002T39 1997 HUB5 Arabic Transcripts
 
LDC2002S24 1997 HUB5 German Evaluation
 
LDC2003T03 1997 HUB5 German Transcripts
 
LDC2002S25 1997 HUB5 Spanish Evaluation
 
LDC2003T04 1997 HUB5 Spanish Transcripts
 
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
 
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
 
LDC2002S10 1998 HUB5 English Evaluation
 
LDC2003T02 1998 HUB5 English Transcripts
 
LDC2002S13 2001 HUB5 English Evaluation
 
LDC2002S12 2001 HUB5 Mandarin Evaluation
 
LDC2003T01 2001 HUB5 Mandarin Transcripts
 
LDC2004S11 2002 Rich Transcription Broadcast News and Conversational Telephone Speech
 
LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
 
LDC2011T05 2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation Set
 
LDC2005T09 ACE 2004 Multilingual Training Corpus
 
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
 
LDC2003T11 ACE-2 Version 1.0
 
LDC93T1 ACL/DCI
 
LDC99L23 American English Spoken Lexicon
 
LDC2012T21 Annotated English Gigaword
 
LDC2005S07 Arabic CTS Levantine Fisher Training Data Set 3, Speech
 
LDC2005T03 Arabic CTS Levantine Fisher Training Data Set 3, Transcripts
 
LDC2004T18 Arabic English Parallel News Part 1
 
LDC2003T12 Arabic Gigaword
 
LDC2011T11 Arabic Gigaword Fifth Edition
 
LDC2009T30 Arabic Gigaword Fourth Edition
 
LDC2007T40 Arabic Gigaword Third Edition
 
LDC2004T17 Arabic News Translation Text Part 1
 
LDC2001T55 Arabic Newswire Part 1
 
LDC2012T07 Arabic Treebank - Broadcast News v1.0
 
LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
 
LDC2003T06 Arabic Treebank: Part 1 v 2.0
 
LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
 
LDC2010T13 Arabic Treebank: Part 1 v 4.1
 
LDC2004T02 Arabic Treebank: Part 2 v 2.0
 
LDC2011T09 Arabic Treebank: Part 2 v 3.1
 
LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
 
LDC2004T11 Arabic Treebank: Part 3 v 1.0
 
LDC2012T09 Arabic-Dialect/English Parallel Text
 
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
 
LDC2005S08 BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts
 
LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1
 
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
 
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
 
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
 
LDC96S47 CALLFRIEND American English-Southern Dialect
 
LDC96S49 CALLFRIEND Egyptian Arabic
 
LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
 
LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
 
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
 
LDC97S42 CALLHOME American English Speech
 
LDC97T14 CALLHOME American English Transcripts
 
LDC97S45 CALLHOME Egyptian Arabic Speech
 
LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
 
LDC97T19 CALLHOME Egyptian Arabic Transcripts
 
LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
 
LDC96L15 CALLHOME Mandarin Chinese Lexicon
 
LDC96S34 CALLHOME Mandarin Chinese Speech
 
LDC96T16 CALLHOME Mandarin Chinese Transcripts
 
LDC2005T13 CCGbank
 
LDC96L14 CELEX2
 
LDC2005T10 Chinese English News Magazine Parallel Text
 
LDC2003T09 Chinese Gigaword
 
LDC2011T13 Chinese Gigaword Fifth Edition
 
LDC2009T27 Chinese Gigaword Fourth Edition
 
LDC2005T14 Chinese Gigaword Second Edition
 
LDC2007T38 Chinese Gigaword Third Edition
 
LDC2005T06 Chinese News Translation Text Part 1
 
LDC2005T23 Chinese Proposition Bank 1.0
 
LDC2001T11 Chinese Treebank 2.0
 
LDC2004T05 Chinese Treebank 4.0
 
LDC2005T01 Chinese Treebank 5.0
 
LDC2002L27 Chinese-English Translation Lexicon Version 3.0
 
LDC2005T08 Discourse Graphbank
 
LDC99L22 Egyptian Colloquial Arabic Lexicon
 
LDC2009T01 English CTS Treebank with Structural Metadata
 
LDC2003T05 English Gigaword
 
LDC2011T07 English Gigaword Fifth Edition
 
LDC2009T13 English Gigaword Fourth Edition
 
LDC2005T12 English Gigaword Second Edition
 
LDC2007T07 English Gigaword Third Edition
 
LDC2012T02 English Translation Treebank: An-Nahar Newswire
 
LDC2012T13 English Web Treebank
 
LDC2006T10 English-Arabic Treebank v 1.0
 
LDC95T11 European Language Newspaper Text
 
LDC2005S13 Fisher English Training Part 2, Speech
 
LDC2005T19 Fisher English Training Part 2, Transcripts
 
LDC2004S13 Fisher English Training Speech Part 1 Speech
 
LDC2004T19 Fisher English Training Speech Part 1 Transcripts
 
LDC2007S02 Fisher Levantine Arabic Conversational Telephone Speech
 
LDC2007T04 Fisher Levantine Arabic Conversational Telephone Speech, Transcripts
 
LDC2013T10 GALE Arabic-English Parallel Aligned Treebank -- Newswire
 
LDC2012T16 GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web
 
LDC2012T20 GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire
 
LDC2012T24 GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web
 
LDC2013T05 GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web
 
LDC2008T02 GALE Phase 1 Arabic Blog Parallel Text
 
LDC2007T24 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 1
 
LDC2008T09 GALE Phase 1 Arabic Broadcast News Parallel Text - Part 2
 
LDC2009T03 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 1
 
LDC2009T09 GALE Phase 1 Arabic Newsgroup Parallel Text - Part 2
 
LDC2008T06 GALE Phase 1 Chinese Blog Parallel Text
 
LDC2009T02 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 1
 
LDC2009T06 GALE Phase 1 Chinese Broadcast Conversation Parallel Text - Part 2
 
LDC2007T23 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 1
 
LDC2008T08 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 2
 
LDC2008T18 GALE Phase 1 Chinese Broadcast News Parallel Text - Part 3
 
LDC2009T15 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 1
 
LDC2010T03 GALE Phase 1 Chinese Newsgroup Parallel Text - Part 2
 
LDC2007T20 GALE Phase 1 Distillation Training
 
LDC2012T06 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1
 
LDC2012T14 GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2
 
LDC2013S02 GALE Phase 2 Arabic Broadcast Conversation Speech Part 1
 
LDC2013T04 GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1
 
LDC2012T18 GALE Phase 2 Arabic Broadcast News Parallel Text
 
LDC2012T17 GALE Phase 2 Arabic Newswire Parallel Text
 
LDC2013T01 GALE Phase 2 Arabic Web Parallel Text
 
LDC2013S04 GALE Phase 2 Chinese Broadcast Conversation Speech
 
LDC2013T08 GALE Phase 2 Chinese Broadcast Conversation Transcripts
 
LDC2005S15 HKUST Mandarin Telephone Speech, Part 1
 
LDC2005T32 HKUST Mandarin Telephone Transcript Data, Part 1
 
LDC2000T50 Hong Kong Hansards Parallel Text
 
LDC2000T47 Hong Kong Laws Parallel Text
 
LDC2000T46 Hong Kong News Parallel Text
 
LDC2004T08 Hong Kong Parallel Text
 
LDC98S69 HUB5 Mandarin Telephone Speech Corpus
 
LDC98T26 HUB5 Mandarin Transcripts
 
LDC95T8 Japanese Business News Text
 
LDC99T34 Japanese Business News Text Supplement
 
LDC2000T45 Korean Newswire
 
LDC2005S14 Levantine Arabic QT Training Data Set 4 (Speech + Transcripts)
 
LDC95T13 Mandarin Chinese News Text
 
LDC2001T02 Message Understanding Conference (MUC) 7
 
LDC2003T18 Multiple-Translation Arabic (MTA) Part 1
 
LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
 
LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
 
LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
 
LDC2002T01 Multiple-Translation Chinese Corpus
 
LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
 
LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
 
LDC95T21 North American News Text Corpus
 
LDC98T30 North American News Text Supplement
 
LDC2007T21 OntoNotes Release 1.0
 
LDC2008T04 OntoNotes Release 2.0
 
LDC2009T24 OntoNotes Release 3.0
 
LDC2011T03 OntoNotes Release 4.0
 
LDC2004T23 Prague Arabic Dependency Treebank 1.0
 
LDC2004T14 Proposition Bank I
 
LDC2004S08 RT-03 MDE Training Data Speech
 
LDC2004T12 RT-03 MDE Training Data Text and Annotations
 
LDC2005S16 RT-04 MDE Training Data Speech
 
LDC2005T24 RT-04 MDE Training Data Text/Annotations
 
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
 
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
 
LDC2006T12 Spanish Gigaword First Edition
 
LDC2009T21 Spanish Gigaword Second Edition
 
LDC95T9 Spanish News Text
 
LDC99T41 Spanish Newswire Text, Volume 2
 
LDC2001S13 Switchboard Cellular Part 1 Audio
 
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
 
LDC2001T14 Switchboard Cellular Part 1 Transcription
 
LDC2004S07 Switchboard Cellular Part 2 Audio
 
LDC97S62 Switchboard-1 Release 2
 
LDC98S75 Switchboard-2 Phase I
 
LDC99S79 Switchboard-2 Phase II
 
LDC2002S06 Switchboard-2 Phase III Audio
 
LDC98S72 Taiwanese Putonghua Speech and Transcripts
 
LDC98T25 TDT Pilot Study Corpus
 
LDC2000S92 TDT2 Careful Transcription Audio
 
LDC2000T44 TDT2 Careful Transcription Text
 
LDC99S84 TDT2 English Audio
 
LDC2001S93 TDT2 Mandarin Audio Corpus
 
LDC2001T57 TDT2 Multilanguage Text Version 4.0
 
LDC2001S94 TDT3 English Audio
 
LDC2001S95 TDT3 Mandarin Audio
 
LDC2001T58 TDT3 Multilanguage Text Version 2.0
 
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
 
LDC2005T16 TDT4 Multilingual Text and Annotations
 
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
 
LDC93T3A TIPSTER Complete
 
LDC2000T52 TREC Mandarin
 
LDC2000T51 TREC Spanish
 
LDC99T42 Treebank-3
 
LDC94T4B-1 UN Parallel Text (English)
 
LDC94T4B-3 UN Parallel Text (Spanish)
  GENOA
 
LDC2004S05 ISL Meeting Speech Part 1
 
LDC2004T10 ISL Meeting Transcripts Part 1
  HARD
  HAVIC
  Hub4
 
LDC98T31 1996 CSR HUB4 Language Model
 
LDC97S66 1996 English Broadcast News Dev and Eval (HUB4)
 
LDC97S44 1996 English Broadcast News Speech (HUB4)
 
LDC97T22 1996 English Broadcast News Transcripts (HUB4)
 
LDC98S71 1997 English Broadcast News Speech (HUB4)
 
LDC98T28 1997 English Broadcast News Transcripts (HUB4)
 
LDC2001S91 1997 HUB4 Broadcast News Evaluation Non-English Test Material
 
LDC2002S11 1997 HUB4 English Evaluation Speech and Transcripts
 
LDC98S73 1997 Mandarin Broadcast News Speech (HUB4-NE)
 
LDC98T24 1997 Mandarin Broadcast News Transcripts (HUB4-NE)
 
LDC98S74 1997 Spanish Broadcast News Speech (HUB4-NE)
 
LDC98T29 1997 Spanish Broadcast News Transcripts (HUB4-NE)
 
LDC2000S86 1998 HUB4 Broadcast News Evaluation English Test Material
 
LDC95T21 North American News Text Corpus
 
LDC98T30 North American News Text Supplement
  Hub5-LVCSR
 
LDC2002S22 1997 HUB5 Arabic Evaluation
 
LDC2002T39 1997 HUB5 Arabic Transcripts
 
LDC2002S24 1997 HUB5 German Evaluation
 
LDC2003T03 1997 HUB5 German Transcripts
 
LDC2002S25 1997 HUB5 Spanish Evaluation
 
LDC2003T04 1997 HUB5 Spanish Transcripts
 
LDC2002S10 1998 HUB5 English Evaluation
 
LDC2003T02 1998 HUB5 English Transcripts
 
LDC2002T43 2000 HUB5 English Evaluation Transcripts
 
LDC2002S13 2001 HUB5 English Evaluation
 
LDC2002S12 2001 HUB5 Mandarin Evaluation
 
LDC2003T01 2001 HUB5 Mandarin Transcripts
 
LDC97L20 CALLHOME American English Lexicon (PRONLEX)
 
LDC97S42 CALLHOME American English Speech
 
LDC97T14 CALLHOME American English Transcripts
 
LDC97S45 CALLHOME Egyptian Arabic Speech
 
LDC2002S37 CALLHOME Egyptian Arabic Speech Supplement
 
LDC97T19 CALLHOME Egyptian Arabic Transcripts
 
LDC2002T38 CALLHOME Egyptian Arabic Transcripts Supplement
 
LDC97L18 CALLHOME German Lexicon
 
LDC97S43 CALLHOME German Speech
 
LDC97T15 CALLHOME German Transcripts
 
LDC96L17 CALLHOME Japanese Lexicon
 
LDC96S37 CALLHOME Japanese Speech
 
LDC96T18 CALLHOME Japanese Transcripts
 
LDC96L15 CALLHOME Mandarin Chinese Lexicon
 
LDC96S34 CALLHOME Mandarin Chinese Speech
 
LDC96T16 CALLHOME Mandarin Chinese Transcripts
 
LDC96L16 CALLHOME Spanish Lexicon
 
LDC96S35 CALLHOME Spanish Speech
 
LDC96T17 CALLHOME Spanish Transcripts
 
LDC99L22 Egyptian Colloquial Arabic Lexicon
 
LDC98S69 HUB5 Mandarin Telephone Speech Corpus
 
LDC98T26 HUB5 Mandarin Transcripts
 
LDC98S70 HUB5 Spanish Telephone Speech Corpus
 
LDC98T27 HUB5 Spanish Transcripts
 
LDC97S62 Switchboard-1 Release 2
 
LDC2001T60 Syllable-Final /s/ Lenition
  JANUS
 
LDC2004S05 ISL Meeting Speech Part 1
 
LDC2004T10 ISL Meeting Transcripts Part 1
  LCTL
  LID
 
LDC96S46 CALLFRIEND American English-Non-Southern Dialect
 
LDC96S47 CALLFRIEND American English-Southern Dialect
 
LDC96S48 CALLFRIEND Canadian French
 
LDC96S49 CALLFRIEND Egyptian Arabic
 
LDC96S50 CALLFRIEND Farsi
 
LDC96S51 CALLFRIEND German
 
LDC96S52 CALLFRIEND Hindi
 
LDC96S53 CALLFRIEND Japanese
 
LDC96S54 CALLFRIEND Korean
 
LDC96S55 CALLFRIEND Mandarin Chinese-Mainland Dialect
 
LDC96S56 CALLFRIEND Mandarin Chinese-Taiwan Dialect
 
LDC96S57 CALLFRIEND Spanish-Caribbean Dialect
 
LDC96S58 CALLFRIEND Spanish-Non-Caribbean Dialect
 
LDC96S59 CALLFRIEND Tamil
 
LDC96S60 CALLFRIEND Vietnamese
  Linguistic Atlas Project
 
LDC2012S03 Digital Archive of Southern Speech
  Machine Reading
  MADCAT
 
LDC2012T15 MADCAT Phase 1 Training Set
 
LDC2013T09 MADCAT Phase 2 Training Set
  MALACH
 
LDC2012S05 USC-SFI MALACH Interviews and Transcripts English
  MDE
  MED
  MED-11
  MIXER 8
  MIXER-7
  MSE
  MT-06
  MT08
 
LDC2010T01 NIST Open MT 2008 Evaluation (MT08) Selected References and System Translations
  MT09
  MT12
  MUC
 
LDC2003T13 Message Understanding Conference (MUC) 6
 
LDC96T10 Message Understanding Conference (MUC) 6 Additional News Text
 
LDC2001T02 Message Understanding Conference (MUC) 7
 
LDC2010T15 Message Understanding Conference 7 Timed (MUC7_T)
 
LDC95T21 North American News Text Corpus
 
LDC93T3A TIPSTER Complete
 
LDC93T3B TIPSTER Volume 1
 
LDC93T3C TIPSTER Volume 2
 
LDC93T3D TIPSTER Volume 3
  NIST Automatic Meeting Recognition
 
LDC2004S09 NIST Meeting Pilot Corpus Speech
 
LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata
  NIST LRE
 
LDC2008S05 2005 NIST Language Recognition Evaluation
 
LDC2009S05 2007 NIST Language Recognition Evaluation Supplemental Training Set
 
LDC2009S04 2007 NIST Language Recognition Evaluation Test Set
  NIST MT
 
LDC2009T05 2008 NIST Metrics for Machine Translation (MetricsMATR08) Development Data
 
LDC2010T10 NIST 2002 Open Machine Translation (OpenMT) Evaluation
 
LDC2010T11 NIST 2003 Open Machine Translation (OpenMT) Evaluation
 
LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation
 
LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation
 
LDC2010T17 NIST 2006 Open Machine Translation (OpenMT) Evaluation
 
LDC2010T21 NIST 2008 Open Machine Translation (OpenMT) Evaluation
 
LDC2013T07 NIST 2008-2012 Open Machine Translation (OpenMT) Progress Test Sets
 
LDC2010T23 NIST 2009 Open Machine Translation (OpenMT) Evaluation
 
LDC2013T03 NIST 2012 Open Machine Translation (OpenMT) Evaluation
  NIST SRE
 
LDC2001S97 2000 NIST Speaker Recognition Evaluation
 
LDC2010S03 2003 NIST Speaker Recognition Evaluation
 
LDC2006S44 2004 NIST Speaker Recognition Evaluation
 
LDC2011S04 2005 NIST Speaker Recognition Evaluation Test Data
 
LDC2011S01 2005 NIST Speaker Recognition Evaluation Training Data
 
LDC2011S10 2006 NIST Speaker Recognition Evaluation Test Set Part 1
 
LDC2012S01 2006 NIST Speaker Recognition Evaluation Test Set Part 2
 
LDC2011S09 2006 NIST Speaker Recognition Evaluation Training Set
 
LDC2011S11 2008 NIST Speaker Recognition Evaluation Supplemental Set
 
LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set
 
LDC2011S05 2008 NIST Speaker Recognition Evaluation Training Set Part 1
 
LDC2011S07 2008 NIST Speaker Recognition Evaluation Training Set Part 2
  NTCIR
  OntoNotes
 
LDC2011T03 OntoNotes Release 4.0
  OpenHaRT
  PHANOTICS
  RATS
  Reflex-LCTL
  REFLEX-MTE
 
LDC2009T11 REFLEX Entity Translation Training/DevTest
  RM
 
LDC96S39 RM Isolated and Spelled Word Data
  ROAR
 
LDC2004S05 ISL Meeting Speech Part 1
 
LDC2004T10 ISL Meeting Transcripts Part 1
  RT
 
LDC2007S12 2004 Spring NIST Rich Transcription (RT-04S) Evaluation Data
 
LDC2007S11 2004 Spring NIST Rich Transcription (RT-04S) Development Data
 
LDC2011S06 2005 Spring NIST Rich Transcription (RT-05S) Evaluation Set
  SCIL
  SemEval
 
LDC2011T01 SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
  SID
 
LDC96S61 1996 Speaker Recognition Benchmark
 
LDC99S80 1997 Speaker Recognition Benchmark
 
LDC98S76 1998 Speaker Recognition Benchmark
 
LDC99S81 1999 Speaker Recognition Benchmark
 
LDC2002S34 2001 NIST Speaker Recognition Evaluation Corpus
 
LDC2004S04 2002 NIST Speaker Recognition Evaluation
 
LDC2001S13 Switchboard Cellular Part 1 Audio
 
LDC2001S15 Switchboard Cellular Part 1 Transcribed Audio
 
LDC2001T14 Switchboard Cellular Part 1 Transcription
 
LDC2004S07 Switchboard Cellular Part 2 Audio
 
LDC98S75 Switchboard-2 Phase I
 
LDC99S79 Switchboard-2 Phase II
 
LDC2002S06 Switchboard-2 Phase III Audio
  SIGHAN
  SPINE
 
LDC2000S96 Speech in Noisy Environments (SPINE) Evaluation Audio
 
LDC2000T54 Speech in Noisy Environments (SPINE) Evaluation Transcripts
 
LDC2000S87 Speech in Noisy Environments (SPINE) Training Audio
 
LDC2000T49 Speech in Noisy Environments (SPINE) Training Transcripts
 
LDC2001S04 Speech in Noisy Environments (SPINE2) Part 1 Audio
 
LDC2001T05 Speech in Noisy Environments (SPINE2) Part 1 Transcripts
 
LDC2001S06 Speech in Noisy Environments (SPINE2) Part 2 Audio
 
LDC2001T07 Speech in Noisy Environments (SPINE2) Part 2 Transcripts
 
LDC2001S08 Speech in Noisy Environments (SPINE2) Part 3 Audio
 
LDC2001T09 Speech in Noisy Environments (SPINE2) Part 3 Transcripts
 
LDC2001S99 Speech in Noisy Environments 1 (SPINE1 CODED) Coded Audio
  SRE-12
  STD
  TAC
  Talkbank
 
LDC2005T35 American National Corpus (ANC) Second Release
 
LDC2004V01 FORM1 Kinematic Gesture
 
LDC2003V01 FORM2 Kinematic Gesture
 
LDC2003L01 Grassfields Bantu Fieldwork: Dschang Lexicon
 
LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms
 
LDC2001S16 Grassfields Bantu Fieldwork: Ngomba Tone Paradigms
 
LDC2004L01 Klex: Finite-State Lexical Transducer for Korean
 
LDC2004T03 Morphologically Annotated Korean Text
 
LDC2003S06 Santa Barbara Corpus of Spoken American English Part II
 
LDC2004S10 Santa Barbara Corpus of Spoken American English Part III
 
LDC2005S25 Santa Barbara Corpus of Spoken American English Part IV
 
LDC2003T15 SLX Corpus of Classic Sociolinguistic Interviews
 
LDC2004S12 TalkBank Ethology Data: Field Recordings of Vervet Monkey Calls
  TDT
 
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
 
LDC98T25 TDT Pilot Study Corpus
 
LDC2000S92 TDT2 Careful Transcription Audio
 
LDC2000T44 TDT2 Careful Transcription Text
 
LDC99S84 TDT2 English Audio
 
LDC2001S93 TDT2 Mandarin Audio Corpus
 
LDC2001T57 TDT2 Multilanguage Text Version 4.0
 
LDC2001S94 TDT3 English Audio
 
LDC2001S95 TDT3 Mandarin Audio
 
LDC2001T58 TDT3 Multilanguage Text Version 2.0
 
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
 
LDC2005T16 TDT4 Multilingual Text and Annotations
 
LDC2007V02 TRECVID 2003 Keyframes & Transcripts
 
LDC2007V01 TRECVID 2005 Keyframes & Transcripts
  TERN
 
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
  TIDES
 
LDC2005T09 ACE 2004 Multilingual Training Corpus
 
LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0
 
LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0
 
LDC2003T11 ACE-2 Version 1.0
 
LDC93T1 ACL/DCI
 
LDC2004T18 Arabic English Parallel News Part 1
 
LDC2003T12 Arabic Gigaword
 
LDC2004T17 Arabic News Translation Text Part 1
 
LDC2001T55 Arabic Newswire Part 1
 
LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation
 
LDC2003T06 Arabic Treebank: Part 1 v 2.0
 
LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
 
LDC2004T02 Arabic Treebank: Part 2 v 2.0
 
LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
 
LDC2004T11 Arabic Treebank: Part 3 v 1.0
 
LDC2005T33 BBN Pronoun Coreference and Entity Type Corpus
 
LDC2000T43 BLLIP 1987-89 WSJ Corpus Release 1
 
LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0
 
LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0
 
LDC2005T13 CCGbank
 
LDC96L14 CELEX2
 
LDC2005T10 Chinese English News Magazine Parallel Text
 
LDC2003T09 Chinese Gigaword
 
LDC2005T14 Chinese Gigaword Second Edition
 
LDC2005T06 Chinese News Translation Text Part 1
 
LDC2005T23 Chinese Proposition Bank 1.0
 
LDC2001T11 Chinese Treebank 2.0
 
LDC2004T05 Chinese Treebank 4.0
 
LDC2005T01 Chinese Treebank 5.0
 
LDC2002L27 Chinese-English Translation Lexicon Version 3.0
 
LDC2007T02 English Chinese Translation Treebank v 1.0
 
LDC2003T05 English Gigaword
 
LDC2005T12 English Gigaword Second Edition
 
LDC95T11 European Language Newspaper Text
 
LDC2000T50 Hong Kong Hansards Parallel Text
 
LDC2000T47 Hong Kong Laws Parallel Text
 
LDC2000T46 Hong Kong News Parallel Text
 
LDC2004T08 Hong Kong Parallel Text
 
LDC95T8 Japanese Business News Text
 
LDC99T34 Japanese Business News Text Supplement
 
LDC2000T45 Korean Newswire
 
LDC95T13 Mandarin Chinese News Text
 
LDC2001T02 Message Understanding Conference (MUC) 7
 
LDC2003T18 Multiple-Translation Arabic (MTA) Part 1
 
LDC2005T05 Multiple-Translation Arabic (MTA) Part 2
 
LDC2003T17 Multiple-Translation Chinese (MTC) Part 2
 
LDC2004T07 Multiple-Translation Chinese (MTC) Part 3
 
LDC2006T04 Multiple-Translation Chinese (MTC) Part 4
 
LDC2002T01 Multiple-Translation Chinese Corpus
 
LDC95T21 North American News Text Corpus
 
LDC98T30 North American News Text Supplement
 
LDC2004T23 Prague Arabic Dependency Treebank 1.0
 
LDC2004T14 Proposition Bank I
 
LDC2006T12 Spanish Gigaword First Edition
 
LDC2009T21 Spanish Gigaword Second Edition
 
LDC95T9 Spanish News Text
 
LDC99T41 Spanish Newswire Text, Volume 2
 
LDC98T25 TDT Pilot Study Corpus
 
LDC2000S92 TDT2 Careful Transcription Audio
 
LDC2000T44 TDT2 Careful Transcription Text
 
LDC99S84 TDT2 English Audio
 
LDC2001S93 TDT2 Mandarin Audio Corpus
 
LDC2001T57 TDT2 Multilanguage Text Version 4.0
 
LDC2001S94 TDT3 English Audio
 
LDC2001S95 TDT3 Mandarin Audio
 
LDC2001T58 TDT3 Multilanguage Text Version 2.0
 
LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus
 
LDC2005T16 TDT4 Multilingual Text and Annotations
 
LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data
 
LDC93T3A TIPSTER Complete
 
LDC2000T52 TREC Mandarin
 
LDC2000T51 TREC Spanish
 
LDC99T42 Treebank-3
 
LDC94T4B-1 UN Parallel Text (English)
 
LDC94T4B-3 UN Parallel Text (Spanish)
  Tipster
 
LDC95T13 Mandarin Chinese News Text
 
LDC95T9 Spanish News Text
 
LDC93T3A TIPSTER Complete
 
LDC93T3B TIPSTER Volume 1
 
LDC93T3C TIPSTER Volume 2
 
LDC93T3D TIPSTER Volume 3
  Transtac
  TREC
 
LDC2001T55 Arabic Newswire Part 1
 
LDC95T13 Mandarin Chinese News Text
 
LDC95T9 Spanish News Text
 
LDC93T3A TIPSTER Complete
 
LDC93T3B TIPSTER Volume 1
 
LDC93T3C TIPSTER Volume 2
 
LDC93T3D TIPSTER Volume 3
 
LDC2000T52 TREC Mandarin
 
LDC2000T51 TREC Spanish
 
LDC2007V02 TRECVID 2003 Keyframes & Transcripts
 
LDC2010V01 TRECVID 2004 Keyframes & Transcripts
 
LDC2007V01 TRECVID 2005 Keyframes & Transcripts
 
LDC2010V02 TRECVID 2006 Keyframes
  TRECVid
  VACE
 
LDC2012V01 2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News
 
LDC2011V05 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
 
LDC2011V06 2006 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
 
LDC2011V03 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 1
 
LDC2011V04 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Test Set Part 2
 
LDC2011V01 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 1
 
LDC2011V02 NIST/USF Evaluation Resources for the VACE Program - Meeting Data Training Set Part 2
  ViperToxin

About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact: ldc@ldc.upenn.edu

(c) 1992-2010 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.