New Corpora Archive
Chronological list of our corpora releases for recent years. Please visit The LDC Corpus Catalog for a complete list of publications that LDC distributes.
1993-2007 United Nations Parallel Text ~ ~673K raw text documents and 520K word alignment documents in the official languages of the UN
GALE Chinese-English Word Alignment and Tagging Training Part 4 -- Web ~ 158K tokens of word aligned Chinese and English parallel text enriched with linguistic tags
GALE Phase 2 Arabic Broadcast Conversation Speech Part 1 ~ 123 hours of Arabic broadcast conversation speech
GALE Phase 2 Arabic Broadcast Conversation Transcripts Part 1 ~ 752K transcribed Arabic broadcast conversation data tokens
NIST 2012 Open Machine Translation (OpenMT) Evaluation ~ 222 Chinese newswire and web data documents with corresponding source and reference files
Chinese-English Biology and Chemistry Abstract Parallel Text ~ ~2000 parallel sentences from scientific article abstracts published in Mandarin and translated into English
GALE Phase 2 Arabic Web Parallel Text ~ 42K words of Arabic source text and its English translation
GALE Chinese-English Word Alignment and Tagging Training Part 3 -- Web ~ 154K tokens of word aligned Chinese and English parallel text enriched with linguistic tags
Russian-English Computer Security Parallel Text ~ 6000 parallel sentences from a set of computer security reports published in Russian and translated into English
Annotated English Gigaword ~ adds automatically-generated syntactic and discourse structure annotation to English Gigaword Fifth Edition (LDC2011T07)
Chinese-English Semiconductor Parallel Text ~ ~2000 parallel sentences from abstracts of scientific articles on semiconductors published in Mandarin and translated into English
GALE Phase 2 Arabic Newswire Parallel Text ~ ~400 source-translation pairs, comprising 181K tokens of Arabic source text and its English translation
GALE Chinese-English Word Alignment and Tagging Training Part 2 -- Newswire ~ 169K tokens of word aligned Chinese and English parallel text enriched with linguistic tags
GALE Phase 2 Arabic Broadcast News Parallel Text ~ 29K of Arabic source text and its English translation
GALE Chinese-English Word Alignment and Tagging Training Part 1 -- Newswire and Web ~ 150K tokens of word aligned Chinese and English parallel text enriched with linguistic tags
MADCAT Phase 1 Training Set ~ handwritten Arabic documents annotated for physical coordinates and token.
English Web Treebank ~ 250K words of English web text manually annotated for syntactic structure; first 50 copies available at no-cost.
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 2 ~ 169K words of Arabic source text and its English translation.
Spanish TimeBank 1.0 ~ stand-off annotations for 210 documents with over 75Ktokens; available at no-cost
American English Nickname Collection ~ 331K distinct mappings between nicknames and given names; available at no-cost.
Arabic Treebank - Broadcast News v1.0 ~ 120 transcribed Arabic broadcast news stories with part-of-speech, morphology, gloss and syntactic tree annotation.
Catalan TimeBank 1.0 ~ stand-off annotations for 210 documents with over 75K tokens; available at no-cost.
Arabic-Dialect/English Parallel Text ~ 3.5 million tokens of Arabic dialect sentences and their English translations.
Prague Czech-English Dependency Treebank 2.0 ~ Czech-English parallel resources annotated for dependency structure, semantic labeling, argument structure, ellipsis and anaphora resolution.
Chinese Dependency Treebank 1.0 ~ 49K Chinese sentences annotated with syntactic dependency structures.
GALE Phase 2 Arabic Broadcast Conversation Parallel Text Part 1 ~ 169K words of Arabic broadcast conversation source text and corresponding English translations.
Turkish Broadcast News Speech and Transcripts ~ 130 hours of VOA Turkish radio broadcasts and corresponding transcripts.
2005 NIST/USF Evaluation Resources for the VACE Program - Broadcast News ~ 60 hours of English broadcast news video data annotated for 2005 VACE tasks.
2009 CoNLL Shared Task Part 1 ~ Catalan, Czech, German and Spanish data used for 2009 CoNLL.
2009 CoNLL Shared Task Part 2 ~ Chinese and English data used for 2009 CoNLL.
USC-SFI MALACH Interviews and Transcripts English ~ 375 hours of interviews from 784 interviewees along with transcripts.
English Translation Treebank: An-Nahar Newswire ~ 599 newswire stories translated from Arabic to English and annotated for POS and syntactic structure.
Digital Archive of Southern Speech ~ 370 hours of English speech data from 30 female speakers and 34 male speakers
ModeS TimeBank 1.0 ~ Modern Spanish test annotated with TimeML and SpatialML mark-up.
2006 NIST Speaker Recognition Evaluation Test Set Part 2 ~ 568 hours of multilingual conversational telephone and microphone speech
TORGO Database of Dysarthric Articulation ~ 23 hours of transcribed English speech data from dysarthric and non-dysarthric speakers
About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data