Obtaining DataUsing DataProviding DataCreating Data
About LDCMembersCatalogProjectsPapersLDC OnlineSearchContact UsUPennHome

Linguistic Resources  
Use of LDC Corpora in University Summer Schools



EMLS Summer School -- July 21, 2006


European Masters in Language and Speech (EMLS) is a network of European Universities providing education in natural language processing and speech communication sciences. EMLS organizes regular summer schools which attract considerable interest of students from both NLP and speech processing domains. .

For this year’s summer school in Utrecht (NL), members of Speech@FIT group, (Faculty of Information technology, Brno University of Technology, Czech republic), prepared two tutorials making use of LDC corpora: .

during “Speech recognition based on Hidden Markov Models” given by Jan “Honza” Cernocky, students built a recognizer of connected digits using HTK tools. The recognizer is comparable to the Aurora-ETSI standard and on clean data, it has more than 99% word accuracy. TI-DIGITS database from LDC was used in this tutorial. .

- in “Phoneme posterior estimation and acoustic keyword-spotting”, given by Igor Szoke, students got acquainted with theory and practice of phoneme recognition and posterior estimation by neural network and with their use in acoustic keyword spotting. The tutorial was based on the one of LDC classics: TIMIT.

LDC supported these two tutorials with the data – while the use of TIDIGITS was limited to the EMLS, EMLS students were offered in kind copies of TIMIT including the documentation for home use.

More Information:
EMLS homepage
Utrecht summer school page
Speech@FIT home

[ top ]

LSA Summer Institute and LDC Corpora -- August 17, 2007

The LDC was pleased to provide access to several LDC corpora for students at the Linguistic Society of America (LSA) 2007 Summer Institute. This year's institute, entitled 'Empirical Foundations for Theories of Language', was hosted by Stanford University. The institute drew researchers and students from across the globe and included a number of courses that provided students with hands-on experience in working with linguistic data. The following examples, which demonstrate how large-scale databases can be used for teaching purposes, were submitted by course instructors at the LSA 2007 Summer Institute:.

In "Information Structure and Word Order Variation", taught by Betty J. Birner and Gregory Ward, students used LDC corpora to collect tokens of various constructions displaying non-canonical word order, with the goal of discovering how various categories of information are distributed in these non-canonical constructions. Among the corpora used were the Treebank and Brown Corpus (including the Wall Street Journal), and Switchboard.

In “Pronunciation Variation and Psycholinguistics”, taught by Susanne Gahl, students examined pronunciation variants and fluctuations in speaking rate in the Switchboard corpus, with the aim of understanding the mechanisms underlying human language production and comprehension.

For "Paraphrase and Usage" taught by Annie Zaenen, Cathy O'Connor, and Tom Wasow, students were required to initiate a small corpus study in order to receive credit. The focus of the class was grammatical alternations and the factors that determine their relative frequencies. The purpose of the project requirement was to give students hands-on experience in exploring usage data. Students used a variety of corpora for their projects, including the Treebank and TIPSTER.

The LDC looks forward to collaborating with LSA for future institutes.

[ top ]


About LDC | Members | Catalog | Projects | Papers | LDC Online | Search / Help | Contact Us | UPenn | Home | Obtaining Data | Creating Data | Using Data | Providing Data

Contact ldc@ldc.upenn.edu
Last modified: Monday, 24-Mar-2008 12:30:44 EST
© 1992-2007 Linguistic Data Consortium, University of Pennsylvania. All Rights Reserved.