Introduction
Klex: Finite-State Lexical Transducer for Korean was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004L01 and ISBN 1-58563-283-x.
Klex is a finite-state lexical transducer for the Korean language,
with the lexical string on the upper side and the inflected surface
string on the lower side. Klex was developed on the XFST (Xerox Finite
State Tool) software platform, developed and distributed by the Xerox
Corporation. The most common application for such lexical transducers is
morphological analysis and generation.
Data
The distribution consists of ~7,8MB.
Characters in Hangul (Korean alphabet) can be displayed by selecting Korean encoding in your brower.
A lexicon in the form of a transducer has the following basic structure:
fly/VV+s/ECS 돕/VV+었/EPF+다/EFN
| |
flies 도왔다
A sequence of morphemes along with the respective part-of-speech
constitutes the upper string; a fully lexicalized form constitutes the
lower string. A transducer network as a whole consists of all such
possible morpheme sequence / word pairs in the language. Given the
lower lexicalized form, the transducer can produce the analyzed
morpheme sequence (the process of "looking-up"); conversely, the
transducer can be used in producing the fully inflected surface form
of grammatical sequence of morphemes (opposite of "looking-up," hence
Xerox's terminology of "looking-down"). These two operations are the
most typical applications of such lexical transducers, namely
morphological analysis and generation.
Output of Klex when used as a morphological analyzer is compatible with the
Morphologically Annotated Korean Text corpus.
It also conforms to the Korean Treebank POS annotation standards, with slight variation.
The Korean morphological grammar employed by Klex was constructed by
Na-Rae Han, under the guidance of Ken Beesley, Lauri Karttunen and
Martha Palmer. The lexicon was fine-tuned by testing against
various corpora, by fixing undesirable outputs and adding
missing lexical entries. Klex was partially supported by the Korean
Treebank Project, whose result was published in 2002 as the Korean English Treebank Annotations.
Updates
There are no updates available at this time.
Sponsorship
The Klex corpus was funded in part through a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.
Note
The cost of the first 50 copies of this publication (not counting the copies distributed to
LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge.
After these first 50 copies are distributed, additional copies will be available for the cost of $2000.
Content Copyright
Portions © 2004 Trustees of the University of Pennsylvania |