Morphologically Annotated Korean Text was produced by Linguistic Data
Consortium (LDC) catalog number LDC2004T03 and ISBN 1-58563-284-8.
This is a collection of Korean text with annotated morphological analysis and part-of-speech tags.
The source text was extracted from the Korean Newswire corpus.
The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994 to
March 20, 2000. The portion included in this release consists of a
small number of hand-picked articles.
The corpus is part of the Korean Treebank Phase 2.
Between 2001 and 2002, the project was conducted under subcontract
from Cogentex Inc., sponsor number Cogentex 5-33436. The text was
tokenized and then automatically analyzed using
Klex. Since there can be multiple possible morphological analyses, the
output was fed through a statistical ranking system in order to select the
best possible analysis for the word in the text environment.
The part-of-speech tagged result was then manually corrected by
Seung-yun Yang and Na-Rae Han, graduate students in the University
of Pennsylvania Linguistics Department.
The data consists of one single file, totalling approximately 880KB in
The text contains 1,574 sentences with 41,024
words and 77,173 morphemes in total. The text file is in ksc-5601
encoding. Characters in Hangul (Korean alphabet) can be
displayed with Korean X-terminals such as hanterm, or by
selecting Korean encoding in common web browsers such as Netscape
or Internet Explorer.
The data is formatted as follows: one head word per line, the word
and its morphologically analyzed output are separated by a tab.
Each morpheme is followed by "/" and its part-of-speech;
morphemes are separated by "+". ^EOS is a special symbol denoting the
end of a sentence.
Morphologically analyzed and part-of-speech tagged data can be useful
in the following applications: training of statistical morphological
analyzers and part-of-speech taggers, evaluation of pre-existing
morphological analyzers and part-of-speech taggers.
The morphologically tagged output is compatible with
Klex: Finite-State Lexical Transducer for Korean.
It also conforms to the
Korean Treebank POS annotation standards.
There are no updates available at this time.
The Morphologically Annotated Korean Text corpus was funded in part through a 5-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.
The cost of the first 50 copies of this publication (not counting the copies distributed to
LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge.
After these first 50 copies are distributed, additional copies will be available for the cost of $300.
Portions © 1994-2000 Korean Press Agency, © 2004 Trustees of the University of Pennsylvania