Catalan TimeBank 1.0 was developed by researchers at Barcelona
Media and consists of Catalan texts in the AnCora
corpus annotated with temporal and event information according to the TimeML
specification language.
TimeML (Pusteyovsky, et al., 2005) is a schema for annotationg eventualities
and time expressions in natural language as well as the temporal relations among
them, thus facilitating the task of extraction, representation and exchange
of temporal information. Catalan Timebank 1.0 is annotated in three levels,
marking events, time expressions and event metadata. The TimeML annotation scheme
was tailored for the specifics of the Catalan language. Temporal relations in
Catalan present distinctions of verbal mood (e.g., indicative, subjunctive,
conditional, etc.) and grammatical aspect (e.g., imperfective) which are absent
in English. Catalan TimeBank 1.0 joins the family of TimeBank annotated corpora
which includes languages such as English, Spanish, Italian, French, Korean and
Chinese. Through their common layer of annotation, these corpora provide resoures
useful for multilingual temporal extraction and processing, such as multilingual
text entailment, opinion mining or question answering.
LDC has released the following corpora incorporating TimeBank annotation:
TimeBank 1.2 LDC2006T08, FactBank
1.0 LDC2009T23 and ModeS
TimeBank 1.0 LDC2012T01.
Data
Catalan TimeBank 1.0 contains stand-off annotations for 210 documents with
over 75,800 tokens (including punctuation marks) and 68,000 tokens (excluding
punctuation). The source documents are from the EFE
news agency, the ACN
Catalan news agency2 and the Catalan version of the El
Períodico newspaper, and span the period from January to December
2000.
The AnCora corpus is the largest multilayer annotated corpus of Spanish and
Catalan. AnCora contains 400,000 words in Spanish and 275,000 words in Catalan.
The AnCora documents are annotated on many linguistic levels including stucture,
syntax, dependencies, semantics and pragmatics.That information is not included
in this release, but it can be mapped to the present annotations. The data contained
in the AnCora corpus has been used in several international natural language
processing evaluations such as CoNLL-2006,
CoNLL-2007 and SemEval-2007.
The corpus is freely available from the Centre
de Llenguatge i Computació (CLiC).
Samples
(Click to view full sized image.)
Updates
Additional information, updates, bug fixes may be available in the LDC
catalog entry for this corpus at LDC2012T10.
Content Copyright
Portions © 2012 Roser Saurí, Toni Badia, © 2012 Trustees
of the University of Pennsylvania |