Introduction
FactBank 1.0, Linguistic Data Consortium (LDC) catalog number LDC2009T23 and
isbn 1-58563-522-7, consists of 208 documents (over 77,000 tokens) from newswire
and broadcast news reports in which event mentions are annotated with their
degree of factuality, that is, the degree to which they correspond to those
events. FactBank 1.0 was built on top of TimeBank
1.2 and a fragment of the AQUAINT
TimeML Corpus, both of which used the TimeML specification language. This
resulted in a double-layered annotation of event factuality. TimeBank 1.2 and
AQUAINT TimeML encode most of the basic structural elements expressing factuality
information while FactBank 1.0 represents the resulting factuality interpretation.
The combination of the factuality values in FactBank with the structural information
in TimeML-annotated corpora facilitates the development of tools aimed at automatically
identifying the factuality values of events, a component fundamental in tasks
requiring some degree of text understanding, such as Textual Entailment, Question
Answering, or Narrative Understanding.
FactBank annotations indicate whether the event mention describes actual situations
in the world, situations that have not happened, or situations of uncertain
interpretation. Event factuality is not an inherent feature of events but a
matter of perspective. Different discourse participants may present divergent
views about the factuality of the very same event. Consequently, in FactBank,
the factuality degree of events is assigned relative to the relevant sources
at play. In this way, it can adequately reflect the divergence of opinions regarding
the factual status of events, as is common in news reports.
The annotation language is grounded on established linguistic analyses of the
phenomenon, which facilitated the creation of a battery of discriminatory tests
for distinguishing between factuality values. Furthermore, the annotation procedure
was carefully designed and divided into basic, sequential annotation tasks.
This made it possible for hard tasks to be built on top of simpler ones, while
at the same time allowing annotators to become incrementally familiar with the
complexity of the problem. As a result, FactBank annotation achieved a relatively
high interannotation agreement, kappa=0.81, a positive result when considered
against similar annotation efforts.
Data
All FactBank markup is standoff and is represented through a set of 20 tables
which can be easily loaded into a database. Each table resides in an independent
text file, where fields are separated by three consecutive bars (i.e., |||).
The data in fields of string type are presented between simple quotations (').
Because FactBank 1.0 was built on top of TimeBank 1.2 and AQUAINT TimeML, both
of which are marked up with inline XML-based annotation, this release contains
the TimeBank 1.2 and AQUAINT TimeML annotation in standoff, table-based format
as well.
Samples
Content Copyright
Portions © 1998 American Broadcasting Corporation, © 1998 The Associated
Press, © 1998 Cable News Network, LP, LLLP, © 1987-1989 Dow Jones
& Company, Inc., © 1998 New York Times, © 1998 Public Radio International,
© 1998, 1999 Xinhua News Agency, © 2002-2009 Brandeis University,
© 2009 Trustees of the University of Pennsylvania
The World is a co-production of Public Radio International and the British
Broadcasting Corporation and is produced at WGBH Boston. |