Introduction
Spanish Gigaword Third Edition, Linguistic Data Consortium (LDC) catalog number
LDC2011T12 and ISBN 1-58563-596-0, was produced by LDC. It is a comprehensive
archive of Spanish newswire text data that has been acquired over several years
by LDC. Spanish Gigaword Third Edition includes all of the content of the second
edition (LDC2009T21)
and adds data collected from January 1, 2009 through December 31, 2010.
The three distinct international sources of Spanish newswire in this
edition, and the time spans of collection covered for each, are as
follows:
- Agence France-Presse, Spanish (afp_spa) May 1994 - Dec 2010
- Associated Press, Spanish (apw_spa)
Nov 1993 - Dec 2010
- Xinhua News Agency, Spanish (xin_spa) Sep 2001 - Dec 2010
The seven-letter codes in the parentheses above include the
three-character source name abbreviations and the three-character
language code (spa) separated by an underscore (_) character. The
three-letter language code conforms to LDCs internal convention based on the
ISO 639-3 standard.
Data
All text data are presented in SGML/XML form, using a very simple,
minimal markup structure all text consists of printable ASCII,
whitespace, and printable code points in the Latin1 Supplement
character table, as defined by both ISO-8859-1 and the Unicode
Standard (ISO 10646) for the accented characters used in Spanish.
The Supplement/accented characters are rendered using UTF-8 encoding.
For all of the documents in this corpus, we have applied a rudimentary
(and _approximate_) categorization of DOC units into four distinct
types. The classification is indicated by the type=string
attribute that is included in each opening DOC tag. The four
types are:
- story : This is by far the most frequent type, and it represents the
most typical newswire item: a coherent report on a particular topic
or event, consisting of paragraphs and full sentences.
- multi : This type of DOC contains a series of unrelated blurbs, each of
which briefly describes a particular topic or event this is typically applied
to DOCs that contain summaries of todays news, news briefs in ... (some
general area like finance or sports), and so on.
- advis : (short for advisory) These are DOCs which the news service addresses
to news editors -- they are not intended for publication to the end users
(the populations who read the news). This type contains formulaic,
repetitive content (contact phone numbers, etc).
- other : This represents DOCs that clearly do not fall into any of
the above types -- in general, items of this type are intended for
broad circulation (they are not advisories), they may be topically
coherent (unlike multi type DOCS), and they typically do not
contain paragraphs or sentences (they arent really stories)
these are things like lists of sports scores, stock prices,
temperatures around the world, and so on.
Sample
Updates
An update to Spanish Gigaword Third Edition was issued to fix an issue with 26
consecutive months of data files from Xinhua Spanish: xin_spa_200601 through xin_spa_200802 i.e.
all files from 2006 and 2007, plus the first two files from 2008. The problem was that all letters with diacritic marks had been omitted
in the text data for that portion of the collection. For example, the
word ano was presented as ao (minus the n with tilde character),
aspiracion appeared as aspiracin, and similarly for all accented
characters (UTF-8 letters outside the ASCII range). All copies of Spanish
Gigaword Third Edition ordered after February 2013 will have this update included.
More information is included in the
readme associated with this update.
Content Copyright
Portions © 1994-2010 Agence France Presse, © 1993-2010 The Associated
Press, © 2001-2010 Xinhua News Agency, © 2006, 2009, 2011, 2013 Trustees
of the University of Pennsylvania
|