2008/2010 NIST Metrics for Machine Translation (MetricsMaTr) GALE Evaluation
Set, Linguistic Data Consortium (LDC) catalog number LDC2011T05 and isbn 1-58563-575-8,
is a package containing source data, reference translations, machine translations
and associated human judgments used in the NIST 2008 and 2010 MetricsMaTr evaluations.
The package was compiled by researchers at NIST, making use of Arabic and Chinese
broadcast, newswire and web data and reference translations collected and developed
by LDC for Phase 2 and Phase 2.5 of the DARPA GALE
is a series of research challenge events for machine translation (MT) metrology,
promoting the development of innovative MT metrics that correlate highly with
human assessments of MT quality. Participants submit their metrics to NIST (National
Institute of Standards and Technology). NIST runs those metrics on certain held-back
test data for which it has human assessments measuring quality and then calculates
correlations between the automatic metric scores and the human assessments.
Specifically, the goals of MetricsMATR are: to inform other MT technology evaluation
campaigns and conferences with regard to improved metrology to establish an
infrastructure that encourages the development of innovative metrics to build
a diverse community that will bring new perspectives to MT metrology research
and to provide a forum for MT metrology discussion and for establishing future
directions of MT metrology.
The first MetricsMaTr challenge was held in 2008
the development data from the 2008 program is available from LDC, 2008
NIST Metrics for Machine Translation (MetricsMATR08) Development Data LDC2009T05.
The MetricsMaTr10 evaluation plan is included in this release.
This release contains 149 documents with corresponding reference translations
(Arabic-to-English and Chinese-to-English), system translations and human assessments.
The human assessments include the following: Adequacy7 (a 7-point scale for
judging the meaning of a system translation with respect to the reference translation)
Adequacy Yes/No (whether the given system segment meant essentially the same
as the reference translation) Preference (the judges preference between two
candidate translations when compared to a human reference translation) and
HTER (Human Targeted Error Rate, human edits to a system translation to have
the same meaning as a reference translation).
Additional information, updates, bug fixes may be available in the LDC catalog
entry for this corpus at LDC2011T05.
This work is supported in part by the Defense Advanced Research Projects Agency,
GALE Program Grant No. HR0011-06-1-003. The content of this publication does
not necessarily reflect the position or policy of the Government, and no official
endorsement should be inferred.
Portions © 2006 Abu Dhabi TV, Agence France Presse, Al-Ahram, Al Alam
News Channel, Al Arabiya, Al Hayat, Al Iraqiyah, Al Quds-Al Arabi, An Nahar,
Asharq al-Awsat, China Central TV, China Military Online, Chinanews.com, Guangming
Daily, Kuwait TV, New Tang Dynasty TV, Nile TV, PAC, Ltd, Peoples Daily Online,
Phoenix TV, Syria TV, Xinhua News Agency, © 2006, 2011 Trustees of the
University of Pennsylvania
Linguistic Data Consortium , Trustees
of the University of Pennsylvania . All Rights Reserved.