Notes on mapping into ITRANS from ISCII and
Unicode.
ITRANS
offers a widely-used 7-bit phonetically mnemonic encoding for Indian languages,
which appears to be
losslessly interconvertible with the ISCII and Unicode encodings for these
languages -- assuming one keeps careful track of what parts of the output are
really ITRANS encodings and what parts are not. Note that if you don't do this,
the results will NOT be safely
round-trippable, for obvious reasons.
If one wanted a safely
round-trippable roman-alphabet encoding of (for instance) Hindi, one would need
to use techniques like those employed by the (once?) widely-used HZ encoding
scheme for Chinese.
ITRANS
5.3 software converts from ITRANS to UTF-8 (among other things).
However, it doesn't do any mappings into ITRANS. For folks who can't read
devanagari easily, mapping into ITRANS from other common encodings of Hindi may
be helpful.
1. Conversion from ISCII to
ITRANS
This uses a simple python script due to
Arun Sharma (arun@sharma-home.net).
We start with a sample from the start of
the IIIT
English-Hindi dictionary, processed Xiaoyi Ma to contain lines of the
form
ENGLISH-WORD
POS HINDI-WORD
where the HINDI-WORD is encoded in
ISCII.
Testing
$ ./iscii2itrans.py
<hdictsample1
produces
a Det
Eka
aback Adv
pIChE
aback Adv
-arka-taprabha
abacus N
ginataAraA
abandon V
ChODa??~dEnaA
abandoned Adj
ChODa??A~-arkauA
abandonment N
parityaAga
abase V
avamaAnita~karanaA
abashed Adj
lajjita
abate V
kama~-arkaOnaA
abatement N
kamI
abattoir N
vadhashaAlaA
abbess N
maThaAdhyakssaA
abbey N
IsaAiyO~kaA~maTha
abbot N
maThaAdhyakssa
abbreviate VT
saamkssipta~karanaA
abbreviation
N saamkssipti
abdicate V
svEchChaA~sE~ChODa??naA
abdication N
pada~tyaAga
abdomen N
pETa
which looks
OK.
2. Conversion from UTF-8 to
ITRANS
We start with a sample from the
Naidunia newswire (from 1999), translated to UTF-8 and SGML by Kevin Walker and
Dave Graff.
The iconverter software from IIT
Kanpur is used to convert to from UTF-8 to ISCII, reducing the problem to the
previously solved one:
$ utf2unicode
naidunia_sample.utf8 naidunia_sample.uni
$ iconverter -e
unicode_iscii_dev naidunia_sample.uni naidunia_sample.isc
$ ./iscii2itrans.py
<naidunia_sample.isc
produces
þÿ<!DOCTYPE NEWSWDAY SYSTEM
"cynewulf.dtd">
<NEWSWDAY>
<DOC>
<DOCID>NAI19990901.6707</DOCID>
<HEADER>
<DATE>19990901</DATE>
<LANG>Hindi</LANG>
<CATE>1</CATE>
<SRCE>Naidunia</SRCE>
</HEADER>
<BODY>
<HEADLINE>
kuCha
laEgaEam nE raAjanIti kaE dhaamdhaA banaA liyaA
svayaam kE -arkaitaEam kE liE
gaThabaamdhana aAIra maEchaE banE aha
saEniyaAjI
</HEADLINE>
<TEXT>
<P>
maamgalavaAra
ôò agastaa kaAamgarEsa adhyakssa shrImatI saEniyaA gaAandhI nE ka-arka-A -arkaAI
ki kuCha laEgaEam nE raAjanIti kaE vyavasaAya banaA liyaA -arkaAIa EsE laEgaEam
nE svayaam kE -arkaitaEam kE liE gaThabaamdhana aAIra maEchaE banaA liE
-arkaAIama un-arka-EamnE pradhaAnamaamtrI shrI aTalabi-arka-ArI vaAjapEyI aAIra
rakssaAmaamtrI shrI ja??rja phanaArraDIsa para kaAragila muddE kaE lEkara
5-arka-malaA 2 baElaAa un-arka-EamnE aAraEpa lagaAyaA ki paAkistaAna dvaAraA
kaAragila mEam ghusapAITha kaA mukaAbalaA karanE mEam bhI sarakaAra asaphala
ra-arkaI -arkaAIa un-arka-EamnE ka-arka-A ki paAka sE chInI aAyaAta muddE kaE vE
janataA kE bIcha lE jaAEangIa
</P>
<P>
which
looks OK.
For more kinds of encoding and font interconversion for
varieties of digital Hindi, see the relevant section of my notes on Hindi
resources.