HIML, AMD are two of the latest ayurvedic indexes published. HIML is great. AMD is mediocre. But both have several issues in common. Yes, sorting has gone wrong. And sorting is something rather important for a dictionary or word index. Is it not?
Directions for Use
The harmonization of the indexes requires attention to some points which may facilitate finding the lemma searched for. This is because of variations in spelling between volumes I/II and volume III.
Majuscules and minuscules may not always have been used consistently.
Compound nouns are sometimes written as one word, sometimes as two; the latter may be with or without a hyphen.
Slight variants in spelling (for instance: mythic/mythical) may be disregarded.
Spelling variants are retained when present in the sources.
Further, it is not clear in some instances whether a word is a proper name or a title (see, for example: Bindusāra, Viśva).
It is useful to compare, in the general index, lemmata such as: kinds of.. ./types of… /varieties of…, and: diseases/disorders.
Those acquainted with Sanskrit may compare fever/jvara, etc.
Peculiar features of the index program:
— letters with a diacritic mark precede those without such a mark;
— words with a bracketed part precede those without brackets;
— lemmata consisting of two or more words precede those written as one word;
— compounds with a hyphen come after those without it.
The titles/author featuring in the headings of the volumes 1A and 1B (Caraka-saṃhitā, Suśrutasaṃhitā, Astāñgahrdayasaṃhitā, Astāñgasaṃgraha, Vāgbhata) are not indexed as far as these parts are concerned. For these the reader is referred to the contents.
The sorting in both the books is miserable. There is not sorting logic in AMD and some logic in HIML.
In HIML:
1) Nowhere was it stated that it will be sorted in English alphabet. When opening a Sanskrit book, I expect devanagari ordering. Why should I look for «bh» somewhere inside «b»? And «ś» with «ṣ» are somewhere in the middle of «s»? Fascinating? No. In the age of automatic part of speech tagging for a Sanskrit corpora we can’t even make Sanskrit sorting as it should be. Yes, it was 10 years ago, but for the last 20 years not much has changed for Sanskrit. If we don’t speak about it, nothingwillchange.
2) If it is stated that «letters with a diacritic mark precede those without such a mark», you don’t get the feeling of a mix you actually will get. So it’s not a feature, it’s indexing software failure. Total failure. All diacritics are treated as they are equal to the basic character. To find a word beginning with ā and ū is a miracle (good that there are not many starting with ī).
If I would see the source text, I could get it right. But I guess I never will. And there will be a lot of errors in indexes related to Sanskrit matters.
Old books have correct sorting. New files have none. Why? Because nobody cares.
It should be (as in the book)
ह्वल्
ह्वला
It’s in Google Docs, MS Word
ह्वला
ह्वल्
Instead. Shorter words are always above but क् and का have the same
number of unicode codepoints. क (one codepoint) is shorter than क्
and का (two codepoints) and they are shorter than क्क (three
codepoints). I was thinking about a preprocessor that will convert
terminal viramas to an auxilliary character and a postprocessor that
will convert this auxilliary character back to the virama. Viramas
play two roles thus we need two distinct characters in the sort table.
Shorter words should come first, see screenshot.
Excel sorts as:
ह्वरस्
ह्वर्
ह्वला
ह्वल्
ह्वा
ह्वान
ह्वार
ह्वारय्
In the book (and how it should be):
ह्वर्
ह्वरस्
ह्वल्
ह्वला
ह्वा
ह्वान
ह्वार
ह्वारय्
So no, Google Docs are as miserable as MS Office. See .pdf page 24, 25, 27, 29.
Gérard Huet
This is pretty trivial, a simple lexicographic ordering:
(* lexicographic comparison *)
value rec lexico l1 l2 = match l1 with
[ [] -> True
| [ c1 :: r1 ] -> if c1=50 (* hiatus *) then lexico r1 l2
else match l2 with
[ [] -> False
| [ c2 :: r2 ] -> if c2=50 (* hiatus *) then lexico l1 r2
else if c2>50 then c1>50 && c1<c2 (* homonym indexes *)
else if c1>50 then True
else if c2<c1 then False
else if c2=c1 then lexico r1 r2
else True
]
]
;
Every Sanskrit phoneme is represented as an integer between 1 (a) and 49 (h). A word is a list of phonemes. Words are thus sorted by lexicographic ordering over lists of integers.
Homophony indexes are suffix codes, from 50 to 59.
Two pitfalls to avoid for computing on Sanskrit words or sentences:
— Do not compute on syllables, but on phonemes — thus translate devanagarii at the phonemic level
— Do not use strings — specially Unicode strings — use lists
My methodology is very simple. I have designed a toolkit Zen for computational linguistics based on very simple notions, and Sanskrit is just an application of these generic techniques.
The whole Zen toolkit may be downloaded as open source software from a URL given at section Zen on my site entry page. A pdf manual documents the library.
http://icebearsoft.euweb.cz/download/zwxindy.pdf — best documentation on Nāgarī, which is called ‘Varṇamālā’ (वर्णमाला). It is also called ‘kakaharā’ (ककहरा) or ‘Akṣharamālā’ (अक्षरमाला; Akshar-mala)! Varṇa means letter; mālā means chain or garland.
#!/usr/bin/perl
# a string describing the language (to be exact, the sorting order)
$language = «Hindi»;
$prefix = «hi»;
$script = «devanagari»;
# Technically speaking, $alphabet is (a reference to) an array of arrays of
# arrays. Sounds complicated? Don’t worry! Explanation follows:
# Every line describes one letter of the alphabet (in all its variants).
# The first string is the name of the letter; this appears in the heading of
# letter groups (when defined with the proper markup). Currently the maximum
# number of letters is limited to 95. A future expansion up to 223 letters
# should be no problem.
# Next follows a sequence of arrays, delimited by commas. Each of these arrays
# describes one variant of the letter with different diacritical marks
# (accents). The order of those describes the sorting order if two words
# appear which differ only in the diacritical variant of this letter.
# Currently the maximum supported number of diacritical variants of one letter
# is 93.
# Each of these arrays contains first the lowercase variant of the letter,
# followed by uppercase variant(s). You might wonder: How can there be other
# than one uppercase variant? Consider the letter combination `ch’: Uppercase
# variants here are: `Ch’ and `CH’. Also, in some character sets there might
# not exist an uppercase variant of a letter, e.g. the letter `’ in the
# ISO-8859-1 character set. In this case we just leave it out.
# The sum of the number of uppercase and lowercase variants of one diacritical
# version of a letter should be 10 or less. (In case of `ch’ it is 3:
# `ch’, `Ch’ and `CH’)
# There can be empty arrays [] which are called slots. They are used for
# mixing alphabets of different languages.
# The next should be pretty easy:
# It means: » is a ligature which is sorted like the letter sequence `ss’
# but in case two words differs only there, the word with » comes after the
# one with ‘ss’ (e.g. Masse, Mae.)
# The same with /, only this time with uppercase/lowercase variants.
# The order of the lines in $ligatures does not matter.
# `special’ are those characters which are normally ignored in the sorting
# process, but e.g. to sort the words «coop» and «co-op» we must also define
# an order here.