Sanskrit Sorting (Devanagari)

There is no correct sorting of Sanskrit words in MS Excel 2007. I was looking for it for years.

Sample file for sorting can be downloaded here.

Old books have correct sorting. New files have none. Why? Because nobody cares.

It should be (as in the book)

  • ह्वल्
  • ह्वला

It’s in Google Docs, MS Word

  • ह्वला
  • ह्वल्

Instead. Shorter words are always above but क् and का have the same

number of unicode codepoints. क (one codepoint) is shorter than क्
and का (two codepoints) and they are shorter than क्क (three
codepoints). I was thinking about a preprocessor that will convert
terminal viramas to an auxilliary character and a postprocessor that
will convert this auxilliary character back to the virama. Viramas
play two roles thus we need two distinct characters in the sort table.
Shorter words should come first, see screenshot.
Excel sorts as:

  • ह्वरस्
  • ह्वर्
  • ह्वला
  • ह्वल्
  • ह्वा
  • ह्वान
  • ह्वार
  • ह्वारय्
  • In the book (and how it should be):
  • ह्वर्
  • ह्वरस्
  • ह्वल्
  • ह्वला
  • ह्वा
  • ह्वान
  • ह्वार
  • ह्वारय्

So no, Google Docs are as miserable as MS Office. See .pdf page 24, 25, 27, 29.

http://greenmesg.org/sanskrit_online_tools/sanskrit_sorting_tool.php

  • ह्वा
  • ह्वान
  • ह्वार
  • ह्वारय्
  • ह्वरस्
  • ह्वर्
  • ह्वला
  • ह्वल्

Is totally wrong as well.

http://sanskrit.inria.fr/DICO/73.html#hlaad

  • √ ह्लाद् hlād
  • ह्लाद hlāda
  • ह्लादक hlādaka
  • ह्लादन hlādana
  • ह्लादयत् hlādayat
  • ह्लादि hlādi
  • ह्लादित hlādita
  • ह्लादिन् hlādin
  • √ ह्वल् hval
  • ह्वाय् hvāy

Seems to be ok.

 

http://www.flickr.com/photos/gasyoun/9862110173/sizes/c/in/photostream/

 

1) Gérard’s Zen method

Gérard Huet
This is pretty trivial, a simple lexicographic ordering:

(* lexicographic comparison *)
value rec lexico l1 l2 = match l1 with
[ [] -> True
| [ c1 :: r1 ] -> if c1=50 (* hiatus *) then lexico r1 l2
else match l2 with
[ [] -> False
| [ c2 :: r2 ] -> if c2=50 (* hiatus *) then lexico l1 r2
else if c2>50 then c1>50 && c1<c2 (* homonym indexes *)
else if c1>50 then True
else if c2<c1 then False
else if c2=c1 then lexico r1 r2
else True
]
]
;

Every Sanskrit phoneme is represented as an integer between 1 (a) and 49 (h). A word is a list of phonemes. Words are thus sorted by lexicographic ordering over lists of integers.
Homophony indexes are suffix codes, from 50 to 59.
Two pitfalls to avoid for computing on Sanskrit words or sentences:
— Do not compute on syllables, but on phonemes — thus translate devanagarii at the phonemic level
— Do not use strings — specially Unicode strings — use lists

My methodology is very simple. I have designed a toolkit Zen for computational linguistics based on very simple notions, and Sanskrit is just an application of these generic techniques.
The whole Zen toolkit may be downloaded as open source software from a URL given at section Zen on my site entry page. A pdf manual documents the library.

 

2) Zdanek’s method

http://icebearsoft.euweb.cz/xindy-devanagari/

http://icebearsoft.euweb.cz/download/zwxindy.pdf — best documentation on Nāgarī, which is called ‘Varṇamālā’ (वर्णमाला). It is also called ‘kakaharā’ (ककहरा) or ‘Akṣharamālā’ (अक्षरमाला; Akshar-mala)! Varṇa means letter; mālā means chain or garland.

#!/usr/bin/perl

# a string describing the language (to be exact, the sorting order)
$language = «Hindi»;
$prefix = «hi»;
$script = «devanagari»;
# Technically speaking, $alphabet is (a reference to) an array of arrays of
# arrays. Sounds complicated? Don’t worry! Explanation follows:

# Every line describes one letter of the alphabet (in all its variants).
# The first string is the name of the letter; this appears in the heading of
# letter groups (when defined with the proper markup). Currently the maximum
# number of letters is limited to 95. A future expansion up to 223 letters
# should be no problem.

# Next follows a sequence of arrays, delimited by commas. Each of these arrays
# describes one variant of the letter with different diacritical marks
# (accents). The order of those describes the sorting order if two words
# appear which differ only in the diacritical variant of this letter.
# Currently the maximum supported number of diacritical variants of one letter
# is 93.

# Each of these arrays contains first the lowercase variant of the letter,
# followed by uppercase variant(s). You might wonder: How can there be other
# than one uppercase variant? Consider the letter combination `ch’: Uppercase
# variants here are: `Ch’ and `CH’. Also, in some character sets there might
# not exist an uppercase variant of a letter, e.g. the letter `’ in the
# ISO-8859-1 character set. In this case we just leave it out.

# The sum of the number of uppercase and lowercase variants of one diacritical
# version of a letter should be 10 or less. (In case of `ch’ it is 3:
# `ch’, `Ch’ and `CH’)

# There can be empty arrays [] which are called slots. They are used for
# mixing alphabets of different languages.

$alphabet = [
[‘ं’, [‘ं’, ‘ँ’]],
[‘ः’, [‘ः’]],
[‘अ’, [‘अं’, ‘अँ’]],
[‘अ’, [‘अ’]],
[‘आ’, [‘आं’, ‘आँ’]],
[‘आ’, [‘आ’, ‘ऑ’]],
[‘इ’, [‘इं’, ‘इँ’]],
[‘इ’, [‘इ’]],
[‘ई’, [‘ईं’]],
[‘ई’, [‘ई’]],
[‘उ’, [‘उं’, ‘उँ’]],
[‘उ’, [‘उ’]],
[‘ऊ’, [‘ऊं’, ‘ऊँ’]],
[‘ऊ’, [‘ऊ’]],
[‘ऋ’, [‘ऋ’]],
[‘ॠ’, [‘ॠ’]],
[‘ऌ’, [‘ऌ’]],
[‘ॡ’, [‘ॡ’]],
[‘ए’, [‘एं’, ‘एँ’]],
[‘ए’, [‘ए’]],
[‘ऐ’, [‘ऐं’]],
[‘ऐ’, [‘ऐ’]],
[‘ओ’, [‘ओं’]],
[‘ओ’, [‘ओ’]],
[‘औ’, [‘औं’]],
[‘औ’, [‘औ’]],
[‘्’, [‘्’]],
[‘क’, [‘कं’, ‘कँ’]],
[‘क’, [‘क’]],
[‘ख’, [‘खं’, ‘खँ’]],
[‘ख’, [‘ख’]],
[‘ग’, [‘गं’, ‘गँ’]],
[‘ग’, [‘ग’]],
[‘घ’, [‘घं’, ‘घँ’]],
[‘घ’, [‘घ’]],
[‘ङ’, [‘ङं’, ‘ङँ’]],
[‘ङ’, [‘ङ’]],
[‘च’, [‘चं’, ‘चँ’]],
[‘च’, [‘च’]],
[‘छ’, [‘छं’, ‘छँ’]],
[‘छ’, [‘छ’]],
[‘ज’, [‘जं’, ‘जँ’]],
[‘ज’, [‘ज’]],
[‘झ’, [‘झं’, ‘झँ’]],
[‘झ’, [‘झ’]],
[‘ञ’, [‘ञं’, ‘ञँ’]],
[‘ञ’, [‘ञ’]],
[‘ट’, [‘टं’, ‘टँ’]],
[‘ट’, [‘ट’]],
[‘ठ’, [‘ठं’, ‘ठँ’]],
[‘ठ’, [‘ठ’]],
[‘ड’, [‘डं’, ‘डँ’]],
[‘ड’, [‘ड’]],
[‘ढ’, [‘ढं’, ‘ढँ’]],
[‘ढ’, [‘ढ’]],
[‘ण’, [‘णं’, ‘णँ’]],
[‘ण’, [‘ण’]],
[‘त’, [‘तं’, ‘तँ’]],
[‘त’, [‘त’]],
[‘थ’, [‘थं’, ‘थँ’]],
[‘थ’, [‘थ’]],
[‘द’, [‘दं’, ‘दँ’]],
[‘द’, [‘द’]],
[‘ध’, [‘धं’, ‘धँ’]],
[‘ध’, [‘ध’]],
[‘न’, [‘नं’, ‘नँ’]],
[‘न’, [‘न’]],
[‘प’, [‘पं’, ‘पँ’]],
[‘प’, [‘प’]],
[‘फ’, [‘फं’, ‘फँ’]],
[‘फ’, [‘फ’]],
[‘ब’, [‘बं’, ‘बँ’]],
[‘ब’, [‘ब’]],
[‘भ’, [‘भं’, ‘भँ’]],
[‘भ’, [‘भ’]],
[‘म’, [‘मं’, ‘मँ’]],
[‘म’, [‘म’]],
[‘य’, [‘यं’, ‘यँ’]],
[‘य’, [‘य’]],
[‘र’, [‘रं’, ‘रँ’]],
[‘र’, [‘र’]],
[‘ल’, [‘लं’, ‘लँ’]],
[‘ल’, [‘ल’]],
[‘व’, [‘वं’, ‘वँ’]],
[‘व’, [‘व’]],
[‘श’, [‘शं’, ‘शँ’]],
[‘श’, [‘श’]],
[‘ष’, [‘षं’, ‘षँ’]],
[‘ष’, [‘ष’]],
[‘स’, [‘सं’, ‘सँ’]],
[‘स’, [‘स’]],
[‘ह’, [‘हं’, ‘हँ’]],
[‘ह’, [‘ह’]],
];

# The next should be pretty easy:
# It means: » is a ligature which is sorted like the letter sequence `ss’
# but in case two words differs only there, the word with » comes after the
# one with ‘ss’ (e.g. Masse, Mae.)

# The same with /, only this time with uppercase/lowercase variants.
# The order of the lines in $ligatures does not matter.

$ligatures = [
[[‘क़’], ‘after’, [[‘क’]]],
[[‘ख़’], ‘after’, [[‘ख’]]],
[[‘ग़’], ‘after’, [[‘ग’]]],
[[‘ज़’], ‘after’, [[‘ज’]]],
[[‘ड़’], ‘after’, [[‘ड’]]],
[[‘ढ़’], ‘after’, [[‘ढ’]]],
[[‘ऩ’], ‘after’, [[‘न’]]],
[[‘फ़’], ‘after’, [[‘फ’]]],
[[‘य़’], ‘after’, [[‘य’]]],
[[‘ऱ’], ‘after’, [[‘र’]]],
[[‘ळ’], ‘after’, [[‘ल’]]],
];

# `special’ are those characters which are normally ignored in the sorting
# process, but e.g. to sort the words «coop» and «co-op» we must also define
# an order here.

@special = (‘?’, ‘!’, ‘.’, ‘letters’, ‘-‘, ‘\», ‘\\/’);

# first lower or upper case?

#$sortcase = «Aa»;
$sortcase = «aA»;

#@letter_group_names = (‘अ’, ‘आ’, ‘इ’, ‘ई’, ‘उ’, ‘ऊ’, ‘ऋ’, ‘ॠ’,
#’ऌ’, ‘ॡ’, ‘ए’, ‘ऐ’, ‘ओ’, ‘औ’,
#’क’, ‘ख’, ‘ग’, ‘घ’, ‘ङ’, ‘च’, ‘छ’, ‘ज’, ‘झ’, ‘ञ’,
#’ट’, ‘ठ’, ‘ड’, ‘ढ’, ‘ण’, ‘त’, ‘थ’, ‘द’, ‘ध’, ‘न’,
#’प’, ‘फ’, ‘ब’, ‘भ’, ‘म’, ‘य’, ‘र’, ‘ल’, ‘व’, ‘श’, ‘ष’, ‘स’, ‘ह’);

do ‘make-rules.pl’;

3) Mimer’s method

http://developer.mimer.com/charts/sanskrit.htm

CREATE COLLATION sanskrit FROM eor USING
'[sa][Indic]'
  -- 
  -- Strict nasalization 
  -- 
'&#0919##0915#<<#0902##0915#'
'&#0919##0916#<<#0902##0916#'
'&#0919##0917#<<#0902##0917#'
'&#0919##0918#<<#0902##0918#'
'&#0919##0919#<<#0902##0919#'
'&#091E##091A#<<#0902##091A#'
'&#091E##091B#<<#0902##091B#'
'&#091E##091C#<<#0902##091C#'
'&#091E##091D#<<#0902##091D#'
'&#091E##091E#<<#0902##091E#'
'&#0923##091F#<<#0902##091F#'
'&#0923##0920#<<#0902##0920#'
'&#0923##0921#<<#0902##0921#'
'&#0923##0922#<<#0902##0922#'
'&#0923##0923#<<#0902##0923#'
'&#0928##0924#<<#0902##0924#'
'&#0928##0925#<<#0902##0925#'
'&#0928##0926#<<#0902##0926#'
'&#0928##0927#<<#0902##0927#'
'&#0928##0928#<<#0902##0928#'
'&#092E##092A#<<#0902##092A#'
'&#092E##092B#<<#0902##092B#'
'&#092E##092C#<<#0902##092C#'
'&#092E##092D#<<#0902##092D#'
'&#092E##092E#<<#0902##092E#';