Sanskrit Sorting (Devanagari)

There is no correct sorting of Sanskrit words in MS Excel 2007. I was looking for it for years.

Sample file for sorting can be downloaded here.

Old books have correct sorting. New files have none. Why? Because nobody cares.

It should be (as in the book)

  • ह्वल्
  • ह्वला

It’s in Google Docs, MS Word

  • ह्वला
  • ह्वल्

Instead. Shorter words are always above but क् and का have the same

number of unicode codepoints. क (one codepoint) is shorter than क्
and का (two codepoints) and they are shorter than क्क (three
codepoints). I was thinking about a preprocessor that will convert
terminal viramas to an auxilliary character and a postprocessor that
will convert this auxilliary character back to the virama. Viramas
play two roles thus we need two distinct characters in the sort table.
Shorter words should come first, see screenshot.
Excel sorts as:

  • ह्वरस्
  • ह्वर्
  • ह्वला
  • ह्वल्
  • ह्वा
  • ह्वान
  • ह्वार
  • ह्वारय्
  • In the book (and how it should be):
  • ह्वर्
  • ह्वरस्
  • ह्वल्
  • ह्वला
  • ह्वा
  • ह्वान
  • ह्वार
  • ह्वारय्

So no, Google Docs are as miserable as MS Office. See .pdf page 24, 25, 27, 29.

http://greenmesg.org/sanskrit_online_tools/sanskrit_sorting_tool.php

  • ह्वा
  • ह्वान
  • ह्वार
  • ह्वारय्
  • ह्वरस्
  • ह्वर्
  • ह्वला
  • ह्वल्

Is totally wrong as well.

http://sanskrit.inria.fr/DICO/73.html#hlaad

  • √ ह्लाद् hlād
  • ह्लाद hlāda
  • ह्लादक hlādaka
  • ह्लादन hlādana
  • ह्लादयत् hlādayat
  • ह्लादि hlādi
  • ह्लादित hlādita
  • ह्लादिन् hlādin
  • √ ह्वल् hval
  • ह्वाय् hvāy

Seems to be ok.

 

http://www.flickr.com/photos/gasyoun/9862110173/sizes/c/in/photostream/

 

1) Gérard’s Zen method

Gérard Huet
This is pretty trivial, a simple lexicographic ordering:

(* lexicographic comparison *)
value rec lexico l1 l2 = match l1 with
[ [] -> True
| [ c1 :: r1 ] -> if c1=50 (* hiatus *) then lexico r1 l2
else match l2 with
[ [] -> False
| [ c2 :: r2 ] -> if c2=50 (* hiatus *) then lexico l1 r2
else if c2>50 then c1>50 && c1<c2 (* homonym indexes *)
else if c1>50 then True
else if c2<c1 then False
else if c2=c1 then lexico r1 r2
else True
]
]
;

Every Sanskrit phoneme is represented as an integer between 1 (a) and 49 (h). A word is a list of phonemes. Words are thus sorted by lexicographic ordering over lists of integers.
Homophony indexes are suffix codes, from 50 to 59.
Two pitfalls to avoid for computing on Sanskrit words or sentences:
– Do not compute on syllables, but on phonemes – thus translate devanagarii at the phonemic level
– Do not use strings – specially Unicode strings – use lists

My methodology is very simple. I have designed a toolkit Zen for computational linguistics based on very simple notions, and Sanskrit is just an application of these generic techniques.
The whole Zen toolkit may be downloaded as open source software from a URL given at section Zen on my site entry page. A pdf manual documents the library.

 

2) Zdanek’s method

http://icebearsoft.euweb.cz/xindy-devanagari/

http://icebearsoft.euweb.cz/download/zwxindy.pdf – best documentation on Nāgarī, which is called ‘Varṇamālā’ (वर्णमाला). It is also called ‘kakaharā’ (ककहरा) or ‘Akṣharamālā’ (अक्षरमाला; Akshar-mala)! Varṇa means letter; mālā means chain or garland.

#!/usr/bin/perl

# a string describing the language (to be exact, the sorting order)
$language = “Hindi”;
$prefix = “hi”;
$script = “devanagari”;
# Technically speaking, $alphabet is (a reference to) an array of arrays of
# arrays. Sounds complicated? Don’t worry! Explanation follows:

# Every line describes one letter of the alphabet (in all its variants).
# The first string is the name of the letter; this appears in the heading of
# letter groups (when defined with the proper markup). Currently the maximum
# number of letters is limited to 95. A future expansion up to 223 letters
# should be no problem.

# Next follows a sequence of arrays, delimited by commas. Each of these arrays
# describes one variant of the letter with different diacritical marks
# (accents). The order of those describes the sorting order if two words
# appear which differ only in the diacritical variant of this letter.
# Currently the maximum supported number of diacritical variants of one letter
# is 93.

# Each of these arrays contains first the lowercase variant of the letter,
# followed by uppercase variant(s). You might wonder: How can there be other
# than one uppercase variant? Consider the letter combination `ch’: Uppercase
# variants here are: `Ch’ and `CH’. Also, in some character sets there might
# not exist an uppercase variant of a letter, e.g. the letter `’ in the
# ISO-8859-1 character set. In this case we just leave it out.

# The sum of the number of uppercase and lowercase variants of one diacritical
# version of a letter should be 10 or less. (In case of `ch’ it is 3:
# `ch’, `Ch’ and `CH’)

# There can be empty arrays [] which are called slots. They are used for
# mixing alphabets of different languages.

$alphabet = [
[‘ं’, [‘ं’, ‘ँ’]],
[‘ः’, [‘ः’]],
[‘अ’, [‘अं’, ‘अँ’]],
[‘अ’, [‘अ’]],
[‘आ’, [‘आं’, ‘आँ’]],
[‘आ’, [‘आ’, ‘ऑ’]],
[‘इ’, [‘इं’, ‘इँ’]],
[‘इ’, [‘इ’]],
[‘ई’, [‘ईं’]],
[‘ई’, [‘ई’]],
[‘उ’, [‘उं’, ‘उँ’]],
[‘उ’, [‘उ’]],
[‘ऊ’, [‘ऊं’, ‘ऊँ’]],
[‘ऊ’, [‘ऊ’]],
[‘ऋ’, [‘ऋ’]],
[‘ॠ’, [‘ॠ’]],
[‘ऌ’, [‘ऌ’]],
[‘ॡ’, [‘ॡ’]],
[‘ए’, [‘एं’, ‘एँ’]],
[‘ए’, [‘ए’]],
[‘ऐ’, [‘ऐं’]],
[‘ऐ’, [‘ऐ’]],
[‘ओ’, [‘ओं’]],
[‘ओ’, [‘ओ’]],
[‘औ’, [‘औं’]],
[‘औ’, [‘औ’]],
[‘्’, [‘्’]],
[‘क’, [‘कं’, ‘कँ’]],
[‘क’, [‘क’]],
[‘ख’, [‘खं’, ‘खँ’]],
[‘ख’, [‘ख’]],
[‘ग’, [‘गं’, ‘गँ’]],
[‘ग’, [‘ग’]],
[‘घ’, [‘घं’, ‘घँ’]],
[‘घ’, [‘घ’]],
[‘ङ’, [‘ङं’, ‘ङँ’]],
[‘ङ’, [‘ङ’]],
[‘च’, [‘चं’, ‘चँ’]],
[‘च’, [‘च’]],
[‘छ’, [‘छं’, ‘छँ’]],
[‘छ’, [‘छ’]],
[‘ज’, [‘जं’, ‘जँ’]],
[‘ज’, [‘ज’]],
[‘झ’, [‘झं’, ‘झँ’]],
[‘झ’, [‘झ’]],
[‘ञ’, [‘ञं’, ‘ञँ’]],
[‘ञ’, [‘ञ’]],
[‘ट’, [‘टं’, ‘टँ’]],
[‘ट’, [‘ट’]],
[‘ठ’, [‘ठं’, ‘ठँ’]],
[‘ठ’, [‘ठ’]],
[‘ड’, [‘डं’, ‘डँ’]],
[‘ड’, [‘ड’]],
[‘ढ’, [‘ढं’, ‘ढँ’]],
[‘ढ’, [‘ढ’]],
[‘ण’, [‘णं’, ‘णँ’]],
[‘ण’, [‘ण’]],
[‘त’, [‘तं’, ‘तँ’]],
[‘त’, [‘त’]],
[‘थ’, [‘थं’, ‘थँ’]],
[‘थ’, [‘थ’]],
[‘द’, [‘दं’, ‘दँ’]],
[‘द’, [‘द’]],
[‘ध’, [‘धं’, ‘धँ’]],
[‘ध’, [‘ध’]],
[‘न’, [‘नं’, ‘नँ’]],
[‘न’, [‘न’]],
[‘प’, [‘पं’, ‘पँ’]],
[‘प’, [‘प’]],
[‘फ’, [‘फं’, ‘फँ’]],
[‘फ’, [‘फ’]],
[‘ब’, [‘बं’, ‘बँ’]],
[‘ब’, [‘ब’]],
[‘भ’, [‘भं’, ‘भँ’]],
[‘भ’, [‘भ’]],
[‘म’, [‘मं’, ‘मँ’]],
[‘म’, [‘म’]],
[‘य’, [‘यं’, ‘यँ’]],
[‘य’, [‘य’]],
[‘र’, [‘रं’, ‘रँ’]],
[‘र’, [‘र’]],
[‘ल’, [‘लं’, ‘लँ’]],
[‘ल’, [‘ल’]],
[‘व’, [‘वं’, ‘वँ’]],
[‘व’, [‘व’]],
[‘श’, [‘शं’, ‘शँ’]],
[‘श’, [‘श’]],
[‘ष’, [‘षं’, ‘षँ’]],
[‘ष’, [‘ष’]],
[‘स’, [‘सं’, ‘सँ’]],
[‘स’, [‘स’]],
[‘ह’, [‘हं’, ‘हँ’]],
[‘ह’, [‘ह’]],
];

# The next should be pretty easy:
# It means: ” is a ligature which is sorted like the letter sequence `ss’
# but in case two words differs only there, the word with ” comes after the
# one with ‘ss’ (e.g. Masse, Mae.)

# The same with /, only this time with uppercase/lowercase variants.
# The order of the lines in $ligatures does not matter.

$ligatures = [
[[‘क़’], ‘after’, [[‘क’]]],
[[‘ख़’], ‘after’, [[‘ख’]]],
[[‘ग़’], ‘after’, [[‘ग’]]],
[[‘ज़’], ‘after’, [[‘ज’]]],
[[‘ड़’], ‘after’, [[‘ड’]]],
[[‘ढ़’], ‘after’, [[‘ढ’]]],
[[‘ऩ’], ‘after’, [[‘न’]]],
[[‘फ़’], ‘after’, [[‘फ’]]],
[[‘य़’], ‘after’, [[‘य’]]],
[[‘ऱ’], ‘after’, [[‘र’]]],
[[‘ळ’], ‘after’, [[‘ल’]]],
];

# `special’ are those characters which are normally ignored in the sorting
# process, but e.g. to sort the words “coop” and “co-op” we must also define
# an order here.

@special = (‘?’, ‘!’, ‘.’, ‘letters’, ‘-‘, ‘\”, ‘\\/’);

# first lower or upper case?

#$sortcase = “Aa”;
$sortcase = “aA”;

#@letter_group_names = (‘अ’, ‘आ’, ‘इ’, ‘ई’, ‘उ’, ‘ऊ’, ‘ऋ’, ‘ॠ’,
#’ऌ’, ‘ॡ’, ‘ए’, ‘ऐ’, ‘ओ’, ‘औ’,
#’क’, ‘ख’, ‘ग’, ‘घ’, ‘ङ’, ‘च’, ‘छ’, ‘ज’, ‘झ’, ‘ञ’,
#’ट’, ‘ठ’, ‘ड’, ‘ढ’, ‘ण’, ‘त’, ‘थ’, ‘द’, ‘ध’, ‘न’,
#’प’, ‘फ’, ‘ब’, ‘भ’, ‘म’, ‘य’, ‘र’, ‘ल’, ‘व’, ‘श’, ‘ष’, ‘स’, ‘ह’);

do ‘make-rules.pl’;

3) Mimer’s method

http://developer.mimer.com/charts/sanskrit.htm

CREATE COLLATION sanskrit FROM eor USING
'[sa][Indic]'
  -- 
  -- Strict nasalization 
  -- 
'&#0919##0915#<<#0902##0915#'
'&#0919##0916#<<#0902##0916#'
'&#0919##0917#<<#0902##0917#'
'&#0919##0918#<<#0902##0918#'
'&#0919##0919#<<#0902##0919#'
'&#091E##091A#<<#0902##091A#'
'&#091E##091B#<<#0902##091B#'
'&#091E##091C#<<#0902##091C#'
'&#091E##091D#<<#0902##091D#'
'&#091E##091E#<<#0902##091E#'
'&#0923##091F#<<#0902##091F#'
'&#0923##0920#<<#0902##0920#'
'&#0923##0921#<<#0902##0921#'
'&#0923##0922#<<#0902##0922#'
'&#0923##0923#<<#0902##0923#'
'&#0928##0924#<<#0902##0924#'
'&#0928##0925#<<#0902##0925#'
'&#0928##0926#<<#0902##0926#'
'&#0928##0927#<<#0902##0927#'
'&#0928##0928#<<#0902##0928#'
'&#092E##092A#<<#0902##092A#'
'&#092E##092B#<<#0902##092B#'
'&#092E##092C#<<#0902##092C#'
'&#092E##092D#<<#0902##092D#'
'&#092E##092E#<<#0902##092E#';

Sanskrit OCR Software

Sanskrit OCR Software Review

Oliver’s OCR tool is the best option for recognizing a Sanskrit text. It was so before 2010. It is even more so in 2013. I have heard Indian’s coders speaking (on Sanskrit conferences) long talks about how hard it is to make, and only this German guy made it possible. There are no real alternatives. Don’t even waste your time. Years ago Oliver’s software had a recognition rate 20-30% lower. It was free of charge at that time. It can be still downloaded (but no long from the official website). Starting from 2010 the updated version is sold and it’s accuracy rate is around 95%. Which is fantastic.

Indian books are printed on bad paper. There are enough ligatures than can make even a font designer go mad.

Lots of ligatures

I have been working with Sanskrit OCRing for 11 years now. Until today – mostly romanized IAST. Indian quality printed IAST, good IAST – all kinds of them. I know how to make things run, train templates. Oliver’s Sanskrit OCR is where ABBYY FineReader was before v.7. It does the simple things and does them badly. It splits more frames than it should (instead of 2 columns it makes 7, but there never have been 7 column Sanskrit books). All OCR software has the same problems. Most of them get never solved. Not because they can not be. Because the wrong algorithm is used.

ABBYY

Major issues:

1) No batch recognize pages. I recognize one page at once. It takes a few days to recognize a 200 page book, just pressing same shortcut on every page. That’s insane. Never seen it even in the earliest versions of ABBYY Fine Reader.

Where is it? Batch export, I mean. And is there Batch analysis of the same layout for let’s say 1000 pages at once? Now I recognize each page at once. It takes around a minute for a page. So I have to keep the window open all the time and do monkey stuff.

Too many, instead of two columns

2) No batch export recognized text. Maybe the batch export function is in HindiOCR (professional) for 199 Euros, but it is not there for SanskritOCR. So no batch option makes it very hard to use.

[Professional version only:] To store the text of all recognized pages in one file, use the option “All pages” in the list Batch export. Activate the option Store in separate pages if every recognized page should be stored in a separate file on disk.

No batch export

 

Smaller OCR Issues:

1) It looses dots in references all the time. It “clears” them as junk data, but when 2-4 Indian numbers come in a line, there is a good chance that after the 1st or 2nd a dot will come. And in most cases it is clear enough, but still gets killed.
Upper script numbers are not recognized as such.

OCR working screen

2) There is no Save project option. Only when you close the tool it asks if you want to save. So you can loose all your work in a minute. Project is not as in ABBYY all the images and the OCR layer in one folder. It’s not even the OCR. It’s the master file, that I don’t know how is even connected. If I move the images to another disk, should I redo the OCR one page at a time?

3) When “Text recognition” mode starts the main windows closes and only a small one is visible. Does it has to be? When recognition ends, it opens again the main window.

4) Options are rather too simplistic. And having Mangal instead of Sanskrit 2003 at least for Sanskrit ligatures is a bit… strange. Default font size 14 is rather small, 16 or 18 should be preferred.

Fonts to choose from Fonts folder

5) The zoom for pages is not remembered. If I zoom 200%, when I click on next page it’s back to 10%. So there are next to no preferences. If I zoom in, the recognize Shorcut does not works anymore. So I have to go to the menu for every page. Too many clicks for a simple thing as this. And yes – the shortcut can’t be made with a single hand. Why not attach it to a single letter, like F5? Why on earth should I press 3 buttons?

See video http://www.youtube.com/watch?v=MjYNvHm_uyw

6) Let’s ignore Latin text. In scientific literature (and we worry about Sanskrit books, not Sanskrit newspapers) there happen to be footnotes. Many of which are in Latin letters. It would be a huge progress if after encountering a lot of “bad” Devanagari it would stop making nonsense OCR at all. If it manages 800 ligatures, it can manage 30 Latin letters, right? At least to kill ’em.

No OCR of non-Devanagari

7) Initial importing of images can take a few hours. It does not matters much, but for a 1020 page dictionary one must be patient. Speed is not too high and I don’t mean the recognition, it’s just the loading of .jpgs. Importing .pdf option would be nice as well. Because exporting .jpg from a .pdf means pixels lost and worse OCR results.

8) The icons, UI is like built in 1999. If Oliver would agree I could draw for free icons same style as http://www.fatcow.com/free-icons just to make it look like it’s not ’99.

9) If you don’t mark a text block before starting recognition, the program automatically uses layout analysis to detect the text blocks.

Should it mean, if I draw the layout and don’t touch anything it will be applied to all pages?

It is the best tool out there. The others are much worse, I’ve scanned thousand of books. I’ve OCRed hundred of them. Believe me. But without batch features it does not makes much sense even for the best Sanskrit OCR tool.

On DSC website we see, that it’s possible to add a “dirty” OCR layer of devanagari to book scans. To see it not online, but offline and in .pdf books would be a revolution. But we are not yet ready for it. And there is not a strong enough demand. Searching devanagari in .pdf is still a pain, even with a “clean” Unicode text.