Sanskrit NLP Tools

Need to write how I see https://sites.google.com/site/sanskritcode/survey — in most cases there is only one real solution, the other are just theoretical abstractions that will never reach real life realizations. But it’s a good starting point.

Natural Language Processing in general is a thriving field, with open source projects such as openNLP.

Dictionaries and thesauruses

  1. Digitize dictionaries(D, B), comparative dictionaries(T), sUtras and thesarauses(H), enable online search(B, 2B), make them available on phones as convenient apps (S, A). Some online dictionaries enable collaborative editing. They do have the following limitations:
    • But database updated in this manner is not publicly available.
    • They don’t currently provide an online API (application programming interface) to build on them easily.
  2. Concordence tools (J), perhaps using word-roots (D).
  3. Mobile dictionaries See this.

Grammar and Parsing

  1. Develop tools which model and illustrate application of various sandhi(F, H, MH, C, C2), prAtipadika declension(D, I=H1, H2, F, B), dhAtu conjugation (F, I=H, 3B, 5B, H2, Dl, ) kRdanta(I=H1, H2, F) and taddhitAnta (I=H1, H2) rules. These can in-turn be used to analyze inflected words (1F, 2F, I=H, B, Dl), do sandhi analysis (1H, 2H), to produce dictionaries of inflected words (F) and find concordence (G).
    • Inflected word generation is usually based on the ‘word and paradigm’ model, close to the work such as ruupa chandrikaa which gives the naamaruupaavalii for ‘typical’ words ending in different var.nas in different lingas. This is found to be very useful and accurate in the analysis of classical Sanskrit texts.
    • Limitation: However, as a generative model the above is not perfect because, not being based firmly on pANini’s rules (which separate saMskR^ita from apabhraMShA), they may generate wrong inflections.
  2. Tools to help understand grammer sUtras (H, B, 3V, T, D, A, Ar).
  3. Domain specific languages tailored for the saMskR^ita grammar are beginning to be seen (V).

Parsing and Translation

  1. Mechanically parsing (H) Sanskrit text, doing part of speech tagging(D). Producing, standardizing Sanskrit corpora (I, Ms..).
  2. Translating Sanskrit into a more familiar language. (F=H, 2H)

Prose

  1. Tools to identify metre(D, M, C, C2).
Script and Input
  1. Text to speech tools (G, D). See also hindii.
  2. Transliteration tools(S, Ls, H, W, B, B, Gv, V, D, C, Rd, Rp, Ar…)
  3. IMEs
    1. Linux: m17n with ibus. Suggested ubuntu packages: ibus sanskrit iBus-m17n ibus-qt4 m17n-db m17n-contrib ttf-indic-fonts . See our note here.
    2. Windows(I, G, M, B..) to input Indic script directly without copy-pasting.
    3. MAC: See here. Mac UIM lets you use m17n.
    4. Other lists are available at [N, N2, W, W2].
  4. Some tools/ websites (1W, 2W, ) enable viewing text in script of reader’s choice.
  5. Also see this hindI wiki page.

Fonts, OCR and Scanning

  1. Sanskrit optical character recognition (OCR) tools(T, A, D1, B, X).
  2. Formal attempts at encoding Indian scripts in Unicode(B, I ), fonts. Tools to convert old custom fonts to unicode (P, T).
  3. DLI downloaders
    1. See here.

Text Archives

  • Note that we have focused on computer programs above, more general, curated collections of links, texts and corpora are available elsewhere (F, N, 2N, D, S…). Also, other summaries are available (I), and tools to download from those corpora is also available (A).

Desiderata

In some cases above source code for Sanskrit tools are available (the links in bold are said to be — our gratitude!); but much good software is not open-source; and there is quite a bit of duplication of effort. Besides the limitations noted above, what is conspicuously missing from the above are tools directed at meeting important needs of the popular spoken Sanskrit movement, especially as we increasingly interact with information through computers and the internet.

  1. Consuming documents and webpages written in other languages in saMskRRita (There is no google-translate like device at present nor will there be one in the near future).
  2. Sanskrit UI versions of commonly used software don’t exist (Unlike Arabic, Hebrew..).
  3. There are no good Sanskrit browser scripts or extensions to do common things like look up word meanings with a click or a mouse-over.
  4. No effort at generating Sanskrit content easily. Eg: Sanskrit wikipedia is nowhere close to the english version. Same goes for the wiktionary.