Abhijit Speaks :

On Life, Technology, Stocks and Movies

On Infrastructure for Indian Language Computing (written for i.t. Magzine)

leave a comment »

(A draft copy of an article I have written for  I.T. Magzine, published in Jan 2008. Website http://www.itmagz.com)

One of my favourite pass-times is to play with google search. One can even accuse me of “nit-picking”, but Google has surprised me pleasantly a few times. Hence while playing with this one day, I thought lets search about “क्रिकेट (kriket)”, and living upto its promise, google showed a few search results. Then I tried, next thing – tried giving search string as ‘क्रिकिट (krikit)’ and ‘क्रिकट(krikat)’ respectively. I was expecting, Google to dutifully ask me whether I was looking for “क्रिकेट (kriket)”. To my surprise, and not so pleasant one, Google actually didn’t ask me that question, nor could I find wikipedia page for ‘क्रिकेट (kriket)’ in one of the first few search result. This is not an exercise in fault-finding at all. But a thought that came to my mind – if some of the smartest engineers at Google are not getting it right, it means there certainly is some distance to be covered.

The point at which I am trying to arrive is – if we have to have Indic Language content to become ubiquitous, what specific technical challenges are to be overcome and what stage we are? Or to continue from my previous argument about Indic Language Computing, what Infrastructure needs to be in place for this to become, a lot more commonplace. While, thinking a bit further – I thought specifically the challenges broadly belong to three aspects of information in Indic Language. Information Generation, Information Retrieval and Information Presentation. Let’s try to see how we are addressing these challenges and what more needs to be done to make a sustainable infrastructure for Indic Language Computing.

Information Generation – is essentially being able to type in Indian Languages. Since the computers were introduced to us more in English language, we do not have some of the standard mechanisms for input. Of course we do have type-writters that support various regional languages, but to a common man, learning the typewriter will be a bit of a barrier. However, we have been commonly writing SMSes and emails with “aap kaise ho?” type sentences, so we have in a way addressed this problem in some way. While this type of ‘hinglish’, just a name for the sake of argument, makes it suitable for quick communication, I’d find it hard to read even my favourite sports news, let alone books or serious literature. In comes transliteration and we are somewhat better. According to Wikipedia “Transliteration is the practice of transcribing a word or text written in one writing system into another writing system or system of rules for such practice.” Or to cut the long story short when I type “aap kaise ho?”, it appears as “आप कैसे हो?” on my screen. There are a few systems that allow one to do this. The oldest one that I know of is ‘Baraha’, which is a windows application. It’s been around since the days of ‘Desktop Applications’, but we are in the world of “Web Applications”, and there are already a few available. Some names that come to my mind -  One famous of course is Google Indic translitration, then there are others, one called quillpad developed by Tachyon technologies and called ‘scratchpad’ developed by paahijen.com (selfish plug! ;-) ). Each of these systems have chosen a specific model for transliteration and I will not (primarily because I have personal interests) venture to compare them. But broadly they are addressing the problem of ‘Information Generation’. These systems will improve over a period of time. But as such there still is a small limitation, a knowledge of English alphabet. So I am not sure whether this is the ideal system for someone who’s new to computers. But so far so good, we atleast have something usable.

Information Retrieval. Being able to search information on Hindi is itself a big step in my opinion. But strictly speaking, the level at which we are able to do it is not yet at par with what one can do in language like English say. Just taking the above example, one should try the search terms as “cricket” and “crickit”. While, the search engine correctly understands when one types “crickit”, that perhaps the user is curious about cricket, it does not do so in Hindi. This is perhaps because, indic languages do not make themselves suitable for some of the algorithms that are commonly used for finding related words like ‘Levenshtein Distance’ or atleast these algorithms are not applied in a suitable way out here. I am not an expert in this area and will leave it open for now, but as a user, I can certainly see some of the work that remains to be done.

Information Presentation is the final aspect I would like to deal with. Some of my discomfort, primarily comes from the fact that I am not able to open a lot of News sites on my Ubuntu laptop. No I am not one of the Microsoft haters or freedom fanatics, but I am particularly disappointed at the fact that me as a user is not at all considered by the News site. While this might have been alright when the Unicode standard was not popular and there were not enough Unicode capable fonts available, but this has changed substantially since then. And offering web based content “which will work on specific versions of specific operating systems and specific browsers only, is against the very nature of web.” While the technology today makes it possible, we need an initiative to go in the right direction.

To conclude – we have taken first few essential steps, we need a concerted effort by the interested parties to fill in some missing pieces to create an infrastructure, that will enable growth of Indic Language content. This will naturally attract more vernacular users.

Written by gabhijit

November 8, 2008 at 12:37 pm

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.