Indian language search engine unveiled by DEITY

The Department of Electronics and Information Technology (DEITY) unveiled Internet search engine, Sandhan, yesterday to assist users searching for tourism-related information across websites. Sandhan will provide searche results to user queries in five Indian languages - Bengali, Hindi, Marathi, Tamil and Telugu. 


Launched by J Satyanarayana, Secretary of DEITY, Sandhan has been developed by 120 researchers of 12 institutions over a period of six years, led by Dr. Pushpak Bhattacharya under the Technology Development for Indian Languages (TDIL) programme of DEITY. As stated in an official release, the project aims to satisfy the need of users for information through text documents present on the web.  

The new search engine by DEITY

The new search engine by DEITY



The query entered in one of the five languages is processed to retrieve a set of relevant documents in the same language, from data crawled in the tourism domain from the World Wide Web. These retrieved documents are then presented to the user in the form of an ordered list based on the relevance of the document. 


With this service, the government aims to plug the wide gap that exists "in fulfilling the information needs of Indians not conversant with English- estimated at 90 percent of the population".


At the time of the unveiling, the Secretary said that six years of research resulted in this milestone, but this was only the beginning. He said that for making the search engine successful, it is equally important to develop and promote content in Indian languages. He added the real success would be when even village level e-services would be available in local languages. 


Although designed mainly for tourism, sectors such as business and academia would also benefit from Sandhan. It can also be deployed as part of e-governance and e-learning. 


The following are the salient features of Sandhan: 

  • The user has the facility to submit a query using either the InScript keyboard or the phonetic keyboard. In the case of the InScript keyboard, users can type using that keyboard layout or an onscreen keyboard can be used to submit a query to the system. 
  • It has the capability to process the query based on its language and retrieves results only in that language. 
  • Snippets generated for each of the retrieved document help the user understand the context of query terms in that document. 
  • A summary is generated for each retrieved document. This feature helps the user to get an idea about the overall content of the document without opening the same. 
  • An additional URL-based semantic search facility is provided for Tamil language. 
  • A set of ten results are displayed at a time to the user to increase readability. 
  • Many of the Indian language web pages are in custom fonts that make the system difficult for retrieving documents. Sandhan uses a font transcoder that converts the custom fonts into Unicode fonts for processing. 


Sandhan is a mission mode project of a consortium of academic and research institutions, and industry partners. The institutes involved are IIT Bombay (Consortium leader), CDAC Noida (Co-Consortium leader), IIT Kharagpur, Dhirubhai Ambani Institute of Information and Communication Technology Gandhinagar, Anna University-Centre for Electronics, Anna University-Knowledge-based Computing Centre, CDAC Pune, Gauhati University, Indian Institute of Information Technology Bhubaneswar, International Institute of Information Technology Hyderabad, ISI Kolkata and Jadhavpur University. It is conceptualised, evolved and funded as a national-level project in the emerging area of Information Retrieval and Access in Indian Languages by the DEITY. 


The Sandhan project has been put together by the TDIL Programme, which is a flagship programme of DEITY involved in research, development, standardisation and proliferation of language technology in India in 22 constitutionally-recognised Indian languages. The TDIL Programme is also associated with international standardisation bodies such as the Unicode Consortium, W3C, IETF and ELRA. 

Published Date: Sep 21, 2012 12:50 pm | Updated Date: Sep 21, 2012 12:50 pm