CDAC Noida
Gyan Nidhi: Multilingual Parallel Corpus

 

GyanNidhi’ which stands for ‘Knowledge Resource’ is parallel in 11 Indian languages , a project sponsored by TDIL, DIT, MC &IT and Government of India.

What it is? The multilingual parallel text corpus contains the same text translated in more than one language.

What GyanNidhi contains? GyanNidhi corpus consists of text in English and 11 Indian languages (Hindi, Punjabi,Marathi, Bengali, Oriya, Gujarati, Telugu, Tamil, Kannada, Malayalam, Assamese). It aims to digitise 1 million pages altogether containing at least 50,000 pages in each Indian language and English.

Source for Parallel Corpus

  • National Book Trust India

  • Sahitya Akademi

  • Navjivan Publishing House

  • Publications Division

  • SABDA, Pondicherry

  • Pustak Mahal

Prabandhika: Corpus Manager

  • Categorisation of corpus data in various user-defined domains

  • Addition/Deletion/Modification of any Indian Language data files in HTML/RTF/TXT/ XML format

  • Selection of languages for viewing parallel corpus with data aligned up to paragraph level

  • Automatic selection and viewing of parallel paragraphs in multiple languages

    • Abstract and Metadata

    • Printing and saving parallel data in Unicode format

Platform                  : Windows
Data Encoding         : XML, UNICODE
Portability of Data  : Data in XML format supports various platforms



Applications of Gyan Nidhi
  • Automatic Dictionary extraction

  • Creation of Translation memory

  • Example Based Machine Translation (EBMT)

  • Language research study and analysis

  • Language Modeling
© 2008 C-DAC. All rights reserved. :: Legal Notices :: Privacy Policy :: For information mail to webmaster[at]cdacnoida.in