

In this paper, we address the problem of word-level language identification of code-switched sentences. Our methods improve the F-score of word level language identification by over 10% compared to the base-line.Ĭode-switching is the practice of moving back and forth between two languages in spoken or written form of communication. Such simplification is essential, otherwise the number of states of the model will be huge and resultant model predictions are very noisy. We apply Conditional Random Fields(CRF) further in second stage to improve the performance of the word level language identification. Second stage consists of using the previous mixing combination class to make the word level language identification.

In the first stage a mixing language combination is identified by using character n-grams of the sentence. We propose a two stage approach for word-level language identification. The core part of the problem is identifying the language, looking at small fragments of text among a set of languages. The current work addresses the identification of language at word level in mixed script scenarios, where all the text is written in roman script but the words being used by the users are transliterations of original words in native language into english. Usage of mixed script text is also prevalent in social media users. Any standard NLP techniques applied on such data such as POS tagging, Named entity recognition suffer because of noisy nature of the input. The text exchanged in social media conversations is often noisy with a mixture of stylistic and misspelt variations of original words.
