Details



AN ANALYTICAL APPROACH TO MEASURING SENTIMENT MISCLARIFICATION FOR INDIAN ENGLISH AND HINGLISH TO ASSESS THE ‘DIALECT BIAS’ IN LARGE LANGUAGE MODELS (LLM)

Om Venkatesh Sharma

47-53

Vol 15, Issue 1, Jan-Jun, 2022

Date of Submission: 2022-01-21 Date of Acceptance: 2022-03-15 Date of Publication: 2022-04-18

Abstract

Large language models (LLMs) are increasingly relied upon for sentiment analysis, yet tend to underperform on dialectal and code-mixed variants of English. This paper investigates dialect bias in sentiment classification for Indian English (IndE) and Hinglish (Romanized Hindi–English), compared to Standard American English (SAE). We curate 2k samples each from three dialects—SAE, IndE, and Hinglish—carefully balanced across positive, neutral, and negative sentiments and manually annotated by bilingual experts (Cohen's κ ≥ 0.8). Zero-shot sentiment prompts are used on GPT 3.5 and GPT 4, along with fine-tuned Indic-focused models (MuRIL, IndicBERT) and a BERT base baseline.

References

  1. Bhange M., Kasliwal N. HinglishNLP: Fine tuned Language Models for Hinglish Sentiment Detection. arXiv:2008.09820, 2020. (arxiv.org)
  2. Singh G. Sentiment Analysis of Code Mixed Social Media Text (Hinglish). arXiv:2102.12149, 2021. (arxiv.org)
  3. Singh P., Lefever E., Solorio T., et al. Sentiment Analysis for Hinglish Code-mixed Tweets via Cross-lingual Embeddings. CoNLL Workshop on Code Switching, 2020. (aclanthology.org)
Download PDF
Back

Disclaimer: Indexing of published papers is subject to the evaluation and acceptance criteria of the respective indexing agencies. While we strive to maintain high academic and editorial standards, International Journal of Innovations in Scientific Engineering does not guarantee the indexing of any published paper. Acceptance and inclusion in indexing databases are determined by the quality, originality, and relevance of the paper, and are at the sole discretion of the indexing bodies.

s