Comparison of the Influence of Different Normalization Methods on Tweet Sentiment Analysis in the Serbian language

Adela Ljajić, Ulfeta Marovac, Milena Stanković

DOI Number
-
First page
683
Last page
696

Abstract


Given the growing need to quickly process texts and extract information from the data for various purposes, correct normalization that will contribute to better and faster processing is of great importance. The paper presents the comparison of different methods of short text (tweet) normalization.  The comparison is illustrated by the example of text sentiment analysis.  The results of an application of different normalizations are presented, taking into account time complexity and sentiment algorithm classification accuracy. It has been shown that using cutting to n-gram normalization, better or similar results are obtained compared to language-dependent normalizations. Including the time complexity, it is concluded that the application of this language-independent normalization gives optimal results in the classification of short informal texts.


Keywords

sentiment analysis, normalization, stemming, n-gram, lemmatization, data mining

Full Text:

PDF

References


Bo Pang and Lillian Lee. 2008. Opinion Mining and Sentiment Analysis. Found. Trends Inf. Retr. 2, 1-2 (January 2008), 1-135. DOI=http://dx.doi.org/10.1561/1500000011

Minqing Hu and Bing Liu. "Mining and summarizing customer reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004.

Wiebe, Janyce; Theresa Wilson; and Claire Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation 39: 165-210.

Baccianella, Stefano, Andrea Esuli, and Fabrizio Sebastiani. 2010. SentiWordNet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In Proceedings of the Seventh Conference on International Language Resources and Evaluation, 2200-2204. European Language Resources Association.

Taboada, M., Brooke, J., Tofiloski, M., Voll, K., &Stede, M. (2011). Lexicon-based methods for sentiment analysis. Computational linguistics, 37 (2), 267–307.

Hsu, C.-W., Chang, C.-C., Lin, C.-J. et al. (2003). A practical guide to support vector classification.

Moreo, A., Romero, M., Castro, J., &Zurita, J. M. (2012). Lexicon-based comments-oriented news sentiment analyzer system. Expert Systems with Applications, 39 (10), 9166–9180 .

Nadia Felix F. Da Silva, Luiz F. S. Coletta, and Eduardo R. Hruschka. 2016. A Survey and Comparative Study of Tweet Sentiment Analysis via Semi-Supervised Learning. ACM Comput. Surv. 49, 1, Article 15 (June 2016), 26 pages. DOI: https://doi.org/10.1145/29327081

Mladenović, M., Mitrović, J., Krstev, C., Vitas, D.: Hybrid Sentiment Analysis Framework for A Morphologically Rich Language. Journal of Intelligent Information Systems, Vol. 46:3, 599–620. (2016)

Batanović, V., Nikolić, B.: Sentiment classification of documents in Serbian: The effects of morphological normalization. In Proceedings of the 24th Telecommunications Forum (TELFOR), Belgrade, Serbia, 1-4. (2016)

Ljajić, A., Marovac, U.: Improving Sentiment Analysis for Twitter Data by Handling Negation Rules in the Serbian language. Computer Science and Information Systems, https://doi.org/10.2298/CSIS180122013L

Processing Serbian written texts: An overview of resources and basic tools: D Vitas, G Pavlovic-Lazetic, C Krstev, L Popovic… - Workshop on Balkan Language Resources and Tools, 2003

L. Rotim and J. Šnajder, “Comparison of Short-Text Sentiment Analysis Methods for Croatian,” in Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing (BSNLP 2017), 2017, pp. 69–75.

Marovac Ulfeta, Crnišanin Adela, PljaskovićAldina, KajanEjub., “Similarity Search in Text Data for the Serbian language”, Proceedings of ICEST (2013), pp. 607 - 610, ISBN: 978-9989-786-90-7, (2013).

Fei Liu, FuliangWeng, and Xiao Jiang. 2012. A broad-coverage normalization system for social media language. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 (ACL '12), Vol. 1. Association for Computational Linguistics, Stroudsburg, PA, USA, 1035-1044.

N. Milošević, “Stemmer forthe Serbian language.” arXiv 1209.4471, 2012.

V. Kešelj and D. Šipka, “A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources,” INFOtheca, vol. 9, no. 1–2, p.23a–33a, 2008.

Krstev C., Stanković R., Obradović I., Vitas D., Utvić M. (2010) Automatic Construction of a Morphological Dictionary of Multi-Word Units. In: Loftsson H., Rögnvaldsson E., Helgadóttir S. (eds) Advances in Natural Language Processing. NLP 2010. Lecture Notes in Computer Science, vol 6233. Springer, Berlin, Heidelberg

Marovac, U., Ljajić, A., Kajan, E., Avdić, A.: Towards the Lexical Resources for Sentiment-Reach Informal Texts-Serbian language Case, 5th International Conference CONTEMPORARY PROBLEMS OF MATHEMATICS, MECHANICS AND INFORMATICS (CPMMI 2018), State University of Novi Pazar, (2018).

Adela Ljajić, Milena Stanković, and Ulfeta Marovac: Detection of Negation in the Serbian Language}. In: Proceedings of the 8th International Conference on Web Intelligence, Mining and Semantics, Novi Sad, Serbia, 2018, pp. 39:1-39:6.

N. Ljubešić, D. Boras, and O. Kubelka, “Retrieving Information in Croatian: Building a Simple and Efficient Rule-Based Stemmer,” in INFuture2007: Digital Information and Heritage, Zagreb, Croatia: Department for Information Sciences, Faculty of Humanities and Social Sciences, 2007, pp. 313–320.


Refbacks

  • There are currently no refbacks.




© University of Niš | Created on November, 2013
ISSN 0352-9665 (Print)
ISSN 2406-047X (Online)