THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY

Đorđe Petrović; Milena Stanković

doi:10.22190/FUMI1905973D

THE INFLUENCE OF TEXT PREPROCESSING METHODS AND TOOLS ON CALCULATING TEXT SIMILARITY

Đorđe Petrović, Milena Stanković

DOI Number

https://doi.org/10.22190/FUMI1905973D

First page

973

Last page

994

Abstract

Text mining to a great extent depends on the various text preprocessing techniques. The preprocessing methods and tools which are used to prepare texts for further mining can be divided into those which are and those which are not language-dependent. The subject matter of this research was the analysis of the inﬂuence of these methods and tools on further text mining. We ﬁrst focused on the analysis of the inﬂuence on the reduction of the vector space model for the multidimensional represen-tation of text documents. We then analyzed the inﬂuence on calculating text similarity, which is the focus of this research. The conclusion we reached is that the implemen-tation of various text preprocessing methods in the Serbian language, which are used for the reduction of the vector space model for the multidimensional representation of text document, achieves the required results. But, the implementation of various text preprocessing methods speciﬁc to the Serbian language for the purpose of calculating text similarity can lead to great diﬀerences in the results.

Keywords

Text preprocessing; text mining; text similarity.

Full Text:

PDF

References

Aggarwal, C. C.: Machine Learning for Text. s.l.:Springer, 2018.

Alshammari, R.: Arabic Text Categorization using Machine Learning. International Journal of Advanced Computer Science and Applications, 9(3), pp. 226-230, 2018.

Batanović, V., Furlan, B. & Nikolić, B.: A Software System for Determining the Semantic Similarity of Short Texts in Serbian. Belgrade, 19thTelecommunications Forum (TELFOR) Proceedings of Papers, 2011.

Batanović, V., Nikolić, B. & Milosavljević, M.: Reliable Baselines for Sentiment Analysis in Resource-Limited Languages: The Serbian Movie Review Dataset. Portorož, Slovenia, Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), 2016.

Batanović, V. & Nikolić, B.: Sentiment Classiffication of Documents in Serbian: The Effects of Morphological Normalization and Word Embeddings, Telfor Journal, 9(2), 2017.

Bird, S., Klein, E. & Loper, E.: Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. s.l.:O'Reilly Media, Inc., 2009.

Ceska, Z. & Fox, C.: The Influence of Text Pre-processing on Plagiarism Detection. Borovets, Bulgaria, International Conference RANLP, 2009.

Feldman, R. & Sanger, J.: The Text Mining Handbook. s.l.:Cambridge University Press, 2006.

Jones, T.: Serbian Stemmer Analysis. [Online], 2017., Available at: https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Serbian_Stemmer_Analysis [Accessed October 2018].

Kajan, E., Pljasković, A. & Crnišanin, A.: Normalizacija tekstualnih dokumenata na sprskom jeziku u cilju effikasnijeg pretraživanja u sistemima e-uprave. Zlatibor, ETRAN, 2012.

KAPK: Commission for accreditation and quality assurance, Guide for students. [Online], 2018. Available at: http://www.kapk.org [Accessed 2018].

Kešelj, V. & Šipka, D.: For the greedy and the optimal subsumption-based stemmer for Serbian: A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources. Infotheca, Tom 9(1-2), pp. 23a-33a, 2008.

Lita, L. V., Ittycheriah, A., Roukos, S. & Kambhatla, N.: Truecasing. Sapporo, Japan, ACL '03 Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, 2003.

Ljubešić, N., Boras, D. & Kubelka, O.: Retrieving Information in Croatian: building a simple and efficient rule-based stemmer. Zagreb, 1st International Conference The Future of Information Sciences (INFuture), 2007.

Manning, C. D., Raghavan, P. & Schütze, H.: Introduction to Information Retrieval. s.l.:Cambridge University Press, 2008.

Milošević, N.: Stemmer for Serbian language, s.l.: arXiv preprint arXiv:1209.4471, 2012.

Miner, G. et al.: Practical Text Mining and Statistical Analysis for Nonstructured Text Data Applications. s.l.:Academic Press, 2012.

Porter, M. F.: An algorithm for suffix stripping. Program, 14(3), pp. 130-137, 1980.

Schütze, H. & Silverstein, C.: Projections for efficient document clustering. Philadelphia, Pennsylvania, SIGIR '97 Proceedings of the 20th annual international ACM SIGIR conference on Research and development in information retrieval, 1997.

Službeni glasnik RS: Pravilnik o standardima i postupku za akreditaciju visokoškolskih ustanova, Službeni glasnik RS, broj 88/17. [Online], 2017. Available at: http://www.kapk.org/en/accreditation/ [Accessed 2018]

Službeni glasnik RS: Zakon o visokom obrazovanju, Službeni glasnik Republike Srbije, broj 73/18. [Online], 2018. Available at: http://www.parlament.gov.rs [Accessed 2018].

Stranieri, A. & Zeleznikow, J.: Knowledge Discovery from Legal Databases. s.l.:Springer, 2005.

Vitas, D. et al.: The serbian language in the digital age. s.l.:Springer, Berlin, Heidelberg, 2012.

DOI: https://doi.org/10.22190/FUMI1905973D

Refbacks

There are currently no refbacks.

ISSN 2406-047X (Online)

Username
Password
Remember me