COMPARISON OF DATA MINING ALGORITHMS, INVERTED INDEX SEARCH AND SUFFIX TREE CLUSTERING SEARCH

Miloš Ilić, Dejan Rančić, Petar Spalević

DOI Number
10.22190/FUACR1603171I
First page
171
Last page
185

Abstract


New documents are created every day and the number of digital documents in the world is exponentially growing. Search engines do a great job by making these documents easily available to the world population. Data mining works with large amount of data sets and offers data to the end user; it consists of many different techniques and algorithms. These techniques allow faster and better search for large amounts of data. Clustering is one of the techniques used in a data mining process; it is based on data grouping according to the features, or any property they have in common, thus, a search process is faster, and a user gets better search results. On the other hand, an inverted index is a structure that provides fast search too, but this structure does not create clusters or groups of similar data. Instead, it processes all data in a document and measures appearance of specific terms in a document. The goal of this paper is to compare these two algorithms. The authors created applications that use these two algorithms and tested them on the same corpus of documents. For both algorithms, the authors are presenting improvements that provide faster search and better search results.

Keywords

application; clustering; data mining; inverted index; Lucene; suffix tree

Full Text:

PDF

References


O. Maimon, L. Rokach, Data Mining and Knowledge Discovery Handbook. Springer, New York, USA, 2010, pp. 1-208.

M. Kantardzi, "Data mining concepts models methods and algorithms," John Wiley & Sons, Inc., Hoboken, New Jersey, 2011, pp. 5-25.

L. Xu, C. Jiang, J. Wang, J. Yuan, Y. Ren, "Information Security in Big Data: Privacy and Data Mining," IEEE Access, vol. 2, 2014, pp. 1149-1176.

M. Rafi, M. Maujood, M. Fazal, S. Muhammad, "A comparison of two suffix tree-based document clustering algorithms," in Proceedings of Information and Emerging Technologies (ICIET), Karachi, 2010, pp.1-5.

C. Manning, P. Raghavan, H. Schütze, An Introduction to Information Retrieval. Cambridge University Press, Cambridge, England, 2009, 67-149.

H. Wu, G. Li, L. Zhou, "Genix: Generalized inverted index for keyword search," Tsinghua Science and Technology, vol. 18, no. 1, 2013, pp. 77-87

A. Jain, A. Bajpai, M. Rohila, "Efficient clustering technique for information retrieval in data mining," International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 6, pp.12-20, 2010.

R. Konow, G. Navarro, C. Clarke, A. Ortiz, "Faster and smaller inverted indices with treaps," in Proceedings of the 36th International ACM SIGIR conference Research and Development in Information Retrieval, 2013, pp. 193-202.

M. McCandless, E. Hatcher, O. Gospodnetic, Lucene in Action. Manning Publication Co., 180 Broad Suite 1323, Stamford, USA, 2010, pp. 233-345.

M. Ilic, P. Spalevic, M. Veinovic, "Inverted index search in data mining," in Proceedings of the 22nd Telecommunications forum - TELFOR, Belgrade, pp. 943-946.

M. Shindler, Clustering for Information Analysis and Retrieval: Algorithms and Applications. PhD. Dissertation, Department of Computer Sciences, University of California, Los Angeles, 2011, pp. 1-158.

D. Sharma, "Stemming algorithms: A comparative study and their analysis," International Journal of applied Information systems, Foundation of Computer Science FCS, New York, USA, vol. 4, no. 3, pp.7-12, 2012.

M. Galaen, Document klynging (documents clustering), Master of Science in Informatics, Norwegian University of Science and Technology, 2008, pp. 19-42.

M. Ilic, P. Spalevic, M. Veinovic, "Suffix tree clustering – data mining algorithm," in Proceedings of the Twenty-Third International Electrotechnical and Computer Science Conference ERK, vol. B, Portoroz, Slovenia, 2014, pp. 15-18.




DOI: https://doi.org/10.22190/FUACR1603171I

Refbacks

  • There are currently no refbacks.


Print ISSN: 1820-6417
Online ISSN: 1820-6425