Elham Amouee, Morteza Mohammadi Zanjireh, Mahdi Bahaghighat, Mohsen Ghorbani

DOI Number
First page
Last page


Increasing size of text data in databases requires appropriate classification and analysis in order to acquire knowledge and improve the quality of decision-making in organizations. The process of discovering the hidden patterns in the data set, called data mining, requires access to quality data in order to receive a valid response from the system. Detecting and removing anomalous data is one of the pre-processing steps and cleaning data in this process. Methods for anomalous data detection are generally classified into three groups including supervised, semi-supervised, and unsupervised. This research tried to offer an unsupervised approach for spotting the anomalous data in text collections. In the proposed method, a combination of two approaches (i.e., clustering-based and distance-based) is used for detecting anomaly in the text data. In order to evaluate the efficiency of the proposed approach, this method is applied on four labeled data sets. The accuracy of Na¨ıve Bayes classification algorithms and decision tree are compared before and after removal of anomalous data with the proposed method and some other methods such as Density-based spatial clustering of applications with noise (DBSCAN). Our proposed method shows that accuracy of more than 92.39% can be achieved. In general, the results revealed that in most cases the proposed method has a good performance.


Anomaly detection, text mining, unsupervised learning, clustering, pre-processing, DBSCAN algorithm

Full Text:



Z. A. Bakar, R. Mohemad, A. Ahmad, and M. M. Deris, “A comparative study for outlier detection techniques in data mining,” in 2006 IEEE conference on cybernetics and intelligent systems. IEEE, 2006, pp. 1–6.

V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM computing surveys (CSUR), vol. 41, no. 3, pp. 1–58, 2009.

J. D. Parmar and J. T. Patel, “Anomaly detection in data mining: A review,” International Journal, vol. 7, no. 4, 2017.

R. Kaur and S. Singh, “A survey of data mining and social network analysis based anomaly detection techniques,” Egyptian informatics journal, vol. 17, no. 2, pp. 199–216, 2016.

D. Guthrie, “Unsupervised detection of anomalous text,” Ph.D. dissertation, Citeseer, 2008.

A. Mahapatra, N. Srivastava, and J. Srivastava, “Contextual anomaly detection in text data,” Algorithms, vol. 5, no. 4, pp. 469–489, 2012.

M. Goldstein and S. Uchida, “A comparative evaluation of unsupervised anomaly detection algorithms for multivariate data,” PloS one, vol. 11, no. 4, p. e0152173, 2016.

R. Kannan, H. Woo, C. C. Aggarwal, and H. Park, “Outlier detection for text data,” in Proceedings of the 2017 siam international conference on data mining. SIAM, 2017, pp. 489–497.

M. Montes-y G´omez, A. Gelbukh, and A. L´opez-L´opez, “Detecting deviations in text collections: An approach using conceptual graphs,” in Mexican International Conference on Artificial Intelligence. Springer, 2002, pp. 176–184.

S. Chellamuthu and M. Punithavalli, “Enhanced k-means with greedy algorithm for outlier detection,” International Journal of Advanced Research in Computer Science, vol. 3, no. 3, 2012.

J. Wang and X. Su, “An improved k-means clustering algorithm,” in 2011 IEEE 3rd International Conference on Communication Software and Networks. IEEE, 2011, pp. 44–46.

M. Marghny and A. I. Taloba, “Outlier detection using improved genetic kmeans,” arXiv preprint arXiv:1402.6859, 2014.

D. Lei, Q. Zhu, J. Chen, H. Lin, and P. Yang, “Automatic k-means clustering algorithm for outlier detection,” in Information engineering and applications. Springer, 2012, pp. 363–372.

C. Yin and S. Zhang, “Parallel implementing improved k-means applied for image retrieval and anomaly detection,” Multimedia Tools and Applications, vol. 76, no. 16, pp. 16911–16927, 2017.

X.-j. Tong, F.-R. Meng, and Z.-x. Wang, “Optimization to k-means initial cluster centers,” Computer Engineering and Design, vol. 32, no. 8, pp. 2721– 2723, 2011.

A. Esmaeili Kelishomi, A. Garmabaki, M. Bahaghighat, and J. Dong, “Mobile user indoor-outdoor detection through physical daily activities,” Sensors, vol. 19, no. 3, p. 511, 2019.

M. Ghorbani, M. Bahaghighat, Q. Xin, and F. ¨Ozen, “Convlstmconv network: a deep learning approach for sentiment analysis in cloud computing,” Journal of Cloud Computing, vol. 9, no. 1, pp. 1–12, 2020. [18] M. Bahaghighat, L. Akbari, and Q. Xin, “A machine learning-based approach for counting blister cards within drug packages,” IEEE Access, vol. 7, pp. 83785–83796, 2019.

M. Bahaghighat, S. A. Motamedi, and Q. Xin, “Image transmission over cognitive radio networks for smart grid applications,” Applied Sciences, vol. 9, no. 24, p. 5498, 2019.

F. Abedini, M. Bahaghighat, and M. S’hoyan, “Wind turbine tower detection using feature descriptors and deep learning,” Facta Universitatis, Series: Electronics and Energetics, vol. 33, no. 1, pp. 133–153, 2019. [21] M. Bahaghighat, F. Abedini, M. S’hoyan, and A.-J. Molnar, “Vision inspection of bottle caps in drink factories using convolutional neural networks,” in 2019 IEEE 15th International Conference on Intelligent Computer Communication and Processing (ICCP). IEEE, 2019, pp. 381–385.

S. Hasani, M. Bahaghighat, and M. Mirfatahia, “The mediating effect of the brand on the relationship between social network marketing and consumer behavior,” Acta Technica Napocensis, vol. 60, no. 2, pp. 1–6, 2019.

J. Zhang, C.-T. Lu, M. Zhou, S. Xie, Y. Chang, and S. Y. Philip, “Heer: Heterogeneous graph embedding for emerging relation detection from news,” in 2016 IEEE International Conference on Big Data (Big Data). IEEE, 2016, pp. 803–812.

D. Greene and P. Cunningham, “Efficient ensemble methods for document clustering,” Department of Computer Science, Trinity College Dublin, Tech. Rep., 2006.

J. Manoharan, S. H. Ganesh, and J. Sathiaseelan, “Outlier detection using enhanced k-means clustering algorithm and weight-based center approach,” Int. J. Comput. Sci. Mobile Comput., vol. 5, no. 4, pp. 453–464, 2016.

D. Greene and P. Cunningham, “Practical solutions to the problem of diagonal dominance in kernel document clustering,” in Proc. 23rd International Conference on Machine learning (ICML’06). ACM Press, 2006, pp. 377–384.

F. N. Flores and V. P. Moreira, “Assessing the impact of stemming accuracy on information retrieval–a multilingual perspective,” Information Processing & Management, vol. 52, no. 5, pp. 840–854, 2016.

W. J. Wilbur and K. Sirotkin, “The automatic identification of stop words,” Journal of information science, vol. 18, no. 1, pp. 45–55, 1992. [29] Z. S. Harris, “Distributional structure,” Word, vol. 10, no. 2-3, pp. 146–162, 1954.

H. P. Luhn, “A statistical approach to mechanized encoding and searching of literary information,” IBM Journal of research and development, vol. 1, no. 4, pp. 309–317, 1957.

K. S. Jones, “A statistical interpretation of term specificity and its application in retrieval,” Journal of documentation, 1972.

S. Robertson, “Understanding inverse document frequency: on theoretical arguments for idf,” Journal of documentation, 2004.

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.

C. D. Manning, P. Raghavan, and H. Schu¨tze, “Scoring, term weighting and the vector space model,” Introduction to information retrieval, vol. 100, pp. 2–4, 2008.

K. Church and W. Gale, “Inverse document frequency (idf): A measure of deviations from poisson,” in Natural language processing using very large corpora. Springer, 1999, pp. 283–295.

A. Huang, “Similarity measures for text document clustering,” in Proceedings of the sixth new zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, vol. 4, 2008, pp. 9–56.


  • There are currently no refbacks.

ISSN: 0353-3670 (Print)

ISSN: 2217-5997 (Online)

COBISS.SR-ID 12826626