GENERATING KNOWLEDGE STRUCTURES FROM OPEN DATASETS' TAGS - AN APPROACH BASED ON FORMAL CONCEPT ANALYSIS

Miloš Bogdanović, Milena Frtunić Gligorijević, Nataša Veljković, Leonid Stoimenov

DOI Number
https://doi.org/10.22190/FUACR201225002B
First page
021
Last page
031

Abstract


Under influence of data transparency initiatives, a variety of institutions have published a significant number of datasets. In most cases, data publishers take advantage of open data portals (ODPs) for making their datasets publicly available. To improve the datasets' discoverability, open data portals (ODPs) group open datasets into categories using various criteria like publishers, institutions, formats, and descriptions. For these purposes, portals take advantage of metadata accompanying datasets. However, a part of metadata may be missing, or may be incomplete or redundant. Each of these situations makes it difficult for users to find appropriate datasets and obtain the desired information. As the number of available datasets grows, this problem becomes easy to notice. This paper is focused on the first step towards decreasing this problem by implementing knowledge structures to be used in situations where a part of datasets' metadata is missing. In particular, we focus on developing knowledge structures capable of suggesting the best match for the category where an uncategorized dataset should belong to. Our approach relies on dataset descriptions provided by users within dataset tags. We take advantage of a formal concept analysis to reveal the shared conceptualization originating from the tags' usage by developing a concept lattice per each category of open datasets. Since tags represent free text metadata entered by users, in this paper we will present a method of optimizing their usage through means of semantic similarity measures based on natural language processing mechanisms. Finally, we will demonstrate the advantage of our proposal by comparing concept lattices generated using formal the concept analysis before and after the optimization process. The main experimental research results will show that our approach is capable of reducing the number of nodes within a lattice more than 40%.

Keywords

Open data, formal concept analysis, semantic similarity, natural language processing

Full Text:

PDF

References


S. Kubler, J. Robert, S. Neumaier, J. Umbrich, Y. Le Traon, “Comparison of metadata quality in open data portals using the Analytic Hierarchy Process,” Government Information Quarterly, vol. 35, no.1, pp.13-29, 2018.

S.R. Auer, C. Bizer, G. Kobilarov, J. Lehmann, R. Cyganiak, Z. Ives, ”DBpedia: A Nucleus for a Web of Open Data,” The Semantic Web, Lecture Notes in Computer Science, pp. 722-735, 2007.

N. Veljković, S. Bogdanović-Dinić, L. Stoimenov, “eGovernment openness index,” Proceedings of the 11th European Conference on eGovernment, Ljubljana, pp. 571–577, 2011.

S. Neumaier, J. Umbrich, A. Polleres, “Automated quality assessment of metadata across open data portals,” Journal of Data and Information quality, vol. 8, no.1, pp. 2:1-2:29, 2016.

S. van der Waal, K. Węcel, L. Ermilov, V. Janev, U. Milošević, M. Wainwright, “Lifting open data portals to the data web,” In Linked Open Data--Creating Knowledge Out of Interlinked Data, Springer, Cham, pp. 175-195, 2014.

P. Milic, N. Veljkovic, L. Stoimenov, “Comparative analysis of metadata models on e-government open data platforms,“ IEEE Transactions on Emerging Topics in Computing, 2018.

F. Maali, R. Cyganiak, V. Peristeras, “Enabling Interoperability of Government Data Catalogues,” In Proceedings of EGOV 2010, pp. 339-350, 2010.

M. El Kourdi, A. Bensaid, T.E. Rachidi, “Automatic Arabic document categorization based on the Naïve Bayes algorithm,” Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, pp. 51-58, 2004.

V. Korde, C.N. Mahender, “Text classification and classifiers: A survey,” International Journal of Artificial Intelligence & Applications, vol. 3, no. 2, pp. 85-99, 2012.

A.K. Uysal, S. Gunal, “A novel probabilistic feature selection method for text classification,” Knowledge-Based Systems, vol. 36, pp. 226-235, 2012.

V. Korde, C.N. Mahender, “Text classification and classifiers: A survey,” International Journal of Artificial Intelligence & Applications, vol. 3, no. 2, pp. 85-99, 2012.

A.K. Uysal, S. Gunal, “A novel probabilistic feature selection method for text classification,” Knowledge-Based Systems, vol. 36, pp. 226-235, 2012.

R. Jaschke, Formal Concept Analysis and Tag Recommendations in Collaborative Tagging Systems, Dissertations in Artificial Intelligence, 2011.

R. Wille, “Restructuring lattice theory: An approach based on hierarchies of concepts,” Ordered Sets, Springer, Dordrecht, pp. 445–470, 1982.

D.D. Lewis, M. Ringuette, “A comparison of two learning algorithms for text categorization,” Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, 1994.

B. Ganter, G. Stumme, “Formal concept analysis: Methods and applications in computer science,” Technical Report Otto – von – Guericke – Universitat Magdeburg

A. M. Boutari, C. Carpineto, R. Nicolussi, R., “Evaluating term concept association measures for short text expansion: two case studies of clas-sification and clustering, ” In CLA 2010, pp. 163–174, 2010.

O. Prokasheva, A. Onishchenko, S. Gurov, Classification methods based on formal concept analysis, FCAIR 2012 – Formal Concept Analysis Meets Information Retrieval, p. 95, 2012.

S.O. Kuznetsov, Mathematical aspects of concept analysis, Journal of Mathematical Science, Vol. 80, Issue 2, pp. 1654–1698, 1996.

S.O. Kuznetsov, Complexity of Learning in Concept Lattices from Positive and Negative Examples, Discrete Applied Mathematics, No. 142(1–3), pp. 111-125, 2004.

V.K. Finn, The Synthesis of Cognitive Procedures and the Problem of Induction, Autom. Doc. Math. Linguist., 43, pp.149-195, 2009.

V.K. Finn, On machine-oriented formalization of plausible reasoning in the style of F. Bacon and D.S. Mill [in Russian], Semiotika i Informatika, 20, pp.35–101, 1983.

P. Njiwoua, Mephu Nguifo E, Améliorer l'apprentissage à partir d'instances grâce à l'induction de concepts: Le système CIBLe, Revue d'Intelligence Artificielle (RIA), vol. 13, 2, pp. 413–440, Hermes Science, 1999.

Z. Xie, W. Hsu, Z. Liu, M. L. Lee: Concept Lattice based Composite Classifiers for high Predictability, Artificial Intelligence, vol. 139, pp.253–267, Wollongong, Australia, 2002.

P. Njiwoua, E. M. Nguifo, Forwarding the choice of bias LEGAL-F Using Feature Selection to Reduce the complexity of LEGAL, In Proceedings of BENELEARN-97,ILK and INFOLAB, Tilburg University, the Netherlands, pp. 89–98, 1997.

M. Maddouri, Towards a machine learning approach based on incremental concept formation, Intelligent Data Analysis, Volume 8, Issue 3, pp. 267–280, 2004.

Y. Freund, R. E. Schapire, Experiments with a new boosting algorithm, International Conference on Machine Learning, pp. 148-156. Morgan Kaufmann Publications, Bari, 1996.

J. Pennington, R. Socher, C. Manning, “Glove: Global vectors for word representation, ” In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, 2014.

S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, R. Harshman, "Indexing by latent semantic analysis". Journal of the American Society for Information Science, 41(6): 391–407, 1990.

T. Mikolov, W. T. Yih, G. Zweig, "Linguistic regularities in continuous space word representations", In Proceedings of NAACL-HLT, pages 746–751, 2013




DOI: https://doi.org/10.22190/FUACR201225002B

Refbacks

  • There are currently no refbacks.


Print ISSN: 1820-6417
Online ISSN: 1820-6425