Vlado Delić, Milan Gnjatović, Nikša Jakovljević, Branislav Popović, Ivan Jokić, Milana Bojanić

DOI Number
First page
Last page


This paper considers the research question of developing user-aware and adaptive conversational agents. The conversational agent is a system which is user-aware to the extent that it recognizes the user identity and his/her emotional states that are relevant in a given interaction domain. The conversational agent is user-adaptive to the extent that it dynamically adapts its dialogue behavior according to the user and his/her emotional state. The paper summarizes some aspects of our previous work and presents work-in-progress in the field of speech-based human-machine interaction. It focuses particularly on the development of speech recognition modules in cooperation with both modules for emotion recognition and speaker recognition, as well as the dialogue management module. Finally, it proposes an architecture of a conversational agent that integrates those modules and improves each of them based on some kind of synergies among themselves.

Full Text:



M. Gnjatović and D. Rösner, “Inducing Genuine Emotions in Simulated Speech-Based Human-Machine Interaction: The NIMITEK Corpus”. IEEE Transactions on Affective Computing, vol. 1, no. 2, pp. 132-144, July-Dec. 2010, DOI: 10.1109/T-AFFC.2010.14

M. Gnjatović, M. Janev, V. Delić, “Focus Tree: Modeling Attentional Information in Task-Oriented Human-Machine Interaction”. Applied Intelligence, vol. 37, no. 3, pp. 305-320, 2012, DOI: 10.1007/s10489-011-0329-5

D. Bohus and A. Rudnicky, “Sorry, I Didn’t Catch That! An Investigation of Non-Understanding Errors and Recovery Strategies”. In Recent Trends in Discourse and Dialogue, vol. 39 of Text, Speech and Language Technology, pp. 123–154, Springer, 2008.

C.H. Lee, “Fundamentals and Technical Challenges in Automatic Speech Recognition”. In Proc. of the 12th International Conference Speech and Computer, SPECOM 2007, pp. 25–44, Moscow, Russia, 2007.

B. Schuller, G. Rigoll, M. Lang, “Speech Emotion Recognition Combining Acoustic Features and Linguistic Information in a Hybrid Support Vector Machine-Belief Network Architecture”. In Proc. of ICASSP 2004, vol. 1, pp. I-577-580, 2004, DOI: 10.1109/ICASSP.2004.1326051

T. Kinnunen and L. Haizhou, “An Overview of Text-Independent Speaker Recognition: From Features to Supervectors”. Speech Communication, vol. 52, pp. 12-40, 2010, DOI: 10.1016/j.specom.2009.08.009

V. Delić, M. Sečujski, N. Jakovljević, M. Gnjatović, I. Stanković, “Challenges of Natural Language Com¬munication with Machines”. Chap. 19 in DAAAM International Scientific Book 2013, pp. 371-388, 2013, DOI: 10.2507/daaam.scibook.2013.19

N. Jakovljević, D. Mišković, M. Janev, M. Sečujski, V. Delić, “Comparison of Linear Discriminant Analysis Approaches in Automatic Speech Recognition”. Electronics and Electrical Engineering, vol. 19, no. 7, pp. 76-79, 2013, DOI: 10.5755/j01.eee.19.7.5167

I. Jokić, S. Jokić, Z. Perić, M. Gnjatović, V. Delić, “Influence of the Number of Principal Components used to the Automatic Speaker Recognition Accuracy”. Electronics and Electrical Engineering, vol. 18, no. 7, pp. 83-86, 2012, DOI: 10.5755/j01.eee.123.7.2379

I. Jokić, S. Jokić, V. Delić, Z. Perić, “Towards a Small Intra-Speaker Variability Models”. Electronics and Electrical Engineering, vol. 20, 2014 (in press).

V. Delić, M. Bojanić, M. Gnjatović, M. Sečujski, S.T. Jovičić, “Discrimination Capability of Prosodic and Spectral Features for Emotional Speech Recognition”. Electronics and Electrical Engineering, vol. 18, no. 9, pp. 51-54, 2012, DOI: 10.5755/j01.eee.18.9.2806

M. Bojanić, V. Delić, M. Sečujski, “Relevance of the types and the statistical properties of features in the recognition of basic emotions in speech”. Facta Universitatis, Series: Electronics and Energetics, vol. 27, no. 2, 2014 (in press).

M. Gnjatović, M. Kunze, X. Zhang, J. Frommer, D. Rösner, “Linguistic Expression of Emotion in Human-Machine Interaction: The NIMITEK Corpus as a Research Tool”. In Proceedings of the 4th Int. Workshop on Human-Computer Conversation, Bellagio, Italy, no pagination, 2008.

M. Gnjatović and V. Delić, “A Cognitively-Inspired Method for Meaning Representation in Dialogue Systems”. In Proc. of the 3rd IEEE Int. Conf. CogInfoCom-2012, Košice, Slovakia, pp. 383-388, 2012.

M. Gnjatović and V. Delić, “Electrophysiologically-Inspired Evaluation of Dialogue Act Complexity”. In Proc. of the 4th IEEE Int. Conf. CogInfoCom 2013, Budapest, Hungary, pp. 167-172, 2013.

M. Gnjatović and V. Delić, “Cognitively-inspired representational approach to meaning in machine dialogue”. Knowledge-Based Systems, DOI: 10.1016/j.knosys.2014.05.001, 2014.

M. Gnjatović, “Therapist-Centered Design of a Robot's Dialogue Behavior”. Cognitive Computation, Special issue: The quest for modeling emotion, behavior and context in socially believable Robots and ICT interfaces, Springer, DOI: 10.1007/s12559-014-9272-1 (in press).

S. J. Young, J. Odell, P. C. Woodland, “Tree-based state tying for high accuracy acoustic modelling”. In Proceedings of the Workshop on Human Language Technology, pp. 307-312, 1994, DOI: 10.3115/1075812.1075885

N. Jakovljević, D. Mišković, E. Pakoci, T. Grbić and V. Delić, “Poređenje performansi nekoliko varijanata GMM u sistemima za prepoznavanje govora”. In Proc. of 21th Telecommunications Forum, TELFOR 2013, Belgrade, Serbia, pp. 466-469, 2013.

M. Janev, D. Pekar, N. Jakovljević, V. Delić, “Eigenvalues driven Gaussian selection in continuous speech recognition using HMMs with full covariance matrices”. Applied Intelligence, vol. 33, no. 2, pp. 107-116, 2010, DOI: 10.1007/s10489-008-0152-9

B. Popović, M. Janev, D. Pekar, N. Jakovljević, M. Gnjatović, M. Sečujski, V. Delić “A novel split-and-merge algorithm for hierarchical clustering of Gaussian mixture models”. Applied Intelligence, vol. 37, no. 3, pp. 377-389, 2012, DOI: 10.1007/s10489-011-0333-9

N. Jakovljević, Primena retke reprezentacije na modelima Gausovih mešavina koji se koriste za automatsko prepoznavanje govora, PhD thesis, University of Novi Sad, March 2014.

V. Delić, M. Sečujski, N. Jakovljević, D. Pekar, D. Mišković, B. Popović, S. Ostrogonac, M. Bojanić, D. Knežević, “Speech and Language Resources within Speech Recognition and Synthesis Systems for Serbian and Kindred South Slavic Languages”. In Proc. of the SPECOM 2013, Pilsen, Czech Republic, LNCS, vol. 8113, Springer, pp. 319-326, 2013, DOI: 10.1007/978-3-319-01931-4_42

S. Ostrogonac, M. Sečujski, V. Delić, D. Mišković, N. Jakovljević, N. Vujnović Sedlar, A Mixed-Structure N-gram Language Model, Axon - inteligentni sistemi, Novi Sad, Serbia. International patent pening: PCT/RS2013/000009

N. Jakovljević, D. Mišković, M. Janev, D. Pekar, “A Decoder for Large Vocabulary Speech Recognition”. In Proc. of 18th International Conference on Systems, Signals and Image Processing, IWSSIP 2011, Sarajevo, Bosnia and Herzegovina, pp. 287-290, 2011.

M. Bojanić, M. Gnjatović, M. Sečujski, V. Delić: “Application of dimensional emotion model in automatic emotional speech recognition”. In Proc. of the 11th IEEE Int. Symp. on Intelligent Systems and Informatics, SISY 2013, Subotica, Serbia, pp. 353-356, 2013, DOI: 10.1109/SISY.2013.6662601

S.T. Jovičić., Z. Kašić, M. Djordjević, M. Rajković, “Serbian emotional speech database: design, processing and evaluation”. In Proc. of SPECOM 2004, St Peterburg, pp.77–81, 2004.

J. Gauvain and C. H. Lee, “Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains”. IEEE Trans. on Speech and Audio Process., vol. 2, no. 2, pp. 291-298, Apr. 1994, DOI: 10.1109/89.279278

M.J.F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition”. Computer speech & language, vol. 12, no. 2, pp. 75-98, 1998, DOI: 10.1006/csla.1998.0043

M.J.F. Gales and P.C. Woodland, “Mean and variance adaptation within the MLLR framework”. Computer Speech & Language, vol. 10, no. 4, pp. 249-264, 1996, DOI: 10.1006/csla.1996.0013

D. Povey and G. Saon, “Feature and model space speaker adaptation with full covariance Gaussians”. In Proc. Interspeech 2006, paper 2050-Tue2BuP.14, 2006.

M.J.F. Gales and S. Young, “The application of hidden Markov models in speech recognition”. Foundations and Trends in Signal Processing, vol. 1, no. 3, pp. 195-304, 2008, DOI: 10.1561/2000000004

N. Jakovljević, D. Mišković, M. Sečujski, D. Pekar, “Vocal tract normalization based on formant positions”. In Proc. Inter. Language Technologies Conference IS-LTC 2006, Ljubljana, pp. 40-43, 2006.

N. Jakovljević, M. Sečujski, V. Delić, “Vocal tract length normalization strategy based on maximum likelihood criterion”. In Proc. EUROCON 2009, St. Petersburg, pp. 417-420, 2009, DOI: 10.1109/EURCON.2009.5167662

G. Saon and J.T. Chien, “Large-vocabulary continuous speech recognition systems”. IEEE Signal Processing Magazine, vol. 29, no. 6, pp. 12-33. Nov. 2012, DOI: 10.1109/MSP.2012.2197156

J.M. Lucas-Cuesta J. Ferreiros, F. Fernandez-Martinez, J.D. Echeverry, S. Lutfi, “On the dynamic adaptation of language models based on dialogue information”. Expert Systems with Applications, vol. 40, no. 4, pp. 1069-1085, 2013, DOI: 10.1016/j.eswa.2012.08.029

W. Kim, Language model adaptation for automatic speech recognition and statistical machine translation, PhD Thesis, Johns Hopkins University, 2005.

L. ten Bosch, “Emotions: what is possible in the ASR framework”. ITRW on Speech and Emotion, Northern Ireland, UK, pp. 189-194, 2000.

J. Hirschberg, D. Litman, M. Swerts, “Prosodic and other cues to speech recognition failures”. Speech Communication, vol. 43, pp. 155-175, 2004.

D. Litman, J. Hirschberg, M. Swerts, “Predicting automatic speech recognition performance using prosodic cues”. In Proc. of the 1st North American chapter of the Association for Computational Linguistics, NAAC, Seattle, pp. 218-225, 2000.

B. Vlasenko, D. Prylipko, A. Wendemuth, “Towards robust spontaneous speech recognition with emotional speech adapted acoustic models”. S. Wölfl (ed.), Poster and Demo Track of the 35th German Conference on Artificial Intelligence, KI-2012, Saarbrucken, Germany, pp. 103-107, 2012.

B. Popović, I. Stanković, S. Ostrogonac, “Temporal Discrete Cosine Transform for Speech Emotion Recognition”. In Proc. of the 4th IEEE Int. Conf. CogInfoCom 2013, Budapest, Hungary, pp. 87-90, 2013.

C.M. Lee and S.S. Narayanan, “Toward detecting emotions in spoken dialogs”. IEEE Transactions on Speech and Audio Processing, vol. 13, no. 2, pp. 293-303, 2005, DOI: 10.1109/TSA.2004.838534

R. Müller, B. Schuller, G. Rigoll, “Enhanced Robustness in Speech Emotion Recognition Combining Acoustic and Semantic Analyses”. In Proc. of the Workshop From Signals to Signs of Emotion and Vice Versa, Santorini, Greece, 2004.

M. Halliday, An Introduction to Functional Grammar, Edward Arnold, London New York, Second edition, 1994.

K. Jokinen and M. McTear, Spoken Dialogue Systems. Synthesis Lectures on Human Language Technologies, Morgan and Claypool, 2009.

B. Grosz and C. Sidner, “Attention, intentions, and the structure of discourse”. Comput Linguist, vol. 12, no 3, pp. 175-204, 1986.


  • There are currently no refbacks.

ISSN: 0353-3670 (Print)

ISSN: 2217-5997 (Online)

COBISS.SR-ID 12826626