Published
Automatic Personality Evaluation from Transliterations of YouTube Vlogs Using Classical and State of the art Word Embeddings
Evaluación automática de la personalidad a partir de las transliteraciones de los vlogs de YouTube mediante el uso de incrustaciones de palabras clásicas
DOI:
https://doi.org/10.15446/ing.investig.93803Keywords:
Personality, Word Embeddings, YouTube, Regression, Classification (en)Personalidad, Incrustaciones de Palabras, YouTube, Regresión, Clasificación (es)
The study of automatic personality recognition has gained attention in the last decade thanks to a variety of applications that derive from this field. The big five model (also known as OCEAN) constitutes a well-known method to label different personality traits. This work considers transliterations of video recordings collected from YouTube (originally provided by the Idiap research institute) and automatically generated scores for the five personality traits which also were provided in the database. The transliterations are modeled with two different word embedding approaches, Word2Vec and GloVe and three different levels of analysis are included: regression to predict the score of each personality trait, binary classification between strong vs. weak presence of each trait, and the tri-class classification according to three different levels of manifestations in each trait (low, medium, and high). According to our findings, the proposed approach provides similar results to others reported in the state-of-the-art. We think that further research is required to find better results. Our results, as well as others reported in the literature, suggest that there is a big gap in the study of personality traits based on linguistic patterns, which make it necessary to work on collecting and labeling data considering the knowledge of expert psychologists and psycholinguists.
El estudio del reconocimiento automático de la personalidad ha ganado atención en la última década gracias a las diversas aplicaciones que se derivan de este campo. El modelo de los cinco grandes (también conocido como OCEAN) constituye un método bien conocido para etiquetar diferentes rasgos de personalidad. En este trabajo se consideran transliteraciones de grabaciones de vídeo recogidas de YouTube (proporcionadas originalmente por el instituto de investigación Idiap) y puntuaciones generadas automáticamente para los cinco rasgos de personalidad que también se proporcionaron en la base de datos. Las transliteraciones se modelan con dos enfoques diferentes de incrustación de palabras, Word2Vec y GloVe, y se incluyen tres niveles diferentes de análisis: regresión para predecir la puntuación de cada rasgo de personalidad, clasificación binaria entre presencia fuerte y débil de cada rasgo, y la clasificación tri-clase según tres niveles diferentes de manifestaciones en cada rasgo (bajo, medio y alto). Según nuestros resultados, el enfoque propuesto proporciona resultados similares a otros reportados en el estado del arte. Creemos que es necesario seguir investigando para encontrar mejores resultados. Nuestros resultados, así como otros reportados en la literatura, sugieren que existe un gran vacío en el estudio de los rasgos de personalidad basados en patrones lingüísticos, lo que hace necesario trabajar en la recolección y etiquetado de datos considerando el conocimiento de psicólogos y psicolingüistas expertos.
References
Alam, F., and Riccardi, G. (2014, November). Predicting personality traits using multimodal information. Proceedings of the 2014 ACM multi media on workshop on computational personality recognition, 15-18. https://dl.acm.org/doi/10.1145/2659522.2659531{10.1145/2659522.2659531
Alammar, J. (June 27, 2018). The Illustrated Transformer [Blog post] http://jalammar.github.io/illustrated-transformer/
Allport, G. W. (1937). Personality: A psychological interpretation.
Bellei, C. (2018). The backpropagation algorithm for Word2Vec. Marginalia http://www.claudiobellei.com/2018/01/06/backprop-word2vec/
Biel, J. I., Tsiminaki, V., Dines, J., and Gatica-Perez, D. (2013, December). Hi YouTube! Personality impressions and verbal content in social video. Proceedings of the 15th ACM on International conference on multimodal interaction, 119-126. https://doi.org/10.1145/2522848.2522877
Buhrmester, M., Kwang, T., and Gosling, S. D. (2016). Amazon's Mechanical Turk: A new source of inexpensive, yet high-quality data? In A. E. Kazdin (Ed.), Methodological issues and strategies in clinical research, 133-139. American Psychological Association. https://psycnet.apa.org/doi/10.1037/14805-009
Cambria, E., Das, D., Bandyopadhyay, S., and Feraco, A. (2017). Affective computing and sentiment analysis. A practical guide to sentiment analysis, 1-10. Springer, Cham. https://doi.org/10.1007/978-3-319-55394-8_1
Celli, F. (2012). Unsupervised personality recognition for social network sites. Proc. of sixth international conference on digital society, 59-62.
Celli, F., Lepri, B., Biel, J. I., Gatica-Perez, D., Riccardi, G., and Pianesi, F. (2014, November). The workshop on computational personality recognition 2014. Proceedings of the 22nd ACM international conference on Multimedia, 1245-1246. https://doi.org/10.1145/2647868.2647870
da Silva, B. B. C., and Paraboni, I. (2018, September). Personality recognition from Facebook text. International Conference on Computational Processing of the Portuguese Language, 107-114. Springer, Cham. https://doi.org/10.1007/978-3-319-99722-3_11
Das, K. G., and Das, D. (2017, December). Developing Lexicon and Classifier for Personality Identification in Texts. Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), 362-372.
Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint https://arxiv.org/abs/1810.04805
Dey, S. (2018, April). Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python. Simple Data Science. https://sandipanweb.wordpress.com/2018/04/23/implementing-a-soft-margin-kernelized-support-vector-machine-binary-classifier-with-quadratic-programming-in-r-and-python
Gosling, S. D., Rentfrow, P. J., and Swann Jr, W. B. (2003). A very brief measure of the Big-Five personality domains. Journal of Research in personality, 37(6), 504-528. https://doi.org/10.1016/S0092-6566(03)00046-1
Guan, Z., Wu, B., Wang, B., and Liu, H. (2020, July). Personality2vec: Network Representation Learning for Personality. 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), 30-37. IEEE. https://doi.org/10.1109/DSC50466.2020.00013
Hassanein, M., Hussein, W., Rady, S., and Gharib, T. F. (2018, December). Predicting personality traits from social media using text semantics. 2018 13th International Conference on Computer Engineering and Systems (ICCES), 184-189. IEEE. https://doi.org/10.1109/ICCES.2018.8639408
Jiang, H., Zhang, X., and Choi, J. D. (2020, April). Automatic Text-Based Personality Recognition on Monologues and Multiparty Dialogues Using Attentive Networks and Contextual Embeddings (Student Abstract). Proceedings of the AAAI Conference on Artificial Intelligence, 34(10), 13821-13822. https://doi.org/10.1609/aaai.v34i10.7182
John, O. P., Donahue, E. M., and Kentle, R. L. (1991). Big five inventory. Journal of Personality and Social Psychology. https://psycnet.apa.org/doi/10.1037/t07550-000
John, O. P., Naumann, L. P., and Soto, C. J. (2008). Paradigm shift to the integrative Big Five trait taxonomy: History, measurement, and conceptual issues. In O. P. John, R. W. Robins, & L. A. Pervin (Eds.), Handbook of personality: Theory and research, 114-158. The Guilford Press.
Kazameini, A., Fatehi, S., Mehta, Y., Eetemadi, S., and Cambria, E. (2020, October). Personality trait detection using bagged svm over bert word embedding ensembles. arXiv preprint https://arxiv.org/abs/2010.01309
Kohavi, R. (1995, August). A study of cross-validation and bootstrap for accuracy estimation and model selection. Ijcai, 14(2), 1137-1145.
Kosinski, M., Matz, S. C., Gosling, S. D., Popov, V., and Stillwell, D. (2015, September). Facebook as a research tool for the social sciences: Opportunities, challenges, ethical considerations, and practical guidelines. American psychologist, 70(6), 543. https://psycnet.apa.org/doi/10.1037/a0039210
Mao, Y., Zhang, D., Wu, C., Zheng, K., and Wang, X. (2018, December). Feature analysis and optimisation for computational personality recognition. 2018 IEEE 4th International Conference on Computer and Communications (ICCC), 2410-2414. IEEE. https://doi.org/10.1109/CompComm.2018.8780801
Mehta, Y., Fatehi, S., Kazameini, A., Stachl, C., Cambria, E., and Eetemadi, S. (2020, November). Bottom-up and top-down: Predicting personality with psycholinguistic and language model features. 2020 IEEE International Conference on Data Mining (ICDM), 1184-1189. IEEE. https://doi.org/10.1109/ICDM50108.2020.00146
Mehta, Y., Majumder, N., Gelbukh, A., and Cambria, E. (2020, April). Recent trends in deep learning based personality detection. Artificial Intelligence Review, 53(4), 2313-2339. https://doi.org/10.1007/s10462-019-09770-z
Milgram, J., Cheriet, M., and Sabourin, R. (2006, October). ``One against one'' or ``one against all'': Which one is better for handwriting recognition with SVMs?. tenth international workshop on Frontiers in handwriting recognition. Suvisoft. https://hal.inria.fr/inria-00103955
Mikolov, T. (2015). word2vec: Tool for computing continuous distributed representations of words. Google Code https://code.google.com/archive/p/word2vec/
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013, September). Efficient estimation of word representations in vector space. arXiv preprint https://arxiv.org/abs/1301.3781
Mohammad, S., and Kiritchenko, S. (2013, June). Using nuances of emotion to identify personality. Seven International AAAI Conference on Weblogs and Social Media.
Onan, A. (2015, June). Classifier and feature set ensembles for web page classification. Journal of Information Science, 42(2), 150-165. https://doi.org/10.1177/0165551515591724
Onan, A. (2016, December). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1), 1-20. https://doi.org/10.1177/0165551516677911
Onan, A. (2017a). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes, 46(2), 330-348. https://doi.org/10.1108/K-10-2016-0300
Onan, A. (2017b, October). A K-medoids based clustering scheme with an application to document clustering. 2017 international conference on computer science and engineering (UBMK), 354-359. IEEE. https://doi.org/10.1109/UBMK.2017.8093409
Onan, A. (2018, April). Sentiment analysis on Twitter based on ensemble of psychological and linguistic feature sets. Balkan Journal of Electrical and Computer Engineering 6(2), 69-77. https://doi.org/10.17694/bajece.419538
Onan, A. (2019a, October). Two-stage topic extraction model for bibliometric data analysis based on word embeddings and clustering. IEEE Access, 7, 145614-145633. https://doi.org/10.1109/ACCESS.2019.2945911
Onan, A. (2019b, November). Mining opinions from instructor evaluation reviews: a deep learning approach. Computer Applications in Engineering Education, 28(1), 117-138. https://doi.org/10.1002/cae.22179
Onan, A. (2020, June). Sentiment analysis on product reviews based on weighted word embeddings and deep neural networks. Concurrency and Computation: Practice and Experience, e5909. https://doi.org/10.1002/cpe.5909
Onan, A., and Korukoglu, S. (2015, November). A feature selection model based on genetic rank aggregation for text sentiment classification. Journal of Information Science, 1, 1-14. https://doi.org/10.1177/0165551515613226
Onan, A., Korukoğlu, S., and Bulut, H. (2016a, March). Ensemble of keyword extraction methods and classifiers in text classification. Expert Systems with Applications, 57, 232-247. https://doi.org/10.1016/j.eswa.2016.03.045
Onan, A., Korukoğlu, S., and Bulut, H. (2016b, June). LDA-based Topic Modelling in Text Sentiment Classification: An Empirical Analysis. Int. J. Comput. Linguistics Appl., 7(1), 101-119. https://doi.org/10.1016/j.eswa.2016.06.005
Onan, A., Korukoğlu, S., and Bulut, H. (2016c, November). A multiobjective weighted voting ensemble classifier based on differential evolution algorithm for text sentiment classification. Expert Systems with Applications, 62, 1-16. https://doi.org/10.1016/j.eswa.2016.06.005
Pennebaker, J. W., and King, L. A. (1999). Linguistic styles: language use as an individual difference. Journal of personality and social psychology, 77(6), 1296-1312. https://psycnet.apa.org/doi/10.1037/0022-3514.77.6.1296
Pennington, J.(2014). GloVe: Global Vectors for Word Representation. https://nlp.stanford.edu/projects/glove/
Pennington, J., Socher, R., and Manning, C. D. (2014, October). Glove: Global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532-1543. https://doi.org/10.3115/v1/D14-1162
Perez, P. A. (2020). WEBERT: Word Embeddings using BERT. https://doi.org/10.5281/zenodo.3964244
Powers, D. M. (2020). Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint https://arxiv.org/abs/2010.16061
Pratama, B. Y., and Sarno, R. (2015, November). Personality classification based on Twitter text using Naive Bayes, KNN and SVM. 2015 International Conference on Data and Software Engineering (ICoDSE), 170-174. IEEE. https://doi.org/10.1109/ICODSE.2015.7436992
Ranković, V., Grujović, N., Divac, D., and Milivojević, N. (2014). Development of support vector regression identification model for prediction of dam structural behaviour. Structural Safety, 48, 33-39. https://doi.org/10.1016/j.strusafe.2014.02.004
Rehurek, R., and Sojka, P. (2010). Software framework for topic modelling with large corpora. In Proceedings of the LREC 2010 workshop on new challenges for NLP frameworks.
Salminen, J., Rao, R. G., Jung, S. G., Chowdhury, S. A., and Jansen, B. J. (2020, July). Enriching Social Media Personas with Personality Traits: A Deep Learning Approach Using the Big Five Classes. International Conference on Human-Computer Interaction, 101-120. Springer, Cham. https://doi.org/10.1007/978-3-030-50334-5_7
Sarkar, C., Bhatia, S., Agarwal, A., and Li, J. (2014, November). Feature analysis for computational personality recognition using youtube personality data set. Proceedings of the 2014 ACM multi media on workshop on computational personality recognition, 11-14. https://doi.org/10.1145/2659522.2659528
Sch¨olkopf, B., Smola, A. J., and Bach, F. (2002).Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT press.
Smola, A. J., and Sch¨olkopf, B. (2004). A tutorial on support vector regression. Statistics and computing, 14(3), 199-222. https://doi.org/10.1023/B:STCO.0000035301.49549.88
Sun, X., Liu, B., Meng, Q., Cao, J., Luo, J., and Yin, H. (2019). Group-level personality detection based on text generated networks. World Wide Web, 23(3), 1887-1906. https://doi.org/10.1007/s11280-019-00729-2
Vapnik, V. (1995). The nature of statistical learning theory. Springer science and business media. DOI: https://doi.org/10.1007/978-1-4757-2440-0
Vinciarelli, A., and Mohammadi, G. (2014). A survey of personality computing. IEEE Transactions on Affective Computing, 5(3), 273-291. https://doi.org/10.1109/TAFFC.2014.2330816
White, J. K., Hendrick, S. S., and Hendrick, C. (2004). Big five personality variables and relationship constructs. Personality and individual differences, 37(7), 1519-1530. https://doi.org/10.1016/j.paid.2004.02.019
Xue, D., Hong, Z., Guo, S., Gao, L., Wu, L., Zheng, J., and Zhao, N. (2017). Personality recognition on social media with label distribution learning. IEEE Access, 5, 13478-13488. https://doi.org/10.1109/ACCESS.2017.2719018
How to Cite
APA
ACM
ACS
ABNT
Chicago
Harvard
IEEE
MLA
Turabian
Vancouver
Download Citation
CrossRef Cited-by
1. Hao Lin, Xiaolei Li, Shijie Jia, Huajun Dong. (2023). Big five personality prediction based on pre-training language model and sentiment knowledge base. Sixth International Conference on Computer Information Science and Application Technology (CISAT 2023). , p.172. https://doi.org/10.1117/12.3004082.
2. Sreekantha Desai Karanam, R.S. Kamath, Sampath Kini K. (2025). Exploring the Impact of Generative AI on Education: A Twitter Sentiment Study with Transformer Models. 2025 International Conference on Intelligent Systems and Pioneering Innovations in Robotics and Electric Mobility (INSPIRE). , p.732. https://doi.org/10.1109/INSPIRE67328.2025.11300605.
3. Mohmad Azhar Teli, Manzoor Ahmad Chachoo. (2022). Lingual markers for automating personality profiling: background and road ahead. Journal of Computational Social Science, 5(2), p.1663. https://doi.org/10.1007/s42001-022-00184-6.
4. 星瑶 郭. (2024). AI-Powered Personality Recognition Based on Social Media Text. Artificial Intelligence and Robotics Research, 13(04), p.788. https://doi.org/10.12677/airr.2024.134081.
5. Mohmad Azhar Teli, Manzoor Ahmad Chachoo. (2023). Pre-trained Word Embeddings In Deep Multi-label Personality Classification Of YouTube Transliterations. 2023 International Conference on Intelligent Systems, Advanced Computing and Communication (ISACC). , p.1. https://doi.org/10.1109/ISACC56298.2023.10084047.
6. Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha. (2024). Navigating pathways to automated personality prediction: a comparative study of small and medium language models. Frontiers in Big Data, 7 https://doi.org/10.3389/fdata.2024.1387325.
7. Kunal Biswas, Shivakumara Palaiahnakote, Umapada Pal, Ram Sarkar. (2025). Personality Traits Prediction Methods: A Survey. International Journal of Pattern Recognition and Artificial Intelligence, 39(12) https://doi.org/10.1142/S0218001425300024.
Dimensions
PlumX
Article abstract page views
Downloads
License
Copyright (c) 2022 Felipe Orlando López Pabón, Juan Rafael Orozco Arroyave

This work is licensed under a Creative Commons Attribution 4.0 International License.
The authors or holders of the copyright for each article hereby confer exclusive, limited and free authorization on the Universidad Nacional de Colombia's journal Ingeniería e Investigación concerning the aforementioned article which, once it has been evaluated and approved, will be submitted for publication, in line with the following items:
1. The version which has been corrected according to the evaluators' suggestions will be remitted and it will be made clear whether the aforementioned article is an unedited document regarding which the rights to be authorized are held and total responsibility will be assumed by the authors for the content of the work being submitted to Ingeniería e Investigación, the Universidad Nacional de Colombia and third-parties;
2. The authorization conferred on the journal will come into force from the date on which it is included in the respective volume and issue of Ingeniería e Investigación in the Open Journal Systems and on the journal's main page (https://revistas.unal.edu.co/index.php/ingeinv), as well as in different databases and indices in which the publication is indexed;
3. The authors authorize the Universidad Nacional de Colombia's journal Ingeniería e Investigación to publish the document in whatever required format (printed, digital, electronic or whatsoever known or yet to be discovered form) and authorize Ingeniería e Investigación to include the work in any indices and/or search engines deemed necessary for promoting its diffusion;
4. The authors accept that such authorization is given free of charge and they, therefore, waive any right to receive remuneration from the publication, distribution, public communication and any use whatsoever referred to in the terms of this authorization.










