Big Data in Social Sciences. An Introduction to the Automation of Textual Data Analysis Using Natural Language Processing and Machine Learning

Authors

DOI:

https://doi.org/10.54790/rccs.51

Keywords:

big data, natural language processing, social sciences, machine learning, text mining

Abstract

Innovations in the field of computer engineering and artificial intelligence provide new methodological opportunities for scientific research, enabling the study of emerging social phenomena that are born and inhabit virtual spaces. The purpose of this paper is to familiarise the social scientist with the widely established processes in massive text analysis using machine learning techniques that give rise to what we know today as natural language processing (NLP). First, a brief overview of the history of NLP and its relation to text analysis in the social sciences is given. Then, in each section of the text, the steps to follow when applying NLP to social research are assessed, providing information on software, tools, data sources and useful links, with the aim of offering an introductory and simplified guide to serve as an initial approach to this discipline. Finally, the main challenges that the social sciences face when implementing NLP techniques are examined and assessed.

Downloads

Download data is not yet available.

Author Biography

Alba Taboada, Universidad Autónoma de Madrid

PhD student in the Economics and Business programme at the Universidad Autónoma de Madrid. He has obtained a grant as Predoctoral Research Staff in Training (FPI) linked to the R&D project CONCERN (PID2020-115095RB-I00). He is also part of the working team of the R&D project NON-CONSPIRA-HATE!(PID2021-123983OB-I00). She graduated in Sociology at UCM and holds a Master in Big Data Science from the University of Navarra. She has been a fellow at the Centro de Investigaciones Sociológicas (CIS), class of 2022. She is currently researching new methodological approaches through Big Data and Machine Learning applied to Social Sciences.

References

Abbott, A. (1997). Of Time and Space: The Contemporary Relevance of the Chicago School. Social Forces, 75(4), 1149. doi: 10.2307/2580667. DOI: https://doi.org/10.2307/2580667

Ajmal, S., Khan, S., Hossain, M., Lomonaco, V., Cannons, K., Xu, Z. y Cuzzolin, F. (2022). International Workshop on Continual Semi-Supervised Learning: Introduction, Benchmarks and Baselines. Continual Semi-Supervised Learning, Vol. 13418 (pp. 1-14). Cham: Springer International Publishing. https://doi.org/10.1007/978-3-031-17587-9_1 DOI: https://doi.org/10.1007/978-3-031-17587-9_1

Alinejad-Rokny, H. (2016). Proposing on Optimized Homolographic Motif Mining Strategy Based on Parallel Computing for Complex Biological Networks. Journal of Medical Imaging and Health Informatics, 6(2), 416-424. https://doi.org/10.1166/jmihi.2016.1707 DOI: https://doi.org/10.1166/jmihi.2016.1707

Bird, S., Klein, E. y Loper, E. (2009). Natural language processing with Python. O’Reilly.

Bitter, C., Elizondo, D. A. y Yang, Y. (2010). Natural language processing: A prolog perspective. Artificial Intelligence Review, 33(1-2), 151-173. https://doi.org/10.1007/s10462-009-9151-4 DOI: https://doi.org/10.1007/s10462-009-9151-4

Calzolari, N. (2020). LREC 2020 Marseille Twelfth International Conference on Language Resources and Evaluation$dMay 11-16, 2020, Palais Du Pharo, Marseille, France: Conference Proceedings. Paris: The European Language Resources Association (ELRA).

Castells, M. (2018). La era de la información: economía, sociedad y cultura. Vol. 3, Fin de milenio. 4a ed., 2ª reimpr. Madrid: Alianza Editorial.

Dahlin, E. (2021). Email Interviews: A Guide to Research Design and Implementation. International Journal of Qualitative Methods, 20:160940692110254. doi: 10.1177/16094069211025453. DOI: https://doi.org/10.1177/16094069211025453

Dhiraj, M. (2008). Digital Ethnography: An Examination of the Use of New Technologies for Social Research. Sociology, 42(5), 837-855. doi: 10.1177/0038038508094565. DOI: https://doi.org/10.1177/0038038508094565

Dogra, V., Verma, S., Kavita, Chatterjee, P., Shafi, J., Choi, J. y Ijaz, M. F. (2022). A Complete Process of Text Classification System Using State-of-the-Art NLP Models. En S. K. Sah Tyagi (Ed.), Computational Intelligence and Neuroscience (pp. 1-26). doi: 10.1155/2022/1883698. DOI: https://doi.org/10.1155/2022/1883698

Egger, R. y Yu, J. (2022). A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Frontiers in Sociology, 7:886498. doi: 10.3389/fsoc.2022.886498. DOI: https://doi.org/10.3389/fsoc.2022.886498

Gibbs, G. (2012). El análisis de datos cualitativos en investigación cualitativa. Madrid: Ediciones Morata.

Gillingham, P. y Graham, T. (2017). Big Data in Social Welfare: The Development of a Critical Perspective on Social Work’s Latest «Electronic Turn». Australian Social Work, 70(2), 135-147. https://doi.org/10.1080/0312407X.2015.1134606 DOI: https://doi.org/10.1080/0312407X.2015.1134606

Gualda, E., Taboada Villamarín, A. y Rebollo Díaz, C. (2023). Big data y ciencias sociales: Una mirada comparativa a las publicaciones de antropología, sociología y trabajo social. Gazeta de Antropología, 39(1).

Gualda, E. y Rebollo, C. (2020). Big data y Twitter para el estudio de procesos migratorios: Métodos, técnicas de investigación y software. Empiria. Revista de metodología de ciencias sociales, 46, 147. https://doi.org/10.5944/empiria.46.2020.26970 DOI: https://doi.org/10.5944/empiria.46.2020.26970

Hockett, C. F. (2020). The state of the art. De Gruyter.

Holtz, P., Kronberger, N. y Wagner, W. (2012). Analyzing Internet Forums: A Practical Guide. Journal of Media Psychology, 24(2), 55-66. https://doi.org/10.1027/1864-1105/a000062 DOI: https://doi.org/10.1027/1864-1105/a000062

James, G., Witten, D., Hastie, T. y Tibshirani, R. (2013). An Introduction to Statistical Learning (vol. 103). New York: Springer. https://doi.org/10.1007/978-1-4614-7138-7 DOI: https://doi.org/10.1007/978-1-4614-7138-7

Johri, P., Khatri, S. K., Al-Taani, A. T., Sabharwal, M., Suvanov, S. y Kumar, A. (2021). Natural Language Processing: History, Evolution, Application, and Future Work. En A. Abraham, O. Castillo, y D. Virmani (Eds.), Proceedings of 3rd International Conference on Computing Informatics and Networks (vol. 167, pp. 365-375). Springer Singapore. https://doi.org/10.1007/978-981-15-9712-1_31 DOI: https://doi.org/10.1007/978-981-15-9712-1_31

Justicia de la Torre, C., Sánchez, D., Blanco, I. y Martín-Bautista, M. J. (2018). Text Mining: Techniques, Applications, and Challenges. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 26(04), 553-582. https://doi.org/10.1142/S0218488518500265 DOI: https://doi.org/10.1142/S0218488518500265

Khanday, A. M. U. D., Rabani, S. T. Khan, Q. R. y Malik, S. H. (2022). Detecting Twitter Hate Speech in COVID-19 Era Using Machine Learning and Ensemble Learning Techniques. International Journal of Information Management Data Insights, 2(2), 100120. doi: 10.1016/j.jjimei.2022.100120. DOI: https://doi.org/10.1016/j.jjimei.2022.100120

Li, S. (2018). Named Entity Recognition and Classification with Scikit-Learn. https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2

Lindstedt, Nathan C. (2019). Structural Topic Modeling For Social Scientists: A Brief Case Study with Social Movement Studies Literature, 2005-2017. Social Currents, 6(4), 307-318. doi: 10.1177/2329496519846505. DOI: https://doi.org/10.1177/2329496519846505

Maud, R. y Blanchard, A. (2022). The Framing of Health Technologies on Social Media by Major Actors: Prominent Health Issues and COVID-Related Public Concerns. International Journal of Information Management Data Insights, 2(1), 100068. doi: 10.1016/j.jjimei.2022.100068. DOI: https://doi.org/10.1016/j.jjimei.2022.100068

Mbona, I. y Eloff, J. H. P. (2023). Classifying Social Media Bots as Malicious or Benign Using Semi-Supervised Machine Learning. Journal of Cybersecurity, 9(1), tyac015. doi: 10.1093/cybsec/tyac015. DOI: https://doi.org/10.1093/cybsec/tyac015

Michel, J.-B., Shen, Y. K., Aiden, A. P., Veres, A., Gray, M. K., The Google Books Team, Pickett, J. P., Hoiberg, D., Clancy, D., Norvig, P., Orwant, J., Pinker, S., Nowak, M. A. y Aiden, E. L. (2011). Quantitative Analysis of Culture Using Millions of Digitized Books. Science, 331(6014), 176-182. https://doi.org/10.1126/science.1199644 DOI: https://doi.org/10.1126/science.1199644

Microsoft (2022). Especificaciones y límites de Excel. https://support.microsoft.com/es-es/office/especificaciones-y-l%C3%ADmites-de-excel-1672b34d-7043-467e-8e27-269d656771c3

Morimoto, J. y Ponton, F. (2021). Virtual reality in biology: Could we become virtual naturalists? Evolution: Education and Outreach, 14(1), 7. https://doi.org/10.1186/s12052-021-00147-x DOI: https://doi.org/10.1186/s12052-021-00147-x

Müller, A. C. y Guido, S. (2016). Introduction to aprendizaje automático with Python: A guide for data scientists. O’Reilly Media, Inc.

Naseeba, B., Challa, N. P., Doppalapudi, A., Chirag, S. y Nair, N. S. (2023). Machine Learning Models for News Article Classification. 5th International Conference on Smart Systems and Inventive Technology (ICSSIT) (pp. 1009-1016). Tirunelveli, India: IEEE. https://doi.org/10.1109/ICSSIT55814.2023.10061095 DOI: https://doi.org/10.1109/ICSSIT55814.2023.10061095

Nikolenko, S. I., Koltcov, S. y Koltsova, O. (2017). Topic modelling for qualitative studies. Journal of Information Science, 43(1), 88-102. https://doi.org/10.1177/0165551515617393 DOI: https://doi.org/10.1177/0165551515617393

Pavlova, A., y Berkers, P. (2020). Mental Health Discourse and Social Media: Which Mechanisms of Cultural Power Drive Discourse on Twitter. Social Science & Medicine, 263, 113250. doi: 10.1016/j.socscimed.2020.113250. DOI: https://doi.org/10.1016/j.socscimed.2020.113250

Piotrowski, M. (2012). Natural Language Processing for Historical Texts. Cham: Springer. https://doi.org/10.1007/978-3-031-02146-6 DOI: https://doi.org/10.1007/978-3-031-02146-6_4

Radick, G. (2016). The unmaking of a modern synthesis: Noam Chomsky, Charles Hockett, and the politics of behaviorism, 1955-1965. Isis, 107(1), 49-73. https://doi.org/10.1086/686177 DOI: https://doi.org/10.1086/686177

Ruelens, A. (2022). Analyzing user-generated content using natural language processing: A case study of public satisfaction with healthcare systems. Journal of Computational Social Science, 5(1), 731-749. https://doi.org/10.1007/s42001-021-00148-2 DOI: https://doi.org/10.1007/s42001-021-00148-2

Saleem, Z., Alhudhaif, A., Qureshi, K. N. y Jeon, G. (2021). Context-aware text classification system to improve the quality of text: A detailed investigation and techniques. Concurrency and Computation: Practice and Experience. https://doi.org/10.1002/cpe.6489 DOI: https://doi.org/10.1002/cpe.6489

Sambeek, I. (2021). Natural Language Processing & Social Sciences. Towards Data Science. https://towardsdatascience.com/natural-language-processing-social-sciences-94a35a8a7c78

Shevtsov, A., Oikonomidou, M., Antonakaki, D., Pratikakis, P. y Ioannidis, S. (2023). What Tweets and YouTube Comments Have in Common? Sentiment and Graph Analysis on Data Related to US Elections 2020. PLOS ONE, 18(1), e0270542. doi: 10.1371/journal.pone.0270542. DOI: https://doi.org/10.1371/journal.pone.0270542

Thorsten, J. (1998). Text categorization with Support Vector Machines: Learning with many relevant features. En C. Nédellec y C. Rouveirol. Aprendizaje automático: ECML-98. Vol. 1398, Lecture Notes in Computer Science (pp. 137-142). Berlin, Heidelberg: Springer. https://doi.org/10.1007/BFb0026683 DOI: https://doi.org/10.1007/BFb0026683

Vilkova, O. (2020). Web Scraping as a Method of Data Extraction in Sociological Studies: On Scientific Applicability. Vestnik Tomskogo gosudarstvennogo universiteta. Filosofiya, sotsiologiya, politologiya, (54), 163-175. doi: 10.17223/1998863X/54/16. DOI: https://doi.org/10.17223/1998863X/54/16

Yuanbo, Q. (2017). The Openness of Open Application Programming Interfaces. Information, Communication & Society, 20(11), 1720-36. doi: 10.1080/1369118X.2016.1254268. DOI: https://doi.org/10.1080/1369118X.2016.1254268

Zwilling, Moti (2023). Big Data Challenges in Social Sciences: An NLP Analysis. Journal of Computer Information Systems, 63(3), 537-554. doi: 10.1080/08874417.2022.2085211. DOI: https://doi.org/10.1080/08874417.2022.2085211

Published

2024-01-11

How to Cite

Taboada Villamarín, A. (2024). Big Data in Social Sciences. An Introduction to the Automation of Textual Data Analysis Using Natural Language Processing and Machine Learning. CENTRA Journal of Social Sciences, 3(1). https://doi.org/10.54790/rccs.51