Social Scientists and Data Mining

By Javier Surasky


How much do AI users really know about how the data that—together with computing power—enable its advances, are generated and managed? AI has transformed many social dynamics and will continue to modify them, but do those of us who conduct research in the social sciences understand the processes behind the big data used by the models and AI tools we integrate into our investigations? Do we know how to “operate” a dataset?

Data is no longer an accessory input; it has become a strategic asset. Much is said about competition among companies and States to lead data markets, as well as privacy issues and data appropriation. Yet, social science researchers continue to have limited training in this area, which requires working with complex mathematical and statistical formulas that we often leave to engineers, computer scientists, mathematicians, and related specialists. That is a huge mistake.

Understanding how data is generated, cleaned, transformed, and modeled is an irrenounceable responsibility for any researcher in the social sciences as part of the responsible exercise of their work.

Data mining and the KDD cycle (Knowledge Discovery in Databases) are part of a new literacy for social scientists, as they help us transition from data to knowledge, and from knowledge to more informed and responsible decisions.

The distinction between data mining and machine learning is crucial to avoid confusing tools with purposes. While machine learning prioritizes result accuracy, data mining focuses on understanding the data being generated. It creates explanatory, transparent, and interpretable models which, for those of us working with social phenomena—where processes matter as much as outcomes—are essential: we seek to know why something happens, not just how to manage what happens.

From this perspective, the KDD cycle provides an organized framework for transforming data into information and then into knowledge through a sequential and iterative process, in which each stage builds upon the next, starting with a clear definition of the objective. Without a well-formulated question, data only produces noise. Following the definition of the objective, we proceed to data collection and preparation, where we address missing values, inconsistencies, and incoherence within our datasets.

The next step is data exploration and transformation, the most extended and decisive phase, whose outcomes determine the final quality of the data that will later be used in model construction. Model results then provide the information we need to create evidence-based knowledge that can be incorporated into concrete social decision-making.

None of this is foreign to the way social sciences operate: it is simply a more systematic and automated version of our own research method. It does not limit our ability to find answers; it enhances our capacity to do so.

In AI, it is common for up to 80% of a project’s time to be spent cleaning, organizing, and transforming data. How much effort do we devote to this in social research? The time required is not “wasted time” but an investment in the quality of final results.

Decisions about which data to use, which to retain, and which to discard require the application of clear criteria based on the defined objective, contextual knowledge, and a critical reading of the phenomenon under study.

These tasks are complemented by attribute transformation, which involves creating new variables, combining existing ones, selecting appropriate scales, and numerizing or discretizing according to the requirements of each algorithm we plan to use. Social sciences already work with derived concepts such as indices, rates, proportions, categories, and rules. Data mining formalizes the operations that social scientists already perform, making explicit the mathematical and statistical criteria behind them, which gives the entire research process greater transparency.

Once the data is ready, patterns begin to emerge, and again, the choice of data combinations is critical to the quality of the results. Principal Component Analysis (PCA) reduces dimensions and reveals underlying structures, while techniques such as Kohonen networks or autoencoders capture nonlinear relationships and uncover complex structures that are impossible to detect using traditional methods. Each of these approaches opens a different window into the data, and therefore into information and knowledge.

Clustering is the most intuitive technique for those of us coming from the social sciences: it groups similar elements without predefined categories. It underlies voter segmentation, cultural consumption patterns, profiles of beneficiaries of social policies, or typologies of countries in international relations. Algorithms such as K-means or HDBSCAN identify natural groups without imposing predefined structures, while metrics like the Silhouette or Davies-Bouldin indices help assess the quality of clustering.

All these technical processes have an unavoidable ethical and legal dimension. Personal data protection, minimization, non-discrimination in algorithms, responsibility for erroneous models, and the right to explanation are essential conditions for research based on data to be legitimate. Those of us working with data must assume this responsibility rigorously.

The conclusion is clear: social sciences cannot remain on the sidelines of data-driven work processes. They must not only be “users” but also take part in data production and processing. Data mining requires well-formulated questions and objectives, contextualized interpretations, critical perspectives, an understanding of structural biases, expert knowledge, and social sensitivity. No algorithm offers this by nature.

Integrating social sciences and data opens new possibilities for strengthening research. Understanding the logic of the KDD cycle, preprocessing, transformations, and core techniques does not seek to turn social scientists into mathematicians, but rather to increase their autonomy and judgment in a world where data increasingly organizes the societies and phenomena we study.

I will soon be publishing a work titled Data Mining for Social Scientists: An Introduction to Concepts, Methods, and Metrics for the Responsible Use of AI in Social Research, which provides a deep analysis of these issues and explains—using minimal mathematical-statistical formulas—the main data-processing steps. I truly believe, as I state at the end of that work, that the integration between social sciences and data analysis is essential to generating solid knowledge in the 21st century, and that it requires social researchers to navigate and appropriate new data cartographies for the benefit of societies and individuals, the ultimate purpose of our scientific work.