By Javier Surasky
Data is no
longer an accessory input; it has become a strategic asset. Much is said about
competition among companies and States to lead data markets, as well as privacy
issues and data appropriation. Yet, social science researchers continue to have
limited training in this area, which requires working with complex mathematical
and statistical formulas that we often leave to engineers, computer scientists,
mathematicians, and related specialists. That is a huge mistake.
Understanding
how data is generated, cleaned, transformed, and modeled is an irrenounceable
responsibility for any researcher in the social sciences as part of the
responsible exercise of their work.
Data mining
and the KDD cycle (Knowledge Discovery in Databases) are part of a new literacy
for social scientists, as they help us transition from data to knowledge, and
from knowledge to more informed and responsible decisions.
The
distinction between data mining and machine learning is crucial to avoid
confusing tools with purposes. While machine learning prioritizes result
accuracy, data mining focuses on understanding the data being generated. It
creates explanatory, transparent, and interpretable models which, for those of
us working with social phenomena—where processes matter as much as outcomes—are
essential: we seek to know why something happens, not just how to manage what
happens.
From this
perspective, the KDD cycle provides an organized framework for transforming
data into information and then into knowledge through a sequential and
iterative process, in which each stage builds upon the next, starting with a
clear definition of the objective. Without a well-formulated question, data
only produces noise. Following the definition of the objective, we proceed to
data collection and preparation, where we address missing values,
inconsistencies, and incoherence within our datasets.
The next
step is data exploration and transformation, the most extended and decisive
phase, whose outcomes determine the final quality of the data that will later
be used in model construction. Model results then provide the information we
need to create evidence-based knowledge that can be incorporated into concrete
social decision-making.
None of
this is foreign to the way social sciences operate: it is simply a more
systematic and automated version of our own research method. It does not limit
our ability to find answers; it enhances our capacity to do so.
In AI, it
is common for up to 80% of a project’s time to be spent cleaning, organizing,
and transforming data. How much effort do we devote to this in social research?
The time required is not “wasted time” but an investment in the quality of
final results.
Decisions
about which data to use, which to retain, and which to discard require the
application of clear criteria based on the defined objective, contextual
knowledge, and a critical reading of the phenomenon under study.
These tasks
are complemented by attribute transformation, which involves creating new
variables, combining existing ones, selecting appropriate scales, and
numerizing or discretizing according to the requirements of each algorithm we
plan to use. Social sciences already work with derived concepts such as
indices, rates, proportions, categories, and rules. Data mining formalizes the
operations that social scientists already perform, making explicit the
mathematical and statistical criteria behind them, which gives the entire
research process greater transparency.
Once the
data is ready, patterns begin to emerge, and again, the choice of data
combinations is critical to the quality of the results. Principal Component
Analysis (PCA) reduces dimensions and reveals underlying structures, while
techniques such as Kohonen networks or autoencoders capture nonlinear
relationships and uncover complex structures that are impossible to detect
using traditional methods. Each of these approaches opens a different window
into the data, and therefore into information and knowledge.
Clustering
is the most intuitive technique for those of us coming from the social
sciences: it groups similar elements without predefined categories. It
underlies voter segmentation, cultural consumption patterns, profiles of
beneficiaries of social policies, or typologies of countries in international
relations. Algorithms such as K-means or HDBSCAN identify natural groups
without imposing predefined structures, while metrics like the Silhouette or
Davies-Bouldin indices help assess the quality of clustering.
All these
technical processes have an unavoidable ethical and legal dimension. Personal
data protection, minimization, non-discrimination in algorithms, responsibility
for erroneous models, and the right to explanation are essential conditions for
research based on data to be legitimate. Those of us working with data must
assume this responsibility rigorously.
The
conclusion is clear: social sciences cannot remain on the sidelines of
data-driven work processes. They must not only be “users” but also take part in
data production and processing. Data mining requires well-formulated questions
and objectives, contextualized interpretations, critical perspectives, an
understanding of structural biases, expert knowledge, and social sensitivity.
No algorithm offers this by nature.
Integrating
social sciences and data opens new possibilities for strengthening research.
Understanding the logic of the KDD cycle, preprocessing, transformations, and
core techniques does not seek to turn social scientists into mathematicians,
but rather to increase their autonomy and judgment in a world where data
increasingly organizes the societies and phenomena we study.
I will soon
be publishing a work titled Data Mining for Social Scientists: An
Introduction to Concepts, Methods, and Metrics for the Responsible Use of AI in
Social Research, which provides a deep analysis of these issues and
explains—using minimal mathematical-statistical formulas—the main
data-processing steps. I truly believe, as I state at the end of that work,
that the integration between social sciences and data analysis is essential to
generating solid knowledge in the 21st century, and that it requires social
researchers to navigate and appropriate new data cartographies for the benefit
of societies and individuals, the ultimate purpose of our scientific work.
