Course Notes: Curating Data

Comprehensive notes from lectures and readings

Lecture Notes (Google Docs) Reading Notes (Google Docs)

Introduction: Curating Data as Knowledge Making Practice

Core Concept

Curating data is not a neutral act. It is a practice of knowledge creation and power. The idea of "raw data" is a myth. Data is always already processed, framed, and interpreted for a purpose.

Critical Data Studies (CDS)

CDS examines cultural, ethical, and political dimensions of data. It rejects the view of data as neutral and instead investigates how data is generated, curated, and exerts power within sociotechnical systems.

Data Assemblage

The complex, interconnected sociotechnical system that produces and gives meaning to data. It includes thought, knowledge, finance, politics, materiality, practices, organizations, and laws.

Raw Data is an Oxymoron

The concept that data is never truly raw or untouched. It is always cooked/shaped, framed, and interpreted by human choices, tools, and disciplinary norms before it can function as data.

Technological Determinism

The flawed belief that technology (like Big Data) autonomously drives social change. CDS argues that technology is a product of society that in turn shapes it.

Key Readings

Dalton & Thatcher (2014): "What Does Critical Data Studies Look Like"

Argue against technological determinism. Big data is a product of society and in turn shapes it. Data is never raw.

Kitchin (2022): "Critical Data Studies"

Introduces "data assemblages," the complex sociotechnical systems that produce data. Distinguishes between Data Holdings (informal, personal storage) and Data Archives (formal, curated collections for long term preservation).

boyd & Crawford (2012): "Critical Questions for Big Data"

Pose 6 critical questions for Big Data, challenging its claims to objectivity, highlighting issues of context and ethics, and warning of new digital divides between data rich and data poor institutions.

The DIKW (Wisdom) Hierarchy

Core Model

A model representing a hierarchy of understanding: Data → Information → Knowledge → Wisdom. Each higher level is dependent on and includes the levels below it.

Data

Symbols or facts without context, meaning, or value. Discrete, objective, unorganized. Example: a number, "72".

Information

Data that has been processed and organized to be meaningful and relevant for a specific purpose. Answers "who, what, when, where." Example: "The temperature at noon was 72°F".

Knowledge

Information that has been understood and internalized through experience and context. It is "actionable information." Involves "know how" and patterns. Answers "how." Example: "A temperature of 72°F in May is unseasonably warm for this region".

Wisdom

The evaluated understanding of knowledge. Involves judgment, ethics, values, and understanding long term consequences. Answers "why" and is concerned with effectiveness. Example: "Given this warmth is part of a long term climate trend, we should invest in new public health initiatives".

Critical Tensions

  • The boundaries between levels (especially Information/Knowledge) are fuzzy.
  • Wisdom is severely under-theorized and cannot be automated; it requires human judgment.
  • The model is often presented as a clean progression, but the process is messy and iterative.

Efficiency vs. Effectiveness

Efficiency: Doing things right; the use of resources relative to an objective. Can be automated.

Effectiveness: Doing the right things; efficiency multiplied by value. Requires human wisdom and judgment.

Collecting Data

Metadata

Data about data. Provides crucial context for other data, such as how, when, and by whom it was created. It is not neutral and enacts particular worldviews.

Quantification

The process of turning qualities into quantities. It is not a neutral, descriptive act but a situated, creative, and agential practice deeply entangled with power and world-making.

Synthetic Data

Artificially generated data that mimics the statistical properties of real-world data. Used to address privacy concerns or data scarcity but raises questions about fidelity.

Surveillance Capitalism

An economic system centered on the commodification of personal data for profit and behavioral prediction and modification.

Materiality of Information Systems

Materiality refers to the idea that data, tools and infrastructures have concrete forms and constraints that shape how knowledge is created, organised and used.

Data Justice

An ethical framework concerned with fairness in the way data is used, highlighting how data-driven systems can reinforce existing inequalities and power structures.

Categorizing & Classifying

Classification Systems

Systematic grouping of things such as objects, animals (people) information based on shared information across the collected items.

Taxonomy

Science or technique of classification. Ordered arrangements of groups → hierarchical taxonomy.

Semantic Infrastructure

Refers to tools, systems and structured data that create, transmit and give meaning to facts across the web. Uses ontological classification systems to organise and label information as facts.

Linked Open Data (LOD)

  • Linked: Data points connect across datasets using shared identifiers
  • Open: Data is freely available and reusable
  • Data: Structured in machine-readable formats (e.g. RDF)

Recommender Systems

Systems that suggest relevant items to users based on various algorithms and data points.

Ethical Challenges of Recommender Systems

  • Echo chambers & filter bubbles
  • Engagement farming & rage-bait
  • Mediated subjectivity & material optimisation
  • Content moderation trauma
  • Physical harms from promoted risky behavior

Displaying & Visualizing

Visual Analytics

The science of analytical reasoning facilitated by interactive visual interfaces. It integrates human judgment into data analysis through visualization and interaction.

Critical Infographics

Approach that questions the neutrality of data visualizations and considers how they reflect perspectives, tell stories, and influence perception.

Feminist Data Visualization

Challenges the traditional, masculinized ideal of a "rational, scientific, objective viewpoint," arguing that such a standpoint is a "mythical, imaginary, impossible" construct. Advocates for valuing multiple forms of knowledge, including emotional and embodied experiences.

W.E.B. Du Bois's Data Portraits

Series of statistical charts illustrating the condition of the Descendants of Former African Slaves at the 1900 Paris Exposition. Exemplifies data as a medium of self-representation and resistance.

Data Visceralization

Moving beyond visualizing data for the eyes to visceralizing it for the whole body—engaging multiple senses (sight, sound, touch, even taste) and emotions to communicate data.

Archiving

Data Holdings vs. Data Archives

Data Holdings

Informal, often personal storage of data (e.g., backups, personal files). Lacks metadata, standards, and long-term preservation planning.

Data Archive

A formal, curated, and documented collection of data intended for long-term preservation and reuse. Includes data, metadata, and context, and is managed by specialists.

Trusted Digital Repository (TDR)

A certified digital repository that ensures long-term access to data, complying with standards like the Open Archival Information System (OAIS) model.

Cyber-infrastructure

Large-scale, standardized, and interoperable data infrastructures that are cross-institutional (e.g., for genomics or climate data).

FAIR Principles

  • Findable
  • Accessible
  • Interoperable
  • Reusable

CARE Principles for Indigenous Data

Complement to FAIR principles focusing on:

  • Collective Benefit
  • Authority to Control
  • Responsibility
  • Ethics

Key Terms & Definitions

Apophenia
The human tendency to perceive meaningful patterns or connections in random or meaningless data. A significant risk when analyzing large datasets.
Big Data
Datasets that are too large or complex to be processed by traditional software. Characterized by volume, velocity, and variety, and generated continuously.
Data Friction
The costs, resistances, and challenges involved in the collection, integration, and sharing of data.
Digital Divide (in Big Data)
The inequality in access to large-scale data resources, creating a divide between Big Data rich institutions (e.g., corporations, well-funded universities) and Big Data poor ones.
Metadata Justice
The ongoing struggle to resist harmful classifications and update biased metadata standards (e.g., changing illegal aliens to undocumented immigrants in library systems).
Posthuman Curating
A concept where curating is no longer performed solely by humans but is a distributed process involving nonhuman agents like algorithms, software, and platforms.
The Curatorial
A philosophy or distinct field of discourse and thought about the practice and theory of curating.