Visualizing Consumer Psychology

A Critical Analysis of E-commerce Behavioral Data

Author: Clara Holst (CH)

Student ID: AU737540 | 202306022@post.au.dk

Program: BSc Cognitive Science, 5th Semester

Institution: Arts Faculty, Aarhus University

Course: Critical Data Studies (Curating Data)

Location: Jens Chr. Skous Vej 2, 8000 Aarhus, Denmark

Characters: 14.399 (~6 pages)

Introduction

This report analyzes 1.25 million e-commerce transactions from a Kaggle dataset, shown in Figure 1, through two distinct data visualization approaches: a traditional visual analytics method and an interactive feminist/critical data studies perspective. The goal is to explore how these different methods reveal or suppress understandings of consumer psychology and price perception, aligning with the course objective to practice data visualization from multiple viewpoints.

Figure 1

Figure 1: Screenshot of the E-commerce App Transactional Dataset on Kaggle (Pratama, 2023), showing an overview of customer, product, transaction, and click-stream tables, which capture user behavior and transaction data from a fashion e-commerce app.

Visual Analytics Approach

The first visualization is a scatter plot mapping perceived value against actual product prices, with customers divided into nine behavioral segments (e.g., Value-Optimistic, Budget-Pessimistic). This follows Cui’s (2019) definition of visual analytics as “the science of analytical reasoning facilitated by interactive visual interfaces.”


To clarify the analytical framework, the behavioral segments were constructed by cross-tabulating two dimensions. First, customers were categorized into three spending tiers based on their transaction history: "Budget" (lowest average spend), "Value" (mid-range), and "Premium" (highest average spend). Second, their "perceived value" rating for products was classified as "Optimistic" (perceived value > 110% of actual price), "Realistic" (90–110%), or "Pessimistic" (<90%). This creates the nine segments, such as a "Budget-Optimistic" shopper who spends little but tends to overvalue products.


The data pipeline, implemented in Python, began by merging three primary datasets: transactional records, customer demographics, and product information. A key technical challenge was extracting structured data from the JSON-like product_metadata field, which was parsed using Python’s ast.literal_eval() to isolate product IDs and prices for analysis.


The visualization itself was generated using Plotly to create an interactive, multi-faceted scatter plot. A stratified sampling method ensured clarity while preserving representativeness. Subplots compare demographic groups, using color for behavioral categories, point size for transaction amounts, and include a perfect perception line (y=x) with ±10% confidence bands.


The statistical results reveal a consumer landscape dominated by mid-range, value-conscious shoppers. The three Value segments (Value-Optimistic, Value-Realistic, and Value-Pessimistic) each represent 16.8% of customers, together forming 50.4% of the population; indicating that spending level and perception bias operate as independent psychological dimensions.

Figure 2

Figure 2: Scatter plot comparing actual product prices (x-axis) against perceived value (y-axis) across 1.25 million transactions. Points colored by nine behavioral segments and sized by transaction amount. The black diagonal indicates perfect perception.


First and foremost, Figure 2 reveals interesting systematic patterns. Perceived value aligns closely with actual price at lower levels but becomes more scattered at higher prices, indicating greater uncertainty in the perception of high-value products.


Despite its analytical clarity, this approach has key limitations. It privileges quantitative precision and abstracts complex psychological processes into data points (D’Ignazio & Klein, 2020). The “perfect perception” line imposes a normative standard that equates accuracy with value.


Furthermore, the statistical aggregation masks individual stories and cultural contexts. The dataset’s Indonesian origin is significant, as price points and perception patterns must be interpreted within their specific economic environment.

Critical Data Studies Perspective

I developed The Algorithmic Mirror website to visceralize datafication through interactive engagement (D’Ignazio & Klein, 2020), as seen in Figure 3. The website invites users to input demographic information and spending habits, mirroring e-commerce platform data collection practices. Users receive a personalized consumer persona based on the same behavioral segments as the dataset.

Figure 3-1

Figure 3-1: Interactive website interface where users input demographic and spending data.

Figure 3-2

Figure 3-2: Generated algorithmic consumer persona based on user input.

The website operationalizes critical data studies principles through three key design choices: (1) applying the same segmentation logic to user input, (2) incorporating a multi-currency model contextualizing Indian price points within Danish, Eurozone, and US spending norms, and (3) intentionally restricting gender options to highlight the “tyranny of categories.”


This design surfaces the “semantic gap” (Seaver, 2022) between lived identity and computational representation. The moment of recognition, or misrecognition, when users see their assigned persona creates visceral engagement with datafication.


The website also foregrounds ethical feedback loops (Milano et al., 2020), showing how algorithmic profiling can reinforce behavioral assumptions. By revealing these mechanisms, The Algorithmic Mirror supports what Seaver (2022) calls “infrastructural recognition.”

Reflection on Design and Viewer Experience

The design choices in both visualizations shape viewer understanding through different epistemological frameworks. The scatter plot positions viewers as distant analysts, revealing macro-trends but risking dehumanization. The interactive website places users in the position of the datafied subject, generating visceral engagement but also reproducing reductive categories.


These approaches illustrate Drucker’s (2011) assertion that all data is performative and arranged. The simulated “perceived value” metric underscores how psychological constructs are inferred, not observed; another example of engineers attempting to “close the semantic gap.”


The dataset’s geographical specificity is essential. The price ranges and behavioral distributions are rooted in the Indonesian market and cannot be universalized.

Conclusion

These contrasting visualization approaches demonstrate that data representation is never neutral. The visual analytics method reveals systematic patterns but reduces human complexity to quantifiable metrics. The interactive critical approach fosters personal engagement but must interrogate its own categorical foundations.


Both approaches show consistent patterns of value misperception across consumer segments, reflecting psychological tendencies that complicate demographic targeting. Together, they reveal how data systems reconstruct human identity as analyzable patterns, raising ethical questions for fairness and autonomy (Milano et al., 2020).


The website enacts feminist data principles by making datafication processes visible and emotionally resonant, modeling situated, partial knowledge rather than an impossible “view from nowhere.” Ultimately, the project demonstrates that a single dataset can tell multiple stories; macro-level behavioral patterns and personal experiences of algorithmic classification.

Appendix

Figure 4: Demographic Behavioral Distribution

Figure 4: Stacked bar chart showing behavioral distribution across demographic groups

Stacked bar chart showing behavioral distribution across 5 age groups (18-24 to 55+) and 2 genders. Height represents customer counts per segment; 18-24 age group dominates (680K customers), primarily Value-Optimistic. Gender split shows female customers (800K) favoring Value-Realistic behaviors, males (454K) preferring Value-Optimistic patterns.

Dataset Overview

Total Transactions Analyzed: 1.25 million | Customer Segments: 9 behavioral profiles | Data Source: Kaggle E-commerce Transactional Dataset

Behavioral Segment Distribution

Behavioral Segment Customer Count Percentage Primary Characteristics
Value-Optimistic 211,128 16.8% Overestimates value by +20% on average
Value-Realistic 211,117 16.8% Accurate price perception (±0%)
Value-Pessimistic 210,799 16.8% Underestimates value significantly
Premium-Realistic 185,212 14.8% Accurate perception of high-value items
Premium-Optimistic 184,808 14.7% Overvalues premium products
Premium-Pessimistic 184,722 14.7% Undervalues premium products
Budget Segments 66,799 5.4% Combined budget-conscious consumers

Most Common Behavior by Age Group

  • 18-24: Value-Optimistic (680,221 customers)
  • 25-34: Value-Optimistic (457,140 customers)
  • 35-44: Value-Pessimistic (105,345 customers)
  • 45-54: Value-Optimistic (11,480 customers)
  • 55+: Value-Realistic (399 customers)

Most Common Behavior by Gender

  • Female: Value-Realistic (800,114 customers)
  • Male: Value-Optimistic (454,471 customers)

Financial Characteristics by Behavior

Behavioral Segment Avg. Actual Price Avg. Perceived Value Value Gap Avg. Transaction
Value-Realistic ₹181,145 ₹181,105 -0.0% ₹1,081,615
Value-Optimistic ₹181,341 ₹217,591 +20.0% ₹1,089,215
Value-Pessimistic ₹180,892 ₹144,714 -20.0% ₹1,075,328
Technical Implementation Details

Methodology: The analysis was conducted using Python with Pandas for data processing and Plotly for visualization. Behavioral segments were created based on spending levels and value perception gaps.

Data Processing: The dataset underwent comprehensive cleaning and feature engineering, including age group categorization, perceived value simulation, and behavioral segmentation algorithms.

# Sample of the segmentation logic
                def categorize_behavior(row):
                    if row['product_revenue'] < 100000:
                        spending_segment = 'Budget'
                    elif row['product_revenue'] < 250000:
                        spending_segment = 'Value'
                    else:
                        spending_segment = 'Premium'
                    
                    if row['value_gap_percentage'] > 10:
                        perception_segment = 'Optimistic'
                    elif row['value_gap_percentage'] < -10:
                        perception_segment = 'Pessimistic'
                    else:
                        perception_segment = 'Realistic'
                    
                    return f'{spending_segment}-{perception_segment}'

Technical Appendix: Analysis Code

Python Analysis Code (Click to expand)
        
        import pandas as pd
        import numpy as np
        import ast
        import os
        import csv


        def load_csv_safe(filename):
            """Automatically detect encoding + delimiter and load a CSV, skipping malformed rows."""
            base_path = r"C:\Users\clara\.cache\kagglehub\datasets\bytadit\transactional-ecommerce\versions\1"
            path = os.path.join(base_path, filename)


            # Detect delimiter
            with open(path, 'r', encoding='latin1') as f:
                sample = f.read(2048)
                try:
                    dialect = csv.Sniffer().sniff(sample)
                    sep = dialect.delimiter
                except Exception:
                    sep = ','


            # Load file with error handling
            df = pd.read_csv(
                path,
                encoding='latin1',
                sep=sep,
                low_memory=False,
                on_bad_lines='warn'
            )
            print(f"✓ Loaded {filename} - shape: {df.shape}")
            return df


        def prepare_behavioral_data():
            """Load and prepare data for behavioral analysis"""
        
            # Load datasets
            print("Loading datasets...")
            transactions = load_csv_safe("transactions.csv")
            customers = load_csv_safe("customer.csv")
            products = load_csv_safe("product.csv")
        
            # Extract product IDs from metadata
            def extract_product_ids(metadata):
                if pd.isna(metadata):
                    return []
                try:
                    data = ast.literal_eval(metadata)
                    return [item.get("product_id") for item in data]
                except Exception:
                    return []


            transactions["product_ids"] = transactions["product_metadata"].apply(extract_product_ids)


            # Explode product_ids and merge datasets
            print("Merging datasets...")
            transactions_exploded = transactions.explode("product_ids")
            transactions_exploded["product_ids"] = pd.to_numeric(transactions_exploded["product_ids"], errors="coerce")
            merged = transactions_exploded.merge(products, left_on="product_ids", right_on="id", how="left")
            merged = merged.merge(customers, on="customer_id", how="left")


            # ------------------------------
            # DATA PREPARATION FOR BEHAVIORAL ANALYSIS
            # ------------------------------


            # Extract item_price from product_metadata
            def extract_item_price(metadata):
                if pd.isna(metadata):
                    return np.nan
                try:
                    data = ast.literal_eval(metadata)
                    if len(data) > 0:
                        return data[0].get("item_price", np.nan)
                    return np.nan
                except Exception:
                    return np.nan


            merged['product_revenue'] = merged['product_metadata'].apply(extract_item_price)


            # Calculate age from birthdate
            print("Calculating age groups...")
            merged['birthdate'] = pd.to_datetime(merged['birthdate'], errors='coerce')
            merged['age'] = (pd.to_datetime('2020-01-01') - merged['birthdate']).dt.days / 365.25
            merged['age_group'] = pd.cut(merged['age'],
                                        bins=[0, 24, 34, 44, 54, 100],
                                        labels=['18-24', '25-34', '35-44', '45-54', '55+'])


            # Simulate perceived value (in real scenario, this would come from survey data or ML model)
            print("Creating behavioral features...")
            np.random.seed(42)
            merged['perceived_value'] = merged['product_revenue'] * np.random.uniform(0.7, 1.3, len(merged))


            # Calculate value gap
            merged['value_gap'] = merged['perceived_value'] - merged['product_revenue']
            merged['value_gap_percentage'] = (merged['value_gap'] / merged['product_revenue']) * 100


            # Create behavioral segments
            def categorize_behavior(row):
                if pd.isna(row['product_revenue']) or pd.isna(row['value_gap_percentage']):
                    return 'Unknown'
            
                # Spending level segments
                if row['product_revenue'] < 100000:
                    spending_segment = 'Budget'
                elif row['product_revenue'] < 250000:
                    spending_segment = 'Value'
                else:
                    spending_segment = 'Premium'
            
                # Perception segments
                if row['value_gap_percentage'] > 10:
                    perception_segment = 'Optimistic'
                elif row['value_gap_percentage'] < -10:
                    perception_segment = 'Pessimistic'
                else:
                    perception_segment = 'Realistic'
            
                return f'{spending_segment}-{perception_segment}'


            merged['buying_behavior'] = merged.apply(categorize_behavior, axis=1)


            # Clean data for plotting
            merged = merged.dropna(subset=['product_revenue', 'perceived_value', 'age_group', 'gender_y'])
        
            print(f"Final dataset shape: {merged.shape}")
            print("Behavioral segments distribution:")
            print(merged['buying_behavior'].value_counts())
        
            return merged


        # Load and prepare the data
        merged_data = prepare_behavioral_data()


        import plotly.express as px
        import plotly.graph_objects as go
        from plotly.subplots import make_subplots


        def create_behavioral_visualizations(merged):
            """Create comprehensive behavioral visualizations"""
        
            # Convert categorical columns to string to avoid type issues
            merged = merged.copy()
            merged['age_group'] = merged['age_group'].astype(str)
            merged['gender_y'] = merged['gender_y'].astype(str)
        
            # Create a sample for better visualization performance
            sample_data = merged.sample(min(3000, len(merged)), random_state=42)
        
            # ------------------------------
            # PLOT 1: COMPREHENSIVE USER BEHAVIOR MATRIX
            # ------------------------------


            fig1 = px.scatter(
                sample_data,
                x='product_revenue',
                y='perceived_value',
                color='buying_behavior',
                size='total_amount',
                hover_data=['masterCategory', 'subCategory', 'gender_y', 'age_group', 'payment_method'],
                facet_col='gender_y',
                facet_row='age_group',
                title='COMPREHENSIVE USER BEHAVIOR MATRIX: Perceived Value vs Actual Spending by Gender, Age, and Behavioral Patterns',
                labels={
                    'product_revenue': 'ACTUAL PRODUCT PRICE',
                    'perceived_value': 'PERCEIVED VALUE',
                    'buying_behavior': 'BEHAVIORAL PROFILE',
                    'gender_y': 'GENDER',
                    'age_group': 'AGE GROUP'
                },
                color_discrete_map={
                    # Budget behaviors
                    'Budget-Optimistic': '#FF6B6B',  # Red - overestimates value
                    'Budget-Realistic': '#FFA726',    # Orange - accurate perception  
                    'Budget-Pessimistic': '#FFD54F',  # Yellow - underestimates value
                
                    # Value behaviors
                    'Value-Optimistic': '#4ECDC4',    # Teal - overestimates value
                    'Value-Realistic': '#26A69A',     # Darker teal - accurate
                    'Value-Pessimistic': '#80CBC4',   # Light teal - underestimates
                
                    # Premium behaviors
                    'Premium-Optimistic': '#45B7D1',  # Blue - overestimates
                    'Premium-Realistic': '#1976D2',   # Dark blue - accurate
                    'Premium-Pessimistic': '#90CAF9'  # Light blue - underestimates
                }
            )


            # Add perfect perception line and confidence bands to each subplot
            for row in range(1, 6):  # 5 age groups
                for col in range(1, 3):  # 2 genders
                    max_val = sample_data[['product_revenue', 'perceived_value']].max().max()
                
                    # Perfect perception line
                    fig1.add_trace(
                        go.Scatter(
                            x=[0, max_val],
                            y=[0, max_val],
                            mode='lines',
                            line=dict(dash='solid', color='black', width=2),
                            name='Perfect Perception',
                            showlegend=False
                        ),
                        row=row, col=col
                    )
                
                    # ±10% confidence band
                    fig1.add_trace(
                        go.Scatter(
                            x=[0, max_val],
                            y=[0, max_val * 1.1],
                            mode='lines',
                            line=dict(dash='dot', color='gray', width=1),
                            fillcolor='rgba(128,128,128,0.1)',
                            fill='tonexty',
                            showlegend=False
                        ),
                        row=row, col=col
                    )
                
                    fig1.add_trace(
                        go.Scatter(
                            x=[0, max_val],
                            y=[0, max_val * 0.9],
                            mode='lines',
                            line=dict(dash='dot', color='gray', width=1),
                            fill='tonexty',
                            showlegend=False
                        ),
                        row=row, col=col
                    )


            # Update layout for comprehensive view
            fig1.update_layout(
                height=1200,
                showlegend=True,
                template='plotly_white',
                legend=dict(
                    orientation="h",
                    yanchor="bottom",
                    y=1.02,
                    xanchor="center",
                    x=0.5,
                    font=dict(size=10)
                ),
                title_font=dict(size=20),
                font=dict(size=11)
            )


            # Update axes for better readability
            fig1.update_xaxes(title_text="ACTUAL PRICE", matches=None, showticklabels=True)
            fig1.update_yaxes(title_text="PERCEIVED VALUE", matches=None, showticklabels=True)


            # Add annotations for interpretation
            fig1.add_annotation(
                text="Points ABOVE line = Overestimate value | Points BELOW line = Underestimate value",
                xref="paper", yref="paper",
                x=0.5, y=-0.05,
                showarrow=False,
                font=dict(size=12, color="gray")
            )


            fig1.show()


            # ------------------------------
            # PLOT 2: BEHAVIORAL SEGMENTATION DASHBOARD
            # ------------------------------


            # Create a comprehensive dashboard with multiple coordinated views
            fig2 = make_subplots(
                rows=3, cols=2,
                subplot_titles=(
                    'Spending Distribution by Demographic Segments',
                    'Value Perception Accuracy by Segment',
                    'Behavioral Profile Distribution',
                    'Product Category Preferences by Behavior',
                    'Price Sensitivity vs Spending Power',
                    'Payment Method Preferences by Behavior'
                ),
                specs=[
                    [{"type": "violin", "rowspan": 2}, {"type": "bar"}],
                    [None, {"type": "bar"}],
                    [{"type": "scatter"}, {"type": "bar"}]
                ],
                vertical_spacing=0.08,
                horizontal_spacing=0.1,
                column_widths=[0.5, 0.5]
            )


            # 4.1 Spending Distribution by Demographic Segments (Violin plot)
            demographic_combinations = []
            for age in ['18-24', '25-34', '35-44', '45-54', '55+']:
                for gender in ['M', 'F']:
                    demographic_combinations.append(f"{age}_{gender}")  # Using underscore instead of hyphen


            for demo in demographic_combinations:
                age, gender = demo.split('_')  # Split by underscore
                demo_data = merged[(merged['age_group'] == age) & (merged['gender_y'] == gender)]['product_revenue'].dropna()
                if len(demo_data) > 0:
                    fig2.add_trace(
                        go.Violin(
                            y=demo_data,
                            x=[demo] * len(demo_data),
                            name=demo,
                            legendgroup=demo,
                            showlegend=False,
                            side='positive' if gender == 'F' else 'negative',
                            scalegroup=demo,
                            meanline_visible=True,
                            points=False,
                            marker_color='#FF6B6B' if gender == 'F' else '#4ECDC4',
                            line_color='#FF6B6B' if gender == 'F' else '#4ECDC4'
                        ),
                        row=1, col=1
                    )


            # 4.2 Value Perception Accuracy by Segment (Bar chart)
            perception_accuracy = merged.groupby(['age_group', 'gender_y'])['value_gap_percentage'].mean().reset_index()
            perception_accuracy['demographic'] = perception_accuracy['age_group'] + '_' + perception_accuracy['gender_y']  # Using underscore


            for gender in ['M', 'F']:
                gender_data = perception_accuracy[perception_accuracy['gender_y'] == gender].sort_values('value_gap_percentage')
                fig2.add_trace(
                    go.Bar(
                        x=gender_data['demographic'],
                        y=gender_data['value_gap_percentage'],
                        name=f"Gender {gender}",
                        marker_color='#FF6B6B' if gender == 'F' else '#4ECDC4',
                        showlegend=True
                    ),
                    row=1, col=2
                )


            # 4.3 Behavioral Profile Distribution (Stacked Bar)
            behavior_dist = merged.groupby(['age_group', 'gender_y', 'buying_behavior']).size().reset_index(name='count')
            behavior_dist['demographic'] = behavior_dist['age_group'] + '_' + behavior_dist['gender_y']  # Using underscore


            # Get behavior types for stacking
            behavior_types = sorted(behavior_dist['buying_behavior'].unique())


            for behavior in behavior_types:
                behavior_data = behavior_dist[behavior_dist['buying_behavior'] == behavior]
                fig2.add_trace(
                    go.Bar(
                        x=behavior_data['demographic'],
                        y=behavior_data['count'],
                        name=behavior,
                        showlegend=True
                    ),
                    row=2, col=2
                )


            # 4.4 Price Sensitivity vs Spending Power (Bubble chart)
            segment_analysis = merged.groupby(['age_group', 'gender_y']).agg({
                'product_revenue': 'mean',
                'value_gap_percentage': 'std',  # Higher std = more variable perception = more sensitive
                'customer_id': 'nunique'
            }).reset_index()
            segment_analysis['demographic'] = segment_analysis['age_group'] + '_' + segment_analysis['gender_y']  # Using underscore


            fig2.add_trace(
                go.Scatter(
                    x=segment_analysis['product_revenue'],
                    y=segment_analysis['value_gap_percentage'].fillna(0),
                    mode='markers+text',
                    marker=dict(
                        size=segment_analysis['customer_id'] / 50,
                        color=segment_analysis['value_gap_percentage'].fillna(0),
                        colorscale='RdYlBu_r',
                        showscale=True,
                        colorbar=dict(title="Price Sensitivity")
                    ),
                    text=segment_analysis['demographic'],
                    textposition="middle center",
                    hovertemplate='%{text}
Avg Spending: %{x:,.0f}
Sensitivity: %{y:.1f}%
Customers: %{marker.size:.0f}', showlegend=False ), row=3, col=1 ) # 4.5 Payment Method Preferences by Behavior payment_behavior = merged.groupby(['buying_behavior', 'payment_method']).size().reset_index(name='count') payment_totals = payment_behavior.groupby('buying_behavior')['count'].sum().reset_index() payment_behavior = payment_behavior.merge(payment_totals, on='buying_behavior', suffixes=('', '_total')) payment_behavior['percentage'] = (payment_behavior['count'] / payment_behavior['count_total']) * 100 for payment_method in payment_behavior['payment_method'].unique(): method_data = payment_behavior[payment_behavior['payment_method'] == payment_method] fig2.add_trace( go.Bar( x=method_data['buying_behavior'], y=method_data['percentage'], name=payment_method, showlegend=True ), row=3, col=2 ) # Update layout for comprehensive dashboard fig2.update_layout( title_text="BEHAVIORAL SEGMENTATION DASHBOARD: Comprehensive User Profiling by Demographics and Psychology", title_x=0.5, title_font=dict(size=18), height=1400, showlegend=True, template="plotly_white", barmode='stack', legend=dict( orientation="h", yanchor="bottom", y=1.02, xanchor="center", x=0.5 ) ) # Update axes with better labels fig2.update_xaxes(title_text="Demographic Segments", row=1, col=1) fig2.update_yaxes(title_text="Spending Distribution", row=1, col=1) fig2.update_xaxes(title_text="Demographic Segments", row=1, col=2) fig2.update_yaxes(title_text="Avg Value Gap (%)", row=1, col=2) fig2.update_xaxes(title_text="Demographic Segments", row=2, col=2) fig2.update_yaxes(title_text="Count of Users", row=2, col=2) fig2.update_xaxes(title_text="Average Spending", row=3, col=1) fig2.update_yaxes(title_text="Price Sensitivity (Std of Value Gap)", row=3, col=1) fig2.update_xaxes(title_text="Behavioral Profiles", row=3, col=2) fig2.update_yaxes(title_text="Percentage (%)", row=3, col=2) fig2.show() # Create visualizations print("Creating visualizations...") create_behavioral_visualizations(merged_data)

Note: This code was used to generate the behavioral segments and visualizations analyzed in this report. The analysis processed 1.25 million transactions from the e-commerce dataset.

References