← All Lessons
Week 9|Data Science

Data Science: Telling Stories with Numbers

Analyze real datasets, create visualizations, and learn to spot misleading statistics.

Materials for this lesson

  • Laptop (charged)
  • Pencil and paper
  • Pre-downloaded CSV dataset (NBA stats or weather data)

Warm-Up: What's Wrong with This Picture?

Look at these two claims based on the same data:

Headline A: "Crime SKYROCKETS — Up 200% This Year!" Headline B: "Crime remains near historic lows despite small uptick"

Both are technically "true." The first one is talking about a jump from 1 incident to 3 incidents in a small town. The 200% increase is real — but the actual numbers are tiny.

Now consider this graph scenario: imagine a bar chart showing Company A's revenue at $4.2 billion and Company B's revenue at $4.0 billion. If the y-axis starts at $3.9 billion instead of $0, Company A's bar looks three times taller than Company B's — even though the actual difference is only about 5%.

Question: Can you think of other ways a graph or statistic could be technically correct but deeply misleading?

💡 Key Concept

"There are three kinds of lies: lies, damned lies, and statistics." — Mark Twain (probably quoting Benjamin Disraeli). Data doesn't lie, but the people presenting it can mislead. A data scientist's job is to find the truth in the numbers — and to present it honestly.


Core Lesson: The Language of Data

Types of Data

Not all data is the same. Understanding the type tells you which tools to use.

| Type | Description | Examples | What You Can Do | |------|-------------|----------|-----------------| | Categorical (Qualitative) | Labels or categories | Favorite color, team name, state | Count, find mode, make bar charts | | Numerical — Discrete | Countable whole numbers | Number of siblings, goals scored | Mean, median, histograms | | Numerical — Continuous | Measurable, any value | Height, temperature, time | Mean, median, std deviation, scatter plots |

Which of these is continuous numerical data?

Measures of Center: Mean, Median, Mode

These three numbers answer the question: "What's a typical value in this dataset?"

Mean (Average): Add everything up and divide by how many values there are.

  • Mean of = 30/5 = 6
  • Sensitive to outliers (extreme values)

Median: The middle value when sorted. If there's an even number of values, average the two middle ones.

  • Median of = 6
  • Resistant to outliers — not affected by extreme values

Mode: The most frequent value.

  • Mode of = 4
  • The only measure that works for categorical data

When Mean and Median Disagree

Here's where it gets interesting. Consider the salaries at a small company:

| Employee | Salary | |----------|--------| | Worker 1 | $40,000 | | Worker 2 | $42,000 | | Worker 3 | $45,000 | | Worker 4 | $48,000 | | CEO | $500,000 |

  • Mean salary: $135,000
  • Median salary: $45,000

Which number better represents a "typical" salary at this company? The median, clearly. The CEO's salary pulls the mean way up, making it misleading.

💡 Key Concept

When data is skewed (has extreme values on one side), the median is usually more informative than the mean. This is why economists often report median household income rather than mean — a few billionaires would distort the average for everyone.

A neighborhood has houses worth: $200K, $210K, $220K, $230K, and $2,000K. Which measure of center best represents a 'typical' house price?

Spread: How Scattered Is the Data?

Center isn't the whole story. These two datasets have the same mean (50), but they feel very different:

  • Dataset A: — very consistent
  • Dataset B: — wildly spread out

Range = maximum - minimum. Simple but crude (one outlier changes everything).

  • Range of A = 2, Range of B = 80

Standard Deviation measures the average distance from the mean. Think of it as "on average, how far is each data point from the center?"

  • Low standard deviation = data clustered tightly around the mean
  • High standard deviation = data spread out widely

You don't need to calculate standard deviation by hand (Python will do it), but understanding what it means is essential.

Tip

The 68-95-99.7 Rule: For data that follows a bell curve (normal distribution), about 68% of values fall within 1 standard deviation of the mean, 95% within 2, and 99.7% within 3. If a data point is more than 3 standard deviations from the mean, it's extremely unusual — likely an outlier or an error.

Bias in Data Collection

Even before you analyze data, the way it was collected can introduce bias:

  • Selection bias: Your sample doesn't represent the population. Surveying only people at a gym about exercise habits would give skewed results.
  • Response bias: People don't answer honestly. "Do you always wash your hands?" — most people say yes, but studies show many don't.
  • Confirmation bias: You unconsciously look for data that supports what you already believe.
  • Survivorship bias: You only see the successes. "Every successful entrepreneur dropped out of college" — you're not counting the millions who dropped out and failed.

Survivorship Bias — Veritasium

A school surveys students in the AP classes about whether school is too easy. 80% say yes. What type of bias is this?


Hands-On Lab: Analyzing Real Data with Python

Time to get your hands on real data! We'll use Python with matplotlib to load a dataset, calculate statistics, and create visualizations.

🧪 Materials Needed

Setup: You'll need Python with matplotlib installed. If using Replit, create a Python project — matplotlib is pre-installed. If running locally, install it with pip install matplotlib.

We'll use a dataset of NBA player statistics. You can download a CSV from Basketball Reference or use any CSV dataset you find interesting (weather, Spotify songs, etc.).

Step 1: Create Your Dataset

If you don't have a CSV handy, paste this code to create a sample dataset directly:

# Create a sample NBA dataset
data = """Player,Team,Points,Rebounds,Assists,Age
Luka Doncic,DAL,33.9,9.2,9.8,24
Giannis Antetokounmpo,MIL,30.4,11.5,6.5,29
Shai Gilgeous-Alexander,OKC,30.1,5.5,6.2,25
Jayson Tatum,BOS,26.9,8.1,4.9,25
Kevin Durant,PHX,27.1,6.6,5.0,35
Joel Embiid,PHI,34.7,11.0,5.6,29
Anthony Edwards,MIN,25.9,5.4,5.1,22
LeBron James,LAL,25.7,7.3,8.3,39
Nikola Jokic,DEN,26.4,12.4,9.0,28
Damian Lillard,MIL,24.3,4.4,7.0,33
Stephen Curry,GSW,26.4,4.5,5.1,35
Tyrese Maxey,PHI,25.9,3.7,6.2,23
De'Aaron Fox,SAC,26.6,4.6,5.6,26
Paolo Banchero,ORL,22.6,6.9,5.4,21
Jalen Brunson,NYK,28.7,3.5,6.7,27
Donovan Mitchell,CLE,26.6,5.1,4.6,27
Devin Booker,PHX,27.1,4.5,6.9,27
Trae Young,ATL,25.7,2.8,10.8,25
Anthony Davis,LAL,24.7,12.6,3.5,30
Kawhi Leonard,LAC,23.7,6.1,3.6,32"""

# Save to file
with open("nba_stats.csv", "w") as f:
    f.write(data)

print("Dataset saved to nba_stats.csv!")

Step 2: Load and Explore the Data

import csv

# Read the CSV file
players = []
with open("nba_stats.csv", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        # Convert numeric fields from strings to floats
        row["Points"] = float(row["Points"])
        row["Rebounds"] = float(row["Rebounds"])
        row["Assists"] = float(row["Assists"])
        row["Age"] = int(row["Age"])
        players.append(row)

# Show what we loaded
print(f"Loaded {len(players)} players\n")
print(f"{'Player':<30} {'Team':<5} {'Pts':>5} {'Reb':>5} {'Ast':>5} {'Age':>4}")
print("-" * 55)
for p in players:
    print(f"{p['Player']:<30} {p['Team']:<5} {p['Points']:>5.1f} "
          f"{p['Rebounds']:>5.1f} {p['Assists']:>5.1f} {p['Age']:>4}")

Step 3: Calculate Summary Statistics

import math

def mean(values):
    return sum(values) / len(values)

def median(values):
    sorted_vals = sorted(values)
    n = len(sorted_vals)
    if n % 2 == 1:
        return sorted_vals[n // 2]
    else:
        return (sorted_vals[n // 2 - 1] + sorted_vals[n // 2]) / 2

def std_dev(values):
    avg = mean(values)
    variance = sum((x - avg) ** 2 for x in values) / len(values)
    return math.sqrt(variance)

# Extract the points column
points = [p["Points"] for p in players]
rebounds = [p["Rebounds"] for p in players]
assists = [p["Assists"] for p in players]
ages = [p["Age"] for p in players]

print("=== POINTS PER GAME ===")
print(f"  Mean:     {mean(points):.1f}")
print(f"  Median:   {median(points):.1f}")
print(f"  Std Dev:  {std_dev(points):.1f}")
print(f"  Min:      {min(points):.1f} ({min(players, key=lambda p: p['Points'])['Player']})")
print(f"  Max:      {max(points):.1f} ({max(players, key=lambda p: p['Points'])['Player']})")

print("\n=== REBOUNDS PER GAME ===")
print(f"  Mean:     {mean(rebounds):.1f}")
print(f"  Median:   {median(rebounds):.1f}")
print(f"  Std Dev:  {std_dev(rebounds):.1f}")

print("\n=== ASSISTS PER GAME ===")
print(f"  Mean:     {mean(assists):.1f}")
print(f"  Median:   {median(assists):.1f}")
print(f"  Std Dev:  {std_dev(assists):.1f}")

print("\n=== AGE ===")
print(f"  Mean:     {mean(ages):.1f}")
print(f"  Median:   {median(ages):.1f}")
print(f"  Youngest: {min(ages)} ({min(players, key=lambda p: p['Age'])['Player']})")
print(f"  Oldest:   {max(ages)} ({max(players, key=lambda p: p['Age'])['Player']})")

Step 4: Create Visualizations

import matplotlib.pyplot as plt

# --- Bar Chart: Points per Game ---
fig, ax = plt.subplots(figsize=(12, 6))

# Sort players by points
sorted_players = sorted(players, key=lambda p: p["Points"], reverse=True)
names = [p["Player"].split()[-1] for p in sorted_players]  # Last names only
pts = [p["Points"] for p in sorted_players]

colors = ["#e74c3c" if p > 30 else "#3498db" if p > 25 else "#95a5a6" for p in pts]
ax.bar(names, pts, color=colors)
ax.set_ylabel("Points Per Game")
ax.set_title("NBA Top Scorers — Points Per Game")
ax.set_xticklabels(names, rotation=45, ha="right")
ax.axhline(y=mean(points), color="black", linestyle="--", alpha=0.5, label=f"Mean: {mean(points):.1f}")
ax.legend()

plt.tight_layout()
plt.savefig("points_bar_chart.png", dpi=150)
plt.show()
print("Bar chart saved to points_bar_chart.png")
import matplotlib.pyplot as plt

# --- Histogram: Distribution of Points ---
fig, ax = plt.subplots(figsize=(10, 5))

ax.hist(points, bins=8, color="#3498db", edgecolor="white", linewidth=1.2)
ax.axvline(mean(points), color="red", linestyle="--", linewidth=2, label=f"Mean: {mean(points):.1f}")
ax.axvline(median(points), color="green", linestyle="-.", linewidth=2, label=f"Median: {median(points):.1f}")
ax.set_xlabel("Points Per Game")
ax.set_ylabel("Number of Players")
ax.set_title("Distribution of Points Per Game")
ax.legend()

plt.tight_layout()
plt.savefig("points_histogram.png", dpi=150)
plt.show()
print("Histogram saved to points_histogram.png")
import matplotlib.pyplot as plt

# --- Scatter Plot: Age vs Points ---
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(ages, points, s=100, color="#e74c3c", alpha=0.7, edgecolors="black", linewidth=0.5)

# Label each point with the player's last name
for p in players:
    ax.annotate(p["Player"].split()[-1], (p["Age"], p["Points"]),
                fontsize=7, ha="center", va="bottom", alpha=0.8)

ax.set_xlabel("Age")
ax.set_ylabel("Points Per Game")
ax.set_title("Age vs Points Per Game — Is There a Pattern?")

plt.tight_layout()
plt.savefig("age_vs_points.png", dpi=150)
plt.show()
print("Scatter plot saved to age_vs_points.png")
Tip

When to use which chart:

  • Bar chart — comparing values across categories (e.g., points by player)
  • Histogram — showing the distribution/shape of one variable (e.g., how points are spread)
  • Scatter plot — exploring the relationship between two variables (e.g., age vs. points)
  • Line chart — showing change over time (e.g., stock prices, temperature by month)
  • Pie chart — showing parts of a whole (use sparingly — bar charts are almost always better)

Challenge: Find Something Surprising

Now it's your turn to be a data detective. Using the dataset (or a different one you find interesting), explore the data and find something surprising, counterintuitive, or interesting.

Here are some questions to investigate:

  1. Is there a relationship between age and scoring? Do older players score more or less? Make a scatter plot and see.

  2. Who is the most "well-rounded" player? Can you create a metric that combines points, rebounds, and assists? Maybe: overall = points + rebounds + assists?

  3. Are there any outliers? A player whose stats are more than 2 standard deviations from the mean in any category?

  4. Can you create a misleading graph? Make a graph that tells a technically-true but misleading story. Then make an honest version of the same data. Compare them.

Outlier Detection Code

import math

def find_outliers(players, stat_name):
    """Find players whose stat is more than 2 standard deviations from the mean."""
    values = [p[stat_name] for p in players]
    avg = sum(values) / len(values)
    variance = sum((x - avg) ** 2 for x in values) / len(values)
    sd = math.sqrt(variance)

    print(f"\n{'='*40}")
    print(f"Outlier Analysis: {stat_name}")
    print(f"Mean: {avg:.1f}, Std Dev: {sd:.1f}")
    print(f"Normal range: {avg - 2*sd:.1f} to {avg + 2*sd:.1f}")
    print(f"{'='*40}")

    outliers = []
    for p in players:
        z_score = (p[stat_name] - avg) / sd
        if abs(z_score) > 2:
            direction = "HIGH" if z_score > 0 else "LOW"
            print(f"  OUTLIER: {p['Player']} — {p[stat_name]} "
                  f"(z-score: {z_score:+.2f}, {direction})")
            outliers.append(p)

    if not outliers:
        print("  No outliers found (all within 2 standard deviations)")

    return outliers

find_outliers(players, "Points")
find_outliers(players, "Rebounds")
find_outliers(players, "Assists")

Create a Misleading Graph (Then Fix It)

import matplotlib.pyplot as plt

# --- MISLEADING VERSION ---
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Pick two players with similar stats
player_a = next(p for p in players if p["Player"] == "Kevin Durant")
player_b = next(p for p in players if p["Player"] == "Devin Booker")

# Misleading: y-axis starts at 26
axes[0].bar(["Durant", "Booker"], [player_a["Points"], player_b["Points"]],
            color=["#e74c3c", "#3498db"], width=0.5)
axes[0].set_ylim(26, 28)  # Truncated axis!
axes[0].set_title("MISLEADING: Durant DOMINATES Booker!")
axes[0].set_ylabel("Points Per Game")

# Honest: y-axis starts at 0
axes[1].bar(["Durant", "Booker"], [player_a["Points"], player_b["Points"]],
            color=["#e74c3c", "#3498db"], width=0.5)
axes[1].set_ylim(0, 35)  # Full axis
axes[1].set_title("HONEST: Nearly identical scoring")
axes[1].set_ylabel("Points Per Game")

plt.suptitle("Same Data, Different Stories", fontsize=14, fontweight="bold")
plt.tight_layout()
plt.savefig("misleading_vs_honest.png", dpi=150)
plt.show()
🏆 Challenge

The real challenge: Write a 2-3 sentence "data story" about your most interesting finding. A data story answers: What did you find? Why does it matter? What questions does it raise?

Example: "Among the top 20 NBA scorers, the youngest player (Paolo Banchero, 21) scores fewer points than the oldest (LeBron James, 39), challenging the assumption that scoring peaks early. This raises the question: does experience compensate for declining athleticism?"

A dataset has values: 10, 12, 11, 13, 12, 100. The mean is 26.3 and the median is 12. Which is more representative of the 'typical' value?


Resources