Diverse Data Hub
  • Data Sets
  • Citation
  • Collaborate

    On this page

    • Key Features of the Data Set
    • Purpose and Use Cases
    • Case Study
      • Objective
      • Analysis
      • Discussion

    Key Features of the Data Set

    Each row represents a single team’s appearance in a specific tournament year and includes information such as:

    • Tournament Seed (seed) – Seed assigned to the team.

    • Tournament Results (tourney_wins, tourney_losses, tourney_finish) – Number of wins, losses, and how far the team advanced.

    • Bid Type (bid) – Whether the team received an automatic bid or was selected at-large.

    • Season Records (reg_wins, reg_losses, reg_wins_pct, conf_wins, conf_losses, conf_wins_pct, total_wins, total_losses, total_wins_pct) – Regular season, conference, and total win/loss stats and percentages.

    • Conference Information (conference, conf_wins, conf_losses, conf_wins_pct, conf_rank) – Team’s conference name, record, and rank within the conference.

    • Division (division) – Regional placement in the tournament bracket (e.g., East, West).

    • Home Game Indication (first_game_at_home) – Whether the team played its first game at home.

    Purpose and Use Cases

    This data set is designed to support analysis of:

    • Team performance over time

    • Impact of seeding and bid types on tournament results

    • Conference strength

    • Emergence and decline of winning teams in women’s college basketball

    Case Study

    Objective

    How much does a team’s tournament seed predict its success in the NCAA Division I Women’s Basketball Tournament?

    This analysis explores the relationship between a team’s seed and its results on a tournament to evaluate whether teams with lower seeds consistently outperform ones with higher seeds.

    By examining historical data, we aim to:

    • Identify trends in tournament advancement by seed level

    Seeding is intended to reflect a team’s regular-season performance. In theory, lower-numbered seeds (e.g., #1, #2) are given to the strongest teams, who should be more likely to advance. But upsets, bracket surprises, and standout performances from lower seeds raise questions like “How reliable is seeding as a predictor of success?”

    Understanding these dynamics can inform fan expectations and bracket predictions.

    Analysis

    We should load the necessary libraries we will be using, including the diversedata package.

    • R
    • Python
    library(diversedata)
    library(tidyverse)
    library(AER)
    library(broom)
    library(knitr)
    library(MASS)
    import diversedata as dd
    import pandas as pd
    import statsmodels.api as sm
    import statsmodels.formula.api as smf
    import altair as alt
    import numpy as np
    from scipy import stats
    from IPython.display import Markdown

    1. Data Cleaning & Processing

    First, let’s load our data (womensmarchmadness from diversedata package) and remove all NA values for our variables of interest seed and tourney_wins.

    • R
    • Python
    # Reading Data
    marchmadness <- womensmarchmadness
    
    # Review total rows
    nrow(marchmadness)
    [1] 2092
    # Removing NA but only in selected columns
    marchmadness <- marchmadness |> drop_na(seed, tourney_wins)
    
    # Notice no rows were removed
    nrow(marchmadness)
    [1] 2092
    # Reading Data
    marchmadness = dd.load_data("womensmarchmadness")
    
    # Review total rows
    marchmadness.shape[0]
    2092
    # Removing NA but only in selected columns
    marchmadness = marchmadness.dropna(subset=['seed', 'tourney_wins'])
    
    # Notice no rows were removed
    marchmadness.shape[0]
    2092

    Note that, the seed = 0 designation in 1983 notes the eight teams that played an opening-round game to become the No.8 seed in each region. For this exercise, we will not take them into consideration. Since seed is an ordinal categorical variable, we can set it as an ordered factor.

    • R
    • Python
    marchmadness <- marchmadness |> 
      filter(seed != 0)
    marchmadness = marchmadness[marchmadness['seed'] != 0]

    2. Exploratory Data Analysis

    We can see which seeds appear more often.

    • R
    • Python
    seed_count <- marchmadness |> 
      count(seed) |> 
      arrange(desc(n)) |>
      mutate(seed = factor(seed, levels = seed))
    
    ggplot(
      seed_count, 
      aes(x = seed, y = n)
      ) +
      geom_col(fill = "skyblue2") +
      labs(
        title = "Distribution of Tournament Seeds",
        x = "Seed",
        y = "Number of Teams") +
      theme_minimal()

    alt.Chart(marchmadness).mark_bar().encode(
        x=alt.X("seed:O").title("Seed").axis(alt.Axis(labelAngle=0)),
        y=alt.Y("count()", title="Number of Teams"),
    ).properties(title="Distribution of Tournament Seeds", width=600)

    We can also take a look at the average tournament wins for each seed:

    • R
    • Python
    marchmadness |> 
      filter(!is.na(seed), seed != 0) |> 
      group_by(seed) |> 
      summarise(
        avg_tourney_wins = mean(tourney_wins, na.rm = TRUE)
        ) |>
      arrange(desc(avg_tourney_wins)) |>
      mutate(seed = factor(seed, levels = seed)) |>
      ggplot(
        aes(
          x = as.factor(seed),
          y = avg_tourney_wins)
        ) +
      geom_col(fill = "skyblue2") +
      labs(
        title = "Average Tournament Wins by Seed",
        x = "Seed",
        y = "Avg. Tourney Wins"
      ) +
      theme_minimal()

    avg_per_seed = marchmadness.groupby(by="seed")["tourney_wins"].mean().reset_index()
    
    alt.Chart(avg_per_seed).mark_bar().encode(
        x=alt.X("seed:O").title("Seed").axis(alt.Axis(labelAngle=0)),
        y=alt.Y("tourney_wins").title("Avg. Tourney Wins"),
    ).properties(title="Average Tournament Wins by Seed", width=600)

    We can note that a teams with a higher seed tend to win more tournaments! We can also see the total amount of tourney wins for each seed.

    • R
    • Python
    seed_order <- marchmadness |> 
      filter(!is.na(seed), seed != 0) |> 
      group_by(seed) |> 
      summarise(avg_wins = mean(tourney_wins, na.rm = TRUE)) |> 
      arrange(desc(avg_wins)) |> 
      pull(seed)
    
    marchmadness |> 
      filter(!is.na(seed), seed != 0) |> 
      mutate(seed = factor(seed, levels = seed_order)) |> 
      ggplot(
        aes(x = seed, y = tourney_wins)
      ) +
      geom_jitter(alpha = 0.25, color = "skyblue2") +
      geom_violin(fill = "skyblue2") +
      labs(
        title = "Distribution of Tournament Wins by Seed",
        x = "Seed",
        y = "Tournament Wins"
      ) +
      theme_minimal()

    alt.Chart(marchmadness).transform_density(
        "tourney_wins", as_=["tourney_wins", "density"], groupby=["seed"]
    ).mark_area(orient="horizontal").encode(
        x=alt.X("density:Q")
        .stack("center")
        .impute(None)
        .title(None)
        .axis(labels=False, values=[0], grid=False, ticks=True),
        y=alt.Y("tourney_wins").title("Tournament Wins"),
        column=alt.Column("seed:O")
        .spacing(0)
        .header(titleOrient="bottom", labelOrient="bottom", labelPadding=0)
        .title("Seed"),
    ).configure_view(
        stroke=None
    ).properties(
        width=70,
        title="Distribution of Tournament Wins by Seed"
    )

    3. Seed Treatment: Numeric vs Factor

    An important decision on this analysis is whether to use seed as a numeric or an ordered categorical predictor. Treating seed as a numeric explanatory variable assumes that the effect of seed is linear on the log scaled of the amount of tourney_wins.

    To test if this assumption is appropriate, we can compare models that make different assumptions about seed. We’ll create models using seed as both a numeric variable and a factor.

    However, first, we need to encode seed as an ordered factor.

    • R
    • Python
    marchmadness_factor <- marchmadness |> 
      mutate(seed = as.ordered(seed)) |> 
      mutate(seed = fct_relevel
             (seed, 
               c("1", "2", "3", "4", "5", 
                 "6", "7", "8", "9", "10", 
                 "11", "12", "13", "14", "15", 
                 "16")))
    marchmadness_factor = marchmadness.copy()
    
    marchmadness_factor["seed"] = pd.Categorical(
        marchmadness["seed"], categories=range(1, 17), ordered=True
    )

    Given that we’re setting tourney_wins as a response, our linear regression model may output negative values at high seed values. Therefore, a Poisson Regression model is better suited, considering that tourney wins is a count variable and is always non-negative.

    • R
    • Python
    options(contrasts = c("contr.treatment", "contr.sdif"))
    
    poisson_model <- glm(tourney_wins ~ seed, family = "poisson", data = marchmadness)
    
    poisson_model_factor <- glm(tourney_wins ~ seed, family = "poisson", data = marchmadness_factor)
    Note

    At the moment, the statsmodels Python package does not support successive contrasts.

    In this Python analysis, seed 1 will be used as the reference seed. To change the reference seed, such as seed 2:

    poisson_model_factor = smf.glm(
        "tourney_wins ~ C(seed, Treatment(reference=2))", # specify reference seed number here
        data=marchmadness_factor,
        family=sm.families.Poisson()
    ).fit()
    poisson_model = smf.glm(
        "tourney_wins ~ seed", data=marchmadness, family=sm.families.Poisson()
    ).fit()
    
    poisson_model_factor = smf.glm(
        "tourney_wins ~ seed", data=marchmadness_factor, family=sm.families.Poisson()
    ).fit()

    We can visualize how the two models fit the data to evaluate if treating seed as numeric or factor would have a significant impact on our modelling process.

    • R
    • Python
    marchmadness <- marchmadness |> 
      mutate(
        Numeric_Seed = predict(poisson_model, type = "response"),
        Factor_Seed = predict(poisson_model_factor, type = "response")
      )
    
    plot_data <- marchmadness |> 
      dplyr::select(seed, tourney_wins, Numeric_Seed, Factor_Seed) |> 
      pivot_longer(cols = c("Numeric_Seed","Factor_Seed"), names_to = "model", values_to = "predicted")
    
    ggplot(plot_data, aes(x = seed, y = predicted, color = model)) +
      geom_point(aes(y = tourney_wins), alpha = 0.3, color = "black") +
      geom_line(stat = "smooth", method = "loess", se = FALSE, linewidth = 1.2) +
      labs(title = "Predicted Tournament Wins by Seed",
           x = "Seed",
           y = "Predicted Wins",
           color = "Model") +
      theme_minimal()

    marchmadness["Numeric_Seed"] = poisson_model.predict()
    marchmadness["Factor_Seed"] = poisson_model_factor.predict()
    
    plot_data = pd.melt(
        marchmadness,
        id_vars=["seed", "tourney_wins"],
        value_vars=["Numeric_Seed", "Factor_Seed"],
        var_name="model",
        value_name="predicted",
    )
    
    points = alt.Chart(plot_data).mark_circle(opacity=0.3, color="black").encode(
        x=alt.X("seed").title("Seed"),
        y=alt.Y("tourney_wins:Q").title("Predicted Wins")
    )
    
    lines = alt.Chart(plot_data).mark_line(size=2).encode(
        x=alt.X("seed"),
        y="predicted:Q",
        color=alt.Color("model:N")
            .title("Model")
            .scale(alt.Scale(domain=["Factor_Seed", "Numeric_Seed"],
                             range=["salmon", "darkturquoise"]
                            )
                  )
    ).transform_loess(on="seed", loess="predicted", groupby=["model"])
    
    
    (points + lines).properties(title='Predicted Tournament Wins by Seed', height=400, width=600)

    Visually, both models don’t appear to be significantly different from each other. Now, if we wanted to formally evaluate which is the better approach we could use likelihood-based model selection tools.

    • R
    • Python
    kable(glance(poisson_model), digits = 2)
    null.deviance df.null logLik AIC BIC deviance df.residual nobs
    3438.76 2083 -2117.42 4238.84 4250.12 1610.34 2082 2084
    kable(glance(poisson_model_factor), digits = 2)
    null.deviance df.null logLik AIC BIC deviance df.residual nobs
    3438.76 2083 -2079.52 4191.04 4281.31 1534.54 2068 2084
    summary_poisson_model_dict = {
        'null.deviance' : poisson_model.null_deviance,
        'df.null' : poisson_model.nobs - 1,
        'logLik': poisson_model.llf,
        'AIC': poisson_model.aic,
        'BIC': poisson_model.bic,
        'deviance': poisson_model.deviance,
        'df.residual': poisson_model.df_resid,
        'nobs': poisson_model.nobs
    }
    
    summary_poisson_model_df = pd.DataFrame([summary_poisson_model_dict]).round(2)
    
    Markdown(summary_poisson_model_df.to_markdown(index = False))
    null.deviance df.null logLik AIC BIC deviance df.residual nobs
    3438.76 2083 -2117.42 4238.84 -14300.4 1610.34 2082 2084
    summary_poisson_model_factor_dict = {
        'null.deviance' : poisson_model_factor.null_deviance,
        'df.null' : poisson_model_factor.nobs - 1,
        'logLik': poisson_model_factor.llf,
        'AIC': poisson_model_factor.aic,
        'BIC': poisson_model_factor.bic,
        'deviance': poisson_model_factor.deviance,
        'df.residual': poisson_model_factor.df_resid,
        'nobs': poisson_model_factor.nobs
    }
    
    summary_poisson_model_factor_df = pd.DataFrame([summary_poisson_model_factor_dict]).round(2)
    
    Markdown(summary_poisson_model_factor_df.to_markdown(index = False))
    null.deviance df.null logLik AIC BIC deviance df.residual nobs
    3438.76 2083 -2079.52 4191.04 -14269.2 1534.54 2068 2084

    Based on lower residual deviance, higher log-likelihood, and a lower AIC, the model that treats seed as a factor fits the data better. This would suggest that the relationship between tournament seed and number of wins is not linear, and would support using an approach that does not assume a constant effect per unit change in seed.

    However, before deciding on using a more complex model, we can evaluate if this complexity offers a significantly better modeling approach. To do this, we can perform a likelihood ratio test.

    \(H_0\): Model poisson_model fits the data better than model poisson_model_factor

    \(H_a\): Model poisson_model_factor fits the data better than model poisson_model

    • R
    • Python
    anova_result <- tidy(anova(poisson_model, poisson_model_factor, test = "Chisq"))
    
    kable(anova_result, digits = 2) 
    term df.residual residual.deviance df deviance p.value
    tourney_wins ~ seed 2082 1610.34 NA NA NA
    tourney_wins ~ seed 2068 1534.54 14 75.79 0
    # manual chi sq test for GLMs:
    
    # deviance
    deviance = 2 * (poisson_model_factor.llf - poisson_model.llf)
    df_diff = poisson_model_factor.df_model - poisson_model.df_model
    p_value = stats.chi2.sf(deviance, df_diff)
    
    # results
    anova_result = pd.DataFrame({
        "model comparison": ["poisson_model vs poisson_model_factor"],
        "deviance": [deviance],
        "df": [df_diff],
        "p-value": [p_value]
    }).round(2)
    
    Markdown(anova_result.to_markdown(index = False))
    model comparison deviance df p-value
    poisson_model vs poisson_model_factor 75.79 14 0

    These results indicate strong evidence that the model with seed treated as a factor fits the data significantly better than treating it as a numeric predictor. Therefore, we will proceed by using seed as a factor.

    4. Overdispersion Testing

    It is noteworthy that Poisson assumes that the mean is equal to the variance of the count variable. If the variance is much greater, we might need a Negative Binomial model. We can do an dispersion test to evaluate this matter.

    Letting \(Y_i\) be the \(ith\) Poisson response in the count regression model, in the presence of equidispersion, \(Y_i\) has the following parameters:

    \(E(Y_i)=\lambda_i, Var(Y_i)=\lambda_i\)

    The test uses the following mathematical expression (using a \(1+\gamma\) dispersion factor):

    \(Var(Y_i)=(1+\gamma)\times\lambda_i\)

    with the hypotheses:

    \(H_0:1 + \gamma = 1\)

    \(H_a: 1 + \gamma > 1\)

    When there is evidence of overdispersion in our data, we will reject \(H_0\).

    • R
    • Python
    kable(tidy(dispersiontest(poisson_model_factor)), digits = 2)
    estimate statistic p.value method alternative
    0.96 -0.56 0.71 Overdispersion test greater
    Note

    At the moment, the statsmodels Python package does not have a built-in hypothesis test for dispersion in count models. However, a manual Pearson chi-square test can be performed to assess whether overdispersion or underdispersion is present in the data.

    Note that R’s dispersiontest() function from the AER package uses a score test, which is more statistically rigorous but also more complex to implement in Python.

    pearson_chi2 = np.sum(poisson_model_factor.resid_pearson**2)
    df_resid = poisson_model_factor.df_resid
    dispersion = pearson_chi2 / df_resid
    p_value = 1 - stats.chi2.cdf(pearson_chi2, df_resid)
    
    dispersion_result = pd.DataFrame({
        "statistic": [pearson_chi2],
        "df": [df_resid],
        "dispersion": [dispersion],
        "p-value": [p_value]
    }).round(2)
    
    Markdown(dispersion_result.to_markdown(index = False))
    statistic df dispersion p-value
    1795.44 2068 0.87 1
    • A dispersion of ~1.0 indicates that the dispersion matches the Poisson assumption, in which the variance of the data is equal the mean.

    • A dispersion of >1.0 indicates overdispersion, in which the variance is greater than the mean.

    • A dispersion of <1.0 indicates underdispersion, in which the variance is less than the mean.

    Since the P-value is much greater than 0.05, we fail to reject the null hypothesis. This suggests that there is no significant evidence of overdispersion in the Poisson model.

    5. Hypothesis Testing: Are Seed and Wins Associated?

    • R
    • Python
    summary_model <- tidy(poisson_model_factor) |> 
      mutate(exp_estimate = exp(estimate)) |> 
      mutate_if(is.numeric, round, 3) |> 
      filter(p.value <= 0.05)
      
    kable(summary_model, digits = 2)
    term estimate std.error statistic p.value exp_estimate
    seed2-1 -0.34 0.07 -4.96 0.00 0.71
    seed3-2 -0.33 0.08 -4.05 0.00 0.72
    seed5-4 -0.46 0.10 -4.34 0.00 0.63
    seed8-7 -0.34 0.15 -2.22 0.03 0.71
    seed12-11 -0.64 0.23 -2.75 0.01 0.52
    seed13-12 -0.92 0.38 -2.40 0.02 0.40
    Note

    As mentioned before, the statsmodels Python package does not support successive contrasts at the moment.

    In this Python analysis, seed 1 is be used as the reference seed by default.

    To obtain successive differences , you can compute them manually by subtracting the corresponding coefficients. For example, the difference between seed 3 and seed 2 is \(-0.66 - (-0.34) = -0.32\).

    You can change the reference seed (such as seed 2) when fitting the model, which would allow you to assess whether other seeds differ significantly from the new reference seed:

    poisson_model_factor = smf.glm(
        "tourney_wins ~ C(seed, Treatment(reference=2))", # specify reference seed number here
        data=marchmadness_factor,
        family=sm.families.Poisson()
    ).fit()
    poisson_model_factor_summary = poisson_model_factor.summary2().tables[1]
    poisson_model_factor_summary['Exp Coef.'] = np.exp(poisson_model_factor_summary['Coef.'])
    poisson_model_factor_summary = poisson_model_factor_summary.rename(columns={'P>|z|' : 'p.value'}).round(2)
    Markdown(poisson_model_factor_summary.to_markdown())
    Coef. Std.Err. z p.value [0.025 0.975] Exp Coef.
    Intercept 1.25 0.04 28.43 0 1.16 1.33 3.48
    seed[T.2] -0.34 0.07 -4.96 0 -0.47 -0.2 0.71
    seed[T.3] -0.66 0.08 -8.83 0 -0.81 -0.52 0.51
    seed[T.4] -0.8 0.08 -10.11 0 -0.95 -0.64 0.45
    seed[T.5] -1.25 0.09 -13.46 0 -1.44 -1.07 0.29
    seed[T.6] -1.41 0.1 -14.15 0 -1.61 -1.21 0.24
    seed[T.7] -1.57 0.11 -14.84 0 -1.78 -1.36 0.21
    seed[T.8] -1.91 0.13 -15.25 0 -2.15 -1.66 0.15
    seed[T.9] -1.87 0.13 -14.72 0 -2.12 -1.63 0.15
    seed[T.10] -2.11 0.14 -14.97 0 -2.38 -1.83 0.12
    seed[T.11] -2.08 0.15 -14.33 0 -2.37 -1.8 0.12
    seed[T.12] -2.73 0.19 -14.06 0 -3.11 -2.35 0.07
    seed[T.13] -3.65 0.34 -10.84 0 -4.3 -2.99 0.03
    seed[T.14] -26.95 23290 -0 1 -45674.4 45620.5 0
    seed[T.15] -26.95 23173.2 -0 1 -45445.6 45391.7 0
    seed[T.16] -4.48 0.5 -8.92 0 -5.46 -3.49 0.01

    Based on these results, we can see that seed is significantly associated with tourney_wins, particularly in changes in levels between lower seeds. These results can be interpreted as:

    Seed 2 teams are expected to win 29% fewer games than Seed 1 teams

    • Seed 2 teams are expected to win 29% (1-0.71) fewer games than Seed 1 teams.

    • Seed 3 teams are expected to win 28% fewer games than Seed 2 teams.

    • Seed 5 teams are expected to win 37% fewer games than Seed 4 teams.

    • Seed 8 teams are expected to win 29% fewer games than Seed 7 teams.

    • Seed 12 teams are expected to win 48% fewer games than Seed 11 teams.

    • Seed 13 teams are expected to win 60% fewer games than Seed 12 teams.

    This conclusion is easier to interpret visually, so, let’s plot our Poisson regression model to view the impact of seed on tourney_wins:

    • R
    • Python
    marchmadness$predicted_wins <- predict(poisson_model_factor, type = "response")
    
    model_plot <- ggplot(marchmadness, aes(x = seed, y = tourney_wins)) +
      geom_jitter(width = 0.3, alpha = 0.5) +
      geom_line(aes(y = predicted_wins), color = "skyblue2", linewidth = 1.2) +
      labs(title = "Poisson Regression: Predicted Tournament Wins by Seed",
           x = "Seed",
           y = "Tournament Wins") +
      theme_minimal()
    
    model_plot

    marchmadness["predicted_wins"] = poisson_model_factor.predict()
    
    result_plot_data = (
        marchmadness
        .groupby(['seed', 'predicted_wins', 'tourney_wins'])
        .size()
        .reset_index(name='count')
    )
    
    # Plot
    bubbles = alt.Chart(result_plot_data).mark_circle(opacity=0.6).encode(
        x=alt.X("seed:N").title("Seed").axis(alt.Axis(labelAngle=0, labelPadding=10)),
        y=alt.Y("tourney_wins:Q").title("Tournament Wins").axis(alt.Axis(format='d')),
        size=alt.Size("count:Q").title("Count").scale(alt.Scale(range=[10, 1000])),
        color=alt.value("black")
    )
    
    line = alt.Chart(result_plot_data).mark_line(size=3, color="skyblue").encode(
        x=alt.X("seed:N").axis(alt.Axis(labelAngle=0, labelPadding=10)),
        y=alt.Y("predicted_wins:Q").axis(alt.Axis(format='d')),
    )
    
    (bubbles + line).properties(
        title="Poisson Regression: Predicted Tournament Wins by Seed",
        width=600,
        height=400
    )

    Discussion

    This analysis examined the relationship between a team’s tournament seed and its performance in the NCAA Division 1 Women’s Basketball Tournament. The results suggest that:

    1. Poisson regression supports seeding as a predictor: The Poisson regression model indicates that seed is a significant predictor of tournament wins, suggesting that higher-ranked teams tend to win more, even between closely ranked seeds.

    2. Proper variable coding is essential: Treating seed as a numeric variable would assume a linear relationship on the log scale of expected tournament wins, meaning each one-unit increase in seed (e.g., from 1 to 2, or 10 to 11) would be associated with the same proportional decrease in wins. This oversimplifies the real pattern. By encoding seed as an ordered factor, the model can estimate distinct effects for each seed level, allowing for more accurate and nuanced interpretation.

    3. There is a lot of variation around the prediction: While seeding generally reflects team strength, upsets and unexpected performances do occur, showing that other factors also influence tournament outcomes.

    Seeding is an important predictor of success, but clearly other factors influence the results. Unexpected performances are common in March Madness, so investigating additional variables could provide a fuller picture of what drives tournament outcomes.

     
     

    This page is built with Quarto.