Diverse Data Hub
  • Data Sets
  • Citation
  • Collaborate
  1. Clean up for womens-march-madness.csv

On this page

  • Loading CSV
  • Renaming columns
  • Column Analysis and Reformatting
  • Changing datatypes:
  • NaN Standardization
  • Results Review
  • Final Reorganization of Columns
  • Saving CSV
  • Validating

Clean up for womens-march-madness.csv

Loading CSV

Load the necessary libraries for the clean-up, as well as the provided dataset

import pandas as pd
import re
import numpy as np
data = pd.read_csv("../data/raw/womens-march-madness/womens-march-madness.csv")
data.head()
Year School Seed Conference Conf. W Conf. L Conf. % Conf. place Reg. W Reg. L Reg. % How qual 1st game at home? Tourney W Tourney L Tourney finish Full W Full L Full %
0 1982 Arizona St. 4 Western Collegiate - - - - 23 6 79.3 at-large Y 1 1 RSF 24 7 77.4
1 1982 Auburn 7 Southeastern - - - - 24 4 85.7 at-large N 0 1 1st 24 5 82.8
2 1982 Cheyney 2 Independent - - - - 24 2 92.3 at-large Y 4 1 N2nd 28 3 90.3
3 1982 Clemson 5 Atlantic Coast 6 3 66.7 4th 20 11 64.5 at-large N 0 1 1st 20 12 62.5
4 1982 Drake 4 Missouri Valley - - - - 26 6 81.3 auto Y 2 1 RF 28 7 80

Renaming columns

Changing the column names to a more interpretable version:

column_names = {
    'Year' : 'year',
    'School' : 'school',
    'Seed' : 'seed',
    'Conference' : 'conference',
    'Conf. W' : 'conf_wins',
    'Conf. L' : 'conf_losses',
    'Conf. %' : 'conf_wins_pct',
    'Conf. place' : 'conf_place',
    'Reg. W' : 'reg_wins',
    'Reg. L' : 'reg_losses',
    'Reg. %' : 'reg_wins_pct',
    'How qual' : 'bid',
    '1st game at home?' : 'first_game_at_home',
    'Tourney W' : 'tourney_wins',
    'Tourney L' : 'tourney_losses',
    'Tourney finish' : 'tourney_finish',
    'Full W' : 'total_wins',
    'Full L' : 'total_losses',
    'Full %' : 'total_wins_pct'
}
data = data.rename(columns = column_names)
data.head()
year school seed conference conf_wins conf_losses conf_wins_pct conf_place reg_wins reg_losses reg_wins_pct bid first_game_at_home tourney_wins tourney_losses tourney_finish total_wins total_losses total_wins_pct
0 1982 Arizona St. 4 Western Collegiate - - - - 23 6 79.3 at-large Y 1 1 RSF 24 7 77.4
1 1982 Auburn 7 Southeastern - - - - 24 4 85.7 at-large N 0 1 1st 24 5 82.8
2 1982 Cheyney 2 Independent - - - - 24 2 92.3 at-large Y 4 1 N2nd 28 3 90.3
3 1982 Clemson 5 Atlantic Coast 6 3 66.7 4th 20 11 64.5 at-large N 0 1 1st 20 12 62.5
4 1982 Drake 4 Missouri Valley - - - - 26 6 81.3 auto Y 2 1 RF 28 7 80

Column Analysis and Reformatting

Analyzing and replacing the contents of the ‘Conf. place’ : ‘conf_place’ column:

data['conf_place'].unique()
array(['-', '4th', '1st', '2nd', '3rd', 'T1st W', '3rd E', '2nd E',
       'T2nd', '1st E', 'T2nd W', 'T1st E', 'T1st', 'T1st N', 'T3rd',
       '1st W', '2nd W', '1st S', '2nd S', 'T4th', 'T2nd E', 'T6th',
       '5th', '6th', 'T5th', 'T8th', 'T7th', '7th', '1st-B', '1st-W',
       '2nd-E', '2nd-6', '1st-E', '1st-R', 'T2nd-W', '1st-6', 'T1st-B',
       'T1st-W', '1st-P', '1st-M', 'T1st-M', '2nd-A', '1st-A', '1st-N',
       '2nd-7', 'T3rd-M', 'T2nd-6', '1st-7', 'T2nd-P', '2nd-N', '2nd-W',
       '8th', 'T1st-A', '2nd-M', '3rd-W', '4th-N', '9th', 'T1st-E',
       '10th', 'T9th', '1st Div.', '3rd Div', 'T3rd Div.'], dtype=object)
  • Numerical Places:

‘1st’, ‘2nd’, ‘3rd’, etc. — clearly ranked within the conference.

  • Tied Places:

‘T1st’, ‘T2nd’, etc. — team tied for that position.

‘T1st-W’, ‘T2nd-E’, etc. — tied for a position within a division (e.g., West, East).

  • Divisions within Conferences:

‘1st W’, ‘2nd E’, ‘3rd-W’, etc. — first, second, or third place in a regional division (West, East, etc.).

  • Ambiguous Codes:

‘1st-6’, ‘T2nd-6’, ‘2nd-7’ — unknown for divisions (e.g., “Group 6” or region 6).

‘1st-B’, ‘1st-R’, ‘1st-P’, ‘1st-M’, etc. — probably shorthand for colored or named divisions (e.g., Blue, Red, Pacific, Mountain).

‘1st Div.’, ‘3rd Div’, ‘T3rd Div.’ — older way of denoting division standing.

  • ‘-’ — missing or not applicable (e.g., for Independent schools not in a conference).

Functions to replace ‘conf_place’ with two columns ‘conf_rank’ and ‘division’

def extract_conf_rank(conf_place):
    if pd.isna(conf_place) or conf_place == '-':
        return np.nan
    # Match patterns like '1st', 'T2nd', '3rd-W', 'T1st-N', etc.
    match = re.search(r'(\d+)(st|nd|rd|th)', conf_place)
    if match:
        return int(match.group(1))
    return np.nan

def extract_division(conf_place):
    if pd.isna(conf_place) or conf_place == '-':
        return np.nan
    # Look for division info after rank (like 'W', 'E', '6', etc.)
    match = re.search(r'(?:\d+(?:st|nd|rd|th)[\s-]?)([A-Za-z0-9]+)?', conf_place)
    if match:
        return match.group(1)
    return np.nan
data['conf_rank'] = data['conf_place'].apply(extract_conf_rank)
data['division'] = data['conf_place'].apply(extract_division)
data.head()
year school seed conference conf_wins conf_losses conf_wins_pct conf_place reg_wins reg_losses ... bid first_game_at_home tourney_wins tourney_losses tourney_finish total_wins total_losses total_wins_pct conf_rank division
0 1982 Arizona St. 4 Western Collegiate - - - - 23 6 ... at-large Y 1 1 RSF 24 7 77.4 NaN NaN
1 1982 Auburn 7 Southeastern - - - - 24 4 ... at-large N 0 1 1st 24 5 82.8 NaN NaN
2 1982 Cheyney 2 Independent - - - - 24 2 ... at-large Y 4 1 N2nd 28 3 90.3 NaN NaN
3 1982 Clemson 5 Atlantic Coast 6 3 66.7 4th 20 11 ... at-large N 0 1 1st 20 12 62.5 4.0 None
4 1982 Drake 4 Missouri Valley - - - - 26 6 ... auto Y 2 1 RF 28 7 80 NaN NaN

5 rows × 21 columns

Analyzing and replacing the contents of the ‘Tourney finish’ : ‘tourney_finish’ column:

According to the documentation:

The round of the final game for each team. OR=opening-round loss (1983 only); 1st=first-round loss; 2nd=second-round loss; RSF=loss in the Sweet 16; RF=loss in the Elite Eight; NSF=loss in the national semifinals; N2nd=national runner-up; Champ=national champions.

We’ll recode the content of the column to more interpretable names.

data['tourney_finish'].unique()
array(['RSF', '1st', 'N2nd', 'RF', 'Champ', 'NSF', 'OR', '2nd'],
      dtype=object)
finish_map = {
    'OR': 'opening_round_loss',
    '1st': 'first_round_loss',
    '2nd': 'second_round_loss',
    'RSF': 'top_16_loss',
    'RF': 'top_8_loss',
    'NSF': 'top_4_loss',
    'N2nd': 'top_2_loss',
    'Champ': 'champ'
}

data['tourney_finish'] = data['tourney_finish'].replace(finish_map)

data.head()
year school seed conference conf_wins conf_losses conf_wins_pct conf_place reg_wins reg_losses ... bid first_game_at_home tourney_wins tourney_losses tourney_finish total_wins total_losses total_wins_pct conf_rank division
0 1982 Arizona St. 4 Western Collegiate - - - - 23 6 ... at-large Y 1 1 top_16_loss 24 7 77.4 NaN NaN
1 1982 Auburn 7 Southeastern - - - - 24 4 ... at-large N 0 1 first_round_loss 24 5 82.8 NaN NaN
2 1982 Cheyney 2 Independent - - - - 24 2 ... at-large Y 4 1 top_2_loss 28 3 90.3 NaN NaN
3 1982 Clemson 5 Atlantic Coast 6 3 66.7 4th 20 11 ... at-large N 0 1 first_round_loss 20 12 62.5 4.0 None
4 1982 Drake 4 Missouri Valley - - - - 26 6 ... auto Y 2 1 top_8_loss 28 7 80 NaN NaN

5 rows × 21 columns

We’re going to make this column an ordered categorical.

finish_order = [
    'opening_round_loss',
    'first_round_loss',
    'second_round_loss',
    'top_16_loss',
    'top_8_loss',
    'top_4_loss',
    'top_2_loss',
    'champ'
]

data['tourney_finish'] = pd.Categorical(
    data['tourney_finish'], 
    categories=finish_order, 
    ordered=True
)

data.head()
year school seed conference conf_wins conf_losses conf_wins_pct conf_place reg_wins reg_losses ... bid first_game_at_home tourney_wins tourney_losses tourney_finish total_wins total_losses total_wins_pct conf_rank division
0 1982 Arizona St. 4 Western Collegiate - - - - 23 6 ... at-large Y 1 1 top_16_loss 24 7 77.4 NaN NaN
1 1982 Auburn 7 Southeastern - - - - 24 4 ... at-large N 0 1 first_round_loss 24 5 82.8 NaN NaN
2 1982 Cheyney 2 Independent - - - - 24 2 ... at-large Y 4 1 top_2_loss 28 3 90.3 NaN NaN
3 1982 Clemson 5 Atlantic Coast 6 3 66.7 4th 20 11 ... at-large N 0 1 first_round_loss 20 12 62.5 4.0 None
4 1982 Drake 4 Missouri Valley - - - - 26 6 ... auto Y 2 1 top_8_loss 28 7 80 NaN NaN

5 rows × 21 columns

Analyzing and replacing the contents of the ‘Seed’ : ‘seed’ column:

data.loc[data['seed'] == '(OR)', 'seed'] = '0'

Changing datatypes:

data.replace("-", np.nan, inplace=True)

data['seed'] = data['seed'].astype('int')
data['conf_wins'] = data['conf_wins'].astype('float')
data['conf_losses'] = data['conf_losses'].astype('float')
data['conf_wins_pct'] = data['conf_wins_pct'].astype('float')
data['total_wins_pct'] = data['total_wins_pct'].str.replace('\\', '')
data['first_game_at_home'] = data['first_game_at_home'].str.replace('^', '')
data['total_wins_pct'] = data['total_wins_pct'].astype('float')

data.head()
year school seed conference conf_wins conf_losses conf_wins_pct conf_place reg_wins reg_losses ... bid first_game_at_home tourney_wins tourney_losses tourney_finish total_wins total_losses total_wins_pct conf_rank division
0 1982 Arizona St. 4 Western Collegiate NaN NaN NaN NaN 23 6 ... at-large Y 1 1 top_16_loss 24 7 77.4 NaN NaN
1 1982 Auburn 7 Southeastern NaN NaN NaN NaN 24 4 ... at-large N 0 1 first_round_loss 24 5 82.8 NaN NaN
2 1982 Cheyney 2 Independent NaN NaN NaN NaN 24 2 ... at-large Y 4 1 top_2_loss 28 3 90.3 NaN NaN
3 1982 Clemson 5 Atlantic Coast 6.0 3.0 66.7 4th 20 11 ... at-large N 0 1 first_round_loss 20 12 62.5 4.0 None
4 1982 Drake 4 Missouri Valley NaN NaN NaN NaN 26 6 ... auto Y 2 1 top_8_loss 28 7 80.0 NaN NaN

5 rows × 21 columns

NaN Standardization

data.replace(to_replace=[pd.NA, "nan", "NaN", "None", None], value=np.nan, inplace=True)

Results Review

for column in data.columns:
    print(f"Column: {column}")
    print(f"Data type: {data[column].dtype}")
    print(f"Unique values: {data[column].unique()}\n")
Column: year
Data type: int64
Unique values: [1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
 2010 2011 2012 2013 2014 2015 2016 2017 2018]

Column: school
Data type: object
Unique values: ['Arizona St.' 'Auburn' 'Cheyney' 'Clemson' 'Drake' 'East Carolina'
 'Georgia' 'Howard' 'Illinois' 'Jackson St.' 'Kansas St.' 'Kent St.'
 'Kentucky' 'Long Beach St.' 'Louisiana Tech' 'Maryland' 'Memphis'
 'Missouri' 'NC State' 'Northwestern' 'Ohio St.' 'Old Dominion' 'Ole Miss'
 'Oregon' 'Penn St.' "Saint Peter's" 'SFA' 'South Carolina'
 'Southern California' 'Stanford' 'Tennessee' 'Tennessee Tech'
 'Central Mich.' 'Dartmouth' 'Florida St.' 'Illinois St.' 'Indiana'
 'La Salle' 'La.-Monroe' 'Louisville' 'Middle Tenn.' 'Monmouth' 'Montana'
 'North Carolina' 'Oregon St.' 'South Carolina St.' "St. John's (NY)"
 'Texas' 'UCLA' 'Utah' 'Alabama' 'BYU' 'LSU' 'San Diego St.' 'Texas Tech'
 'UNLV' 'Virginia' 'Holy Cross' 'Idaho' "Saint Joseph's" 'Southern Miss.'
 'Syracuse' 'Washington' 'Western Ky.' 'Western Mich.' 'Arkansas' 'Iowa'
 'James Madison' 'North Texas' 'Ohio' 'Oklahoma' 'Providence' 'Rutgers'
 'Southern Ill.' 'Vanderbilt' 'Villanova' 'Bowling Green' 'Duke'
 'Eastern Wash.' 'Kansas' 'Manhattan' 'New Mexico St.' 'New Orleans'
 'Northwestern St.' 'South Alabama' 'Colorado' 'Eastern Ill.' 'Fairfield'
 'Houston' 'Nebraska' 'Wake Forest' 'Cal St. Fullerton' 'Chattanooga'
 'Cincinnati' 'Hawaii' 'Miami (FL)' 'Oklahoma St.' 'Purdue' 'Temple'
 'UConn' 'West Virginia' 'Appalachian St.' 'California' 'DePaul'
 'Michigan' 'Northern Ill.' 'Richmond' 'George Washington' 'Lamar'
 'Michigan St.' 'Missouri St.' 'Toledo' 'Washington St.' 'Creighton'
 'Notre Dame' 'Santa Clara' 'UC Santa Barbara' 'Vermont' 'Wisconsin'
 'Florida' 'Ga. Southern' 'Georgetown' 'Georgia Tech' 'Montana St.'
 'San Diego' 'Xavier' 'Boise St.' 'Brown' 'FIU' 'Fordham' 'Grambling'
 'Green Bay' 'Loyola (MD)' 'Marquette' 'Minnesota' "Mt. St. Mary's"
 'N.C. A&T' 'Portland' 'Radford' 'Seton Hall' 'SMU' 'Tennessee St.'
 'Texas A&M' 'UAB' 'Virginia Tech' 'Florida A&M' 'Furman' 'Maine'
 'San Francisco' 'Tulane' 'UC Irvine' 'Western Ill.' 'Austin Peay'
 'Butler' 'Colorado St.' 'Harvard' 'Massachusetts' 'Rhode Island'
 'St. Francis (PA)' 'UCF' 'Youngstown St.' 'Arizona' 'Detroit Mercy'
 'Eastern Ky.' 'Iowa St.' 'Lehigh' 'Liberty' 'Marshall' 'Texas St.' 'Troy'
 'New Mexico' 'Saint Francis (PA)' 'UNC Greensboro' 'Boston College'
 'CSUN' 'Evansville' 'Mississippi St.' 'Northeastern' 'Oral Roberts'
 "St. Mary's (CA)" 'Alcorn St.' 'Campbell' 'Hampton' 'Pepperdine' 'Rice'
 'Baylor' 'Delaware' 'Denver' 'Georgia St.' 'Idaho St.' 'Long Island'
 'Milwaukee' 'Penn' 'Siena' 'TCU' 'Bucknell' 'Hartford' 'Norfolk St.'
 'Oakland' 'Southern' 'Weber St.' 'Alabama St.' 'Boston University'
 'Charlotte' 'Valparaiso' 'Colgate' 'Eastern Mich.' 'Lipscomb'
 'Loyola Marymount' 'Marist' 'Canisius' 'Coppin St.' 'Stetson'
 'UT Arlington' 'Western Carolina' 'Army' 'Fla. Atlantic'
 'Northern Arizona' 'Sacred Heart' 'South Florida' 'Southeast Mo. St.'
 'Tulsa' 'UC Riverside' 'Belmont' 'Delaware St.' 'Gonzaga' 'Louisiana'
 'Pittsburgh' 'Prairie View' 'Robert Morris' 'UMBC' 'Asheville'
 'Cleveland St.' 'Cornell' 'ETSU' 'Fresno St.' 'Miami (OH)' 'Murray St.'
 'UTEP' 'UTSA' 'Wyoming' 'Ball St.' 'Drexel' 'NC A&T' 'South Dakota St.'
 'VCU' 'Dayton' 'Little Rock' 'Portland St.' 'Princeton' 'UNI'
 'Gardner-Webb' 'McNeese' 'Navy' 'Samford' 'UC Davis' 'UT Martin' 'Albany'
 'FGCU' 'St. Bonaventure' 'Cal Poly' 'Quinnipiac' 'Wichita St.' 'Akron'
 'North Dakota' 'South Dakota' 'Winthrop' 'Wright St.' 'American'
 'Savannah St.' 'St. Francis Brooklyn' 'Army West Point' 'Buffalo'
 'Central Arkansas' 'Duquesne' 'Iona' 'Jacksonville' 'UNC Asheville'
 'Elon' 'Texas Southern' 'Mercer' 'Nicholls St.' 'Northern Colo.'
 'Seattle']

Column: seed
Data type: int64
Unique values: [ 4  7  2  5  6  8  1  3  0 10  9 12 11 16 14 15 13]

Column: conference
Data type: object
Unique values: ['Western Collegiate' 'Southeastern' 'Independent' 'Atlantic Coast'
 'Missouri Valley' 'Mid-Eastern' 'Big Ten' 'Southwestern' 'Big Eight'
 'Mid-American' 'Metro' 'Metro Atlantic' 'Northern California'
 'Ohio Valley' 'Ivy' 'East Coast' 'Southland' 'Cosmopolitan'
 'Mountain West Athl.' 'Sun Belt' 'Northern Pacific' 'Atlantic 10'
 'Big East' 'Southwest' 'High Country' 'Pacific Coast' 'Colonial'
 'Pacific West' 'Western Athletic' 'Gulf Star' 'Pacific-10' 'Big West'
 'American South' 'Southern' 'Big Sky' 'North Star' 'Patriot'
 'Great Midwest' 'Midwestern' 'West Coast' 'North Atlantic'
 'Mid-Continent' 'Trans America' 'Northeast' 'Big South' 'Conference USA'
 'Trans-America' 'Big 12' 'America East' 'Horizon' 'Mountain West'
 'Atlantic Sun' ' Mountain West' 'Summit' 'Pac-12' 'American Atletic'
 'American Athletic' 'ASUN' 'Colonial Athletic' 'Western Atlantic'
 'Atlantic-10' 'SWAC' 'Atlantic']

Column: conf_wins
Data type: float64
Unique values: [nan  6. 11.  5.  9. 10.  4. 12.  7. 18. 16. 13. 14. 17.  8. 15.  3. 19.
 20.]

Column: conf_losses
Data type: float64
Unique values: [nan  3.  1.  2.  0.  5.  4.  6.  8.  7.  9. 11. 10. 13. 14. 12.]

Column: conf_wins_pct
Data type: float64
Unique values: [  nan  66.7  85.7  84.6 100.   75.   83.3  64.3  91.7  50.   76.9  92.3
  87.5  62.5  90.   88.9  92.9  70.   71.4  94.4  81.3  78.6  80.   77.8
  44.4  42.9  57.1  55.6  61.1  93.8  68.8  86.7  72.2  60.   63.6  56.3
  90.9  54.5  81.8  36.4  58.3  72.7  21.4  53.8  46.7  41.7  93.3  41.2
  43.8  95.   84.2  94.1  38.9  55.   82.4  27.8  37.5  22.2  61.9  85.
  33.3]

Column: conf_place
Data type: object
Unique values: [nan '4th' '1st' '2nd' '3rd' 'T1st W' '3rd E' '2nd E' 'T2nd' '1st E'
 'T2nd W' 'T1st E' 'T1st' 'T1st N' 'T3rd' '1st W' '2nd W' '1st S' '2nd S'
 'T4th' 'T2nd E' 'T6th' '5th' '6th' 'T5th' 'T8th' 'T7th' '7th' '1st-B'
 '1st-W' '2nd-E' '2nd-6' '1st-E' '1st-R' 'T2nd-W' '1st-6' 'T1st-B'
 'T1st-W' '1st-P' '1st-M' 'T1st-M' '2nd-A' '1st-A' '1st-N' '2nd-7'
 'T3rd-M' 'T2nd-6' '1st-7' 'T2nd-P' '2nd-N' '2nd-W' '8th' 'T1st-A' '2nd-M'
 '3rd-W' '4th-N' '9th' 'T1st-E' '10th' 'T9th' '1st Div.' '3rd Div'
 'T3rd Div.']

Column: reg_wins
Data type: int64
Unique values: [23 24 20 26 19 21 14 28 17 22 30 25 27 15 18 16 29 12 33 31 13 32 34 10]

Column: reg_losses
Data type: int64
Unique values: [ 6  4  2 11  7  8 10  5 13  1  3  9 12 14 15  0 16 17 18]

Column: reg_wins_pct
Data type: float64
Unique values: [ 79.3  85.7  92.3  64.5  81.3  73.1  72.4  58.3  80.   82.8  56.7  75.9
  81.5  96.8  78.6  86.2  74.2  82.1  75.   76.   80.8  87.1  83.3  65.2
  87.   67.9  66.7  76.7  71.4  92.9  72.   69.   64.3  85.2  57.1  96.4
  50.   89.7  83.9  69.6  92.6  93.3  73.3  93.1  84.6  88.   51.6  89.3
  63.3  93.8  70.4  77.8  96.6  90.   67.7  60.   70.   96.3  96.7  87.5
  58.6  76.5  65.5 100.   90.6  62.1  86.7  61.3  74.1  80.6  60.9  71.9
  60.7  68.8  78.8  71.   77.4  59.4  46.7  88.9  62.5  61.5  59.3  76.9
  63.   41.4  69.2  88.5  84.4  51.7  93.5  90.9  75.8  53.6  81.8  69.7
  65.4  93.9  65.6  90.3  53.3  63.6  58.1  68.6  96.9  55.2  54.8  43.3
  78.1  87.9  43.8  72.7  55.6  46.9  84.8  56.3  97.   53.1  45.5  51.5
  60.6  88.2  57.6  52.9  82.4  45.2  64.7  91.2  97.1  67.6  79.4  55.9
  86.4  94.1]

Column: bid
Data type: object
Unique values: ['at-large' 'auto']

Column: first_game_at_home
Data type: object
Unique values: ['Y' 'N']

Column: tourney_wins
Data type: int64
Unique values: [1 0 4 2 5 3 6]

Column: tourney_losses
Data type: int64
Unique values: [1 0]

Column: tourney_finish
Data type: category
Unique values: ['top_16_loss', 'first_round_loss', 'top_2_loss', 'top_8_loss', 'champ', 'top_4_loss', 'opening_round_loss', 'second_round_loss']
Categories (8, object): ['opening_round_loss' < 'first_round_loss' < 'second_round_loss' < 'top_16_loss' < 'top_8_loss' < 'top_4_loss' < 'top_2_loss' < 'champ']

Column: total_wins
Data type: int64
Unique values: [24 28 20 19 21 14 26 17 23 35 25 22 27 15 18 16 31 29 30 32 34 33 12 39
 36 37 13 40 38]

Column: total_losses
Data type: int64
Unique values: [ 7  5  3 12  8  9 11  6 14  1  4 10 13  2 15 16  0 17 18 53 19]

Column: total_wins_pct
Data type: float64
Unique values: [ 77.4  82.8  90.3  62.5  80.   70.4  70.   56.   77.8  81.3  54.8  75.
  79.3  97.2  78.1  83.9  72.7  72.4  73.1  78.6  84.4  83.3  73.3  84.6
  68.8  64.5  76.7  69.   90.   69.2  79.4  66.7  63.3  80.6  82.1  55.2
  93.9  50.   86.7  82.9  78.8  72.   68.   93.8  75.8  90.9  62.1  75.9
  71.9  85.2  51.5  65.5  61.3  87.9  69.7  91.4  67.9  85.3  93.3  71.
  65.6  91.2  90.6  84.8  58.1  92.9  87.5  81.8  56.7  89.7  86.1  86.2
  70.6 100.   88.9  60.   87.1  60.6  91.7  71.4  58.3  82.4  64.3  67.7
  93.5  94.1  58.6  67.6  74.2  76.5  85.7  88.2  94.6  97.   57.6  56.3
  45.2  96.7  77.1  61.8  59.3  57.1  96.6  88.6  40.   94.3  60.7  63.6
  91.9  89.2  73.5  74.3  74.1  51.7  89.5  81.6  94.4  74.4  97.1  63.
  79.6  48.3  78.9  51.6  97.3  89.3  68.4  65.7  68.6  81.1  81.5  59.4
  58.8  86.8  97.4  53.1  36.9  41.9  48.4  86.5  42.4  53.6  45.5  83.8
  48.5  64.7  72.2  55.9  94.7  62.9  78.4  44.1  87.2  69.4  73.   71.1
  54.5  79.5  51.4  76.3  43.8  92.3  96.9  61.1  70.3  62.2  63.9  84.2
  54.3  94.9  92.1]

Column: conf_rank
Data type: float64
Unique values: [nan  4.  1.  2.  3.  6.  5.  8.  7.  9. 10.]

Column: division
Data type: object
Unique values: [nan 'W' 'E' 'N' 'S' 'B' '6' 'R' 'P' 'M' 'A' '7' 'Div']

Final Reorganization of Columns

data.columns
Index(['year', 'school', 'seed', 'conference', 'conf_wins', 'conf_losses',
       'conf_wins_pct', 'conf_place', 'reg_wins', 'reg_losses', 'reg_wins_pct',
       'bid', 'first_game_at_home', 'tourney_wins', 'tourney_losses',
       'tourney_finish', 'total_wins', 'total_losses', 'total_wins_pct',
       'conf_rank', 'division'],
      dtype='object')
data = data[['year', 'school', 'seed', 'conference', 'conf_wins', 'conf_losses',
       'conf_wins_pct', 'conf_rank', 'division', 'reg_wins', 'reg_losses', 'reg_wins_pct',
       'bid', 'first_game_at_home', 'tourney_wins', 'tourney_losses',
       'tourney_finish', 'total_wins', 'total_losses', 'total_wins_pct'
       ]]

data.head()
year school seed conference conf_wins conf_losses conf_wins_pct conf_rank division reg_wins reg_losses reg_wins_pct bid first_game_at_home tourney_wins tourney_losses tourney_finish total_wins total_losses total_wins_pct
0 1982 Arizona St. 4 Western Collegiate NaN NaN NaN NaN NaN 23 6 79.3 at-large Y 1 1 top_16_loss 24 7 77.4
1 1982 Auburn 7 Southeastern NaN NaN NaN NaN NaN 24 4 85.7 at-large N 0 1 first_round_loss 24 5 82.8
2 1982 Cheyney 2 Independent NaN NaN NaN NaN NaN 24 2 92.3 at-large Y 4 1 top_2_loss 28 3 90.3
3 1982 Clemson 5 Atlantic Coast 6.0 3.0 66.7 4.0 NaN 20 11 64.5 at-large N 0 1 first_round_loss 20 12 62.5
4 1982 Drake 4 Missouri Valley NaN NaN NaN NaN NaN 26 6 81.3 auto Y 2 1 top_8_loss 28 7 80.0

Saving CSV

data.to_csv('../data/clean/womensmarchmadness.csv', index=False)

Validating

clean = pd.read_csv("../data/clean/womensmarchmadness.csv")

clean.head()
year school seed conference conf_wins conf_losses conf_wins_pct conf_rank division reg_wins reg_losses reg_wins_pct bid first_game_at_home tourney_wins tourney_losses tourney_finish total_wins total_losses total_wins_pct
0 1982 Arizona St. 4 Western Collegiate NaN NaN NaN NaN NaN 23 6 79.3 at-large Y 1 1 top_16_loss 24 7 77.4
1 1982 Auburn 7 Southeastern NaN NaN NaN NaN NaN 24 4 85.7 at-large N 0 1 first_round_loss 24 5 82.8
2 1982 Cheyney 2 Independent NaN NaN NaN NaN NaN 24 2 92.3 at-large Y 4 1 top_2_loss 28 3 90.3
3 1982 Clemson 5 Atlantic Coast 6.0 3.0 66.7 4.0 NaN 20 11 64.5 at-large N 0 1 first_round_loss 20 12 62.5
4 1982 Drake 4 Missouri Valley NaN NaN NaN NaN NaN 26 6 81.3 auto Y 2 1 top_8_loss 28 7 80.0
clean.shape
(2092, 20)
 
 

This page is built with Quarto.