import pandas as pd
import re
import numpy as np
Clean up for womens-march-madness.csv
Loading CSV
Load the necessary libraries for the clean-up, as well as the provided dataset
= pd.read_csv("../data/raw/womens-march-madness/womens-march-madness.csv")
data data.head()
Year | School | Seed | Conference | Conf. W | Conf. L | Conf. % | Conf. place | Reg. W | Reg. L | Reg. % | How qual | 1st game at home? | Tourney W | Tourney L | Tourney finish | Full W | Full L | Full % | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | - | - | - | - | 23 | 6 | 79.3 | at-large | Y | 1 | 1 | RSF | 24 | 7 | 77.4 |
1 | 1982 | Auburn | 7 | Southeastern | - | - | - | - | 24 | 4 | 85.7 | at-large | N | 0 | 1 | 1st | 24 | 5 | 82.8 |
2 | 1982 | Cheyney | 2 | Independent | - | - | - | - | 24 | 2 | 92.3 | at-large | Y | 4 | 1 | N2nd | 28 | 3 | 90.3 |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6 | 3 | 66.7 | 4th | 20 | 11 | 64.5 | at-large | N | 0 | 1 | 1st | 20 | 12 | 62.5 |
4 | 1982 | Drake | 4 | Missouri Valley | - | - | - | - | 26 | 6 | 81.3 | auto | Y | 2 | 1 | RF | 28 | 7 | 80 |
Renaming columns
Changing the column names to a more interpretable version:
= {
column_names 'Year' : 'year',
'School' : 'school',
'Seed' : 'seed',
'Conference' : 'conference',
'Conf. W' : 'conf_wins',
'Conf. L' : 'conf_losses',
'Conf. %' : 'conf_wins_pct',
'Conf. place' : 'conf_place',
'Reg. W' : 'reg_wins',
'Reg. L' : 'reg_losses',
'Reg. %' : 'reg_wins_pct',
'How qual' : 'bid',
'1st game at home?' : 'first_game_at_home',
'Tourney W' : 'tourney_wins',
'Tourney L' : 'tourney_losses',
'Tourney finish' : 'tourney_finish',
'Full W' : 'total_wins',
'Full L' : 'total_losses',
'Full %' : 'total_wins_pct'
}= data.rename(columns = column_names)
data data.head()
year | school | seed | conference | conf_wins | conf_losses | conf_wins_pct | conf_place | reg_wins | reg_losses | reg_wins_pct | bid | first_game_at_home | tourney_wins | tourney_losses | tourney_finish | total_wins | total_losses | total_wins_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | - | - | - | - | 23 | 6 | 79.3 | at-large | Y | 1 | 1 | RSF | 24 | 7 | 77.4 |
1 | 1982 | Auburn | 7 | Southeastern | - | - | - | - | 24 | 4 | 85.7 | at-large | N | 0 | 1 | 1st | 24 | 5 | 82.8 |
2 | 1982 | Cheyney | 2 | Independent | - | - | - | - | 24 | 2 | 92.3 | at-large | Y | 4 | 1 | N2nd | 28 | 3 | 90.3 |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6 | 3 | 66.7 | 4th | 20 | 11 | 64.5 | at-large | N | 0 | 1 | 1st | 20 | 12 | 62.5 |
4 | 1982 | Drake | 4 | Missouri Valley | - | - | - | - | 26 | 6 | 81.3 | auto | Y | 2 | 1 | RF | 28 | 7 | 80 |
Column Analysis and Reformatting
Analyzing and replacing the contents of the ‘Conf. place’ : ‘conf_place’ column:
'conf_place'].unique() data[
array(['-', '4th', '1st', '2nd', '3rd', 'T1st W', '3rd E', '2nd E',
'T2nd', '1st E', 'T2nd W', 'T1st E', 'T1st', 'T1st N', 'T3rd',
'1st W', '2nd W', '1st S', '2nd S', 'T4th', 'T2nd E', 'T6th',
'5th', '6th', 'T5th', 'T8th', 'T7th', '7th', '1st-B', '1st-W',
'2nd-E', '2nd-6', '1st-E', '1st-R', 'T2nd-W', '1st-6', 'T1st-B',
'T1st-W', '1st-P', '1st-M', 'T1st-M', '2nd-A', '1st-A', '1st-N',
'2nd-7', 'T3rd-M', 'T2nd-6', '1st-7', 'T2nd-P', '2nd-N', '2nd-W',
'8th', 'T1st-A', '2nd-M', '3rd-W', '4th-N', '9th', 'T1st-E',
'10th', 'T9th', '1st Div.', '3rd Div', 'T3rd Div.'], dtype=object)
- Numerical Places:
‘1st’, ‘2nd’, ‘3rd’, etc. — clearly ranked within the conference.
- Tied Places:
‘T1st’, ‘T2nd’, etc. — team tied for that position.
‘T1st-W’, ‘T2nd-E’, etc. — tied for a position within a division (e.g., West, East).
- Divisions within Conferences:
‘1st W’, ‘2nd E’, ‘3rd-W’, etc. — first, second, or third place in a regional division (West, East, etc.).
- Ambiguous Codes:
‘1st-6’, ‘T2nd-6’, ‘2nd-7’ — unknown for divisions (e.g., “Group 6” or region 6).
‘1st-B’, ‘1st-R’, ‘1st-P’, ‘1st-M’, etc. — probably shorthand for colored or named divisions (e.g., Blue, Red, Pacific, Mountain).
‘1st Div.’, ‘3rd Div’, ‘T3rd Div.’ — older way of denoting division standing.
- ‘-’ — missing or not applicable (e.g., for Independent schools not in a conference).
Functions to replace ‘conf_place’ with two columns ‘conf_rank’ and ‘division’
def extract_conf_rank(conf_place):
if pd.isna(conf_place) or conf_place == '-':
return np.nan
# Match patterns like '1st', 'T2nd', '3rd-W', 'T1st-N', etc.
= re.search(r'(\d+)(st|nd|rd|th)', conf_place)
match if match:
return int(match.group(1))
return np.nan
def extract_division(conf_place):
if pd.isna(conf_place) or conf_place == '-':
return np.nan
# Look for division info after rank (like 'W', 'E', '6', etc.)
= re.search(r'(?:\d+(?:st|nd|rd|th)[\s-]?)([A-Za-z0-9]+)?', conf_place)
match if match:
return match.group(1)
return np.nan
'conf_rank'] = data['conf_place'].apply(extract_conf_rank)
data['division'] = data['conf_place'].apply(extract_division)
data[ data.head()
year | school | seed | conference | conf_wins | conf_losses | conf_wins_pct | conf_place | reg_wins | reg_losses | ... | bid | first_game_at_home | tourney_wins | tourney_losses | tourney_finish | total_wins | total_losses | total_wins_pct | conf_rank | division | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | - | - | - | - | 23 | 6 | ... | at-large | Y | 1 | 1 | RSF | 24 | 7 | 77.4 | NaN | NaN |
1 | 1982 | Auburn | 7 | Southeastern | - | - | - | - | 24 | 4 | ... | at-large | N | 0 | 1 | 1st | 24 | 5 | 82.8 | NaN | NaN |
2 | 1982 | Cheyney | 2 | Independent | - | - | - | - | 24 | 2 | ... | at-large | Y | 4 | 1 | N2nd | 28 | 3 | 90.3 | NaN | NaN |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6 | 3 | 66.7 | 4th | 20 | 11 | ... | at-large | N | 0 | 1 | 1st | 20 | 12 | 62.5 | 4.0 | None |
4 | 1982 | Drake | 4 | Missouri Valley | - | - | - | - | 26 | 6 | ... | auto | Y | 2 | 1 | RF | 28 | 7 | 80 | NaN | NaN |
5 rows × 21 columns
Analyzing and replacing the contents of the ‘Tourney finish’ : ‘tourney_finish’ column:
According to the documentation:
The round of the final game for each team. OR=opening-round loss (1983 only); 1st=first-round loss; 2nd=second-round loss; RSF=loss in the Sweet 16; RF=loss in the Elite Eight; NSF=loss in the national semifinals; N2nd=national runner-up; Champ=national champions.
We’ll recode the content of the column to more interpretable names.
'tourney_finish'].unique() data[
array(['RSF', '1st', 'N2nd', 'RF', 'Champ', 'NSF', 'OR', '2nd'],
dtype=object)
= {
finish_map 'OR': 'opening_round_loss',
'1st': 'first_round_loss',
'2nd': 'second_round_loss',
'RSF': 'top_16_loss',
'RF': 'top_8_loss',
'NSF': 'top_4_loss',
'N2nd': 'top_2_loss',
'Champ': 'champ'
}
'tourney_finish'] = data['tourney_finish'].replace(finish_map)
data[
data.head()
year | school | seed | conference | conf_wins | conf_losses | conf_wins_pct | conf_place | reg_wins | reg_losses | ... | bid | first_game_at_home | tourney_wins | tourney_losses | tourney_finish | total_wins | total_losses | total_wins_pct | conf_rank | division | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | - | - | - | - | 23 | 6 | ... | at-large | Y | 1 | 1 | top_16_loss | 24 | 7 | 77.4 | NaN | NaN |
1 | 1982 | Auburn | 7 | Southeastern | - | - | - | - | 24 | 4 | ... | at-large | N | 0 | 1 | first_round_loss | 24 | 5 | 82.8 | NaN | NaN |
2 | 1982 | Cheyney | 2 | Independent | - | - | - | - | 24 | 2 | ... | at-large | Y | 4 | 1 | top_2_loss | 28 | 3 | 90.3 | NaN | NaN |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6 | 3 | 66.7 | 4th | 20 | 11 | ... | at-large | N | 0 | 1 | first_round_loss | 20 | 12 | 62.5 | 4.0 | None |
4 | 1982 | Drake | 4 | Missouri Valley | - | - | - | - | 26 | 6 | ... | auto | Y | 2 | 1 | top_8_loss | 28 | 7 | 80 | NaN | NaN |
5 rows × 21 columns
We’re going to make this column an ordered categorical.
= [
finish_order 'opening_round_loss',
'first_round_loss',
'second_round_loss',
'top_16_loss',
'top_8_loss',
'top_4_loss',
'top_2_loss',
'champ'
]
'tourney_finish'] = pd.Categorical(
data['tourney_finish'],
data[=finish_order,
categories=True
ordered
)
data.head()
year | school | seed | conference | conf_wins | conf_losses | conf_wins_pct | conf_place | reg_wins | reg_losses | ... | bid | first_game_at_home | tourney_wins | tourney_losses | tourney_finish | total_wins | total_losses | total_wins_pct | conf_rank | division | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | - | - | - | - | 23 | 6 | ... | at-large | Y | 1 | 1 | top_16_loss | 24 | 7 | 77.4 | NaN | NaN |
1 | 1982 | Auburn | 7 | Southeastern | - | - | - | - | 24 | 4 | ... | at-large | N | 0 | 1 | first_round_loss | 24 | 5 | 82.8 | NaN | NaN |
2 | 1982 | Cheyney | 2 | Independent | - | - | - | - | 24 | 2 | ... | at-large | Y | 4 | 1 | top_2_loss | 28 | 3 | 90.3 | NaN | NaN |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6 | 3 | 66.7 | 4th | 20 | 11 | ... | at-large | N | 0 | 1 | first_round_loss | 20 | 12 | 62.5 | 4.0 | None |
4 | 1982 | Drake | 4 | Missouri Valley | - | - | - | - | 26 | 6 | ... | auto | Y | 2 | 1 | top_8_loss | 28 | 7 | 80 | NaN | NaN |
5 rows × 21 columns
Analyzing and replacing the contents of the ‘Seed’ : ‘seed’ column:
'seed'] == '(OR)', 'seed'] = '0' data.loc[data[
Changing datatypes:
"-", np.nan, inplace=True)
data.replace(
'seed'] = data['seed'].astype('int')
data['conf_wins'] = data['conf_wins'].astype('float')
data['conf_losses'] = data['conf_losses'].astype('float')
data['conf_wins_pct'] = data['conf_wins_pct'].astype('float')
data['total_wins_pct'] = data['total_wins_pct'].str.replace('\\', '')
data['first_game_at_home'] = data['first_game_at_home'].str.replace('^', '')
data['total_wins_pct'] = data['total_wins_pct'].astype('float')
data[
data.head()
year | school | seed | conference | conf_wins | conf_losses | conf_wins_pct | conf_place | reg_wins | reg_losses | ... | bid | first_game_at_home | tourney_wins | tourney_losses | tourney_finish | total_wins | total_losses | total_wins_pct | conf_rank | division | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | NaN | NaN | NaN | NaN | 23 | 6 | ... | at-large | Y | 1 | 1 | top_16_loss | 24 | 7 | 77.4 | NaN | NaN |
1 | 1982 | Auburn | 7 | Southeastern | NaN | NaN | NaN | NaN | 24 | 4 | ... | at-large | N | 0 | 1 | first_round_loss | 24 | 5 | 82.8 | NaN | NaN |
2 | 1982 | Cheyney | 2 | Independent | NaN | NaN | NaN | NaN | 24 | 2 | ... | at-large | Y | 4 | 1 | top_2_loss | 28 | 3 | 90.3 | NaN | NaN |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6.0 | 3.0 | 66.7 | 4th | 20 | 11 | ... | at-large | N | 0 | 1 | first_round_loss | 20 | 12 | 62.5 | 4.0 | None |
4 | 1982 | Drake | 4 | Missouri Valley | NaN | NaN | NaN | NaN | 26 | 6 | ... | auto | Y | 2 | 1 | top_8_loss | 28 | 7 | 80.0 | NaN | NaN |
5 rows × 21 columns
NaN Standardization
=[pd.NA, "nan", "NaN", "None", None], value=np.nan, inplace=True) data.replace(to_replace
Results Review
for column in data.columns:
print(f"Column: {column}")
print(f"Data type: {data[column].dtype}")
print(f"Unique values: {data[column].unique()}\n")
Column: year
Data type: int64
Unique values: [1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009
2010 2011 2012 2013 2014 2015 2016 2017 2018]
Column: school
Data type: object
Unique values: ['Arizona St.' 'Auburn' 'Cheyney' 'Clemson' 'Drake' 'East Carolina'
'Georgia' 'Howard' 'Illinois' 'Jackson St.' 'Kansas St.' 'Kent St.'
'Kentucky' 'Long Beach St.' 'Louisiana Tech' 'Maryland' 'Memphis'
'Missouri' 'NC State' 'Northwestern' 'Ohio St.' 'Old Dominion' 'Ole Miss'
'Oregon' 'Penn St.' "Saint Peter's" 'SFA' 'South Carolina'
'Southern California' 'Stanford' 'Tennessee' 'Tennessee Tech'
'Central Mich.' 'Dartmouth' 'Florida St.' 'Illinois St.' 'Indiana'
'La Salle' 'La.-Monroe' 'Louisville' 'Middle Tenn.' 'Monmouth' 'Montana'
'North Carolina' 'Oregon St.' 'South Carolina St.' "St. John's (NY)"
'Texas' 'UCLA' 'Utah' 'Alabama' 'BYU' 'LSU' 'San Diego St.' 'Texas Tech'
'UNLV' 'Virginia' 'Holy Cross' 'Idaho' "Saint Joseph's" 'Southern Miss.'
'Syracuse' 'Washington' 'Western Ky.' 'Western Mich.' 'Arkansas' 'Iowa'
'James Madison' 'North Texas' 'Ohio' 'Oklahoma' 'Providence' 'Rutgers'
'Southern Ill.' 'Vanderbilt' 'Villanova' 'Bowling Green' 'Duke'
'Eastern Wash.' 'Kansas' 'Manhattan' 'New Mexico St.' 'New Orleans'
'Northwestern St.' 'South Alabama' 'Colorado' 'Eastern Ill.' 'Fairfield'
'Houston' 'Nebraska' 'Wake Forest' 'Cal St. Fullerton' 'Chattanooga'
'Cincinnati' 'Hawaii' 'Miami (FL)' 'Oklahoma St.' 'Purdue' 'Temple'
'UConn' 'West Virginia' 'Appalachian St.' 'California' 'DePaul'
'Michigan' 'Northern Ill.' 'Richmond' 'George Washington' 'Lamar'
'Michigan St.' 'Missouri St.' 'Toledo' 'Washington St.' 'Creighton'
'Notre Dame' 'Santa Clara' 'UC Santa Barbara' 'Vermont' 'Wisconsin'
'Florida' 'Ga. Southern' 'Georgetown' 'Georgia Tech' 'Montana St.'
'San Diego' 'Xavier' 'Boise St.' 'Brown' 'FIU' 'Fordham' 'Grambling'
'Green Bay' 'Loyola (MD)' 'Marquette' 'Minnesota' "Mt. St. Mary's"
'N.C. A&T' 'Portland' 'Radford' 'Seton Hall' 'SMU' 'Tennessee St.'
'Texas A&M' 'UAB' 'Virginia Tech' 'Florida A&M' 'Furman' 'Maine'
'San Francisco' 'Tulane' 'UC Irvine' 'Western Ill.' 'Austin Peay'
'Butler' 'Colorado St.' 'Harvard' 'Massachusetts' 'Rhode Island'
'St. Francis (PA)' 'UCF' 'Youngstown St.' 'Arizona' 'Detroit Mercy'
'Eastern Ky.' 'Iowa St.' 'Lehigh' 'Liberty' 'Marshall' 'Texas St.' 'Troy'
'New Mexico' 'Saint Francis (PA)' 'UNC Greensboro' 'Boston College'
'CSUN' 'Evansville' 'Mississippi St.' 'Northeastern' 'Oral Roberts'
"St. Mary's (CA)" 'Alcorn St.' 'Campbell' 'Hampton' 'Pepperdine' 'Rice'
'Baylor' 'Delaware' 'Denver' 'Georgia St.' 'Idaho St.' 'Long Island'
'Milwaukee' 'Penn' 'Siena' 'TCU' 'Bucknell' 'Hartford' 'Norfolk St.'
'Oakland' 'Southern' 'Weber St.' 'Alabama St.' 'Boston University'
'Charlotte' 'Valparaiso' 'Colgate' 'Eastern Mich.' 'Lipscomb'
'Loyola Marymount' 'Marist' 'Canisius' 'Coppin St.' 'Stetson'
'UT Arlington' 'Western Carolina' 'Army' 'Fla. Atlantic'
'Northern Arizona' 'Sacred Heart' 'South Florida' 'Southeast Mo. St.'
'Tulsa' 'UC Riverside' 'Belmont' 'Delaware St.' 'Gonzaga' 'Louisiana'
'Pittsburgh' 'Prairie View' 'Robert Morris' 'UMBC' 'Asheville'
'Cleveland St.' 'Cornell' 'ETSU' 'Fresno St.' 'Miami (OH)' 'Murray St.'
'UTEP' 'UTSA' 'Wyoming' 'Ball St.' 'Drexel' 'NC A&T' 'South Dakota St.'
'VCU' 'Dayton' 'Little Rock' 'Portland St.' 'Princeton' 'UNI'
'Gardner-Webb' 'McNeese' 'Navy' 'Samford' 'UC Davis' 'UT Martin' 'Albany'
'FGCU' 'St. Bonaventure' 'Cal Poly' 'Quinnipiac' 'Wichita St.' 'Akron'
'North Dakota' 'South Dakota' 'Winthrop' 'Wright St.' 'American'
'Savannah St.' 'St. Francis Brooklyn' 'Army West Point' 'Buffalo'
'Central Arkansas' 'Duquesne' 'Iona' 'Jacksonville' 'UNC Asheville'
'Elon' 'Texas Southern' 'Mercer' 'Nicholls St.' 'Northern Colo.'
'Seattle']
Column: seed
Data type: int64
Unique values: [ 4 7 2 5 6 8 1 3 0 10 9 12 11 16 14 15 13]
Column: conference
Data type: object
Unique values: ['Western Collegiate' 'Southeastern' 'Independent' 'Atlantic Coast'
'Missouri Valley' 'Mid-Eastern' 'Big Ten' 'Southwestern' 'Big Eight'
'Mid-American' 'Metro' 'Metro Atlantic' 'Northern California'
'Ohio Valley' 'Ivy' 'East Coast' 'Southland' 'Cosmopolitan'
'Mountain West Athl.' 'Sun Belt' 'Northern Pacific' 'Atlantic 10'
'Big East' 'Southwest' 'High Country' 'Pacific Coast' 'Colonial'
'Pacific West' 'Western Athletic' 'Gulf Star' 'Pacific-10' 'Big West'
'American South' 'Southern' 'Big Sky' 'North Star' 'Patriot'
'Great Midwest' 'Midwestern' 'West Coast' 'North Atlantic'
'Mid-Continent' 'Trans America' 'Northeast' 'Big South' 'Conference USA'
'Trans-America' 'Big 12' 'America East' 'Horizon' 'Mountain West'
'Atlantic Sun' ' Mountain West' 'Summit' 'Pac-12' 'American Atletic'
'American Athletic' 'ASUN' 'Colonial Athletic' 'Western Atlantic'
'Atlantic-10' 'SWAC' 'Atlantic']
Column: conf_wins
Data type: float64
Unique values: [nan 6. 11. 5. 9. 10. 4. 12. 7. 18. 16. 13. 14. 17. 8. 15. 3. 19.
20.]
Column: conf_losses
Data type: float64
Unique values: [nan 3. 1. 2. 0. 5. 4. 6. 8. 7. 9. 11. 10. 13. 14. 12.]
Column: conf_wins_pct
Data type: float64
Unique values: [ nan 66.7 85.7 84.6 100. 75. 83.3 64.3 91.7 50. 76.9 92.3
87.5 62.5 90. 88.9 92.9 70. 71.4 94.4 81.3 78.6 80. 77.8
44.4 42.9 57.1 55.6 61.1 93.8 68.8 86.7 72.2 60. 63.6 56.3
90.9 54.5 81.8 36.4 58.3 72.7 21.4 53.8 46.7 41.7 93.3 41.2
43.8 95. 84.2 94.1 38.9 55. 82.4 27.8 37.5 22.2 61.9 85.
33.3]
Column: conf_place
Data type: object
Unique values: [nan '4th' '1st' '2nd' '3rd' 'T1st W' '3rd E' '2nd E' 'T2nd' '1st E'
'T2nd W' 'T1st E' 'T1st' 'T1st N' 'T3rd' '1st W' '2nd W' '1st S' '2nd S'
'T4th' 'T2nd E' 'T6th' '5th' '6th' 'T5th' 'T8th' 'T7th' '7th' '1st-B'
'1st-W' '2nd-E' '2nd-6' '1st-E' '1st-R' 'T2nd-W' '1st-6' 'T1st-B'
'T1st-W' '1st-P' '1st-M' 'T1st-M' '2nd-A' '1st-A' '1st-N' '2nd-7'
'T3rd-M' 'T2nd-6' '1st-7' 'T2nd-P' '2nd-N' '2nd-W' '8th' 'T1st-A' '2nd-M'
'3rd-W' '4th-N' '9th' 'T1st-E' '10th' 'T9th' '1st Div.' '3rd Div'
'T3rd Div.']
Column: reg_wins
Data type: int64
Unique values: [23 24 20 26 19 21 14 28 17 22 30 25 27 15 18 16 29 12 33 31 13 32 34 10]
Column: reg_losses
Data type: int64
Unique values: [ 6 4 2 11 7 8 10 5 13 1 3 9 12 14 15 0 16 17 18]
Column: reg_wins_pct
Data type: float64
Unique values: [ 79.3 85.7 92.3 64.5 81.3 73.1 72.4 58.3 80. 82.8 56.7 75.9
81.5 96.8 78.6 86.2 74.2 82.1 75. 76. 80.8 87.1 83.3 65.2
87. 67.9 66.7 76.7 71.4 92.9 72. 69. 64.3 85.2 57.1 96.4
50. 89.7 83.9 69.6 92.6 93.3 73.3 93.1 84.6 88. 51.6 89.3
63.3 93.8 70.4 77.8 96.6 90. 67.7 60. 70. 96.3 96.7 87.5
58.6 76.5 65.5 100. 90.6 62.1 86.7 61.3 74.1 80.6 60.9 71.9
60.7 68.8 78.8 71. 77.4 59.4 46.7 88.9 62.5 61.5 59.3 76.9
63. 41.4 69.2 88.5 84.4 51.7 93.5 90.9 75.8 53.6 81.8 69.7
65.4 93.9 65.6 90.3 53.3 63.6 58.1 68.6 96.9 55.2 54.8 43.3
78.1 87.9 43.8 72.7 55.6 46.9 84.8 56.3 97. 53.1 45.5 51.5
60.6 88.2 57.6 52.9 82.4 45.2 64.7 91.2 97.1 67.6 79.4 55.9
86.4 94.1]
Column: bid
Data type: object
Unique values: ['at-large' 'auto']
Column: first_game_at_home
Data type: object
Unique values: ['Y' 'N']
Column: tourney_wins
Data type: int64
Unique values: [1 0 4 2 5 3 6]
Column: tourney_losses
Data type: int64
Unique values: [1 0]
Column: tourney_finish
Data type: category
Unique values: ['top_16_loss', 'first_round_loss', 'top_2_loss', 'top_8_loss', 'champ', 'top_4_loss', 'opening_round_loss', 'second_round_loss']
Categories (8, object): ['opening_round_loss' < 'first_round_loss' < 'second_round_loss' < 'top_16_loss' < 'top_8_loss' < 'top_4_loss' < 'top_2_loss' < 'champ']
Column: total_wins
Data type: int64
Unique values: [24 28 20 19 21 14 26 17 23 35 25 22 27 15 18 16 31 29 30 32 34 33 12 39
36 37 13 40 38]
Column: total_losses
Data type: int64
Unique values: [ 7 5 3 12 8 9 11 6 14 1 4 10 13 2 15 16 0 17 18 53 19]
Column: total_wins_pct
Data type: float64
Unique values: [ 77.4 82.8 90.3 62.5 80. 70.4 70. 56. 77.8 81.3 54.8 75.
79.3 97.2 78.1 83.9 72.7 72.4 73.1 78.6 84.4 83.3 73.3 84.6
68.8 64.5 76.7 69. 90. 69.2 79.4 66.7 63.3 80.6 82.1 55.2
93.9 50. 86.7 82.9 78.8 72. 68. 93.8 75.8 90.9 62.1 75.9
71.9 85.2 51.5 65.5 61.3 87.9 69.7 91.4 67.9 85.3 93.3 71.
65.6 91.2 90.6 84.8 58.1 92.9 87.5 81.8 56.7 89.7 86.1 86.2
70.6 100. 88.9 60. 87.1 60.6 91.7 71.4 58.3 82.4 64.3 67.7
93.5 94.1 58.6 67.6 74.2 76.5 85.7 88.2 94.6 97. 57.6 56.3
45.2 96.7 77.1 61.8 59.3 57.1 96.6 88.6 40. 94.3 60.7 63.6
91.9 89.2 73.5 74.3 74.1 51.7 89.5 81.6 94.4 74.4 97.1 63.
79.6 48.3 78.9 51.6 97.3 89.3 68.4 65.7 68.6 81.1 81.5 59.4
58.8 86.8 97.4 53.1 36.9 41.9 48.4 86.5 42.4 53.6 45.5 83.8
48.5 64.7 72.2 55.9 94.7 62.9 78.4 44.1 87.2 69.4 73. 71.1
54.5 79.5 51.4 76.3 43.8 92.3 96.9 61.1 70.3 62.2 63.9 84.2
54.3 94.9 92.1]
Column: conf_rank
Data type: float64
Unique values: [nan 4. 1. 2. 3. 6. 5. 8. 7. 9. 10.]
Column: division
Data type: object
Unique values: [nan 'W' 'E' 'N' 'S' 'B' '6' 'R' 'P' 'M' 'A' '7' 'Div']
Final Reorganization of Columns
data.columns
Index(['year', 'school', 'seed', 'conference', 'conf_wins', 'conf_losses',
'conf_wins_pct', 'conf_place', 'reg_wins', 'reg_losses', 'reg_wins_pct',
'bid', 'first_game_at_home', 'tourney_wins', 'tourney_losses',
'tourney_finish', 'total_wins', 'total_losses', 'total_wins_pct',
'conf_rank', 'division'],
dtype='object')
= data[['year', 'school', 'seed', 'conference', 'conf_wins', 'conf_losses',
data 'conf_wins_pct', 'conf_rank', 'division', 'reg_wins', 'reg_losses', 'reg_wins_pct',
'bid', 'first_game_at_home', 'tourney_wins', 'tourney_losses',
'tourney_finish', 'total_wins', 'total_losses', 'total_wins_pct'
]]
data.head()
year | school | seed | conference | conf_wins | conf_losses | conf_wins_pct | conf_rank | division | reg_wins | reg_losses | reg_wins_pct | bid | first_game_at_home | tourney_wins | tourney_losses | tourney_finish | total_wins | total_losses | total_wins_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | NaN | NaN | NaN | NaN | NaN | 23 | 6 | 79.3 | at-large | Y | 1 | 1 | top_16_loss | 24 | 7 | 77.4 |
1 | 1982 | Auburn | 7 | Southeastern | NaN | NaN | NaN | NaN | NaN | 24 | 4 | 85.7 | at-large | N | 0 | 1 | first_round_loss | 24 | 5 | 82.8 |
2 | 1982 | Cheyney | 2 | Independent | NaN | NaN | NaN | NaN | NaN | 24 | 2 | 92.3 | at-large | Y | 4 | 1 | top_2_loss | 28 | 3 | 90.3 |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6.0 | 3.0 | 66.7 | 4.0 | NaN | 20 | 11 | 64.5 | at-large | N | 0 | 1 | first_round_loss | 20 | 12 | 62.5 |
4 | 1982 | Drake | 4 | Missouri Valley | NaN | NaN | NaN | NaN | NaN | 26 | 6 | 81.3 | auto | Y | 2 | 1 | top_8_loss | 28 | 7 | 80.0 |
Saving CSV
'../data/clean/womensmarchmadness.csv', index=False) data.to_csv(
Validating
= pd.read_csv("../data/clean/womensmarchmadness.csv")
clean
clean.head()
year | school | seed | conference | conf_wins | conf_losses | conf_wins_pct | conf_rank | division | reg_wins | reg_losses | reg_wins_pct | bid | first_game_at_home | tourney_wins | tourney_losses | tourney_finish | total_wins | total_losses | total_wins_pct | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1982 | Arizona St. | 4 | Western Collegiate | NaN | NaN | NaN | NaN | NaN | 23 | 6 | 79.3 | at-large | Y | 1 | 1 | top_16_loss | 24 | 7 | 77.4 |
1 | 1982 | Auburn | 7 | Southeastern | NaN | NaN | NaN | NaN | NaN | 24 | 4 | 85.7 | at-large | N | 0 | 1 | first_round_loss | 24 | 5 | 82.8 |
2 | 1982 | Cheyney | 2 | Independent | NaN | NaN | NaN | NaN | NaN | 24 | 2 | 92.3 | at-large | Y | 4 | 1 | top_2_loss | 28 | 3 | 90.3 |
3 | 1982 | Clemson | 5 | Atlantic Coast | 6.0 | 3.0 | 66.7 | 4.0 | NaN | 20 | 11 | 64.5 | at-large | N | 0 | 1 | first_round_loss | 20 | 12 | 62.5 |
4 | 1982 | Drake | 4 | Missouri Valley | NaN | NaN | NaN | NaN | NaN | 26 | 6 | 81.3 | auto | Y | 2 | 1 | top_8_loss | 28 | 7 | 80.0 |
clean.shape
(2092, 20)