import pandas as pd
Gender Assessment data cleaning
# Load the dataset
= pd.read_csv("../data/raw/gender-assessment/gender_assessment.csv") df
# Inspect the data
print(df.info())
print(f"Initial number of rows: {len(df)}")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 79 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 WBA ID 2000 non-null object
1 Company Name 2000 non-null object
2 HQ Country 2000 non-null object
3 HQ Region 2000 non-null object
4 ISIN 1475 non-null object
5 WBA Industry 1995 non-null object
6 Ownership 1996 non-null object
7 Year assessed 2000 non-null int64
8 Overall Gender Assessment Score 2000 non-null float64
9 Percentage of Total Possible Score
(out of 52.3) 2000 non-null int64
10 A01. Strategic action 2000 non-null int64
11 A01.EA The company made a public commitment to gender equality and women’s empowerment (e.g. signatory to the UN Women’s Empowerment Principles, or having made another public commitment at CEO level). 2000 non-null object
12 A02. Gender targets 2000 non-null int64
13 A02.EA The company discloses one or more time-bound targets on gender equality and women’s empowerment with regard to its workplace. 2000 non-null object
14 A02.EC The company discloses one or more time-bound targets on gender equality and women’s empowerment with regard to its supply chain. 2000 non-null object
15 A04. Gender-responsive human rights due diligence process 2000 non-null int64
16 A04.EA The company discloses what gender-related human rights impacts it has assessed and prioritised as being salient (i.e. most severe and potentially irremediable if not addressed). 2000 non-null object
17 A04.EB The company consults with women or women's groups as part of the risk identification and assessment process. 2000 non-null object
18 A05. Grievance mechanisms 2000 non-null float64
19 A05.EA The company has a gender-responsive mechanism through which employees can report grievances. 2000 non-null object
20 A05.EB The company has one or more channel(s)/mechanism(s), or participates in a shared mechanism, accessible to all external individuals and communities who may be adversely impacted by the company (or individuals or organisations acting on their behalf or who are otherwise in a position to be aware of adverse impacts), to raise complaints or concerns. 2000 non-null object
21 A05.EC The company collects, analyses and monitors sex-disaggregated grievance data (e.g. number of grievances reported, number of grievances remediated). 2000 non-null object
22 A06. Stakeholder engagement 2000 non-null int64
23 A06.EA The company does employee surveys or other engagement mechanisms that specifically address gender equality & women’s empowerment issues. 2000 non-null object
24 A07. Corrective action process 2000 non-null float64
25 A07.EA The company screens for gender-related issues among its suppliers as part of its audit process. Can score Partially Met for .5. 2000 non-null object
26 A07.EB The company identifies any gender-related issues as requiring corrective action to be taken by a supplier within a set period of time in order to remediate the issue. 2000 non-null object
27 B01. Gender equality in leadership 2000 non-null int64
28 B01.EA The company maintains a gender balance (between 40-60%) at the highest governance body. 2000 non-null object
29 B01.EB The company maintains a gender balance (between 40-60%) at the senior executive level. 2000 non-null object
30 B01.EC The company maintains a gender balance (between 40-60%) at the senior management level. 2000 non-null object
31 B01.ED The company maintains a gender balance (between 40-60%) at the middle/other management level. 2000 non-null object
32 B02. Professional development and recruitment 2000 non-null int64
33 B02.EA The company offers professional development programmes (e.g. mentoring programme(s), leadership coaching, access to internal and/or external professional networks, educational programs, and formal sponsorship programmes. 2000 non-null object
34 B02.EB The company tracks the number of women who are participating in these programmes. 2000 non-null object
35 B03. Sex-disaggregated employee data 2000 non-null int64
36 B03.EA The company collects sex-disaggregated data on the gender balance of its employees by occupational function. 2000 non-null object
37 B03.EB The company collects sex-disaggregated data on the percentage of employees promoted. 2000 non-null object
38 B03.EC The company collects sex-disaggregated data on the annual turnover of employees. 2000 non-null object
39 B03.ED The company collect sex-disaggregated data on the annual absenteeism levels of employees. 2000 non-null object
40 B04. Gender equality leadership in the supply chain 2000 non-null int64
41 B04.EA The company collects or requires its suppliers to collect sex-disaggregated data by leadership level across the supply chain. 2000 non-null object
42 B06. Enabling environment for freedom of association and collective bargaining 2000 non-null int64
43 B06.EB The company describes how it supports the practices of its business relationships in relation to freedom of association and collective bargaining. 2000 non-null object
44 B07. Gender-responsive procurement 2000 non-null int64
45 B07.EA The company made a public commitment to gender-responsive procurement. 2000 non-null object
46 B07.EB The company procures from women-owned businesses. 2000 non-null object
47 C01. Gender pay gap 2000 non-null int64
48 C01.EA The company collects sex-disaggregated pay data. 2000 non-null object
49 C01.EB The company collects sex-disaggregated pay data by different pay bands, occupational functions, or other financial benefits. 2000 non-null object
50 C01.EC The company uses a third party to undertake/verify its pay gap analysis. 2000 non-null object
51 C02. Paid primary and secondary carer leave 2000 non-null float64
52 C02.EA The company has a global policy of providing at least 14 weeks of paid primary carer leave offered to full-time employees. 2000 non-null object
53 C02.EB The company monitors the return-to-work rate of employees after primary carer leave and their retention a year after primary carer leave. 2000 non-null object
54 C02.EC The company has a global policy of providing at least two weeks of paid secondary carer leave offered to full-time employees. 2000 non-null object
55 C02.ED The company tracks the number of workers who take secondary carer leave. 2000 non-null object
56 C03. Childcare and other family support 2000 non-null int64
57 C03.EA The company offers childcare support to employees. 2000 non-null object
58 C03.EB The company offers other family support to its employees. 2000 non-null object
59 C04. Flexible work 2000 non-null int64
60 C04.EA The company offers flexible working hours to its employees (the ability to alter the start and end of the day). 2000 non-null object
61 C04.EB The company collects sex-disaggregated data on the number of employees who have flexible working hour arrangements. 2000 non-null object
62 C04.EC The company offers flexible work locations to its employees (the ability to work from home/telecommuting). 2000 non-null object
63 C04.ED The company collects sex-disaggregated data on the number of employees who have flexible work location arrangements. 2000 non-null object
64 C06. Living wage in the supply chain 2000 non-null int64
65 C06.EA The company requires its suppliers to pay their workers a living wage. 2000 non-null object
66 C06.EB The company takes specific actions to help ensure its suppliers pay their workers a living wage. 2000 non-null object
67 D01. Health, safety and well-being in the workplace 2000 non-null float64
68 D01.EA The company has a publicly available policy statement committing it to respect the health and safety of its employees. 2000 non-null object
69 D01.EB The company discloses sex-disaggregated information on health and safety for its employees. 2000 non-null object
70 D01.EC The company provides coverage of the costs associated with any of the following health information and services: maternal health, sexual and reproductive health, and mental health. It has to provide more than two different services for a full score. Partially met if only one service is provided. 2000 non-null object
71 D02. Safe and healthy work in the supply chain 2000 non-null int64
72 D02.EA The company has a publicly available statement of policy that expects its business relationships to commit to respecting the health and safety of their workers. 2000 non-null object
73 D02.EC The company discloses how it monitors the health and safety performance of its business relationships. 2000 non-null object
74 E01. Violence and harassment prevention 2000 non-null float64
75 E01.EA The company has publicly available policies in place regarding violence and harassment in the workplace (e.g., zero tolerance policy, safe transport policy, etc.). Can score Partially Met for .5. 2000 non-null object
76 E02. Violence and harassment remediation 2000 non-null float64
77 E02.EA The company has a remediation process for addressing violence and harassment grievances in the workplace. Can score Partially Met for .5. 2000 non-null object
78 E02.EB The company collects, analyses and monitors sex-disaggregated data on the remediation of violence and harassment grievances. 2000 non-null object
dtypes: float64(7), int64(17), object(55)
memory usage: 1.2+ MB
None
Initial number of rows: 2000
# Choosing only indicator scores and removing element scores
#Drop columns with values only in ['Met', 'Unmet', 'Partially Met']
= []
columns_to_drop for col in df.columns:
= df[col].dropna().unique()
unique_vals if all(val in ['Met', 'Unmet', 'Partially Met'] for val in unique_vals):
columns_to_drop.append(col)
#WBA ID and ISIN are not required in the analysis so removing
'WBA ID', 'ISIN'])
columns_to_drop.extend([
= df.drop(columns=columns_to_drop) df
# Drop rows with missing values in critical columns
= df.dropna(subset=["Company Name ", "HQ Country", "Overall Gender Assessment Score"]) df
# Clean column names (convert to lowercase and replace spaces with underscores)
= df.columns.str.strip().str.lower().str.replace(' ', '_') df.columns
# Rename long column names to shorter ones
= {
rename_map 'company_name': 'company',
'hq_country': 'country',
'hq_region': 'region',
'wba_industry': 'industry',
'year_assessed': 'year',
'overall_gender_assessment_score': 'score',
'percentage_of_total_possible_score_\n(out_of_52.3)': 'percent_score',
"a01._strategic_action": "strategic_action",
"a02._gender_targets": "gender_targets",
"a04._gender-responsive_human_rights_due_diligence_process": "gender_due_diligence",
"a05._grievance_mechanisms": "grievance_mechanisms",
"a06._stakeholder_engagement": "stakeholder_engagement",
"a07._corrective_action_process": "corrective_action",
"b01._gender_equality_in_leadership": "gender_leadership",
"b02._professional_development_and_recruitment": "development_recruitment",
"b03._sex-disaggregated_employee_data": "employee_data_by_sex",
"b04._gender_equality_leadership_in_the_supply_chain": "supply_chain_gender_leadership",
"b06._enabling_environment_for_freedom_of_association_and_collective_bargaining": "enabling_environment_union_rights",
"b07._gender-responsive_procurement": "gender_procurement",
"c01._gender_pay_gap": "gender_pay_gap",
"c02._paid_primary_and_secondary_carer_leave": "carer_leave_paid",
'c03._childcare_and_other_family_support': 'childcare_support',
'c04._flexible_work': 'flex_work',
'c06._living_wage_in_the_supply_chain': 'living_wage_supply_chain',
'd01._health,_safety_and_well-being_in_the_workplace': 'health_safety',
'd02._safe_and_healthy_work_in_the_supply_chain': 'health_safety_supply_chain',
'e01._violence_and_harassment_prevention': 'violence_prevention',
'e02._violence_and_harassment_remediation': 'violence_remediation'
}
# Apply renaming
= df.rename(columns=rename_map) df
# Ensure 'score' and 'percent_score' are numeric
'score'] = pd.to_numeric(df['score'], errors='coerce')
df['percent_score'] = pd.to_numeric(df['percent_score'], errors='coerce') df[
# Remove duplicates
= df.drop_duplicates() df
# Save cleaned file
"../data/clean/genderassessment.csv", index=False) df.to_csv(
# Validate cleaned data
= pd.read_csv("../data/clean/genderassessment.csv")
clean_dataprint(f"Final cleaned dataset rows: {len(clean_data)}") # Final row count
clean_data.head()
Final cleaned dataset rows: 2000
company | country | region | industry | ownership | year | score | percent_score | strategic_action | gender_targets | ... | gender_procurement | gender_pay_gap | carer_leave_paid | childcare_support | flex_work | living_wage_supply_chain | health_safety | health_safety_supply_chain | violence_prevention | violence_remediation | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3M | United States | North America | Chemicals | Public | 2023 | 11.3 | 22 | 1 | 0 | ... | 1 | 0 | 0.0 | 0 | 2 | 0 | 1.0 | 2 | 1.0 | 0.0 |
1 | Asos | United Kingdom | Europe & Central Asia | Apparel & Footwear | Public | 2023 | 16.9 | 32 | 1 | 0 | ... | 0 | 0 | 0.0 | 1 | 1 | 2 | 0.5 | 2 | 0.5 | 0.0 |
2 | A.P. Moller - Maersk | Denmark | Europe & Central Asia | Freight & logistics | Public | 2024 | 10.9 | 21 | 1 | 1 | ... | 0 | 0 | 0.0 | 0 | 0 | 0 | 1.0 | 2 | 1.0 | 0.0 |
3 | ABB | Switzerland | Europe & Central Asia | Capital Goods | Public | 2023 | 12.8 | 25 | 1 | 1 | ... | 0 | 0 | 1.0 | 0 | 0 | 0 | 1.0 | 2 | 1.0 | 0.0 |
4 | AbbVie | United States | North America | Pharmaceuticals & Biotechnology | Public | 2023 | 15.4 | 30 | 1 | 0 | ... | 1 | 0 | 0.0 | 2 | 1 | 0 | 1.0 | 2 | 1.0 | 0.0 |
5 rows × 29 columns