Diverse Data Hub
  • Data Sets
  • Citation
  • Collaborate
  1. Wildfire Dataset Cleaning Steps

On this page

  • Handle missing values
    • Drop uninformative or very sparse columns
    • Handle moderate missing values
  • Fix the Data types
    • Date/Time
  • Column Names:
  • Investigation of the Date columns
    • Based on Alberta Historical Wildfire Data Dictionary (2006–2024)
    • I just keep these date columns and I think these columns are most informative
    • Please let me know which columns you think are better to keep
  • Cleaning some values o Columns
  • Export cleaned dataset
    • Validate cleaned file

Wildfire Dataset Cleaning Steps

import pandas as pd

wildfire = pd.read_csv("../data/raw/wildfire/wildfire.csv")
wildfire.head()
YEAR FIRE_NUMBER FIRE_NAME CURRENT_SIZE SIZE_CLASS LATITUDE LONGITUDE FIRE_ORIGIN GENERAL_CAUSE INDUSTRY_IDENTIFIER ... DISTANCE_FROM_WATER_SOURCE FIRST_BUCKET_DROP_DATE FIRST_BH_DATE FIRST_BH_SIZE FIRST_UC_DATE FIRST_UC_SIZE FIRST_TO_DATE FIRST_TO_SIZE FIRST_EX_DATE FIRST_EX_SIZE_PERIMETER
0 2006 PWF001 NaN 0.10 A 56.249956 -117.181960 Private Land Resident NaN ... NaN NaN 2006-04-02 22:00:00 0.01 2006-04-02 22:00:00 0.01 NaN NaN 2006-04-03 10:20:00 0.10
1 2006 EWF002 NaN 0.20 B 53.606367 -115.915733 Provincial Land Incendiary NaN ... NaN NaN 2006-04-03 13:20:00 0.20 2006-04-03 13:20:00 0.20 NaN NaN 2006-04-03 14:00:00 0.20
2 2006 EWF001 NaN 0.50 B 53.610933 -115.594267 Provincial Land Incendiary NaN ... NaN NaN 2006-04-03 13:23:00 0.50 2006-04-03 13:23:00 0.50 NaN NaN 2006-04-03 15:00:00 0.50
3 2006 EWF003 NaN 0.01 A 53.608867 -115.609467 Provincial Land Incendiary NaN ... NaN NaN 2006-04-03 14:08:00 0.01 2006-04-03 14:08:00 0.01 NaN NaN 2006-04-03 15:05:00 0.01
4 2006 PWF002 NaN 0.10 A 56.249956 -117.050249 Provincial Land Other Industry Waste Disposal ... NaN NaN 2006-04-03 19:57:00 0.10 2006-04-03 20:19:00 0.10 2006-04-03 20:20:00 0.1 2006-04-05 10:18:00 0.10

5 rows × 50 columns

Handle missing values

## First I checked how many rows do we have : 
num_rows = wildfire.shape[0]
print("Number of rows:", num_rows)
Number of rows: 26551
wildfire.isna().sum()
YEAR                                0
FIRE_NUMBER                         0
FIRE_NAME                       25756
CURRENT_SIZE                        0
SIZE_CLASS                          0
LATITUDE                            0
LONGITUDE                           0
FIRE_ORIGIN                        12
GENERAL_CAUSE                       0
INDUSTRY_IDENTIFIER             26071
RESPONSIBLE_GROUP               17029
ACTIVITY_CLASS                  10416
TRUE_CAUSE                      12529
FIRE_START_DATE                   689
DETECTION_AGENT_TYPE                0
DETECTION_AGENT                     0
DISCOVERED_DATE                  5409
DISCOVERED_SIZE                 26402
REPORTED_DATE                       0
DISPATCHED_RESOURCE                12
DISPATCH_DATE                      12
START_FOR_FIRE_DATE                17
ASSESSMENT_RESOURCE                 0
ASSESSMENT_DATETIME                 0
ASSESSMENT_HECTARES                 0
FIRE_SPREAD_RATE                 2806
FIRE_TYPE                        2617
FIRE_POSITION_ON_SLOPE           2863
WEATHER_CONDITIONS_OVER_FIRE     2870
TEMPERATURE                      2872
RELATIVE_HUMIDITY                2878
WIND_DIRECTION                   2880
WIND_SPEED                       2880
FUEL_TYPE                        7559
INITIAL_ACTION_BY                  17
IA_ARRIVAL_AT_FIRE_DATE          7703
IA_ACCESS                       14858
FIRE_FIGHTING_START_DATE         7572
FIRE_FIGHTING_START_SIZE         7572
BUCKETING_ON_FIRE                7770
DISTANCE_FROM_WATER_SOURCE      18958
FIRST_BUCKET_DROP_DATE          18957
FIRST_BH_DATE                       0
FIRST_BH_SIZE                       0
FIRST_UC_DATE                       0
FIRST_UC_SIZE                       0
FIRST_TO_DATE                   23809
FIRST_TO_SIZE                   23809
FIRST_EX_DATE                       6
FIRST_EX_SIZE_PERIMETER             6
dtype: int64

Drop uninformative or very sparse columns

These columns have more than 50% missing values and we can remove them.

## These columns are very sparse
# FIRE_NAME                       25756
# INDUSTRY_IDENTIFIER             26071
# DISCOVERED_SIZE                 26402
# DISTANCE_FROM_WATER_SOURCE      18958
# FIRST_BUCKET_DROP_DATE          18957
# FIRST_TO_DATE                   23809
# FIRST_TO_SIZE                   23809



wildfire = wildfire.drop(columns=[
    "FIRE_NAME", "INDUSTRY_IDENTIFIER", "DISCOVERED_SIZE", 
    "DISTANCE_FROM_WATER_SOURCE", "FIRST_BUCKET_DROP_DATE", "FIRST_TO_DATE", "FIRST_TO_SIZE"
], errors="ignore")

Handle moderate missing values

These are useful variables but have some missing data. We can fill with “unknown”, 0, or mean.

# few missing
wildfire["DISPATCHED_RESOURCE"] = wildfire["DISPATCHED_RESOURCE"].fillna("Unknown")
wildfire["DISPATCH_DATE"] = pd.to_datetime(wildfire["DISPATCH_DATE"], errors="coerce") # I converted this column to date, and If a value can't be converted to a date (e.g., it's missing or malformed), it will be replaced with NaT
wildfire["START_FOR_FIRE_DATE"] = pd.to_datetime(wildfire["START_FOR_FIRE_DATE"], errors="coerce")
wildfire["INITIAL_ACTION_BY"] = wildfire["INITIAL_ACTION_BY"].fillna("Unknown")
wildfire["FIRE_ORIGIN"] = wildfire["INITIAL_ACTION_BY"].fillna("Unknown")
wildfire["FIRST_EX_DATE"] = pd.to_datetime(wildfire["FIRST_EX_DATE"], errors="coerce")
wildfire["FIRST_EX_SIZE_PERIMETER"] = wildfire["FIRST_EX_SIZE_PERIMETER"].fillna(-1) # missing or unknown numeric data

# Moderate missing
wildfire["FIRE_START_DATE"] = pd.to_datetime(wildfire["FIRE_START_DATE"], errors="coerce")
wildfire["DISCOVERED_DATE"] = pd.to_datetime(wildfire["DISCOVERED_DATE"], errors="coerce")
wildfire["FIRE_SPREAD_RATE"] = wildfire["FIRE_SPREAD_RATE"].fillna(wildfire["FIRE_SPREAD_RATE"].mean())  ## mean value
wildfire["FIRE_POSITION_ON_SLOPE"] = wildfire["FIRE_POSITION_ON_SLOPE"].fillna("Unknown")
wildfire["WEATHER_CONDITIONS_OVER_FIRE"] = wildfire["WEATHER_CONDITIONS_OVER_FIRE"].fillna("Unknown")
wildfire["WIND_DIRECTION"] = wildfire["WIND_DIRECTION"].fillna("Unknown")
wildfire["FUEL_TYPE"] = wildfire["FUEL_TYPE"].fillna("Unknown")
wildfire["TRUE_CAUSE"] = wildfire["TRUE_CAUSE"].fillna("Unknown") 
wildfire["FIRE_TYPE"] = wildfire["FIRE_TYPE"].fillna("Unknown") 

# Higher missing 
wildfire["RESPONSIBLE_GROUP"] = wildfire["RESPONSIBLE_GROUP"].fillna("Unknown")
wildfire["ACTIVITY_CLASS"] = wildfire["ACTIVITY_CLASS"].fillna("Unknown")
wildfire["IA_ARRIVAL_AT_FIRE_DATE"] = pd.to_datetime(wildfire["IA_ARRIVAL_AT_FIRE_DATE"], errors="coerce")
wildfire["IA_ACCESS"] = wildfire["IA_ACCESS"].fillna("Unknown")
wildfire["FIRE_FIGHTING_START_DATE"] = pd.to_datetime(wildfire["FIRE_FIGHTING_START_DATE"], errors="coerce")
wildfire["FIRE_FIGHTING_START_SIZE"] = wildfire["FIRE_FIGHTING_START_SIZE"].fillna(wildfire["FIRE_FIGHTING_START_SIZE"].median()) ## median value
wildfire["BUCKETING_ON_FIRE"] = wildfire["BUCKETING_ON_FIRE"].fillna("Unknown")

## Double check everything one more time  --> alll of the date columns now have NaT 
wildfire.isna().sum()
## Please let me know about the other columns, like humidity or temperature and wind_s, what should we do 
YEAR                               0
FIRE_NUMBER                        0
CURRENT_SIZE                       0
SIZE_CLASS                         0
LATITUDE                           0
LONGITUDE                          0
FIRE_ORIGIN                        0
GENERAL_CAUSE                      0
RESPONSIBLE_GROUP                  0
ACTIVITY_CLASS                     0
TRUE_CAUSE                         0
FIRE_START_DATE                  693
DETECTION_AGENT_TYPE               0
DETECTION_AGENT                    0
DISCOVERED_DATE                 5409
REPORTED_DATE                      0
DISPATCHED_RESOURCE                0
DISPATCH_DATE                     12
START_FOR_FIRE_DATE               17
ASSESSMENT_RESOURCE                0
ASSESSMENT_DATETIME                0
ASSESSMENT_HECTARES                0
FIRE_SPREAD_RATE                   0
FIRE_TYPE                          0
FIRE_POSITION_ON_SLOPE             0
WEATHER_CONDITIONS_OVER_FIRE       0
TEMPERATURE                     2872
RELATIVE_HUMIDITY               2878
WIND_DIRECTION                     0
WIND_SPEED                      2880
FUEL_TYPE                          0
INITIAL_ACTION_BY                  0
IA_ARRIVAL_AT_FIRE_DATE         7703
IA_ACCESS                          0
FIRE_FIGHTING_START_DATE        7572
FIRE_FIGHTING_START_SIZE           0
BUCKETING_ON_FIRE                  0
FIRST_BH_DATE                      0
FIRST_BH_SIZE                      0
FIRST_UC_DATE                      0
FIRST_UC_SIZE                      0
FIRST_EX_DATE                      6
FIRST_EX_SIZE_PERIMETER            0
dtype: int64

Fix the Data types

## First I checked the data types
print(wildfire.dtypes)
YEAR                                     int64
FIRE_NUMBER                             object
CURRENT_SIZE                           float64
SIZE_CLASS                              object
LATITUDE                               float64
LONGITUDE                              float64
FIRE_ORIGIN                             object
GENERAL_CAUSE                           object
RESPONSIBLE_GROUP                       object
ACTIVITY_CLASS                          object
TRUE_CAUSE                              object
FIRE_START_DATE                 datetime64[ns]
DETECTION_AGENT_TYPE                    object
DETECTION_AGENT                         object
DISCOVERED_DATE                 datetime64[ns]
REPORTED_DATE                           object
DISPATCHED_RESOURCE                     object
DISPATCH_DATE                   datetime64[ns]
START_FOR_FIRE_DATE             datetime64[ns]
ASSESSMENT_RESOURCE                     object
ASSESSMENT_DATETIME                     object
ASSESSMENT_HECTARES                    float64
FIRE_SPREAD_RATE                       float64
FIRE_TYPE                               object
FIRE_POSITION_ON_SLOPE                  object
WEATHER_CONDITIONS_OVER_FIRE            object
TEMPERATURE                            float64
RELATIVE_HUMIDITY                      float64
WIND_DIRECTION                          object
WIND_SPEED                             float64
FUEL_TYPE                               object
INITIAL_ACTION_BY                       object
IA_ARRIVAL_AT_FIRE_DATE         datetime64[ns]
IA_ACCESS                               object
FIRE_FIGHTING_START_DATE        datetime64[ns]
FIRE_FIGHTING_START_SIZE               float64
BUCKETING_ON_FIRE                       object
FIRST_BH_DATE                           object
FIRST_BH_SIZE                          float64
FIRST_UC_DATE                           object
FIRST_UC_SIZE                          float64
FIRST_EX_DATE                   datetime64[ns]
FIRST_EX_SIZE_PERIMETER                float64
dtype: object

Date/Time

date_cols = [
    "FIRE_START_DATE", "DISCOVERED_DATE", "REPORTED_DATE", "DISPATCH_DATE",
    "START_FOR_FIRE_DATE", "IA_ARRIVAL_AT_FIRE_DATE", "FIRE_FIGHTING_START_DATE", "FIRST_BH_DATE", "FIRST_UC_DATE",
    "FIRST_EX_DATE", "ASSESSMENT_DATETIME"
]

for col in date_cols:
    wildfire[col] = pd.to_datetime(wildfire[col], errors="coerce")

## Format all float columns to 2 decimal places

float_cols = wildfire.select_dtypes(include=["float64"]).columns
wildfire[float_cols] = wildfire[float_cols].round(2)

Column Names:

wildfire.columns.tolist()
['YEAR',
 'FIRE_NUMBER',
 'CURRENT_SIZE',
 'SIZE_CLASS',
 'LATITUDE',
 'LONGITUDE',
 'FIRE_ORIGIN',
 'GENERAL_CAUSE',
 'RESPONSIBLE_GROUP',
 'ACTIVITY_CLASS',
 'TRUE_CAUSE',
 'FIRE_START_DATE',
 'DETECTION_AGENT_TYPE',
 'DETECTION_AGENT',
 'DISCOVERED_DATE',
 'REPORTED_DATE',
 'DISPATCHED_RESOURCE',
 'DISPATCH_DATE',
 'START_FOR_FIRE_DATE',
 'ASSESSMENT_RESOURCE',
 'ASSESSMENT_DATETIME',
 'ASSESSMENT_HECTARES',
 'FIRE_SPREAD_RATE',
 'FIRE_TYPE',
 'FIRE_POSITION_ON_SLOPE',
 'WEATHER_CONDITIONS_OVER_FIRE',
 'TEMPERATURE',
 'RELATIVE_HUMIDITY',
 'WIND_DIRECTION',
 'WIND_SPEED',
 'FUEL_TYPE',
 'INITIAL_ACTION_BY',
 'IA_ARRIVAL_AT_FIRE_DATE',
 'IA_ACCESS',
 'FIRE_FIGHTING_START_DATE',
 'FIRE_FIGHTING_START_SIZE',
 'BUCKETING_ON_FIRE',
 'FIRST_BH_DATE',
 'FIRST_BH_SIZE',
 'FIRST_UC_DATE',
 'FIRST_UC_SIZE',
 'FIRST_EX_DATE',
 'FIRST_EX_SIZE_PERIMETER']
wildfire.columns = wildfire.columns.str.lower()

Investigation of the Date columns

date_cols = [
    'fire_start_date', 'discovered_date', 'reported_date', 'dispatch_date',
    'start_for_fire_date', 'assessment_datetime', 'ia_arrival_at_fire_date',
    'fire_fighting_start_date', 'first_bh_date', 'first_uc_date', 'first_ex_date'
]

results = []

for col in date_cols:
    n_unique = wildfire[col].nunique()
    total = wildfire[col].notna().sum()
    ratio = round(n_unique / total, 3) if total else None
    results.append({
        "column": col,
        "unique_values": n_unique,
        "non_null_values": total,
        "uniqueness_ratio": ratio
    })

date_stats_df = pd.DataFrame(results)
date_stats_df
column unique_values non_null_values uniqueness_ratio
0 fire_start_date 20199 25858 0.781
1 discovered_date 20276 21142 0.959
2 reported_date 25442 26551 0.958
3 dispatch_date 25408 26539 0.957
4 start_for_fire_date 25432 26534 0.958
5 assessment_datetime 25678 26551 0.967
6 ia_arrival_at_fire_date 18146 18848 0.963
7 fire_fighting_start_date 18128 18979 0.955
8 first_bh_date 25450 26551 0.959
9 first_uc_date 25406 26551 0.957
10 first_ex_date 25140 26545 0.947

Based on Alberta Historical Wildfire Data Dictionary (2006–2024)

Column Name Official Definition (From Data Dictionary)
fire_start_date The estimated or known time and date the wildfire began. May come from storm tracking, witness statements, or lightning map data.
discovered_date The time the detection agent first discovered the wildfire. May be blank for unplanned detections.
reported_date The time and date the wildfire was reported to Alberta Wildfire (usually when the detection agent contacts dispatch).
dispatch_date The date and time the first resource was dispatched to respond to the wildfire.
start_for_fire_date The time and date the dispatched resource left for the wildfire, e.g., wheels roll, skids up, or vehicle departure.
assessment_datetime The date and time of the initial wildfire assessment, which determines the fire year.
ia_arrival_at_fire_date The date and time the initial action resource arrived at the wildfire site.
fire_fighting_start_date The date and time suppression efforts began (e.g., digging firelines, water drops).
first_bh_date The date/time the wildfire status changed to Being Held (BH) — unlikely to spread further under current conditions.
first_uc_date The date/time the fire was declared Under Control (UC) — fire perimeter secured, no further spread expected.
first_ex_date The date/time the fire was declared Extinguished (EX) — fully out, no hot spots remaining.

I just keep these date columns and I think these columns are most informative

## I keep these columns
##   'fire_start_date',
##   'discovered_date',
##   'ia_arrival_at_fire_date',
##   'fire_fighting_start_date',
##   'first_bh_date',
##   'first_uc_date',
##   'first_ex_date'

# I deleted these columns
cols_to_delete = [
  'reported_date',
    'dispatch_date',
    'start_for_fire_date',
    'first_ex_date',
    'discovered_date', 
    'assessment_datetime',
    'dispatched_resource', ## This is not a date column, but it is not useful so I deleted it here
    'assessment_resource' ## This is not date column but it is not usefull so I deleted here
]

wildfire = wildfire.drop(columns=cols_to_delete, errors='ignore')
wildfire.head()
year fire_number current_size size_class latitude longitude fire_origin general_cause responsible_group activity_class ... ia_arrival_at_fire_date ia_access fire_fighting_start_date fire_fighting_start_size bucketing_on_fire first_bh_date first_bh_size first_uc_date first_uc_size first_ex_size_perimeter
0 2006 PWF001 0.10 A 56.25 -117.18 Land Owner Resident Resident Grass ... NaT Unknown NaT 0.02 Unknown 2006-04-02 22:00:00 0.01 2006-04-02 22:00:00 0.01 0.10
1 2006 EWF002 0.20 B 53.61 -115.92 Fire Department Incendiary Others (explain in remarks) Lighting Fires ... NaT Unknown NaT 0.02 Unknown 2006-04-03 13:20:00 0.20 2006-04-03 13:20:00 0.20 0.20
2 2006 EWF001 0.50 B 53.61 -115.59 Fire Department Incendiary Others (explain in remarks) Lighting Fires ... NaT Unknown NaT 0.02 Unknown 2006-04-03 13:23:00 0.50 2006-04-03 13:23:00 0.50 0.50
3 2006 EWF003 0.01 A 53.61 -115.61 Industry Incendiary Others (explain in remarks) Lighting Fires ... NaT Unknown NaT 0.02 Unknown 2006-04-03 14:08:00 0.01 2006-04-03 14:08:00 0.01 0.01
4 2006 PWF002 0.10 A 56.25 -117.05 Fire Department Other Industry Employees Refuse ... NaT Unknown NaT 0.02 Unknown 2006-04-03 19:57:00 0.10 2006-04-03 20:19:00 0.10 0.10

5 rows × 35 columns

wildfire.columns.tolist()
['year',
 'fire_number',
 'current_size',
 'size_class',
 'latitude',
 'longitude',
 'fire_origin',
 'general_cause',
 'responsible_group',
 'activity_class',
 'true_cause',
 'fire_start_date',
 'detection_agent_type',
 'detection_agent',
 'assessment_hectares',
 'fire_spread_rate',
 'fire_type',
 'fire_position_on_slope',
 'weather_conditions_over_fire',
 'temperature',
 'relative_humidity',
 'wind_direction',
 'wind_speed',
 'fuel_type',
 'initial_action_by',
 'ia_arrival_at_fire_date',
 'ia_access',
 'fire_fighting_start_date',
 'fire_fighting_start_size',
 'bucketing_on_fire',
 'first_bh_date',
 'first_bh_size',
 'first_uc_date',
 'first_uc_size',
 'first_ex_size_perimeter']

Please let me know which columns you think are better to keep

Category Columns
Identifiers 'year', 'fire_number'
Fire Size & Status 'current_size', 'size_class', 'fire_fighting_start_size', 'first_bh_size', 'first_uc_size', 'first_ex_size_perimeter'
Location 'latitude', 'longitude', 'fire_origin'
Cause & Activity 'general_cause', 'true_cause', 'activity_class', 'responsible_group'
Key Dates 'fire_start_date', 'discovered_date', 'ia_arrival_at_fire_date', 'fire_fighting_start_date', 'first_bh_date', 'first_uc_date'
Detection & Resources 'detection_agent_type', 'detection_agent', 'initial_action_by', 'ia_access'
Assessment & Behavior 'assessment_hectares', 'fire_spread_rate', 'fire_type', 'fire_position_on_slope', 'fuel_type'
Weather 'weather_conditions_over_fire', 'temperature', 'relative_humidity', 'wind_direction', 'wind_speed'
Suppression 'bucketing_on_fire'

Cleaning some values o Columns

wildfire['responsible_group'] = wildfire['responsible_group'].replace(
    'Others (explain in remarks)', 
    'Others'
)

Export cleaned dataset

wildfire.to_csv("../data/clean/wildfire.csv", index=False)

Validate cleaned file

clean = pd.read_csv("../data/clean/wildfire.csv")
clean.shape
clean.head()
year fire_number current_size size_class latitude longitude fire_origin general_cause responsible_group activity_class ... ia_arrival_at_fire_date ia_access fire_fighting_start_date fire_fighting_start_size bucketing_on_fire first_bh_date first_bh_size first_uc_date first_uc_size first_ex_size_perimeter
0 2006 PWF001 0.10 A 56.25 -117.18 Land Owner Resident Resident Grass ... NaN Unknown NaN 0.02 Unknown 2006-04-02 22:00:00 0.01 2006-04-02 22:00:00 0.01 0.10
1 2006 EWF002 0.20 B 53.61 -115.92 Fire Department Incendiary Others Lighting Fires ... NaN Unknown NaN 0.02 Unknown 2006-04-03 13:20:00 0.20 2006-04-03 13:20:00 0.20 0.20
2 2006 EWF001 0.50 B 53.61 -115.59 Fire Department Incendiary Others Lighting Fires ... NaN Unknown NaN 0.02 Unknown 2006-04-03 13:23:00 0.50 2006-04-03 13:23:00 0.50 0.50
3 2006 EWF003 0.01 A 53.61 -115.61 Industry Incendiary Others Lighting Fires ... NaN Unknown NaN 0.02 Unknown 2006-04-03 14:08:00 0.01 2006-04-03 14:08:00 0.01 0.01
4 2006 PWF002 0.10 A 56.25 -117.05 Fire Department Other Industry Employees Refuse ... NaN Unknown NaN 0.02 Unknown 2006-04-03 19:57:00 0.10 2006-04-03 20:19:00 0.10 0.10

5 rows × 35 columns

 
 

This page is built with Quarto.