Diverse Data Hub
  • Data Sets
  • Citation
  • Collaborate

On this page

  • About the Data
    • Download
    • Metadata
    • Variables
    • Key Features of the Data Set
    • Purpose and Use Cases
  • Case Study
    • Objective
    • Analysis
      • 1. Data Cleaning & Processing
      • 2. Exploratory Data Analysis
      • 3. Logistic Regression Model
      • 4. Interpretation/Evaluation of our Model
    • Discussion
  • Attribution

Historical Alberta Wildfire Data

NoteTry it live!

Use Binder to explore and run this dataset and analysis interactively in your browser with R. Ideal for students and instructors to run, modify, and test the analysis in a live JupyterLab environment—no setup needed.

👉 Launch on Binder

About the Data

This data set contains information on wildfires in Canada, compiled from official government sources under the Open Government Licence – Alberta.

The data was gathered to monitor, assess, and respond to wildfire risks across different regions. Wildfires have far-reaching environmental, social, and economic consequences. From an equity and inclusion perspective, analyzing wildfire data can reveal geographic and resource-based disparities in detection and containment efforts, and highlight how certain populations face greater risks due to climate change and limited infrastructure.

In particular, Alberta experiences some of the most severe and frequent wildfires in Canada due to its vast forested areas, dry climate, and increasing temperatures linked to climate change. Wildfires in Alberta can lead to widespread evacuations, destroy homes and livelihoods, and disproportionately affect rural and Indigenous communities, who may lack access to adequate emergency services and infrastructure. Understanding the patterns of wildfire occurrence and spread helps policymakers, environmental planners, and emergency services allocate resources more equitably and implement effective mitigation strategies. This data set enables data-driven approaches to reduce the impact of wildfires and support more resilient and inclusive disaster management practices across Alberta and beyond.

Download

Download CSV

Metadata

CSV Name
wildfire.csv
Data Set Characteristics
Multivariate
Subject Area
Climate Change
Associated Tasks
Classification, Time Series, Geospatial Analysis
Feature Type
Categorical, Integer
Instances
26551 records
Features
50
Has Missing Values?
Yes

Variables

Variable Name Role Type Description Units Missing Values
year ID Integer Year in which the wildfire was first detected incident Year No
fire_number ID String Identifier for the wildfire - No
current_size Feature Numeric Final estimated area burned by the wildfire Hectares No
size_class Feature Categorical Size classification based on final area burned - No
latitude Feature Numeric Latitude coordinate of the wildfire origin Degrees No
longitude Feature Numeric Longitude coordinate of the wildfire origin Degrees No
fire_origin Metadata Categorical Who owns or administers the land the wildfire ignited on - No
general_cause Feature Categorical Classification of the wildfire cause - No
responsible_group Metadata Categorical Recreational group responsible for causing the wildfire - No
activity_class Feature Categorical Activity that was going on when the wildfire started - No
true_cause Feature Categorical Specific reason why the wildfire started (e.g., “Arson Known”, “Hot Exhaust”, “Line Impact”, “Unattended Fire”, etc.) - No
fire_start_date Time Date Datetime the wildfire started YYYY-MM-DD Yes
detection_agent_type Feature Categorical Type of detection agent that discovered the wildfire (e.g., lookout (“LKT”), aircraft (“AIR”)) - No
detection_agent Feature Categorical Specific type of detection agent that discovered the wildfire - No
assessment_hectares Feature Numeric Size of the wildfire at the time of assessment Hectares No
fire_spread_rate Feature Numeric Rate at which the wildfire spread at the time of initial assessment Metres/minute No
fire_type Feature Categorical Predominant wildfire behavior classification at the time of initial assessment (e.g., “Surface”, “Ground”, “Crown”) - No
fire_position_on_slope Feature Categorical Position of the wildfire relative to the slope it is travelling on at the time of initial assessment (e.g., “Bottom”, “Middle 1/3”, “Unknown”) - No
weather_conditions_over_fire Feature Categorical Weather conditions over the wildfire at the time of initial assessment - No
temperature Feature Numeric Temperature at the wildfire location at the time of initial assessment °C Yes
relative_humidity Feature Numeric Relative humidity at the wildfire location at the time of initial assessment % Yes
wind_direction Feature Categorical Wind direction at the wildfire location at the time of initial assessment - No
wind_speed Feature Numeric Wind speed at the wildfire location in km/h at the time of initial assessment km/h Yes
fuel_type Feature Categorical Dominant fuel type (vegetation cover) in which the wildfire is burning at the wildfire location at the time of initial assessment - No
initial_action_by Metadata Categorical Group that initiated suppression efforts - No
ia_arrival_at_fire_date Time DateTime Datetime when the initial action group arrived at the wildfire YYYY-MM-DD Yes
ia_access Feature Categorical Method of access that the initial action group used - No
fire_fighting_start_date Time DateTime Datetime when the initial action group began firefighting activities YYYY-MM-DD Yes
fire_fighting_start_size Feature Numeric Wildfire size at the time firefighting began Hectares No
bucketing_on_fire Feature Binary Whether aerial bucketing was used on the wildfire Yes/No No
first_bh_date Time DateTime Datetime when wildfire was first declared being held YYYY-MM-DD No
first_bh_size Feature Numeric Wildfire size when wildfire was first declared being held Hectares No
first_uc_date Time DateTime Datetime when wildfire was first declared under control YYYY-MM-DD No
first_uc_size Feature Numeric Wildfire size when first declared under control Hectares No
first_ex_size_perimeter Feature Numeric Wildfire size when first declared extinguished Hectares No

Key Features of the Data Set

Each row represents a single wildfire incident and includes information such as:

  • Environmental conditions (temperature, wind_speed, relative_humidity) – Air temperature (°C), wind speed (km/h) and relative humidity (%) near the fire; influence fire intensity and spread.

  • Fire behavior metrics (fire_spread_rate, fire_type, fuel_type) – Fire expansion rate (ha/h), behavior class (surface, crown) and dominant fuel; determine growth and management.

  • Access difficulty (ia_access) – Ease of crew access; low access delays response.

  • Location coordinates (latitude, longitude) – Decimal degrees of fire origin; used for mapping and spatial analysis.

Purpose and Use Cases

This data set is designed to support analysis of:

  • Factors contributing to the spread, intensity, and size of wildfires

  • The impact of weather conditions and fuel types on fire behavior

  • Geographic and seasonal patterns in wildfire occurrence

  • The effectiveness and timeliness of initial suppression efforts

  • Relationships between fire causes, detection methods, and responsible parties

Case Study

Objective

  • R
  • Python

Large wildfires pose serious environmental, social, and economic challenges, especially as climate conditions become more extreme. Identifying the key environmental and human factors linked to these fires can help guide more effective prevention and response strategies.
So, our main question is:

Can we identify the environmental and human factors most associated with large wildfires?

According to Natural Resources Canada, wildfires exceeding 200 hectares in final size are classified as “large fires.” While these fires represent a small percentage of all wildfires, they account for the majority of the total area burned annually.
The goal is to explore potential predictors of fire size, such as weather, fire cause, and detection method, and provide insights that could inform early interventions and resource planning.

Large wildfires pose serious environmental, social, and economic challenges, especially as climate conditions become more extreme. A machine learning model may be able to guide prevention and response strategies.

So, our main question is:

How well can a logistic regression model be used to predict large wildfires?

According to Natural Resources Canada, wildfires exceeding 200 hectares in final size are classified as “large fires”. While these fires represent a small percentage of all wildfires, they account for the majority of the total area burned annually.
The goal is to explore a predictive logistic regression model, tune its hyperparameters, and evaluate its performance.

Analysis

Loading Libraries

  • R
  • Python
# Data
library(diversedata)      # Diverse Data Hub data sets

# Core libraries
library(tidyverse)        
library(lubridate)      

# Spatial & mapping
library(sf)               
library(terra)            
library(ggmap)            
library(ggspatial)        
library(maptiles)         
library(leaflet)          
library(leaflet.extras)   

# Visualization & color
library(viridis)          

# Tables & reporting
library(gt)               
library(kableExtra)       

# Modeling & interpretation
library(marginaleffects)  
library(broom)           
import diversedata as dd
import numpy as np
import pandas as pd
import altair as alt
import plotly.express as px
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from scipy.stats import randint
from sklearn.metrics import (
    f1_score,
    precision_score,
    recall_score,
    ConfusionMatrixDisplay,
    confusion_matrix,
    classification_report
)
from sklearn.dummy import DummyClassifier
from IPython.display import Markdown

1. Data Cleaning & Processing

  • R
  • Python
  • Converted fire size to numeric
  • Created a binary variable large_fire (TRUE if >200 ha)
  • Filtered out incomplete records
# Reading Data  
wildfire_data <- wildfire

# Clean and prepare base data
wildfire_clean <- wildfire_data |>
  filter(!is.na(assessment_hectares), assessment_hectares > 0) |>
  mutate(
    large_fire = current_size > 200,
    true_cause = as.factor(true_cause),
    detection_agent_type = as.factor(detection_agent_type),
    temperature = as.numeric(temperature),
    wind_speed = as.numeric(wind_speed)
  )

# Drop unused levels for modeling
wildfire_clean <- wildfire_clean |>
  filter(!is.na(true_cause), !is.na(detection_agent_type)) |>
  mutate(
    true_cause = droplevels(true_cause),
    detection_agent_type = droplevels(detection_agent_type)
  )
  • Created a binary variable large_fire (TRUE if >200 ha)
  • Filtered out incomplete records
# Reading Data
wildfire = dd.load_data("wildfire")

# Clean and prepare base data
wildfire = wildfire.dropna(
    subset=["assessment_hectares", "true_cause", "detection_agent_type"]
)
wildfire["large_fire"] = wildfire["current_size"] > 200

2. Exploratory Data Analysis

Map of Wildfire Size and Location in Alberta

This interactive map displays the geographic distribution and relative size of wildfires across Alberta, using red circles sized by fire area. Each point represents a wildfire event, with larger circles indicating more extensive burns. The map reveals regions with concentrated wildfire activity and visually emphasizes differences in fire magnitude across the province.

  • R
  • Python
Note

To provide geographic context for our wildfire data, we added a shapefile representing Alberta’s boundaries.
This shapefile was sourced from the Alberta Government Open Data Portal and specifically corresponds to the Electoral Division Shapefile (Bill 33, 2017).
The data was processed and transformed to the appropriate geographic coordinate system to enable mapping alongside our wildfire data set.

# map
leaflet() |>
  addProviderTiles("CartoDB.Positron") |> 
  setView(lng = -115, lat = 55, zoom = 5.5) |> 
  addPolygons(data = alberta_shape, 
              color = "#CCCCCC",     
              weight = 0.5,         
              fillOpacity = 0.02,   
              group = "Alberta Boundaries") |>
  addCircles(data = wildfire_sf,
             radius = ~sqrt(current_size) * 30,  
             fillOpacity = 0.6,
             color = "red",
             stroke = FALSE,
             group = "Wildfires") |>
  addLayersControl(overlayGroups = c("Alberta Boundaries", "Wildfires"),
                   options = layersControlOptions(collapsed = FALSE)) |>
  addLegend(position = "bottomright", 
            title = "Wildfire Size (approx.)",
            colors = "red", 
            labels = "Larger = Bigger fire")
wildfire["display_size"] = wildfire["current_size"] + 1000

fig_map = px.scatter_map(
    wildfire,
    lat="latitude",
    lon="longitude",
    opacity=0.6,
    zoom=4.7,
    center={"lat": 55, "lon": -116},
    size="display_size",
    color_discrete_sequence=["red"],
    hover_name="fire_number",
    hover_data={"display_size": False, "current_size": True},
    labels={"current_size": "size (hectares)"},
)

fig_map = fig_map.update_layout(
    margin=dict(l=0, r=0, t=30, b=10),
    width=700,
    height=800,
    mapbox_style="open-street-map",
)

Proportion of Large Fires by Cause

  • R
  • Python
wildfire_clean |>
  group_by(true_cause) |>
  summarize(prop_large = mean(large_fire, na.rm = TRUE)) |>
  filter(prop_large > 0) |>
  ggplot(aes(x = reorder(true_cause, -prop_large), y = prop_large)) +
  geom_col(fill = 'skyblue') +
  coord_flip() +
  labs(
    title = "Proportion of Large Fires by True Cause",
    x = "True Cause",
    y = "Proportion of Fires > 200 ha",
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(size = 14, face = "bold"),
    axis.text.y = element_text(size = 10),
    axis.title = element_text(size = 12)
  )

wildfire_by_true_cause = (
    wildfire.groupby("true_cause")
    .agg(prop_large=("large_fire", lambda x: x.mean(skipna=True)))
    .query("prop_large > 0")
    .reset_index()
)

alt.Chart(wildfire_by_true_cause).mark_bar().encode(
    x=alt.X("prop_large:Q").title("Proportion of Fires > 200 ha"),
    y=alt.Y("true_cause").sort("x").title("True Cause"),
).properties(title="Proportion of Large Fires by True Cause", width=500, height=300)

3. Logistic Regression Model

Governments and emergency planners often care about whether a fire becomes large, not necessarily exactly how large. This makes a binary classification more interpretable and actionable for planning responses. Logistic regression predicts probability of exceeding a threshold, such as “Will this fire exceed 200 hectares?” Modeling the probability of exceeding such a threshold is often more operationally useful than modeling exact hectares.

In Decision Making for Wildfires: A Guide for Applying a Risk Management Process at the Incident Level, the authors outline a risk-based decision-making framework. A key concept is identifying risk thresholds to guide decisions. This aligns with the broader goals of wildfire risk management, which emphasize anticipating fire severity and planning appropriate responses. Logistic regression is well-suited to this context because it models the probability of crossing such thresholds, such as a fire becoming large, enabling a clear, interpretable basis for making timely, risk-informed decisions.

We built a logistic regression model to predict the likelihood of a fire becoming large based on a set of environmental and operational factors that are consistently measurable, statistically significant, and operationally relevant.

In our logistic regression model, we selected key environmental and contextual variables from the wildfire_clean data set that are well-established drivers of wildfire behavior. These include:

  • temperature: Higher temperatures increase evaporation, dry out vegetation, and promote ignition and fire spread.

  • wind_speed: Strong winds feed oxygen to the fire and can carry embers over long distances, accelerating spread.

  • relative_humidity: Low humidity conditions dry fuels, increasing the probability and intensity of fire ignition.

  • fire_spread_rate: Reflects how quickly a fire expands; a dynamic indicator of fire behavior.

  • fire_type: Indicates how the fire behaves (e.g., surface vs. crown fire), which affects controllability and risk.

  • fuel_type: Different vegetation types burn at different intensities and rates; fuels like grass or timber respond differently under the same conditions.

  • ia_access: Stands for Initial Attack Access. Limited access can delay suppression efforts, allowing fires to grow larger.

These variables were chosen because they each represent core determinants of ignition likelihood, fire intensity, and suppression difficulty. As explained in Introduction to Wildland Fire by Pyne, Andrews, and Laven, this theoretical foundation supports our inclusion of both meteorological and environmental variables in the model.

Mathematical Definition of the Logistic Regression Model

We aim to estimate the probability that a wildfire becomes large (i.e., burns more than 200 hectares), using logistic regression.

Let:

\[ Y_i = \begin{cases} 1, & \text{if fire } i \text{ is large (area > 200 ha)} \\ 0, & \text{otherwise} \end{cases} \]

\[ \mathbf{x}_i = \left( \text{temperature}_i, \text{wind\_speed}_i, \text{relative\_humidity}_i, \text{fire\_spread\_rate}_i, \text{fire\_type}_i, \text{fuel\_type}_i, \text{ia\_access}_i \right) \]

Then the model is defined as:

\[ \Pr(Y_i = 1 \mid \mathbf{x}_i) = \frac{1}{1 + \exp(-\eta_i)} \]

where the linear predictor is:

\[ \eta_i = \beta_0 + \beta_1 \cdot \text{temperature}_i + \beta_2 \cdot \text{wind\_speed}_i + \beta_3 \cdot \text{relative\_humidity}_i + \beta_4 \cdot \text{fire\_spread\_rate}_i + \boldsymbol{\beta}_{\text{cat}}^\top \mathbf{x}_{i,\text{cat}} \]

Here:

\[ \begin{aligned} \beta_0 & \text{ is the intercept} \\ \boldsymbol{\beta}_{\text{cat}} & \text{ is the vector of coefficients for the categorical predictors} \\ \mathbf{x}_{i,\text{cat}} & \text{ is the vector of dummy variables representing those categorical predictors} \end{aligned} \]

  • R
  • Python

In this R analysis, we will fit a logistic regression model for inference.

model <- glm(
  large_fire ~ temperature + wind_speed + relative_humidity +
    fire_spread_rate + fire_type + fuel_type + ia_access,
  data = wildfire_clean,
  family = "binomial"
)

tidy_model <- tidy(model) |>
  mutate(
    estimate = round(estimate, 3),
    std.error = round(std.error, 3),
    statistic = round(statistic, 2),
    p.value = round(p.value, 4)
  )

kable(tidy_model
      , caption = "Logistic Regression Results: Predicting Large Fires (> 200 ha)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width = FALSE)
Logistic Regression Results: Predicting Large Fires (> 200 ha)
term estimate std.error statistic p.value
(Intercept) -2.016 0.401 -5.03 0.0000
temperature 0.048 0.010 4.66 0.0000
wind_speed 0.050 0.004 11.03 0.0000
relative_humidity -0.031 0.004 -7.64 0.0000
fire_spread_rate 0.104 0.010 10.30 0.0000
fire_typeGround -1.354 0.184 -7.37 0.0000
fire_typeSurface -1.093 0.129 -8.48 0.0000
fire_typeUnknown -13.012 3683.753 0.00 0.9972
fuel_typeC2 -0.945 0.175 -5.40 0.0000
fuel_typeC3 -1.024 0.276 -3.71 0.0002
fuel_typeC4 0.454 0.337 1.35 0.1775
fuel_typeC6 -16.192 4492.954 0.00 0.9971
fuel_typeC7 -16.211 1258.218 -0.01 0.9897
fuel_typeD1 -16.322 274.554 -0.06 0.9526
fuel_typeM1 -1.798 0.362 -4.97 0.0000
fuel_typeM2 -2.272 0.279 -8.13 0.0000
fuel_typeM3 -15.610 3734.948 0.00 0.9967
fuel_typeM4 -15.179 6522.639 0.00 0.9981
fuel_typeO1a -2.912 0.289 -10.08 0.0000
fuel_typeO1b -3.514 0.456 -7.70 0.0000
fuel_typeS1 -2.189 0.486 -4.51 0.0000
fuel_typeS2 -2.242 0.546 -4.11 0.0000
fuel_typeUnknown -4.148 0.481 -8.62 0.0000
ia_accessGround -0.861 0.410 -2.10 0.0356
ia_accessHover Exit -0.083 0.412 -0.20 0.8407
ia_accessRappel -14.797 554.243 -0.03 0.9787
ia_accessUnknown 0.428 0.107 3.99 0.0001

Higher wind speed (0.050, p < 0.001), temperature (0.048, p < 0.001), and fire spread rate (0.104, p < 0.001) significantly increase the odds of a large wildfire, while relative humidity (-0.031, p < 0.001) decreases it. Compared to crown fires, ground fires (-1.354, p < 0.001) and surface fires (-1.093, p < 0.001) are much less likely to result in large fires.

This logistic regression model estimates the factors influencing the likelihood of a large wildfire. Higher temperature, wind speed, and fire spread rate significantly increase the odds of a large fire, while higher relative humidity decreases it. Certain fire types (e.g., Ground, Surface) and fuel types (e.g., M1, M2, O1a, O1b) are associated with significantly lower odds compared to their respective reference categories. The model shows that higher wind speed, lower humidity, and faster fire spread significantly increase the risk of large wildfires.

Let’s dig into our model further by examining how different environmental and operational factors interact to influence fire outcomes.

In this Python analysis, we will fit a logistic regression model for prediction.

First, let’s separate our predictors from the target.

Then, split our data into train and test sets, with the random state specified to make our work reproducible.

X = wildfire[[
  "temperature",
  "wind_speed",
  "relative_humidity",
  "fire_spread_rate",
  "fire_type",
  "fuel_type",
  "ia_access",
]]
y = wildfire["large_fire"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=123
)

As we can see below, there are some missing values in the numeric features: temperature, wind_speed, and relative_humidity. We’ll impute these missing values with scikit-learn’s SimpleImputer with the default strategy of mean.

X_train.isna().sum()
temperature          1985
wind_speed           1991
relative_humidity    1990
fire_spread_rate        0
fire_type               0
fuel_type               0
ia_access               0
dtype: int64

We’ll use StandardScaler on the numerical features and encode the categorical features with OneHotEncoder. A column transformer will allow us to specify which features should be preprocessed in a specific way and a pipeline will put together our preprocessing and modelling steps.

numeric_feats = X_train.select_dtypes('number').columns.to_list()
categorical_feats = X_train.select_dtypes(exclude=['number']).columns.to_list()

print(f"The numeric features that will be scaled and missing values will be imputed: {numeric_feats} \n")
print(f"The categorical features that will one-hot encoded: {categorical_feats}")
The numeric features that will be scaled and missing values will be imputed: ['temperature', 'wind_speed', 'relative_humidity', 'fire_spread_rate'] 
The categorical features that will one-hot encoded: ['fire_type', 'fuel_type', 'ia_access']
numeric_preprocessor = make_pipeline(SimpleImputer(), StandardScaler())
categorical_preprocessor = OneHotEncoder(handle_unknown="ignore")

preprocessor = make_column_transformer(
    (numeric_preprocessor, numeric_feats),
    (categorical_preprocessor, categorical_feats),
)

pipeline = make_pipeline(preprocessor, LogisticRegression(random_state=123, n_jobs=-1))

Using grid search with cross-validation, we’ll identify the optimal hyperparameter settings for class_weight and C. We’ll prioritize f1-score over accuracy since there is a class imbalance in the data for the target.

Setting class_weight to None treats all target classes equally, whereas balanced adjusts the weights inversely proportional to their frequencies in the input data.

Meanwhile, C controls regularization strength: larger values reduce regularization and allows the model to fit the training data more closely, while smaller values increase regularization to avoid overfitting to the training data.

param_grid = {
    "logisticregression__class_weight": [None, "balanced"],
    "logisticregression__C": [0.001, 0.01, 0.1, 1.0, 10, 100],
}

gs = GridSearchCV(
    pipeline, param_grid=param_grid, scoring="f1", n_jobs=-1, return_train_score=True
)
gs.fit(X_train, y_train)
GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['temperature',
                                                                          'wind_speed',
                                                                          'relative_humidity',
                                                                          'fire_spread_rate']),
                                                                        ('onehotencoder',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['fire_type',
                                                                          'fuel_type',
                                                                          'ia_access'])])),
                                       ('logisticregression',
                                        LogisticRegression(n_jobs=-1,
                                                           random_state=123))]),
             n_jobs=-1,
             param_grid={'logisticregression__C': [0.001, 0.01, 0.1, 1.0, 10,
                                                   100],
                         'logisticregression__class_weight': [None,
                                                              'balanced']},
             return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step..._state=123))])
param_grid  {'logisticregression__C': [0.001, 0.01, ...], 'logisticregression__class_weight': [None, 'balanced']}
scoring  'f1'
n_jobs  -1
refit  True
cv  None
verbose  0
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  True
Parameters
transformers  [('pipeline', ...), ('onehotencoder', ...)]
remainder  'drop'
sparse_threshold  0.3
n_jobs  None
transformer_weights  None
verbose  False
verbose_feature_names_out  True
force_int_remainder_cols  'deprecated'
['temperature', 'wind_speed', 'relative_humidity', 'fire_spread_rate']
Parameters
missing_values  nan
strategy  'mean'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
['fire_type', 'fuel_type', 'ia_access']
Parameters
categories  'auto'
drop  None
sparse_output  True
dtype  <class 'numpy.float64'>
handle_unknown  'ignore'
min_frequency  None
max_categories  None
feature_name_combiner  'concat'
Parameters
penalty  'l2'
dual  False
tol  0.0001
C  0.001
fit_intercept  True
intercept_scaling  1
class_weight  'balanced'
random_state  123
solver  'lbfgs'
max_iter  100
multi_class  'deprecated'
verbose  0
warm_start  False
n_jobs  -1
l1_ratio  None
cv_results = pd.DataFrame(gs.cv_results_)[
    [
        "mean_train_score",
        "std_train_score",
        "mean_test_score",
        "std_test_score",
        "param_logisticregression__class_weight", 
        "param_logisticregression__C",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().head(5)

Markdown(cv_results.to_markdown())
rank_test_score mean_train_score std_train_score mean_test_score std_test_score param_logisticregression__class_weight param_logisticregression__C mean_fit_time
1 0.196485 0.0058278 0.195341 0.00693425 balanced 0.001 0.0521213
2 0.1924 0.00467025 0.187695 0.00898124 balanced 0.01 0.0655165
3 0.189944 0.00449934 0.181763 0.00690405 balanced 0.1 0.0828915
4 0.188844 0.00435476 0.180443 0.00739874 balanced 10 0.0891777
5 0.189138 0.00460983 0.180185 0.00724185 balanced 1 0.0874284

In the table above, we can see that a class_weight='balanced' and C=0.001 achieves the best mean validation score.

We can also see that there is not much variation between the validation scores, with a standard deviation (std_test_score) of 0.0069. This indicates that the model has a stable performance and is not overly sensitive to which subset of the training data it sees.

4. Interpretation/Evaluation of our Model

  • R
  • Python

The Role of Wind Across Fire Causes

As you’ve likely noticed, we did not include true_cause as a predictor in our logistic regression model because it contains many categories, some of which have very few observations. This can lead to large standard errors, unstable coefficient estimates, and a higher risk of overfitting.

However, true_cause remains an important variable for exploratory analysis. Certain ignition causes may be more sensitive to wind. For example, sparks from power lines can spread quickly in high winds, embers from debris burning can travel farther, and unattended fires can escalate rapidly under windy conditions. We chose wind speed for this analysis because it is a well-documented environmental amplifier of fire behavior particularly when interacting with certain ignition causes.

Given these interactions, it’s valuable to visualize and investigate how the effect of wind speed on large wild fire development varies across different causes. Let’s explore this further through faceted marginal effect plots.

This plot shows how wind speed affects the probability of a large wildfire, and how that effect varies across different true causes of fire.

  • The x-axis shows the different values of true_cause (fire ignition source categories).

  • The y-axis shows the marginal effect of wind speed on the predicted probability (i.e., on the response scale) of a fire becoming large.

The plot helps us compare the sensitivity of different fire causes to wind speed. For example, if one cause shows a higher marginal effect, it means that increasing wind speed has a stronger impact on the likelihood of large fires for that cause.

p <- plot_comparisons(
  model,
  variables = "wind_speed",
  by = "true_cause",
  type = "response"
)

p +
  scale_color_manual(values = rep("darkgreen", length(unique(p$data$by)))) +
  scale_fill_manual(values = rep("darkgreen", length(unique(p$data$by)))) +
  labs(
    title = "Effect of Wind Speed on Large Fire Probability by True Cause",
    x = "True Cause",
    y = "Marginal Effect of Wind Speed"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The marginal effects analysis suggests that wind speed may have a relatively stronger impact on the probability of a fire becoming large when the ignition cause is related to Unknown, Insufficient Buffer, Incendiary Device, Hot Exhaust, or Line Impact. However, the overall changes in predicted probabilities are quite small, and the error bars across causes show substantial overlap. This indicates a high degree of uncertainty and suggests that these differences should be interpreted with caution. While the findings may not yet be strong enough to inform operational decisions, they do highlight areas for further exploration. In particular, when wind is forecasted in regions with a history of certain ignition causes, fire managers could use this exploratory insight to guide risk monitoring strategies, while recognizing that more robust evidence is needed before drawing firm conclusions.

Interaction Effects via Marginal Effects

We can also use marginal effects to explore interactions without explicitly adding them to the model, by conditioning on another variable.

This plot shows the marginal effect of relative_humidity on the predicted probability of large fire, calculated separately for each level of fire_type, based on your fitted model.

plot_comparisons(model, variables = "relative_humidity", by = "fire_type")

When comparing the marginal effect of relative humidity across fire types, we found that three fire types showed almost no change in large fire probability as humidity increased. However, one fire type exhibited a noticeably stronger negative effect. This suggests that humidity plays a more meaningful role in suppressing fire growth for certain fire types likely those that are more sensitive to moisture availability, such as crown fires—while having minimal effect on others.

Now that we’ve found the best hyperparameter settings, let’s evaluate a model that uses them. We’ll take a look at the f1 score on the full training set and the testing set using the best parameters and with “Large Fire” being the positive class.

print(f"Train score on the full training set: {gs.score(X_train, y_train):.4f}")
print(f"Test score on the testing set: {gs.score(X_test, y_test):.4f}")
Train score on the full training set: 0.1938
Test score on the testing set: 0.1777

This doesn’t seem great, but let’s compare these scores to a baseline model that only predicts the most common class in the training data (“Not a Large Fire”).

baseline_model = DummyClassifier()
baseline_model.fit(X_train, y_train)
baseline_y_pred_train = baseline_model.predict(X_train)
baseline_y_pred_test = baseline_model.predict(X_test)

print(
    f"Baseline train score on the full training set: {f1_score(y_train, baseline_y_pred_train):.4f}"
)
print(
    f"Baseline test score on the testing set: {f1_score(y_test, baseline_y_pred_test, pos_label=1):.4f}"
)
DummyClassifier()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
strategy  'prior'
random_state  None
constant  None
Baseline train score on the full training set: 0.0000
Baseline test score on the testing set: 0.0000

As we can see, the baseline model achieves a f1 score of 0 since it does not make any positive class (“Large Fire”) predictions.

As you’ve likely noticed, we did not include true_cause as a predictor in our logistic regression model because it contains many categories, some of which have very few observations. This can lead to large standard errors, unstable coefficient estimates, and a higher risk of overfitting. But let’s incorporate true_cause into the model to see its effect on model performance.

X_tc = wildfire[[
  "temperature",
  "wind_speed",
  "relative_humidity",
  "fire_spread_rate",
  "fire_type",
  "fuel_type",
  "ia_access",
  "true_cause",
]]
X_train_tc, X_test_tc, y_train, y_test = train_test_split(
    X_tc, y, test_size=0.3, random_state=123
)

categorical_feats_tc = categorical_feats + ["true_cause"]

preprocessor_tc = make_column_transformer(
    (numeric_preprocessor, numeric_feats),
    (categorical_preprocessor, categorical_feats_tc),
)

pipeline_tc = make_pipeline(
    preprocessor_tc, LogisticRegression(random_state=123, n_jobs=-1)
)

gs_tc = GridSearchCV(
    pipeline_tc, param_grid=param_grid, scoring="f1", n_jobs=-1, return_train_score=True
)

gs_tc.fit(X_train_tc, y_train)
GridSearchCV(estimator=Pipeline(steps=[('columntransformer',
                                        ColumnTransformer(transformers=[('pipeline',
                                                                         Pipeline(steps=[('simpleimputer',
                                                                                          SimpleImputer()),
                                                                                         ('standardscaler',
                                                                                          StandardScaler())]),
                                                                         ['temperature',
                                                                          'wind_speed',
                                                                          'relative_humidity',
                                                                          'fire_spread_rate']),
                                                                        ('onehotencoder',
                                                                         OneHotEncoder(handle_unknown='ignore'),
                                                                         ['fire_type',
                                                                          'fuel_type',
                                                                          'ia_access',
                                                                          'true_cause'])])),
                                       ('logisticregression',
                                        LogisticRegression(n_jobs=-1,
                                                           random_state=123))]),
             n_jobs=-1,
             param_grid={'logisticregression__C': [0.001, 0.01, 0.1, 1.0, 10,
                                                   100],
                         'logisticregression__class_weight': [None,
                                                              'balanced']},
             return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step..._state=123))])
param_grid  {'logisticregression__C': [0.001, 0.01, ...], 'logisticregression__class_weight': [None, 'balanced']}
scoring  'f1'
n_jobs  -1
refit  True
cv  None
verbose  0
pre_dispatch  '2*n_jobs'
error_score  nan
return_train_score  True
Parameters
transformers  [('pipeline', ...), ('onehotencoder', ...)]
remainder  'drop'
sparse_threshold  0.3
n_jobs  None
transformer_weights  None
verbose  False
verbose_feature_names_out  True
force_int_remainder_cols  'deprecated'
['temperature', 'wind_speed', 'relative_humidity', 'fire_spread_rate']
Parameters
missing_values  nan
strategy  'mean'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
['fire_type', 'fuel_type', 'ia_access', 'true_cause']
Parameters
categories  'auto'
drop  None
sparse_output  True
dtype  <class 'numpy.float64'>
handle_unknown  'ignore'
min_frequency  None
max_categories  None
feature_name_combiner  'concat'
Parameters
penalty  'l2'
dual  False
tol  0.0001
C  0.001
fit_intercept  True
intercept_scaling  1
class_weight  'balanced'
random_state  123
solver  'lbfgs'
max_iter  100
multi_class  'deprecated'
verbose  0
warm_start  False
n_jobs  -1
l1_ratio  None
cv_results_tc = pd.DataFrame(gs_tc.cv_results_)[
    [
        "mean_train_score",
        "std_train_score",
        "mean_test_score",
        "std_test_score",
        "param_logisticregression__class_weight", 
        "param_logisticregression__C",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index()

Markdown(cv_results.to_markdown())
rank_test_score mean_train_score std_train_score mean_test_score std_test_score param_logisticregression__class_weight param_logisticregression__C mean_fit_time
1 0.196485 0.0058278 0.195341 0.00693425 balanced 0.001 0.0521213
2 0.1924 0.00467025 0.187695 0.00898124 balanced 0.01 0.0655165
3 0.189944 0.00449934 0.181763 0.00690405 balanced 0.1 0.0828915
4 0.188844 0.00435476 0.180443 0.00739874 balanced 10 0.0891777
5 0.189138 0.00460983 0.180185 0.00724185 balanced 1 0.0874284
print(f"Train score with the original model: {gs.score(X_train, y_train):.4f}")
print(f"Test score with the original model: {gs.score(X_test, y_test):.4f} \n")

print(f"Train score with true_cause as a feature: {gs_tc.score(X_train_tc, y_train):.4f}")
print(f"Test score on with true_cause as a feature: {gs_tc.score(X_test_tc, y_test):.4f}")
Train score with the original model: 0.1938
Test score with the original model: 0.1777 
Train score with true_cause as a feature: 0.1873
Test score on with true_cause as a feature: 0.1710

It looks like our model performs a bit better without using true_cause, as we suspected.

Let’s take a closer look at how the model performs on the train set and test set with the best hyperparameters (without true_cause and with class_weight='balanced' and C=0.001).

  • On the train set:
  • On the test set:
cm = ConfusionMatrixDisplay.from_estimator(
    gs,
    X_train,
    y_train,
    display_labels=["Not a Large Fire", "Large Fire"],
)

From the above confusion matrix, we can observe that:

  • The “Not a Large Fire” class has the largest number samples and correct predictions.
  • There are many incorrect “Large Fire” predictions.
  • There are relatively little “Large Fire” samples.
  • The predictions for the true “Large Fire” class are mostly correct.
y_pred_train = gs.predict(X_train)
print(
    classification_report(
        y_train,
        y_pred_train,
        target_names=["Not a Large Fire", "Large Fire"],
    )
)
                  precision    recall  f1-score   support

Not a Large Fire       1.00      0.88      0.93     18240
      Large Fire       0.11      0.79      0.19       345

        accuracy                           0.88     18585
       macro avg       0.55      0.84      0.56     18585
    weighted avg       0.98      0.88      0.92     18585

From the above metrics, we can see that accuracy is high on the train set (0.88), but accuracy can be misleading with imbalanced data.

In general, scores for the dominant class (“Not a Large Fire”) are high. Meanwhile, the precision (0.11) and consequently the f1 score (0.19) for the minority class (“Large Fire”) is low. However, the recall is decent (0.79). This means that the model is able to predict most of the actual “Large Fire” incidents, but at the cost of many false positives.

cm = ConfusionMatrixDisplay.from_estimator(
    gs,
    X_test,
    y_test,
    display_labels=["Not a Large Fire", "Large Fire"],
)

From the above confusion matrix, we can observe that the model performs similarly on the test set as on the train set:

  • The “Not a Large Fire” class has the largest number samples and correct predictions.
  • There are many incorrect “Large Fire” predictions.
  • There are relatively little “Large Fire” samples.
  • The predictions for the true “Large Fire” class are mostly correct.
y_pred = gs.predict(X_test)
print(
    classification_report(
        y_test,
        y_pred,
        target_names=["Not a Large Fire", "Large Fire"],
    )
)
                  precision    recall  f1-score   support

Not a Large Fire       1.00      0.88      0.93      7826
      Large Fire       0.10      0.76      0.18       140

        accuracy                           0.88      7966
       macro avg       0.55      0.82      0.56      7966
    weighted avg       0.98      0.88      0.92      7966

From the above metrics, we can see that accuracy is high on the test set (0.88), but accuracy can be misleading with imbalanced data.

In general, scores for the dominant class (“Not a Large Fire”) are again high. Meanwhile, the precision (0.10) and recall (0.76) are slightly lower compared to on the train set (0.11 and 0.79, respectively), and consequently the f1 score (0.18) for the minority class (“Large Fire”) is lower. The recall is decent (0.76), which again means that the model is able to predict most of the actual “Large Fire” incidents, but at the cost of many false positives.

Overall, the model performance isn’t great, so let’s try a random forest model.

Random Forest Model

We’ll first find the optimal hyperparameter settings via cross validation and again prioritize f1-score over accuracy since there is a class imbalance in the data for the target. Since there is a large hyperparameter space, we’ll use a randomized hyperparameter search instead of a grid search for efficiency.

param_dist = {
    "randomforestclassifier__max_depth": randint(1, 20),
    "randomforestclassifier__min_samples_split": randint(2, 20),
    "randomforestclassifier__min_samples_leaf": randint(1, 20),
    "randomforestclassifier__class_weight": [None, "balanced"],
}
pipeline = make_pipeline(preprocessor, RandomForestClassifier(random_state=123))
random_search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_dist,
    n_iter=100,
    cv=5,
    random_state=123,
    scoring="f1",
    return_train_score=True,
)
random_search.fit(X_train, y_train)
RandomizedSearchCV(cv=5,
                   estimator=Pipeline(steps=[('columntransformer',
                                              ColumnTransformer(transformers=[('pipeline',
                                                                               Pipeline(steps=[('simpleimputer',
                                                                                                SimpleImputer()),
                                                                                               ('standardscaler',
                                                                                                StandardScaler())]),
                                                                               ['temperature',
                                                                                'wind_speed',
                                                                                'relative_humidity',
                                                                                'fire_spread_rate']),
                                                                              ('onehotencoder',
                                                                               OneHotEncoder(handle_unknown='ignore'),
                                                                               ['fire_type',
                                                                                'fuel_typ...
                                        'randomforestclassifier__max_depth': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7f07745a4980>,
                                        'randomforestclassifier__min_samples_leaf': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7f077476ac90>,
                                        'randomforestclassifier__min_samples_split': <scipy.stats._distn_infrastructure.rv_discrete_frozen object at 0x7f07745a5df0>},
                   random_state=123, return_train_score=True, scoring='f1')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
estimator  Pipeline(step..._state=123))])
param_distributions  {'randomforestclassifier__class_weight': [None, 'balanced'], 'randomforestclassifier__max_depth': <scipy.stats....x7f07745a4980>, 'randomforestclassifier__min_samples_leaf': <scipy.stats....x7f077476ac90>, 'randomforestclassifier__min_samples_split': <scipy.stats....x7f07745a5df0>}
n_iter  100
scoring  'f1'
n_jobs  None
refit  True
cv  5
verbose  0
pre_dispatch  '2*n_jobs'
random_state  123
error_score  nan
return_train_score  True
Parameters
transformers  [('pipeline', ...), ('onehotencoder', ...)]
remainder  'drop'
sparse_threshold  0.3
n_jobs  None
transformer_weights  None
verbose  False
verbose_feature_names_out  True
force_int_remainder_cols  'deprecated'
['temperature', 'wind_speed', 'relative_humidity', 'fire_spread_rate']
Parameters
missing_values  nan
strategy  'mean'
fill_value  None
copy  True
add_indicator  False
keep_empty_features  False
Parameters
copy  True
with_mean  True
with_std  True
['fire_type', 'fuel_type', 'ia_access']
Parameters
categories  'auto'
drop  None
sparse_output  True
dtype  <class 'numpy.float64'>
handle_unknown  'ignore'
min_frequency  None
max_categories  None
feature_name_combiner  'concat'
Parameters
n_estimators  100
criterion  'gini'
max_depth  17
min_samples_split  4
min_samples_leaf  4
min_weight_fraction_leaf  0.0
max_features  'sqrt'
max_leaf_nodes  None
min_impurity_decrease  0.0
bootstrap  True
oob_score  False
n_jobs  None
random_state  123
verbose  0
warm_start  False
class_weight  'balanced'
ccp_alpha  0.0
max_samples  None
monotonic_cst  None
rs_results = pd.DataFrame(random_search.cv_results_)[
    [
        "mean_train_score",
        "std_train_score",
        "mean_test_score",
        "std_test_score",
        "param_randomforestclassifier__max_depth", 
        "param_randomforestclassifier__min_samples_split",
        "param_randomforestclassifier__min_samples_leaf",
        "param_randomforestclassifier__class_weight",
        "mean_fit_time",
        "rank_test_score",
    ]
].set_index("rank_test_score").sort_index().head(5)

Markdown(rs_results.to_markdown())
rank_test_score mean_train_score std_train_score mean_test_score std_test_score param_randomforestclassifier__max_depth param_randomforestclassifier__min_samples_split param_randomforestclassifier__min_samples_leaf param_randomforestclassifier__class_weight mean_fit_time
1 0.544729 0.0183762 0.30126 0.0200132 17 4 4 balanced 1.26537
2 0.487543 0.0167833 0.289423 0.0196836 17 6 5 balanced 1.23495
3 0.463022 0.0125878 0.288967 0.0138957 19 4 6 balanced 1.23164
4 0.471739 0.0160131 0.287304 0.0168119 15 2 5 balanced 1.22173
5 0.47545 0.0199179 0.287131 0.0242556 15 19 3 balanced 1.23574

From the above, we can see that the hyperparameter settings of max_depth=17, min_samples_split=4, min_samples_leaf=4, and class_weight='balanced' achieves the best mean validation f1 score (0.30) with little variation (standard deviation = 0.018). This is an improvement compared to the logistic regression model, but let’s take a closer look at how this random forest model performs on the whole training and testing set.

  • On the train set:
  • On the test set:
cm = ConfusionMatrixDisplay.from_estimator(
    random_search,
    X_train,
    y_train,
    display_labels=["Not a Large Fire", "Large Fire"],
)

From the above confusion matrix, we can observe that:

  • The “Not a Large Fire” class has the largest number samples and correct predictions. There are more correct predictions for this class compared to the logistic regression model.
  • There are quite a few incorrect “Large Fire” predictions, but less compared to the logistic regression model.
  • There are relatively little “Large Fire” samples.
  • The predictions for the true “Large Fire” class are mostly correct. There are more correct predictions for this class compared to the logistic regression model.
y_pred_train_rf = random_search.predict(X_train)
print(
    classification_report(
        y_train,
        y_pred_train_rf,
        target_names=["Not a Large Fire", "Large Fire"],
    )
)
                  precision    recall  f1-score   support

Not a Large Fire       1.00      0.97      0.98     18240
      Large Fire       0.36      0.99      0.52       345

        accuracy                           0.97     18585
       macro avg       0.68      0.98      0.75     18585
    weighted avg       0.99      0.97      0.97     18585

From the above metrics, we can see that accuracy is very high on the train set (0.98), but accuracy can be misleading with imbalanced data.

In general, scores for the dominant class (“Not a Large Fire”) are very high. Meanwhile, the precision (0.36) and consequently the f1 score (0.52) for the minority class (“Large Fire”) is lower. However, the recall is very high (0.99). This means that the model is able to predict most of the actual “Large Fire” incidents, but at the cost of false positives.

These scores are better compared to the logistic regression model (accuracy = 0.88; precision = 0.11; f1 score = 0.19; recall = 0.79)!

cm = ConfusionMatrixDisplay.from_estimator(
    random_search,
    X_test,
    y_test,
    display_labels=["Not a Large Fire", "Large Fire"],
)

From the above confusion matrix, we can observe that the model performs similarly on the test set as on the train set:

  • The “Not a Large Fire” class has the largest number samples and correct predictions. There are more correct predictions for this class compared to the logistic regression model.
  • There are quite a few incorrect “Large Fire” predictions, but less compared to the logistic regression model.
  • There are relatively little “Large Fire” samples.
  • Slightly more than half of predictions for the true “Large Fire” class are correct. There are less correct predictions for this class compared to the logistic regression model.
y_pred_rf = random_search.predict(X_test)
print(
    classification_report(
        y_test,
        y_pred_rf,
        target_names=["Not a Large Fire", "Large Fire"],
    )
)
                  precision    recall  f1-score   support

Not a Large Fire       0.99      0.96      0.98      7826
      Large Fire       0.21      0.56      0.30       140

        accuracy                           0.95      7966
       macro avg       0.60      0.76      0.64      7966
    weighted avg       0.98      0.95      0.96      7966

From the above metrics, we can see that accuracy is high on the test set (0.95), but accuracy can be misleading with imbalanced data.

In general, scores for the dominant class (“Not a Large Fire”) are again high. Meanwhile, the precision (0.21) and recall (0.56) are lower compared to on the train set (0.35 and 0.99, respectively), and consequently the f1 score (0.30) for the minority class (“Large Fire”) is lower. This indicates some overfitting on the training set.

The random forest classifier performs much better compared to the logistic regression model, in terms of testing f1 score (accuracy = 0.88; precision = 0.10; recall = 0.76; f1 score = 0.18). There are much less false positives and more true negatives, but there are also less true positives and more false negatives.

If the goal is safety and preparedness, the logistic regression model performance may be desirable (testing recall = 0.76; testing f1 score = 0.18). It might even be better to perform the hyperparameter search using recall or f-beta score with more weight on recall instead of f1 score.

If the goal is efficient use of resources, the random forest model performance (testing precision = 0.21; testing f1 score = 0.30) is a bit better compared to the logistic regression model, but still not desirable due to many false positives. It would be better to perform the hyperparameter seach using precision or f-beta score with more weight on precision instead of f1 score.

Overall, it’s important to keep in mind the purpose and goal of the model before any analysis and evaluation.

Discussion

  • R
  • Python

This analysis aimed to uncover the key environmental and operational factors associated with the development of large wildfires in Alberta. Using logistic regression and marginal effects, we modeled the probability of a wildfire exceeding 200 hectares, which is a meaningful operational threshold for fire managers.

We found that higher temperature, stronger wind speed, and faster spread rates significantly increase the likelihood of a large wildfire. Conversely, higher relative humidity consistently reduced fire size probability, highlighting its protective role in fire suppression.

In addition to main effects, we explored interaction patterns using marginal effects. While we did not include true_cause directly in the model due to its high cardinality and sparse categories, we used it in conjunction with wind_speed to explore how ignition causes might modulate wind’s influence on fire growth. The marginal effect plots suggest that certain causes—such as incendiary devices, line impacts, or insufficient buffer zones are more sensitive to wind speed, potentially due to the nature of how these fires spread under wind-driven conditions.

Similarly, by plotting the effect of relative_humidity across different fire_type categories, we observed that humidity suppresses large fire development more strongly in certain fire types, particularly crown fires. This implies that moisture-based mitigation strategies may be more effective in forests prone to crown fire behavior, while having less impact on surface or ground fires.

While some observed differences in marginal effects were small in magnitude and had overlapping confidence intervals, these nuanced insights still provide value. They highlight where future research or localized fire management protocols could focus, especially under climate scenarios with increasing temperature and wind extremes.

Overall, this case study illustrates how statistical modeling paired with domain-informed exploratory tools like marginal effects can support risk-informed, data-driven wildfire management.

This analysis demonstrates how to tune a model’s hyperparameters using grid search and cross-validation. More importantly, it highlights the critical role of feature selection and metric choice.

The features included in the model were guided by domain expertise documented in the literature for their relevance to fire behavior, and a particularly sparse feature was excluded to avoid introducing noise. This approach could extended upon by using machine learning–based feature selection techniques, such as recursive feature elimination or regularization, to further refine the chosen predictors.

The performance of the model depends on its intended purpose, particularly in the presence of class imbalance, and the chosen evaluation metric should align with the model’s goal. Metrics like precision, recall, and f1 score provide a more informative and holistic evaluation compared to accuracy alone. If the goal is to maximize detection of large fires, recall should be considered more heavily. If the goal is to minimize false alarms, precision should be prioritized.

The logistic regression model performs well for the majority class and achieves relatively high recall for the minority class (large fires), showing some promise. However, its low precision for large fires indicates many false positives. Despite achieving a higer f1 score compared to the logistic regression model, the random forest model also has low precision. Since this may be due to class imbalance in the data, future iterations could explore resampling strategies such as SMOTE, or decision threshold adjustment to improve minority class performance and overall balance.

Attribution

Data sourced from the Government of Alberta via the Government of Canada’s Open Government Portal, available under an Open Government Licence - Alberta. Original data set: Historical wildfire data: 2006-2024.

 
 

This page is built with Quarto.