Predicting Notable Wildfire Incidence in the USA

A county-month panel model predicting large wildfire incidence across the US using logistic regression and random forest.

March 01, 2025 projects R Machine Learning Random Forest Logistic Regression Statistics

Overview

This project develops a predictive model for the monthly incidence of notable wildfires across U.S. counties from 1992 to 2015. Using a county-month panel dataset constructed from historical wildfire records, land cover data, and monthly climate measurements, we define a binary outcome indicating whether at least one wildfire of 300 or more acres occurred in a given county and month.

The Data

The modeling dataset is constructed by joining three sources:

Historical wildfire records — fire occurrence and size by county and date
Land cover data — NLCD land cover fractions per county
Monthly climate measurements — temperature, drought, and precipitation by county-month

A notable fire is defined as any fire with a final burned area of ≥ 300 acres, following operational wildfire reporting standards used by U.S. fire management agencies. This filters out the vast majority of small, quickly-contained fires.

How I Built It

Data Preprocessing

The panel is built at the county-month level. Missing land cover fractions (no NLCD record) are filled with 0, and missing climate variables are replaced with the county’s past-years median for that specific month to avoid data leakage. Any rows still missing key predictors after imputation are dropped.
Modeling

Two models are fit:
- Logistic Regression — baseline model
- Random Forest — second model
Models are evaluated using time-aware expanding-window cross-validation to prevent data leakage. Performance is assessed via the Area Under the ROC Curve (AUC).
Results

Exploratory analysis reveals strong seasonality, pronounced geographic clustering in the western United States, and clear associations between fire incidence and drought, temperature, and land cover type. The random forest modestly outperforms logistic regression, with both models achieving meaningful predictive signal above chance.

Full Report

View the full report here

Tools Used

R — core language
Quarto — report generation
tidyverse / dplyr — data wrangling
randomForest — random forest modeling
NLCD — land cover data source