# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)
library(glue) # Interpreted String Literals
# ML & Text Mining -------------------------------------
library(tidymodels) # Easily Install and Load the 'Tidymodels' Packages
library(textrecipes) # Extra 'Recipes' for Text Processing
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
library(rvest) # Easily Harvest (Scrape) Web Pages
library(cleanNLP) # A Tidy Data Model for Natural Language Processing
library(themis) # Extra Recipes for Dealing with Unbalanced Classes
library(discrim) # Model Wrappers for Discriminant Analysis
set.seed(123) # for reproducibility
WB Project PDO features classification
WORK IN PROGRESS! (Please expect unfinished sections, and unpolished code. Feedback is welcome!)
Set up
# 1) --- Set the font as the default for ggplot2
# Who else? https://datavizf24.classes.andrewheiss.com/example/05-example.html
lulas_theme <- theme_minimal(base_size = 12) +
theme(panel.grid.minor = element_blank(),
# Bold, bigger title
plot.title = element_text(face = "bold", size = rel(1.6)),
# Plain, slightly bigger subtitle that is grey
plot.subtitle = element_text(face = "plain", size = rel(1.4), color = "#A6A6A6"),
# Italic, smaller, grey caption that is left-aligned
plot.caption = element_text(face = "italic", size = rel(0.7),
color = "#A6A6A6", hjust = 0),
# Bold legend titles
legend.title = element_text(face = "bold"),
# Bold, slightly larger facet titles that are left-aligned for the sake of repetition
strip.text = element_text(face = "bold", size = rel(1.1), hjust = 0),
# Bold axis titles
axis.title = element_text(face = "bold"),
# Change X-axis label size
axis.text.x = element_text(size = rel(1.4)),
# Change Y-axis label size
axis.text.y = element_text(size = 14),
# Add some space above the x-axis title and make it left-aligned
axis.title.x = element_text(margin = margin(t = 10), hjust = 0),
# Add some space to the right of the y-axis title and make it top-aligned
axis.title.y = element_text(margin = margin(r = 10), hjust = 1),
# Add a light grey background to the facet titles, with no borders
strip.background = element_rect(fill = "grey90", color = NA),
# Add a thin grey border around all the plots to tie in the facet titles
panel.border = element_rect(color = "grey90", fill = NA))
# 2) --- use
# ggplot + lulas_theme
Load data
__________
PREDICT MISSING FEATUREs
__________
What ML models work with text?
Remember that text data is SPARSE!
To predict a missing feature (e.g., sector) based on available features from text data, several supervised machine learning algorithms can be applied. Given that you have a mixture of text and structured data, here are some suitable algorithms:
- Logistic Regression / Multinomial Logistic Regression: If you’re predicting a categorical variable like “sector”, logistic regression can work well, especially with appropriate feature engineering for text (e.g., converting text data into numeric features using TF-IDF or word embeddings).
-
Lasso regression/classification learns
how much of a penalty
to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model. - k-Nearest Neighbors (k-NN): k-NN can be useful for text data, especially when you have a mix of structured and unstructured data. It’s a simple algorithm that can work well with text data, but it can be computationally expensive.
- Decision Trees / Random Forests: These algorithms handle both numeric and categorical data efficiently and can manage missing values quite well. You can input text-based features as well, though you might need to preprocess the text into numeric form (e.g., using embeddings).
- Naive Bayes: Naive Bayes is a simple and efficient algorithm for text classification. It assumes feature independence, which may not always hold, but it’s often a good baseline, particularly with short texts.
- Support Vector Machines (SVMs): SVMs are useful when you have high-dimensional data, which is common with text after feature extraction (like TF-IDF). They can perform well with a mix of structured and unstructured data.
All available models are listed at parsnip
Some model parameters can be learned from data during fitting/training. Some CANNOT 😱. These are hyperparameters of a model, and we estimate them by training lots of models with different hyperparameters and comparing them
— Check missing feature
sum(!is.na(projs_train$pdo))
sum(!is.na(projs_train$sector1)) /4403# 99
sum(!is.na(projs_train$regionname)) / 4403 # 100%
sum(!is.na(projs_train$countryname)) / 4403 # 100%
sum(!is.na(projs_train$status)) / 4403 # 100%
sum(!is.na(projs_train$lendinginstr)) / 4403 # 99%
sum(!is.na(projs_train$ESrisk)) / 4403 # 96%
sum(!is.na(projs_train$curr_total_commitment)) / 4403 # 100%
sum(!is.na(projs_train$theme1)) /4403 # 71%
table(projs_train$theme1, useNA = "ifany") # 71 levels
sum(!is.na(projs_train$env_cat)) / 4403 # 73%
table(projs_train$env_cat , useNA = "ifany") # 7 levels
# source function
source(here("R","f_recap_values.R"))
# check candidate lables for classification
f_recap_values(projs_train, c("sector1", "theme1","env_cat","ESrisk" ) ) %>% kable()
skim_variable | total_rows | n_distinct | n_missing | missing_perc |
---|---|---|---|---|
ESrisk | 4403 | 5 | 3980 | 90.4% |
theme1 | 4403 | 73 | 1254 | 28.5% |
env_cat | 4403 | 8 | 1167 | 26.5% |
sector1 | 4403 | 71 | 17 | 0.4% |
— Identify features for classification
These could be:
- features derived from raw text (e.g. characters, words, ngrams, etc.),
- feature vectors (e.g. word embeddings), or
- meta-linguistic features (e.g. part-of-speech tags, syntactic parses, or semantic features)
How do we use them?
- Do we use raw token counts?
- Do we use normalized frequencies?
- Do we use some type of weighting scheme? ✅
- yes, we use
tf-idf
(a weighting scheme, which will downweight words that are common across all documents and upweight words that are unique to a document)
- yes, we use
- Do we use some type of dimensionality reduction? ✅ # TEXT CLASSIFICATION for Environmental Assessment Category
____________________________________________________________
———————- BINARY OUTCOME ——————-
____________________________________________________________
One candidate variable that could be predicted is the env_cat
variable (“Environmental Assessment Category”). This is a categorical variable with 7 levels (A, B, C, F, H, M, U ) , but, to simplify, I converted it into a binary outcome defined as in Table 1.
#__________ # COMMON STEPS
1) Select and engineer the features [using projs_train
]
Recode env_cat
variable
projs_train <- projs_train %>%
# useful for later
rename(., proj_id = "id") %>%
# recode as factors
mutate (across (c(status, boardapprovalFY, regionname, countryname, sector1,
theme1, lendinginstr), as.factor)) %>%
# env risk category 7 levels
mutate(env_cat_f = fct_na_value_to_level(env_cat, level = "Missing")) %>%
mutate(env_cat_f = fct_recode(env_cat_f,
"A_high risk" = "A",
"B_med risk" = "B",
"C_low risk" = "C",
"F_fin expos" = "F",
"Other" = "H",
"Other" = "M",
"Other" = "U",
"Missing" = "Missing" )) %>%
relocate(env_cat_f , .after = env_cat) %>%
# collaapse env_cat_f into 2 levels
mutate(env_cat_f2 = fct_collapse(env_cat_f,
"High-Med-risk" = c("A_high risk", "B_med risk"),
"Low-risk_Othr" = c("C_low risk", "F_fin expos", "Other", "Missing")
)) %>%
relocate(env_cat_f2, .after = env_cat_f)
tabyl(projs_train, env_cat, show_na = TRUE) # 7 levels
tabyl(projs_train, env_cat_f, show_na = TRUE) # 2 levels
tabyl(projs_train, env_cat_f2, show_na = TRUE) # 7levels
# Show as contingency table env_cat_f and env_cat_f2
env_cat_f_f2_k <- table(projs_train$env_cat_f, projs_train$env_cat_f2) %>%
kable() %>%
kable_styling("striped", full_width = F)
env_cat_f_f2_k
saveRDS(env_cat_f_f2_k, here("analysis", "output", "tables", "env_cat_f_f2_k.rds"))
High-Med-risk | Low-risk_Othr | |
---|---|---|
A_high risk | 351 | 0 |
B_med risk | 1830 | 0 |
C_low risk | 0 | 880 |
F_fin expos | 0 | 127 |
Other | 0 | 48 |
Missing | 0 | 1167 |
Recode sector1
variable
(this I use later as MULTI-CLASS outcome)
# !!!!! `sector_f` e' diverso da `tok_sector_broad` XCHE si basa su `sector1` !!!!
projs_train <- projs_train %>% # sector1 99 levels
mutate(sector_f = case_when(
str_detect(sector1, regex("water|wastewater|sanitat|Sewer|Irrigat|Drainag", ignore_case = T)) ~ "WAT & SAN",
str_detect(sector1, regex("transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation", ignore_case = T)) ~ "TRANSPORT",
sector1 == "port" ~ "TRANSPORT",
str_detect(sector1, regex("urban|housing|inter-urban|peri-urban|waste", ignore_case = T)) ~ "URBAN",
str_detect(sector1, regex("energ|electri|hydroele|hydropow|renewable|transmis", ignore_case = T)) ~ "ENERGY", # Matches either "energy" or "power"
str_detect(sector1, regex("health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin", ignore_case = T)) ~ "HEALTH",
str_detect(sector1, regex("educat|school|vocat|teach|univers|student|literacy|training|curricul", ignore_case = T)) ~ "EDUCATION",
str_detect(sector1, regex("Agricultural|Agro|Fish|Forest|Crop|livestock|agri-business", ignore_case = T)) ~ "AGR FOR FISH",
str_detect(sector1, regex("Minin|oil|gas|mineral|Extract", ignore_case = T)) ~ "MINING OIL&GAS",
str_detect(sector1, regex("Social Protec", ignore_case = T)) ~ "SOCIAL PROT.",
str_detect(sector1, regex("Bank|finan|Investment", ignore_case = T)) ~ "FINANCIAL",
str_detect(sector1, regex("Information|Communication|ICT|Internet|Technologies", ignore_case = T)) ~ "ICT",
str_detect(sector1, regex("Tourism|Trade and Services|Manuf|Other Industry|Trade and Services", ignore_case = T)) ~ "IND TRADE SERV",
str_detect(sector1, regex("Government|Public Admin|Institution|Central Agenc|Sub-national Gov|law|justice|governance", ignore_case = T)) ~ "INSTIT. SUPP.",
TRUE ~ "Missing")) %>%
relocate(sector_f, .after = sector1)
tabyl(projs_train, sector1, show_na = TRUE) # 99 levels
tabyl(projs_train, sector_f, show_na = TRUE) # 7 levels
#tabyl(projs_train, tok_sector_broad, show_na = TRUE) # 7 levels
2) Split sample based on env_cat
variable
— [1/2] Proportional sub-set dplyr::slice_sample
# # Calculate proportions of missing and non-missing values in env_cat
# proportion_missing <- projs_train %>%
# mutate(class_status = ifelse(is.na(env_cat), "Missing", "Not Missing")) %>%
# count(class_status) %>%
# mutate(prop = n / sum(n))
#
# # Ensure the number of samples adds up to 500
# set.seed(123) # For reproducibility
#
# # Define the number of samples for each group (rounding and adjusting for total = 500)
# n_samples <- round(500 * proportion_missing$prop)
# n_samples[1] <- 500 - sum(n_samples[-1]) # Adjust to ensure total is exactly 500
#
# # Split the data into missing and non-missing groups
# missing_group <- projs_train %>%
# filter(is.na(theme1)) %>%
# slice_sample(n = n_samples[proportion_missing$class_status == "Missing"])
#
# not_missing_group <- projs_train %>%
# filter(!is.na(theme1)) %>%
# slice_sample(n = n_samples[proportion_missing$class_status == "Not Missing"])
#
# # Combine the two groups
# projs_train_smpl <- bind_rows(missing_group, not_missing_group)
#
# # View the sample
# print(projs_train_smpl)
— [2/2] Proportional sub-set rsample::initial_split
Based on availability of env_cat_f
.
We will use the strata
argument to stratify the data by the outcome variable (env_cat_f
). This will ensure that the training and validation sets have the same proportion.
# Create a stratified split based on missing vs non-missing env_cat
projs_train %>% tabyl(env_cat_f) # 7 levels
# Split BUT only "Not Missing" `env_cat_f`
## --- 0) THIS WILL BE 4 TRAINING & VALIDATION
env_cat_use <- projs_train %>%
filter(env_cat_f != "Missing") # 3236 proj
# SPLIT INTO TRAINING, VALIDATION
set.seed(123) # Ensure reproducibility
env_split <- initial_split(env_cat_use, prop = 0.7, # 70% training, 30% testing
strata = env_cat_f) # stratify by OUTCOME
## -- 1) for training (labelled `env_cat_f`)
env_cat_train <- training(env_split) # 2265 proj
## -- 2) for validation (labelled `env_cat_f`)
env_cat_test <- testing(env_split) # 971 proj
# # UNLABELLED PORTION
## -- 3) for actual test (UNlabelled `env_cat_f`)
env_cat_missing <- projs_train %>%
filter(env_cat_f == "Missing") # 1167 proj
# check ditribution of `env_cat_f` in training and validation
tabyl(env_cat_train, env_cat_f) |> adorn_totals("row") |> adorn_pct_formatting(digits = 1)#
tabyl(env_cat_test, env_cat_f)|> adorn_totals("row") |> adorn_pct_formatting(digits = 1)#
rm( env_cat_use, env_cat_missing)
# # Rebalance the classes
# tabyl(env_cat_train, env_cat_f) |> adorn_totals("row") |> adorn_pct_formatting(digits = 1)
#
# ds_rec <- recipe(env_cat_f ~ ., data = env_cat_train) %>%
# step_downsample(env_cat_f , under_ratio = 3, trained = FALSE) %>%
# prep()
#
# training <- ds_rec %>%
# bake(new_data = NULL) %>%
# count(env_cat_f, name = "training")
# training
#
#
# env_cat_train %>% count(env_cat_f)
#__________
MODEL-SPECIFIC STEPS
0. LASSO classification - log w/ text with tf-idf
Lasso regression or classification learns how much of a penalty
to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.
3) Define preprocessing steps
Prep recipe [env_recipe
]
{textrecipes
} provides a number of step functions for pre-processing text data. These include functions to tokenize (e.g. step_tokenize()
), remove stop words (e.g. step_stopwords()), and to derive meta-features (e.g. step_lemma()
, step_stem()
, etc.) Furthermore, there are functions to engineer features in ways that are particularly relevant to text data, such as feature frequencies and weights (e.g. step_tf()
, step_tfidf()
, etc.) and token filtering (e.g. step_tokenfilter()
).
step_tokenize()
-
step_tfidf()
-
smooth_idf = FALSE
(terms that appear in many (or all) documents will not be down weighted as much as they would be if the smoothing term was not added)
-
# add step to recipe
env_recipe <- recipe (formula = env_cat_f2 ~ pdo,
data = env_cat_train) %>%
step_tokenize(pdo) %>% # tokenize
step_tfidf(pdo, smooth_idf = FALSE) # tf-idf creates matrix of term frequencies weighted by inverse document frequency
# Review the recipe
env_recipe
# Run the recipe
env_recipe_bake <- env_recipe %>%
# chooses the parameters for the recipe based on the data
prep(training = NULL) %>%
# applies the recipe to the data
bake(new_data = NULL)
# preview the baked recipe -- TOO SPARSE
dim(env_recipe_bake)
#[1] 2264 7516
The resulting engineered features data frame has 2264
observations and 7516
variables!!! I.e. for each writing sample, only a small subset of them will actually appear, most of our cells will be filled with zeros. This is what is known as a sparse matrix. Furthermore, the more features we have, the more chance these features will capture the nuances of these particular writing samples increasing the likelihood we overfit the model.
Improve recipe [env_recipe
]
- We can filter out features by stopword list or
- Filter by frequency of occurrence
# -- Rebuild recipe with tokenfilter step
env_recipe <- recipe (formula = env_cat_f2 ~ pdo,
data = env_cat_train) %>%
# tokenize
step_tokenize(pdo) %>%
# filter by frequency of occurrence
step_tokenfilter(pdo, max_tokens = 100) %>%
# tf-idf creates matrix of weighted term frequencies
step_tfidf(pdo, smooth_idf = FALSE)
# -- Run the recipe
env_recipe_bake <- env_recipe %>%
# chooses the parameters for the recipe based on the data
prep(training = NULL) %>%
# applies the recipe to the data
bake(new_data = NULL)
# -- preview the baked recipe
dim(env_recipe_bake)
#[1] 2264 7516 --> #[1] 2264 101
# subset check
env_recipe_bake[1:5, 1:10 ]
4) Select classification model
Logistic Regr Model [env_spec
]
Let’s start with a simple logistic regression model to see how well we can classify the texts in the training set with the features we have engineered. We will use the parsnip::logistic_reg()
function to specify the logistic regression model. We then select the implementation engine (glmnet
General Linear Model) and the mode of the model (classification
).
— Model specification
# Create a model specification
env_spec <-
# lasso regularized model
parsnip::logistic_reg(
# non-negative number ~ the total amount of regularization
penalty = tune(), # 0 = no penalty, 1 = max
# number between zero and one (inclusive)
mixture = 1 # specifies a pure lasso model,
) %>%
set_mode("classification") %>%
set_engine("glmnet")
# Preview
env_spec
Create workflow [env_wf
]
Check the resulting workflow (a container object that aggregates information required to fit and predict from a model) by adding the recipe and model to it.
— Evaluate so far
# set.seed(123)
# # 1. Define the resampling folds
# env_folds <- vfold_cv(env_cat_train)
#
# # 2. Define a tuning grid for the parameters
# grid <- grid_regular(penalty(), mixture(), levels = 10) # Adjust the levels as needed
#
# env_rs <- fit_resamples(
# env_wf, # Workflow
# resamples = env_folds, # Cross-validation folds
# grid = grid, # Tuning grid
# control = control_resamples(save_pred = TRUE)
# )
5) Tuning hyperparameters
Different algorithms will have different parameters that can be adjusted which can affect the performance of the model (hyperparameters) ≠ parameters (features) –> hyperparameters tuning which is topically done during fitting the model to the training set and evaluating its performance
— Penalty tuning [env_grid
]
In logistic regression, the penalty hyperparameter is like a control that helps prevent the model from becoming too complex and overfitting to the training data. There are two common types of penalties:
L1 (Lasso): Encourages the model to use fewer features by making some of the coefficients exactly zero. This can simplify the model.
L2 (Ridge): Tries to keep all the coefficients small but not exactly zero, which can help stabilize the model and avoid overfitting.
-
the logistic regression model using
glmnet
can be tuned to prevent overfitting by adjusting thepenalty
andmixture
(combination of L1 and L2) hyperparameters- In our
env_spec
model,tune()
was a placeholder for a range of values for the penalty hyperparameter. - To tune the penalty hyperparameter, we use the
grid_regular()
function from {dials
} to specify a grid of values to try.
- In our
The package
dials
contains infrastructure to create and manage values of tuning parameters for thetidymodels
packages.
# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
penalty(), levels = 10)
# Preview
env_grid
# 0 no penalty
# ...
# 1 max penalty
— K-fold cross-validation (for penalty optimal #) [ env_fold
, env_tune
]
Now to perform the tuning and choose an optimal value for penalty we need to create a tuning workflow. We use the strategy of resampling (splitting env_cat_train
in multiple training/testing sets) called k-fold cross-validation to arrive at the optimal value for the penalty hyperparameter.
The env_tune
object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics()
function on the cls_tune
object
# Collect the results of the tuning
env_tune_metrics <- tune::collect_metrics(env_tune)
# visualize the results
autoplot(env_tune) +
labs(
title = "Lasso model performance across regularization penalties",
subtitle = "Performance metrics can be used to identify the best penalty"
)
# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues
The most common metrics for model performance in classification are accuracy and the area under the receiver operating characteristic area under the curve (ROC-AUC
). Accuracy is simply the proportion of correct predictions. The ROC-AUC
provides a single score which summarizes how well the model can distinguish between classes. The closer to 1 the more discriminative power the model has.
Conveniently, the show_best()
function from {tune
} takes a tune_grid
object and returns the best performing hyperparameter values.
# Show the best hyperparameter values
show_best(env_tune, metric = "roc_auc") # 0.779
# Make selection programmatically
env_best <- select_best(env_tune, metric ="roc_auc")
env_best_acc <- select_best(env_tune, metric ="accuracy")
6) Update model specification and workflow with best HP
Update workflow [env_wf_lasso
]
Now we can update the model specification and workflow with the best performing hyperparameter value using the previous cls_wf_tune
workflow and the finalize_workflow()
function.
# Update the workflow/model specification
env_wf_lasso <- env_wf %>%
tune::finalize_workflow(env_best)
# Preview updated workflow object (with defined penalty paramv 0.00599)
env_wf_lasso
Instead of
penalty = tune()
like before, now our workflow has finalized values for all arguments.
7) Assess the model performance on training set
— See the results [env_lasso_fit
]
Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .
# Fit the model to the training data
env_lasso_fit <- fit (env_wf_lasso, data = env_cat_train)
— Coefficients [enf_fitted_coeff
]
— [FIG] Assessing performanceon training set
# Example of a data frame containing actual and predicted values
pred_long <- predict(env_lasso_fit, new_data = env_cat_train , type = "prob")|>
bind_cols(env_cat_train) %>%
select(env_cat_f2, pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr') %>%
pivot_longer(cols = c(pred_high_risk, pred_low_risk),
names_to = "risk_type", values_to = "risk_value")
# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
#geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) + # Remove width and use position_dodge
labs(title = "Predicted High and Low Risk Distribution by Env Category",
subtitle = "Model: Lasso Regression fitted on training data",
x = "ACTUAL",
y = "PREDICTED",
fill = "Risk Type") + # Set label for fill legend
theme_minimal() +
guides(color = "none") # Suppress the color legend
— Assess the model w cross-valid [env_lasso_cv
]
The next step is to assess the performance of the model (in wf env_wf_lasso
) on the training set given the features we have engineered, the algorithm we have selected, and the hyperparameters we have tuned.
Instead of evaluating the model on the training set directly, we will use cross-validation on the training set to gauge the variability of the model (as the model’s performance on the entire training set at once is not a reliable indicator of the model’s performance on new data). In fact: + Cross-validation is a technique that allows us to estimate the model’s performance on new data by simulating the process of training and testing the model on different subsets of the training data (from env_vfold
. + the function tune::fit_resamples()
fits the model to the training data and evaluate its performance.
# (this is similar to env_tune)
set.seed(123)
# Cross validate the (optimized) workflow
env_lasso_cv <- env_wf_lasso %>%
tune::fit_resamples(
# 10 fold cross validation splits
resamples = env_vfold,
# save predictions for confusion matrix
control = control_resamples(save_pred = TRUE)
)
We want to aggregate the metrics across the folds to get a sense of the variability of the model. The collect_metrics()
function takes the results of a cross-validation and returns a data frame with the metrics.
env_lasso_cv[[3]][[1]] # 1st split
env_lasso_cv[[3]][[3]] # 3rd split
# Collect the results of the cross-validation (this is the average from the 10 splits!)
collect_metrics(env_lasso_cv)
# accuracy binary 0.745 10 0.00923 Preprocessor1_Model1
# brier_class binary 0.175 10 0.00462 Preprocessor1_Model1
# roc_auc binary 0.778 10 0.0111 Preprocessor1_Model1
From the accuracy and ROC-AUC metrics it appears we have a decent candidate model, but there is room for potential improvement. A good next step is to evaluate the model errors and see if there are any patterns that can be addressed before considering what approach to take to improve the model.
— [FIG] Visualize the confusion matrix
The confusion matrix is a table that shows the number of true positives, true negatives, false positives, and false negatives. It is a useful tool for understanding the model’s errors and can help identify patterns that can be addressed to improve the model.
- The
conf_mat_resampled()
function takes afit_resamples
object (with predictions saved) and returns a table(tidy = FALSE)
with the confusion matrix for the aggregated folds. We can pass this to theautoplot()
function to plot +this is done on the cross-valid resultsenv_lasso_cv
(?!)
# Plot the confusion matrix
env_lasso_cv %>%
tune::conf_mat_resampled(tidy = FALSE) %>%
autoplot(type = "heatmap")
# 137.2 = true positives
# 33.6 = true negatives
# 40.5 = false positives
# 15.1 = false negatives
There are more false positives (low risk predicted to be high risk) than false negatives. (This is a common issue in imbalanced datasets and can be addressed by adjusting the decision threshold of the model.)
__________
00. NULL model - log w/ text with tf-idf
__________
We can assess a model like this one by comparing its performance to a “null model” or baseline model, a simple, non-informative model that always predicts the largest class for classification. Such a model is perhaps the simplest heuristic or rule-based alternative that we can consider as we assess our modeling efforts.
Workflow and specification for NULL MODEL
We can build a classification null_model()
specification and add it to a workflow()
with the same preprocessing recipe we used in the previous section, to estimate performance.
# Create a NULL model
null_classification <- null_model() %>%
set_engine("parsnip") %>%
set_mode("classification")
null_rs <- workflow() %>%
add_recipe(env_recipe) %>%
add_model(null_classification) %>%
fit_resamples(
env_vfold
)
Assess the model performance on training set
What results do we obtain from the null model, in terms of performance metrics?
collect_metrics(null_rs)
# 1 accuracy binary 0.673 10 0.00750 Preprocessor1_Model1
# 2 brier_class binary 0.220 10 0.00258 Preprocessor1_Model1
# 3 roc_auc binary 0.5 10 0 Preprocessor1_Model1
The accuracy and ROC AUC indicate that this null model is, like in the regression case, dramatically worse than even our first model.
OTHER ATTEMPS to improve model
__________
1. LASSO classification - log w/ text with tf-idf + stopwords
To improve supervised learning models, consider:
-
Engineering the features differently
- we set a token filter to limit the number of features to 100, which we could adjust (
max_tokens
)
- we set a token filter to limit the number of features to 100, which we could adjust (
- Selecting different (or additional) features
- Changing the algorithm
- Tuning the hyperparameters differently
[🤞🏻] Add stopwords exclusion to the recipe
3) Define preprocessing steps (NEW!)
— Improve recipe [env_stop_recipe
]
# --- Create a recipe with a token filter step that excludes stopwords
# Rebuild recipe with tokenfilter step
env_stop_recipe <- recipes::recipe (
formula = env_cat_f2 ~ pdo,
data = env_cat_train) %>%
step_tokenize(pdo) %>% # tokenize
# remove CUSTOM stopwords
step_stopwords(pdo, custom_stopword_source = stop_vector) %>%
step_tokenfilter(pdo, max_tokens = 100) %>% # filter by frequency of occurrence
step_tfidf(pdo, smooth_idf = FALSE) # tf-idf creates matrix of weighted term frequencies
# prep and bake the recipe
env_stop_recipe_bake <- env_stop_recipe %>%
prep() %>%
bake(new_data = NULL)
# preview the baked recipe
dim(env_stop_recipe_bake)
#[1] 2264 101
env_recipe_bake[1:5, 1:10]
env_stop_recipe_bake[1:5, 1:10]
4) Select a classification algorithm (same)
— Model specification
# Create a model specification
env_spec <-
# generalized linear model for binary outcomes
parsnip::logistic_reg(
# A non-negative number representing the total amount of regularization
penalty = tune(), # 0 = no penalty, 1 = max
#A number between zero and one (inclusive)
mixture = 1 # pecifies a pure lasso model,
) %>%
set_engine("glmnet")
##### tune() IS A PLACEHOLDER
# Preview
env_spec
— Create workflow [env_stop_wf
]
env_stop_recipe
is actually the part that changed in this workflow adding step_stopwords()
.
5) Tuning hyperparameters (same)
— Penalty tuning [env_grid
] (same)
# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
penalty(), levels = 10
)
# Preview
env_grid
— K-fold cross-val [env_fold
, env_FEAT_tune
] (same/NEW)
The env_stop_tune
object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics()
function on the env_stop_tune
object
# Collect the results of the tuning
env_stop_tune_metrics <- tune::collect_metrics(env_stop_tune)
# visualize the results
autoplot(env_stop_tune) +
labs(
title = "Lasso model performance across regularization penalties",
subtitle = "Performance metrics can be used to identify the best penalty"
)
# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues
Conveniently, the tune::show_best()
function takes a tune_grid
object and returns the best performing hyperparameter values.
# Show the best hyperparameter values
show_best(env_stop_tune, metric = "roc_auc")
# Make selection programmatically
env_stop_best <- select_best(env_stop_tune, metric ="roc_auc")
env_stop_best_acc <- select_best(env_stop_tune, metric ="accuracy")
env_stop_best_brier <- select_best(env_stop_tune, metric ="brier_class")
6) Update model specification and workflow with best HP (NEW)
Update workflow [env_stop_wf_lasso
]
Now we can update the model specification and workflow with the best performing hyperparameter value using the previous cls_wf_tune
workflow and the finalize_workflow()
function.
# Update the model specification
env_stop_wf_lasso <- env_stop_wf %>%
tune::finalize_workflow(env_stop_best)
# Preview updated workflow object (with defined penalty paramv 0.00599)
env_stop_wf_lasso
7) Assess the model performance on training set
— See the results [env_lasso_fit
]
Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .
# Fit the model to the training data
env_stop_lasso_fit <- fit (env_stop_wf_lasso, data = env_cat_train)
— Coefficients [enf_fitted_coeff
]
“and” is not top coefficient anymore!!!
— [FIG] Assessing performance [env_stop_lasso_fit
] on training set
# Example of a data frame containing actual and predicted values
pred_stop_long <- predict(env_stop_lasso_fit, new_data = env_cat_train , type = "prob")|>
bind_cols(env_cat_train) %>%
select(env_cat_f2, pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr') %>%
pivot_longer(cols = c(pred_high_risk, pred_low_risk),
names_to = "risk_type", values_to = "risk_value")
# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_stop_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
#geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) + # Remove width and use position_dodge
labs(title = "Predicted High and Low Risk Distribution by Env Category",
subtitle = "Model: Lasso Regression fitted on training data (stopwords)",
x = "ACTUAL",
y = "PREDICTED",
fill = "Risk Type") + # Set label for fill legend
theme_minimal() +
guides(color = "none") # Suppress the color legend
This model did not improve much (especially on the LOW-risk-Other level prediction!
— Assess the model w cross-valid [env_stop_lasso_cv
]
# (this is similar to env_tune)
# Cross validate the (optimized) workflow
env_stop_lasso_cv <- env_stop_wf_lasso %>%
tune::fit_resamples(
# 10 fold cross validation splits
resamples = env_vfold,
# save predictions for confusion matrix
control = control_resamples(save_pred = TRUE)
)
We want to aggregate the metrics across the folds to get a sense of the variability of the model. The collect_metrics()
function takes the results of a cross-validation and returns a data frame with the metrics.
env_stop_lasso_cv[[3]][[1]] # 1st split
env_stop_lasso_cv[[3]][[3]] # 3rd split
# Collect the results of the cross-validation (this is the average from the 10 splits!)
collect_metrics(env_stop_lasso_cv)
# 1 accuracy binary 0.750 10 0.00745 Preprocessor1_Model1
# 2 brier_class binary 0.172 10 0.00322 Preprocessor1_Model1
# 3 roc_auc binary 0.784 10 0.00881 Preprocessor1_Model1
# OLD
collect_metrics(env_lasso_cv)
# 1 accuracy binary 0.750 10 0.0102 Preprocessor1_Model1
# 2 brier_class binary 0.174 10 0.00352 Preprocessor1_Model1
# 3 roc_auc binary 0.777 10 0.00771 Preprocessor1_Model1
Same accuracy
but improve in roc_auc
and brier_class
compared to the previous model!
— [FIG] Visualize the confusion matrix
# BTW
collect_predictions(env_stop_lasso_cv)
This is done on the cross-valid results env_stop_lasso_cv
(?!)
# Plot the confusion matrix
env_stop_lasso_cv %>%
tune::conf_mat_resampled(tidy = FALSE) %>%
autoplot(type = "heatmap")
# 137.2 = true positives --> 136.4
# 33.6 = true negatives --> 33.1
# 40.5 = false positives --> 40.8
# 15.1 = false negatives --> 15.3
There are more false positives (low risk predicted to be high risk) than false negatives. (This is a common issue in imbalanced datasets and can be addressed by adjusting the decision threshold of the model.)
8) Evaluate the best model on the validation/test set
To do this we need to fit the tuned workflow to the training set, which is the actual training phase. We will use the last_fit()
function from {workflows
}
Le’s use the updated workflow env_stop_wf_lasso
— Fit the best model(training) and evaluate on the test set
# fit the model to the training set and evaluate on the validation set
env_stop_lasso_final_fit <- last_fit(
env_stop_wf_lasso,
split = env_split)
# Evaluate the model on the validation set (in my case)
collect_metrics(env_stop_lasso_final_fit)
# 1 accuracy binary 0.762 Preprocessor1_Model1
# 2 roc_auc binary 0.807 Preprocessor1_Model1
# 3 brier_class binary 0.163 Preprocessor1_Model1
The performance metrics are very close to those we achieved on the training set –> good sign that the model is robust as it performs well on both training and test (validation) sets.
— [FIG] Visualize the confusion matrix
# Plot the confusion matrix
env_stop_lasso_final_fit %>%
collect_predictions() %>%
conf_mat(truth = env_cat_f2, estimate = .pred_class) %>%
autoplot(type = "heatmap")
Still (on the test set) imbalanced false positives (well) and false negatives (poor).
— Visualize the ROC AUC
curve
Take the output of last_fit()
and use it (env_stop_lasso_final_fit
) to plot the ROC curve.
colnames(env_stop_lasso_final_fit)
# Extract predictions from the final fit object
# Extract the tibble from the list
env_stop_lassofinal_fit_pred <- env_stop_lasso_final_fit$.predictions[[1]]
str(env_stop_lassofinal_fit_pred)
# Visualize the ROC curve
env_stop_lassofinal_fit_pred %>%
roc_curve(truth = env_cat_f2, '.pred_High-Med-risk') %>%
autoplot() +
labs(
title = "ROC Curve for High-Med Risk Prediction",
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)",
caption = "logistic regression model on text (stopwords)"
)
#__________
2. model - log w/ text with tf-idf + stopwords + 3 factors
[👍🏻] Logistic with pdo
+ sector_f
|regionname
|boardapprovalFY
as predictors
3) Define preprocessing steps [env_FEAT_recipe
] (same)
— Improve recipe [env_FEAT_recipe
] (NEW!)
- using
sector_f
to include the sector tag but with less dimensions -
step_dummy
because logistic regression, especially when using certain tuning functions intidymodels
, requires numeric or dummy variables.
# --- Create a recipe with a token filter step that excludes stopwords
# Rebuild recipe with tokenfilter step
env_FEAT_recipe <- recipe (env_cat_f2 ~ pdo + sector_f + regionname + boardapprovalFY,
data = training(env_split)) %>%
# tokenize the text
step_tokenize(pdo) %>%
# remove CUSTOM stopwords
step_stopwords(pdo, custom_stopword_source = stop_vector) %>%
# filter by frequency of occurrence
step_tokenfilter(pdo, max_tokens = 100) %>%
# creates tf-idf matrix of weighted term frequencies
step_tfidf(pdo, smooth_idf = FALSE) %>%
# add NA as special factor level
step_unknown(sector_f ,new_level = "Unknown sect" ) %>%
step_unknown(regionname ,new_level = "Unknown reg" ) %>%
step_unknown(boardapprovalFY ,new_level = "Unknown FY" ) %>%
# convert to dummy variables
step_dummy(sector_f, regionname, boardapprovalFY, one_hot = TRUE)
check what changed…
4) Select algorithm + workflow (same)
— Model specification [env_spec
]
# Create a model specification
env_spec <-
# generalized linear model for binary outcomes
parsnip::logistic_reg(
mode = "classification",
# A non-negative number representing the total amount of regularization
penalty = tune(), # 0 = no penalty, 1 = max
#A number between zero and one (inclusive)
mixture = 1 # pecifies a pure lasso model,
) %>%
set_engine("glmnet")
##### tune() IS A PLACEHOLDER
# Preview
env_spec
— Create workflow [env_FEAT_wf
] (NEW!)
env_FEAT_recipe
is actually the part that changed in this workflow adding step_stopwords()
.
5) Tuning hyperparameters (same)
— Penalty tuning + folds [env_grid
, env_fold
] (same)
# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
penalty(), levels = 10
)
# Create a resampling object
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10)
— K-fold cross-val tuning [env_FEAT_tune
] (NEW)
# Create a tuning workflow
env_FEAT_tune <- tune::tune_grid(
object = env_FEAT_wf, # changed !
resamples = env_vfold,
grid = env_grid,
control = control_grid(save_pred = TRUE)
)
# preview
env_FEAT_tune
The env_FEAT_tune
object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics()
function on the env_FEAT_tune
object
# Collect the results of the tuning
env_FEAT_tune_metrics <- tune::collect_metrics(env_FEAT_tune)
# visualize the results
autoplot(env_FEAT_tune)
# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues
Conveniently, the tune::show_best()
function takes a tune_grid
object and returns the best performing hyperparameter values.
# Show the best hyperparameter values
show_best(env_FEAT_tune, metric = "roc_auc")
# Make selection programmatically
env_FEAT_best <- select_best(env_FEAT_tune, metric ="roc_auc")
env_FEAT_best_acc <- select_best(env_FEAT_tune, metric ="accuracy")
env_FEAT_best_brier <- select_best(env_FEAT_tune, metric ="brier_class")
6) Update model specification and workflow with best HP (NEW)
Update workflow [env_FEAT_wf2
]
Now we can update the model specification and workflow with the best performing hyperparameter value using the previous env_FEAT_tune
workflow and the finalize_workflow()
function.
# Update the model specification
env_FEAT_wf2 <- env_FEAT_wf %>%
tune::finalize_workflow(env_FEAT_best)
# Preview updated workflow object (with defined penalty paramv 0.00599)
env_FEAT_wf2
7) Assess the model performance on training set
— See the results [env_lasso_fit
]
Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .
# Fit the model to the training data
env_FEAT_fit <- fit (env_FEAT_wf2, data = env_cat_train)
— Coefficients [enf_fitted_coeff
]
- “sector_f…” appear among the top coefficients!!!
— [FIG] Assessing performance [env_FEAT_fit
] on training set
Now is env_split_train
# Fit the model on the training set
env_FEAT_fit <- env_FEAT_wf2 %>% # NEW
fit(data = training(env_split))
# Example of a data frame containing actual and predicted values
pred_FEAT_long <- predict(env_FEAT_fit, new_data = training(env_split), type = "prob")|>
bind_cols(training(env_split)) %>%
select(env_cat_f2, pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr') %>%
pivot_longer(cols = c(pred_high_risk, pred_low_risk),
names_to = "risk_type", values_to = "risk_value")
# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_FEAT_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
#geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) + # Remove width and use position_dodge
labs(title = "Predicted High and Low Risk Distribution by Env Category",
subtitle = "Model: Lasso Regression fitted on training data (stop and feat)",
x = "ACTUAL",
y = "PREDICTED",
fill = "Risk Type") + # Set label for fill legend
theme_minimal() +
guides(color = "none") # Suppress the color legend
Seems improved also LOW risk prediction (at least o average although more dispersion)
# Fit the model on the testing set
env_FEAT_fit <- env_FEAT_wf2 %>% # NEW
fit(data = testing(env_split))
# Example of a data frame containing actual and predicted values
pred_FEAT_long <- predict(env_FEAT_fit, new_data = testing(env_split), type = "prob")|>
bind_cols(testing(env_split)) %>%
select(env_cat_f2, pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr') %>%
pivot_longer(cols = c(pred_high_risk, pred_low_risk),
names_to = "risk_type", values_to = "risk_value")
# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_FEAT_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
#geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) + # Remove width and use position_dodge
labs(title = "Predicted High and Low Risk Distribution by Env Category",
subtitle = "Model: Lasso Regression fitted on testing data (stop and feat)",
x = "ACTUAL",
y = "PREDICTED",
fill = "Risk Type") + # Set label for fill legend
theme_minimal() +
guides(color = "none") # Suppress the color legend
— Assess the model w cross-valid [env_FEAT_cv
] (NEW)
# (this is similar to env_tune)
set.seed(123)
# Cross validate the (optimized) workflow
env_FEAT_cv <- env_FEAT_wf2 %>%
tune::fit_resamples(
# 10 fold cross validation splits
resamples = env_vfold,
# save predictions for confusion matrix
control = control_resamples(save_pred = TRUE)
)
We want to aggregate the metrics across the folds to get a sense of the variability of the model. The collect_metrics()
function takes the results of a cross-validation and returns a data frame with the metrics.
env_FEAT_cv[[3]][[1]] # 1st split
env_FEAT_cv[[3]][[3]] # 3rd split
# Collect the results of the cross-validation (this is the average from the 10 splits!)
collect_metrics(env_FEAT_cv)
# 1 accuracy binary 0.777 10 0.0118 Preprocessor1_Model1
# 2 brier_class binary 0.155 10 0.00503 Preprocessor1_Model1
# 3 roc_auc binary 0.820 10 0.0101 Preprocessor1_Model1
# OLD
collect_metrics(env_stop_lasso_cv)
#1 accuracy binary 0.754 10 0.00997 Preprocessor1_Model1
#2 brier_class binary 0.170 10 0.00371 Preprocessor1_Model1
#3 roc_auc binary 0.785 10 0.00813 Preprocessor1_Model1
Improved a bit roc_auc
& accuracy
and brier_class
lower compared to the previous model!
— [FIG] Visualize the confusion matrix
This is done on the cross-valid results env_FEAT_cv
(?!) OKKIO: the number of obs were less because of the dropped missing factors
set.seed(123)
# Plot the confusion matrix
env_FEAT_cv %>%
tune::conf_mat_resampled(tidy = FALSE) %>%
autoplot(type = "heatmap")
# env_stop_lasso_cv --> env_FEAT_cv
# 137.2 = true positives --> 135
# 33.6 = true negatives --> 33.7
# 40.5 = false positives --> 33.8
# 15.1 = false negatives --> 16.7
Compared to previous model, there is just a little improvement in false positive (low risk predicted to be high risk) as they are less than before. This model did improve (especially in the low risk category): in fact the probability of being classified as HIGH RISK is less than 50% and of being classified LOW RISK above 50%.
8) Evaluate the best model on the validation/test set
To do this we need to fit the tuned workflow to the training set, which is the actual training phase. We will use the last_fit()
function from {workflows
}
Le’s use the updated workflow env_FEAT_wf2
RECALL I DID
for training (labelled
env_cat_f
)env_cat_train <- training(env_split)
# 2265 projfor validation (labelled
env_cat_f
)env_cat_test <- testing(env_split)
# 971 proj
— Fit the best model(training) and evaluate on the test set
After determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set
.
# fit the model to the training set and evaluate on the validation set
env_FEAT_final_fit <- last_fit(
env_FEAT_wf2,
split = env_split)
# Evaluate the model on the validation set (in my case)
collect_metrics(env_FEAT_final_fit)
# 1 accuracy 0.793 Preprocessor1_Model1 --> 0.79
# 2 roc_auc 0.851 Preprocessor1_Model1 --> 0.85
# 3 brier_class 0.143 Preprocessor1_Model1 --> 0.144
The performance metrics are very close to those we achieved on the training set (actually better!!) –> good sign that the model is robust as it performs well on both training and test (validation) sets.
— [FIG] Visualize the confusion matrix
# Plot the confusion matrix
env_FEAT_final_fit %>%
collect_predictions() %>%
conf_mat(truth = env_cat_f2, estimate = .pred_class) %>%
autoplot(type = "heatmap")
library(tidymodels)
library(ggplot2)
library(tidymodels)
library(ggplot2)
# Create a table for the confusion matrix counts
ML_final_fit_cm_p <- env_FEAT_final_fit %>%
collect_predictions() %>%
conf_mat(truth = env_cat_f2, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
labs(
title = "Confusion Matrix for Lasso Logistic Regression Model",
x = "Predicted Class",
y = "True Class",
fill = "Count"
) +
scale_fill_gradient(low = "#f2e8ea", high = "#964957") + # Adjust color gradient for better contrast
theme_minimal(base_size = 14) + # Set a clean theme with larger base text size
theme(
plot.title = element_text(hjust = 0.5, face = "bold"), # Center and bold the title
#axis.text.x = element_text(angle = 45, hjust = 1), # Angle x-axis text for readability
legend.position = "right" # Place the legend on the right
)
ML_final_fit_cm_p
f_save_plot_obj <- function(plot_object, plot_obj_name) {
# Save the plot object
saveRDS(plot_object, here("analysis", "output", "figures", paste0(plot_obj_name, ".rds")))
}
f_save_plot_obj(ML_final_fit_cm_p, "ML_final_fit_cm_p")
Still (on the validation set) imbalanced false positives (138) and false negatives (63).
— [FIG] Visualize the ROC AUC
curve
Take the output of last_fit()
and use it (env_FEAT_final_fit
) to plot the ROC curve.
colnames(env_FEAT_final_fit)
# Extract predictions from the final fit object
# Extract the tibble from the list
env_FEAT_final_fit_pred <- env_FEAT_final_fit$.predictions[[1]]
str(env_FEAT_final_fit_pred)
# Visualize the ROC curve
env_FEAT_final_fit_pred %>%
roc_curve(truth = env_cat_f2, '.pred_High-Med-risk') %>%
autoplot() +
labs(
title = "ROC Curve for High-Med Risk Prediction",
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)",
caption = "logistic regression model on text (stopwords) + features"
)
9) Interpret the model
— Inspecting what levels of the outcome are most difficult to estimate
I will then select the columns with the actual outcome (env_cat_f2
), the predicted outcome, the env_cat_f
level, and the pdo
text and separate the predicted outcome to inspect them separately
Inspect to see in which actual category (env_cat_f
) are proj when they are actually env_cat_f2 == 'High-Med-risk'
but falsely predicted to be .pred_class == 'Low-risk_Othr'
: not surprisingly most of them are med risk level. (this makes sense)
— Inspecting the most important features for predicting the outcome
- using the
extract_fit_parsnip()
function from theworkflows
package to extract the model object from the workflow. - estimates are the log odds of the outcome for each feature (i.e. the probability of the outcome (High risk) divided by the probability of the opposite outcome (low risk)).
-
Positive coefficient
: A positive coefficient indicate an increased likelihood of being in the “Low-risk_Othr” category compared to the “High-Med-risk” category. -
Negative coefficient
: A negative coefficient indicate an increased likelihood of being in the “High-Med-risk” category compared to the “Low-risk_Othr” category.
-
-
odds ratio (exponentiated coeff) means that the feature is associated with a lower probability of the outcome, while positive odds means that the feature is associated with a higher probability of the outcome.
-
Odds ratio > 1
: Indicates that the predictor increases the likelihood of the outcome (e.g., “Low-risk_Othr”). -
Odds ratio < 1
: Indicates that the predictor decreases the likelihood of the outcome “High-Med-risk”.
-
# Extract the estimate (log-odds)
env_FEAT_final_fit_features <- extract_fit_parsnip(env_FEAT_final_fit) %>%
tidy() %>%
# Calculate the exponentiated estimate
mutate(odds = exp(estimate),
probability = odds / (1 + odds))
# tibble: 206 × 5
# term estimate penalty odds probability
# <chr> <dbl> <dbl> <dbl> <dbl>
# (Intercept) -1.26 0.00599 0.284 0.221
# tfidf_pdo_access -0.00452 0.00599 0.995 0.499
# tfidf_pdo_activities 0.786 0.00599 2.19 0.687
-
tfidf_pdo_activities
est 0.786 | odds 2.19 | prob 0.687 (associated with a LOW risk) -
tfidf_pdo_access
est -0.004 | odds 0.995 | prob 0.499 (associated with a HIGH risk)
— Extract the most important features
A quick way to extract the most important features for predicting each outcome is to use the vi()
function from {vip
}.
- The
vi()
function calculates the permutation importance of each feature in the model.
library(vip)
# Extract the most important features
env_FEAT_var_importance <- extract_fit_parsnip(env_FEAT_final_fit) %>%
vip::vi() %>%
# it is kinda counterintuitive
mutate(note = case_when(
Sign == "POS" ~ "More likely to be in Low-risk_Othr",
Sign == "NEG" ~ "More likely to be in High-Med-risk",
TRUE ~ "NEU"
))
— [FIG] Plot the most important features
# Recode variable and sign
var_importance_tbl <- env_FEAT_var_importance %>%
mutate(Feature = str_remove(Variable, "tfidf_"),
EnvRiskOutcome = case_when(
Sign == "NEG" ~ "High-Med-risk",
Sign == "POS" ~ "Low-risk_Othr") ) %>%
select(Feature, Importance, EnvRiskOutcome)
summary(var_importance_tbl$EnvRiskOutcome)
# Plot the most important features
ML_feature_importance_p <- var_importance_tbl %>%
slice_max(Importance, n = 50) %>%
ggplot(aes(x = reorder(Feature, Importance), y = Importance, color = EnvRiskOutcome)) +
geom_point() +
coord_flip() +
facet_wrap(~EnvRiskOutcome, ncol = 2, scales = "free_y") +
labs(
title = glue("Most influential features for predicting Environmental Risk"),
subtitle = "LASSO Logistic Regression model on text + metadata tags \n(Importance = absolute value of the logistic regression coefficients)",
#caption = "(Importance of each feature calculated as the absolute value of the logistic regression coefficients)",
x = "",
y = "",
fill = ""
) +
lulas_theme +
theme(
plot.title = element_text(hjust = 0.5, face = "bold")
) +
guides(color = "none")
ML_feature_importance_p
The feature importance plot highlights the top 50 predictors of environmental risk (binary) category
, ranked by their influence in a LASSO logistic regression model. For better readability the predictors are split according to the level of risk predicted. It should be no surprise that words in the PDO text (those variables starting with pdo_*
) are the most important predictors given the data. Still some of the other predictors are also important, such as sector_f_TRANSPORT
(left panel) or regionname
and sector_f_FINANCIAL
(right panel).
Each facet groups features by environmental risk outcome, allowing a comparison of which factors contribute most to each category. Features with higher importance values, like [mention a few key features], play a significant role in predicting environmental risk, offering insights for targeted risk assessment and decision-making.
# show plot
ML_feature_importance_p
# save as rds
f_save_plot_obj(ML_feature_importance_p, "ML_feature_importance_p")
__________
3. model - NB w/ text with tf-idf + stopwords + 3 factors
[🫤] Naive bayes classific with text and sector_f
and regionname
and boardapprovalFY
as predictors
3) Define preprocessing steps (same)
4) Select algorithm + workflow (same)
— Model specification [env_spec
]
Let’s use a naive Bayes model, which is available in the tidymodels
package discrim.
One of the main advantages is its ability to handle a large number of features, such as those we deal with when using word count methods. Here we have only kept the 1000 most frequent tokens, but we could have kept more tokens and a naive Bayes model would still be able to handle such predictors well. For now, we will limit the model to a moderate number of tokens.
# needed for naive Bayes
library(discrim)
# Create a model specification
env_NB_spec <-
# generalized linear model for binary outcomes
parsnip::naive_Bayes() %>%
# Specify the mode of the model
set_mode("classification") %>%
# Specify the engine
set_engine("naivebayes")
# Preview
env_NB_spec
— Create workflow [env_NB_wf
] (NEW!)
env_NB_spec
is actually the part that changed in this workflow adding step_stopwords()
.
— Fit the classificatoin model to the training set [env_NB_fit
] (NEW!)
5) Tuning hyperparameters (same)
Let’s use resampling to estimate the performance of the naive Bayes classification model we just fit. We can do this using resampled data sets built from the training set. Let’s create 10-fold cross-validation sets, and use these resampled sets for performance estimates.
The env_FEAT_tune
object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics()
function on the env_FEAT_tune
object
We can extract the relevant information using collect_metrics()
and collect_predictions()
.
env_NB_rs_metrics <- collect_metrics(env_NB_rs)
#1 accuracy binary 0.684 10 0.00843 Preprocessor1_Model1
#2 brier_class binary 0.314 10 0.00884 Preprocessor1_Model1
#3 roc_auc binary 0.762 10 0.00715 Preprocessor1_Model1
env_NB_rs_predictions <- collect_predictions(env_NB_rs)
Worse than logistic!
-
Accuracy
is the proportion of the data that is predicted correctly. Be aware that accuracy can be misleading in some situations, such as for imbalanced data sets. -
ROC AUC
measures how well a classifier performs at different thresholds. The ROC curve plots the true positive rate against the false positive rate; AUC closer to 1 indicates a better-performing model, while AUC closer to 0.5 indicates a model that does no better than random guessing. -
Brier score
is a measure of the mean squared difference between the predicted probabilities and the actual outcomes.
— Visualize the ROC AUC
curve
The area under each of these curves is the roc_auc
metric we have computed. (If the curve was close to the diagonal line, then the model’s predictions would be no better than random guessing.)
— [FIG] Visualize the confusion matrix
# Plot the confusion matrix
conf_mat_resampled(env_NB_rs, tidy = FALSE) %>%
autoplot(type = "heatmap")
- different from logistic
- Very well with true positive (high risk), but very bad with true negative (low risk).
8) Evaluate the best model on the validation/test set
— Fit the best model(training) and evaluate on the test set
After determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set
.
# fit the model to the training set and evaluate on the validation set
env_NB_rs_final_fit <- last_fit(
env_NB_wf,
split = env_split)
# Evaluate the model on the validation set (in my case)
collect_metrics(env_NB_rs_final_fit)
# 1 accuracy binary 0.691 Preprocessor1_Model1
# 2 roc_auc binary 0.784 Preprocessor1_Model1
# 3 brier_class binary 0.307 Preprocessor1_Model1
____________________________________________________________
———————- MULTICLASS OUTCOME ——————-
____________________________________________________________
4. & 5. model - Multiclass outcome sector_f
[🙃] Multiclass outcome
2) Split sample (based on sector_f
)
This time I need to create a new split of the data using initial_split()
based on levels of sector_f
(here I collapsed the original 99 levels into 7 macro levels.
We will use the strata
argument to stratify the data by the outcome variable (sector_f
). This will ensure that the training and validation sets have the same proportion.
# Create a stratified split based on missing vs non-missing env_cat
projs_train %>% tabyl(sector_f) # 7 levels
# Split BUT only "Not Missing" `env_cat_f`
## --- 0) THIS WILL BE 4 TRAINING & VALIDATION
sec_use <- projs_train %>%
filter(sector_f != "Missing") # 4316 proj
# SPLIT INTO TRAINING, VALIDATION
set.seed(123) # Ensure reproducibility
sec_split <- initial_split(sec_use, prop = 0.7, # 70% training, 30% testing
strata = sector_f) # stratify by OUTCOME
## -- 1) for training (labelled `sector_f`)
sec_train <- training(sec_split) # 3019 proj
## -- 2) for validation (labelled `sector_f`)
sec_test <- testing(sec_split) # 1297 proj
# # UNLABELLED PORTION
## -- 3) for actual test (UNlabelled `sector_f`)
sec_missing <- projs_train %>%
filter(sector_f == "Missing") # 87 proj
# check ditribution of `sector_f` in training and validation
tabyl(sec_train, sector_f) |> adorn_totals("row") |> adorn_pct_formatting(digits = 1)#
tabyl(sec_test, sector_f)|> adorn_totals("row") |> adorn_pct_formatting(digits = 1)#
There is no terrible imbalance between the levels of sector_f
in the training and validation sets. However, Compared to binary classification, there are several additional issues to keep in mind when working with multiclass classification:
- Many machine learning algorithms do not handle imbalanced data well and are likely to have a hard time predicting minority classes.
- Not all machine learning algorithms are built for multiclass classification at all.
- Many evaluation metrics need to be reformulated to describe multiclass predictions.
3) Preprocessing steps [multi_rec
] (same)
we have added step_downsample(sector_f)
to the end of the recipe specification to downsample after all the text preprocessing. + We want to downsample last so that we still generate features on the full training data set. + The downsampling will then only affect the modeling step, not the preprocessing steps, with hopefully better results.
multi_rec <- recipe(
#sector_f ~ pdo, data = sec_train) %>%
sector_f ~ pdo + regionname + boardapprovalFY + env_cat_f, data = sec_train) %>%
step_tokenize(pdo) %>%
step_stopwords(pdo, custom_stopword_source = stop_vector) %>%
step_tokenfilter(pdo, max_tokens = 100) %>%
step_tfidf(pdo, smooth_idf = FALSE) %>%
# add NA as special factor level
step_unknown(regionname ,new_level = "Unknown reg" ) %>%
step_unknown(boardapprovalFY ,new_level = "Unknown FY" ) %>%
step_unknown(env_cat_f ,new_level = "Unknown env cat" ) %>%
# convert to dummy variables
step_dummy(regionname, boardapprovalFY,env_cat_f, one_hot = TRUE) %>%
# resolve imnbalance
step_downsample(sector_f)
— COULD do step_word_embeddings
to see a diffrernent
check what changed…
4.a) Specify models (NEW!)
— i) Multin. Lasso Logistic Regress [logistic_model
]
Some model algorithms and computational engines (examples are most random forests and SVMs) automatically detect when we perform multiclass classification from the number of classes in the outcome variable and do not require any changes to our model specification. For lasso regularization, we need to create a new special model specification just for the multiclass class using multinom_reg()
.
# MULTINOMIAL LASSO REGRESSION MODEL
# For lasso regularization, we need to create a new special model specification just for the multiclass class
logistic_model <- multinom_reg(penalty = tune(), mixture = 1) %>%
set_mode("classification") %>%
set_engine("glmnet")
logistic_model
— ii) KNN [knn_model
]
knn_model <- nearest_neighbor(neighbors = tune()) %>%
set_engine("kknn") %>%
set_mode("classification")
4.b) Specify workflows (NEW!)
— i) Multin. Lasso Logistic Regress [logistic_wf
]
To keep our text data sparse throughout modeling and use the sparse capabilities of set_engine(“glmnet”), we need to explicitly set a non-default preprocessing blueprint, using the package hardhat
. This “blueprint” lets us specify during modeling how we want our data passed around from the preprocessing into the model. The composition “dgCMatrix” is the most common sparse matrix type, from the Matrix
package, used in R for modeling. We can use this blueprint argument when we add our recipe to our modeling workflow, to define how the data should be passed into the model.
library(hardhat)
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")
# workflow
logistic_wf <- workflow() %>%
add_recipe(multi_rec, blueprint = sparse_bp) %>%
add_model(logistic_model)
logistic_wf
— ii) KNN [knn_wf
]
5) Hyperparameters tuning
We used the same arguments for penalty and mixture as before, as well as the same mode and engine, but this model specification is set up to handle more than just two classes.
— folds for cross-validation
We also need a new cross-validation object
since we are using a different data set.
set.seed(123)
# random splits ("folds") of the data for cross-validation
multi_folds <- vfold_cv(sec_train)
— Define Grids
The last time we tuned a lasso model, we used the defaults for the penalty parameter and 30 levels. Let’s restrict the values this time using the range argument, so we don’t test out as small values for regularization, and only try 20 levels.
-
grid_regular()
chooses sensible values to try for the penalty parameter, based on the range we provide - we ask for 20 different possible values.
— i) Multin. Lasso Logistic Regress [logistic_grid
]
logistic_grid <- grid_regular(hardhat::extract_parameter_set_dials(logistic_model), levels = 10)
— ii) KNN [knn_grid
]
knn_grid <- grid_regular(hardhat::extract_parameter_set_dials(knn_model), levels = 10,
filter = c( neighbors >1))
— Define tuning process
Multiclass support for the parameters
model_control <- control_grid(save_pred = TRUE)
model_metrics <- metric_set( accuracy, sens, spec, mn_log_loss, roc_auc)
Now we have everything we need to tune the regularization penalty and find an appropriate value.
-
tune_grid()
can fit a model at each of the values for the regularization penalty in our regular grid. - Note that we specify
save_pred = TRUE
so we can create ROC curves and a confusion matrix later.
— i) Multin. Lasso Logistic Regress [logistic_model
]
# Tune model
multi_lasso_rs <- tune_grid(
# A) unfitted workflow
# logistic_wf,
# B) model specification
logistic_model,
# B) recipe
multi_rec,
grid = logistic_grid,
metrics = model_metrics, # pre defined metrics
control = model_control, # pre defined control
# folds for cross-validation
resamples = multi_folds)
multi_lasso_rs
— ii) KNN [knn_grid
]
# Tune model
multi_knn_rs <- tune_grid(
# A) unfitted workflow
# logistic_wf,
# B) model specification
knn_model,
# B) recipe
multi_rec,
grid = knn_grid,
metrics = model_metrics, # pre defined metrics
control = model_control, # pre defined control
# folds for cross-validation
resamples = multi_folds)
multi_knn_rs
6) Evaluate the model performance & upd WF
— Accuracy metric
What do we see, in terms of performance metrics?
Even the very best “accuracy” value here is quite low (0.44), significantly lower than the binary classification model. This is expected because multiclass classification is more difficult than binary classification.
— [FIG] performance metrics
— i) Multin. Lasso Logistic Regress [logistic_model
]
— ii) KNN [knn_grid
]
— choose final hyperparameters
— i) Multin. Lasso Logistic Regress [log_final_param
]
— TIDY conf matrix
multi_lasso_rs %>%
collect_predictions() %>%
inner_join(log_final_param) %>%
conf_mat(truth = sector_f, estimate = .pred_class)
— [FIG] conf matrix (1st fold)
To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices
in
final_penalty <- multi_lasso_rs %>%
show_best(metric = "accuracy") %>%
slice(1) %>%
pull(penalty)
multi_lasso_rs %>%
collect_predictions() %>%
filter(penalty == final_penalty) %>%
filter(id == "Fold01") %>%
conf_mat(truth = sector_f, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
- The diagonal is fairly well populated, which is a good sign. This means that the model generally predicted the right class.
- The off-diagonal numbers are all the failures and where we should direct our focus. It is a little hard to see these cases well since the majority class affects the scale. A trick to deal with this problem is to remove all the correctly predicted observations.
— [FIG] conf matrix (1st fold) only wrong pred
multi_lasso_rs %>%
collect_predictions() %>%
# final param
filter(penalty == final_penalty) %>%
filter(id == "Fold01") %>%
# exclude correct
filter(.pred_class != sector_f) %>%
conf_mat(truth = sector_f, estimate = .pred_class) %>%
autoplot(type = "heatmap", values = "prop") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
Here it is more clear where the model breaks down: “Institutional Supp.” which makes sense because it was often an agency created FOR a sector!
— ii) KNN [knn_final_param
]
— TIDY conf matrix
— [FIG] conf matrix (1st fold)
To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices
in
neighbors_final <- multi_knn_rs %>%
show_best(metric = "accuracy") %>%
slice(1) %>%
pull(neighbors)
multi_knn_rs %>%
collect_predictions() %>%
# final param
filter(neighbors == neighbors_final) %>%
filter(id == "Fold01") %>%
conf_mat(truth = sector_f, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices
in
multi_knn_rs %>%
collect_predictions() %>%
# final param
filter(neighbors == neighbors_final) %>%
filter(id == "Fold01") %>%
# exclude correct
filter(.pred_class != sector_f) %>%
conf_mat(truth = sector_f, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20))
6.b) Finalize the workflow
This I do only for the best model
— Finalize workflow
After we have those parameters (final_penalty
), penalty {and max_tokens}, we can finalize our earlier tunable workflow, by updating it with this value.
logistic_wf_final <- logistic_wf %>%
# here it wants a tibble
finalize_workflow(parameters = log_final_param )
7) Assess the model performance on training set
— See the results [multi_fit
]
Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .
# Fit the model to the training data
multi_fit <- fit (logistic_wf_final, data = sec_train)
— Coefficients [multi_fitted_coeff
]
In this model, it seems that words contained in PDO are the most influential in predicting the sector of a project.
- “tfidf_pdo_education” appear among the top coefficients!!!
- “tfidf_pdo_health” appear among the top coefficients!!!
8) Evaluate the best model on the test set [last_fit()
]
We can now fit this finalized workflow on training data and finally return to our testing data.
-
last_fit()
emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set
— metrics
multi_final_fitted <- last_fit(logistic_wf_final, sec_split)
collect_metrics(multi_final_fitted)
#1 accuracy multiclass 0.448 Preprocessor1_Model1
#2 roc_auc hand_till 0.800Preprocessor1_Model1
#3 brier_class multiclass 0.367 Preprocessor1_Model1
— predictions
— [FIG] confusion matrix
multi_final_fitted %>%
collect_predictions() %>%
conf_mat(truth = sector_f, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
scale_x_discrete(labels = function(x) str_wrap(x, 20)) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
— [FIG] ROC curve
pred_tibble <- multi_final_fitted %>%
collect_predictions()
# Assuming you have your results (with columns for true classes and predicted probabilities)
results <- pred_tibble # e.g., tibble with true and predicted columns
paint(results)
# Compute ROC curve (micro-average)
roc_curve(results, truth = sector_f,
'.pred_AGR FOR FISH',
'.pred_EDUCATION' ,
'.pred_ENERGY' ,
'.pred_FINANCIAL' ,
'.pred_HEALTH' ,
'.pred_ICT' ,
'.pred_IND TRADE SERV' ,
'.pred_INSTIT. SUPP.' ,
'.pred_MINING OIL&GAS' ,
'.pred_SOCIAL PROT.' ,
'.pred_TRANSPORT' ,
'.pred_URBAN' ,
'.pred_WAT & SAN'
) %>%
autoplot()
It makes sense that the sectors’ levels in which I have collapsed more things and/or have a more “blurred” definitions are the ones that are harder to predict:
- “URBAN”
- “IND TRADE SERV”
- “INSTIT. SUPP.”
- “ICT” (?)
—– 🔴🟢 QUI 🟠🟡 —-
9) Interpret the model
The output of last_fit()
(multi_final_fitted
) also contains a fitted model (a workflow, to be more specific) that has been trained on the training data. We can use the vip package to understand what the most important variables are in the predictions
multi_final_fitted$.workflow
multi_final_fitted$.metrics
multi_final_fitted$.predictions
# N of observation is the same as the `sec_test` set 1267
nrow(multi_final_fitted$.predictions[[1]])
— Inspecting what levels of the outcome are most difficult to estimate
— Inspecting the most important features for predicting the outcome
We can use the vip package to understand what the most important variables are in the predictions
- using the
extract_fit_parsnip()
function from theworkflows
package to extract the model object from the workflow.
A quick way to extract the most important features for predicting each outcome is to use the vi()
function from {vip
}.
- The
vi()
function calculates the permutation importance of each feature in the model.
library(vip)
sector_feat_imp <- extract_fit_parsnip(multi_final_fitted$.workflow[[1]]) %>%
vi(lambda = final_penalty) %>%
mutate(
Sign = case_when(Sign == "POS" ~ "More important for predicting a given sector",
Sign == "NEG" ~ "Less important for predicting a given sector",
TRUE ~ "Neutral"),
Importance = abs(Importance)) %>%
#filter(!str_detect(Variable, "boardapprovalFY")) %>%
filter (Importance != 0.00000000)
— [FIG] Variable importance for predicting the sector category
sector_feat_imp %>% group_by(Sign) %>%
top_n(30, Importance) %>%
mutate(Variable = str_replace(Variable, "tfidf_pdo_", "PDO_"),
Variable = str_replace(Variable, "env_cat_f", "ENVCAT_"),
Variable = str_replace(Variable, "regionname_", "REG_"),
Variable = str_replace(Variable, pattern = "boardapprovalFY_X" , replacement = "APPR_FY_")
) %>%
ungroup %>%
ggplot(aes(x = Importance,
y = fct_reorder(Variable, Importance),
fill = Sign
)) +
geom_col(show.legend = FALSE) +
scale_x_continuous(expand = c(0, 0)) +
facet_wrap(~Sign, scales = "free", ncol = 2) +
labs(
y = NULL,
title = "Variable importance for predicting the sector category",
#subtitle = paste0("These features are the most important in predicting a given class" )
)
— Misclassified sector
We can gain some final insight into our model by looking at observations from the test set that it misclassified. I will then select the columns with the actual outcome (sector_f
), and compare with the predicted sector: What was predicted wrong?
Let’s see for example cases labelled “AGR FOR FISH” but misclassified
Interesting: they all have very brief PDO !
same !!!
____________________
SET OF MODELS
____________________
What if I want to see them all together?
__________
TOPIC MODELING w ML
__________
Topic modeling is an unsupervised machine learning technique used to hat exploratively identifies latent topics based on frequently co-occurring words.
It can identify topics or themes that occur in a collection of documents, allowing hidden patterns and relationships within text data to be discovered. It is widely applied in fields such as social sciences and humanities.
https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html
https://m-clark.github.io/text-analysis-with-R/topic-modeling.html
https://sicss.io/2020/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html
Document-Term Matrix
…
/include independent variables in my topic model?
https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html#how-do-i-include-independent-variables-in-my-topic-model
Compare PDO text v. project METADATA [CMPL 🟠]
Using NLP models trained on document metadata and structure can be combined with text analysis to improve classification accuracy.
STEPS
- Use document text (abstracts) as features to train a supervised machine learning model. The labeled data (documents with sector tags) will serve as training data, and the model can predict the missing sector tags for unlabeled documents.
- TEXT preprocessing (e.g. tokenization, lemmatization, stopword removal, TF-IDF)
- Convert the processed text into a numerical format using Term Frequency-Inverse Document Frequency (TF-IDF), which gives more weight to terms that are unique to a document but less frequent across the entire corpus.
- Define data features, e.g.
- Document Length: Public sector documents might be longer, more formal.
- Presence of Certain Keywords: Use specific keywords that correlate with either the public or private sector.
- Sector Tags: In documents where the “sector tag” is present, you can use it as a feature for training.
- Predicting Missing Sector Tags (Classification):
- Use models like: Logistic Regression: For a binary classification (e.g., public vs. private). Random Forest or XGBoost: If you have a more complex tagging scheme (e.g., multiple sector categories).
- Cross-validation: Ensure the model generalizes well by validating with the documents that already have the sector tag filled in.
- Evaluate the model: Use metrics like accuracy, precision, recall, and F1 score to evaluate the model’s performance.
# #| eval: FALSE
# #| echo: FALSE
# library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
# library(dplyr) # A Grammar of Data Manipulation
# library(tidyr) # Tidy Messy Data
# library(caret) # Classification and Regression Training # Classification and Regression Training
#
# # ----- 1. Tokenization and stopwords removal using tidytext.
# # Assuming df is your dataframe with "abstract" and "sector_tag"
# # Tokenize the text and remove stopwords
# tidy_abstracts <- df %>%
# unnest_tokens(word, abstract) %>%
# anti_join(stop_words) # Remove stopwords
#
# # Optional: Apply stemming (you can also use `SnowballC` if you prefer)
# tidy_abstracts <- tidy_abstracts %>%
# mutate(word = SnowballC::wordStem(word))
#
# # ----- 2. Document-term matrix or TF-IDF calculation using bind_tf_idf().
# # Create a term-frequency matrix
# abstract_dtm <- tidy_abstracts %>%
# count(document_id = row_number(), word) %>% # Assuming each row is a document
# cast_dtm(document_id, word, n)
#
# # Alternatively, use TF-IDF weighting
# abstract_tfidf <- tidy_abstracts %>%
# count(document_id = row_number(), word) %>%
# bind_tf_idf(word, document_id, n)
#
# # ----- 3. Model training using caret (Random Forest, Logistic Regression, etc.).
# # Split data into training (with sector tags) and testing (missing sector tags)
# train_data <- df[!is.na(df$sector_tag), ]
# test_data <- df[is.na(df$sector_tag), ]
#
# # Combine the DFM or TF-IDF with the training dataset
# train_tfidf <- abstract_tfidf %>%
# filter(document_id %in% train_data$row_number()) %>%
# spread(word, tf_idf, fill = 0)
#
# # Merge with sector tags
# train_tfidf <- left_join(train_tfidf, train_data, by = c("document_id" = "row_number"))
#
# # Prepare for machine learning by ensuring you have sector tags in the final dataset
#
# # ----- 4. Prediction of missing sector tags based on the trained model.
# # Random Forest Model
# model <- train(sector_tag ~ ., data = train_tfidf, method = "rf")
#
# # Predict missing sector tags for the test data
# test_tfidf <- abstract_tfidf %>%
# filter(document_id %in% test_data$row_number()) %>%
# spread(word, tf_idf, fill = 0)
#
# # Predict sector tags for the missing observations
# predicted_tags <- predict(model, newdata = test_tfidf)
#
# # Add the predicted sector tags to the original dataset
# df$sector_tag[is.na(df$sector_tag)] <- predicted_tags
#
# # ----- 5. Evaluate and Refine the Model
# confusionMatrix(predicted_tags, test_data$sector_tag)
— I could see if corresponds to sector flags in the project metadata
more missing but more objective!
Topic modeling with LDA (Latent Dirichlet Allocation)
Topic modeling algorithms like Latent Dirichlet Allocation (LDA) can be applied to automatically uncover underlying themes within a corpus. The detected topics may highlight key terms or subject areas that are strongly associated with either the public or private sector.
Named Entity Recognition using CleanNLP and spaCy
NER is especially useful for analyzing unstructured text.
NER can identify key entities (organizations, people, locations) mentioned in the text. By tracking which entities appear frequently (e.g., government agencies vs. corporations), it’s possible to categorize a document as more focused on the public or private sector.
— Summarise the tokens by parts of speech
# Initialize the spacy backend
cnlp_init_spacy()
__________
STRUCTURAL TOPIC MODELING (STM)
__________
The Structural Topic Model is a general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both.
MAIN REFERENCE stm
R package http://www.structuraltopicmodel.com/ EXAMPLE UN corpus https://content-analysis-with-r.com/6-topic_models.html STM 1/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-i-2da2b353d362 STM 2/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-ii-462e6e07328
BERTopic
Developed by Maarten Grootendorst, BERTopic enhances the process of discovering topics by using document embeddings and a class-based variation of Term Frequency-Inverse Document Frequency (TF-IDF).
https://medium.com/(supunicgn/a-beginners-guide-to-bertopic-5c8d3af281e8?)
__________
(dYnamic) TOPIC MODELING OVER TIME
__________
Example: An analysis of Peter Pan using the R package koRpus https://ladal.edu.au/topicmodels.html#Topic_proportions_over_time
RENDER this
quarto render analysis/01c_WB_project_pdo_feat_class.qmd --to html
open ./docs/analysis/01c_WB_project_pdo_feat_class.html