WB Project PDO features classification

Warning

WORK IN PROGRESS! (Please expect unfinished sections, and unpolished code. Feedback is welcome!)

Set up

# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(tidyverse) # Easily Install and Load the 'Tidyverse' 
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)
library(glue) # Interpreted String Literals
# ML & Text Mining -------------------------------------
library(tidymodels) # Easily Install and Load the 'Tidymodels' Packages 
library(textrecipes) # Extra 'Recipes' for Text Processing  
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools 
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
library(rvest) # Easily Harvest (Scrape) Web Pages
library(cleanNLP) # A Tidy Data Model for Natural Language Processing
library(themis) # Extra Recipes for Dealing with Unbalanced Classes
library(discrim) # Model Wrappers for Discriminant Analysis

set.seed(123) # for reproducibility

# 1) --- Set the font as the default for ggplot2
# Who else? https://datavizf24.classes.andrewheiss.com/example/05-example.html 
lulas_theme <- theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        # Bold, bigger title
        plot.title = element_text(face = "bold", size = rel(1.6)),
        # Plain, slightly bigger subtitle that is grey
        plot.subtitle = element_text(face = "plain", size = rel(1.4), color = "#A6A6A6"),
        # Italic, smaller, grey caption that is left-aligned
        plot.caption = element_text(face = "italic", size = rel(0.7), 
                                    color = "#A6A6A6", hjust = 0),
        # Bold legend titles
        legend.title = element_text(face = "bold"),
        # Bold, slightly larger facet titles that are left-aligned for the sake of repetition
        strip.text = element_text(face = "bold", size = rel(1.1), hjust = 0),
        # Bold axis titles
        axis.title = element_text(face = "bold"),
        # Change X-axis label size
        axis.text.x = element_text(size = rel(1.4)),   
        # Change Y-axis label size
        axis.text.y = element_text(size = 14),   
        # Add some space above the x-axis title and make it left-aligned
        axis.title.x = element_text(margin = margin(t = 10), hjust = 0),
        # Add some space to the right of the y-axis title and make it top-aligned
        axis.title.y = element_text(margin = margin(r = 10), hjust = 1),
        # Add a light grey background to the facet titles, with no borders
        strip.background = element_rect(fill = "grey90", color = NA),
        # Add a thin grey border around all the plots to tie in the facet titles
        panel.border = element_rect(color = "grey90", fill = NA))

# 2) --- use 
# ggplot + lulas_theme

Load data

# Load Proj train dataset `projs_train_t`
projs_train <- readRDS(here("data","derived_data", "projs_train.rds"))  
custom_stop_words_df  <-  readRDS(here("data","derived_data", "custom_stop_words_df.rds")) 
# Load function
source(here("R","f_recap_values.R"))

__________

PREDICT MISSING FEATUREs

__________

What ML models work with text?

Remember that text data is SPARSE!

To predict a missing feature (e.g., sector) based on available features from text data, several supervised machine learning algorithms can be applied. Given that you have a mixture of text and structured data, here are some suitable algorithms:

Logistic Regression / Multinomial Logistic Regression: If you’re predicting a categorical variable like “sector”, logistic regression can work well, especially with appropriate feature engineering for text (e.g., converting text data into numeric features using TF-IDF or word embeddings).
Lasso regression/classification learns how much of a penalty to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.
k-Nearest Neighbors (k-NN): k-NN can be useful for text data, especially when you have a mix of structured and unstructured data. It’s a simple algorithm that can work well with text data, but it can be computationally expensive.
Decision Trees / Random Forests: These algorithms handle both numeric and categorical data efficiently and can manage missing values quite well. You can input text-based features as well, though you might need to preprocess the text into numeric form (e.g., using embeddings).
Naive Bayes: Naive Bayes is a simple and efficient algorithm for text classification. It assumes feature independence, which may not always hold, but it’s often a good baseline, particularly with short texts.
Support Vector Machines (SVMs): SVMs are useful when you have high-dimensional data, which is common with text after feature extraction (like TF-IDF). They can perform well with a mix of structured and unstructured data.

All available models are listed at parsnip

Some model parameters can be learned from data during fitting/training. Some CANNOT 😱. These are hyperparameters of a model, and we estimate them by training lots of models with different hyperparameters and comparing them

— Check missing feature

sum(!is.na(projs_train$pdo))
sum(!is.na(projs_train$sector1)) /4403# 99
sum(!is.na(projs_train$regionname)) / 4403  # 100%
sum(!is.na(projs_train$countryname)) / 4403  # 100%
sum(!is.na(projs_train$status)) / 4403  # 100%
sum(!is.na(projs_train$lendinginstr)) / 4403  # 99% 
sum(!is.na(projs_train$ESrisk)) / 4403  # 96% 
sum(!is.na(projs_train$curr_total_commitment)) / 4403  # 100% 

sum(!is.na(projs_train$theme1)) /4403 # 71% 
table(projs_train$theme1, useNA = "ifany") # 71 levels
sum(!is.na(projs_train$env_cat)) / 4403  # 73% 
table(projs_train$env_cat , useNA = "ifany") # 7 levels

# source function
source(here("R","f_recap_values.R")) 

# check candidate lables for classification 
f_recap_values(projs_train, c("sector1", "theme1","env_cat","ESrisk" ) ) %>% kable()

skim_variable	total_rows	n_distinct	n_missing	missing_perc
ESrisk	4403	5	3980	90.4%
theme1	4403	73	1254	28.5%
env_cat	4403	8	1167	26.5%
sector1	4403	71	17	0.4%

— Identify features for classification

These could be:

features derived from raw text (e.g. characters, words, ngrams, etc.),
feature vectors (e.g. word embeddings), or
meta-linguistic features (e.g. part-of-speech tags, syntactic parses, or semantic features)

How do we use them?

Do we use raw token counts?
Do we use normalized frequencies?
Do we use some type of weighting scheme? ✅
- yes, we use tf-idf (a weighting scheme, which will downweight words that are common across all documents and upweight words that are unique to a document)
Do we use some type of dimensionality reduction? ✅ # TEXT CLASSIFICATION for Environmental Assessment Category

Francom text Classification (in R) Stanford slide

____________________________________________________________

———————- BINARY OUTCOME ——————-

____________________________________________________________

One candidate variable that could be predicted is the env_cat variable (“Environmental Assessment Category”). This is a categorical variable with 7 levels (A, B, C, F, H, M, U ) , but, to simplify, I converted it into a binary outcome defined as in Table 1.

#__________ # COMMON STEPS

1) Select and engineer the features [using `projs_train`]

Recode `env_cat` variable

projs_train <- projs_train %>% 
   # useful for later 
   rename(., proj_id = "id") %>%
   # recode as factors 
   mutate (across (c(status, boardapprovalFY, regionname, countryname, sector1, 
                     theme1, lendinginstr), as.factor))  %>% 
   # env risk category 7 levels
   mutate(env_cat_f = fct_na_value_to_level(env_cat, level = "Missing")) %>% 
   mutate(env_cat_f = fct_recode(env_cat_f, 
                              "A_high risk" = "A", 
                              "B_med risk" = "B", 
                              "C_low risk" = "C", 
                              "F_fin expos" = "F", 
                              "Other" = "H", 
                              "Other" = "M", 
                              "Other" = "U", 
                              "Missing" = "Missing" )) %>% 
   relocate(env_cat_f , .after = env_cat) %>% 
   # collaapse env_cat_f into 2 levels
   mutate(env_cat_f2 = fct_collapse(env_cat_f, 
                                    "High-Med-risk" = c("A_high risk", "B_med risk"),
                                    "Low-risk_Othr" = c("C_low risk", "F_fin expos", "Other", "Missing")
   )) %>% 
   relocate(env_cat_f2, .after = env_cat_f)

tabyl(projs_train, env_cat, show_na = TRUE) # 7 levels
tabyl(projs_train, env_cat_f, show_na = TRUE) # 2 levels
tabyl(projs_train, env_cat_f2, show_na = TRUE) # 7levels

# Show as contingency table env_cat_f and env_cat_f2
env_cat_f_f2_k <- table(projs_train$env_cat_f, projs_train$env_cat_f2) %>% 
   kable() %>% 
   kable_styling("striped", full_width = F)

env_cat_f_f2_k
saveRDS(env_cat_f_f2_k, here("analysis", "output", "tables", "env_cat_f_f2_k.rds"))

Table 1: Recoded Environmental Assessment Category

	High-Med-risk	Low-risk_Othr
A_high risk	351	0
B_med risk	1830	0
C_low risk	0	880
F_fin expos	0	127
Other	0	48
Missing	0	1167

Recode `sector1` variable

(this I use later as MULTI-CLASS outcome)

# !!!!! `sector_f` e' diverso da `tok_sector_broad` XCHE si basa su `sector1` !!!! 
projs_train <- projs_train %>% # sector1 99 levels
   mutate(sector_f = case_when(
      str_detect(sector1, regex("water|wastewater|sanitat|Sewer|Irrigat|Drainag", ignore_case = T)) ~ "WAT & SAN",
      str_detect(sector1, regex("transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation", ignore_case = T)) ~ "TRANSPORT",
      sector1 == "port" ~ "TRANSPORT",
      str_detect(sector1, regex("urban|housing|inter-urban|peri-urban|waste", ignore_case = T)) ~ "URBAN",
      str_detect(sector1, regex("energ|electri|hydroele|hydropow|renewable|transmis", ignore_case = T)) ~ "ENERGY",  # Matches either "energy" or "power"
      str_detect(sector1, regex("health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin", ignore_case = T)) ~ "HEALTH",
      str_detect(sector1, regex("educat|school|vocat|teach|univers|student|literacy|training|curricul", ignore_case = T)) ~ "EDUCATION",
      
      str_detect(sector1, regex("Agricultural|Agro|Fish|Forest|Crop|livestock|agri-business", ignore_case = T)) ~ "AGR FOR FISH",
      str_detect(sector1, regex("Minin|oil|gas|mineral|Extract", ignore_case = T)) ~ "MINING OIL&GAS",
      str_detect(sector1, regex("Social Protec", ignore_case = T)) ~ "SOCIAL PROT.",
      
      str_detect(sector1, regex("Bank|finan|Investment", ignore_case = T)) ~ "FINANCIAL",
      str_detect(sector1, regex("Information|Communication|ICT|Internet|Technologies", ignore_case = T)) ~ "ICT",
      str_detect(sector1, regex("Tourism|Trade and Services|Manuf|Other Industry|Trade and Services", ignore_case = T)) ~ "IND TRADE SERV",
      str_detect(sector1, regex("Government|Public Admin|Institution|Central Agenc|Sub-national Gov|law|justice|governance", ignore_case = T)) ~ "INSTIT. SUPP.",
      TRUE ~ "Missing")) %>% 
   relocate(sector_f, .after = sector1)  

tabyl(projs_train, sector1, show_na = TRUE) # 99 levels
tabyl(projs_train, sector_f, show_na = TRUE) # 7 levels
#tabyl(projs_train, tok_sector_broad, show_na = TRUE) # 7 levels

2) Split sample based on `env_cat` variable

— [1/2] Proportional sub-set `dplyr::slice_sample`

# # Calculate proportions of missing and non-missing values in env_cat
# proportion_missing <- projs_train %>%
#   mutate(class_status = ifelse(is.na(env_cat), "Missing", "Not Missing")) %>%
#   count(class_status) %>%
#   mutate(prop = n / sum(n))
# 
# # Ensure the number of samples adds up to 500
# set.seed(123)  # For reproducibility
# 
# # Define the number of samples for each group (rounding and adjusting for total = 500)
# n_samples <- round(500 * proportion_missing$prop)
# n_samples[1] <- 500 - sum(n_samples[-1])  # Adjust to ensure total is exactly 500
# 
# # Split the data into missing and non-missing groups
# missing_group <- projs_train %>%
#   filter(is.na(theme1)) %>%
#   slice_sample(n = n_samples[proportion_missing$class_status == "Missing"])
# 
# not_missing_group <- projs_train %>%
#   filter(!is.na(theme1)) %>%
#   slice_sample(n = n_samples[proportion_missing$class_status == "Not Missing"])
# 
# # Combine the two groups
# projs_train_smpl <- bind_rows(missing_group, not_missing_group)
# 
# # View the sample
# print(projs_train_smpl)

— [2/2] Proportional sub-set `rsample::initial_split`

Based on availability of env_cat_f.

We will use the strata argument to stratify the data by the outcome variable (env_cat_f). This will ensure that the training and validation sets have the same proportion.

# Create a stratified split based on missing vs non-missing env_cat
projs_train %>% tabyl(env_cat_f) # 7 levels

# Split BUT only "Not Missing" `env_cat_f` 
## --- 0) THIS WILL BE 4 TRAINING & VALIDATION 
env_cat_use <- projs_train %>% 
   filter(env_cat_f != "Missing") # 3236 proj 

# SPLIT INTO TRAINING, VALIDATION 
set.seed(123)  # Ensure reproducibility
env_split <- initial_split(env_cat_use, prop = 0.7, # 70% training, 30% testing
                       strata = env_cat_f) # stratify by OUTCOME 

## -- 1) for training (labelled `env_cat_f`)
env_cat_train <- training(env_split)   # 2265 proj
    
## -- 2) for validation (labelled `env_cat_f`)
env_cat_test <- testing(env_split)  # 971 proj
   
# # UNLABELLED PORTION 
## -- 3) for actual test (UNlabelled `env_cat_f`)
env_cat_missing <- projs_train %>% 
  filter(env_cat_f == "Missing") # 1167 proj 

# check ditribution of `env_cat_f` in training and validation
tabyl(env_cat_train, env_cat_f) |> adorn_totals("row") |> adorn_pct_formatting(digits = 1)# 
tabyl(env_cat_test, env_cat_f)|> adorn_totals("row") |> adorn_pct_formatting(digits = 1)#

rm( env_cat_use, env_cat_missing)

# # Rebalance the classes
# tabyl(env_cat_train, env_cat_f) |> adorn_totals("row") |> adorn_pct_formatting(digits = 1)
# 
# ds_rec <-  recipe(env_cat_f ~ ., data = env_cat_train) %>% 
#  step_downsample(env_cat_f , under_ratio = 3, trained = FALSE) %>%
#    prep() 
# 
# training <- ds_rec %>%
#   bake(new_data = NULL) %>%
#   count(env_cat_f, name = "training")
# training
# 
# 
# env_cat_train %>%  count(env_cat_f)

#__________

MODEL-SPECIFIC STEPS

0. LASSO classification - log w/ text with tf-idf

Lasso regression or classification learns how much of a penalty to put on some features (sometimes penalizing all the way down to zero) so that we can select only some features out of the high-dimensional space of original possible variables (tokens) for the final model.

3) Define preprocessing steps

Prep recipe [`env_recipe`]

{textrecipes} provides a number of step functions for pre-processing text data. These include functions to tokenize (e.g. step_tokenize()), remove stop words (e.g. step_stopwords()), and to derive meta-features (e.g. step_lemma(), step_stem(), etc.) Furthermore, there are functions to engineer features in ways that are particularly relevant to text data, such as feature frequencies and weights (e.g. step_tf(), step_tfidf(), etc.) and token filtering (e.g. step_tokenfilter()).

step_tokenize()
step_tfidf()
- smooth_idf = FALSE (terms that appear in many (or all) documents will not be down weighted as much as they would be if the smoothing term was not added)

# add step to recipe
env_recipe <- recipe (formula = env_cat_f2 ~ pdo,
                      data = env_cat_train) %>%
   step_tokenize(pdo) %>%   # tokenize
   step_tfidf(pdo, smooth_idf = FALSE) # tf-idf  creates matrix of term frequencies weighted by inverse document frequency 

# Review the recipe   
env_recipe

# Run the recipe 
env_recipe_bake <-  env_recipe %>% 
   # chooses the parameters for the recipe based on the data
   prep(training = NULL) %>% 
   # applies the recipe to the data
   bake(new_data = NULL)

# preview the baked recipe -- TOO SPARSE 
dim(env_recipe_bake)
#[1] 2264 7516

The resulting engineered features data frame has 2264 observations and 7516 variables!!! I.e. for each writing sample, only a small subset of them will actually appear, most of our cells will be filled with zeros. This is what is known as a sparse matrix. Furthermore, the more features we have, the more chance these features will capture the nuances of these particular writing samples increasing the likelihood we overfit the model.

Improve recipe [`env_recipe`]

We can filter out features by stopword list or
Filter by frequency of occurrence

# -- Rebuild recipe with tokenfilter step
env_recipe <- recipe (formula = env_cat_f2 ~ pdo,
                      data = env_cat_train) %>%
   # tokenize
   step_tokenize(pdo) %>%   
   # filter by frequency of occurrence
   step_tokenfilter(pdo, max_tokens = 100) %>%  
   # tf-idf  creates matrix of weighted term frequencies  
   step_tfidf(pdo, smooth_idf = FALSE)      

# -- Run the recipe 
env_recipe_bake <-  env_recipe %>% 
   # chooses the parameters for the recipe based on the data
   prep(training = NULL) %>% 
   # applies the recipe to the data
   bake(new_data = NULL)

# -- preview the baked recipe
dim(env_recipe_bake)
#[1] 2264 7516 --> #[1] 2264 101

# subset check
env_recipe_bake[1:5, 1:10 ]

4) Select classification model

Logistic Regr Model [`env_spec`]

Let’s start with a simple logistic regression model to see how well we can classify the texts in the training set with the features we have engineered. We will use the parsnip::logistic_reg() function to specify the logistic regression model. We then select the implementation engine (glmnet General Linear Model) and the mode of the model (classification).

— Model specification

# Create a model specification
env_spec <-
   # lasso regularized model
   parsnip::logistic_reg(
      # non-negative number ~ the total amount of regularization
      penalty = tune(),  # 0 = no penalty, 1 = max
      # number between zero and one (inclusive) 
      mixture = 1 # specifies a pure lasso model,
   ) %>%
   set_mode("classification") %>%
   set_engine("glmnet")

# Preview 
env_spec

Create workflow [`env_wf`]

Check the resulting workflow (a container object that aggregates information required to fit and predict from a model) by adding the recipe and model to it.

# Create a workflow
env_wf <- workflows::workflow() %>%
   add_recipe(env_recipe) %>%
   add_model(env_spec)

env_wf

— Evaluate so far

# set.seed(123)
# # 1. Define the resampling folds
# env_folds <- vfold_cv(env_cat_train)
# 
# # 2. Define a tuning grid for the parameters
# grid <- grid_regular(penalty(), mixture(),  levels = 10)  # Adjust the levels as needed
# 
# env_rs <- fit_resamples(
#    env_wf,                      # Workflow
#    resamples = env_folds,        # Cross-validation folds
#    grid = grid,                  # Tuning grid
#    control = control_resamples(save_pred = TRUE)
# )

5) Tuning hyperparameters

Different algorithms will have different parameters that can be adjusted which can affect the performance of the model (hyperparameters) ≠ parameters (features) –> hyperparameters tuning which is topically done during fitting the model to the training set and evaluating its performance

— Penalty tuning [`env_grid`]

In logistic regression, the penalty hyperparameter is like a control that helps prevent the model from becoming too complex and overfitting to the training data. There are two common types of penalties:

L1 (Lasso): Encourages the model to use fewer features by making some of the coefficients exactly zero. This can simplify the model.
L2 (Ridge): Tries to keep all the coefficients small but not exactly zero, which can help stabilize the model and avoid overfitting.
the logistic regression model using glmnet can be tuned to prevent overfitting by adjusting the penalty and mixture (combination of L1 and L2) hyperparameters
- In our env_spec model, tune() was a placeholder for a range of values for the penalty hyperparameter.
- To tune the penalty hyperparameter, we use thegrid_regular() function from {dials} to specify a grid of values to try.
The package dials contains infrastructure to create and manage values of tuning parameters for the tidymodels packages.

# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
  penalty(), levels = 10)

# Preview
env_grid 
# 0 no penalty
# ... 
# 1 max penalty

— K-fold cross-validation (for penalty optimal #) [ `env_fold`, `env_tune`]

Now to perform the tuning and choose an optimal value for penalty we need to create a tuning workflow. We use the strategy of resampling (splitting env_cat_train in multiple training/testing sets) called k-fold cross-validation to arrive at the optimal value for the penalty hyperparameter.

# Create a resampling object
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10)

# Create a tuning workflow
env_tune <- tune::tune_grid(
  object = env_wf,
  resamples = env_vfold,
  grid = env_grid,
  control = control_grid(save_pred = TRUE)
)

# preview
env_tune

The env_tune object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics() function on the cls_tune object

# Collect the results of the tuning
env_tune_metrics <- tune::collect_metrics(env_tune)

# visualize the results
autoplot(env_tune) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identify the best penalty"
  )

# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues

The most common metrics for model performance in classification are accuracy and the area under the receiver operating characteristic area under the curve (ROC-AUC). Accuracy is simply the proportion of correct predictions. The ROC-AUC provides a single score which summarizes how well the model can distinguish between classes. The closer to 1 the more discriminative power the model has.

Conveniently, the show_best() function from {tune} takes a tune_grid object and returns the best performing hyperparameter values.

# Show the best hyperparameter values
show_best(env_tune, metric = "roc_auc") # 0.779 

# Make selection programmatically
env_best <- select_best(env_tune, metric ="roc_auc")
env_best_acc <- select_best(env_tune, metric ="accuracy")

6) Update model specification and workflow with best HP

Update workflow [`env_wf_lasso`]

Now we can update the model specification and workflow with the best performing hyperparameter value using the previous cls_wf_tune workflow and the finalize_workflow() function.

# Update the workflow/model specification
env_wf_lasso <- env_wf %>% 
   tune::finalize_workflow(env_best)

# Preview updated workflow object (with defined penalty paramv  0.00599)
env_wf_lasso

Instead of penalty = tune() like before, now our workflow has finalized values for all arguments.

7) Assess the model performance on training set

— See the results [`env_lasso_fit`]

Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .

# Fit the model to the training data
env_lasso_fit <- fit (env_wf_lasso, data = env_cat_train)

— Coefficients [`enf_fitted_coeff`]

env_fitted_coeff <- env_lasso_fit %>% extract_fit_parsnip() %>% 
   tidy() %>%
   arrange(-estimate)

— [FIG] Assessing performanceon training set

# Example of a data frame containing actual and predicted values
pred_long  <- predict(env_lasso_fit, new_data = env_cat_train , type = "prob")|>
   bind_cols(env_cat_train) %>% 
   select(env_cat_f2,  pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr')   %>%
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value")

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "Predicted High and Low Risk Distribution by Env Category",
        subtitle = "Model: Lasso Regression fitted on training data",
       x = "ACTUAL",
       y = "PREDICTED",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

— Assess the model w cross-valid [`env_lasso_cv`]

The next step is to assess the performance of the model (in wf env_wf_lasso) on the training set given the features we have engineered, the algorithm we have selected, and the hyperparameters we have tuned.

Instead of evaluating the model on the training set directly, we will use cross-validation on the training set to gauge the variability of the model (as the model’s performance on the entire training set at once is not a reliable indicator of the model’s performance on new data). In fact: + Cross-validation is a technique that allows us to estimate the model’s performance on new data by simulating the process of training and testing the model on different subsets of the training data (from env_vfold. + the function tune::fit_resamples() fits the model to the training data and evaluate its performance.

# (this is similar to env_tune) 
set.seed(123)
# Cross validate the (optimized) workflow
env_lasso_cv <- env_wf_lasso %>%
   tune::fit_resamples(
      # 10 fold cross validation splits
      resamples = env_vfold,
      # save predictions for confusion matrix
      control = control_resamples(save_pred = TRUE)
   )

We want to aggregate the metrics across the folds to get a sense of the variability of the model. The collect_metrics() function takes the results of a cross-validation and returns a data frame with the metrics.

env_lasso_cv[[3]][[1]] # 1st split 
env_lasso_cv[[3]][[3]] # 3rd split

# Collect the results of the cross-validation (this is the average from the 10 splits!)
collect_metrics(env_lasso_cv)
# accuracy    binary     0.745    10 0.00923 Preprocessor1_Model1
# brier_class binary     0.175    10 0.00462 Preprocessor1_Model1
# roc_auc     binary     0.778    10 0.0111  Preprocessor1_Model1

From the accuracy and ROC-AUC metrics it appears we have a decent candidate model, but there is room for potential improvement. A good next step is to evaluate the model errors and see if there are any patterns that can be addressed before considering what approach to take to improve the model.

— [FIG] Visualize the confusion matrix

The confusion matrix is a table that shows the number of true positives, true negatives, false positives, and false negatives. It is a useful tool for understanding the model’s errors and can help identify patterns that can be addressed to improve the model.

The conf_mat_resampled() function takes a fit_resamples object (with predictions saved) and returns a table (tidy = FALSE) with the confusion matrix for the aggregated folds. We can pass this to the autoplot() function to plot +this is done on the cross-valid results env_lasso_cv (?!)

# Plot the confusion matrix
env_lasso_cv %>%
   tune::conf_mat_resampled(tidy = FALSE) %>%
   autoplot(type = "heatmap")

# 137.2 = true positives 
# 33.6 = true negatives

# 40.5 = false positives
# 15.1 = false negatives

There are more false positives (low risk predicted to be high risk) than false negatives. (This is a common issue in imbalanced datasets and can be addressed by adjusting the decision threshold of the model.)

__________

00. NULL model - log w/ text with tf-idf

__________

We can assess a model like this one by comparing its performance to a “null model” or baseline model, a simple, non-informative model that always predicts the largest class for classification. Such a model is perhaps the simplest heuristic or rule-based alternative that we can consider as we assess our modeling efforts.

Workflow and specification for NULL MODEL

We can build a classification null_model() specification and add it to a workflow() with the same preprocessing recipe we used in the previous section, to estimate performance.

# Create a NULL model
null_classification <- null_model() %>% 
   set_engine("parsnip") %>% 
   set_mode("classification")

null_rs <- workflow() %>% 
   add_recipe(env_recipe) %>%
   add_model(null_classification) %>%
  fit_resamples(
    env_vfold
  )

Assess the model performance on training set

What results do we obtain from the null model, in terms of performance metrics?

collect_metrics(null_rs)

# 1 accuracy    binary     0.673    10 0.00750 Preprocessor1_Model1
# 2 brier_class binary     0.220    10 0.00258 Preprocessor1_Model1
# 3 roc_auc     binary     0.5      10 0       Preprocessor1_Model1

The accuracy and ROC AUC indicate that this null model is, like in the regression case, dramatically worse than even our first model.

OTHER ATTEMPS to improve model

__________

1. LASSO classification - log w/ text with tf-idf + stopwords

To improve supervised learning models, consider:

Engineering the features differently
- we set a token filter to limit the number of features to 100, which we could adjust (max_tokens)
Selecting different (or additional) features
Changing the algorithm
Tuning the hyperparameters differently

[🤞🏻] Add stopwords exclusion to the recipe

3) Define preprocessing steps (NEW!)

# Create a custom stopword list
stop_vector <- custom_stop_words_df %>%  pull(word)

— Improve recipe [`env_stop_recipe`]

# ---  Create a recipe with a token filter step that excludes stopwords
# Rebuild recipe with tokenfilter step
env_stop_recipe <- recipes::recipe (
   formula = env_cat_f2 ~ pdo,
   data = env_cat_train) %>%
   step_tokenize(pdo) %>%   # tokenize
   # remove CUSTOM stopwords
   step_stopwords(pdo, custom_stopword_source = stop_vector) %>%  
   step_tokenfilter(pdo, max_tokens = 100) %>%  # filter by frequency of occurrence
   step_tfidf(pdo, smooth_idf = FALSE)      # tf-idf  creates matrix of weighted term frequencies  

# prep and bake the recipe
env_stop_recipe_bake <-  env_stop_recipe %>% 
  prep() %>% 
   bake(new_data = NULL)

# preview the baked recipe
dim(env_stop_recipe_bake)
#[1] 2264 101
env_recipe_bake[1:5, 1:10]
env_stop_recipe_bake[1:5, 1:10]

4) Select a classification algorithm (same)

— Model specification

# Create a model specification
env_spec <-
   # generalized linear model for binary outcomes
   parsnip::logistic_reg(
      # A non-negative number representing the total amount of regularization
      penalty = tune(),  # 0 = no penalty, 1 = max
      #A number between zero and one (inclusive)
      mixture = 1 # pecifies a pure lasso model,
   ) %>%
   set_engine("glmnet")
                           ##### tune() IS A PLACEHOLDER
# Preview
env_spec

— Create workflow [`env_stop_wf`]

env_stop_recipe is actually the part that changed in this workflow adding step_stopwords().

# Create a workflow
env_stop_wf <- workflows::workflow() %>%
   add_recipe(env_stop_recipe) %>%  # NEW RECIPE
   add_model(env_spec)
# Preview
env_stop_wf

5) Tuning hyperparameters (same)

— Penalty tuning [`env_grid`] (same)

# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
  penalty(), levels = 10
  )
# Preview
env_grid

— K-fold cross-val [`env_fold`, `env_FEAT_tune`] (same/NEW)

# Create a resampling object
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10)

# Create a tuning workflow
env_stop_tune <- tune::tune_grid(
  object = env_stop_wf, # changed ! 
  resamples = env_vfold,
  grid = env_grid,
  control = control_grid(save_pred = TRUE)
)
# preview
env_stop_tune

The env_stop_tune object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics() function on the env_stop_tune object

# Collect the results of the tuning
env_stop_tune_metrics <- tune::collect_metrics(env_stop_tune)

# visualize the results
autoplot(env_stop_tune) +
  labs(
    title = "Lasso model performance across regularization penalties",
    subtitle = "Performance metrics can be used to identify the best penalty"
  )

# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues

Conveniently, the tune::show_best() function takes a tune_grid object and returns the best performing hyperparameter values.

# Show the best hyperparameter values
show_best(env_stop_tune, metric = "roc_auc")

# Make selection programmatically
env_stop_best <- select_best(env_stop_tune, metric ="roc_auc")
env_stop_best_acc <- select_best(env_stop_tune, metric ="accuracy")
env_stop_best_brier <- select_best(env_stop_tune, metric ="brier_class")

6) Update model specification and workflow with best HP (NEW)

Update workflow [`env_stop_wf_lasso`]

Now we can update the model specification and workflow with the best performing hyperparameter value using the previous cls_wf_tune workflow and the finalize_workflow() function.

# Update the model specification
env_stop_wf_lasso <- env_stop_wf %>% 
   tune::finalize_workflow(env_stop_best)

# Preview updated workflow object (with defined penalty paramv  0.00599)
env_stop_wf_lasso

7) Assess the model performance on training set

— See the results [`env_lasso_fit`]

Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .

# Fit the model to the training data
env_stop_lasso_fit <- fit (env_stop_wf_lasso, data = env_cat_train)

— Coefficients [`enf_fitted_coeff`]

env_stop_fitted_coeff <- env_stop_lasso_fit %>% extract_fit_parsnip() %>% 
   tidy() %>%
   arrange(-estimate)

“and” is not top coefficient anymore!!!

— [FIG] Assessing performance [`env_stop_lasso_fit`] on training set

# Example of a data frame containing actual and predicted values
pred_stop_long  <- predict(env_stop_lasso_fit, new_data = env_cat_train , type = "prob")|>
   bind_cols(env_cat_train) %>% 
   select(env_cat_f2,  pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr')   %>%
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value")

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_stop_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "Predicted High and Low Risk Distribution by Env Category",
        subtitle = "Model: Lasso Regression fitted on training data (stopwords)",
       x = "ACTUAL",
       y = "PREDICTED",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

This model did not improve much (especially on the LOW-risk-Other level prediction!

— Assess the model w cross-valid [`env_stop_lasso_cv`]

# (this is similar to env_tune) 

# Cross validate the (optimized) workflow
env_stop_lasso_cv <- env_stop_wf_lasso %>%
   tune::fit_resamples(
      # 10 fold cross validation splits
      resamples = env_vfold,
      # save predictions for confusion matrix
      control = control_resamples(save_pred = TRUE)
   )

env_stop_lasso_cv[[3]][[1]] # 1st split 
env_stop_lasso_cv[[3]][[3]] # 3rd split

# Collect the results of the cross-validation (this is the average from the 10 splits!)
collect_metrics(env_stop_lasso_cv)
# 1 accuracy    binary     0.750    10 0.00745 Preprocessor1_Model1
# 2 brier_class binary     0.172    10 0.00322 Preprocessor1_Model1
# 3 roc_auc     binary     0.784    10 0.00881 Preprocessor1_Model1

# OLD 
collect_metrics(env_lasso_cv)
# 1 accuracy    binary     0.750    10 0.0102  Preprocessor1_Model1
# 2 brier_class binary     0.174    10 0.00352 Preprocessor1_Model1
# 3 roc_auc     binary     0.777    10 0.00771 Preprocessor1_Model1

Same accuracy but improve in roc_auc and brier_class compared to the previous model!

— [FIG] Visualize the confusion matrix

# BTW 
collect_predictions(env_stop_lasso_cv)

This is done on the cross-valid results env_stop_lasso_cv (?!)

# Plot the confusion matrix
env_stop_lasso_cv %>%
   tune::conf_mat_resampled(tidy = FALSE) %>%
   autoplot(type = "heatmap")

# 137.2 = true positives  --> 136.4 
# 33.6 = true negatives  --> 33.1

# 40.5 = false positives  --> 40.8
# 15.1 = false negatives  --> 15.3

8) Evaluate the best model on the validation/test set

To do this we need to fit the tuned workflow to the training set, which is the actual training phase. We will use the last_fit() function from {workflows}

Le’s use the updated workflow env_stop_wf_lasso

— Fit the best model(training) and evaluate on the test set

# fit the model to the training set and evaluate on the validation set
env_stop_lasso_final_fit <- last_fit(
   env_stop_wf_lasso, 
   split = env_split)

# Evaluate the model on the validation set (in my case)
collect_metrics(env_stop_lasso_final_fit)

# 1 accuracy    binary         0.762 Preprocessor1_Model1
# 2 roc_auc     binary         0.807 Preprocessor1_Model1
# 3 brier_class binary         0.163 Preprocessor1_Model1

The performance metrics are very close to those we achieved on the training set –> good sign that the model is robust as it performs well on both training and ~~test~~ (validation) sets.

— [FIG] Visualize the confusion matrix

# Plot the confusion matrix
env_stop_lasso_final_fit %>%
   collect_predictions() %>%
   conf_mat(truth = env_cat_f2, estimate = .pred_class) %>%
   autoplot(type = "heatmap")

Still (on the test set) imbalanced false positives (well) and false negatives (poor).

— Visualize the `ROC AUC` curve

Take the output of last_fit() and use it (env_stop_lasso_final_fit) to plot the ROC curve.

colnames(env_stop_lasso_final_fit)
# Extract predictions from the final fit object
# Extract the tibble from the list
env_stop_lassofinal_fit_pred <-  env_stop_lasso_final_fit$.predictions[[1]]
str(env_stop_lassofinal_fit_pred)

# Visualize the ROC curve 
env_stop_lassofinal_fit_pred %>% 
   roc_curve(truth = env_cat_f2, '.pred_High-Med-risk') %>% 
   autoplot() +
   labs(
      title = "ROC Curve for High-Med Risk Prediction",
      x = "1 - Specificity (False Positive Rate)",
      y = "Sensitivity (True Positive Rate)",
      caption = "logistic regression model on text (stopwords)"
   )

#__________

2. model - log w/ text with tf-idf + stopwords + 3 factors

[👍🏻] Logistic with `pdo` + `sector_f`|`regionname`|`boardapprovalFY` as predictors

3) Define preprocessing steps [`env_FEAT_recipe`] (same)

— Improve recipe [`env_FEAT_recipe`] (NEW!)

using sector_f to include the sector tag but with less dimensions
step_dummy because logistic regression, especially when using certain tuning functions in tidymodels, requires numeric or dummy variables.

# ---  Create a recipe with a token filter step that excludes stopwords
# Rebuild recipe with tokenfilter step
env_FEAT_recipe <- recipe (env_cat_f2 ~ pdo + sector_f + regionname + boardapprovalFY,
                           data = training(env_split)) %>%
   # tokenize the text
   step_tokenize(pdo) %>%  
   # remove CUSTOM stopwords
   step_stopwords(pdo, custom_stopword_source = stop_vector) %>%  
   # filter by frequency of occurrence
   step_tokenfilter(pdo, max_tokens = 100) %>%  
   # creates tf-idf matrix of weighted term frequencies
   step_tfidf(pdo, smooth_idf = FALSE) %>%
   # add NA as special factor level
   step_unknown(sector_f ,new_level = "Unknown sect" ) %>%
   step_unknown(regionname ,new_level = "Unknown reg" ) %>%
   step_unknown(boardapprovalFY ,new_level = "Unknown FY" ) %>%
   # convert to dummy variables
   step_dummy(sector_f, regionname, boardapprovalFY, one_hot = TRUE)

check what changed…

# prep and bake the recipe
env_FEAT_recipe_bake <-  env_FEAT_recipe %>% 
  prep() %>% 
   bake(new_data = NULL)

# preview the baked recipe
dim(env_FEAT_recipe_bake)
#[1] 2264 101 --> 2264  150
env_FEAT_recipe_bake[1:5, 1:10]

4) Select algorithm + workflow (same)

— Model specification [`env_spec`]

# Create a model specification
env_spec <-
   # generalized linear model for binary outcomes
   parsnip::logistic_reg(
      mode = "classification",
      # A non-negative number representing the total amount of regularization
      penalty = tune(),  # 0 = no penalty, 1 = max
      #A number between zero and one (inclusive)
      mixture = 1 # pecifies a pure lasso model,
   ) %>%
   set_engine("glmnet")
                           ##### tune() IS A PLACEHOLDER
# Preview
env_spec

— Create workflow [`env_FEAT_wf`] (NEW!)

env_FEAT_recipe is actually the part that changed in this workflow adding step_stopwords().

# Create a workflow
env_FEAT_wf <- workflows::workflow() %>%
   add_recipe(env_FEAT_recipe) %>%  # NEW RECIPE
   add_model(env_spec) # same model
# Preview
env_FEAT_wf

5) Tuning hyperparameters (same)

— Penalty tuning + folds [`env_grid`, `env_fold`] (same)

# Create a grid of values for the penalty hyperparameter (random set of 10 values)
env_grid <- dials::grid_regular(
  penalty(), levels = 10
  )
# Create a resampling object
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10)

— K-fold cross-val tuning [`env_FEAT_tune`] (NEW)

# Create a tuning workflow
env_FEAT_tune <- tune::tune_grid(
  object = env_FEAT_wf, # changed ! 
  resamples = env_vfold,
  grid = env_grid,
  control = control_grid(save_pred = TRUE)
)
# preview
env_FEAT_tune

The env_FEAT_tune object contains the results of the tuning for each fold. We can see the results of the tuning for each fold by calling the collect_metrics() function on the env_FEAT_tune object

# Collect the results of the tuning
env_FEAT_tune_metrics <- tune::collect_metrics(env_FEAT_tune)

# visualize the results
autoplot(env_FEAT_tune)

# in roc_auc: many many of the penalty values performed similarly, with a drop-off in performance at the higher val- ues

Conveniently, the tune::show_best() function takes a tune_grid object and returns the best performing hyperparameter values.

# Show the best hyperparameter values
show_best(env_FEAT_tune, metric = "roc_auc")

# Make selection programmatically
env_FEAT_best <- select_best(env_FEAT_tune, metric ="roc_auc")
env_FEAT_best_acc <- select_best(env_FEAT_tune, metric ="accuracy")
env_FEAT_best_brier <- select_best(env_FEAT_tune, metric ="brier_class")

6) Update model specification and workflow with best HP (NEW)

Update workflow [`env_FEAT_wf2`]

Now we can update the model specification and workflow with the best performing hyperparameter value using the previous env_FEAT_tune workflow and the finalize_workflow() function.

# Update the model specification
env_FEAT_wf2 <- env_FEAT_wf %>% 
   tune::finalize_workflow(env_FEAT_best)

# Preview updated workflow object (with defined penalty paramv  0.00599)
env_FEAT_wf2

7) Assess the model performance on training set

— See the results [`env_lasso_fit`]

Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .

# Fit the model to the training data
env_FEAT_fit <- fit (env_FEAT_wf2, data = env_cat_train)

— Coefficients [`enf_fitted_coeff`]

env_FEAT_fitted_coeff <- env_FEAT_fit %>% 
   extract_fit_parsnip() %>% 
   tidy() %>%
   arrange(-estimate)

“sector_f…” appear among the top coefficients!!!

— [FIG] Assessing performance [`env_FEAT_fit`] on training set

Now is env_split_train

# Fit the model on the training set
env_FEAT_fit <- env_FEAT_wf2 %>% # NEW
   fit(data = training(env_split))

# Example of a data frame containing actual and predicted values
pred_FEAT_long  <- predict(env_FEAT_fit, new_data = training(env_split), type = "prob")|>
   bind_cols(training(env_split)) %>% 
   select(env_cat_f2,  pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr')   %>%
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value")

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_FEAT_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "Predicted High and Low Risk Distribution by Env Category",
        subtitle = "Model: Lasso Regression fitted on training data (stop and feat)",
       x = "ACTUAL",
       y = "PREDICTED",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

Seems improved also LOW risk prediction (at least o average although more dispersion)

# Fit the model on the testing set
env_FEAT_fit <- env_FEAT_wf2 %>% # NEW
   fit(data = testing(env_split))

# Example of a data frame containing actual and predicted values
pred_FEAT_long  <- predict(env_FEAT_fit, new_data = testing(env_split), type = "prob")|>
   bind_cols(testing(env_split)) %>% 
   select(env_cat_f2,  pred_high_risk = '.pred_High-Med-risk', pred_low_risk = '.pred_Low-risk_Othr')   %>%
   pivot_longer(cols = c(pred_high_risk, pred_low_risk),
                names_to = "risk_type", values_to = "risk_value")

# Plot the predictions with boxplots and jittered points without duplicate legends
ggplot(pred_FEAT_long, aes(x = env_cat_f2, y = risk_value, fill = risk_type)) +
  geom_boxplot(alpha = 0.4, position = position_dodge(width = 0.8)) +
  #geom_jitter(alpha = 0.6, position = position_dodge(width = 0.8)) +  # Remove width and use position_dodge
   labs(title = "Predicted High and Low Risk Distribution by Env Category",
        subtitle = "Model: Lasso Regression fitted on testing data (stop and feat)",
       x = "ACTUAL",
       y = "PREDICTED",
       fill = "Risk Type") +  # Set label for fill legend
  theme_minimal() +
  guides(color = "none")  # Suppress the color legend

— Assess the model w cross-valid [`env_FEAT_cv`] (NEW)

# (this is similar to env_tune) 
set.seed(123)
# Cross validate the (optimized) workflow
env_FEAT_cv <- env_FEAT_wf2 %>%
   tune::fit_resamples(
      # 10 fold cross validation splits
      resamples = env_vfold,
      # save predictions for confusion matrix
      control = control_resamples(save_pred = TRUE)
   )

env_FEAT_cv[[3]][[1]] # 1st split 
env_FEAT_cv[[3]][[3]] # 3rd split

# Collect the results of the cross-validation (this is the average from the 10 splits!)
collect_metrics(env_FEAT_cv)
# 1 accuracy    binary     0.777    10 0.0118  Preprocessor1_Model1
# 2 brier_class binary     0.155    10 0.00503 Preprocessor1_Model1
# 3 roc_auc     binary     0.820    10 0.0101  Preprocessor1_Model1

# OLD 
collect_metrics(env_stop_lasso_cv)
#1 accuracy    binary     0.754    10 0.00997 Preprocessor1_Model1
#2 brier_class binary     0.170    10 0.00371 Preprocessor1_Model1
#3 roc_auc     binary     0.785    10 0.00813 Preprocessor1_Model1

Improved a bit roc_auc & accuracy and brier_class lower compared to the previous model!

— [FIG] Visualize the confusion matrix

This is done on the cross-valid results env_FEAT_cv (?!) OKKIO: the number of obs were less because of the dropped missing factors

set.seed(123)
# Plot the confusion matrix
env_FEAT_cv %>%
   tune::conf_mat_resampled(tidy = FALSE) %>%
   autoplot(type = "heatmap")

# env_stop_lasso_cv    -->   env_FEAT_cv
# 137.2 = true positives  -->  135 
# 33.6 = true negatives  -->    33.7

# 40.5 = false positives  -->   33.8 
# 15.1 = false negatives  -->   16.7

Compared to previous model, there is just a little improvement in false positive (low risk predicted to be high risk) as they are less than before. This model did improve (especially in the low risk category): in fact the probability of being classified as HIGH RISK is less than 50% and of being classified LOW RISK above 50%.

8) Evaluate the best model on the validation/test set

To do this we need to fit the tuned workflow to the training set, which is the actual training phase. We will use the last_fit() function from {workflows}

Le’s use the updated workflow env_FEAT_wf2

RECALL I DID

for training (labelled env_cat_f) env_cat_train <- training(env_split) # 2265 proj
for validation (labelled env_cat_f) env_cat_test <- testing(env_split) # 971 proj

— Fit the best model(training) and evaluate on the test set

After determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.

# fit the model to the training set and evaluate on the validation set
env_FEAT_final_fit <- last_fit(
   env_FEAT_wf2, 
   split = env_split)

# Evaluate the model on the validation set (in my case)
collect_metrics(env_FEAT_final_fit)

# 1 accuracy     0.793 Preprocessor1_Model1 --> 0.79 
# 2 roc_auc      0.851 Preprocessor1_Model1 --> 0.85
# 3 brier_class  0.143 Preprocessor1_Model1 --> 0.144

The performance metrics are very close to those we achieved on the training set (actually better!!) –> good sign that the model is robust as it performs well on both training and ~~test~~ (validation) sets.

— [FIG] Visualize the confusion matrix

# Plot the confusion matrix
env_FEAT_final_fit %>%
   collect_predictions() %>%
   conf_mat(truth = env_cat_f2, estimate = .pred_class) %>%
   autoplot(type = "heatmap")

library(tidymodels)
library(ggplot2)

library(tidymodels)
library(ggplot2)

# Create a table for the confusion matrix counts
ML_final_fit_cm_p <- env_FEAT_final_fit %>%
  collect_predictions() %>%
  conf_mat(truth = env_cat_f2, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(
    title = "Confusion Matrix for Lasso Logistic Regression Model",
    x = "Predicted Class",
    y = "True Class",
    fill = "Count"
  ) +
  scale_fill_gradient(low = "#f2e8ea", high = "#964957") +  # Adjust color gradient for better contrast
  theme_minimal(base_size = 14) +                              # Set a clean theme with larger base text size
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),     # Center and bold the title
    #axis.text.x = element_text(angle = 45, hjust = 1),         # Angle x-axis text for readability
    legend.position = "right"                                  # Place the legend on the right
  )

ML_final_fit_cm_p
f_save_plot_obj <- function(plot_object, plot_obj_name) {
   # Save the plot object
   saveRDS(plot_object, here("analysis", "output", "figures", paste0(plot_obj_name, ".rds")))
}

f_save_plot_obj(ML_final_fit_cm_p, "ML_final_fit_cm_p")

Figure 1: Confusion matrix for the final model on the validation set

Still (on the validation set) imbalanced false positives (138) and false negatives (63).

— [FIG] Visualize the `ROC AUC` curve

Take the output of last_fit() and use it (env_FEAT_final_fit) to plot the ROC curve.

colnames(env_FEAT_final_fit)
# Extract predictions from the final fit object
# Extract the tibble from the list
env_FEAT_final_fit_pred <-  env_FEAT_final_fit$.predictions[[1]]
str(env_FEAT_final_fit_pred)

# Visualize the ROC curve 
env_FEAT_final_fit_pred %>% 
   roc_curve(truth = env_cat_f2, '.pred_High-Med-risk') %>% 
   autoplot() +
   labs(
      title = "ROC Curve for High-Med Risk Prediction",
      x = "1 - Specificity (False Positive Rate)",
      y = "Sensitivity (True Positive Rate)",
      caption = "logistic regression model on text (stopwords) + features"
   )

9) Interpret the model

— Inspecting what levels of the outcome are most difficult to estimate

# collect the predictions from the final model
env_FEAT_final_fit_feat <- env_FEAT_final_fit %>%
   collect_predictions() %>%
   bind_cols(env_cat_test)  %>%
   rename(env_cat_f2 = 'env_cat_f2...6') %>% 
   select ( -'env_cat_f2...22')

#preview the predictions
glimpse(env_FEAT_final_fit_feat)

I will then select the columns with the actual outcome (env_cat_f2), the predicted outcome, the env_cat_f level, and the pdo text and separate the predicted outcome to inspect them separately

env_FEAT_final_fit_feat %>%
   filter(env_cat_f2 != .pred_class ) %>%  
   select(env_cat_f2, .pred_class,  env_cat_f, pdo, proj_id)

Inspect to see in which actual category (env_cat_f) are proj when they are actually env_cat_f2 == 'High-Med-risk' but falsely predicted to be .pred_class == 'Low-risk_Othr': not surprisingly most of them are med risk level. (this makes sense)

env_FEAT_final_fit_feat %>%
   filter(env_cat_f2 == 'High-Med-risk' & .pred_class == 'Low-risk_Othr') %>%  
   select(env_cat_f2, .pred_class,  env_cat_f, pdo, proj_id) %>% 
   count(env_cat_f )

levels(env_FEAT_final_fit_feat$env_cat_f2)
#[1] "High-Med-risk" "Low-risk_Othr"

— Inspecting the most important features for predicting the outcome

using the extract_fit_parsnip() function from the workflows package to extract the model object from the workflow.
estimates are the log odds of the outcome for each feature (i.e. the probability of the outcome (High risk) divided by the probability of the opposite outcome (low risk)).
- Positive coefficient: A positive coefficient indicate an increased likelihood of being in the “Low-risk_Othr” category compared to the “High-Med-risk” category.
- Negative coefficient: A negative coefficient indicate an increased likelihood of being in the “High-Med-risk” category compared to the “Low-risk_Othr” category.
odds ratio (exponentiated coeff) means that the feature is associated with a lower probability of the outcome, while positive odds means that the feature is associated with a higher probability of the outcome.
- Odds ratio > 1: Indicates that the predictor increases the likelihood of the outcome (e.g., “Low-risk_Othr”).
- Odds ratio < 1: Indicates that the predictor decreases the likelihood of the outcome “High-Med-risk”.

# Extract the estimate (log-odds)
env_FEAT_final_fit_features <- extract_fit_parsnip(env_FEAT_final_fit) %>% 
   tidy() %>% 
   # Calculate the exponentiated estimate
   mutate(odds = exp(estimate),
          probability = odds / (1 + odds))  

#  tibble: 206 × 5
# term                   estimate penalty  odds probability
# <chr>                     <dbl>   <dbl> <dbl>       <dbl>
# (Intercept)            -1.26    0.00599 0.284      0.221 
# tfidf_pdo_access       -0.00452 0.00599 0.995      0.499 
# tfidf_pdo_activities    0.786   0.00599 2.19       0.687

tfidf_pdo_activities est 0.786 | odds 2.19 | prob 0.687 (associated with a LOW risk)
tfidf_pdo_access est -0.004 | odds 0.995 | prob 0.499 (associated with a HIGH risk)

— Extract the most important features

A quick way to extract the most important features for predicting each outcome is to use the vi() function from {vip}.

The vi() function calculates the permutation importance of each feature in the model.

library(vip)

# Extract the most important features
env_FEAT_var_importance <-  extract_fit_parsnip(env_FEAT_final_fit) %>% 
   vip::vi() %>%
   # it is kinda counterintuitive  
   mutate(note = case_when(
       Sign  ==  "POS" ~ "More likely to be in Low-risk_Othr",
       Sign  ==  "NEG" ~ "More likely to be in High-Med-risk",    
      TRUE ~ "NEU"
   ))

— [FIG] Plot the most important features

# Recode variable and sign 
var_importance_tbl <- env_FEAT_var_importance %>% 
   mutate(Feature =  str_remove(Variable, "tfidf_"),
          EnvRiskOutcome = case_when(
             Sign == "NEG" ~ "High-Med-risk",
             Sign == "POS" ~ "Low-risk_Othr") ) %>% 
   select(Feature, Importance,  EnvRiskOutcome)  

summary(var_importance_tbl$EnvRiskOutcome)

# Plot the most important features
ML_feature_importance_p <- var_importance_tbl %>%
   slice_max(Importance, n = 50) %>%
   ggplot(aes(x = reorder(Feature, Importance), y = Importance, color = EnvRiskOutcome)) +
   geom_point() +
   coord_flip() +
   facet_wrap(~EnvRiskOutcome, ncol = 2, scales = "free_y") +
   labs(
      title = glue("Most influential features for predicting Environmental Risk"),
      subtitle = "LASSO Logistic Regression model on text + metadata tags \n(Importance = absolute value of the logistic regression coefficients)",
      #caption = "(Importance of each feature calculated as the absolute value of the logistic regression coefficients)",
      x = "",
      y = "",
      fill = ""
   ) +
   lulas_theme + 
   theme(
     plot.title = element_text(hjust = 0.5, face = "bold")
   ) +
   guides(color = "none")

ML_feature_importance_p

The feature importance plot highlights the top 50 predictors of environmental risk (binary) category, ranked by their influence in a LASSO logistic regression model. For better readability the predictors are split according to the level of risk predicted. It should be no surprise that words in the PDO text (those variables starting with pdo_*) are the most important predictors given the data. Still some of the other predictors are also important, such as sector_f_TRANSPORT (left panel) or regionname and sector_f_FINANCIAL (right panel).

Each facet groups features by environmental risk outcome, allowing a comparison of which factors contribute most to each category. Features with higher importance values, like [mention a few key features], play a significant role in predicting environmental risk, offering insights for targeted risk assessment and decision-making.

# show plot
ML_feature_importance_p

# save as rds
f_save_plot_obj(ML_feature_importance_p, "ML_feature_importance_p")

__________

3. model - NB w/ text with tf-idf + stopwords + 3 factors

[🫤] Naive bayes classific with text and `sector_f` and `regionname` and `boardapprovalFY` as predictors

3) Define preprocessing steps (same)

4) Select algorithm + workflow (same)

— Model specification [`env_spec`]

Let’s use a naive Bayes model, which is available in the tidymodels package discrim. One of the main advantages is its ability to handle a large number of features, such as those we deal with when using word count methods. Here we have only kept the 1000 most frequent tokens, but we could have kept more tokens and a naive Bayes model would still be able to handle such predictors well. For now, we will limit the model to a moderate number of tokens.

# needed for naive Bayes
library(discrim)

# Create a model specification
env_NB_spec <-
   # generalized linear model for binary outcomes
   parsnip::naive_Bayes() %>%
   # Specify the mode of the model
   set_mode("classification") %>%
   # Specify the engine
   set_engine("naivebayes")
 
# Preview
env_NB_spec

— Create workflow [`env_NB_wf`] (NEW!)

env_NB_spec is actually the part that changed in this workflow adding step_stopwords().

# Create a workflow
env_NB_wf <- workflows::workflow() %>%
   add_recipe(env_FEAT_recipe) %>%  # same RECIPE
   add_model(env_NB_spec) # NEW MODEL
# Preview
env_NB_wf

— Fit the classificatoin model to the training set [`env_NB_fit`] (NEW!)

# Fit the model to the training set
env_NB_fit <- env_NB_wf %>%
   #add_model(env_NB_spec) %>%
   fit(data = env_cat_train)

5) Tuning hyperparameters (same)

Let’s use resampling to estimate the performance of the naive Bayes classification model we just fit. We can do this using resampled data sets built from the training set. Let’s create 10-fold cross-validation sets, and use these resampled sets for performance estimates.

set.seed(123)
env_vfold <- rsample::vfold_cv(env_cat_train, v = 10) 

# Fit the model to the resampled folds
env_NB_rs <- fit_resamples(
   env_NB_wf,
   resamples = env_vfold,
   control = control_resamples(save_pred = TRUE)
)

We can extract the relevant information using collect_metrics() and collect_predictions().

env_NB_rs_metrics <- collect_metrics(env_NB_rs)
#1 accuracy    binary     0.684    10 0.00843 Preprocessor1_Model1
#2 brier_class binary     0.314    10 0.00884 Preprocessor1_Model1
#3 roc_auc     binary     0.762    10 0.00715 Preprocessor1_Model1

env_NB_rs_predictions <- collect_predictions(env_NB_rs)

Worse than logistic!

Accuracy is the proportion of the data that is predicted correctly. Be aware that accuracy can be misleading in some situations, such as for imbalanced data sets.
ROC AUC measures how well a classifier performs at different thresholds. The ROC curve plots the true positive rate against the false positive rate; AUC closer to 1 indicates a better-performing model, while AUC closer to 0.5 indicates a model that does no better than random guessing.
Brier score is a measure of the mean squared difference between the predicted probabilities and the actual outcomes.

— Visualize the `ROC AUC` curve

# Visualize the ROC curve 
env_NB_rs_predictions %>% 
   group_by(id) %>% 
   # roc_curve(truth = env_cat_f2, .pred_class) %>% 
   # Not work with factor, use "positive" class 
   roc_curve(truth = env_cat_f2, '.pred_High-Med-risk') %>% 
   autoplot()

The area under each of these curves is the roc_auc metric we have computed. (If the curve was close to the diagonal line, then the model’s predictions would be no better than random guessing.)

— [FIG] Visualize the confusion matrix

# Plot the confusion matrix
conf_mat_resampled(env_NB_rs, tidy = FALSE) %>% 
   autoplot(type = "heatmap")

different from logistic
Very well with true positive (high risk), but very bad with true negative (low risk).

8) Evaluate the best model on the validation/test set

— Fit the best model(training) and evaluate on the test set

After determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set.

# fit the model to the training set and evaluate on the validation set
env_NB_rs_final_fit <- last_fit(
   env_NB_wf, 
   split = env_split)

# Evaluate the model on the validation set (in my case)
collect_metrics(env_NB_rs_final_fit)
# 1 accuracy    binary         0.691 Preprocessor1_Model1
# 2 roc_auc     binary         0.784 Preprocessor1_Model1
# 3 brier_class binary         0.307 Preprocessor1_Model1

____________________________________________________________

———————- MULTICLASS OUTCOME ——————-

____________________________________________________________

4. & 5. model - Multiclass outcome `sector_f`

[🙃] Multiclass outcome

2) Split sample (based on `sector_f`)

This time I need to create a new split of the data using initial_split() based on levels of sector_f (here I collapsed the original 99 levels into 7 macro levels.

We will use the strata argument to stratify the data by the outcome variable (sector_f). This will ensure that the training and validation sets have the same proportion.

# Create a stratified split based on missing vs non-missing env_cat
projs_train %>% tabyl(sector_f) # 7 levels

# Split BUT only "Not Missing" `env_cat_f` 
## --- 0) THIS WILL BE 4 TRAINING & VALIDATION 
sec_use <- projs_train %>% 
   filter(sector_f != "Missing") # 4316 proj 

# SPLIT INTO TRAINING, VALIDATION 
set.seed(123)  # Ensure reproducibility
sec_split <- initial_split(sec_use, prop = 0.7, # 70% training, 30% testing
                       strata = sector_f) # stratify by OUTCOME 

## -- 1) for training (labelled `sector_f`)
sec_train <- training(sec_split)   # 3019 proj
    
## -- 2) for validation (labelled `sector_f`)
sec_test <- testing(sec_split)  # 1297 proj
   
# # UNLABELLED PORTION 
## -- 3) for actual test (UNlabelled `sector_f`)
sec_missing <- projs_train %>% 
  filter(sector_f == "Missing") # 87 proj 

# check ditribution of `sector_f` in training and validation
tabyl(sec_train, sector_f) |> adorn_totals("row") |> adorn_pct_formatting(digits = 1)# 
tabyl(sec_test, sector_f)|> adorn_totals("row") |> adorn_pct_formatting(digits = 1)#

There is no terrible imbalance between the levels of sector_f in the training and validation sets. However, Compared to binary classification, there are several additional issues to keep in mind when working with multiclass classification:

Many machine learning algorithms do not handle imbalanced data well and are likely to have a hard time predicting minority classes.
Not all machine learning algorithms are built for multiclass classification at all.
Many evaluation metrics need to be reformulated to describe multiclass predictions.

3) Preprocessing steps [`multi_rec`] (same)

we have added step_downsample(sector_f) to the end of the recipe specification to downsample after all the text preprocessing. + We want to downsample last so that we still generate features on the full training data set. + The downsampling will then only affect the modeling step, not the preprocessing steps, with hopefully better results.

multi_rec <- recipe(
   #sector_f ~ pdo, data = sec_train) %>%
   sector_f ~ pdo + regionname + boardapprovalFY + env_cat_f, data = sec_train) %>%
   step_tokenize(pdo) %>%  
   step_stopwords(pdo, custom_stopword_source = stop_vector) %>%  
   step_tokenfilter(pdo, max_tokens = 100) %>%  
   step_tfidf(pdo, smooth_idf = FALSE) %>%
   # add NA as special factor level
   step_unknown(regionname ,new_level = "Unknown reg" ) %>%
   step_unknown(boardapprovalFY ,new_level = "Unknown FY" ) %>%
   step_unknown(env_cat_f ,new_level = "Unknown env cat" ) %>%
   # convert to dummy variables
   step_dummy(regionname, boardapprovalFY,env_cat_f, one_hot = TRUE) %>% 
   # resolve imnbalance
   step_downsample(sector_f)

— COULD do `step_word_embeddings` to see a diffrernent

check what changed…

# prep and juice the recipe
multi_juice <- multi_rec %>% 
   prep() %>% 
   #bake(new_data = NULL)
   juice()

# preview the baked recipe
dim(multi_juice)
#[1] 559 142
slice_head(multi_juice, n = 3)

4.a) Specify models (NEW!)

— i) Multin. Lasso Logistic Regress [`logistic_model`]

Some model algorithms and computational engines (examples are most random forests and SVMs) automatically detect when we perform multiclass classification from the number of classes in the outcome variable and do not require any changes to our model specification. For lasso regularization, we need to create a new special model specification just for the multiclass class using multinom_reg().

# MULTINOMIAL LASSO REGRESSION MODEL 
# For lasso regularization, we need to create a new special model specification just for the multiclass class
logistic_model <- multinom_reg(penalty = tune(), mixture = 1) %>%
  set_mode("classification") %>%
  set_engine("glmnet")

logistic_model

— ii) KNN [`knn_model`]

knn_model <- nearest_neighbor(neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

4.b) Specify workflows (NEW!)

— i) Multin. Lasso Logistic Regress [`logistic_wf`]

To keep our text data sparse throughout modeling and use the sparse capabilities of set_engine(“glmnet”), we need to explicitly set a non-default preprocessing blueprint, using the package hardhat. This “blueprint” lets us specify during modeling how we want our data passed around from the preprocessing into the model. The composition “dgCMatrix” is the most common sparse matrix type, from the Matrix package, used in R for modeling. We can use this blueprint argument when we add our recipe to our modeling workflow, to define how the data should be passed into the model.

library(hardhat)
sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")

# workflow
logistic_wf <- workflow() %>%
  add_recipe(multi_rec, blueprint = sparse_bp) %>%
  add_model(logistic_model)

logistic_wf

— ii) KNN [`knn_wf`]

library(hardhat)
#sparse_bp <- default_recipe_blueprint(composition = "dgCMatrix")

# workflow
knn_wf <- workflow() %>%
  add_recipe(multi_rec, blueprint = sparse_bp) %>%
  add_model(knn_model)

knn_wf

5) Hyperparameters tuning

We used the same arguments for penalty and mixture as before, as well as the same mode and engine, but this model specification is set up to handle more than just two classes.

— folds for cross-validation

We also need a new cross-validation object since we are using a different data set.

set.seed(123)
# random splits ("folds") of the data for cross-validation
multi_folds <- vfold_cv(sec_train)

— Define Grids

The last time we tuned a lasso model, we used the defaults for the penalty parameter and 30 levels. Let’s restrict the values this time using the range argument, so we don’t test out as small values for regularization, and only try 20 levels.

grid_regular() chooses sensible values to try for the penalty parameter, based on the range we provide - we ask for 20 different possible values.

— i) Multin. Lasso Logistic Regress [`logistic_grid`]

logistic_grid <- grid_regular(hardhat::extract_parameter_set_dials(logistic_model), levels = 10)

— ii) KNN [`knn_grid`]

knn_grid <- grid_regular(hardhat::extract_parameter_set_dials(knn_model), levels = 10, 
                         filter = c( neighbors >1))

— Define tuning process

Multiclass support for the parameters

model_control <- control_grid(save_pred = TRUE)
model_metrics <- metric_set( accuracy, sens, spec,  mn_log_loss, roc_auc)

Now we have everything we need to tune the regularization penalty and find an appropriate value.

tune_grid() can fit a model at each of the values for the regularization penalty in our regular grid.
Note that we specify save_pred = TRUEso we can create ROC curves and a confusion matrix later.

— i) Multin. Lasso Logistic Regress [`logistic_model`]

# Tune model 
multi_lasso_rs <- tune_grid(
   # A) unfitted workflow
   # logistic_wf,
   # B) model specification
   logistic_model,
   # B) recipe
   multi_rec,
   grid = logistic_grid,
   metrics = model_metrics, # pre defined metrics
   control = model_control, # pre defined control
   # folds for cross-validation
   resamples = multi_folds)

multi_lasso_rs

— ii) KNN [`knn_grid`]

# Tune model 
multi_knn_rs <- tune_grid(
   # A) unfitted workflow
   # logistic_wf,
   # B) model specification
    knn_model,
   # B) recipe
   multi_rec,
   grid = knn_grid,
   metrics = model_metrics, # pre defined metrics
   control = model_control, # pre defined control
   # folds for cross-validation
   resamples = multi_folds)

multi_knn_rs

6) Evaluate the model performance & upd WF

— Accuracy metric

What do we see, in terms of performance metrics?

multi_lasso_rs %>%  collect_metrics()  
multi_knn_rs %>%  collect_metrics()  

multi_lasso_rs %>%  show_best(metric = "roc_auc")   # better 
multi_knn_rs %>%  show_best(metric ="roc_auc")  

multi_lasso_rs %>%  show_best(metric = "accuracy")   # better 
multi_knn_rs %>%  show_best(metric ="accuracy")

Even the very best “accuracy” value here is quite low (0.44), significantly lower than the binary classification model. This is expected because multiclass classification is more difficult than binary classification.

— [FIG] performance metrics

— i) Multin. Lasso Logistic Regress [`logistic_model`]

autoplot(multi_lasso_rs) +
  labs(
    color = "Number of tokens",
    title = "Multiclass Lasso Logistic Regression performance across regularization penalties and tokens"
  )

— ii) KNN [`knn_grid`]

autoplot(multi_knn_rs) +
  labs(
    color = "Number of tokens",
    title = "Knn performance across regularization penalties and tokens"
  )

— choose final hyperparameters

— i) Multin. Lasso Logistic Regress [`log_final_param`]

— TIDY conf matrix

log_final_param <- multi_lasso_rs %>% 
   show_best(metric = "accuracy") %>% 
   slice(1) %>% 
   select(-.config)

multi_lasso_rs %>% 
   collect_predictions() %>%
   inner_join(log_final_param) %>%
   conf_mat(truth = sector_f, estimate = .pred_class)

— [FIG] conf matrix (1st fold)

To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices in

final_penalty <- multi_lasso_rs %>% 
   show_best(metric = "accuracy") %>% 
   slice(1) %>% 
   pull(penalty)


multi_lasso_rs %>% 
   collect_predictions() %>%
   filter(penalty == final_penalty) %>%
   filter(id == "Fold01") %>%
   conf_mat(truth = sector_f, estimate = .pred_class) %>%
   autoplot(type = "heatmap") +
   scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
   scale_x_discrete(labels = function(x) str_wrap(x, 20))

The diagonal is fairly well populated, which is a good sign. This means that the model generally predicted the right class.
The off-diagonal numbers are all the failures and where we should direct our focus. It is a little hard to see these cases well since the majority class affects the scale. A trick to deal with this problem is to remove all the correctly predicted observations.

— [FIG] conf matrix (1st fold) only wrong pred

multi_lasso_rs %>% 
   collect_predictions() %>%
   # final param 
   filter(penalty == final_penalty) %>%
   filter(id == "Fold01") %>%
   # exclude correct 
   filter(.pred_class != sector_f) %>%
   conf_mat(truth = sector_f, estimate = .pred_class) %>%
   autoplot(type = "heatmap", values = "prop") +
   scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
   scale_x_discrete(labels = function(x) str_wrap(x, 20))

Here it is more clear where the model breaks down: “Institutional Supp.” which makes sense because it was often an agency created FOR a sector!

— ii) KNN [`knn_final_param`]

— TIDY conf matrix

knn_final_param <- multi_knn_rs %>% 
   show_best(metric = "accuracy") %>% 
   slice(1)  
 
multi_knn_rs %>% 
   collect_predictions() %>%
   inner_join(knn_final_param) %>%
   conf_mat(truth = sector_f, estimate = .pred_class)

— [FIG] conf matrix (1st fold)

To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices in

neighbors_final <- multi_knn_rs %>% 
   show_best(metric = "accuracy") %>% 
   slice(1) %>% 
   pull(neighbors)

multi_knn_rs %>% 
   collect_predictions() %>%
   # final param 
   filter(neighbors == neighbors_final) %>%
   filter(id == "Fold01") %>%
   conf_mat(truth = sector_f, estimate = .pred_class) %>%
   autoplot(type = "heatmap") +
   scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
   scale_x_discrete(labels = function(x) str_wrap(x, 20))

To get a more detailed view of how our classifier is performing, let us look at one of the confusion matrices in

multi_knn_rs %>% 
   collect_predictions() %>%
   # final param 
   filter(neighbors == neighbors_final) %>%
   filter(id == "Fold01") %>%
   # exclude correct 
   filter(.pred_class != sector_f) %>%
    conf_mat(truth = sector_f, estimate = .pred_class) %>%
   autoplot(type = "heatmap") +
   scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
   scale_x_discrete(labels = function(x) str_wrap(x, 20))

6.b) Finalize the workflow

This I do only for the best model

— Finalize workflow

After we have those parameters (final_penalty), penalty {and max_tokens}, we can finalize our earlier tunable workflow, by updating it with this value.

logistic_wf_final <- logistic_wf %>%
   # here it wants a tibble 
   finalize_workflow(parameters =  log_final_param )

7) Assess the model performance on training set

— See the results [`multi_fit`]

Here we access the model coefficients to see which features are most important in the model + We see here, for the penalty we chose, what terms contribute the most to a en cat NOT being high risk .

# Fit the model to the training data
multi_fit <- fit (logistic_wf_final, data = sec_train)

— Coefficients [`multi_fitted_coeff`]

multi_fitted_coeff <- multi_fit %>% 
   extract_fit_parsnip() %>% 
   tidy( penalty = final_penalty) %>% 
   arrange(-estimate)

multi_fitted_coeff

In this model, it seems that words contained in PDO are the most influential in predicting the sector of a project.

“tfidf_pdo_education” appear among the top coefficients!!!
“tfidf_pdo_health” appear among the top coefficients!!!

8) Evaluate the best model on the test set [`last_fit()`]

We can now fit this finalized workflow on training data and finally return to our testing data.

last_fit() emulates the process where, after determining the best model, the final fit on the entire training set is needed and is then evaluated on the test set

— metrics

multi_final_fitted <- last_fit(logistic_wf_final, sec_split)

collect_metrics(multi_final_fitted)
#1 accuracy    multiclass     0.448 Preprocessor1_Model1
#2 roc_auc     hand_till      0.800Preprocessor1_Model1
#3 brier_class multiclass     0.367 Preprocessor1_Model1

— predictions

multi_final_fitted %>%
   collect_predictions() %>%
   conf_mat(truth = sector_f, estimate = .pred_class)

— [FIG] confusion matrix

multi_final_fitted %>%
   collect_predictions() %>%
   conf_mat(truth = sector_f, estimate = .pred_class) %>%
   autoplot(type = "heatmap") +
   scale_y_discrete(labels = function(x) str_wrap(x, 20)) +
   scale_x_discrete(labels = function(x) str_wrap(x, 20)) +
   theme(axis.text.x = element_text(angle = 30, hjust = 1))

— [FIG] ROC curve

pred_tibble <- multi_final_fitted %>%
   collect_predictions()

# Assuming you have your results (with columns for true classes and predicted probabilities)
results <- pred_tibble # e.g., tibble with true and predicted columns
paint(results)

# Compute ROC curve (micro-average)
roc_curve(results, truth = sector_f, 
          '.pred_AGR FOR FISH',
          '.pred_EDUCATION'           ,
          '.pred_ENERGY'              ,
          '.pred_FINANCIAL'           ,
          '.pred_HEALTH'              ,
          '.pred_ICT'                 ,
          '.pred_IND TRADE SERV'      ,
          '.pred_INSTIT. SUPP.' ,
          '.pred_MINING OIL&GAS'      ,
          '.pred_SOCIAL PROT.'   ,
          '.pred_TRANSPORT'           ,
          '.pred_URBAN'               ,
          '.pred_WAT & SAN'           
) %>%
   autoplot()

It makes sense that the sectors’ levels in which I have collapsed more things and/or have a more “blurred” definitions are the ones that are harder to predict:

“URBAN”
“IND TRADE SERV”
“INSTIT. SUPP.”
“ICT” (?)

—– 🔴🟢 QUI 🟠🟡 —-

9) Interpret the model

The output of last_fit() (multi_final_fitted) also contains a fitted model (a workflow, to be more specific) that has been trained on the training data. We can use the vip package to understand what the most important variables are in the predictions

multi_final_fitted$.workflow
multi_final_fitted$.metrics
multi_final_fitted$.predictions 
# N of observation is the same as the `sec_test` set 1267 
nrow(multi_final_fitted$.predictions[[1]])

— Inspecting what levels of the outcome are most difficult to estimate

# collect the predictions from the final model
multi_final_fit_feat <- multi_final_fitted %>%
   collect_predictions() %>%
   bind_cols(sec_test)  %>%
   rename(sector_f = 'sector_f...17') %>% 
   select ( -'sector_f...28')

#preview the predictions
glimpse(multi_final_fit_feat)

— Inspecting the most important features for predicting the outcome

We can use the vip package to understand what the most important variables are in the predictions

using the extract_fit_parsnip() function from the workflows package to extract the model object from the workflow.

A quick way to extract the most important features for predicting each outcome is to use the vi() function from {vip}.

The vi() function calculates the permutation importance of each feature in the model.

library(vip)
sector_feat_imp <- extract_fit_parsnip(multi_final_fitted$.workflow[[1]]) %>%
  vi(lambda = final_penalty) %>% 
  mutate(
    Sign = case_when(Sign == "POS" ~ "More important for predicting a given sector", 
                     Sign == "NEG" ~ "Less important for predicting a given sector",
                     TRUE ~ "Neutral"),
    Importance = abs(Importance))   %>% 
   #filter(!str_detect(Variable, "boardapprovalFY"))  %>%
   filter (Importance != 0.00000000)

— [FIG] Variable importance for predicting the sector category

sector_feat_imp %>% group_by(Sign) %>%
   top_n(30, Importance) %>%
   mutate(Variable =  str_replace(Variable, "tfidf_pdo_", "PDO_"),
          Variable =  str_replace(Variable, "env_cat_f", "ENVCAT_"),
          Variable =  str_replace(Variable, "regionname_", "REG_"),
          Variable =  str_replace(Variable, pattern = "boardapprovalFY_X" , replacement = "APPR_FY_")
          ) %>% 
   ungroup %>%
   ggplot(aes(x = Importance,
              y = fct_reorder(Variable, Importance),
              fill = Sign
   )) +
   geom_col(show.legend = FALSE) +
   scale_x_continuous(expand = c(0, 0)) +
   facet_wrap(~Sign, scales = "free", ncol = 2) +
   labs(
      y = NULL,
      title = "Variable importance for predicting the sector category",
      #subtitle = paste0("These features are the most important in predicting a given class" )
   )

— Misclassified sector

We can gain some final insight into our model by looking at observations from the test set that it misclassified. I will then select the columns with the actual outcome (sector_f), and compare with the predicted sector: What was predicted wrong?

# solo sbagliate 
compare_true_pred <- multi_final_fit_feat %>%
   filter(sector_f != .pred_class ) %>%  
   select(.pred_class, sector_f,sector1,  proj_id , env_cat_f, boardapprovalFY , regionname, pdo )

Let’s see for example cases labelled “AGR FOR FISH” but misclassified

check_AGR <- compare_true_pred %>% 
   filter(.pred_class == "AGR FOR FISH") %>%
   select(.pred_class, sector_f, sector1,  proj_id , env_cat_f, boardapprovalFY , regionname, pdo ) %>%    slice_sample(n = 5)

Interesting: they all have very brief PDO !

check_FIN <- compare_true_pred %>% 
   filter(.pred_class == "FINANCIAL") %>%
   select(.pred_class, sector_f, sector1,  proj_id , env_cat_f, boardapprovalFY , regionname, pdo ) %>%
   slice_sample(n = 5)

same !!!

____________________

SET OF MODELS

____________________

What if I want to see them all together?

Spiega workflowsets

__________

TOPIC MODELING w ML

__________

Topic modeling is an unsupervised machine learning technique used to hat exploratively identifies latent topics based on frequently co-occurring words.

It can identify topics or themes that occur in a collection of documents, allowing hidden patterns and relationships within text data to be discovered. It is widely applied in fields such as social sciences and humanities.

https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html

https://m-clark.github.io/text-analysis-with-R/topic-modeling.html

https://sicss.io/2020/materials/day3-text-analysis/topic-modeling/rmarkdown/Topic_Modeling.html

Document-Term Matrix

…

/include independent variables in my topic model?

https://bookdown.org/valerie_hase/TextasData_HS2021/tutorial-13-topic-modeling.html#how-do-i-include-independent-variables-in-my-topic-model

Compare PDO text v. project METADATA [CMPL 🟠]

Using NLP models trained on document metadata and structure can be combined with text analysis to improve classification accuracy.

STEPS

Use document text (abstracts) as features to train a supervised machine learning model. The labeled data (documents with sector tags) will serve as training data, and the model can predict the missing sector tags for unlabeled documents.
TEXT preprocessing (e.g. tokenization, lemmatization, stopword removal, TF-IDF)
- Convert the processed text into a numerical format using Term Frequency-Inverse Document Frequency (TF-IDF), which gives more weight to terms that are unique to a document but less frequent across the entire corpus.
Define data features, e.g.
- Document Length: Public sector documents might be longer, more formal.
- Presence of Certain Keywords: Use specific keywords that correlate with either the public or private sector.
- Sector Tags: In documents where the “sector tag” is present, you can use it as a feature for training.
Predicting Missing Sector Tags (Classification):
- Use models like: Logistic Regression: For a binary classification (e.g., public vs. private). Random Forest or XGBoost: If you have a more complex tagging scheme (e.g., multiple sector categories).
- Cross-validation: Ensure the model generalizes well by validating with the documents that already have the sector tag filled in.
- Evaluate the model: Use metrics like accuracy, precision, recall, and F1 score to evaluate the model’s performance.

# #| eval: FALSE
# #| echo: FALSE
# library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
# library(dplyr) # A Grammar of Data Manipulation
# library(tidyr) # Tidy Messy Data
# library(caret) # Classification and Regression Training # Classification and Regression Training
# 
# # ----- 1. Tokenization and stopwords removal using tidytext.
# # Assuming df is your dataframe with "abstract" and "sector_tag"
# # Tokenize the text and remove stopwords
# tidy_abstracts <- df %>%
#   unnest_tokens(word, abstract) %>%
#   anti_join(stop_words)  # Remove stopwords
# 
# # Optional: Apply stemming (you can also use `SnowballC` if you prefer)
# tidy_abstracts <- tidy_abstracts %>%
#   mutate(word = SnowballC::wordStem(word))
# 
# # ----- 2. Document-term matrix or TF-IDF calculation using bind_tf_idf().
# # Create a term-frequency matrix
# abstract_dtm <- tidy_abstracts %>%
#   count(document_id = row_number(), word) %>%  # Assuming each row is a document
#   cast_dtm(document_id, word, n)
# 
# # Alternatively, use TF-IDF weighting
# abstract_tfidf <- tidy_abstracts %>%
#   count(document_id = row_number(), word) %>%
#   bind_tf_idf(word, document_id, n)
# 
# # ----- 3. Model training using caret (Random Forest, Logistic Regression, etc.).
# # Split data into training (with sector tags) and testing (missing sector tags)
# train_data <- df[!is.na(df$sector_tag), ]
# test_data  <- df[is.na(df$sector_tag), ]
# 
# # Combine the DFM or TF-IDF with the training dataset
# train_tfidf <- abstract_tfidf %>%
#   filter(document_id %in% train_data$row_number()) %>%
#   spread(word, tf_idf, fill = 0)
# 
# # Merge with sector tags
# train_tfidf <- left_join(train_tfidf, train_data, by = c("document_id" = "row_number"))
# 
# # Prepare for machine learning by ensuring you have sector tags in the final dataset
# 
# # ----- 4. Prediction of missing sector tags based on the trained model.
# # Random Forest Model
# model <- train(sector_tag ~ ., data = train_tfidf, method = "rf")
# 
# # Predict missing sector tags for the test data
# test_tfidf <- abstract_tfidf %>%
#   filter(document_id %in% test_data$row_number()) %>%
#   spread(word, tf_idf, fill = 0)
# 
# # Predict sector tags for the missing observations
# predicted_tags <- predict(model, newdata = test_tfidf)
# 
# # Add the predicted sector tags to the original dataset
# df$sector_tag[is.na(df$sector_tag)] <- predicted_tags
# 
# # ----- 5. Evaluate and Refine the Model
# confusionMatrix(predicted_tags, test_data$sector_tag)

— I could see if corresponds to sector flags in the project metadata

more missing but more objective!

Topic modeling with LDA (Latent Dirichlet Allocation)

Topic modeling algorithms like Latent Dirichlet Allocation (LDA) can be applied to automatically uncover underlying themes within a corpus. The detected topics may highlight key terms or subject areas that are strongly associated with either the public or private sector.

Named Entity Recognition using CleanNLP and spaCy

NER is especially useful for analyzing unstructured text.

NER can identify key entities (organizations, people, locations) mentioned in the text. By tracking which entities appear frequently (e.g., government agencies vs. corporations), it’s possible to categorize a document as more focused on the public or private sector.

— Summarise the tokens by parts of speech

# Initialize the spacy backend
cnlp_init_spacy()

__________

STRUCTURAL TOPIC MODELING (STM)

__________

The Structural Topic Model is a general framework for topic modeling with document-level covariate information. The covariates can improve inference and qualitative interpretability and are allowed to affect topical prevalence, topical content or both.

MAIN REFERENCE stm R package http://www.structuraltopicmodel.com/ EXAMPLE UN corpus https://content-analysis-with-r.com/6-topic_models.html STM 1/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-i-2da2b353d362 STM 2/2 https://jovantrajceski.medium.com/structural-topic-modeling-with-r-part-ii-462e6e07328

BERTopic

Developed by Maarten Grootendorst, BERTopic enhances the process of discovering topics by using document embeddings and a class-based variation of Term Frequency-Inverse Document Frequency (TF-IDF).

https://medium.com/(supunicgn/a-beginners-guide-to-bertopic-5c8d3af281e8?)

__________

(dYnamic) TOPIC MODELING OVER TIME

__________

Example: An analysis of Peter Pan using the R package koRpus https://ladal.edu.au/topicmodels.html#Topic_proportions_over_time

RENDER this

quarto render analysis/01c_WB_project_pdo_feat_class.qmd --to html
open ./docs/analysis/01c_WB_project_pdo_feat_class.html

Set up

Load data

__________

PREDICT MISSING FEATUREs

__________

What ML models work with text?

— Check missing feature

— Identify features for classification

____________________________________________________________

———————- BINARY OUTCOME ——————-

____________________________________________________________

1) Select and engineer the features [using projs_train]

Recode env_cat variable

Recode sector1 variable

2) Split sample based on env_cat variable

— [1/2] Proportional sub-set dplyr::slice_sample

— [2/2] Proportional sub-set rsample::initial_split

MODEL-SPECIFIC STEPS

0. LASSO classification - log w/ text with tf-idf

3) Define preprocessing steps

Prep recipe [env_recipe]

Improve recipe [env_recipe]

4) Select classification model

Logistic Regr Model [env_spec]

— Model specification

Create workflow [env_wf]

— Evaluate so far

5) Tuning hyperparameters

— Penalty tuning [env_grid]

— K-fold cross-validation (for penalty optimal #) [ env_fold, env_tune]

6) Update model specification and workflow with best HP

Update workflow [env_wf_lasso]

7) Assess the model performance on training set

— See the results [env_lasso_fit]

— Coefficients [enf_fitted_coeff]

— [FIG] Assessing performanceon training set

— Assess the model w cross-valid [env_lasso_cv]

— [FIG] Visualize the confusion matrix

__________

00. NULL model - log w/ text with tf-idf

__________

Workflow and specification for NULL MODEL

Assess the model performance on training set

OTHER ATTEMPS to improve model

__________

1. LASSO classification - log w/ text with tf-idf + stopwords

[🤞🏻] Add stopwords exclusion to the recipe

3) Define preprocessing steps (NEW!)

— Improve recipe [env_stop_recipe]

4) Select a classification algorithm (same)

— Model specification

— Create workflow [env_stop_wf]

5) Tuning hyperparameters (same)

— Penalty tuning [env_grid] (same)

— K-fold cross-val [env_fold, env_FEAT_tune] (same/NEW)

6) Update model specification and workflow with best HP (NEW)

Update workflow [env_stop_wf_lasso]

7) Assess the model performance on training set

— See the results [env_lasso_fit]

— Coefficients [enf_fitted_coeff]

— [FIG] Assessing performance [env_stop_lasso_fit] on training set

— Assess the model w cross-valid [env_stop_lasso_cv]

— [FIG] Visualize the confusion matrix

8) Evaluate the best model on the validation/test set

— Fit the best model(training) and evaluate on the test set

— [FIG] Visualize the confusion matrix

— Visualize the ROC AUC curve

2. model - log w/ text with tf-idf + stopwords + 3 factors

[👍🏻] Logistic with pdo + sector_f|regionname|boardapprovalFY as predictors

3) Define preprocessing steps [env_FEAT_recipe] (same)

— Improve recipe [env_FEAT_recipe] (NEW!)

4) Select algorithm + workflow (same)

— Model specification [env_spec]

— Create workflow [env_FEAT_wf] (NEW!)

5) Tuning hyperparameters (same)

— Penalty tuning + folds [env_grid, env_fold] (same)

— K-fold cross-val tuning [env_FEAT_tune] (NEW)

6) Update model specification and workflow with best HP (NEW)

Update workflow [env_FEAT_wf2]

7) Assess the model performance on training set

1) Select and engineer the features [using `projs_train`]

Recode `env_cat` variable

Recode `sector1` variable

2) Split sample based on `env_cat` variable

— [1/2] Proportional sub-set `dplyr::slice_sample`

— [2/2] Proportional sub-set `rsample::initial_split`

Prep recipe [`env_recipe`]

Improve recipe [`env_recipe`]

Logistic Regr Model [`env_spec`]

Create workflow [`env_wf`]

— Penalty tuning [`env_grid`]

— K-fold cross-validation (for penalty optimal #) [ `env_fold`, `env_tune`]

Update workflow [`env_wf_lasso`]

— See the results [`env_lasso_fit`]

— Coefficients [`enf_fitted_coeff`]

— Assess the model w cross-valid [`env_lasso_cv`]

— Improve recipe [`env_stop_recipe`]

— Create workflow [`env_stop_wf`]

— Penalty tuning [`env_grid`] (same)

— K-fold cross-val [`env_fold`, `env_FEAT_tune`] (same/NEW)

Update workflow [`env_stop_wf_lasso`]

— See the results [`env_lasso_fit`]

— Coefficients [`enf_fitted_coeff`]

— [FIG] Assessing performance [`env_stop_lasso_fit`] on training set

— Assess the model w cross-valid [`env_stop_lasso_cv`]

— Visualize the `ROC AUC` curve

[👍🏻] Logistic with `pdo` + `sector_f`|`regionname`|`boardapprovalFY` as predictors

3) Define preprocessing steps [`env_FEAT_recipe`] (same)

— Improve recipe [`env_FEAT_recipe`] (NEW!)

— Model specification [`env_spec`]

— Create workflow [`env_FEAT_wf`] (NEW!)

— Penalty tuning + folds [`env_grid`, `env_fold`] (same)

— K-fold cross-val tuning [`env_FEAT_tune`] (NEW)

Update workflow [`env_FEAT_wf2`]

— See the results [`env_lasso_fit`]

— Coefficients [`enf_fitted_coeff`]

— [FIG] Assessing performance [`env_FEAT_fit`] on training set

— Assess the model w cross-valid [`env_FEAT_cv`] (NEW)

— [FIG] Visualize the `ROC AUC` curve

[🫤] Naive bayes classific with text and `sector_f` and `regionname` and `boardapprovalFY` as predictors

— Model specification [`env_spec`]

— Create workflow [`env_NB_wf`] (NEW!)

— Fit the classificatoin model to the training set [`env_NB_fit`] (NEW!)

— Visualize the `ROC AUC` curve

4. & 5. model - Multiclass outcome `sector_f`

2) Split sample (based on `sector_f`)

3) Preprocessing steps [`multi_rec`] (same)

— COULD do `step_word_embeddings` to see a diffrernent

— i) Multin. Lasso Logistic Regress [`logistic_model`]

— ii) KNN [`knn_model`]

— i) Multin. Lasso Logistic Regress [`logistic_wf`]

— ii) KNN [`knn_wf`]

— i) Multin. Lasso Logistic Regress [`logistic_grid`]

— ii) KNN [`knn_grid`]

— i) Multin. Lasso Logistic Regress [`logistic_model`]

— ii) KNN [`knn_grid`]

— i) Multin. Lasso Logistic Regress [`logistic_model`]

— ii) KNN [`knn_grid`]

— i) Multin. Lasso Logistic Regress [`log_final_param`]

— ii) KNN [`knn_final_param`]

— See the results [`multi_fit`]

— Coefficients [`multi_fitted_coeff`]

8) Evaluate the best model on the test set [`last_fit()`]