WB Project PDO text EDA

Warning

WORK IN PROGRESS! (Please expect unfinished sections, and unpolished code. Feedback is welcome!)

Set up

# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(tidyverse) # Easily Install and Load the 'Tidyverse' 
library(stringr) # Simple, Consistent Wrappers for Common String Operations
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(readxl) # Read Excel Files
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)
library(patchwork) # The Composer of Plots
library(ragg) # Anti-Grain Geometry Graphics)

# TEXT ANALYTICS -------------------------------------
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools   

# Set options to prevent scientific notation
options(scipen = 999)

Set up ggplot2 theme

# 1) --- Set the font as the default for ggplot2
# Who else? https://datavizf24.classes.andrewheiss.com/example/05-example.html 
lulas_theme <- theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        # Bold, bigger title
        plot.title = element_text(face = "bold", size = rel(1.6)),
        # Plain, slightly bigger subtitle that is grey
        plot.subtitle = element_text(face = "plain", size = rel(1.4), color = "#A6A6A6"),
        # Italic, smaller, grey caption that is left-aligned
        plot.caption = element_text(face = "italic", size = rel(0.7), 
                                    color = "#A6A6A6", hjust = 0),
        # Bold legend titles
        legend.title = element_text(face = "bold"),
        # Bold, slightly larger facet titles that are left-aligned for the sake of repetition
        strip.text = element_text(face = "bold", size = rel(1.1), hjust = 0),
        # Bold axis titles
        axis.title = element_text(face = "bold"),
        # Change X-axis label size
        axis.text.x = element_text(size = rel(1.4)),   
        # Change Y-axis label size
        axis.text.y = element_text(size = 14),   
        # Add some space above the x-axis title and make it left-aligned
        axis.title.x = element_text(margin = margin(t = 10), hjust = 0),
        # Add some space to the right of the y-axis title and make it top-aligned
        axis.title.y = element_text(margin = margin(r = 10), hjust = 1),
        # Add a light grey background to the facet titles, with no borders
        strip.background = element_rect(fill = "grey90", color = NA),
        # Add a thin grey border around all the plots to tie in the facet titles
        panel.border = element_rect(color = "grey90", fill = NA))

# 2) --- use 
# ggplot + lulas_theme

—————————————————————————-

Data sources

The data used in this analysis comes from the World Bank’s Projects & Operations database.

Since some pre-processing steps are computationally expensive, I did that in a separate read-only notebook (analysis/01a_WB_project_pdo_prep.qmd), where:

Retrieved manually ALL WB projects (22,569) approved between FY 1947 and 2026 as of 31/08/2024 using simply the Excel button on this page WBG Projects
- Of these, approximately 50% (11,322 projects) had a “viable” PDO text in the dataset (i.e., not blank or labeled as “TBD”, etc.).
- There are no Project Development Objectives available in projects approved before FY2001
- However, other than approval year, based on some tests on available projects’ features, PDO texts’ missingness seems to happen at random.
Dropped from the analysis: projects with no PDO text (including those approved before FY2001), and projects with missing status, FY of approval –> 8,811 usable projects selected.
Split the dataset into training / validation / test subsets (proportional to FY and regional distribution).
- Here I work only on training set (~50% of usable ones, i.e. 4,403 PDOs).
Cleaned the data (parse dates, recode variables, etc.), fix typos, unwanted special characters, and other unimportant issues in PDOs.
Obtained PoS tagging + tokenization with cleanNLP package (functions cnlp_init_udpipe() + cnlp_annotate()) and saved projs_train_t (cleaned train dataset).

— Note on cleanNLP package

cleanNLP supports multiple backends for processing text, such as CoreNLP, spaCy, udpipe, and stanza. Each of these backends has different capabilities and might require different initialization procedures.

CoreNLP ~ powerful Java-based NLP toolkit developed by Stanford, which includes many linguistic tools like tokenization, part-of-speech tagging, and named entity recognition.
- ❕❗️ NEEDS EXTERNAL INSTALLATION (must be installed in Java with cnlp_install_corenlp() which installs the Java JAR files and models)
spaCy ~ fast and modern NLP library written in Python. It provides advanced features like dependency parsing, named entity recognition, and tokenization.
- ❕❗️ NEEDS EXTERNAL INSTALLATION (fmst be installed in Python (with spacy_install() which installs both spaCy and necessary Python dependencies) and the spacyr R package must be installed to interface with it.
udpipe ~ R package that provides bindings to the UDPipe NLP toolkit. Fast, lightweight and language-agnostic NLP library for tokenization, part-of-speech tagging, lemmatization, and dependency parsing.
stanza~ another modern NLP library from Stanford, similar to CoreNLP but built on PyTorch and supports over 66 languages…

when you initialize a back-end (like CoreNLP) in cleanNLP, it stays active for the entire session unless you reinitialize or explicitly change it.

# ---- 1) Initialize the CoreNLP backend
library(cleanNLP) # A Tidy Data Model for Natural Language Processing  
cnlp_init_corenlp()
# If you want to specify a language or model path:
cnlp_init_corenlp(language = "en", 
                  # model_path = "/path/to/corenlp-models"
                  )

# ---- 2) Initialize the spaCy backend 
library(cleanNLP) # A Tidy Data Model for Natural Language Processing  
library(spacyr) # Wrapper to the 'spaCy' 'NLP' Library
# Initialize spaCy in cleanNLP
cnlp_init_spacy()
# Optional: specify language model
cnlp_init_spacy(model_name = "en_core_web_sm")

# ---- 3) Initialize the udpipe backend
library(cleanNLP) # A Tidy Data Model for Natural Language Processing # 
# Initialize udpipe backend
cnlp_init_udpipe(model_name = "english")

# ---- 4) Initialize the stanza backend

[TBL] Illustrative PDOs text in Projects’ documents

Project_ID	Project_Name	Project_Development_Objective
P127665	Second Economic Recovery Development Policy Loan	This development policy loan supports the Government of Croatia’s reform efforts with the aim to: (i) enhance fiscal sustainability through expenditure-based consolidation; and (ii) strengthen investment climate.
P069934	PERNAMBUCO INTEGRATED DEVELOPMENT: EDUCATION QUALITY IMPROVEMENT PROJECT	The development objectives of the Pernambuco Integrated Development: Education Quality Improvement Project are to (a) improve the quality, efficiency, and inclusiveness of the public education system; (b) modernize and strengthen the managerial, financial, and administrative capacity of the Secretariat of Education to set policies and guidelines for the sector and deliver public education efficiently; and (c) support the overall state modernization effort through interventions to be carried out in the Secretariat of Education and to be replicated in other state institutions.

Notes on PDO text data quality

First, it is important to notice that all 7,548 projects approved before FY2001 had no PDO text available.

The exploratory analysis of the 11,353 projects WITH PDO text revealed some interesting findings:

PDO text length: The PDO text is quite short, with a median of 2 sentences and a maximum of 9 sentences.
PDO text missingness: besides 11,306 projects with missing PDOs, 31 projects had some invalid PDO values, namely:
- 11 have PDO as one of: “.”,“-”,“NA”, “N/A”
- 7 have PDO as one of: “No change”, “No change to PDO following restructuring.”,“PDO remains the same.”
- 9 have PDO as one of: “TBD”, “TBD.”, “Objective to be Determined.”
- 4 have PDO as one of: “XXXXXX”, “XXXXX”, “XXXX”, “a”

Of the remaining 11,322 projects with a valid PDO, some more projects were excluded from the analysis for incompleteness:

3 projects without “project status”
2,176 projects without “board approval FY”
332 projects approved in FY >= FY2024 (for incomplete approval stage)

Lastly (and this was quite surprising to me) the remaining, viable 8,811 unique projects, were matched by only 7,582 unique PDOs! In fact, 2,235 projects share 1,006 NON-UNIQUE PDO text in the “cleaned” dataset. Why? Apparently, the same PDO is re-used for multiple projects (from 2 to as many as 9 times), likely in cases of follow-up phases of a parent project or components of the same lending program.”

In sum, the cleaning process yielded a usable set of 8,811 functional projects, which was split into a training subset (4,403) to explore and test models and a testing subset (4408), held out for post-prediction evaluation.

Note

Evidently, in some cases,the same PDO is used for multiple projects (from a minimum of 2 to a maximum of 9 time!!!), most likely when there is a parent project or subsequent phases of the same lending program.

—————————————————————————

Load pre-processed Projs’ dataset + PDO dataset

Here I will just load the pre-processed data (training set only).

[Saved file `projs_train_t` & `pdo_train_t`]

# Load Proj train dataset `projs_train_t`
projs_train <- readRDS("~/Github/slogan/data/derived_data/projs_train.rds")  

# Load clean tokenized-PDO dataset `pdo_train_t`
pdo_train_t <- readRDS(here::here("data" , "derived_data", "pdo_train_t.rds"))

Previous Tokenization and PoS Tagging

Typically, one of the first steps in this transformation from natural language to feature, or any of kind of text analysis, is tokenization.

i) Explain Tokenization

Breaking units of language into components relevant for the research question is called “tokenization”. Components can be words, n-grams, sentences, etc. or combining smaller units into larger units.

Tokenization is a row-wise operation: it changes the number of rows in the dataset.

The choices of tokenization

Should words be lower cased?
Should punctuation be removed?
Should numbers be replaced by some placeholder?
Should words be stemmed (also called lemmatization)? ☑️
Should bigrams/multi-word phrase be used instead of single word phrases? ☑️
Should stopwords (the most common words) be removed? ☑️
Should rare words be removed? ❌
Should hyphenated words be split into two words? ❌

for the moment I keep all as conservatively as possible

ii) Explain Pos Tagging

Linguistic annotation is a common for of enriching text data, i.e. adding information about the text that is not directly present in the text itself.

Upon this, e.g. classifying noun, verb, adjective, etc., one can discover intent or action in a sentence, or scanning “verb-noun” patterns.

Here I have a training dataset file with:

Variable	Type	Provenance	Description	Example
proj_id	chr	original PDO data
pdo	chr	original PDO data
word	chr	original PDO data		Governments
sid	int	output cleanNLP	sentence ID
tid	chr	output cleanNLP	token ID within sentence
token	chr	output cleanNLP	Tokenized form of the token.	government
token_with_ws	chr	output cleanNLP	Token with trailing whitespace	government
lemma	chr	output cleanNLP	The base form of the token	government
stem	chr	output SnowballC	The base form of the token	govern
upos	chr	output cleanNLP	Universal part-of-speech tag (e.g., NOUN, VERB, ADJ).
xpos	chr	output cleanNLP	Language-specific part-of-speech tags.
feats	chr	output cleanNLP	Morphological features of the token
tid_source	chr	output cleanNLP	Token ID in the source document
relation	chr	output cleanNLP	Dependency relation between the token and its head token
pr_name	chr	output cleanNLP	Name of the parent token
FY_appr	dbl	original PDO data
FY_clos	dbl	original PDO data
status	chr	original PDO data
regionname	chr	original PDO data
countryname	chr	original PDO data
sector1	chr	original PDO data
theme1	chr	original PDO data
lendinginstr	chr	original PDO data
env_cat	chr	original PDO data
ESrisk	chr	original PDO data
curr_total_commitment	dbl	original PDO data

— PoS Tagging: `upos` (Universal Part-of-Speech)

upos	n	percent	explan
ADJ	21261	0.0852623	Adjective
ADP	27050	0.1084777	Adposition
ADV	2950	0.0118303	Adverb
AUX	3588	0.0143888	Auxiliary
CCONJ	14236	0.0570902	Coordinating conjunction
DET	21505	0.0862408	Determiner
INTJ	57	0.0002286	Interjection
NOUN	70752	0.2837344	Noun
NUM	2190	0.0087825	Numeral
PART	8691	0.0348532	Particle
PRON	2330	0.0093439	Pronoun
PROPN	14856	0.0595765	Proper noun
PUNCT	28393	0.1138635	Punctuation
SCONJ	2160	0.0086622	Subordinating conjunction
SYM	316	0.0012672	Symbol
VERB	25806	0.1034889	Verb
X	3219	0.0129090	Other

Warning

On random visual check, these are not always correct, but they are a good starting point for now.

iii) Custom Stopwords

Remove stop words, which are the most common words in a language.

but I don’t want to remove any meaningful word for now

# Custom list of articles, prepositions, and pronouns
custom_stop_words <- c(
   # Articles
   "the", "a", "an",   
   "and", "but", "or", "yet", "so", "for", "nor", "as", "at", "by", "per",  
   # Prepositions
   "of", "in", "on", "at", "by", "with", "about", "against", "between", "into", "through", 
   "during", "before", "after", "above", "below", "to", "from", "up", "down", "under",
   "over", "again", "further", "then", "once",  
   # Pronouns
   "i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
   "yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her", 
   "hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves" ,
   "this", "that", "these", "those", "which", "who", "whom", "whose", "what", "where",
   "when", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other",
   # "some", "such", "no",  "not", 
   # "too", "very",   
   # verbs
   "is", "are", "would", "could", "will", "be", "e.g", "e.g.", "i.e.",
   "i", "ii", "iii", "iv", "v",
   # because tautology
   "pdo"
)

# Convert to a data frame if needed for consistency with tidytext
custom_stop_words_df <- tibble(word = custom_stop_words)

saveRDS(custom_stop_words, here("data" , "derived_data", "custom_stop_words.rds"))
saveRDS(custom_stop_words_df, here("data" , "derived_data", "custom_stop_words_df.rds"))

iv) Stemming

Often documents contain different versions of one base word, often called a stem. Stemming is the process of reducing words to their base or root form.

Snowball is one framework released in 1980 with an open-source license that can be found in R package SnowballC.

# Using `SnowballC::wordStem` to stem the words. e.g.
pdo_train_t <- pdo_train_t %>% 
   mutate(stem = SnowballC::wordStem(token_l)) %>%
   relocate(stem, .after = lemma)

Why Stemming?: For example, in topic modeling, stemming reduces noise by making it easier for the model to identify core topics without being distracted by grammatical variations. (Lemmatization is more computationally intensive as it requires linguistic context and dictionaries, making it slower, especially on large datasets)

Token	Lemma	Stem
development	development	develop
quality	quality	qualiti
high-quality	high-quality	high-qual
include	include	includ
logistics	logistic	logist
government/governance	Governemnt/government/governance	govern

NOTE: Among word / stems encountered in PDOs, there are a lot of acronyms which may refer to World Bank lingo, or local agencies, etc… Especially when looked at in low case form they don’t make much sense…

Notes on sparsity

Sparsity in the context of a document-term matrix refers to the proportion of cells in the matrix that contain zeros. High sparsity means that most terms do not appear in most documents.

removing stopwords before stemming can reduce sparsity
tidytext::cast_tdm turns a “tidy” one-term-per-document-per-row data frame into a Document-Term Matrix (DTM) from the tm package.
- this dataset contains 4403 documents (each of them a PDO) and 11029 terms (distinct words). Notice that this DTM is 100% sparse (100% of document-word pairings are zero, bc most pairings of document and term do not occur (they have the value zero).

# create document-word matrix
DTM <- pdo_train_t %>% 
   anti_join(custom_stop_words_df, by = c("token_l" = "word")) %>% 
   count(proj_id, token_l) %>%
   tidytext::cast_dtm(proj_id, token_l, n) # HIGH!!!

DTM
# <<DocumentTermMatrix (documents: 4403, terms: 11029)>>
# Non-/sparse entries: 129940/48430747
# Sparsity           : 100%
# Maximal term length: 34
# Weighting          : term frequency (tf)

v) Document-term matrix or TF-IDF

The tf-idf is the product of the term frequency and the inverse document frequency::

\[ \begin{aligned} tf(\text{term}) &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned} \]

— TF-IDF matrix on train pdo

# reduce size 

pdo_train_4_tf_idf <- pdo_train_t %>% # 255964
   # Keep only content words [very restrictive for now]
   # normally c("NOUN", "VERB", "ADJ", "ADV")
   filter(upos %in% c("NOUN")) %>% #    72,668 
   filter(!token_l %in% c("development", "objective", "project")) %>%   #  66,741
   # get rid of stop words (from default list)   
   filter(!token_l %in% custom_stop_words_df$word) %>%   #  66,704
   # Optional: Remove lemmas of length 1 or shorter
   filter(nchar(lemma) > 1)  #  66,350

Now, count the occurrences of each lemma for each document. (This is the term frequency or tf)

# This is the term frequency or `tf`

# Count lemmas per document
lemma_counts <- pdo_train_4_tf_idf %>%
  count(proj_id, lemma, sort = TRUE)
# Preview the result
head(lemma_counts)

With the lemma counts prepared, the bind_tf_idf() function from the tidytext package computes the TF-IDF scores.

# Compute the TF-IDF scores
lemma_tf_idf <- lemma_counts %>%
  bind_tf_idf(lemma, proj_id, n) %>%
  arrange(desc(tf_idf))

head(lemma_tf_idf)

Note

What to use: token, lemma, or stem?

General Preference in Real-World NLP:

Tokens for analyses where word forms matter or for sentiment analysis.
Lemmas (*) for most general-purpose NLP tasks where you want to reduce dimensionality while maintaining accuracy and clarity of meaning.
Stems for very large datasets, search engines, and applications where speed and simplicity are more important than linguistic precision.

(*) I use lemma, after “aggressively” reducing the number of words to consider, and removing stop words (at least for now).

_______

TEXT ANALYSIS/SUMMARY

We are looking at (training data subset) pdo_train_t which has 249360 rows and 26 columns obtained from 4,403 PDOs (of which 4050 unique) of 4403 Wold Bank projects approved in Fiscal Years ranging from 2001 to 2023.

[TBL] Frequencies of documents/words/stems

entity	counts
N proj	4403
N PDOs	4050
N words	12953
N token	11114
N lemma	11178
N stem	8541

Term frequency

Note: normally, the most frequent words are function words (e.g. determiners, prepositions, pronouns, and auxiliary verbs), which are not very informative. Moreover, even content words (e.g. nouns, verbs, adjectives, and adverbs) can often be quite generic semantically speaking (e.g. “good” may be used for many different things).

However, in this analysis, I do not use the STOPWORD approach, but use the POS tags to reduce – in a more controlled way – the dataset, filtering the content words such as nouns, verbs, adjectives, and adverbs.

[FUNC] save plot Ouptput

Pointless bc does not render in the HTML output.

[FUNC] save plot Object

[FIG] Overall `token` freq ggplot

Excluding “project” “develop”,“objective”
Including only “content words” (NOUN, VERB, ADJ, ADV)

# Evaluate the title with glue first
title_text <- glue::glue("Most frequent TOKEN in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 

pdo_wrd_freq <- pdo_train_t %>%   # 123,927
   # include only content words
   filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
   #filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
   filter (!(relation %in% c("nummod" ))) %>% # 173,686 
 filter (!(token_l %in% c("pdo","project", "development", "objective","objectives", "i", "ii", "iii",
                          "is"))) %>% # whne it is VERB
   count(token_l) %>% 
   filter(n > 800) %>% 
   mutate(token_l = reorder(token_l, n))   # reorder values by frequency

# plot 
pdo_wrd_freq_p <- pdo_wrd_freq %>% 
   ggplot(aes(token_l, n)) +
   geom_col(fill = "#d7b77b") +
   scale_y_continuous(breaks = seq(0, max(pdo_wrd_freq$n), by = 400)) + # directly use 'n' instead of .data$n
   coord_flip() + # flip x and y coordinates so we can read the words better
   labs(#title = title_text,
      subtitle = "[TOKEN with count > 800]", y = "", x = "")+
   geom_hline(yintercept = 800, linetype = "dashed", color = "#873c4a") +
   lulas_theme +
   theme(# Adjust angle and alignment of x labels
      axis.text.x = element_text(angle = 45, hjust = 1))

[FIG] Overall `stem` freq ggplot

Without “project” “develop”,“objective”
Including only “content words” (NOUN, VERB, ADJ, ADV)

# Evaluate the title with glue first
title_text <- glue::glue("Most frequent STEM in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 

# Plot
pdo_stem_freq <- pdo_train_t %>%   # 256,632
   # include only content words
   filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
   filter (!(relation %in% c("nummod" ))) %>% # 173,686 
   filter (!(stem %in% c("pdo", "project", "develop", "object", "i", "ii", "iii"))) %>%
   count(stem) %>% 
   filter(n > 800) %>%
   mutate(stem = reorder(stem, n))    # reorder values by frequency
   
# plot 
pdo_stem_freq_p <-    pdo_stem_freq %>% 
   ggplot(aes(stem, n)) +
   geom_col(fill = "#d7b77b") +
   scale_y_continuous(breaks = seq(0, max(pdo_stem_freq$n), by = 400)) + # directly use 'n' instead of .data$n
   coord_flip() + # flip x and y coordinates so we can read the words better
   labs(#title = title_text,
      subtitle = "[STEM with count > 800]", y = "", x = "") +
   geom_hline(yintercept = 800, linetype = "dashed", color = "#873c4a") +
   lulas_theme +
   theme(# Adjust angle and alignment of x labels
      axis.text.x = element_text(angle = 45, hjust = 1))

Evidently, after stemming, more words (or stems) reach the threshold frequency count of 800 (they have been combined by root).

[FIG] `token` + `stem` freq ggplot

title2_text <- glue::glue("Most frequent TOKEN & STEM in {n_distinct(pdo_train_t$proj_id)} PDOs") 

subtitle2_text <- glue::glue("From projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 

combo_freq_p <-  pdo_wrd_freq_p + pdo_stem_freq_p + 
   plot_annotation(title = title2_text,
                    subtitle = subtitle2_text,
                   # caption = "Source: World Bank Project Documents",
                   theme = theme(plot.title = element_text(size = 12, face = "bold"),
                                 plot.subtitle = element_text(size = 10, face = "italic"),
                                 plot.caption = element_text(size = 10, face = "italic"))
                   )  

combo_freq_p

#f_save_plot("combo_freq_p", combo_freq_p)
f_save_plot_obj(combo_freq_p, "combo_freq_p")

_______

SECTOR-related term frequency

Isolate SECTOR words and see frequency over years

To try and make it a bit more meaningful, let’s focus on the frequency of the most common words related to SECTORS.

From token_l, I created a “broad SECTOR” variable to group the sectors in broader definitions:

WAT_SAN = water|wastewater|sanitat|Sewer|sewage|Irrigat|Drainag|river basin|groundwater
TRANSPORT = transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation|highway|transit|bridge|port
URBAN = urban|housing|inter-urban|peri-urban|waste manag|slum|city|megacity|intercity|inter-city|town
ENERGY = energ|electri|hydroele|hydropow|renewable|transmis|grid|transmission|electric power|geothermal|solar|wind|thermal|nuclear power|energy generation
HEALTH = health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin|immuniz|diseas|malaria|HIV|AIDS|TB|maternal|clinic|nutrition
EDUCATION = educat|school|vocat|teach|univers|student|literacy|training|curricul|pedagog
AGR_FOR_FISH (Agriculture, Forestry, Fishing) = Agricultural|Agro|Fish|Forest|Crop|livestock|fishery|land|soil
** MINING_OIL_GAS** = Minin|oil|gas|mineral|quarry|extract|coal|natural gas|mine|petroleum|hydrocarbon
SOCIAL_PROT = Social Protec|social risk|social assistance|living standard|informality|insurance|social choesion|gig economy|human capital|employment|unemploy|productivity|wage lev|intergeneration|lifelong learn|vulnerab|empowerment|sociobehav
FINANCIAL = Bank|finan|Investment|credit|microfinan|loan|financial stability|banking|financial intermed|fintech
ICT = Information|Communication|ICT|Internet|telecom|cyber|data|AI|artificial intelligence|blockchain|e-learn|e-commerce|platform|software|hardware|digital
IND_TRADE_SERV = Industry|Trade|Service|manufactur|Tourism|Trade and Services|market|export|import|supply chain|logistic|distribut|e-commerce|retail|wholesale|trade facilitation|trade policy|trade agreement|trade barrier|trade finance|trade promotion|trade integration|trade liberalization|trade balance|trade deficit|trade surplus|trade war|trade dispute|trade negotiation|trade cooperation|trade relation|trade partner|trade route|trade corridor
“INSTIT_SUPP” = Government|Public Admin|Institution|Central Agenc|Sub-national Gov|law|justice|governance|Policy|Regulation|Public Expenditure|Public Investment|Public Procurement
“GENDER_EQUAL” = Gender|Women|Girl|Woman|femal|Gender Equal|gender-base|gender inclus|gender mainstream|gender sensit|gender respons|gender gap|gender-based|gender-sensitive|gender-responsive|gender-transform|gender-equit|gender-balance
“CLIMATE” = Climate|Environment|Sustain|Resilience|Adaptation|Mitigation|Green|Eco|Eco-|carbon|carbon cycle|carbon dioxide|climate change|ecosystem|emission|energy effic|greenhouse|greenhouse gas|temperature anomalies|zero net|green growth|low carbon|climate resilient|climate smart|climate tech|climate variab

pdo_train_t <- pdo_train_t %>%
   # dealing with water/watershed/waterway
   mutate(tok_sector_broad = case_when(
      stringr::str_detect(token_l, regex("water|wastewater|sanitat|Sewer|sewage|Irrigat|Drainag|river basin|groundwater", ignore_case = T)) ~ "WAT_SAN",
      stringr::str_detect(token_l, regex("transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation|highway|transit|bridge|port", ignore_case = T)) ~ "TRANSPORT",
      stringr::str_detect(token_l, regex("urban|housing|inter-urban|peri-urban|waste manag|slum|city|megacity|intercity|inter-city|town", ignore_case = T)) ~ "URBAN",
      stringr::str_detect(token_l, regex("energ|electri|hydroele|hydropow|renewable|transmis|grid|transmission|electric power|geothermal|solar|wind|thermal|nuclear power|energy generation", ignore_case = T)) ~ "ENERGY",   
      stringr::str_detect(token_l, regex("health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin|immuniz|diseas|malaria|HIV|AIDS|TB|maternal|clinic|nutrition", ignore_case = T))  ~ "HEALTH",
      stringr::str_detect(token_l, regex("educat|school|vocat|teach|univers|student|literacy|training|curricul|pedagog", ignore_case = T)) ~ "EDUCATION",
      # not infra 
      stringr::str_detect(token_l, regex("Agricultural|Agro|Fish|Forest|Crop|livestock|fishery|land|soil", ignore_case = T)) ~ "AGR_FOR_FISH",
      stringr::str_detect(token_l, regex("Minin|oil|gas|mineral|quarry|extract|coal|natural gas|mine|petroleum|hydrocarbon", ignore_case = T)) ~ "MINING_OIL_GAS",
      stringr::str_detect(token_l, regex("Social Protec|social risk|social assistance|living standard|informality|insurance|social choes|gig economy|human capital|employment|unemploy|productivity|wage lev|intergeneration|lifelong learn|vulnerab|empowerment|sociobehav", ignore_case = T)) ~ "SOCIAL_PROT",
      stringr::str_detect(token_l, regex("Bank|finan|Investment|credit|microfinan|loan|financial stability|banking|financial intermed|fintech", ignore_case = T)) ~ "FINANCIAL",
      stringr::str_detect(token_l, regex("Information|Communication|ICT|Internet|telecom|cyber|data|AI|artificial intelligence|blockchain|e-learn|platform|software|hardware|digital", ignore_case = T)) ~ "ICT",
      stringr::str_detect(token_l, regex("Industry|Trade|Service|manufactur|Tourism|Trade and Services|market|export|import|supply chain|logistic|distribut|e-commerce|retail|wholesale|trade facilitation|trade policy|trade agreement|trade barrier|trade finance|trade promotion|trade integration|trade liberalization|trade balance|trade deficit|trade surplus|trade war|trade dispute|trade negotiation|trade cooperation|trade relation|trade partner|trade route|trade corridor", ignore_case = T)) ~ "IND_TRADE_SERV",
      stringr::str_detect(token_l, regex("Government|Public Admin|Institution|Central Agenc|Sub-national Gov|law|justice|governance|Policy|Regulation|Public Expenditure|Public Investment|Public Procurement", ignore_case = T)) ~ "INSTIT_SUPP",
      stringr::str_detect(token_l, regex("Gender|Women|Girl|Woman|femal|Gender Equal|gender-base|gender inclus|gender mainstream|gender sensit|gender respons|gender gap|gender-based|gender-sensitive|gender-responsive|gender-transform|gender-equit|gender-balance", ignore_case = T)) ~  "GENDER_EQUAL" ,
            stringr::str_detect(token_l, regex("Climate chan|Environment|Sustain|Resilience|Adaptation|Mitigation|Green|Eco|Eco-|carbon|carbon cycle|carbon dioxide|climate change|ecosystem|emission|energy effic|greenhouse|greenhouse gas|temperature anomalies|zero net|green growth|low carbon|climate resilient|climate smart|climate tech|climate variab", ignore_case = T)) ~ "CLIMATE" ,
      TRUE ~ NA_character_)) %>% 
   relocate(tok_sector_broad, .after = token_l) # move the new column to the right of token_l

Data prep for sector plots

tabyl(pdo_train_t$tok_sector_broad)

# Create a custom color list for each sector
sector_colors <- c(
   "WAT_SAN" = "#26BDE2",  # SDG 6
   "ENERGY" = "#FCC30B", # 7SDG 
   "MINING_OIL_GAS" = "#23399b", # no SDG! 
   "URBAN" = "#FD9D24", # SDG 11
   "ICT" = "#0f7184",    # no SDG!
   "HEALTH" = "#4C9F38", # SDG 3
   "EDUCATION" = "#C5192D", # SDG 4
   # SDGS 
   "POVERTY" = "#E5243B", # SDG 1
   "ZERO_HUNGER" = "#DDA63A", # SDG 2
   "GENDER_EQUAL" = "#FF3A21", # SDG 5
   "WORK" = "#A21942", # SDG 8
   "INDUSTRY" = "#FD9D24", # SDG 9
   "INEQUALITY" =  "#DD1367", # SDG 10
   "CONSUMPTION" = "#BF8B2E", # SDG 12
   "CLIMATE" = "#3F7E44", # SDG 13
   "OCEANS" = "#0A97D9", # SDG 14
   "BIODIVERSITY" = "#56C02B", # SDG 15
   "PEACE" = "#00689D", # SDG 16
   "PARTNERSHIP" = "#19486A", # SDG 17
   # MINE
   "TRANSPORT" =  "#A6A6A6", 
   "AGR_FOR_FISH" = "#56C02B" , 
   "SOCIAL_PROT" = "#e28293", 
   "FINANCIAL" =  "#e60066",
   "IND_TRADE_SERV" = "#85239b",
   "INSTIT_SUPP" = "#49239b"
   )

# prepare data for plotting (count)
sector_broad_pdo <- pdo_train_t %>% 
   filter(!is.na(tok_sector_broad)) %>% 
   filter(tok_sector_broad %in% c("WAT_SAN", "ENERGY", "TRANSPORT", "URBAN", "MINING_OIL_GAS", "ICT", "HEALTH", "EDUCATION", 
                                  # not infrastructure
                                  "AGR_FOR_FISH", "GENDER_EQUAL", "CLIMATE",  "SOCIAL_PROT", "FINANCIAL", "IND_TRADE_SERV", "INSTIT_SUPP" )) %>%
   count(tok_sector_broad, FY_appr) %>% 
   #filter(n > 0) %>% 
   mutate(tok_sector_broad = factor(tok_sector_broad, levels = c(
      "WAT_SAN", "ENERGY", "TRANSPORT","URBAN","MINING_OIL_GAS","ICT", "HEALTH", "EDUCATION",
      # not infrastructure
       "AGR_FOR_FISH", "GENDER_EQUAL", "CLIMATE",  "SOCIAL_PROT", "FINANCIAL", "IND_TRADE_SERV", "INSTIT_SUPP"))) # reorder values by frequency
#df$FY

[FIG] faceted sector (`tok_sector_broad`) freq ggplot

# Evaluate the title with glue first
title_text <- glue::glue("Sector words frequency in PDO over FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 

# Plot
pdo_sect_broad_freq <- sector_broad_pdo %>% 
   # only the "original group
  filter(tok_sector_broad %in% c("WAT_SAN", "ENERGY", "TRANSPORT", "URBAN", "MINING_OIL_GAS", "ICT", "HEALTH", "EDUCATION" )) %>% 
   ggplot(.,
          aes(x = FY_appr, y = n, 
              group = tok_sector_broad, color = tok_sector_broad)) +
   geom_line(linetype = "dotted", alpha = 0.5, size = 1) +
   geom_point(size = 2) +
   scale_x_continuous(breaks =  seq(2001, 2023, by=  1)) +
   scale_y_continuous(breaks =  seq(0,300, by=  25)) +
   # ~ SDG colors 
    # scale_color_viridis_d(option = "magma", end = 0.9) + 
   scale_color_manual(values = sector_colors) +
   facet_wrap(~tok_sector_broad, ncol = 3, scales = "free")+ 
   guides(color = FALSE) +
   lulas_theme +
   theme(# Adjust angle and alignment of x labels
      axis.text.x = element_text(angle = 45, hjust = 1)) + 
   labs(title = "Sector words frequency in PDOs by fiscal years of approval",
        subtitle = "[Using \"custom\" broad sector definition]",
        x =  "",# "Board approval FY", 
        y = ""#"Counts of 'sector' word (tok_sector_broad)"
   ) + 
   # Add the reference line at y = 50, red, dashed, and transparent (50% opacity)
   geom_hline(yintercept = 50, linetype = "longdash", color = "#d02e4c", alpha = 0.40)   
   # geom_vline(data = subset(sector_broad_pdo, tok_sector_broad == "HEALTH"), 
   #            aes(xintercept = 2020), 
   #            linetype = "solid", color = "#9b6723",alpha = 0.35) +
   # geom_text(data = subset(sector_broad_pdo, tok_sector_broad == "HEALTH"), 
   #           aes(x = 2020, y = max(sector_broad_pdo$n)*0.65, label = "Covid"), 
   #           angle = 90, vjust = -0.5, color = "#9b6723") 

pdo_sect_broad_freq

[FUNC] Figure split sector (`tok_sector_broad`) freq ggplot

# --- Get a LIST of unique sectors (facets) and split the data
PDOsector_list <- base::split(x = sector_broad_pdo, f = sector_broad_pdo$tok_sector_broad)

# --- Create a function to plot for each sector with custom colors
f_plot_sector <- function(data) {
   # Get the sector name
   sector <- unique(data$tok_sector_broad)
   # Create the plot
   p <-ggplot(data = data, 
              aes(x = FY_appr, y = n, 
                  group = tok_sector_broad, color = tok_sector_broad)) +
      # By sector ... 
      geom_line(color = sector_colors[sector], linetype = "dotted", alpha = 0.5, size = 1) +   
      geom_point(color = sector_colors[sector], size = 3) +               
      scale_x_continuous(breaks = seq(2001, 2023, by = 1)) +
      scale_y_continuous(breaks = seq(0, max(data$n), by = 25)) +
      # custom
      lulas_theme + 
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(
         title = paste("\"",sector,"\" in PDOs by fiscal years of approval"),  # Use facet-specific title
         subtitle = "[Using a \"custom\" broad sector definition &\nshowing relevant WDR publication(s)]",
         x = "", 
         y = ""  # Remove y-axis label
      ) +
      # Ensure y-axis limit includes 50
      expand_limits(y = 50) + 
      # Add the reference line at y = 50, red, dashed, and transparent (50% opacity)
      geom_hline(yintercept = 50, linetype = "longdash", color = "#d02e4c", alpha = 0.75)
   # # Add vline and text annotation only for the HEALTH sector
   if (sector == "HEALTH") {
      p <- p +
         geom_vline(aes(xintercept = 2020), linetype = "solid", color = "#9b6723",alpha = 0.35) +
         geom_text(aes(x = 2020, y = max(n) * 0.75, label = "Covid"),
                   angle = 90, vjust = -0.5, color = "#9b6723")
   }

   return(p)
}

# --- Use purrr::map to create a LIST of plots, one for each sector
sector_plots <- map(PDOsector_list, f_plot_sector)

# --- (exe) Extract the first plot to display
#sector_plots$HEALTH

# ---- Optionally print each plot to the console
#walk(sector_plots, print)

# Define the output directory using the 'here' function
output_dir <- here("analysis", "output", "figures")

## Save each plot to a file in the specified directory
#walk2(sector_plots, names(PDOsector_list), 
#      ~ggsave(filename = file.path(output_dir, paste0(.y, "_sector_plot.png")), plot = .x))

# Save iteratively the plots' objects as RDS files
walk2(sector_plots, names(PDOsector_list), 
      ~f_save_plot_obj(.x, paste0("pdo_", .y, "_sect_p")))

#pdo_FINANCIAL_sect_p

_______

SECTOR in PDO v. WDR publications

For the (broadly defined) HEALTH sector, it is quite clear that Covid-19 is the main driver of the peak in 2020.

What about the other sectors? I was struck by the fact that, observing PDOs over time, the broadly defined “sector term” in the PDO always presents at least one peak and I wonder what could trigger it.

One possible explanation is that the PDOs somehow reflect the topics discussed by the World Development Reports (WDR) published annually by the World Bank. The WDR is a flagship publication of the World Bank that provides in-depth analysis of a specific aspect of development.

It is important to remark that these publications are not some speculative research endeavor, as they are deeply rooted in the concrete information that the Bank retrieves on the ground from projects and operations as they are supported and evaluated. In turn, the WDRs themselves inform the Bank’s policy priorities and operational strategies.

Therefore, it is reasonable to expect some kind of correlation between the topics discussed in the WDRs and the objectives of projects stated in in the PDOs.

Ingest WDR data

Previously created, as explained in data/derived_data/_provenance.md

# Read the WDR data
wdr <- readRDS(here("data","derived_data", "wdr.rds"))

— Manually add WDR 2023 ✍🏻

OKR Full item

library(tibble) # Simple Data Frames # Simple Data Frames

# Create a named list of NA values for subj_11 to subj_46
na_values <- setNames(rep(NA, 35), paste0("subj_", 12:46))

# Add a new row with the existing columns and NA for subj_11 to subj_46
wdr <- wdr %>%
  add_row(
    date_issued = 2023,
    decade = "2020s",
    id = NA, # ? 
    ISBN = "978-1-4648-1941-4",
    title = "Migrants, Refugees, and Societies",
    doc_mt_identifier_1 = "oai:openknowledge.worldbank.org:10986/39696", #? 
    subject_miss = NA,
    abstract = "Migration is a development challenge. About 184 million people-2.3 percent of the world’s population-live outside of their country of nationality. Almost half of them are in low- and middle-income countries. But what lies ahead? As the world struggles to cope with global economic imbalances, diverging demographic trends, and climate change, migration will become a necessity in the decades to come for countries at all levels of income. If managed well, migration can be a force for prosperity and can help achieve the United Nations’ Sustainable Development Goals. World Development Report 2023 proposes an innovative approach to maximize the development impacts of cross-border movements on both destination and origin countries and on migrants and refugees themselves. The framework it offers, drawn from labor economics and international law, rests on a “Match and Motive Matrix” that focuses on two factors: how closely migrants’ skills and attributes match the needs of destination countries and what motives underlie their movements. This approach enables policy makers to distinguish between different types of movements and to design migration policies for each. International cooperation will be critical to the effective management of migration.",
    url_keys = "https://openknowledge.worldbank.org/handle/10986/39696",
    altmetric = 150,
    all_topic = "Poverty Reduction,Social Development,Conflict and Development",
    all_subj = "migration,migrants,refugees,force displacement,crss-border mobility,remittances,origin country,international protection,refugee-hosting country,irregular migration,international cooperation",
    subj_1 = "migration",
    subj_2 = "migrants",
    subj_3 = "refugees",
    subj_4 = "force displacement",
    subj_5 = "crss-border mobility",
    subj_6 = "remittances",
    subj_7 = "origin country",
    subj_8 = "international protection",
    subj_9 = "refugee-hosting country",
    subj_10 = "irregular migration",
    subj_11 = "international cooperation",
    !!!na_values  # Unpack the NA values for subj_12 to subj_46
  )

— Manually add WDR 2024 ✍🏻

OKR Full item

library(tibble) # Simple Data Frames # Simple Data Frames

# Create a named list of NA values for subj_11 to subj_46
na_values <- setNames(rep(NA, 35), paste0("subj_", 12:46))
# https://documents.worldbank.org/en/publication/documents-reports/documentdetail/099042523192514880/p17826903573340450b2d00e8cfd3baf7ac
# https://openknowledge.worldbank.org/entities/publication/5e5ac9f1-71ee-4734-825e-60966658395f/full

# Add a new row with the existing columns and NA for subj_11 to subj_46
wdr <- wdr %>%
  add_row(
    date_issued = 2024,
    decade = "2020s",
    id = NA, # ? 
    ISBN = "978-1-4648-2078-6",
    title = "The Middle-Income Trap",
    doc_mt_identifier_1 = "oai:openknowledge.worldbank.org:10986/41919", #? 
    subject_miss = NA,
    abstract = "Middle-income countries are in a race against time. Many of them have done well since the 1990s to escape low-income levels and eradicate extreme poverty, leading to the perception that the last three decades have been great for development. But the ambition of the more than 100 economies with incomes per capita between US$1,100 and US$14,000 is to reach high-income status within the next generation. When assessed against this goal, their record is discouraging. Since the 1970s, income per capita in the median middle-income country has stagnated at less than a tenth of the US level. With aging populations, growing protectionism, and escalating pressures to speed up the energy transition, today’s middle-income economies face ever more daunting odds. To become advanced economies despite the growing headwinds, they will have to make miracles. Drawing on the development experience and advances in economic analysis since the 1950s, World Development Report 2024 identifies pathways for developing economies to avoid the “middle-income trap.” It points to the need for not one but two transitions for those at the middle-income level: the first from investment to infusion and the second from infusion to innovation. Governments in lower-middle-income countries must drop the habit of repeating the same investment-driven strategies and work instead to infuse modern technologies and successful business processes from around the world into their economies. This requires reshaping large swaths of those economies into globally competitive suppliers of goods and services. Upper-middle-income countries that have mastered infusion can accelerate the shift to innovation—not just borrowing ideas from the global frontiers of technology but also beginning to push the frontiers outward. This requires restructuring enterprise, work, and energy use once again, with an even greater emphasis on economic freedom, social mobility, and political contestability. Neither transition is automatic. The handful of economies that made speedy transitions from middle- to high-income status have encouraged enterprise by disciplining powerful incumbents, developed talent by rewarding merit, and capitalized on crises to alter policies and institutions that no longer suit the purposes they were once designed to serve. Today’s middle-income countries will have to do the same.",
    url_keys = "https://openknowledge.worldbank.org/handle/10986/41919",
    altmetric = 13,
   all_topic = "Macroeconomics,Economic Growth,Business Cycles and Stabilization Policies,Poverty Reduction,Achieving Shared Growth,Science and Technology Development,Innovation",
    all_subj = "middle-income trap,investment,infusion,innovation,technologies,competitive suppliers,economic freedom",
    subj_1   = "middle-income trap",
    subj_2   = "investment",
    subj_3   = "infusion",
    subj_4   = "innovation",
    subj_5   = "technologies",
    subj_6   = "competitive suppliers",
    subj_7   = "economic freedom",
    subj_8   = NA,
    subj_9   = NA,
    subj_10  = NA,
    subj_11  = NA,
    !!!na_values  # Unpack the NA values for subj_12 to subj_46
  )

— Manually correct WDR 2011 ✍🏻

wdr$url_keys [wdr$id == "4389"] <- "https://openknowledge.worldbank.org/handle/10986/4389"

wdr$altmetric [wdr$id == "4389"] <- "210"

wdr$abstract [wdr$id == "4389"] <- "The 2011 World development report looks across disciplines and experiences drawn from around the world to offer some ideas and practical recommendations on how to move beyond conflict and fragility and secure development. The key messages are important for all countries-low, middle, and high income-as well as for regional and global institutions: first, institutional legitimacy is the key to stability. When state institutions do not adequately protect citizens, guard against corruption, or provide access to justice; when markets do not provide job opportunities; or when communities have lost social cohesion-the likelihood of violent conflict increases. Second, investing in citizen security, justice, and jobs is essential to reducing violence. But there are major structural gaps in our collective capabilities to support these areas. Third, confronting this challenge effectively means that institutions need to change. International agencies and partners from other countries must adapt procedures so they can respond with agility and speed, a longer-term perspective, and greater staying power. Fourth, need to adopt a layered approach. Some problems can be addressed at the country level, but others need to be addressed at a regional level, such as developing markets that integrate insecure areas and pooling resources for building capacity Fifth, in adopting these approaches, need to be aware that the global landscape is changing. Regional institutions and middle income countries are playing a larger role. This means should pay more attention to south-south and south-north exchanges, and to the recent transition experiences of middle income countries."

wdr$all_topic [wdr$id == "4389"] <- tolower("Justice,Jobs,Political Violence and Civil War,Political Violence and War,Organized Crime,Fragility,Conflict and Violence,Crime,Social Cohesion,Public Sector Management,Social Development,Law and Development, Social Protections and Labor,Conflict and Development,Water Supply and Sanitation,Judicial System Reform, Labor Markets,Armed Conflict,Urban Solid Waste Management") 

# Define the subjects to be added for the specific row
subjects <- c(
  "Armed Conflict",
  "Civil Wars",
  "Conflict Prevention",
  "Conflict Resolution",
  "Development Policy",
  "Fragile States",
  "International Development",
  "Peacebuilding",
  "Political Instability",
  "Post-Conflict Reconstruction",
  "Security and Development"
) %>% tolower()  # Convert subjects to lowercase

# Ensure id is handled as character and enforce lowercase comparison
wdr <- wdr %>%
   mutate(across(starts_with("subj_"),
                 ~ ifelse(id == "4389", 
                          subjects[as.numeric(sub("^subj_", "", cur_column()))], 
                          NA_character_))) %>% 
   mutate (all_subj = if_else(id == "4389", paste0(subjects, collapse = ","), all_subj)) 

# Check the result for the row with id == "4389"
wdr %>% filter(id == "4389") %>% select(starts_with("subj_"))  # Display the updated subject columns


# check <- wdr[wdr$id == "4389",]

— Remove extra space in title column

# Check and remove leading space in the 'title' column
wdr <- wdr %>%
  mutate(title = str_trim(title, side = "left"))

— Re-save (upon correction) `wrd2.rds`

wdr2 <-  wdr
write_rds(x = wdr2, file = here::here("data", "derived_data","wdr2.rds"))

[TBL] World Develompent Reports 2000-2024

Below are the titles of the World Development Reports from 2000 to 2024.

date_issued	title	url_keys
2001	Attacking Poverty	https://openknowledge.worldbank.org/handle/10986/11856?show=full
2002	Building Institutions for Markets	https://openknowledge.worldbank.org/handle/10986/5984?show=full
2003	Sustainable Development in a Dynamic World--Transforming Institutions, Growth, and Quality of Life	https://openknowledge.worldbank.org/handle/10986/5985?show=full
2004	Making Services Work for Poor People	https://openknowledge.worldbank.org/handle/10986/5986?show=full
2005	A Better Investment Climate for Everyone	https://openknowledge.worldbank.org/handle/10986/5987?show=full
2006	Equity and Development	https://openknowledge.worldbank.org/handle/10986/5988?show=full
2007	Development and the Next Generation	https://openknowledge.worldbank.org/handle/10986/5989?show=full
2008	Agriculture for Development	https://openknowledge.worldbank.org/handle/10986/5990?show=full
2009	Reshaping Economic Geography	https://openknowledge.worldbank.org/handle/10986/5991?show=full
2010	Development and Climate Change	https://openknowledge.worldbank.org/handle/10986/4387?show=full
2011	Conflict, Security, and Development	https://openknowledge.worldbank.org/handle/10986/4389
2012	Gender Equality and Development	https://openknowledge.worldbank.org/handle/10986/4391?show=full
2013	Jobs	https://openknowledge.worldbank.org/handle/10986/11843?show=full
2014	Risk and Opportunity—Managing Risk for Development	https://openknowledge.worldbank.org/handle/10986/16092?show=full
2015	Mind, Society, and Behavior	https://openknowledge.worldbank.org/handle/10986/20597?show=full
2016	Digital Dividends	https://openknowledge.worldbank.org/handle/10986/23347?show=full
2017	Governance and the Law	https://openknowledge.worldbank.org/handle/10986/25880?show=full
2018	Learning to Realize Education's Promise	https://openknowledge.worldbank.org/handle/10986/28340?show=full
2019	The Changing Nature of Work	https://openknowledge.worldbank.org/handle/10986/30435?show=full
2020	Trading for Development in the Age of Global Value Chains	https://openknowledge.worldbank.org/handle/10986/32437?show=full
2021	Data for Better Lives	https://openknowledge.worldbank.org/handle/10986/35218?show=full
2022	Finance for an Equitable Recovery	https://openknowledge.worldbank.org/handle/10986/36883?show=full
2023	Migrants, Refugees, and Societies	https://openknowledge.worldbank.org/handle/10986/39696
2024	The Middle-Income Trap	https://openknowledge.worldbank.org/handle/10986/41919

Qualify: peak or trend (by sector)

Add `geomvline` to sector plots v WDR title [CMPL 🟠]

tabyl(pdo_train_t$tok_sector_broad)
# pdo_train_t$tok_sector_broad      n  WDR 

#                  AGR_FOR_FISH    665 WDR 2008  Agriculture for Development
#                     EDUCATION   1180 WDR 2004 Making Services Work for Poor People
#                        ENERGY    886 WDR 
#                     FINANCIAL   1843 WDR 
#                  GENDER_EQUAL    213 WDR 2012  Gender Equality and Development
#                        HEALTH    946 WDR 
#                           ICT    548 WDR 
#                IND TRADE SERV     60 WDR 
#           INSTITUTIONAL SUPP.   2171 WDR 
#                 MINING_OIL_GAS    299 WDR 
#                     TRANSPORT   1371 WDR 
#                         URBAN    553 WDR 
#                       WAT_SAN   1069 WDR

— ✅ AGR_FOR_FISH ( Agriculture, forestry, and fishing)

The WDR of 2008 was titled “Agriculture for Development”, link

# --- Get a LIST of unique sectors (facets) and split the data
PDOsector_list <- base::split(x = sector_broad_pdo, f = sector_broad_pdo$tok_sector_broad)
# Specific split df 
#PDOsector_list$'AGR_FOR_FISH'

# Specific plot 
pdo_agr_WDR_plot <- sector_plots$'AGR_FOR_FISH' +  
  geom_vline(xintercept = 2008, linetype = "solid", color = "#9b6723",alpha = 0.35) +
  geom_text(aes(x = 2008, y = max(n) * 0.5, label = "WDR Agric"), 
                   angle = 90, vjust = -0.5, color = "#9b6723")
pdo_agr_WDR_plot

#f_save_plot("pdo_agr_plot", pdo_agr_plot)
f_save_plot_obj (pdo_agr_WDR_plot, "pdo_agr_WDR_plot")

— ✅ EDUCATION

WDR 2007 was titled “Development and the Next Generation” WDR 2018 was titled “Learning to Realize Education’s Promise”

# Specific split df 
#PDOsector_list$EDUCATION

# Specific plot 
pdo_edu_WDR_plot <- sector_plots$EDUCATION + 
  geom_vline(xintercept = 2007, linetype = "solid", color = "#9b6723",alpha = 0.35) +
  geom_text(aes(x = 2007, y = max(n) * 0.45, label = "WDR Youth"), 
                   angle = 90, vjust = -0.5, color = "#9b6723") +
  geom_vline(xintercept = 2018, linetype = "solid", color = "#9b6723",alpha = 0.35) +
  geom_text(aes(x = 2018, y = max(n) * 0.30, label = "WDR Educ"), 
                   angle = 90, vjust = -0.5, color = "#9b6723")

pdo_edu_WDR_plot

#f_save_plot("pdo_edu_plot", pdo_edu_plot)
f_save_plot_obj (pdo_edu_WDR_plot, "pdo_edu_WDR_plot")

— ✅ CLIMATE (climate change)

The WDR of 2010 was titled ” Development and Climate Change”, link

# --- Get a LIST of unique sectors (facets) and split the data
PDOsector_list <- base::split(x = sector_broad_pdo, f = sector_broad_pdo$tok_sector_broad)
# GENDER split df 
#PDOsector_list$'CLIMATE'

# Specific plot 
pdo_clim_WDR_plot <- sector_plots$'CLIMATE' + 
   # geom_vline(xintercept = 2003, linetype = "solid", color = "#9b6723",alpha = 0.35) +
   # geom_text(aes(x = 2003, y = max(n) * 0.5, label = "WDR Sust Dev"), 
   #           angle = 90, vjust = -0.5, color = "#9b6723")+ 
   geom_vline(xintercept = 2010, linetype = "solid", color = "#9b6723",alpha = 0.35) +
   geom_text(aes(x = 2010, y = max(n) * 0.45, label = "WDR Climate change"), 
             angle = 90, vjust = -0.5, color = "#9b6723")

pdo_clim_WDR_plot

#f_save_plot("pdo_clim_plot", pdo_clim_plot)
f_save_plot_obj (pdo_clim_WDR_plot, "pdo_clim_WDR_plot")

— ✅ GENDER EQUALITY

the WDR of 2012 was titled “Gender Equality and Development”, link

# --- Get a LIST of unique sectors (facets) and split the data
PDOsector_list <- base::split(x = sector_broad_pdo, f = sector_broad_pdo$tok_sector_broad)
# GENDER split df 
#PDOsector_list$GENDER_EQUAL

# Specific plot 
pdo_gen_WDR_plot <- sector_plots$GENDER_EQUAL +
  geom_vline(xintercept = 2012, linetype = "solid", color = "#9b6723",alpha = 0.35) +
  geom_text(aes(x = 2012, y = max(n) * 0.45, label = "WDR Gender equal"), 
                   angle = 90, vjust = -0.5, color = "#9b6723")

pdo_gen_WDR_plot

#f_save_plot("pdo_gen_plot", pdo_gen_plot)
f_save_plot_obj (pdo_gen_WDR_plot, "pdo_gen_WDR_plot")

— SOCIAL PROTECTION

WDR 2004 ” Making Services Work for Poor People”

# Specific split df 
#PDOsector_list$SOCIAL_PROT

# Specific plot 
pdo_soc_WDR_plot <- sector_plots$SOCIAL_PROT + 
  geom_vline(xintercept = 2004, linetype = "solid", color = "#9b6723",alpha = 0.35) +
  geom_text(aes(x = 2004, y = max(n) * 0.75, label = "WDR Services"), 
                   angle = 90, vjust = -0.5, color = "#9b6723")

pdo_soc_WDR_plot

#f_save_plot("pdo_soc_plot", pdo_soc_plot)
f_save_plot_obj (pdo_soc_WDR_plot, "pdo_soc_WDR_plot")

— INSTITUTIONAL SUPPORT

WDR 2002 ” Building Institutions for Markets” WDR 2007 ” Governance and the Law”

# Specific split df 
# PDOsector_list$INSTIT_SUP

# Specific plot 
pdo_inst_WDR_plot <- sector_plots$INSTIT_SUPP + 
   geom_vline(xintercept = 2002, linetype = "solid", color = "#9b6723",alpha = 0.35) +
   geom_text(aes(x = 2002, y = max(n) * 0.75, label = "WDR Institutions"), 
             angle = 90, vjust = -0.5, color = "#9b6723") + 
   geom_vline(xintercept = 2007, linetype = "solid", color = "#9b6723",alpha = 0.35) +
   geom_text(aes(x = 2007, y = max(n) * 0.75, label = "WDR Governance"), 
             angle = 90, vjust = -0.5, color = "#9b6723")

pdo_inst_WDR_plot

#f_save_plot("pdo_inst_plot", pdo_inst_plot)
f_save_plot_obj (pdo_inst_WDR_plot, "pdo_inst_WDR_plot")

— ICT

WDR 2016 ” Digital Dividends” WDR 2021 ” Data for Better Lives”

# Specific split df 
# PDOsector_list$ICT

# Specific plot 
pdo_ict_WDR_plot <- sector_plots$ICT + 
   geom_vline(xintercept = 2016, linetype = "solid", color = "#9b6723",alpha = 0.35) +
   geom_text(aes(x = 2016, y = max(n) * 0.75, label = "WDR Digital Div"), 
             angle = 90, vjust = -0.5, color = "#9b6723") + 
   geom_vline(xintercept = 2021, linetype = "solid", color = "#9b6723",alpha = 0.35) +
   geom_text(aes(x = 2021, y = max(n) * 0.75, label = "WDR Data"), 
             angle = 90, vjust = -0.5, color = "#9b6723")

pdo_ict_WDR_plot

#f_save_plot("pdo_ict_plot", pdo_ict_plot)
f_save_plot_obj (pdo_ict_WDR_plot, "pdo_ict_WDR_plot")

— FINANCIAL

WDR 2005 ” A Better Investment Climate for Everyone” WDR 2022 ” Finance for an Equitable Recovery”

# Specific split df 
#PDOsector_list$FINANCIAL

# Specific plot 
pdo_fin_WDR_plot <- sector_plots$FIN + 
  geom_vline(xintercept = 2005, linetype = "solid", color = "#9b6723",alpha = 0.35) +
  geom_text(aes(x = 2005, y = max(n) * 0.80, label = "WDR Inv Clim"), 
                   angle = 90, vjust = -0.5, color = "#9b6723") +
  geom_vline(xintercept = 2022, linetype = "solid", color = "#9b6723",alpha = 0.35) +
  geom_text(aes(x = 2022, y = max(n) * 0.75, label = "WDR Finance"), 
                   angle = 90, vjust = -0.5, color = "#9b6723")
pdo_fin_WDR_plot

#f_save_plot("pdo_fin_plot", pdo_fin_plot)
f_save_plot_obj (pdo_fin_WDR_plot, "pdo_fin_WDR_plot")

_______

BIGRAMS

Here I use [clnp_annotate() output + ] dplyr to combine consecutive tokens into bigrams.

# Create bigrams by pairing consecutive tokens by sentence ID and token IDs
bigrams <- pdo_train_t %>%
   # keeping FY with tokens
   group_by(FY_appr, proj_id, pdo, sid ) %>%
   arrange(tid) %>%
   # Using mutate() and lead(), we create bigrams from consecutive tokens 
   mutate(next_token = lead(token), 
          bigram = paste(token, next_token)) %>%
   # make bigram low case
   mutate(bigram = tolower(bigram)) %>%
   # only includes the rows where valid bigrams are formed
   filter(!is.na(next_token)) %>%
   ungroup() %>%
   arrange(FY_appr, proj_id, sid, tid) %>%
   select(FY_appr,proj_id, pdo,sid, tid, token, bigram)

# most frequent bigrams 
count_bigram <- bigrams %>% 
   count(bigram, sort = TRUE)

Clean bigrams

The challenge is to clean but without separating consecutive words… so I do this split-reunite process to remove stopwords and punctuation. Basically only keeping bigrams made of 2 nouns or ADJ+noun.

# Separate the bigram column into two words
bigrams_cleaned <- bigrams %>%
  tidyr::separate(bigram, into = c("word1", "word2"), sep = " ")

# Remove stopwords and bigrams in EACH component word containing punctuation
bigrams_cleaned <- bigrams_cleaned %>%
   # custom stop words
   filter(!word1 %in% custom_stop_words_df$word, !word2 %in% custom_stop_words_df$word) %>% 
   # Remove punctuation   
   filter(!stringr::str_detect(word1, "[[:punct:]]"), !stringr::str_detect(word2, "[[:punct:]]"))  

# Reunite the component cleaned words into the bigram column
bigrams_cleaned <- bigrams_cleaned %>%
   unite(bigram, word1, word2, sep = " ") %>% 
   # Remove too obvious bigrams 
   filter(!bigram %in% c("development objective", "development objectives", 
                         "proposed project", "project development", "program development"))

# View the cleaned dataframe
bigrams_cleaned

# Count the frequency of each bigram
bigram_freq <- bigrams_cleaned %>%
  count(bigram, sort = TRUE)

[FIG] most frequent bigrams in PDOs

Excluding bigrams where 1 word is among stopwords or a punctuation sign
Excluding “development objective/s”, “proposed project”, “program development” because not very informative

# ---- Prepare data for plotting
# Evaluate the title with glue first
title_text <- glue::glue("Frequency of bigrams in PDOs over FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}") 

# Define the bigrams you want to highlight
bigrams_to_highlight <- c("public sector", "private sector", "eligible crisis",
                          "health care", "health services", "public health")   

 
# ---- Plot the most frequent bigrams
pdo_bigr_freq <- bigram_freq %>%
   slice_max(n, n = 25) %>%
   ggplot(aes(x = reorder(bigram, n), y = n,
              fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
   geom_col() +
   # coord flipped so n is Y axis
   scale_y_continuous(breaks = seq(min(bigram_freq$n)-1, max(bigram_freq$n), by = 50)) +
   scale_fill_manual(values = c("public sector" = "#005ca1", 
                                "private sector" = "#9b2339", 
                                "eligible crisis"= "#8e550a", 
                                "health care"= "#4C9F38",
                                "health services"= "#4C9F38",
                                "public health"= "#4C9F38", 
                                "Other" = "grey")) +
   guides(fill = "none") +
   coord_flip() +
   labs(title = title_text, subtitle = "(top 25 bigrams)",
        x = "", y = "") +
   theme(axis.text.y = element_text(
            # obtain vector of colors 2 match x axis labels color to fill
            color = bigram_freq %>%
               slice_max(n, n = 25) %>%
               # mutate(color = ifelse(bigram %in% bigrams_to_highlight,
               #                       ifelse(bigram == "public sector", "#005ca1",
               #                              ifelse(bigram == "private sector", "#9b2339", "#8e550a")),
               #                       "#4c4c4c")) 
               mutate(color = dplyr::case_when (
                  bigram == "public sector" ~ "#005ca1",
                  bigram == "private sector" ~ "#9b2339",
                  bigram == "eligible crisis" ~ "#8e550a",
                  bigram %in% c("health care", "health services", "public health") ~ "#4C9F38",
                  TRUE ~ "#4c4c4c")) %>%
               # Ensure the order matches the reordered bigrams (AS BINS)
               arrange(reorder(bigram, n)) %>%  
               # Extract the color column in bin order as vector to be passed to element_text()
               pull(color)
            )
         ) + lulas_theme

pdo_bigr_freq

Results are not surprising in terms of frequent bigram recurrence:

See for example “increase access”, “service delivery” ,“institutional capacity”, “poverty reduction” etc, at the top.
Although, while “health” recurred in several bigrams (e.g. “health services”, “public health”, “health care”) among the top 25, “education” did not appear at all.
A bit mysterious is perhaps “eligible crisis” (> 100 mentions)?! (coming back to this later)

[FIG] Changes over time BY 1FY

Besides huge, counter intuitive, difference between “health” and “education”, “climate change” appears in the top 25 (ranking above “financial sector” and “capacity building”) which begs the question: Has the frequency of these bigrams has changed over time?

# 
# ## too busy to be useful
# 
# # Step 1: Count the frequency of each bigram by year
# top_bigrams_1FY <- bigrams_cleaned %>%
#    group_by(FY_appr, bigram) %>%
#    summarise(count = n(), .groups = 'drop') %>%
#    arrange(FY_appr, desc(count)) %>%
#    # ---  +/- top 10  
#    group_by(FY_appr) %>%
#    top_n(10, count) %>%
#    ungroup()
#    # # ---  STRICT  top 10  
#    # mutate(rank = dense_rank(desc(count))) %>%  # Rank bigrams by frequency
#    # filter(rank <= 10) %>%  # Keep only the top 10 by rank
#    # ungroup()
# 
#   
# # Add specific bigrams to highlight, if any
# bigrams_to_highlight <- c("climate change",  "climate resilience", "public sector", "private sector")
# 
# # Step 2: Plot the top bigrams by frequency over time   
# pdo_bigr_FY_freq  <-  top_bigrams_1FY %>% 
#  ggplot(aes(x = reorder(bigram, count), 
#              y = count,
#              fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
#   geom_col() +
#   scale_fill_manual(values = c("public sector" = "#005ca1", "private sector" = "#e60066", 
#                                "climate change" = "#399B23", "climate resilience" = "#d8e600",
#                                "Other" = "grey")) +
#   guides(fill = "none") +
#   coord_flip() +
#   facet_wrap(~ FY_appr, scales = "free_y") +
#   labs(title = "Top 10 Bigrams by Frequency Over Time",
#        subtitle = "(Faceted by Fiscal Year Approval)",
#        x = "Bigrams",
#        y = "Count") +
#   theme_minimal() +
#   theme(plot.title.position = "plot",
#         axis.text.x = element_text(angle = 45, hjust = 1))+
#      lulas_theme
# 
# pdo_bigr_FY_freq

[FIG] Changes over time BY 3FY

To reduce the noise and make the plot more readable, we can group the data by 3 fiscal years (FY) intervals.

# generate FY group 
f_generate_year_groups <- function(years, interval) {
  breaks <- seq(floor(min(years, na.rm = TRUE) / interval) * interval, 
                ceiling(max(years, na.rm = TRUE) / interval) * interval, 
                by = interval)
  
  labels <- paste(breaks[-length(breaks)], "-", breaks[-1] - 1)
  
  return(list(breaks = breaks, labels = labels))
}

# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 3 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show

# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_FYper <- bigrams_cleaned %>%
   # cut divides the range of x into intervals
   mutate(FY_group = base::cut(FY_appr, 
                               breaks = year_groups$breaks, 
                               include.lowest = TRUE, 
                               right = FALSE, 
                               labels = year_groups$labels)) %>% 
   # Count the frequency of each bigram by n-year groups
   group_by(FY_group, bigram) %>%
   summarise(count = n(), .groups = 'drop') %>%
   arrange(FY_group, desc(count)) %>%
   # Top ? bigrams for each n-year period
   group_by(FY_group) %>%
   top_n(top_n_i, count) %>%
   ungroup()

# --- Step 3: Add specific bigrams to highlight, if any
bigrams_to_highlight <- c("climate change",  "climate resilience", 
                          "eligible crisis",  
                          "public sector", "private sector",
                          "water supply", "sanitation services",
                          "health care", "health services", "public health", "health preparedness"
                          )

# --- Step 4: Plot the top bigrams by frequency over n-year periods
pdo_bigr_FY_freq  <-  top_bigrams_FYper %>% 
 ggplot(aes(x = reorder(bigram, count), 
             y = count,
             fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
  geom_col() +
  scale_fill_manual(values = c(
     # "public sector" = "#005ca1", 
     # "private sector" = "#e60066", 
     "water supply" = "#26BDE2",
      "sanitation services" = "#26BDE2",
     "climate change" = "#3F7E44", 
     "climate resilience" = "#a6bd23",
     "eligible crisis" = "#e68000",  
     "health care" = "#E5243B",
     "health services" = "#E5243B",
     "public health" = "#E5243B",
     "Other" = "grey")) +
  guides(fill = "none") +
  coord_flip() +
  facet_wrap(~ FY_group, ncol = 3 , scales = "free_y" )+ 
              #strip.position = "top") +  # Facet wrap with columns
  labs(title = glue::glue("Top 12 Bigrams by Frequency Over {interval_i}-Year Periods"),
       subtitle =  "(Some sectors highlighted)",
       x = "",
       y = "") +
     lulas_theme

# print the plot
pdo_bigr_FY_freq

Frequency observed over FY intervals is very revealing.

Interesting to see the trend of “water supply” and “sanitation services” bigrams, which are quite stable over time.
The bigram “health care” and “health services” are also quite stable, while “public health” obviously gained relevance since the 2019-2021 FY period.
Conversely, “private sector” and “public sector” loose importance over time (around mid 2010s), while “climate change” and “climate resilience” gain relevance from the same point on.
Still quite surprising the bigram “eligible crisis”, which actually appears in the top 12 bigrams starting in FY 2016-2018!

🤔 Which are the most frequent and persistent Bigrams Over Time?

For this, I am looking for a ranking that considers Mean frequency across periods arrange(desc(mean_count)) + Stability (low standard deviation) across periods [this is hard bc of NAs], and NOT total count overall…

Using top_bigrams_FYper which had breaks of 3FY

# ------------------------------[REPEATED just to see the table]

# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 3 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show

# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_FYper <- bigrams_cleaned %>%
   # cut divides the range of x into intervals
   mutate(FY_group = base::cut(FY_appr, 
                               breaks = year_groups$breaks, 
                               include.lowest = TRUE, 
                               right = FALSE, 
                               labels = year_groups$labels)) %>% 
   # Count the frequency of each bigram by n-year groups
   group_by(FY_group, bigram) %>%
   summarise(count = n(), .groups = 'drop') %>%
   arrange(FY_group, desc(count)) %>%
   # Top ? bigrams for each n-year period
   group_by(FY_group) %>%
   top_n(top_n_i, count) %>%
   ungroup()

sd() returns NA for bigrams that are not present in any periods (or are present in just 1 period).

# Calculate the mean frequency and standard deviation of the counts for each bigram across periods
stable_and_frequent_bigrams_per <- top_bigrams_FYper %>%
   group_by(bigram) %>%
   summarise(mean_count = mean(count, na.rm = TRUE),     # Mean frequency across periods
             sd_count = sd(count, na.rm = TRUE),         # Stability (lower sd = more stable)
             count_non_na = sum(!is.na(count)),  # Count non-NA values
             sd_count2 = if_else(count_non_na >= 1, sd(count, na.rm = TRUE), NA_real_),  # Only calculate sd if >= 3 non-NA
             total_count = sum(count)) %>%               # Total count across all periods (optional)
   arrange(desc(mean_count)) %>%                      # Sort by frequency and then stability
   # Filter out bigrams with low mean frequency or high instability (you can adjust thresholds)
   # Focus on the top 25% most frequent bigrams
   filter(mean_count > quantile(mean_count, 0.70, na.rm = TRUE)) #%>% 
   # Focus on the most stable 50% (lower sd) ---> NO bc NA values
   #filter( sd_count < quantile(sd_count, 0.5, na.rm = TRUE))

[TBL] Bigrams Over Time [3FY]

# View the most frequent and stable bigrams
stable_and_frequent_bigrams_per %>% 
   slice_head(n = 15)  %>% kableExtra::kable()

bigram	mean_count	sd_count	count_non_na	sd_count2	total_count
increase access	39.83333	6.080022	6	6.080022	239
eligible crisis	37.33333	1.527525	3	1.527525	112
threat posed	33.00000	NA	1	NA	33
private sector	31.20000	10.917875	5	10.917875	156
health preparedness	31.00000	NA	1	NA	31
strengthen national	28.00000	NA	1	NA	28
service delivery	27.71429	5.313953	7	5.313953	194
climate change	27.00000	2.828427	2	2.828427	54
poverty reduction	27.00000	14.514361	4	14.514361	108
public health	25.50000	16.263456	2	16.263456	51
public sector	25.25000	8.301606	4	8.301606	101
institutional capacity	24.87500	6.577831	8	6.577831	199
improve access	24.57143	8.521681	7	8.521681	172
national systems	24.00000	NA	1	NA	24

Using top_bigrams_1FY which had breaks of 1FY

# --- Step 1: Create n-year groups (using `f_generate_year_groups`)
interval_i = 1 # decide the interval
year_groups <- f_generate_year_groups(bigrams_cleaned$FY_appr, interval = interval_i)
top_n_i = 12 # decide the top n bigrams to show

# --- Step 2: Add the generated FY breaks and labels to data frame
top_bigrams_1FY <- bigrams_cleaned %>%
   # cut divides the range of x into intervals
   mutate(FY_group = base::cut(FY_appr, 
                               breaks = year_groups$breaks, 
                               include.lowest = TRUE, 
                               right = FALSE, 
                               labels = year_groups$labels)) %>% 
   # Count the frequency of each bigram by n-year groups
   group_by(FY_group, bigram) %>%
   summarise(count = n(), .groups = 'drop') %>%
   arrange(FY_group, desc(count)) %>%
   # Top ? bigrams for each n-year period
   group_by(FY_group) %>%
   top_n(top_n_i, count) %>%
   ungroup()

# Calculate the mean frequency and standard deviation of the counts for each bigram across periods
stable_and_frequent_bigrams_1FY <- top_bigrams_1FY %>%
   group_by( bigram) %>%
   summarise(mean_count = mean(count, na.rm = TRUE),     # Mean frequency across periods
             sd_count = sd(count, na.rm = TRUE),         # Stability (lower sd = more stable)
             total_count = sum(count)) %>%               # Total count across all periods (optional)
   arrange(desc(mean_count)) %>%                      # Sort by frequency and then stability
   # Filter out bigrams with low mean frequency or high instability (you can adjust thresholds)
   # Focus on the top 25% most frequent bigrams
   filter(mean_count > quantile(mean_count, 0.70, na.rm = TRUE)) #%>% 
   # Focus on the most stable 50% (lower sd) ---> NO bc NA values
   #filter( sd_count < quantile(sd_count, 0.5, na.rm = TRUE))

[TBL] Bigrams Over Time [1FY]

# View the most frequent and stable bigrams
stable_and_frequent_bigrams_1FY %>% 
   slice_head(n = 15)   %>% kableExtra::kable()

bigram	mean_count	sd_count	total_count
mobile applications	21.00000	NA	21
public health	16.66667	3.0550505	50
threat posed	16.50000	2.1213203	33
health preparedness	15.50000	0.7071068	31
increase access	14.64706	5.1713293	249
eligible crisis	14.62500	10.1971635	117
strengthen national	14.00000	2.8284271	28
vulnerable households	13.00000	NA	13
respond promptly	12.50000	10.6066017	25
action plan	12.00000	NA	12
disaster risk	12.00000	NA	12
local governments	12.00000	NA	12
national systems	12.00000	1.4142136	24
world bank	12.00000	NA	12
climate resilience	11.66667	4.5092498	35

_______

Explore specific bigrams

— Public/Private ~ compare frequency over FY

A case in which looking at bigrams may be better than tokens is the question whether WB project are more focused on public or private sector. It is not easy to capture this information from the text, because:

“government” may be referred to the subject/counterpart of the project (e.g. “government of Mozambique”)
“private” is not necessarily referred to the “private sector” (e.g. “private households”)
“public” is not necessarily referred to the “public sector” (e.g. “public health”)

So, I narrow down to consecutive bigrams “public sector” and “private sector” to get an indicative frequency of these terms.

[FIG] Bigrams (“public sector”, “private sector”) freq plot

# Filter for the specific bigrams "public sector" and "private sector"
bigrams_pub_priv_sec <- bigrams %>%
   filter(bigram %in% c("public sector", "private sector"))

# Display the result
#bigrams_pub_priv_sec

# prepare data for plotting (count)
sector_bigr_df <- bigrams_pub_priv_sec %>% 
   count(FY_appr, bigram) %>% 
   # reorder values by frequency
   mutate(bigram = factor(bigram, levels = c("public sector", "private sector")))

# ---- Prepare data for plotting
# Evaluate the title with glue first
title_text <- glue::glue("Frequency of bigrams \"public sector\" and \"private sector\" in PDOs over FY {min(sector_bigr_df$FY_appr)}-{max(sector_bigr_df$FY_appr)}") 

two_col_contrast <- c( "#005ca1",  "#e60066" )

# Create a named vector for the legend labels with totals in a single pipeline
legend_labels <- sector_bigr_df %>%
   group_by(bigram) %>%
   # Calculate total counts for each bigram
   summarize(total_n = sum(n)) %>% 
   # Append totals to bigram names
   mutate(label = paste0(bigram, " (", total_n, ")")) %>%  
   # Create a named vector with bigram as names and labels as values
   {setNames(.$label, .$bigram)} # curly braces {} in a dplyr pipeline using . as ouptu from previous..

# ---- Plot
pdo_pub_pri_bigr <- ggplot(data = sector_bigr_df, aes(x = FY_appr, y = n, group = bigram, color = bigram)) +
   geom_line(linetype = "solid", alpha = 0.75, size = .5) +
   geom_point(size = 3) +
   scale_x_continuous(breaks = seq(2001, 2023, by = 1)) +
   scale_color_manual(values = two_col_contrast, 
                      labels = legend_labels) +  # Use modified labels
   lulas_theme +
   theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
   labs(title = title_text, 
        x = "", 
        y = "", 
        color = "") 
   

pdo_pub_pri_bigr

# Save the plot
#f_save_plot("pdo_pub_pri_bigr", pdo_pub_pri_bigr)
f_save_plot_obj(pdo_pub_pri_bigr, "pdo_pub_pri_bigr")

Note:

these are much less common than the single words.
What happens in FY 2014-2016 that makes these bigram drop in frequency of mention?

_______

TRIGRAMS

# Create bigrams by pairing consecutive tokens by sentence ID and token IDs
trigrams <- pdo_train_t %>%
   # keeping FY with tokens
   group_by(FY_appr, proj_id, pdo, sid ) %>%
   arrange(tid) %>%
   # Using mutate() and lead(), we create bigrams from consecutive tokens 
   mutate(next_token = lead(token), 
          next_next_token = lead(next_token),
          trigram = paste(token, next_token, next_next_token)) %>%
   # make  tri-grams lower case
   mutate(trigram = tolower(trigram)) %>%
   # remove NA values
   filter(!is.na(trigram)) %>%
   ungroup() %>% 
   arrange(FY_appr, proj_id, sid, tid ) %>% 
   select (FY_appr, proj_id, pdo, sid, tid, token, trigram)

# most frequent  trigrams
count_trigram <- trigrams %>% 
   count(trigram, sort = TRUE)

Clean trigrams

# Split the trigrams into three tokens
trigrams_split <- trigrams %>% 
   separate(trigram, c("token1", "token2", "token3"), sep = " ")  

# Remove stopwords and punctuation
trigrams_clean <- trigrams_split %>% 
   filter(!token1 %in% custom_stop_words,
          !token2 %in% custom_stop_words,
          !token3 %in% custom_stop_words) %>%
   filter(token1 != "na",
          token2 != "na",
          token3 != "na") %>%
   # Remove punctuation   
   filter(!stringr::str_detect(token1, "[[:punct:]]"), 
          !stringr::str_detect(token2, "[[:punct:]]"),
          !stringr::str_detect(token3, "[[:punct:]]"))  %>% 
   unite(trigram, token1, token2, token3, sep = " ") %>%
   select(FY_appr, proj_id, pdo, sid, tid, trigram)

# Count the frequency of each trigram
trigram_freq <- trigrams_clean %>% 
   count(trigram, sort = TRUE)

[FIG] Most frequent trigrams in PDOs

Excluding bigrams where 1 word is among stopwords or a punctuation sign
Excluding “development objective/s”, “proposed project”, “program development” because not very informative

# Evaluate the title with glue first
title_text <- glue::glue("Most frequent trigrams in PDOs over FY {min(trigrams_clean$FY_appr)}-{max(trigrams_clean$FY_appr)}")
 
# Define colors for specific highlights
highlight_colors <- c("Health" =  "#d02e4c", "Environment" ="#3F7E44", "Other" = "grey")


# Plot the most frequent trigrams
pdo_trigram_freq_plot <- trigram_freq %>%
   dplyr::filter(!trigram %in% c("project development objective",
                          "project development objectives",
                          "overall development objective",
                          "program development objective",
                          "program development objectives",
                          "proposed project development", 
                          "proposed development objectives", 
                          "proposed development objective",
                          "revised project development"
                          )) %>%
   top_n(25) %>%
   # plot the top 25 trigrams
   ggplot(aes(x = reorder(trigram, n), y = n,
              fill = dplyr::case_when(
                 stringr::str_detect(trigram, "health") ~ "Health",
                 stringr::str_detect(trigram, "environment") ~ "Environment",
                 stringr::str_detect(trigram, "climate") ~ "Environment",
                 stringr::str_detect(trigram, "greenhouse") ~ "Environment",
                 # stringr::str_detect(trigram, "sustain") ~ "Environment",
                 TRUE ~ "Other"))) +
   geom_col() +
   # coord flipped so n is Y axis
   scale_y_continuous(breaks = seq(min(trigram_freq$n)-1, max(trigram_freq$n), by = 50)) +
   coord_flip() +
   labs(title = title_text, subtitle = "(top 25 trigrams)",
        x = "", y = "") +
   scale_fill_manual(values = highlight_colors) +
   guides(fill = "none") +
   lulas_theme

pdo_trigram_freq_plot

_______

CONCORDANCES & KWIC

Concordances with specific bigrams

Concordancing is central to analyses of text and they often represents the first step in more sophisticated analyses of language data, because concordances are extremely valuable for understanding how a word or phrase is used, how often it is used, and in which contexts is used.

A concordance list is a list of all contexts in which a particular token appears in a corpus or text. Here I use it in association with the bigram “eligible crisis” to see in which context it appears in the PDOs.

Here I did it at the level of sentence, i.e. without tokenizing the text into words.

— eligible crisis ~ notable bigrams over FY

# reduce back to the original data
pdo_t <- pdo_train_t %>% 
   select(proj_id, pdo,pr_name, FY_appr, FY_clos, status, regionname, countryname,
          sector1, theme1,lendinginstr, env_cat, ESrisk, curr_total_commitment) %>%
   group_by(proj_id) %>% 
   slice(1)

First of all, let’s see what are the sentence that contain the bigram “eligible crisis” in the PDOs.

# Tokenize the text data into sentences
sentences <- pdo_t %>%
   unnest_tokens(sentence, pdo, token = "sentences", drop = FALSE)

# Count the number of sentences in each document
sentence_count <- sentences %>%
   group_by(proj_id) %>%
   summarise(num_sentences = n())

n_distinct(sentence_count$proj_id)  # number of projects
sum(sentence_count$num_sentences)   # total number of sentences

# ---- Define the bigram you want to find
target_bigram <- "eligible crisis"


# Filter sentences that contain the specific bigram
sentences_with_targ <- sentences %>%
   filter(stringr::str_detect(sentence, target_bigram))

# Define how many characters before and after the bigram to extract
chars_before <- 60  # Number of characters before the bigram
chars_after <- 60   # Number of characters after the bigram

# Add the extracted bigram and surrounding characters to the same dataframe
sentences_with_eligcris <- sentences_with_targ %>%
   mutate(closest_text = str_extract(sentence, paste0(".{0,", chars_before, "}", target_bigram, ".{0,", chars_after, "}"))) %>% 
   # View the updated dataframe with the closest_text column
   select(proj_id, #sentence, 
          closest_text)


# Define how many words before and after the bigram to extract
words_before <- 8  # Number of words before the bigram
words_after <- 8   # Number of words after the bigram

# Add the extracted bigram and surrounding words to the same dataframe
sentences_with_eligcris2 <- sentences_with_targ %>%
   mutate(closest_text = str_extract(sentence, 
                                     paste0("(", # Start a capture group
                                            # Match preceding words
                                            "(?:\\S+\\s+){0,", words_before, "}", 
                                            target_bigram, 
                                            # Match following words
                                            "(?:\\s+\\S+){0,", words_after, "}", 
                                            ")"
                                           ))) %>% 
   # View the updated dataframe with the closest_text column
   select(proj_id,  sentence, 
          closest_text)

n_distinct(sentences_with_eligcris2$proj_id)

There are 112 projects, for which the PDO has a sentences containing the bigram “eligible crisis” in the PDOs.

[TBL] Close phrase around bigram “eligible crisis”

It appears “eligible crisis or emergency” is a commonly used phrase in the PDOs, often accompanied by similar phrasing: “to respond promptly and effectively”. as well as “provide immediate and effective response to”. Presumably, a standard sentence that refers to a situation that qualifies for specific types of assistance or intervention under certain policies.

# Define the phrase you want to search for in the vicinity of the target bigram
phrase_to_search <- "respond promptly and effectively"

# Count how often the phrase appears in the vicinity of the target bigram
phrase_count <- sentences_with_eligcris2 %>%
  mutate(contains_phrase = stringr::str_detect(closest_text, phrase_to_search)) %>%  # Check if the phrase is present
  summarise(count = sum(contains_phrase))  # Count how many times the phrase is found

# View the result
tabyl(phrase_count$count)

32% of the (112) times, the bigram “eligible crisis” in the PDOs, it is accompanied by the phrase “respond promptly and effectively”.

Here are a few examples of the sentences containing the bigram “eligible crisis” and the phrase “respond promptly and effectively” OR immediate and effective response:

set.seed(555)
# Filter the sentences that contain the phrase
sample_with_eligcris2 <- sentences_with_eligcris2 %>% 
   ungroup() %>% 
   # take a random sample of 5 sentences
   sample_n(10) %>%
   select(proj_id, closest_text) %>% 
   mutate(
     closest_text = paste0("(...) ", closest_text),
     # Make "eligible crisis" bold by adding <b> tags
     closest_text = gsub("eligible crisis", "<b>eligible crisis</b>", closest_text),
     # Highlight by adding <mark> tags
     closest_text = gsub("(?i)(promptly and effectively|immediate and effective response)", # (?i) makes the match case-insensitive.
                         "<mark style='background-color: #d8e600;'>\\1</mark>", closest_text, perl = TRUE)
     )

# Print out sample in a kable 
elcr_k <- kable(sample_with_eligcris2, format = "html", 
                # Display the table with bold formatting
                escape = FALSE,
                col.names = c("WB Project ID","Excerpt of PDO Sentences with 'Eligible Crisis'")) %>% 
   kable_styling(full_width = FALSE)

elcr_k

WB Project ID	Excerpt of PDO Sentences with 'Eligible Crisis'
P179499	(...) and effective response in the case of an eligible crisis or emergency.
P176608	(...) promptly and effectively in the event of an eligible crisis or emergency.
P151442	(...) assistance programs and, in the event of an eligible crisis or emergency, to provide immediate and effective response
P177329	(...) eligible crisis or emergency, respond promptly and effectively to it.
P127338	(...) capacity to respond promptly and effectively in an eligible crisis or emergency, asrequired.
P158504	(...) immediate and effective response in case of an eligible crisis or emergency.
P173368	(...) immediate and effective response in case of an eligible crisis or emergency in the kingdom of cambodia.
P178816	(...) the project regions and to respond to an eligible crisis
P160505	(...) theproject area, and, in the event of an eligible crisis or emergency, to provide immediate and effective response
P149377	(...) mozambique to respond promptly and effectively to an eligible crisis or emergency.

# Save the table as an HTML file
write_rds(elcr_k, here("analysis", "output", "tables" ,"elcr_k.rds"))

— climate change ~ notable bigrams over FY [CMPL 🟠]

First of all, let’s see what are the sentence that contain the bigram “eligible crisis” in the PDOs.

# ---- Define the bigram you want to find
target_bigram <- "climate change"

# # Filter sentences that contain the specific bigram
# sentences_with_targ <- sentences %>%
#    filter(stringr::str_detect(sentence, target_bigram))
# 
# # Define how many words before and after the bigram to extract
# words_before <- 8  # Number of words before the bigram
# words_after <- 8   # Number of words after the bigram

# Add the extracted bigram and surrounding words to the same dataframe
sentences_with_climchang <- sentences %>%
   filter(stringr::str_detect(sentence, target_bigram)) %>% 
   mutate(closest_text = str_extract(sentence, 
                                     paste0("(", # Start a capture group
                                            # Match preceding words
                                            "(?:\\S+\\s+){0,", words_before, "}", 
                                            target_bigram, 
                                            # Match following words
                                            "(?:\\s+\\S+){0,", words_after, "}", 
                                            ")"
                                           ))) %>% 
   # View the updated dataframe with the closest_text column
   select(proj_id, pdo, sentence,
          closest_text)

There are 92 projects, for which the PDO has a sentences containing the bigram “climate change”in the PDOs.

[TBL] Close phrase around bigram “climate change”

I want to know which of these commonly used phrases are most often found in the vicinity of the bigram “climate change” in the PDOs.

# Count how often the phrase appears in the vicinity of the target bigram
close_words <- sentences_with_climchang %>%
  mutate(contains_what = dplyr::case_when(
     stringr::str_detect(sentence, "mitigat") ~ "mitigate",
     stringr::str_detect(sentence, "adapt") ~ "adapt",
     stringr::str_detect(sentence, "vulnerab") ~ "vulnerability",
     stringr::str_detect(sentence, "hazard") ~ "hazard",
     stringr::str_detect(sentence, "resil") ~ "resilience",
     TRUE ~ "..."))
 
# Count how often the phrase is found
close_words_sort <-  close_words %>% 
   filter(contains_what != "...") %>% 
  group_by(contains_what) %>% 
  summarise(count = n()) %>% 
  mutate(percentage = scales::percent(count/sum(count))) %>% 
   arrange(desc(count))

# Specify the words to highlight
highlight_words <- c("mitigate")
highlight_words2 <- c(  "resilience", "adapt")


clch_close_k <- close_words_sort %>% 
   kable(format = "html", 
         col.names = c("Near 'climate change'", "Count", "Percentage")) %>% 
   kable_styling(full_width = FALSE) %>% 
    # Light yellow background
   row_spec(which(close_words_sort$contains_what %in% highlight_words), 
            background = "#d8e600") %>% 
   row_spec(which(close_words_sort$contains_what %in% highlight_words2), 
            background = "#a6bd23")  

clch_close_k

Near 'climate change'	Count	Percentage
vulnerability	25	39.1%
mitigate	14	21.9%
resilience	14	21.9%
adapt	6	9.4%
hazard	5	7.8%

# save as object
write_rds(clch_close_k, here("analysis", "output", "tables" ,"clch_close_k.rds"))


#   <chr>         <int>       <chr>     
# 1 vulnerab         18       32.1%     
# 2 mitigate         12       21.4%     
# 3 resil            12       21.4%     
# 4 hazard            9       16.1%     
# 5 adapt             5       8.9%

Here are a few examples of the sentences containing the bigram “climate change” and the words “mitigate|adaptation”:

set.seed(888)
# Filter the sentences that contain the phrase
sentences_with_climchang2_k <-  sentences_with_climchang %>%
   filter(proj_id != "P125447") %>%
   # add a column to identify the phrases
   mutate(contains_what = case_when(
      stringr::str_detect(closest_text, "mitig") ~ "mitig",
      stringr::str_detect(closest_text, "adapt") ~ "adapt",
      stringr::str_detect(closest_text, "vulnerab") ~ "vulnerab",
      stringr::str_detect(closest_text, "hazard") ~ "hazard",
      stringr::str_detect(closest_text, "resil") ~ "resil",
      TRUE ~ "...")) %>% 
   filter(contains_what != "...") %>% 
   # take a random sample of 3  by word 
   group_by(contains_what) %>% 
   slice_sample(n = 3, replace = FALSE ) %>%
   select(contains_what, proj_id, closest_text) %>%
   mutate(closest_text = paste0("(...) ", closest_text),
          # Make "mutate" bold by adding <b> tags
          closest_text = gsub("climate change", "<b>climate change</b>", closest_text), 
          # highlight the phrases by adding <mark> tags (adapt, mitigate, etc.)
          closest_text = gsub("(?i)(adaptation|resilience)", # (?i) makes the match case-insensitive.
                              "<mark style='background-color: #a6bd23;'>\\1</mark>", closest_text, perl = TRUE),
          closest_text = gsub("(?i)(mitigation|mitigate)",  
                              "<mark style='background-color: #8e94d6;'>\\1</mark>", closest_text, perl = TRUE),
          closest_text = gsub("(?i)(hazard|vulnerability)", 
                              "<mark style='background-color: #e28293;'>\\1</mark>", closest_text, perl = TRUE)
          )

# save as object
write_rds(sentences_with_climchang2_k, here("analysis", "output", "tables" ,"sentences_with_climchang2_k.rds"))

#paint(sentences_with_climchang2_k)

# Prepare the kable table with subheaders based on 'contains_what'
sentences_with_climchang2_k %>%
  ungroup() %>%
  arrange(contains_what) %>%
  select(contains_what, proj_id, closest_text) %>%
  kable(format = "html", 
        escape = FALSE,
        col.names = c("Near word (root)", "WB Project ID", "Closest Text")) %>%
  kable_styling(full_width = FALSE)

Near word (root)	WB Project ID	Closest Text
adapt	P090731	(...) pilot adaptation measures addressing primarily, the impacts of climate change on their natural resource base, focused on biodiversity
adapt	P120170	(...) a multi-sectoral dpl to enhance climate change adaptation capacity is anticipated in the cps.
adapt	P129375	(...) objectives of the project are to: (i) integrate climate change adaptation and disaster risk reduction across the recipient’s
hazard	P174191	(...) and health-related hazards, including the adverse effects of climate change and disease outbreaks.
hazard	P123896	(...) agencies to financial protection from losses caused by climate change and geological hazards.
hazard	P117871	(...) buildings and infrastructure due to natural hazards or climate change impacts; and (b) increased capacity of oecs governments
mitig	P074619	(...) to help mitigate global climate change through carbon emission reductions (ers) of 138,000 tco2e
mitig	P164588	(...) institutional capacity for sustainable agriculture, forest conservation and climate change mitigation.
mitig	P094154	(...) removing carbon from the atmosphere and to mitigateclimate change in general.
resil	P154784	(...) to increase agricultural productivity and build resilience to climate change risks in the targeted smallholder farming and pastoralcommunities
resil	P112615	(...) the resilience of kiribati to the impacts of climate change on freshwater supply and coastal infrastructure.
resil	P157054	(...) to improve durability and enhance resilience to climate change
vulnerab	P149259	(...) to measurably reduce vulnerability to natural hazards and climate change impacts in grenada and in the eastern caribbean
vulnerab	P146768	(...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.
vulnerab	P117871	(...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.

   # Add subheaders based on the unique values in `contains_what`
  #group_rows(index = table(sentences_with_climchang2$contains_what))

— Keyword In Context (KWIC)

Keyword In Context (KWIC), or concordances, are the most frequently used method in corpus linguistics. The idea is very intuitive: we get to know more about the semantics of a word by examining how it is being used in a wider context.

Usually, the process involves: 1) tokenizing the text, 2) perform a search for a word and retrieve its concordances from the corpus. Typically, these extractions are displayed through keyword-in-context displays (KWICs), where the search term, also referred to as the node word, is showcased within its surrounding context, comprising both preceding and following words.

— Concordances

Using quanteda

file:///Users/luisamimmi/Github/slogan_old/docs/01b_WDR_data-exploration_abstracts.html

# I use again data = pdo_words
pdo_q_corpus <- quanteda::corpus(as.data.frame(projs_train), 
                               docid_field = "id",
                               text_field = "pdo",
                               meta = list("pr_name", "boardApprovalFY")
)
 
# --- example with individual keyword 
# Step 1) tokens
pdo_q_tokens <- quanteda::tokens(x = pdo_q_corpus,
                       remove_punct = TRUE,
                       remove_symbols = TRUE#,remove_numbers = TRUE
 ) %>% 
  quanteda::tokens_tolower() #%>%
 #quanteda::tokens_remove(pattern = custom_stop_words) %>%
 #quanteda::tokens_remove(pattern = c("project", "development", "bank", "world", "project", "projects"))
                                      
# #______ Step 2) kwic (individual exe )
# kwic_pdo_data <- quanteda::kwic(x = pdo_q_tokens, # define text(s)
#                                  # define pattern
#                                  pattern = quanteda::phrase(c("gender", "climate", "sustainab*")),
#                                  # define window size
#                                  window = 5) %>%
#     # convert into a data frame
#     as_tibble() %>%
#     left_join(projs_train, by = c("docname" =  "id")) %>%
#     # remove superfluous columns
#      dplyr::select( 'Year' = boardapprovalFY, 'Prj title' = pr_name, pre, keyword, post) %>%
#   #  slice_sample( n = 50) %>%
#    kbl(align = "c") # %>% kable_styling()
 
# ____ Step 2) kwic (on vector)
# Iterate `quanteda::kwic` over a vector of tokens | regex-modified-keywords
keywords <- c("gender", "climate", "sustainab*", "conditional*" )

# apply iteratively kwic over a vector of keywords
outputs_key <-  map(keywords, 
      ~quanteda::kwic(pdo_q_tokens,
                      pattern =  .x,
                      window = 5) %>% 
        as_tibble() %>%
        left_join(projs_train, by = c("docname" =  "id")) %>%  
        # remove superfluous columns
       dplyr::select( 'Year' = boardapprovalFY, 'Prj title' = pr_name, pre, keyword, post)
  )

# # all togetha 3
n = length(keywords)

# check the first element  
outputs_key[[1]] %>%
   kbl(align = "c")
outputs_key[[2]] %>%
   kbl(align = "c")

# this list  has no element names 
names(outputs_key)

— create kwic with phrases | purrr + print + save png

# Iterate `quanteda::kwic` over a vector of phrases/bigrams 
keywords_phrase <- c("pro-poor", "gender equality", "gender mainstreaming" )
 
# Step 1) tokens
# (done above) -> abs_q_tokens

# Step 2) kwic 
# apply iteratively kwic over a vector of bigrams
outputs_bigrams <- map(keywords_phrase,
                       ~quanteda::kwic(x = pdo_q_tokens, # define text(s) 
                                       # define pattern
                                       pattern = quanteda::phrase(.x),
                                       # define window size
                                       window = 5) %>%
                          # convert into a data frame
                          as_tibble() %>%
                          left_join(projs_train, by = c("docname" =  "id")) %>%  
                          ## remove superfluous columns
                          dplyr::select( 'Year' = boardapprovalFY, 'Prj title' = pr_name, pre, keyword, post)
)  

#  number ofo cbigrams 
n_bi = length(keywords_phrase)
n_bi # 7
# name this list's elements 
outputs_bigrams <- outputs_bigrams %>% 
  set_names(paste0("kwic_", keywords_phrase))  

# get rid of empty output dfs in list  
outputs_bigrams2 <- outputs_bigrams[sapply(
  outputs_bigrams, function(x) dim(x)[1]) > 0] # 4 left!
 
#or 
outputs_bigrams3 <- purrr::keep(outputs_bigrams, ~nrow(.) > 0)  # 4 left!

# -------------- print all 
#  walk + print -
#walk(.x = outputs_bigrams2, .f = print)

# -------------- save  all -> create multiple tables from a single dataframe and save them as images
# https://stackoverflow.com/questions/69323569/how-to-save-multiple-tables-as-images-using-kable-and-map/69323893#69323893

out_dir_tab <-  here::here("analysis", "output","tables")
 
outputs_bigrams2  %>%
  imap(~save_kable(file = paste0(out_dir_tab, '/', 'pdo_', .y, '_.png'),
                   # bs_theme = 'journal', 
                   self_contained = T, 
                   x = kbl(.x, booktabs = T, align = c('l','l', 'c')) %>%
                     kable_styling() 
  )
  )

_______

CORRESPONDENCE TEXT v. FEATURES

The particular dataset containing PDO text, offers the advantage of having also some other features that have been used to categorize the projects. This represents a fantastic opportunity to explore the correspondence between the text and such features.

—- Old plot

THIS STARTED FROM pdo_train_t bc I needed the tokens

# data long by PDOsector (sector_broad_pdo)
paint(sector_broad_pdo)
PDOsector_list

PDOsector_list$`MINING_OIL_GAS`
# Which PDOsector
tabyl(pdo_train_t$tok_sector_broad ) 

# --- Split data long into a LIST of subset by sector
PDOsector_list <- base::split(x = sector_broad_pdo, f = sector_broad_pdo$tok_sector_broad)
str(PDOsector_list)

#### [FUNC] Figure split sector (`tok_sector_broad`) freq ggplot
# --- Create a function to plot for each sector with custom colors
f_plot_pdo_sector <- function(PDO_sec_name) {

   # Retrieve specific sector data (sec as character)
   PDO_data_sec <- PDOsector_list[[as.character(PDO_sec_name)]]
   # Group by fiscal year of approval
   PDO_data_sec <-  PDO_data_sec %>% 
      group_by(FY_appr)  
   
   # Create the plot
   ggplot(data = PDO_data_sec, 
          aes(x = FY_appr, y = n)) +
      # By sector ... 
      geom_line(color = sector_colors[PDO_sec_name], linetype = "dotted", alpha = 0.5, size = 1) +   
      geom_point(color = sector_colors[PDO_sec_name], size = 3) +               
      scale_x_continuous(breaks = seq(2001, 2023, by = 1)) +
      scale_y_continuous(breaks = seq(0, max(PDO_data_sec$n), by = 25)) +
      # custom
      lulas_theme + 
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(
         title = paste("\"",PDO_sec_name,"\" in PDOs by fiscal years of approval"),  # Use facet-specific title
         subtitle = "[Using a \"custom\" broad sector definition]",
         x = "", 
         y = ""  # Remove y-axis label
      ) +
      # Ensure y-axis limit includes 50
      expand_limits(y = 50) + 
      # Add the reference line at y = 50, red, dashed, and transparent (50% opacity)
      geom_hline(yintercept = 50, linetype = "longdash", color = "#d02e4c", alpha = 0.75)

}

# --- Plot one sector 
# PDO_sec_name EDUCATION    ENERGY    HEALTH       ICT     MINING_OIL_GAS TRANSPORT    WAT_SAN 
# data_sec EX sector_list[["WAT_SAN"]]
f_plot_pdo_sector( PDO_sec_name = "WAT_SAN")

f_plot_pdo_sector( PDO_sec_name = "ENERGY")

f_plot_pdo_sector( PDO_sec_name = "TRANSPORT")

f_plot_pdo_sector( PDO_sec_name = "URBAN")

f_plot_pdo_sector( PDO_sec_name = "MINING_OIL_GAS")

f_plot_pdo_sector( PDO_sec_name = "ICT")

f_plot_pdo_sector( PDO_sec_name = "HEALTH")

f_plot_pdo_sector( PDO_sec_name = "EDUCATION")

# Use purrr::map to apply the function over the names of sector_list
map(names(PDOsector_list), f_plot_pdo_sector)

—- NEW plot

COMPARE PDO words v sector (the tag)

Basically I want to compare the trend over time of the frequency of my custom sector word (pdo_train_t$tok_sector_broad) in the PDO text, against the frequency of the sector tag in the dataset (sector1).

—- make `sector1_broad`

THIS STARTs FROM projs_train bc I needed the PROJECTS

# Data input
tabyl(projs_train$sector1)   

# let's select some clear cut sectors e.g. WATER AND SANITATION
projs_train <- projs_train %>%
   mutate (sector1_broad = case_when(
      # WAT_SAN
      sector1 == "Other Water Supply, Sanitation and Waste Management" ~ "WAT_SAN",
      sector1 == "Public Administration - Water, Sanitation and Waste Management" ~ "WAT_SAN",
      sector1 == "Sanitation" ~ "WAT_SAN",
      sector1 == "Water Supply" ~ "WAT_SAN",
      sector1 == "Waste Management" ~ "WAT_SAN",
      # ENERGY
      sector1 == "Energy Transmission and Distribution" ~ "ENERGY",
      sector1 == "Non-Renewable Energy Generation" ~ "ENERGY",
      sector1 == "Other Energy and Extractives" ~ "ENERGY",
      sector1 == "Public Administration - Energy and Extractives" ~ "ENERGY",
      sector1 == "Renewable Energy Biomass"  ~ "ENERGY", 
      sector1 == "Renewable Energy Geothermal"  ~ "ENERGY", 
      sector1 == "Renewable Energy Hydro"  ~ "ENERGY", 
      sector1 == "Renewable Energy Solar"  ~ "ENERGY", 
      sector1 == "Renewable Energy Wind"  ~ "ENERGY", 
      sector1 == "Renewable energy"  ~ "ENERGY", 
      
      # TRANSPORT
      sector1 == "Other Transportation" ~ "TRANSPORT",
      sector1 == "Public Administration - Transportation" ~ "TRANSPORT",
      sector1 == "Urban Transport" ~ "TRANSPORT",
      sector1 == "Rural and Inter-Urban Roads" ~ "TRANSPORT",
      sector1 == "Roads and highways" ~ "TRANSPORT",
      sector1 == "Ports/Waterways" ~ "TRANSPORT",
      sector1 == "Railways" ~ "TRANSPORT",
      sector1 == "Airports" ~ "TRANSPORT",
        # URBAN
      #niente  
      #  MINING_OIL_GAS
      sector1 == "MINING_OIL_GAS" ~ "MINING_OIL_GAS",
      sector1 == "Oil and Gas" ~ "MINING_OIL_GAS",
      # ICT
      sector1 == "ICT Infrastructure" ~ "ICT",
      sector1 == "ICT Services" ~ "ICT",
      sector1 == "Public Administration - Information and Communications Technologies" ~ "ICT",
      sector1 == "Other Information and Communications Technologies" ~ "ICT",
     # EDUCATION
      sector1 == "Other Education" ~ "EDUCATION",
      sector1 == "Primary education" ~ "EDUCATION",
      sector1 == "Public Administration - Education" ~ "EDUCATION",
      sector1 == "Tertiary education" ~ "EDUCATION",
      sector1 == "Secondary education" ~ "EDUCATION",
      sector1 == "Workforce Development and Vocational Education" ~ "EDUCATION",
      sector1 == "Adult, Basic and Continuing Education" ~ "EDUCATION",
      sector1 == "Early Childhood Education" ~ "EDUCATION",
      # HEALTH
      sector1 == "Health" ~ "HEALTH",
      sector1 == "Public Administration - Health" ~ "HEALTH",
      sector1 == "Health facilities and construction" ~ "HEALTH",
       # else
      TRUE ~ sector1
   ))

# check
pdo_train_t %>% 
   filter(tok_sector_broad %in% 
             c("WAT_SAN", "ENERGY", "TRANSPORT", "MINING_OIL_GAS", "ICT", "EDUCATION", "HEALTH")) %>%
   tabyl(tok_sector_broad, show_missing_levels =  T) 


projs_train %>% 
   filter(sector1_broad %in% 
             c("WAT_SAN", "ENERGY", "TRANSPORT", "MINING_OIL_GAS",           "ICT", "EDUCATION", "HEALTH")) %>% 
   tabyl(sector1_broad)

—- prep data `sector_broad_tag`

paint(projs_train )

# prep data
sector_broad_tag <- projs_train %>% 
      mutate(FY_appr = boardapprovalFY) %>%
   filter(!is.na(sector1_broad)) %>% 
   filter(sector1_broad %in% 
             c("WAT_SAN", "ENERGY", "TRANSPORT", "MINING_OIL_GAS",           "ICT", "EDUCATION", "HEALTH")) %>% 
   select (FY_appr,  sector1_broad )  %>% 
   # count(FY_appr, sector1_broad) %>% 
   # filter(n > 0) %>% 
   mutate(sector1_broad = factor(sector1_broad, levels = c(
      "WAT_SAN", "ENERGY", "TRANSPORT",#"URBAN",
      "MINING_OIL_GAS","ICT", "HEALTH", "EDUCATION" ))) # reorder values by frequency
#df$FY

# data long by sector1_broad (sector_broad_tag)
paint(sector_broad_tag)

# which tag_sector (gia tolto NA!!!)
#tabyl(sector_broad_tag$sector1 )
tabyl(sector_broad_tag$sector1_broad )

—- [FUNC] Plot each tag sector

Here I have much bigger numbers

# --- Split data long into a LIST of subset by sector
sector_list <- base::split(x =  sector_broad_tag, f = sector_broad_tag$sector1_broad)
str(sector_list) 
 
# --- FUNCTION to plot iteratively each sector (like f_plot_sector)
f_plot_tag_sector <- function( name_sec){
 # Ensure name_sec is treated as a character
   data_sec <- sector_list[[as.character(name_sec)]]
 #  data_sec <- sector_list[["ENERGY"]]

      data_sec <- data_sec %>% 
      group_by(FY_appr)  %>% 
       count() %>% 
       ungroup() #%>% 
      # #mutate(FY_appr = as.Date(FY_appr, format = "%Y-%m-%d"))
   
   # plot
   ggplot(data = data_sec, aes(x = FY_appr, y = n)) +
      geom_line(color = sector_colors[name_sec], linetype = "dotted", alpha = 0.5, size = 1) +   
      geom_point(color = sector_colors[name_sec], size = 3) +     
      scale_x_continuous(breaks = seq(2001, 2023, by = 1)) +
      scale_y_continuous(breaks = seq(0, max(data_sec$n), by = 5)) +
      labs(title = name_sec, x = "Year", y = "Number of projects") +
      # custom
      lulas_theme + 
      theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
      labs(
         title = paste("\"",name_sec,"\" in tag by fiscal years of approval"),  # Use facet-specific title
         subtitle = "[Using variable \"sector1\"]",
         x = "", 
         y = ""  # Remove y-axis label
      ) 

}

# --- Plot one sector 
# name_sec EDUCATION    ENERGY    HEALTH       ICT     MINING_OIL_GAS TRANSPORT    WAT_SAN 
# data_sec EX sector_list[["WAT_SAN"]]
f_plot_tag_sector( name_sec = "WAT_SAN")

f_plot_tag_sector( name_sec = "ENERGY")

f_plot_tag_sector( name_sec = "TRANSPORT")

f_plot_tag_sector( name_sec = "ICT")

f_plot_tag_sector( name_sec = "HEALTH")

f_plot_tag_sector( name_sec = "EDUCATION")

# Use purrr::map to apply the function over the names of sector_list
 map(names(sector_list), f_plot_tag_sector)

—- Combine two sets of data `sector_broad_tag` and `sector_broad_pdo`

# not sure why 
sector_broad_tag <- sector_broad_tag %>% 
   count(FY_appr, sector1_broad)

# Combine two sets of data
str(sector_broad_pdo)
str(sector_broad_tag)

sector_broad_combo <- left_join(sector_broad_pdo, sector_broad_tag, 
                                by = c("FY_appr", "tok_sector_broad" = "sector1_broad") ,
                                suffix = c("_pdo", "_tag")
                                ) %>% 
   filter (!is.na(n_tag))

sector_broad_combo

— [TAB] Kolmorogov-Smirnov test test of similarity with a table

In Kolmorogov-Smirnov test:

the null hypothesis is that the two distributions are the same
The alternative hypothesis is that the two distributions are different.

The test statistic is the maximum difference between the two cumulative distribution functions. The p-value is the probability of observing a test statistic as extreme as the one observed, assuming the null hypothesis is true.

# Function to calculate KS test results and save to a table without plotting
ks_results_k <- sector_broad_combo  %>%
   group_by(tok_sector_broad) %>%
   summarize(
      # ks_alt_hyp = ks.test(
      #   (n_pdo - min(n_pdo)) / (max(n_pdo) - min(n_pdo)),
      #   (n_tag - min(n_tag)) / (max(n_tag) - min(n_tag))
      # )$alternative, 
 
      # a) with normalization     
      ks_statistic = ks.test(
         (n_pdo - min(n_pdo)) / (max(n_pdo) - min(n_pdo)),
         (n_tag - min(n_tag)) / (max(n_tag) - min(n_tag)))$statistic, 
      
      ks_p_value = ks.test(
         (n_pdo - min(n_pdo)) / (max(n_pdo) - min(n_pdo)),
         (n_tag - min(n_tag)) / (max(n_tag) - min(n_tag)))$p.value,
      similarity = ifelse(ks_p_value > 0.05, "Similar", "Dissimilar"),
      
      # # b) without normalization
      #  ks_statistic_raw = ks.test(n_pdo, n_tag)$statistic,
      #  ks_p_value_raw = ks.test(n_pdo, n_tag)$p.value,
       ) %>%
   ungroup()  %>% 
   arrange(ks_p_value)

# save as object
write_rds(ks_results_k, here("analysis", "output", "tables" ,"ks_results_k.rds"))

# Count how often the phrase appears in the vicinity of the target bigram
ks_results_k %>% 
   kable(format = "html", 
         col.names = c("SECTORS", "KS statistic","KS p-value",  "Distributions"#, "KS statistic R","KS p-valueR" 
                       ),
         # Round  to 4 digits) 
         digits = c(0, 4, 4, 0#, 4,4
                    )) %>% 
   kable_styling(full_width = FALSE) %>%
   row_spec(which(ks_results_k$similarity == "Dissimilar"), background = "#e7d8da")

SECTORS	KS statistic	KS p-value	Distributions
ENERGY	0.6522	0.0001	Dissimilar
HEALTH	0.3913	0.0487	Dissimilar
WAT_SAN	0.3913	0.0544	Similar
EDUCATION	0.3478	0.1002	Similar
ICT	0.2857	0.3399	Similar
MINING_OIL_GAS	0.3333	0.3442	Similar
TRANSPORT	0.2174	0.6410	Similar

Additional plots

— [FIG] Kolmogorov-Smirnov Test for Similarity of PDO and TAG Distributions by Sector

ks_results_k %>% 
   ggplot(aes(x = reorder(tok_sector_broad, ks_p_value), y = ks_p_value)) +
   geom_col(fill = "#0073C2FF") +
   geom_text(aes(label = round(ks_p_value, 4)), vjust = -0.5) +
   coord_flip() +
   labs(
      title = "Kolmogorov-Smirnov Test for Similarity of PDO and TAG Distributions by Sector",
      x = "Sector",
      y = "P-Value"
   ) +
   theme_minimal() +
   theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 1: Kolmogorov-Smirnov Test for Similarity of PDO and TAG Distributions by Sector

— [FIG] Correlation between PDO and TAG by sector

Correlation Coefficients: Display the correlation between n_pdo and n_tag within each sector

sector_broad_combo %>%
   ggplot(aes(x = n_pdo, y = n_tag)) +
   geom_point() +
   geom_smooth(method = "lm", se = FALSE) +
   facet_wrap(~tok_sector_broad, scales = "free") +
   labs(
      title = "Correlation between PDO and TAG by Sector",
      x = "Normalized PDO",
      y = "Normalized TAG"
   ) +
   theme_minimal()

Figure 2: Correlation between PDO and TAG by sector

— [FIG] KS test for PDO and TAG by sector

However, for comparing distributions of two continuous variables like n_pdo and n_tag (normalized in this case), the chi-square test may not be directly applicable. Instead, other methods are often more suitable, depending on the nature of the distributions:

Kolmogorov-Smirnov (KS) Test P-Values: Display the p-value from a Kolmogorov-Smirnov test comparing n_pdo and n_tag within each sector producing a p-value that indicates the probability of observing these distributions if they were the same. + A low p-value (typically < 0.05) suggests the distributions are significantly different, while a higher p-value suggests similarity. + A high p-value does not necessarily mean the distributions are identical, only that there is not enough evidence to reject the null hypothesis of similarity. + The KS test is non-parametric and makes no assumptions about the underlying distributions, making it a versatile tool for comparing distributions. + We should not standardize the samples if we wish to know if their distributions are identical or not. !!!

— [FUNC] Plot by sector for PDO and TAG by sector

# Define the plotting function
f_plot_sector_trends <- function(data, sector_name, title = NULL, subtitle = NULL) {
   # Filter data for the specified sector
   sector_data <- data %>% filter(tok_sector_broad == sector_name)
   
   # Normalize n_pdo and n_tag to range [0, 1] for the selected sector
   sector_data <- sector_data %>%
      mutate(
         n_pdo_norm = (n_pdo - min(n_pdo)) / (max(n_pdo) - min(n_pdo)),
         n_tag_norm = (n_tag - min(n_tag)) / (max(n_tag) - min(n_tag))
      )
   
   # ---- Calculate  KS test p-value for the selected sector
   sector_stats <- sector_data %>%
      summarize(
         # spearman_cor = cor(n_pdo_norm, n_tag_norm, method = "spearman", use = "complete.obs"),
         ks_p_value = ks.test(n_pdo_norm, n_tag_norm)$p.value,
         similarity = ifelse(ks_p_value > 0.05, "Similar", "Dissimilar")
      )
   
  # --- Create custom legend title with statistical information
   # custom_legend_title <- paste0(
   #    "Frequency Metrics"#,
   #    #"KS test p-value: ", round(sector_stats$ks_p_value, 4)
   # )   
   # Plot with both KS test annotations
   ggplot(sector_data, aes(x = FY_appr)) +
      # ---- (OR) Lighter line plot for normalized n_pdo
      # geom_line(aes(y = n_pdo_norm, color = "n_pdo"), size = 1, alpha = 0.3) +
      # # Points for normalized n_pdo
      # geom_point(aes(y = n_pdo_norm, color = "n_pdo"), size = 4, alpha = 0.5) +
      
      # ---- (OR) Bar plot for normalized n_pdo with fill in aes() to include in legend
      geom_bar(aes(y = n_pdo_norm, fill = "n_pdo"), stat = "identity", alpha = 0.5) +
      # -----Lighter line plot for normalized n_tag
      geom_line(aes(y = n_tag_norm, color = "n_tag"), size = 1, alpha = 0.3) +
      # -----Points for normalized n_tag
      geom_point(aes(y = n_tag_norm, color = "n_tag"), size = 4) +
      
      # --- Annotate KS test results directly on the plot
      annotate(
         "text", x = Inf, y = Inf, label = paste("KS test p-value:", round(sector_stats$ks_p_value, 4)),
         hjust = 1.1, vjust = 1.1, color = "black", size = 4
      ) +
      
      
      # ---- Set titles, labels, and theme
      labs(
         title = paste0(title, " - ", sector_name),
         subtitle = subtitle,
         x = "", 
         y = "",
          fill =  "PDO term frequency",
          color = "Sector label frequency"
         ) +
     # Customize colors for lines and bar fill
      scale_color_manual(
         values = c("n_tag" = "#00689D") 
      ) +
      scale_fill_manual(
         values = c("n_pdo" = "#8e550a") 
      ) +
      scale_x_continuous(breaks = unique(sector_data$FY_appr)) + 
      lulas_theme +
      theme(legend.position = "right",
            axis.text.x = element_text(angle = 45, hjust = 1) )
}

— [FIG] WATER & SANITATION

— 🟢 [FIG] ENERGY (most dissimilar trend)

save for blog

comp_pdo_sec_ENE_plot <-  f_plot_sector_trends(data = sector_broad_combo, sector_name = "ENERGY",
                   title = "Comparing frequency of sector trends over time" ,
                subtitle = "\`n_pdo\` = frequency in PDOs text; \`n_tag\` = frequency sector label (normalized values) \nKolmogorov-Smirnov test for similarity between the two trends")
 
#f_save_plot("pdo_agr_plot", pdo_agr_plot)
f_save_plot_obj (comp_pdo_sec_ENE_plot, "comp_pdo_sec_ENE_plot")

— 🟢 [FIG] TRANSPORT (most similar trend)

save for blog

comp_pdo_sec_TRANSP_plot <-  f_plot_sector_trends(data = sector_broad_combo, sector_name = "TRANSPORT",
                   title = "Comparing frequency of sector trends over time" ,
                subtitle = "\`n_pdo\` = frequency in PDOs text; \`n_tag\` = frequency sector label (normalized values) \nKolmogorov-Smirnov test for similarity between the two trends")
 
#f_save_plot("pdo_agr_plot", pdo_agr_plot)
f_save_plot_obj (comp_pdo_sec_TRANSP_plot, "comp_pdo_sec_TRANSP_plot")

— [FIG] MINING_OIL_GAS

— [FIG] ICT

— [FIG] HEALTH

— [FIG] EDUCATION

— [FIG] URBAN

# u_pdo <- f_plot_pdo_sector( PDO_sec_name = "URBAN")
# # u_tag <- f_plot_tag_sector( name_sec = "URBAN")
# 
# u_pdo + #u_tag + 
#    # plot_layout(guides = 'collect') + 
#    plot_layout(ncol = 1)  # Use `ncol = 1` to stack vertically or `ncol = 2` for side-by-side

COMPARE PDO words v sector ($$)

—- count of PDO with sector words `sector_broad_pdo` corresponding to each `tok_sector_broad` per year

This is a count of OBS per cell

sector_broad_pdo   # 345

—- sum of `sum_curr_total_commitment` corresponding to each `sector1_broad` per year

THIS STARTs FROM projs_train bc I needed the PROJECTS

# prep data
sector_broad_commit <- projs_train %>% 
   select(FY_appr = boardapprovalFY, sector1_broad, curr_total_commitment) %>% 
   # group_by(FY_appr, sector1_broad) %>% 
   # summarise(sum_curr_total_commitment = sum(curr_total_commitment)) %>% 
   # ungroup() %>% 
   # mutate(FY_appr = as.character(FY_appr)) %>% 
   filter (sector1_broad %in% c("WAT_SAN", "ENERGY","TRANSPORT", "MINING_OIL_GAS", "ICT", "HEALTH", "EDUCATION", "URBAN")) %>% 
   mutate(FY_appr = as.numeric(FY_appr)) %>% 
   arrange(sector1_broad,FY_appr) %>% 
   group_by(sector1_broad, FY_appr ) %>% 
   summarise(sum_commit = sum(curr_total_commitment) , .groups = "drop") %>%
   complete(sector1_broad, FY_appr = full_seq(FY_appr, 1), fill = list(sum_commit = 0))  # Fill missing years

sector_broad_pdo <- sector_broad_pdo %>% 
   filter(tok_sector_broad %in% c("WAT_SAN", "ENERGY","TRANSPORT", "MINING_OIL_GAS", "ICT", "HEALTH", "EDUCATION" ))  

paint(sector_broad_pdo)
nrow(sector_broad_pdo) # 161 
tabyl(sector_broad_pdo$tok_sector_broad, show_missing_levels = F, show_na = F)  
tabyl(sector_broad_pdo$FY_appr, show_missing_levels = F, show_na = F)  

paint(sector_broad_commit)
nrow(sector_broad_commit) # 161  
tabyl(sector_broad_commit$sector1_broad, show_missing_levels = F, show_na = F)  
tabyl(sector_broad_commit$FY_appr, show_missing_levels = F, show_na = F)

# merge the two datasets
sector_broad_pdo_comm <-  left_join (sector_broad_pdo, sector_broad_commit, by = c("tok_sector_broad" = "sector1_broad", "FY_appr" = "FY_appr"))

— [TAB] Kolmorogov-Smirnov test test of similarity with a table

In Kolmorogov-Smirnov test:

the null hypothesis is that the two distributions are the same
The alternative hypothesis is that the two distributions are different.

# Function to calculate KS test results and save to a table without plotting
ks_results2_k <- sector_broad_pdo_comm  %>%
   group_by(tok_sector_broad) %>%
   # min -max normalization
   mutate(n_scaled = (n - min(n, na.rm = TRUE)) / 
             (max(n, na.rm = TRUE) - min(n, na.rm = TRUE)),
          sum_commit_scaled = (sum_commit - min(sum_commit, na.rm = TRUE)) / 
             (max(sum_commit, na.rm = TRUE) - min(sum_commit, na.rm = TRUE)) ) %>%
   summarize(
      # -- a) with normalization     
      ks_statistic = ks.test(n_scaled, sum_commit_scaled)$statistic,
      ks_p_value = ks.test(n_scaled, sum_commit_scaled )$p.value,
      similarity = ifelse(ks_p_value > 0.05, "Similar", "Dissimilar"),
      # -- b) without normalization
      #  ks_statistic_raw = ks.test(n_pdo, n_tag)$statistic,
      #  ks_p_value_raw = ks.test(n_pdo, n_tag)$p.value,
       ) %>%
   ungroup()  %>% 
   arrange(ks_p_value)

# save as object
write_rds(ks_results2_k, here("analysis", "output", "tables" ,"ks_results2_k.rds"))

# Count how often the phrase appears in the vicinity of the target bigram
ks_results2_k %>% 
   kable(format = "html", 
         col.names = c("SECTORS", "KS statistic","KS p-value",  "Distributions"#, "KS statistic R","KS p-valueR" 
                       ),
         # Round  to 4 digits) 
         digits = c(0, 4, 4, 0#, 4,4
                    )) %>% 
   kable_styling(full_width = FALSE) %>%
   row_spec(which(ks_results2_k$similarity == "Dissimilar"), background = "#e7d8da")

SECTORS	KS statistic	KS p-value	Distributions
EDUCATION	0.6522	0.0001	Dissimilar
ICT	0.6522	0.0001	Dissimilar
HEALTH	0.5652	0.0010	Dissimilar
MINING_OIL_GAS	0.5217	0.0031	Dissimilar
ENERGY	0.3478	0.1235	Similar
TRANSPORT	0.2609	0.4218	Similar
WAT_SAN	0.2609	0.4218	Similar

— [FUNC] standardize and plot

Standardization is done by subtracting the mean and dividing by the standard deviation. This is done for both the n and sum_commit columns. In this way we can compare the two distributions on the same scale.
Robust Scaling: Subtract the median and divide by the IQR. This is more robust to outliers than standardization, but it doesn’t ensure the distributions have the same variance.
✅ Min-Max Scaling: Rescale both n and sum_commit to a [0, 1] range. This doesn’t assume normality and ensures both distributions are within the same bounds, though it doesn’t account for the shape of the distributions.

robust or min-max scaling alternatives can provide more reliable comparisons, especially with skewed data.

Kolmogorov-Smirnov (KS) Test P-Values: Display the p-value from a Kolmogorov-Smirnov test comparing rescaled trends within each sector producing a p-value that indicates the probability of observing these distributions if they were the same.
- DIFFERENT = A low p-value (typically < 0.05) suggests the distributions are significantly different, while a higher p-value suggests similarity.
- SIMILAR = A high p-value does not necessarily mean the distributions are identical, only that there is not enough evidence to reject the null hypothesis of similarity.
- The KS test is non-parametric and makes no assumptions about the underlying distributions, making it a versatile tool for comparing distributions.

# --- FUNCTION to 1) standardize 2 distributions and 2) plot iteratively each sector  
f_plot_sector_comm <- function(data, sector) {

      # ---- Filter data for the specified sector
   sector_data <- data %>%
      filter(tok_sector_broad == sector) %>%
      group_by(tok_sector_broad) %>%
      # # ---- Standardize n and sum_commit within each sector
      # mutate(n_standardized = (n - mean(n, na.rm = TRUE)) / sd(n, na.rm = TRUE),
      #        sum_commit_standardized = (
      #           sum_commit - mean(sum_commit, na.rm = TRUE)) / sd(sum_commit, na.rm = TRUE)) %>%
      # ---- Min-Max Scaling
      mutate(n_scaled = (n - min(n, na.rm = TRUE)) / 
                (max(n, na.rm = TRUE) - min(n, na.rm = TRUE)),
             sum_commit_scaled = (sum_commit - min(sum_commit, na.rm = TRUE)) / 
                (max(sum_commit, na.rm = TRUE) - min(sum_commit, na.rm = TRUE)) ) %>%
      ungroup()

      # ---- Calculate Spearman correlation and KS test p-value for the selected sector
   sector_stats <- sector_data %>%
      summarize(
         spearman_cor = cor(n_scaled, sum_commit_scaled, method = "spearman", use = "complete.obs"),
         ks_p_value = ks.test(n_scaled, sum_commit_scaled)$p.value,
         similarity = ifelse(ks_p_value > 0.05, "Similar", "Dissimilar")
      )

      
   # Extract the color for the sector line (you can set a specific color or use ggplot's color palette)
   pdo_color <- "#8e550a" # Get a color from ggplot's default palette
   commit_color <-"#00689D"  # Set a color for the secondary line
   
   # Plot the data for the selected sector
   ggplot(sector_data, aes(x = FY_appr)) +
      # --- geom_bar for rel_freq_n_pdo
      geom_line(aes(y = n_scaled), color = pdo_color, alpha = 0.75) +
      geom_point(aes(y = n_scaled), color = pdo_color, alpha = 0.75, size = 2) +
      # --- geom_line and geom_point for rel_freq_commitment
      geom_line(aes(y = sum_commit_scaled ), color = commit_color, linetype = "dashed" ) +
      geom_point(aes(y = sum_commit_scaled), color = commit_color, size = 2) +
      # --- scale
      scale_x_continuous(breaks = seq(2001, 2023, by = 1)) +
      scale_y_continuous(
         name = "N words in PDOs (mean = 0, sd = 1)",
         sec.axis = sec_axis(~., name = "$$ Committed (mean = 0, sd = 1)")
      ) +
      
            # --- Annotate KS test results directly on the plot
      annotate(
         "text", x = Inf, y = Inf, label = paste("KS test p-value:", round(sector_stats$ks_p_value, 4)),
         hjust = 1.1, vjust = 1.1, color = "black", size = 4
      ) +
      
       # Customize colors and set common legend title
        labs(
         title = paste("Word frequency in PDO v. amount committed for", sector),
         subtitle = "n_pdo and sum_commit are rescaled with Min-Max scaling \nKolmogorov-Smirnov test for similarity between the two trends",
         x = ""
      ) +
     
      # custom
      lulas_theme + 
      theme(   legend.position = "none",
               axis.text.x = element_text(angle = 45, hjust = 1),
               axis.title.y = element_text(color = pdo_color ), 
               axis.title.y.right = element_text(color =  commit_color ) 
      )
}
# --- Plot one sector 
# name_sec EDUCATION    ENERGY    HEALTH       ICT     MINING_OIL_GAS TRANSPORT    WAT_SAN 
# data_sec EX sector_list[["WAT_SAN"]]
f_plot_sector_comm(sector_broad_pdo_comm, sector = "TRANSPORT") # KS p-val = 0.42 similar

f_plot_sector_comm(sector_broad_pdo_comm, sector = "WAT_SAN") # KS p-val = 0.42 similar

f_plot_sector_comm(sector_broad_pdo_comm, sector = "ENERGY") # KS p-val = 0.12 similar

f_plot_sector_comm(sector_broad_pdo_comm, sector = "ICT") # KS p-val = 0.0001 DIFFERENT

f_plot_sector_comm(sector_broad_pdo_comm, sector = "HEALTH") # KS p-val = 0.001 DIFFERENT

f_plot_sector_comm(sector_broad_pdo_comm, sector = "EDUCATION") # KS p-val = 0.0001 DIFFERENT

# many with non commitment 
f_plot_sector_comm(sector_broad_pdo_comm, sector = "MINING_OIL_GAS") # KS p-val = 0.0031 DIFFERENT

— [FIG] Plot sector WAT_SAN

very similar trends in PDO and commitment

— [FIG] Plot sector ICT

very different trends in PDO and commitment

Set up

Set up ggplot2 theme

—————————————————————————-

Data sources

[TBL] Illustrative PDOs text in Projects’ documents

Notes on PDO text data quality

—————————————————————————

Load pre-processed Projs’ dataset + PDO dataset

[Saved file projs_train_t & pdo_train_t]

Previous Tokenization and PoS Tagging

i) Explain Tokenization

The choices of tokenization

ii) Explain Pos Tagging

— PoS Tagging: upos (Universal Part-of-Speech)

iii) Custom Stopwords

iv) Stemming

Notes on sparsity

v) Document-term matrix or TF-IDF

— TF-IDF matrix on train pdo

_______

TEXT ANALYSIS/SUMMARY

[TBL] Frequencies of documents/words/stems

Term frequency

[FUNC] save plot Ouptput

[FUNC] save plot Object

[FIG] Overall token freq ggplot

[FIG] Overall stem freq ggplot

[FIG] token + stem freq ggplot

_______

SECTOR-related term frequency

Isolate SECTOR words and see frequency over years

Data prep for sector plots

[FIG] faceted sector (tok_sector_broad) freq ggplot

[FUNC] Figure split sector (tok_sector_broad) freq ggplot

_______

SECTOR in PDO v. WDR publications

Ingest WDR data

— Manually add WDR 2023 ✍🏻

— Manually add WDR 2024 ✍🏻

— Manually correct WDR 2011 ✍🏻

— Remove extra space in title column

— Re-save (upon correction) wrd2.rds

[TBL] World Develompent Reports 2000-2024

Qualify: peak or trend (by sector)

Add geomvline to sector plots v WDR title [CMPL 🟠]

— ✅ AGR_FOR_FISH ( Agriculture, forestry, and fishing)

— ✅ EDUCATION

— ✅ CLIMATE (climate change)

— ✅ GENDER EQUALITY

— SOCIAL PROTECTION

— INSTITUTIONAL SUPPORT

— ICT

— FINANCIAL

_______

BIGRAMS

Clean bigrams

[FIG] most frequent bigrams in PDOs

[FIG] Changes over time BY 1FY

[FIG] Changes over time BY 3FY

🤔 Which are the most frequent and persistent Bigrams Over Time?

[TBL] Bigrams Over Time [3FY]

[TBL] Bigrams Over Time [1FY]

_______

Explore specific bigrams

— Public/Private ~ compare frequency over FY

[FIG] Bigrams (“public sector”, “private sector”) freq plot

_______

TRIGRAMS

Clean trigrams

[FIG] Most frequent trigrams in PDOs

_______

CONCORDANCES & KWIC

Concordances with specific bigrams

— eligible crisis ~ notable bigrams over FY

[TBL] Close phrase around bigram “eligible crisis”

— climate change ~ notable bigrams over FY [CMPL 🟠]

[TBL] Close phrase around bigram “climate change”

— Keyword In Context (KWIC)

— Concordances

— create kwic with phrases | purrr + print + save png

[Saved file `projs_train_t` & `pdo_train_t`]

— PoS Tagging: `upos` (Universal Part-of-Speech)

[FIG] Overall `token` freq ggplot

[FIG] Overall `stem` freq ggplot

[FIG] `token` + `stem` freq ggplot

[FIG] faceted sector (`tok_sector_broad`) freq ggplot

[FUNC] Figure split sector (`tok_sector_broad`) freq ggplot

— Re-save (upon correction) `wrd2.rds`

Add `geomvline` to sector plots v WDR title [CMPL 🟠]

—- make `sector1_broad`

—- prep data `sector_broad_tag`

—- Combine two sets of data `sector_broad_tag` and `sector_broad_pdo`

—- count of PDO with sector words `sector_broad_pdo` corresponding to each `tok_sector_broad` per year

—- sum of `sum_curr_total_commitment` corresponding to each `sector1_broad` per year