Project_ID | Project_Name | Project_Development_Objective |
---|---|---|
P127665 | Second Economic Recovery Development Policy Loan | This development policy loan supports the Government of Croatia's reform efforts with the aim to: (i) enhance fiscal sustainability through expenditure-based consolidation; and (ii) strengthen investment climate. |
P179010 | Tunisia Emergency Food Security Response Project | To (a) ensure, in the short-term, the supply of (i) agricultural inputs for farmers to secure the next cropping seasons and for continued dairy production, and (ii) wheat for uninterrupted access to bread and other grain products for poor and vulnerable households; and (b) strengthen Tunisia’s resilience to food crises by laying the ground for reforms of the grain value chain. |
A text exploration of the World Bank’s projects objectives (PDO)
TL;DR
The idea of analyzing language as data has always intrigued me. In this deep dive, I focus on ~4,000 World Bank Projects & Operations, zooming in on the short texts that describe the Project Development Objectives (PDOs)—an abstract of sorts for Bank’s operations.
This explorative analysis revealed fascinating—and surprising—insights, uncovering patterns and correlations in text but also solutions to enhance the quality of projects’ data themselves.
(This is an ongoing project, so comments, questions, and suggestions are welcome. The R source code is open, albeit not very polished).
MOTIVATION
I have always been fascinated by the idea of analyzing language as data and I finally found some time to study Natural Language Processing (NLP) and Text Analytics techniques.
For this learning project, I explore a dataset of World Bank Projects & Operations, with a focus on the text data contained in the Project Development Objective (PDO) section of World Bank’s projects (loans, grants, technical assistance). A PDO outlines, in synthetic form, the proposed objectives of operations, as defined in the early stages of the World Bank project cycle.
Normally, a few objectives are listed in paragraphs that are a couple sentences long. Table 1 shows two examples.
The dataset also includes some relevant metadata about the projects, including: country, fiscal year of approval, project status, main sector, main theme, environmental risk category, or lending instrument.s
I retrieved the data on this page WBG Projects. Such data is classified by the World Bank as “public” and accessible under a Creative Commons Attribution 4.0 International License.
DATA
The original dataset included 22,569 World Bank projects approved from fiscal year 1947 through 2025, as of August 31, 2024. Approximately half—11,322 projects—had a viable Project Development Objective (PDO) text (i.e., not blank or labeled as “TBD”, etc.), all approved after FY2001. From this group, some projects were excluded due to missing key variables.
This left 8,811 projects as usable observations for analysis.
Interestingly, within this refined subset, 2,235 projects share only 1,006 unique PDOs: recycled PDOs often appear in follow-up projects or components of a larger parent project.
Finally, from these 8,811 projects, a representative sample of 4,403 projects with PDOs was selected for further analysis.
First, it is important to notice that all 7,548 projects approved before FY2001 had no PDO text available.
The exploratory analysis of the 11,353 projects WITH PDO text revealed some interesting findings:
- PDO text length: The PDO text is quite short, with a median of 2 sentences and a maximum of 9 sentences.
-
PDO text missingness: besides 11,306 projects with missing PDOs, 31 projects had some invalid PDO values, namely:
- 11 have PDO as one of: “.”,“-”,“NA”, “N/A”
- 7 have PDO as one of: “No change”, “No change to PDO following restructuring.”,“PDO remains the same.”
- 9 have PDO as one of: “TBD”, “TBD.”, “Objective to be Determined.”
- 4 have PDO as one of: “XXXXXX”, “XXXXX”, “XXXX”, “a”
Of the available 11,322 projects with a valid PDO, some more projects were excluded from the analysis for incompleteness:
- 3 projects without “project status”
- 2,176 projects without “board approval FY”
- 332 projects approved in FY >= FY2024 (for incomplete approval stage)
Lastly (and this was quite surprising to me) the remaining, viable 8,811 unique projects, were matched by only 7,582 unique PDOs! In fact, 2,235 projects share 1,006 NON-UNIQUE PDO text in the clean dataset. Why? Apparently, the same PDO is re-used for multiple projects (from 2 to as many as 9 times), likely in cases of follow-up phases of a parent project or components of the same lending program.”
In sum, the cleaning process yielded a usable set of 8,811 functional projects, which was split into a training subset (4,403) to explore and test models and a testing subset (4408), held out for post-prediction evaluation.
Preprocessing the PDO text data
Cleaning text data entails extra steps compared to numerical data. A key process is tokenization, which breaks text into smaller units like words
, bigrams
, n-grams
, or sentences
. After that, a common cleaning task is normalization, where text is standardized (e.g., converting to lowercase). Similarly, data reduction techniques like stemming and lemmatization simplify words to their root form (e.g., “running,” “ran,” and “runs” become “run”). This can help to reduce dimensionality, especially with very large datasets, when the word form is not relevant.
Upon tokenization, it is very common to remove irrelevant elements like punctuation or stop words
(unimportant words like “the”, “ii)”, “at”, or repeated ones in context like “PDO”) which add noise to the data.
In contrast, data enhancement techniques like part-of-speech tagging add value by identifying grammatical components, allowing focus on meaningful elements like nouns
, verbs
, or adjectives
.
TERM FREQUENCY PATTERNS
Figure 1 shows the most recurrent tokens and stems in the PDO text data.
Words and stems
Evidently, after stemming, more words (or stems
) reach the threshold frequency count of 800 (as they have been combined by root). Despite the pre-processing of PDOs’ text data, these aren’t particularly informative words.
Bigrams
Figure 2 shows the most frequent bigrams
in the PDO text data. The top-ranking bigrams align with expectations, featuring phrases like “increase access”, “service delivery” ,“institutional capacity”, “poverty reduction” etc., at the top. Notably, while “health” appears in several bigrams (e.g., “health services”, “public health”, “health care”), “education” is absent from the top 25. Another noteworthy observation is the frequent mention (over 100 instances) of “eligible crisis”, which was somewhat unexpected.
Trigrams
Figure 3 shows the most frequent trigrams
in the PDO text data. Here, the recurrence of phrases involving “health” is reiterated, along with a few phrases revolving around “environmental” goals, along with terms that inherently belong together: like “water resource management”, “social safety net”, etc..
Sectors in the PDO text
To focus on a meaningful set of tokens
, I examined the frequency of sector-related terms within the PDO text data. To capture the broader concept of “sector,” I created a comprehensive SECTOR variable that encompasses all relevant words within an expanded definition.
The “sector” term discussed here is not the sector
variable available in the data, but it is an artificial construct reflecting the occurrence of terms referred to the same sector semantic field. Besides conceptual association, these definitions are rooted in the World Bank’s own classification of sector and sub-sector.
Below are the “broad SECTOR” definitions used in this analysis:
- WAT_SAN = water|wastewater|sanitat|sewer|sewage|irrigat|drainag|river basin|groundwater
- TRANSPORT = transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation|highway|transit|bridge|port
- URBAN = urban|housing|inter-urban|peri-urban|waste manag|slum|city|megacity|intercity|inter-city|town
- ENERGY = energ|electri|hydroele|hydropow|renewable|transmis|grid|transmission|electric power|geothermal|solar|wind|thermal|nuclear power|energy generation
- HEALTH = health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin|immuniz|diseas|malaria|hiv|aids|tb|maternal|clinic|nutrition
- EDUCATION = educat|school|vocat|teach|univers|student|literacy|training|curricul|pedagog
- AGR_FOR_FISH = agricultural|agro|fish|forest|crop|livestock|fishery|land|soil
- MINING_OIL_GAS = minin|oil|gas|mineral|quarry|extract|coal|natural gas|mine|petroleum|hydrocarbon
- SOCIAL_PROT = social protec|social risk|social assistance|living standard|informality|insurance|social cohesion|gig economy|human capital|employment|unemploy|productivity|wage lev|intergeneration|lifelong learn|vulnerab|empowerment|sociobehav
- FINANCIAL = bank|finan|investment|credit|microfinan|loan|financial stability|banking|financial intermed|fintech
- ICT = information|communication|ict|internet|telecom|cyber|data|ai|artificial intelligence|blockchain|e-learn|e-commerce|platform|software|hardware|digital
- IND_TRADE_SERV = industry|trade|service|manufactur|tourism|trade and services|market|export|import|supply chain|logistic|distribut|e-commerce|retail|wholesale|trade facilitation|trade policy|trade agreement|trade barrier|trade finance|trade promotion|trade integration|trade liberalization|trade balance|trade deficit|trade surplus|trade war|trade dispute|trade negotiation|trade cooperation|trade relation|trade partner|trade route|trade corridor
- INSTIT_SUPP = government|public admin|institution|central agenc|sub-national gov|law|justice|governance|policy|regulation|public expenditure|public investment|public procurement
- GENDER_EQUAL = gender|women|girl|woman|femal|gender equal|gender-base|gender inclus|gender mainstream|gender sensit|gender respons|gender gap|gender-based|gender-sensitive|gender-responsive|gender-transform|gender-equit|gender-balance
- CLIMATE = climate chang|environment|sustain|resilience|adaptation|mitigation|green|eco|eco-|carbon|carbon cycle|carbon dioxide|climate change|ecosystem|emission|energy effic|greenhouse|greenhouse gas|temperature anomalies|zero net|green growth|low carbon|climate resilient|climate smart|climate tech|climate variab
The occurrence trends over time for key sector terms are shown in Figure 4.
Interestingly, all the broadly defined “sector term” in the PDO present one or more peaks at some point in time. For the (broadly defined) HEALTH sector, it is likely that Covid-19 triggered the peak in 2020. What about the other sectors? What could be the driving reason?
A possible explanation is that the PDOs may echo themes from the World Development Reports (WDR), the World Bank’s flagship annual publication that analyzes a key development issue each year. Far from being speculative research, each WDR is grounded in the Bank’s field-based insights and, in turn, it informs the Bank’s policy and operational priorities. This would suggest a likely alignment between WDR themes and project objectives in the PDOs.
To some extent, visual exploration (see examples below) seems to support this hypothesis: thematically relevant WDRs consistently appear in close proximity to peaks in sector-related term frequencies. However, further validation is necessary. Additionally, preparing each WDR typically takes 2-3 years, so a temporal alignment with project documents may include some lag.
Examples of sectors-term trend
Figure 5 shows a “combined sector” that is quite broadly defined (AGRICULTURE, FORESTRY, FISHING) with the highest peak in 2010, two years after the publication of the WDR on “Agriculture for Development”. Perhaps the “alignment” hypothesis is not very meaningful with such a broadly defined sector.
Figure 6, tracking frequency of CLIMATE-related terms, shows how the highest peak coincided with the publication of the WDR on “Development and Climate Change” in 2010.
Figure 7 reports two WDR publications relevant to EDUCATION, which seemingly preceded two peaks in the sector-related terms in the PDOs:
- in 2007, on “Development and the Next Generation”
- in 2018, on “Learning to Realize Education’s Promise”
Figure 8 shows that the highest frequency of terms related to GENDER EQUALITY was instead recorded a couple of years before the publication of the WDR on “Gender Equality and Development” in 2012.
Comparing PDO text against variable sector
The available data includes not only text but also relevant metadata, such as the sector1
variable, which captures the project’s primary sector. Do the terms in the PDO text align with this sector label? To examine this, I applied the two-sample Kolmogorov-Smirnov test to compare the distribution of sector-related terms
in the PDO text with the distribution of sector1
.
The Kolmogorov-Smirnov test is non-parametric and makes no assumptions about the underlying distributions, making it a versatile tool for comparing distributions. The null hypothesis
is that the two samples are drawn from the same distribution. Hence, if the p-value
is less than the significance level (0.05), the null hypothesis is rejected, suggesting the observed distributions are in fact different. The test statistic
is the maximum difference between the cumulative distribution functions (CDF) of the two samples.
-
KS statistic: The vectors of observed distributions have been rescaled (bringing
n_pdo
andn_tag
to a [0, 1] range before applying the Kolmogorov-Smirnov (KS) test). This is useful when distributions differ substantially in scale or units, as it makes them directly comparable in relative terms.
As shown in Table 2, the results indicate similar distributions across most sectors. This is promising, as it suggests that in cases where metadata is lacking, sector assignments can be reasonably inferred from the PDO text.
SECTORS | KS statistic | KS p-value | Distributions |
---|---|---|---|
ENERGY | 0.6522 | 0.0001 | Dissimilar |
HEALTH | 0.3913 | 0.0487 | Dissimilar |
WAT_SAN | 0.3913 | 0.0544 | Similar |
EDUCATION | 0.3478 | 0.1002 | Similar |
ICT | 0.2857 | 0.3399 | Similar |
MINING_OIL_GAS | 0.3333 | 0.3442 | Similar |
TRANSPORT | 0.2174 | 0.6410 | Similar |
Below is a graphical representation of two illustrative sectors, showing the most similar and the most dissimilar distributions of the sector as deducted form text data, versus the proper metadata sector labeling.
Figure 9 shows the distributions of the TRANSPORT sector in the PDOs’ text and in the metadata. The two distributions are the most similar, as confirmed by the Kolmogorov-Smirnov test with a p-value of 0.641.
Figure 10 compares visually the distributions of the ENERGY sector in the PDOs’ text data and the metadata. The two distributions are the most dissimilar, as the Kolmogorov-Smirnov test confirms with a p-value of 0.0001.
Comparing PDO text against variable amount committed
A similar question is: do word trends observed in PDOs also reflect the allocation of funds by sector? I explored this question with the same approach as before, but this time I compared the distribution of sector-related terms in the PDOs’ text against the distribution of the sum of the amount committed
in corresponding projects (i.e. filtered by sector1
category). Given the very different ranges, I compared rescaled values (using the Kolmogorov-Smirnov two-sample test) to evaluate the independence of these two distributions.
As shown in Table 3, the results indicate less homogeneity of the distributions across key sectors, somthing that could be further investigated.
SECTORS | KS statistic | KS p-value | Distributions |
---|---|---|---|
EDUCATION | 0.6522 | 0.0001 | Dissimilar |
ICT | 0.6522 | 0.0001 | Dissimilar |
HEALTH | 0.5652 | 0.0010 | Dissimilar |
MINING_OIL_GAS | 0.5217 | 0.0031 | Dissimilar |
ENERGY | 0.3478 | 0.1235 | Similar |
TRANSPORT | 0.2609 | 0.4218 | Similar |
WAT_SAN | 0.2609 | 0.4218 | Similar |
Let us pick a couple of examples of specific sectors to check visually.
WATER & SANITATION sector: words v. funding
The distributions in the “WATER & SANITATION” sector are among the most similar pairs (K-S test p-value is = 0.4218).
ICT sector: words v. funding
The distributions in the ICT sector are among the least similar (K-S test p-value is = 0.0001).
Concordances: a.k.a. keywords in context
Another useful analysis that can be done exploring text data refers to concordance, which enables a closer look at the context surrounding a word (or combination of words). This approach can help clarify the word’s specific meaning or reveal underlying patterns in the data.
The bigram “eligible crisis” in the PDOs
For instance, among the most frequent bigrams
(two-word combinations) in the PDO text (illustrated in Figure 2), the phrase “eligible crisis” stands out. Besides appearing in the PDOs of 112 projects, this phrase is often used in a similar context. Specifically, in 32% of these cases, it is paired with phrases like “respond promptly and effectively” or “immediate and effective response”. As shown in Table 4, this suggests a sort of recurring standard phrasing.
WB Project ID | Excerpt of PDO Sentences with 'Eligible Crisis' |
---|---|
P179499 | (...) and effective response in the case of an eligible crisis or emergency. |
P176608 | (...) promptly and effectively in the event of an eligible crisis or emergency. |
P151442 | (...) assistance programs and, in the event of an eligible crisis or emergency, to provide immediate and effective response |
P177329 | (...) eligible crisis or emergency, respond promptly and effectively to it. |
P127338 | (...) capacity to respond promptly and effectively in an eligible crisis or emergency, asrequired. |
P158504 | (...) immediate and effective response in case of an eligible crisis or emergency. |
P173368 | (...) immediate and effective response in case of an eligible crisis or emergency in the kingdom of cambodia. |
P178816 | (...) the project regions and to respond to an eligible crisis |
P160505 | (...) theproject area, and, in the event of an eligible crisis or emergency, to provide immediate and effective response |
P149377 | (...) mozambique to respond promptly and effectively to an eligible crisis or emergency. |
The bigram “climate change” in the PDOs
Another frequently occurring bigram
is “climate change”, found in 92 PDOs. Table 5 displays words that commonly appear near this bigram. Notably, the word “mitigation” (which I associate with a more aspirational, long-term response) appears more frequently than “adaptation” (which I view as a more practical, short-term response). However, the ratio would flip considering that “resilience” may convey a similar practical intent as “adaptation”. Another interesting insight worth exploring further in the future.
Near 'climate change' | Count | Percentage |
---|---|---|
vulnerability | 25 | 39.1% |
mitigate | 14 | 21.9% |
resilience | 14 | 21.9% |
adapt | 6 | 9.4% |
hazard | 5 | 7.8% |
Table 6 shows a few examples for each of the words most frequently found in the vicinity of the bigram “climate change”.
Near word (root) | WB Project ID | Closest Text |
---|---|---|
adapt | ||
adapt | P090731 | (...) pilot adaptation measures addressing primarily, the impacts of climate change on their natural resource base, focused on biodiversity |
adapt | P120170 | (...) a multi-sectoral dpl to enhance climate change adaptation capacity is anticipated in the cps. |
adapt | P129375 | (...) objectives of the project are to: (i) integrate climate change adaptation and disaster risk reduction across the recipient’s |
hazard | ||
hazard | P174191 | (...) and health-related hazards, including the adverse effects of climate change and disease outbreaks. |
hazard | P123896 | (...) agencies to financial protection from losses caused by climate change and geological hazards. |
hazard | P117871 | (...) buildings and infrastructure due to natural hazards or climate change impacts; and (b) increased capacity of oecs governments |
mitig | ||
mitig | P074619 | (...) to help mitigate global climate change through carbon emission reductions (ers) of 138,000 tco2e |
mitig | P164588 | (...) institutional capacity for sustainable agriculture, forest conservation and climate change mitigation. |
mitig | P094154 | (...) removing carbon from the atmosphere and to mitigateclimate change in general. |
resil | ||
resil | P154784 | (...) to increase agricultural productivity and build resilience to climate change risks in the targeted smallholder farming and pastoralcommunities |
resil | P112615 | (...) the resilience of kiribati to the impacts of climate change on freshwater supply and coastal infrastructure. |
resil | P157054 | (...) to improve durability and enhance resilience to climate change |
vulnerab | ||
vulnerab | P149259 | (...) to measurably reduce vulnerability to natural hazards and climate change impacts in grenada and in the eastern caribbean |
vulnerab | P146768 | (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region. |
vulnerab | P117871 | (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region. |
DATA QUALITY ENHANCEMENT
This section shifts focus to a new area of exploration: the possibility to enhance the metadata quality by predicting missing features in the World Bank project documents. The idea is to use the Project Development Objective (PDO) words as input to predict the missing categorical descriptors (sector
, environmental risk category
, etc.) for some of the observations . Table 7 shows some missing features in the source dataset.
Variable | N obs. | N Distinct | N Missing | N Percent |
---|---|---|---|---|
ESrisk | 4403 | 5 | 3980 | 90.4% |
theme1 | 4403 | 73 | 1254 | 28.5% |
env_cat | 4403 | 8 | 1167 | 26.5% |
sector1 | 4403 | 71 | 17 | 0.4% |
One candidate variable that could be predicted is env_cat
(“Environmental Assessment Category”). This is a categorical variable with 7 levels (A, B, C, F, H, M, U), but, to simplify, I collapsed it into a binary outcome defined as “High-Med-risk” and “Low-risk-Othr” (as illustrated in Table 8).
env_cat
variable
High-Med-risk | Low-risk_Othr | |
---|---|---|
A_high risk | 351 | 0 |
B_med risk | 1830 | 0 |
C_low risk | 0 | 880 |
F_fin expos | 0 | 127 |
Other | 0 | 48 |
Missing | 0 | 1167 |
Using ML models to predict a missing feature
The goal at hand has to do with text classification, that is assigning categories to some observations. To predict a missing feature based on a mix of text data and other available predictors, several machine learning (ML) algorithms can be applied. I tested a few suitable algorithms.
The sample splitting (necessary in ML to save testing dataset for model evaluation) was done based on the availability of the env_cat
variable. The sample was actually split into three groups:
-
Training set (with
env_cat
available) 2,264 observations -
Testing set (with
env_cat
available) 972 observations -
Validation set (with
env_cat
missing) 1,167 observations
Choosing the ML algorithm
To predict the missing binary categorical outcome env_cat_f2
, I tried several models, including: Lasso losgistic regression (with different specifications including only text or a mix of text and other predictors) and Naive Bayes classification (Here I only report the results, but details can be found on this webpage). Since text data is sparse and high-dimensional, it is critical to perform some pre-treatment of the features
(i.e. the explanatory variables) before modeling.
LASSO models (for logistic regression) is an approach that basically defines
how much of a penalty
to put on some features in order to select only the most useful out of all the original possible variables (tokens). It is a good choice when dealing with a high-dimensional dataset, like text data.Naïve Bayes classification is a simple and efficient algorithm for text classification. It assumes feature independence, which may not always hold, but it’s often a good baseline, particularly with short texts.
Other supervised ML algorithms could be used in this case, such as Random Forest, Support-Vector Machines, K-Nearest Neighbors, but they were not tested here.
The steps to predict the missing feature
Outcome label engineering: Define what to predict (outcome variable, \(y\)), and its functional form (binary or multiclass, log form or not if numeric)./
Sample design: Select the observations to use. In ML this is typically done by splitting the sample into training and testing sets.
-
Feature Engineering: Define the input data (predictors, \(X\)) and their format. Here, text data was combined with other predictors (e.g.
sector
,region
,FY approved
, etc.) to create a feature matrix.-
Text preprocessing: The text data was preprocessed by
tokenization
, filtering of tokens by frequency, removal ofstopwords
, weighting viaTF-IDF
(Term Frequency-Inverse Document Frequency), to make it suitable for ML algorithms.
-
Text preprocessing: The text data was preprocessed by
-
Model selection and fitting: The models were trained on the training set.
- Different algorithms will have different parameters that can be adjusted which can affect the performance of the model (hyperparameters tuning, typically done while training the model).
Prediction: The best model was used to predict the missing
env_cat_f2
and evaluate the model’s performance on the hold-out sample (testing set).-
Evaluation: The predictions were evaluated on the testing set based on performance metrics:
-
accuracy
, which is the proportion of correct predictions, and -
ROC-AUC
(Receiver Operating Characteristic - Area Under the Curve), which summarizes how well the model can distinguish between classes.
-
Interpretation: The model was interpreted to understand which features were most important in predicting the outcome.
ML is an iterative process, so it is common to revise (some of) the above steps multiple times to refine the model.
Models and Results
Table 9 reports the specifications of the models and their performance.
Algorithm | Features | Specification | Accuracy | ROC_auc |
---|---|---|---|---|
LASSO logistic regression | Text only | env_cat_f2 ~ pdo | 0.750 | 0.777 |
LASSO logistic regression (more preprocessing) | Text only | env_cat_f2 ~ pdo | 0.762 | 0.807 |
LASSO logistic regression (more preprocessing) | Text + other predictors | env_cat_f2 ~ pdo + sector_f + regionname + FYapprov | 0.790 | 0.850 |
Naïve Bayes classification | Text + other predictors | env_cat_f2 ~ pdo + sector_f + regionname + FYapprov | 0.691 | 0.784 |
The best model performance was achieved by the LASSO logistic regression model that combined both PDOs’ text and some available metadata information to predict the missing env_cat_f2
in the testing set. The model achieved an accuracy of 0.79 and an ROC-AUC of 0.85, whereas:
-
accuracy
is the proportion of correct predictions made by the model out of all predictions or, in other words, how often the model is correct overall. -
ROC-AUC
(Receiver Operating Characteristic - Area Under the Curve) goes further by evaluating the model’s ability to distinguish between classes across various thresholds. It summarizes how well the model can separates the classes, providing a more nuanced view of its performance, especially useful when the class distribution is uneven.
Performance of the preferred ML model
Figure 13 presents the confusion matrix
for the preferred ML model used to predict the missing environment risk category assigned to World Bank projects. This matrix shows the distribution of true and predicted classifications. Ideally, a high-performing model would have most observations (or darker shading) along the diagonal, indicating correct classifications—specifically, true positives
in the top-left quadrant and true negatives
in the bottom-right quadrant.
In this case, the model performs well in predicting the environment risk category
for the High-Med group but struggles with the Low & Other group. Many of these cases are incorrectly classified as High-Med Risk (false positives
). This result is understandable, as the Low & Other category is more loosely defined and even includes Missing observations (which, in hindsight, could have been excluded from the prediction).
Most important features for prediction
It’s also insightful to examine which coefficients are most influential in the model. This can be done visually through the feature importance
plot (see Figure 14).
The feature importance plot displays the top 50 predictors of the environmental risk (binary) category
, ranked by their impact in a LASSO logistic regression model. For clarity, predictors are divided according to the risk level they predict. As expected, given the structure of the data, words from the PDO text (those variables starting with pdo_*
) are among the most important predictors. However, other predictors also play a significant role, such as sector_f_TRANSPORT
(left panel), regionname
, and sector_f_FINANCIAL
(right panel).
Prediction and Interpretation
While the model’s prediction performance is not particularly remarkable, it is sufficient to illustrate the potential of this analysis to enhance the quality of incomplete datasets. With further improvements in preprocessing, feature engineering, algorithm selection, and hyperparameter tuning, there is significant potential to optimize a similar ML model.
Although not reported here, I also explored predicting a multiclass outcome (sector
, grouped into 7 levels). However, the results were less favorable compared to the binary classification. This outcome is expected, as multiclass classification is inherently more challenging, particularly with imbalanced data or limited sample sizes.
CONCLUSIONS
-
This project was primarily a proof-of-concept for learning purposes, so optimizing ML performance and conducting in-depth data analysis were not priorities. Nevertheless, it showcased the potential of applying NLP techniques to unstructured text data, uncovering insights such as:
- identifying trends in sector-specific language and topics over time,
- revealing unexpected patterns and relationships, like recurring phrases or topics,
- enhancing text classification and metadata tagging with ML models,
- sparking additional text-based questions that could guide further research.
Future steps could include exploring explanations for observed patterns by combining this NLP analysis with other data sources (e.g., World Bank official statements or project data) and experimenting with advanced NLP techniques for topic modeling.
One pain point with this type of work is accessing document data efficiently. Even with the World Bank’s “Access to Information” policy, getting programmatic access to their text data is still tricky (no dedicated API, outdated pages, broken links). This could benefit from an approach similar to the accessible, well-maintained World Development Indicators (WDI) data.
With all the buzz around AI and Large Language Models (LLMs), this kind of analysis might seem like yesterday’s news. But I think there’s still huge, untapped potential for using NLP in development studies, policy analysis, and beyond—especially when it’s backed by domain expertise.
Acknowledgements
Below are some great resources—especially geared toward programmers—to learn and implement NLP techniques.