A text exploration of the World Bank’s projects objectives (PDO)

TL;DR

WB
NLP
TextAnalytics
ML
DigitalHumanities
Rstats

The idea of analyzing language as data has always intrigued me. In this deep dive, I focus on ~4,000 World Bank Projects & Operations, zooming in on the short texts that describe the Project Development Objectives (PDOs)—an abstract of sorts for Bank’s operations.
This explorative analysis revealed fascinating—and surprising—insights, uncovering patterns and correlations in text but also solutions to enhance the quality of projects’ data themselves.

(This is an ongoing project, so comments, questions, and suggestions are welcome. The R source code is open, albeit not very polished).

Published

October 29, 2024

MOTIVATION

I have always been fascinated by the idea of analyzing language as data and I finally found some time to study Natural Language Processing (NLP) and Text Analytics techniques.

For this learning project, I explore a dataset of World Bank Projects & Operations, with a focus on the text data contained in the Project Development Objective (PDO) section of World Bank’s projects (loans, grants, technical assistance). A PDO outlines, in synthetic form, the proposed objectives of operations, as defined in the early stages of the World Bank project cycle.

Normally, a few objectives are listed in paragraphs that are a couple sentences long. Table 1 shows two examples.

Table 1: Illustrative PDOs text in Projects’ documents
Project_ID Project_Name Project_Development_Objective
P127665 Second Economic Recovery Development Policy Loan This development policy loan supports the Government of Croatia's reform efforts with the aim to: (i) enhance fiscal sustainability through expenditure-based consolidation; and (ii) strengthen investment climate.
P179010 Tunisia Emergency Food Security Response Project To (a) ensure, in the short-term, the supply of (i) agricultural inputs for farmers to secure the next cropping seasons and for continued dairy production, and (ii) wheat for uninterrupted access to bread and other grain products for poor and vulnerable households; and (b) strengthen Tunisia’s resilience to food crises by laying the ground for reforms of the grain value chain.

The dataset also includes some relevant metadata about the projects, including: country, fiscal year of approval, project status, main sector, main theme, environmental risk category, or lending instrument.s

I retrieved the data on this page WBG Projects. Such data is classified by the World Bank as “public” and accessible under a Creative Commons Attribution 4.0 International License.

DATA

The original dataset included 22,569 World Bank projects approved from fiscal year 1947 through 2025, as of August 31, 2024. Approximately half—11,322 projects—had a viable Project Development Objective (PDO) text (i.e., not blank or labeled as “TBD”, etc.), all approved after FY2001. From this group, some projects were excluded due to missing key variables.

This left 8,811 projects as usable observations for analysis.

Interestingly, within this refined subset, 2,235 projects share only 1,006 unique PDOs: recycled PDOs often appear in follow-up projects or components of a larger parent project.

Finally, from these 8,811 projects, a representative sample of 4,403 projects with PDOs was selected for further analysis.

First, it is important to notice that all 7,548 projects approved before FY2001 had no PDO text available.

The exploratory analysis of the 11,353 projects WITH PDO text revealed some interesting findings:

  1. PDO text length: The PDO text is quite short, with a median of 2 sentences and a maximum of 9 sentences.
  2. PDO text missingness: besides 11,306 projects with missing PDOs, 31 projects had some invalid PDO values, namely:
    • 11 have PDO as one of: “.”,“-”,“NA”, “N/A”
    • 7 have PDO as one of: “No change”, “No change to PDO following restructuring.”,“PDO remains the same.”
    • 9 have PDO as one of: “TBD”, “TBD.”, “Objective to be Determined.”
    • 4 have PDO as one of: “XXXXXX”, “XXXXX”, “XXXX”, “a”

Of the available 11,322 projects with a valid PDO, some more projects were excluded from the analysis for incompleteness:

  • 3 projects without “project status”
  • 2,176 projects without “board approval FY”
  • 332 projects approved in FY >= FY2024 (for incomplete approval stage)

Lastly (and this was quite surprising to me) the remaining, viable 8,811 unique projects, were matched by only 7,582 unique PDOs! In fact, 2,235 projects share 1,006 NON-UNIQUE PDO text in the clean dataset. Why? Apparently, the same PDO is re-used for multiple projects (from 2 to as many as 9 times), likely in cases of follow-up phases of a parent project or components of the same lending program.”

In sum, the cleaning process yielded a usable set of 8,811 functional projects, which was split into a training subset (4,403) to explore and test models and a testing subset (4408), held out for post-prediction evaluation.

Preprocessing the PDO text data

Cleaning text data entails extra steps compared to numerical data. A key process is tokenization, which breaks text into smaller units like words, bigrams, n-grams, or sentences. After that, a common cleaning task is normalization, where text is standardized (e.g., converting to lowercase). Similarly, data reduction techniques like stemming and lemmatization simplify words to their root form (e.g., “running,” “ran,” and “runs” become “run”). This can help to reduce dimensionality, especially with very large datasets, when the word form is not relevant.

Upon tokenization, it is very common to remove irrelevant elements like punctuation or stop words (unimportant words like “the”, “ii)”, “at”, or repeated ones in context like “PDO”) which add noise to the data.

In contrast, data enhancement techniques like part-of-speech tagging add value by identifying grammatical components, allowing focus on meaningful elements like nouns, verbs, or adjectives.

TERM FREQUENCY PATTERNS

Figure 1 shows the most recurrent tokens and stems in the PDO text data.

Words and stems

Evidently, after stemming, more words (or stems) reach the threshold frequency count of 800 (as they have been combined by root). Despite the pre-processing of PDOs’ text data, these aren’t particularly informative words.

Figure 1

Bigrams

Figure 2 shows the most frequent bigrams in the PDO text data. The top-ranking bigrams align with expectations, featuring phrases like “increase access”, “service delivery” ,“institutional capacity”, “poverty reduction” etc., at the top. Notably, while “health” appears in several bigrams (e.g., “health services”, “public health”, “health care”), “education” is absent from the top 25. Another noteworthy observation is the frequent mention (over 100 instances) of “eligible crisis”, which was somewhat unexpected.

Figure 2

Trigrams

Figure 3 shows the most frequent trigrams in the PDO text data. Here, the recurrence of phrases involving “health” is reiterated, along with a few phrases revolving around “environmental” goals, along with terms that inherently belong together: like “water resource management”, “social safety net”, etc..

Figure 3

Sectors in the PDO text

To focus on a meaningful set of tokens, I examined the frequency of sector-related terms within the PDO text data. To capture the broader concept of “sector,” I created a comprehensive SECTOR variable that encompasses all relevant words within an expanded definition.

The “sector” term discussed here is not the sector variable available in the data, but it is an artificial construct reflecting the occurrence of terms referred to the same sector semantic field. Besides conceptual association, these definitions are rooted in the World Bank’s own classification of sector and sub-sector.

Below are the “broad SECTOR” definitions used in this analysis:

  • WAT_SAN = water|wastewater|sanitat|sewer|sewage|irrigat|drainag|river basin|groundwater
  • TRANSPORT = transport|railway|road|airport|waterway|bus|metropolitan|inter-urban|aviation|highway|transit|bridge|port
  • URBAN = urban|housing|inter-urban|peri-urban|waste manag|slum|city|megacity|intercity|inter-city|town
  • ENERGY = energ|electri|hydroele|hydropow|renewable|transmis|grid|transmission|electric power|geothermal|solar|wind|thermal|nuclear power|energy generation
  • HEALTH = health|hospital|medicine|drugs|epidem|pandem|covid-19|vaccin|immuniz|diseas|malaria|hiv|aids|tb|maternal|clinic|nutrition
  • EDUCATION = educat|school|vocat|teach|univers|student|literacy|training|curricul|pedagog
  • AGR_FOR_FISH = agricultural|agro|fish|forest|crop|livestock|fishery|land|soil
  • MINING_OIL_GAS = minin|oil|gas|mineral|quarry|extract|coal|natural gas|mine|petroleum|hydrocarbon
  • SOCIAL_PROT = social protec|social risk|social assistance|living standard|informality|insurance|social cohesion|gig economy|human capital|employment|unemploy|productivity|wage lev|intergeneration|lifelong learn|vulnerab|empowerment|sociobehav
  • FINANCIAL = bank|finan|investment|credit|microfinan|loan|financial stability|banking|financial intermed|fintech
  • ICT = information|communication|ict|internet|telecom|cyber|data|ai|artificial intelligence|blockchain|e-learn|e-commerce|platform|software|hardware|digital
  • IND_TRADE_SERV = industry|trade|service|manufactur|tourism|trade and services|market|export|import|supply chain|logistic|distribut|e-commerce|retail|wholesale|trade facilitation|trade policy|trade agreement|trade barrier|trade finance|trade promotion|trade integration|trade liberalization|trade balance|trade deficit|trade surplus|trade war|trade dispute|trade negotiation|trade cooperation|trade relation|trade partner|trade route|trade corridor
  • INSTIT_SUPP = government|public admin|institution|central agenc|sub-national gov|law|justice|governance|policy|regulation|public expenditure|public investment|public procurement
  • GENDER_EQUAL = gender|women|girl|woman|femal|gender equal|gender-base|gender inclus|gender mainstream|gender sensit|gender respons|gender gap|gender-based|gender-sensitive|gender-responsive|gender-transform|gender-equit|gender-balance
  • CLIMATE = climate chang|environment|sustain|resilience|adaptation|mitigation|green|eco|eco-|carbon|carbon cycle|carbon dioxide|climate change|ecosystem|emission|energy effic|greenhouse|greenhouse gas|temperature anomalies|zero net|green growth|low carbon|climate resilient|climate smart|climate tech|climate variab

The occurrence trends over time for key sector terms are shown in Figure 4.

Interestingly, all the broadly defined “sector term” in the PDO present one or more peaks at some point in time. For the (broadly defined) HEALTH sector, it is likely that Covid-19 triggered the peak in 2020. What about the other sectors? What could be the driving reason?

Figure 4

A possible explanation is that the PDOs may echo themes from the World Development Reports (WDR), the World Bank’s flagship annual publication that analyzes a key development issue each year. Far from being speculative research, each WDR is grounded in the Bank’s field-based insights and, in turn, it informs the Bank’s policy and operational priorities. This would suggest a likely alignment between WDR themes and project objectives in the PDOs.

To some extent, visual exploration (see examples below) seems to support this hypothesis: thematically relevant WDRs consistently appear in close proximity to peaks in sector-related term frequencies. However, further validation is necessary. Additionally, preparing each WDR typically takes 2-3 years, so a temporal alignment with project documents may include some lag.

Examples of sectors-term trend

Figure 5 shows a “combined sector” that is quite broadly defined (AGRICULTURE, FORESTRY, FISHING) with the highest peak in 2010, two years after the publication of the WDR on “Agriculture for Development”. Perhaps the “alignment” hypothesis is not very meaningful with such a broadly defined sector.

Figure 5

Figure 6, tracking frequency of CLIMATE-related terms, shows how the highest peak coincided with the publication of the WDR on “Development and Climate Change” in 2010.

Figure 6

Figure 7 reports two WDR publications relevant to EDUCATION, which seemingly preceded two peaks in the sector-related terms in the PDOs:

  • in 2007, on “Development and the Next Generation”
  • in 2018, on “Learning to Realize Education’s Promise”
Figure 7

Figure 8 shows that the highest frequency of terms related to GENDER EQUALITY was instead recorded a couple of years before the publication of the WDR on “Gender Equality and Development” in 2012.

Figure 8

Comparing PDO text against variable sector

The available data includes not only text but also relevant metadata, such as the sector1 variable, which captures the project’s primary sector. Do the terms in the PDO text align with this sector label? To examine this, I applied the two-sample Kolmogorov-Smirnov test to compare the distribution of sector-related terms in the PDO text with the distribution of sector1.

The Kolmogorov-Smirnov test is non-parametric and makes no assumptions about the underlying distributions, making it a versatile tool for comparing distributions. The null hypothesis is that the two samples are drawn from the same distribution. Hence, if the p-value is less than the significance level (0.05), the null hypothesis is rejected, suggesting the observed distributions are in fact different. The test statistic is the maximum difference between the cumulative distribution functions (CDF) of the two samples.

  • KS statistic: The vectors of observed distributions have been rescaled (bringing n_pdo and n_tag to a [0, 1] range before applying the Kolmogorov-Smirnov (KS) test). This is useful when distributions differ substantially in scale or units, as it makes them directly comparable in relative terms.

As shown in Table 2, the results indicate similar distributions across most sectors. This is promising, as it suggests that in cases where metadata is lacking, sector assignments can be reasonably inferred from the PDO text.

Table 2: Comparing the freqeuncy distributions of SECTOR in text and metadata
SECTORS KS statistic KS p-value Distributions
ENERGY 0.6522 0.0001 Dissimilar
HEALTH 0.3913 0.0487 Dissimilar
WAT_SAN 0.3913 0.0544 Similar
EDUCATION 0.3478 0.1002 Similar
ICT 0.2857 0.3399 Similar
MINING_OIL_GAS 0.3333 0.3442 Similar
TRANSPORT 0.2174 0.6410 Similar

Below is a graphical representation of two illustrative sectors, showing the most similar and the most dissimilar distributions of the sector as deducted form text data, versus the proper metadata sector labeling.

Figure 9 shows the distributions of the TRANSPORT sector in the PDOs’ text and in the metadata. The two distributions are the most similar, as confirmed by the Kolmogorov-Smirnov test with a p-value of 0.641.

Figure 9

Figure 10 compares visually the distributions of the ENERGY sector in the PDOs’ text data and the metadata. The two distributions are the most dissimilar, as the Kolmogorov-Smirnov test confirms with a p-value of 0.0001.

Figure 10

Comparing PDO text against variable amount committed

A similar question is: do word trends observed in PDOs also reflect the allocation of funds by sector? I explored this question with the same approach as before, but this time I compared the distribution of sector-related terms in the PDOs’ text against the distribution of the sum of the amount committed in corresponding projects (i.e. filtered by sector1 category). Given the very different ranges, I compared rescaled values (using the Kolmogorov-Smirnov two-sample test) to evaluate the independence of these two distributions.

As shown in Table 3, the results indicate less homogeneity of the distributions across key sectors, somthing that could be further investigated.

Table 3: Comparing the distributions of SECTOR in text and in corresponding $$ committed
SECTORS KS statistic KS p-value Distributions
EDUCATION 0.6522 0.0001 Dissimilar
ICT 0.6522 0.0001 Dissimilar
HEALTH 0.5652 0.0010 Dissimilar
MINING_OIL_GAS 0.5217 0.0031 Dissimilar
ENERGY 0.3478 0.1235 Similar
TRANSPORT 0.2609 0.4218 Similar
WAT_SAN 0.2609 0.4218 Similar

Let us pick a couple of examples of specific sectors to check visually.

WATER & SANITATION sector: words v. funding

The distributions in the “WATER & SANITATION” sector are among the most similar pairs (K-S test p-value is = 0.4218).

Figure 11

ICT sector: words v. funding

The distributions in the ICT sector are among the least similar (K-S test p-value is = 0.0001).

Figure 12

Concordances: a.k.a. keywords in context

Another useful analysis that can be done exploring text data refers to concordance, which enables a closer look at the context surrounding a word (or combination of words). This approach can help clarify the word’s specific meaning or reveal underlying patterns in the data.

The bigram “eligible crisis” in the PDOs

For instance, among the most frequent bigrams (two-word combinations) in the PDO text (illustrated in Figure 2), the phrase “eligible crisis” stands out. Besides appearing in the PDOs of 112 projects, this phrase is often used in a similar context. Specifically, in 32% of these cases, it is paired with phrases like “respond promptly and effectively” or “immediate and effective response”. As shown in Table 4, this suggests a sort of recurring standard phrasing.

Table 4: Context of the bigram “eligible crisis” in the PDOs
WB Project ID Excerpt of PDO Sentences with 'Eligible Crisis'
P179499 (...) and effective response in the case of an eligible crisis or emergency.
P176608 (...) promptly and effectively in the event of an eligible crisis or emergency.
P151442 (...) assistance programs and, in the event of an eligible crisis or emergency, to provide immediate and effective response
P177329 (...) eligible crisis or emergency, respond promptly and effectively to it.
P127338 (...) capacity to respond promptly and effectively in an eligible crisis or emergency, asrequired.
P158504 (...) immediate and effective response in case of an eligible crisis or emergency.
P173368 (...) immediate and effective response in case of an eligible crisis or emergency in the kingdom of cambodia.
P178816 (...) the project regions and to respond to an eligible crisis
P160505 (...) theproject area, and, in the event of an eligible crisis or emergency, to provide immediate and effective response
P149377 (...) mozambique to respond promptly and effectively to an eligible crisis or emergency.

The bigram “climate change” in the PDOs

Another frequently occurring bigram is “climate change”, found in 92 PDOs. Table 5 displays words that commonly appear near this bigram. Notably, the word “mitigation” (which I associate with a more aspirational, long-term response) appears more frequently than “adaptation” (which I view as a more practical, short-term response). However, the ratio would flip considering that “resilience” may convey a similar practical intent as “adaptation”. Another interesting insight worth exploring further in the future.

Table 5: Frequent words near “climate change”
Near 'climate change' Count Percentage
vulnerability 25 39.1%
mitigate 14 21.9%
resilience 14 21.9%
adapt 6 9.4%
hazard 5 7.8%

Table 6 shows a few examples for each of the words most frequently found in the vicinity of the bigram “climate change”.

Table 6: Context of the bigram “climate change” in the PDOs
Near word (root) WB Project ID Closest Text
adapt
adapt P090731 (...) pilot adaptation measures addressing primarily, the impacts of climate change on their natural resource base, focused on biodiversity
adapt P120170 (...) a multi-sectoral dpl to enhance climate change adaptation capacity is anticipated in the cps.
adapt P129375 (...) objectives of the project are to: (i) integrate climate change adaptation and disaster risk reduction across the recipient’s
hazard
hazard P174191 (...) and health-related hazards, including the adverse effects of climate change and disease outbreaks.
hazard P123896 (...) agencies to financial protection from losses caused by climate change and geological hazards.
hazard P117871 (...) buildings and infrastructure due to natural hazards or climate change impacts; and (b) increased capacity of oecs governments
mitig
mitig P074619 (...) to help mitigate global climate change through carbon emission reductions (ers) of 138,000 tco2e
mitig P164588 (...) institutional capacity for sustainable agriculture, forest conservation and climate change mitigation.
mitig P094154 (...) removing carbon from the atmosphere and to mitigateclimate change in general.
resil
resil P154784 (...) to increase agricultural productivity and build resilience to climate change risks in the targeted smallholder farming and pastoralcommunities
resil P112615 (...) the resilience of kiribati to the impacts of climate change on freshwater supply and coastal infrastructure.
resil P157054 (...) to improve durability and enhance resilience to climate change
vulnerab
vulnerab P149259 (...) to measurably reduce vulnerability to natural hazards and climate change impacts in grenada and in the eastern caribbean
vulnerab P146768 (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.
vulnerab P117871 (...) at measurably reducing vulnerability to natural hazards and climate change impacts in the eastern caribbean sub-region.

DATA QUALITY ENHANCEMENT

This section shifts focus to a new area of exploration: the possibility to enhance the metadata quality by predicting missing features in the World Bank project documents. The idea is to use the Project Development Objective (PDO) words as input to predict the missing categorical descriptors (sector, environmental risk category, etc.) for some of the observations . Table 7 shows some missing features in the source dataset.

Table 7: Missing features in source dataset
Variable N obs. N Distinct N Missing N Percent
ESrisk 4403 5 3980 90.4%
theme1 4403 73 1254 28.5%
env_cat 4403 8 1167 26.5%
sector1 4403 71 17 0.4%

One candidate variable that could be predicted is env_cat (“Environmental Assessment Category”). This is a categorical variable with 7 levels (A, B, C, F, H, M, U), but, to simplify, I collapsed it into a binary outcome defined as “High-Med-risk” and “Low-risk-Othr” (as illustrated in Table 8).

Table 8: Binary outcome obtained from the env_cat variable
High-Med-risk Low-risk_Othr
A_high risk 351 0
B_med risk 1830 0
C_low risk 0 880
F_fin expos 0 127
Other 0 48
Missing 0 1167

Using ML models to predict a missing feature

The goal at hand has to do with text classification, that is assigning categories to some observations. To predict a missing feature based on a mix of text data and other available predictors, several machine learning (ML) algorithms can be applied. I tested a few suitable algorithms.

The sample splitting (necessary in ML to save testing dataset for model evaluation) was done based on the availability of the env_cat variable. The sample was actually split into three groups:

  1. Training set (with env_cat available) 2,264 observations
  2. Testing set (with env_cat available) 972 observations
  3. Validation set (with env_cat missing) 1,167 observations

Choosing the ML algorithm

To predict the missing binary categorical outcome env_cat_f2, I tried several models, including: Lasso losgistic regression (with different specifications including only text or a mix of text and other predictors) and Naive Bayes classification (Here I only report the results, but details can be found on this webpage). Since text data is sparse and high-dimensional, it is critical to perform some pre-treatment of the features (i.e. the explanatory variables) before modeling.

  • LASSO models (for logistic regression) is an approach that basically defines how much of a penalty to put on some features in order to select only the most useful out of all the original possible variables (tokens). It is a good choice when dealing with a high-dimensional dataset, like text data.

  • Naïve Bayes classification is a simple and efficient algorithm for text classification. It assumes feature independence, which may not always hold, but it’s often a good baseline, particularly with short texts.

Other supervised ML algorithms could be used in this case, such as Random Forest, Support-Vector Machines, K-Nearest Neighbors, but they were not tested here.

The steps to predict the missing feature

  1. Outcome label engineering: Define what to predict (outcome variable, \(y\)), and its functional form (binary or multiclass, log form or not if numeric)./

  2. Sample design: Select the observations to use. In ML this is typically done by splitting the sample into training and testing sets.

  3. Feature Engineering: Define the input data (predictors, \(X\)) and their format. Here, text data was combined with other predictors (e.g. sector, region, FY approved, etc.) to create a feature matrix.

    • Text preprocessing: The text data was preprocessed by tokenization, filtering of tokens by frequency, removal of stopwords, weighting via TF-IDF (Term Frequency-Inverse Document Frequency), to make it suitable for ML algorithms.

  4. Model selection and fitting: The models were trained on the training set.

    • Different algorithms will have different parameters that can be adjusted which can affect the performance of the model (hyperparameters tuning, typically done while training the model).

  1. Prediction: The best model was used to predict the missing env_cat_f2 and evaluate the model’s performance on the hold-out sample (testing set).

  2. Evaluation: The predictions were evaluated on the testing set based on performance metrics:

    • accuracy, which is the proportion of correct predictions, and
    • ROC-AUC (Receiver Operating Characteristic - Area Under the Curve), which summarizes how well the model can distinguish between classes.

  3. Interpretation: The model was interpreted to understand which features were most important in predicting the outcome.

ML is an iterative process, so it is common to revise (some of) the above steps multiple times to refine the model.

Models and Results

Table 9 reports the specifications of the models and their performance.

Table 9: Comparison of models and results for binary outcome
Algorithm Features Specification Accuracy ROC_auc
LASSO logistic regression Text only env_cat_f2 ~ pdo 0.750 0.777
LASSO logistic regression (more preprocessing) Text only env_cat_f2 ~ pdo 0.762 0.807
LASSO logistic regression (more preprocessing) Text + other predictors env_cat_f2 ~ pdo + sector_f + regionname + FYapprov 0.790 0.850
Naïve Bayes classification Text + other predictors env_cat_f2 ~ pdo + sector_f + regionname + FYapprov 0.691 0.784

The best model performance was achieved by the LASSO logistic regression model that combined both PDOs’ text and some available metadata information to predict the missing env_cat_f2 in the testing set. The model achieved an accuracy of 0.79 and an ROC-AUC of 0.85, whereas:

  • accuracy is the proportion of correct predictions made by the model out of all predictions or, in other words, how often the model is correct overall.
  • ROC-AUC (Receiver Operating Characteristic - Area Under the Curve) goes further by evaluating the model’s ability to distinguish between classes across various thresholds. It summarizes how well the model can separates the classes, providing a more nuanced view of its performance, especially useful when the class distribution is uneven.

Performance of the preferred ML model

Figure 13 presents the confusion matrix for the preferred ML model used to predict the missing environment risk category assigned to World Bank projects. This matrix shows the distribution of true and predicted classifications. Ideally, a high-performing model would have most observations (or darker shading) along the diagonal, indicating correct classifications—specifically, true positives in the top-left quadrant and true negatives in the bottom-right quadrant.

In this case, the model performs well in predicting the environment risk category for the High-Med group but struggles with the Low & Other group. Many of these cases are incorrectly classified as High-Med Risk (false positives). This result is understandable, as the Low & Other category is more loosely defined and even includes Missing observations (which, in hindsight, could have been excluded from the prediction).

Figure 13

Most important features for prediction

It’s also insightful to examine which coefficients are most influential in the model. This can be done visually through the feature importance plot (see Figure 14).

The feature importance plot displays the top 50 predictors of the environmental risk (binary) category, ranked by their impact in a LASSO logistic regression model. For clarity, predictors are divided according to the risk level they predict. As expected, given the structure of the data, words from the PDO text (those variables starting with pdo_*) are among the most important predictors. However, other predictors also play a significant role, such as sector_f_TRANSPORT (left panel), regionname, and sector_f_FINANCIAL (right panel).

Figure 14: Top 50 most important features in the preferred ML model

Prediction and Interpretation

While the model’s prediction performance is not particularly remarkable, it is sufficient to illustrate the potential of this analysis to enhance the quality of incomplete datasets. With further improvements in preprocessing, feature engineering, algorithm selection, and hyperparameter tuning, there is significant potential to optimize a similar ML model.

Although not reported here, I also explored predicting a multiclass outcome (sector, grouped into 7 levels). However, the results were less favorable compared to the binary classification. This outcome is expected, as multiclass classification is inherently more challenging, particularly with imbalanced data or limited sample sizes.

CONCLUSIONS

  • This project was primarily a proof-of-concept for learning purposes, so optimizing ML performance and conducting in-depth data analysis were not priorities. Nevertheless, it showcased the potential of applying NLP techniques to unstructured text data, uncovering insights such as:

    • identifying trends in sector-specific language and topics over time,
    • revealing unexpected patterns and relationships, like recurring phrases or topics,
    • enhancing text classification and metadata tagging with ML models,
    • sparking additional text-based questions that could guide further research.
  • Future steps could include exploring explanations for observed patterns by combining this NLP analysis with other data sources (e.g., World Bank official statements or project data) and experimenting with advanced NLP techniques for topic modeling.

  • One pain point with this type of work is accessing document data efficiently. Even with the World Bank’s “Access to Information” policy, getting programmatic access to their text data is still tricky (no dedicated API, outdated pages, broken links). This could benefit from an approach similar to the accessible, well-maintained World Development Indicators (WDI) data.

  • With all the buzz around AI and Large Language Models (LLMs), this kind of analysis might seem like yesterday’s news. But I think there’s still huge, untapped potential for using NLP in development studies, policy analysis, and beyond—especially when it’s backed by domain expertise.

Acknowledgements

Below are some great resources—especially geared toward programmers—to learn and implement NLP techniques.

References

Engel, Claudia, and Scott Bailey. 2022. Text Analysis with R. https://cengel.github.io/R-text-analysis/.
Francom, Jerid. 2024. An Introduction to Quantitative Text Analysis for Linguistics: Reproducible Research Using R. 1st ed. London: Routledge. https://doi.org/10.4324/9781003393764.
Future Mojo, dir. 2022. Natural Language Processing Demystified - YouTube. https://www.youtube.com/playlist?list=PLw3N0OFSAYSEC_XokEcX8uzJmEZSoNGuS.
Heiss, Andrew. 2022. “Text.” Data Visualization Course. 2022. https://datavizs22.classes.andrewheiss.com/example/13-example/#sentiment-analysis.
Hvitfeldt, Emil, and Julia Silge. 2022. Supervised Machine Learning for Text Analysis in R. First edition. Data Science Series. Boca Raton London New York: CRC Press. https://smltar.com/.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. O’Reilly. https://www.tidytextmining.com/.