Predicting Alpha Thalassemia Phenotype using Clinical and Hematological Measures
Ashley Williams
Alpha thalassemia is a prevalent genetic disorder with a wide spectrum of clinical severity, ranging from asymptomatic carrier states to presenting fatal severe anemia. Genotype-phenotype correlations are generally strong but variable, necessitating reliable prognostic tools for clinical management and improved disease prevention. This project aimed to develop and validate a logistic regression model to predict the clinical phenotype (either silent carrier or alpha trait status) of alpha thalassemia patients using clinical variables.
Results indicate that a combination of hemoglobin concentration, red blood cell count, and lymphocyte percentage are strong predictors of phenotypic severity. The final model achieved predictive accuracy, sensitivity, and allows for easy interpretation, demonstrating its potential use as a clinical decision support tool. The resulting model provides a cost-effective method to aid in alpha thalassemia assessment and prevention with particular application in low-resource settings where advanced testing is inaccessible.
Alpha thalassemia is an inherited blood disorder causing the body to produce an insufficient amount of hemoglobin, thus leading to anemia
Overall model would look to be a tool that can support clinical decisions, especially in low-resource environments. As such, the overall purpose of this model is to explore how well blood test results alone classify patients as Silent Carriers or Alpha Trait carriers of alpha-thalassemia, and which biomarkers contribute most to accurate classification?
hb), red blood cell count
(rbc), lymphocyte % (lymph), neutrophil %
(neut), and platelet count (plt), which
biomarkers are most predictive of whether a patient is a Silent Carrier
or Alpha Trait carrier?phenotype, Phenotype of the patient, either Silent
Carrier or Alpha Traithb, Hemoglobin concentration in grams per decilitre -
g/dLrbc, Red blood cell volume in \(10^12\)/Lmchc, Mean corpuscular hemoglobin concentration in
g/dLplt, Total platelet count in \(10^6\)/Lpcv, Pack cell volume per hematocrit in %wbc, Total white blood cell count in \(10^6\)/Llymph, Percentage of white blood cells that are
lymphocytesneut, Percentage of white blood cells that are
neutrophilsI obtained my data for this project from Kaggle.
mch, mean corpuscular hemoglobin. However,
I am not using this variable in my model so this is not a concern.sex and
phenotype into factors as described below, such that they
can be used in my logistic regression approach.
# A tibble: 2 × 2
phenotype medhb
<chr> <dbl>
1 Alpha Trait 10.8
2 Silent Carrier 11.8
# A tibble: 2 × 2
phenotype medrbc
<chr> <dbl>
1 Alpha Trait 5.21
2 Silent Carrier 5.03
# A tibble: 2 × 2
phenotype medpcv
<chr> <dbl>
1 Alpha Trait 33.1
2 Silent Carrier 35.9
# A tibble: 2 × 2
phenotype medmcv
<chr> <dbl>
1 Alpha Trait 65.5
2 Silent Carrier 72.5
# A tibble: 2 × 2
phenotype medlymph
<chr> <dbl>
1 Alpha Trait 41.5
2 Silent Carrier 45.5
# A tibble: 2 × 2
phenotype medplt
<chr> <dbl>
1 Alpha Trait 354.
2 Silent Carrier 332.
For this project, I am going to employ a binary logistic regression approach.
Logistic regression is a method used to predict the probability of a discrete outcome of two mutually exclusive events.
Logistic regression analyzes the relationship between the target and predictor variables by utilizing a logistic function to model the probability of an event occurring, rather than a continuous value as seen in linear regression.
To complete my logistic regression approach, I utilize several R packages such as: caret, nnet, pROC, and pscl.
silent_carrier alpha_trait
74 30
Testing data:
silent_carrier alpha_trait
31 12
Additionally, phenotype was coded as 0 = silent carrier
and 1 = alpha trait to facilitate binary logistic regression.
My original model was phenotype ~ hb +
pcv + rbc + mcv +
mchc + rdw + wbc +
lymph + neut+ plt
This full logistic regression model included hb, pcv, rbc, mcv, mchc, rdw, wbc, lymph, neut, and plt as potential predictors of phenotype.
These variables were included due to their clinical relevance when assessing a patient for alpha thalassemia. Variables were then systematically excluded based on p-value until reaching the model
phenotype ~ hb + rbc +
lymph + mchc where I had 2 significant
variables. Then, mchc was excluded from the model due to
high collinearity with the significant variable hb, as they
both are measures of the same thing.
A binary logistic regression model was fitted predicting the
phenotype from hemoglobin concentration hb,
red blood cell volume rbc and percent of lymphocytes in
white blood cell count lymph.
This model was selected after testing several reduced models, as it ultimately had the best performance on key logistic regression outputs and had the most significant variables out of many combinations.
Call:
glm(formula = formula_logit, family = binomial, data = train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.57589 2.24398 1.594 0.111038
hb -0.69551 0.20504 -3.392 0.000694 ***
rbc 0.95577 0.46345 2.062 0.039182 *
lymph -0.03237 0.01848 -1.752 0.079794 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 124.96 on 103 degrees of freedom
Residual deviance: 109.47 on 100 degrees of freedom
AIC: 117.47
Number of Fisher Scoring iterations: 4
After assessing the performance of the model phenotype ~
hb + rbc + lymph I was
dissatisfied with its results.
Since the response variable, phenotype is imbalanced
where the alpha trait has many fewer counts than silent carrier, the
sensitivity is low.
As such, I employed alternative approaches, downsampling and upsampling, to mediate this disparity. The downsampling approach will make the large class smaller, and the upsampling approach will make the small class bigger. Both techniques help fix class imbalance so that a predictive model doesn’t become biased toward the majority class.
Downsampled data:
silent_carrier alpha_trait
30 30
Upsampled data:
silent_carrier alpha_trait
74 74
I then fit 2 additional regression models using the same variables with these new data.
Downsampled model:
Call:
glm(formula = formula_logit, family = binomial, data = train_down)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.27326 2.44235 1.340 0.1802
hb -0.40747 0.21123 -1.929 0.0537 .
rbc 0.41620 0.49873 0.835 0.4040
lymph -0.01873 0.02222 -0.843 0.3993
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.178 on 59 degrees of freedom
Residual deviance: 78.568 on 56 degrees of freedom
AIC: 86.568
Number of Fisher Scoring iterations: 4
Call:
glm(formula = formula_logit, family = binomial, data = train_up)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.02473 1.89356 2.654 0.00796 **
hb -0.79168 0.17127 -4.622 3.79e-06 ***
rbc 1.13951 0.36343 3.135 0.00172 **
lymph -0.04157 0.01522 -2.731 0.00631 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 205.17 on 147 degrees of freedom
Residual deviance: 174.51 on 144 degrees of freedom
AIC: 182.51
Number of Fisher Scoring iterations: 3
log(\(\frac{p}{1-p}\)) = 5.025 -
0.792(hb) + 1.14(rbc) -
0.0416(lymph)
This model predicts the probability of a patient having the alpha trait phenotype (Y=1).
hb rbc lymph
1.682957 1.599386 1.114799
Analysis of Deviance Table
Model 1: phenotype ~ 1
Model 2: phenotype ~ hb + rbc + lymph
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 147 205.17
2 144 174.51 3 30.661 1.002e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fitting null model for pseudo-r2
llh llhNull G2 McFadden r2ML r2CU
-87.2553144 -102.5857827 30.6609366 0.1494405 0.1871173 0.2494898
(Intercept) hb rbc lymph
152.1297849 0.4530813 3.1252225 0.9592807
Confusion Matrix and Statistics
Reference
Prediction silent_carrier alpha_trait
silent_carrier 22 2
alpha_trait 9 10
Accuracy : 0.7442
95% CI : (0.5883, 0.8648)
No Information Rate : 0.7209
P-Value [Acc > NIR] : 0.44239
Kappa : 0.4607
Mcnemar's Test P-Value : 0.07044
Sensitivity : 0.8333
Specificity : 0.7097
Pos Pred Value : 0.5263
Neg Pred Value : 0.9167
Prevalence : 0.2791
Detection Rate : 0.2326
Detection Prevalence : 0.4419
Balanced Accuracy : 0.7715
'Positive' Class : alpha_trait
The likelihood ratio test (LRT) compares the logistic regression model to a null model containing only an intercept. The residual deviance decreases from 205.17 to 174.51 when the predictors are added, producing a test statistic of 30.661 on 3 degrees of freedom (p<0.0001 ). This small p-value indicates that the predictors significantly improve model fit compared to the null model. The included clinical variables provide substantial explanatory power for predicting the phenotype of Alpha Thalassemia.
\(Pseudo-R^2\) values provide additional measures of model fit for logistic regression. McFadden’s \(pseudo-R^2\) was 0.1494, which indicates moderate fit. The maximum-likelihood \(R^2\) (r2ML = 0.1871, Cox & Snell) similarly suggests improvement over the null model. Nagelkerke’s \(pseudo-R^2\) was 0.2495, meaning the model achieves about 24.95% of the maximum possible improvement in fit, relative to the null model. Overall, these values indicate that the logistic regression model provides a moderate fit to the data that is better for predicting phenotype than the null model.
For each additional increase in g/dL of hemoglobin concentration, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 54.69%, holding all other variables constant. Here, 54.69% is from (0.4531−1=0.5469=54.69%).
For each \(1x10^{12}\) cells/L increase in red blood cells, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia more than tripled (about 3.13 times), holding all other predictors constant.
For each additional 1% of the white blood cells population that is lymphocytes, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 4.07%, holding all other variables constant. Here, 4.07% is from (0.9593-1=0.0407=4.07%).
Using a 0.5 probability cutoff on the test data, the model achieved an accuracy of 74.42%, with 83.33% sensitivity and 70.97% specificity.
The ROC curve yielded an AUC of 0.872, indicating excellent discrimination between patients with the silent carrier phenotype and those with the alpha trait phenotype.
Overall, a group of predictors showed statistically significant associations with the phenotype of alpha thalassemia. After adjusting for all other variables in the model, red blood cell count was a strong predictor: each one-unit (1x10\(^{12}\) cells) increase in red blood cell count more than tripled the odds of presenting the alpha trait (OR≈3.13).
Patients with lower hemoglobin concentration are at an elevated risk: after accounting for other variables, each additional increase in g/dL of hemoglobin concentration decreases risk of presenting the alpha trait by about 54.69%.
In contrast, increasing lymphocyte % was associated with lower odds of presenting the alpha trait; for each additional 1% of the white blood cells population that is lymphocytes, the odds decrease by about 4.07%. Although this percentage is small, lymphocyte % of white blood cell population is considered a statistically significant predictor of phenotype in the model and thus this small percentage may still be useful.
A sensitive test reduces false negatives, making it good for screening; a specific test reduces false positives, good for confirmation. In this case, diagnoses for alpha thalassemia are confirmed using genetic tests. The purpose of this model is for screening purposes; thus the heightened sensitivity of the final model is the focus.
Overall, this logistic regression model demonstrates that a small set of clinical predictors provides explanatory and predictive power for assessing phenotype of Alpha Thalassemia patients, while maintaining easy interpretability for use in clinical screenings and application in disease prevention.
rbc had a
p-value of <0.01 (from t-test) and it was found that after accounting
for other predictors, each additional \(1x10^{12}\) cells/L increase in red blood
cells more than tripled (about 3.13 times) the odds of the patient
presenting the alpha trait phenotype of alpha thalassemia. Overall this
model uses this red blood cell abnormality that is more prominent in
Alpha Trait carriers to aid in predicting disease phenotype.
One major limitation in this study is the small sample size. A small sample can lead to biased and imprecise results that don’t accurately represent the larger population.
Another major limitation in this study, that goes along with the small sample size, is the disproportionality in the size of classes of the response variable. I noted throughout this study how this impacted and destabilized the model, and eventually led to using an upsampling approach.
Data: Kaggle.
---
title: "AT Analysis"
output:
flexdashboard::flex_dashboard:
theme:
version: 4
navbar-bg: "#b53533"
bootswatch: default
orientation: columns
vertical_layout: fill
source_code: embed
---
<head>
<base target="_blank">
</head>
```{r setup, include=FALSE}
library(flexdashboard)
pacman::p_load(caret, nnet, pROC, pscl, tidyverse, DT)
data <- read.csv("twoalphas.csv")
data$phenotype <- ifelse(data$phenotype == "alpha trait", 1, 0)
#alpha trait = 1
#female = 1
data$sex <- ifelse(data$sex == "female", 1, 0)
data2 <- data %>%
mutate(
phenotype = factor(phenotype, levels = c(0, 1),
labels = c("silent_carrier", "alpha_trait")),
sex = factor(sex, levels = c(0, 1),
labels = c("Male", "Female")))
dataEDA <- data
dataEDA$phenotype <- recode(dataEDA$phenotype, `0` = "Silent Carrier", `1` = "Alpha Trait")
```
Title
===
Column {data-width=450}
---
### <b><span Style="color:#4f0c0b">Title</span></b>
<font size=8><b><span Style="color:#b53533">Predicting Alpha Thalassemia Phenotype using Clinical and Hematological Measures</span></b></font>
<font size=6><b><span Style="color:#d15c5a">Ashley Williams</span></b></font>
Column {data-width=550}
---
### <b><span Style="color:#4f0c0b">Abstract</span></b>
Alpha thalassemia is a prevalent genetic disorder with a wide spectrum of clinical severity, ranging from asymptomatic carrier states to presenting fatal severe anemia. Genotype-phenotype correlations are generally strong but variable, necessitating reliable prognostic tools for clinical management and improved disease prevention. This project aimed to develop and validate a logistic regression model to predict the clinical phenotype (either silent carrier or alpha trait status) of alpha thalassemia patients using clinical variables.
Results indicate that a combination of hemoglobin concentration, red blood cell count, and lymphocyte percentage are strong predictors of phenotypic severity. The final model achieved predictive accuracy, sensitivity, and allows for easy interpretation, demonstrating its potential use as a clinical decision support tool. The resulting model provides a cost-effective method to aid in alpha thalassemia assessment and prevention with particular application in low-resource settings where advanced testing is inaccessible.
Background
===
Column {.tabset data-width=500}
-----------------------------------------------------------------------
### <font size=2.8><span Style="color:#4f0c0b">Background</span></font>
Alpha thalassemia is an inherited blood disorder causing the body to produce an insufficient amount of hemoglobin, thus leading to anemia
- Alpha thalassemia occurs when 1 or more of the 4 total alpha-globin genes (2 inherited from each parent), which contribute to the synthesis of hemoglobin molecules, are mutated or deleted.
- There are multiple types of alpha thalassemia with a range of severities. In this project, I focus on the following:
- <b>Alpha thalassemia silent carrier:</b> One alpha-globin gene is affected, the other 3 are wildtype. Blood tests are often normal, but their red blood cells may be smaller than normal. Being a silent carrier means you don’t have signs of the disease, but you can pass the damaged gene on to progeny. This is confirmed by DNA tests.
- <b>Alpha thalassemia trait carrier:</b> Two genes are affected. Patient likely to have mild anemia.
- Having 3 affected genes leads to Hemoglobin H disease, where the patient has moderate to severe anemia. Having all 4 affected genes causes severe anemia, where most cases lead to prenatal death.
- There is no cure for Alpha thalassemia. Thus, effective screening to detect Thalassemia carriers is vital to prevention. There are many challenges to an effective screening program, especially in low-resource settings. Considering alpha-thalassemia, genetic testing is needed for a confirmatory diagnosis of a carrier, which is expensive and not widely available. Thus follows the importance of building predictive models that can act as decision-support tools, because they are easy to deploy and use in low-resource settings where other options are limited.
### <font size=2.8><span Style="color:#4f0c0b">Research Questions</span></font>
Overall model would look to be a tool that can support clinical decisions, especially in low-resource environments. As such, the overall purpose of this model is to explore how well blood test results alone classify patients as Silent Carriers or Alpha Trait carriers of alpha-thalassemia, and which biomarkers contribute most to accurate classification?
- Which minimal subset of hematologic predictors faithfully captures the biological distinction between the silent carrier and alpha trait phenotypes?
- Which blood parameters best distinguish Silent Carriers from Alpha Trait carriers?
- Among hemoglobin (`hb`), red blood cell count (`rbc`), lymphocyte % (`lymph`), neutrophil % (`neut`), and platelet count (`plt`), which biomarkers are most predictive of whether a patient is a Silent Carrier or Alpha Trait carrier?
- Are red-blood-cell abnormalities more prominent in Alpha Trait carriers?
- Is RBC count a significant independent predictor of Alpha Trait status after controlling for other blood parameters?
- Will hemoglobin concentration be the strongest predictor of alpha thalassemia, and will a decrease in hemoglobin lead to a higher log-odds ratio of a more severe phenotype of alpha thalassmia?
### <font size=2.8><span Style="color:#4f0c0b">Variables of Interest</span></font>
- This dataset contains 16 total variables, the following were key variables of interest considered in this project:
- `phenotype`, Phenotype of the patient, either Silent Carrier or Alpha Trait
- `hb`, Hemoglobin concentration in grams per decilitre - g/dL
- `rbc`, Red blood cell volume in $10^12$/L
- `mchc`, Mean corpuscular hemoglobin concentration in g/dL
- `plt`, Total platelet count in $10^6$/L
- `pcv`, Pack cell volume per hematocrit in %
- `wbc`, Total white blood cell count in $10^6$/L
- `lymph`, Percentage of white blood cells that are lymphocytes
- `neut`, Percentage of white blood cells that are neutrophils
### <font size=2.8><span Style="color:#4f0c0b">Source & Cleaning</span></font>
I obtained my data for this project from [Kaggle](https://www.kaggle.com/datasets/letslive/alpha-thalassemia-dataset?select=twoalphas.csv).
- About the dataset
- This dataset is from a database of 288 cases from the Human Genetics Unit (HGU) of the Faculty of Medicine, Colombo, Sri Lanka.
- The data used in this project (n=147) was collected from Alpha thalassemia carrier children and their family members screened, from 2016 to 2020.
- Data Cleaning
- There is one missing value present in this data set. It was missing for the variable `mch`, mean corpuscular hemoglobin. However, I am not using this variable in my model so this is not a concern.
- I next converted the categorical variables `sex` and `phenotype` into factors as described below, such that they can be used in my logistic regression approach.
- Sex, where <b>Male = 0</b> and <b>Female = 1</b>
- Phenotype, where <b>Silent Carrier = 0</b> and <b>Alpha Trait = 1</b>
- I finally checked the distributions of all the variables for outliers, and while there were some, they were not out of the realm of biological possibility and all 288 observations in the original data thus were included in this study.
Column {.tabset data-width=500}
-----------------------------------------------------------------------
### <span Style="color:#4f0c0b">Data Cleaning Intro</span>
```{r}
library(DataExplorer)
plot_intro(data)
```
### <span Style="color:#4f0c0b">Data Cleaning Histogram</span>
```{r}
plot_histogram(data)
```
EDA
===
Column {.tabset data-width=400}
---
### 1
- <b>Phenotype Analysis:</b>
- The most frequently observed phenotype in this data set was the silent carrier.
- There is an imbalance in the proportion of the phenotypes, which may lead to difficulties generating a strong, sensitive model.
### 2
- <b>Hb Analysis:</b>
- The median hemoglobin concentration for patients who have the alpha trait is 10.8 g/dL, which is smaller than the median for silent carriers, which is 11.8 g/dL.
- This difference is consistent with the disease presentation, as patients with more severe forms of Alpha Thalassemia are deficient in hemoglobin.
```{r}
dataEDA %>% group_by(phenotype) %>% summarize(medhb = median(hb))
```
### 3
- <b>Rbc Analysis:</b>
- The median red blood cell volume for patients who have the alpha trait is 5.21 $10^{12}$/L, which is larger than that of silent carriers, which is 5.03 $10^{12}$/L.
- This difference is not incredibly large, however this distinction between the two phenotypes may make this a useful predictor of alpha thalassemia phenotype.
```{r}
dataEDA %>% group_by(phenotype) %>% summarize( medrbc = median(rbc))
```
### 4
- <b>Pcv Analysis:</b>
- The median pack cell volume per hematocrit percentage for patients who have the alpha trait is 33.1, which is smaller than the median percentage of silent carriers, 35.9.
- This variable shows another interesting distinction between the two phenotypes which may make this a useful predictor.
```{r}
dataEDA %>% group_by(phenotype) %>% summarize( medpcv = median(pcv))
```
### 5
- <b>Mcv Analysis:</b>
- The median mean cell volume for patients who have the alpha trait is 65.5 fL, which is considerably smaller than the median observed in silent carriers, 72.5 fL.
- The difference in the medians indicates this variable may be a useful predictor of phenotype
- .
```{r}
dataEDA %>% group_by(phenotype) %>% summarize( medmcv = median(mcv))
```
### 6
- <b>Lymph Analysis:</b>
- The median lymphocyte percentage for patients who have the alpha trait is 41.5, which is slightly smaller than the median for silent carriers which is 45.5.
- This variable's discrepancy is not considerably large, however it may be of use in predicting phenotype
```{r}
dataEDA %>% group_by(phenotype) %>% summarize( medlymph = median(lymph))
```
### 7
- <b>Plt Analysis:</b>
- The median platelet count for patients who have the alpha trait is 354$*$$10^{12}$, which is considerably larger than the median for silent carriers which is 332$*$$10^6$.
- This is a large difference between the two phenotypes, indicating that this variable may also be a useful predictor.
```{r}
dataEDA %>% group_by(phenotype) %>% summarize( medplt = median(plt))
```
Column {.tabset data-width=600}
---
### Phenotype
```{r}
ggplot(dataEDA, aes(x=phenotype))+geom_bar(fill="#d15c5a", color="black")+labs(title="Distribution of Phenotype", x="Phenotype", y="Count") + geom_text(aes(x="Alpha Trait", y=47, label="42"))+geom_text(aes(x="Silent Carrier", y=110, label="105"))
```
### Hb
```{r}
ggplot(dataEDA, aes(x=phenotype, y=hb))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of Hemoglobin Concentration by Phenotype", x="Phenotype", y="Hemoglobin concentration (g/dL)")
```
### Rbc
```{r}
ggplot(dataEDA, aes(x=phenotype, y=rbc))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of RBC Volume by phenotype", x="phenotype", y="RBC Volume (10^12/L)")
```
### Pcv
```{r}
ggplot(dataEDA, aes(x=phenotype, y=pcv))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of Pack Cell Volume by Phenotype", x="Phenotype", y="Pack Cell Volume, (per hematocrit, %")
```
### Mcv
```{r}
ggplot(dataEDA, aes(x=phenotype, y=mcv))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of Mean Cell Volume by Phenotype", x="Phenotype", y="Mean Cell Volume (fL)")
```
### Lymph
```{r}
ggplot(dataEDA, aes(x=phenotype, y=lymph))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of Lymphocyte Percentage by Phenotype", x="Phenotype", y="Lymphocyte Percentage")
```
### Plt
```{r}
ggplot(dataEDA, aes(x=phenotype, y=plt))+geom_boxplot(fill="#d15c5a")+labs(title="Distribution of Platelet Count by Phenotype", x="Phenotype", y="Platelet Count (1x10^6/L)")
```
Methods
===
Column {.tabset data-width=1000}
---
### Methods
For this project, I am going to employ a binary logistic regression approach.
- Logistic regression is a method used to predict the probability of a discrete outcome of two mutually exclusive events.
- In this case, predicting the probability of a peron being either a silent carrier or possessing the alpha trait phenotype.
- Logistic regression analyzes the relationship between the target and predictor variables by utilizing a logistic function to model the probability of an event occurring, rather than a continuous value as seen in linear regression.
- To complete my logistic regression approach, I utilize several R packages such as: caret, nnet, pROC, and pscl.
### Set up
- In order to do logistic regression, the target variable needs to partitioned into two groups.
- (1) Training data, used to estimate model parameters.
- (2) Test data, to assess how well the model works on new, unseen data.
- The dataset was split into 70% training and 30% testing using a random stratified sampling approach to preserve class balance.
```{r}
library(caret)
set.seed(11)
idx <- createDataPartition(data2$phenotype, p = 0.7, list = FALSE)
train <- data2[idx, ]
test <- data2[-idx, ]
```
Training data:
```{r}
table(train$phenotype)
```
Testing data:
```{r}
table(test$phenotype)
```
Additionally, `phenotype` was coded as 0 = silent carrier and 1 = alpha trait to facilitate binary logistic regression.
### Model selection
My original model was `phenotype` ~ `hb` + `pcv` + `rbc` + `mcv` + `mchc` + `rdw` + `wbc` + `lymph` + `neut`+ `plt`
This full logistic regression model included hb, pcv, rbc, mcv, mchc, rdw, wbc, lymph, neut, and plt as potential predictors of phenotype.
These variables were included due to their clinical relevance when assessing a patient for alpha thalassemia. Variables were then systematically excluded based on p-value until reaching the model
`phenotype` ~ `hb` + `rbc` + `lymph` + `mchc`
where I had 2 significant variables. Then, `mchc` was excluded from the model due to high collinearity with the significant variable `hb`, as they both are measures of the same thing.
A binary logistic regression model was fitted predicting the `phenotype` from hemoglobin concentration `hb`, red blood cell volume `rbc` and percent of lymphocytes in white blood cell count `lymph`.
This model was selected after testing several reduced models, as it ultimately had the best performance on key logistic regression outputs and had the most significant variables out of many combinations.
```{r}
formula_logit <- phenotype ~ hb + rbc + lymph
logit_model <- glm(formula_logit, data = train, family = binomial)
summary(logit_model)
```
### Up/Down sampling
After assessing the performance of the model `phenotype` ~ `hb` + `rbc` + `lymph`
I was dissatisfied with its results.
Since the response variable, `phenotype` is imbalanced where the alpha trait has many fewer counts than silent carrier, the sensitivity is low.
As such, I employed alternative approaches, downsampling and upsampling, to mediate this disparity. The downsampling approach will make the large class smaller, and the upsampling approach will make the small class bigger. Both techniques help fix class imbalance so that a predictive model doesn’t become biased toward the majority class.
Downsampled data:
```{r}
set.seed(2025)
train_down <- downSample(x = train %>% select(-phenotype),
y = train$phenotype,
yname = "phenotype")
table(train_down$phenotype)
```
Upsampled data:
```{r}
set.seed(2025)
train_up <- upSample(x = train %>% select(-phenotype),
y = train$phenotype,
yname = "phenotype")
table(train_up$phenotype)
```
I then fit 2 additional regression models using the same variables with these new data.
Downsampled model:
```{r}
logit_model_down <- glm(formula_logit, data = train_down, family = binomial)
summary(logit_model_down)
```
- Since this model made all the variables insignificant, I did not proceed further with this model.
Upsampled model:
```{r}
logit_model_up <- glm(formula_logit, data = train_up, family = binomial)
summary(logit_model_up)
```
- This upsampled model improved the significance of all the variables! As such, I proceeded to check diagnostics and performance. It performed better or similarly on most checks, but specifically it improved the sensitivity from 50% to 83.33%. <b>As such, I selected this upsampled model as my final model.</b>
log($\frac{p}{1-p}$) = 5.025 - 0.792(`hb`) + 1.14(`rbc`) - 0.0416(`lymph`)
This model predicts the probability of a patient having the alpha trait phenotype (Y=1).
### Assumptions
- Linearity of continuous predictors with log-odds was assessed using logit plots. This condition was satisfied
Multicollinearity was checked using VIF values. There is minimal multicollinearity among predictors, so this condition is satisfied.
```{r}
library(car)
vif(logit_model_up)
```
Model Performance
===
Column {.tabset data-width=500}
---
### Goodness of fit
1.
```{r}
null_model <- glm(phenotype ~ 1, data = train_up, family = binomial)
anova(null_model, logit_model_up, test = "Chisq")
```
2.
```{r}
pR2(logit_model_up)
```
### Key effects
```{r}
or <- exp(coef(logit_model_up))
or
```
### CM
```{r}
test_prob <- predict(logit_model_up, newdata = test, type = "response")
test_pred <- ifelse(test_prob >= 0.5, "alpha_trait", "silent_carrier") %>%
factor(levels = levels(test$phenotype))
cm <- confusionMatrix(test_pred, test$phenotype, positive = "alpha_trait")
cm
```
### ROC/AUC
```{r}
roc_obj <- roc(response = test$phenotype,
predictor = test_prob,
levels = c("silent_carrier", "alpha_trait"),
direction = "<")
plot(roc_obj,
print.auc = TRUE,
legacy.axes = TRUE,
main = "ROC Curve for Alpha Thalassemia Model")
```
Column {.tabset data-width=500}
---
### LRT/pR2
(1) The likelihood ratio test (LRT) compares the logistic regression model to a null model containing only an intercept. The residual deviance decreases from 205.17 to 174.51 when the predictors are added, producing a test statistic of 30.661 on 3 degrees of freedom (p<0.0001
). This small p-value indicates that the predictors significantly improve model fit compared to the null model. The included clinical variables provide substantial explanatory power for predicting the phenotype of Alpha Thalassemia.
(2) $Pseudo-R^2$ values provide additional measures of model fit for logistic regression.
McFadden’s $pseudo-R^2$ was 0.1494, which indicates moderate fit.
The maximum-likelihood $R^2$ (r2ML = 0.1871, Cox & Snell) similarly suggests improvement over the null model.
Nagelkerke’s $pseudo-R^2$ was 0.2495, meaning the model achieves about 24.95% of the maximum possible improvement in fit, relative to the null model. Overall, these values indicate that the logistic regression model provides a moderate fit to the data that is better for predicting phenotype than the null model.
### Key effects
- For each additional increase in g/dL of hemoglobin concentration, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 54.69%, holding all other variables constant. Here, 54.69% is from (0.4531−1=0.5469=54.69%).
- For each $1x10^{12}$ cells/L increase in red blood cells, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia more than tripled (about 3.13 times), holding all other predictors constant.
- For each additional 1% of the white blood cells population that is lymphocytes, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 4.07%, holding all other variables constant. Here, 4.07% is from (0.9593-1=0.0407=4.07%).
### CM
Using a 0.5 probability cutoff on the test data, the model achieved an accuracy of 74.42%, with 83.33% sensitivity and 70.97% specificity.
### ROC/AUC
The ROC curve yielded an AUC of 0.872, indicating excellent discrimination between patients with the silent carrier phenotype and those with the alpha trait phenotype.
### Conclusion
Overall, a group of predictors showed statistically significant associations with the phenotype of alpha thalassemia. After adjusting for all other variables in the model, red blood cell count was a strong predictor: each one-unit (1x10$^{12}$ cells) increase in red blood cell count more than tripled the odds of presenting the alpha trait (OR≈3.13).
Patients with lower hemoglobin concentration are at an elevated risk: after accounting for other variables, each additional increase in g/dL of hemoglobin concentration decreases risk of presenting the alpha trait by about 54.69%.
In contrast, increasing lymphocyte % was associated with lower odds of presenting the alpha trait; for each additional 1% of the white blood cells population that is lymphocytes, the odds decrease by about 4.07%. Although this percentage is small, lymphocyte % of white blood cell population is considered a statistically significant predictor of phenotype in the model and thus this small percentage may still be useful.
A sensitive test reduces false negatives, making it good for screening; a specific test reduces false positives, good for confirmation. In this case, diagnoses for alpha thalassemia are confirmed using genetic tests. The purpose of this model is for screening purposes; thus the heightened sensitivity of the final model is the focus.
Overall, this logistic regression model demonstrates that a small set of clinical predictors provides explanatory and predictive power for assessing phenotype of Alpha Thalassemia patients, while maintaining easy interpretability for use in clinical screenings and application in disease prevention.
Conclusion
===
Column {.tabset data-width=1000}
---
### Conclusions
- <b>Which minimal subset of hematologic predictors faithfully captures the biological distinction between the silent carrier and alpha trait phenotypes? Which blood parameters best distinguish Silent Carriers from Alpha Trait carriers?</b>
- Hemoglobin concentration, red blood cell volume, and lymphocyte percentage of white blood cell population were found to be the minimal subset of predictors that provide predictive strength and sensitivity for alpha thalassemia phenotype.
- Of these hematologic variables in the final model, hemoglobin concentration and red blood cell count were the most predictive (based on p-value from t-test) of patient phenotype.
- <b>Will hemoglobin concentration be the strongest predictor of alpha thalassemia, and will a decrease in hemoglobin lead to a higher log-odds ratio of a more severe phenotype of alpha thalassmia?</b>
- Hemoglobin concentration was found to be the strongest predictor of phenotype, as hypothesized. It had the smallest p-value (from t-test) out of the final model, and it was found that for each additional increase in g/dL of hemoglobin concentration, the odds of the patient presenting the alpha trait phenotype of Alpha thalassemia decrease by about 54.69%, holding all other variables constant.
- This result makes since considering the pathology of alpha thalassemia. Alpha thalassemia is a genetic disorder where the body produces a reduced amount of alpha-globin protein chains, which are a key component of normal hemoglobin. This reduced production leads to a shortage of functional hemoglobin. Having more abnormal copies of the alpha-globin genes (having the alpha trait phenotype) leads to less production of hemoglobin, so the result of the model is consistent with the pathophysiology of alpha thalassemia.
- <b>Are red-blood-cell abnormalities more prominent in Alpha Trait carriers? Is RBC count a significant independent predictor of Alpha Trait status after controlling for other blood parameters?</b>
- RBC count was found to be a significant predictor of phenotype in the final model, indicating the prominence of red blood cell anomalies in alpha trait patients. In the final model, `rbc` had a p-value of <0.01 (from t-test) and it was found that after accounting for other predictors, each additional $1x10^{12}$ cells/L increase in red blood cells more than tripled (about 3.13 times) the odds of the patient presenting the alpha trait phenotype of alpha thalassemia. Overall this model uses this red blood cell abnormality that is more prominent in Alpha Trait carriers to aid in predicting disease phenotype.
- When considering the pathophysiology of alpha thalassemia, this result makes sense. It is known from medical studies that people with the alpha trait have a reduced ability to carry oxygen in individual red blood cells. The body compensates for this chronic, mild oxygen deficit by producing more red blood cells overall, a phenomenon called polycythemia.
### Limitations
- One major limitation in this study is the small sample size. A small sample can lead to biased and imprecise results that don't accurately represent the larger population.
- Another major limitation in this study, that goes along with the small sample size, is the disproportionality in the size of classes of the response variable. I noted throughout this study how this impacted and destabilized the model, and eventually led to using an upsampling approach.
### Future Directions
- In future studies, it would be interesting to apply this approach and fit a model to predict alpha thalassemia phenotype in a global demographic. This dataset is strictly cases from Sri Lanka, and while the disease is most prevalent in tropical and subtropical regions (most prominently South and Southeast Asia, the Mediterranean, and Africa), it still affects people globally and it would be an important direction of research to determine if a similar set of predictors can effectively predict alpha thalassemia cases from different geographical and ethnic populations.
- Another direction that would interesting to explore would be to develop a multinomial logistic regression model for alpha thalassemia that includes the capability to predict additional phenotypes of Alpha thalassemia (Hemoglobin H disease and Alpha Thalassemia Major) along with the two studied in this project.
- Finally, it would be interesting to employ this appoach to predict phenotypes of the disease beta thalassemia, and compare the models to understand how predictors of disease phenotype are related between the two types of thalassemia.
### References
- Harewood J, Azevedo AM. Alpha Thalassemia. [Updated 2023 Sep 4]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan-. Available from: [https://www.ncbi.nlm.nih.gov/books/NBK441826/](https://www.ncbi.nlm.nih.gov/books/NBK441826/)
- Motiani A, Zubair M, Sonagra AD. Laboratory Evaluation of Alpha Thalassemia. [Updated 2024 Feb 9]. In: StatPearls [Internet]. Treasure Island (FL): StatPearls Publishing; 2025 Jan-. Available from: [https://www.ncbi.nlm.nih.gov/books/NBK587402/](https://www.ncbi.nlm.nih.gov/books/NBK587402/)
- Phanthong B, Charoenkwan P, Kamlungkuea T, Luewan S, Tongsong T. Accuracy of Red Blood Cell Parameters in Predicting α0-Thalassemia Trait Among Non-Anemic Males. J Clin Med. 2025 May 21;14(10):3591. Available from:
[https://pmc.ncbi.nlm.nih.gov/articles/PMC12111872/](https://pmc.ncbi.nlm.nih.gov/articles/PMC12111872/)
Data: [Kaggle](https://www.kaggle.com/datasets/letslive/alpha-thalassemia-dataset?select=twoalphas.csv).
About the Author
===
Column {data-width=500}
---
### Background
My name is Ashley Williams and I am an undergraduate student attending the University of Dayton. I am majoring in Biology and I am minoring in Chemistry, Data Analytics, Neuroscience, and Research in the Biological Sciences. My anticipated graduation is in May of 2027.
I am an undergraduate researcher and have co-authorship of two peer-reviewed scientific papers, one from [2022](https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1010653) and one from [this year](https://academic.oup.com/mbe/article/42/9/msaf213/8248050)! I conduct my research in the [Williams Lab](https://thetomwilliamslab.com/), where I specifically study the regulation of the <i>Drosophila melanogaster pale</i> gene, and its origin during the evolution of a dimorphic pigmentation trait. I have been heavily involved in scientific research since 2021, and I have also presented my research on numerous occasions including twice at the University of Dayton's <span Style="color:#cf311e">Stander Symposium</span>, at <span Style="color:#267c28">the Society for Developmental Biology's 83rd Annual Meeting</span>, and at <span Style="color:#0066b6">the American Society for Biochemistry and Molecular Biology's conference, "Evolution and core processes in gene regulation"</span>.
I am interested in pursuing a Ph.D. in the field of genetics after my graduation, and continuing my career in academia and biological research.
Column {data-width=500}
---
### Presenting
```{r, fig.width=6, echo=FALSE, fig.align='right'}
knitr::include_graphics("IMG_6751.jpeg")
```