← All posts

June 16, 2026 · Carlos Crespo

The Algorithm That Reads Your Genes

Give a random forest a cell's gene-expression pattern and it can classify the cell type and show which gene markers drove the call. A look at how this older machine-learning method does real work in medicine.

Machine LearningHealthcareRandom ForestData Science

An algorithm can read your genes, and I mean that in a very practical way, because if you hand a random forest a single cell's gene-expression pattern it will actually tell you what type of cell it is and which genes gave it away!

A random forest is not a large language model and it is not trying to write you an email or paint you a picture, it is a machine-learning method built around decision trees, and instead of trusting just one tree it grows a whole bunch of them, lets them vote, and uses that group decision to classify new data. This kind of machine learning has been around for decades quietly doing pattern recognition, classification, and prediction long before the current wave of chatbots took the spotlight, and even though it does not get much attention it still does a huge amount of the real work.

Where a random forest sits on the spectrum from generative models to classic machine learning

That might sound less flashy than a chatbot, but in a lot of settings it is exactly what you want, and one of those settings is medicine and biological research, where the question is usually closer to "can we classify this sample correctly?" or "which features mattered most?" instead of "can you write me a paragraph?"

For one of my machine-learning studies this past year I worked through a random forest pipeline on biological single-cell transcriptomics data, and the dataset came straight out of research tied to cell type discovery, where scientists were studying gene-expression patterns to identify and describe different cell types. The study was never about building a chatbot, it was about taking a real dataset, checking it, training a model, testing it, and then asking the question that actually matters, which is whether the model's reasoning lines up with the known biology, and that last part is what made the whole thing so fun!

The dataset had 871 samples and 608 gene-expression features, and each sample was labeled as either part of the e1 cluster or not, with the biological ground truth coming from published research that pinned down the e1 cluster through marker genes like TESPA1, LINC00507, and SLC17A7, while KCNIP1 worked as a negative marker. I tried different settings for the number of trees, the number of variables sampled at each split, and the classification cutoff, and the best version landed on 500 trees, an mtry of 49, and a cutoff of 0.3.

The model performed really well, reaching 99.655% accuracy with a tiny 0.345% out-of-bag error rate, and on the two held-out verification samples it nailed both, calling the positive sample positive with 78.2% confidence and the negative sample negative with a confident 99.4%!

Random forest results: 871 samples, 608 features, 99.655 percent accuracy

The accuracy is cool, but honestly the feature ranking is the part that really got me, because the top three ranked features came back as TESPA1, SLC17A7, and LINC00507, which lined up exactly with the marker genes the biology paper expected for the e1 cluster, and KCNIP1 showed up near the top too in a way that made total sense since it sits on the negative-marker side of the definition.

Feature ranking with the top marker genes highlighted

That is the part I would really want people to take away, because the model handed us a prediction and also handed us a way to look back at which features drove it, and in a medical or research setting that matters a lot, since if a system says "this sample belongs here" the researchers and clinicians almost always need more than a final answer, they need to understand what signals pushed the model toward that call, and that is exactly where classic machine learning and explainability still carry so much weight.

Generative AI was still in the room, and I leaned on ChatGPT and Claude throughout the project for debugging, organizing my code, and checking that my report lined up with my results, which was genuinely helpful, but the model actually doing the classification was a random forest, so two very different kinds of AI showed up in the same workflow, one helping me move faster and the other doing the real analysis on the biological data.

That is the bigger picture I keep coming back to, because AI is not one single thing, it is a whole set of methods where some are great at language, some are great at images, some are better for structured data, some are easier to explain, and some are just more useful when you need a clear feature ranking instead of a polished paragraph. Medicine is a perfect example of why that matters, because the FDA already keeps a public list of AI-enabled medical devices authorized in the United States and a lot of those are not chatbots at all, they are tools that process images, signals, measurements, and patient data, and NIH talks about machine learning across electronic health records, omics data, imaging, and disease-specific data, so in plain English a huge amount of AI in medicine is really just pattern recognition at scale, asking can we spot something in an image, can we classify a sample, can we find the genes that separate one cell type from another, and can we give a doctor or researcher a faster and more consistent way to work through complicated data.

The random forest study was a small, hands-on version of that whole idea, where a dataset went in, a trained model came out, and the final result got checked against real biological ground truth, and it reminded me that chatbots are great and I use them constantly but they are only one layer, because if we stop at generative AI we miss a lot of the older and still incredibly useful tools that quietly make AI work in fields like medicine, finance, logistics, and research. I thought this one was fun to revisit because it shows both sides at once, where AI can be the assistant helping you write cleaner code and AI can also be the model quietly classifying biological data, same umbrella, very different job!

Try It Yourself

You do not need the original dataset to get the idea, because the code below shows the basic shape of a random forest pipeline in R, and you can swap in any CSV where the last column is the label you want to predict.

library(randomForest)

# Replace this with your own CSV.
# The final column should be the class label.
data <- read.csv("your_data.csv", header = TRUE)
data$Label <- as.factor(data$Label)

set.seed(123)

# Hold out a small sample for a final check.
holdout_index <- sample(seq_len(nrow(data)), 2)
verification_data <- data[holdout_index, ]
training_data <- data[-holdout_index, ]

# Train a random forest classifier.
model <- randomForest(
  Label ~ .,
  data = training_data,
  ntree = 500,
  mtry = 49,
  importance = TRUE
)

# Check model performance.
print(model)

# See which features mattered most.
feature_importance <- importance(model, type = 1)
feature_ranking <- data.frame(
  Feature = rownames(feature_importance),
  Score = feature_importance[, 1]
)
feature_ranking <- feature_ranking[order(feature_ranking$Score, decreasing = TRUE), ]
head(feature_ranking, 10)

# Test on held-out samples.
predicted_class <- predict(model, verification_data)
predicted_probability <- predict(model, verification_data, type = "prob")

print(predicted_class)
print(predicted_probability)

This is really just the surface-level version of the project, where you load the data, hold out a few examples, train the model, check the accuracy, rank the features, and then test on examples the model never saw during training, and that last step matters a ton because a model that only works on the data it already memorized is not very useful!

Closing Note

This is not medical advice and it is not a claim that this model is ready for clinical use, it is a learning example, and the reason I like it so much is that it makes AI feel a lot less abstract, because you can actually see the full pipeline, you can see the features, and you can compare the model's ranking against published biological ground truth, which makes it so much easier to understand why AI in medicine goes way past chatbots answering questions, since sometimes it is just a model quietly sorting through hundreds of measurements and pointing researchers toward the signals that actually matter.

Sources

  • Leo Breiman, "Random Forests", Machine Learning, 2001. https://link.springer.com/article/10.1023/A:1010933404324
  • Aevermann et al., "Cell type discovery using single-cell transcriptomics", Human Molecular Genetics, 2018. https://academic.oup.com/hmg/article/27/R1/R40/4953379
  • FDA, Artificial Intelligence-Enabled Medical Devices. https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-enabled-medical-devices
  • NIH, Artificial Intelligence. https://datascience.nih.gov/artificial-intelligence
  • R randomForest package. https://cran.r-project.org/web/packages/randomForest/index.html

Get the next post in your inbox.

A new write-up every Monday, no spam, unsubscribe anytime.

Subscribe to the newsletter

Stay Connected

Follow Along for Recent Work, Behind-The-Scenes Updates, & New Launches.

Keep up with website launches, client projects, and business updates from CSolutions.