Thoughts and Theory, MATHEMATICAL STATISTICS AND MACHINE LEARNING FOR LIFE SCIENCES

How you can find Yersinia pestis pathogen in any random sample

Pieter Bruegel the Elder “The Triumph of Death”, Museo del Prado, Madrid, image source

This is the twenty third article from my column Mathematical Statistics and Machine Learning for Life Sciences, where I discuss in plain language some mysterious analytical techniques that are common in Computational Biology. DNA sequencing technologies applied to archeological material tremendously enriched our knowledge about the human past. For example, analyzing DNA from human remains in historical burials can provide great information about ancient pandemics such as plague that is caused by bacterium. However, common methods of detecting ancient pathogens often suffer from the lack of specificity and may result in false discoveries. In the present article, I…


Making Sense of Big Data, MATHEMATICAL STATISTICS AND MACHINE LEARNING FOR LIFE SCIENCES

How the Curse of Dimensionality complicates Genetics research

Modified from Wikipedia Emperor’s New Clothes

This is the twenty second article of my column Mathematical Statistics and Machine Learning for Life Sciences, where I discuss in plain language some mysterious analytical techniques that are common in Computational Biology. Genome-wide genotyping and whole-genome sequencing (WGS) brought unprecedented resolution to genetic studies in Life Sciences, but also resulted in rapid growth of high-dimensional genetic variation data suffering from the Curse of Dimensionality. In the present article, I will give some theoretical background of the problem, and discuss why it is extremely challenging to do any meaningful and robust analysis in modern Genetics and Genomics.

Genomics: Progress or Regress?

Everyone who works…


MATHEMATICAL STATISTICS AND MACHINE LEARNING FOR LIFE SCIENCES

DESeq2 vs. LASSO predictive capacity on gene expression

Image by Author

This is the twenty first article from my column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. In my previous post Select Features for OMICs Integration I gave examples of multivariate feature selection and mentioned its advantages over the univariate feature selection without actually demonstrating it. In this post, we will compare predictive capacities of multivariate models such as LASSO, PLS and Random Forest with univariate models, e.g. the famous differential gene expression tool DESeq2 as well as traditional Mann-Whitney U…


Mathematical Statistics and Machine Learning for Life Sciences

Graph-based Single Cell Omics integration with UMAP

Image by Author

This is the twentieth article from the column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. Data integration is an important next step for improving analysis accuracy by utilizing synergistic effects via combining multiple sources of information. In Computational Biology and Biomedicine, data integration is making particular advances in Single Cell research area. Last year, Nature recognized Single Cell Multimodal Omics Integration as a method of the year 2019. …


Mathematical Statistics and Machine Learning for Life Sciences

Linear Mixed Model via Restricted Maximum Likelihood (REML)

Image by Author

This is the nineteenth article from the column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. This is the final article in the series dedicated to the Linear Mixed Model (LMM). Previously we talked about How Linear Mixed Model Works, how to derive and program Linear Mixed Model from Scratch in R from the Maximum Likelihood (ML) principle. …


Mathematical Statistics and Machine Learning for Life Sciences

Derive and code LMM using Maximum Likelihood

Image source

This is the eighteenth article from the column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. Linear Mixed Model (also called Linear Mixed Effects Model) is widely used in Life Sciences, there are many tutorials showing how to run the model in R, however it is sometimes unclear how exactly the Random Effects parameters are optimized in the likelihood maximization procedure. In my previous post How Linear Mixed Model Works I gave an introduction to the concepts of the model, and…


Mathematical Statistics and Machine Learning for Life Sciences

And how to understand LMM through Bayesian lenses

Image source: Wikipedia Simpson’s Paradox

This is the seventeenth article from my column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. Linear Mixed Model (LMM) also known as Linear Mixed Effects Model is one of key techniques in traditional Frequentist statistics. Here I will attempt to derive LMM solution from scratch from the Maximum Likelihood principal by optimizing mean and variance parameters of Fixed and Random Effects. However, before diving into derivations, I will start slowly in this post with an introduction of when and how…


Mathematical Statistics and Machine Learning for Life Sciences

At large Perplexity

This is the sixteenth article from the column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. In my previous post, tSNE vs. UMAP: Global Structure, I touched the limit of large perplexity as a potential way for tSNE to preserve more of a global data structure that becomes important when attempting to use tSNE beyond visualization for addressing hierarchical relations between clusters of data points (clustering). …


Mathematical Statistics and Machine Learning for Life Sciences

Why preservation of global structure is important

Image source

This is the fifteenth article from the column Mathematical Statistics and Machine Learning for Life Sciences where I try to explain some mysterious analytical techniques used in Bioinformatics and Computational Biology in a simple way. Dimension reduction techniques such as tSNE and UMAP are absolutely central for many types of data analysis, yet there is surprisingly little understanding of how exactly they work. Previously I started comparing tSNE vs. UMAP in my articles How Exactly UMAP Works, How to Program UMAP from Scratch, and Why UMAP is Superior over tSNE. Today I will share my views on to what extent…


Mathematical Statistics and Machine Learning for Life Sciences

Does initialization really matter?

This is the fourteenth post from the Mathematical Statistics and Machine Learning for Life Sciences column, where I try to explain in a simple way some mysterious analytical techniques used in Bioinformatics, Biomedicine, Genetics etc. In my previous posts How Exactly UMAP Works and How to Program UMAP from Scratch I explained limitations of tSNE and the way UMAP overcomes them. From the feedback I received, it seemed to me that the main message of the posts was not emphasized enough. …

Nikolay Oskolkov

Bioinformatician, SciLifeLab, Sweden

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store