Quantitative methods

At SAMPLED Analytics we combine quantitative models, statistical methods and software to analyze complex data.

Models

Our approach starts with wrapping all the known facts, hypotheses and assumptions about the data into quantitative models that allow us to define quantities of interests, including data diversity and random effects.

Quantitative models are built around the specific problem being considered. In particular, the relationship between all variables of interest is explicitely modeled using mathematical expressions and model parameters have a precise meaning.

For instance, a set of differential equations describing the kinetics of a chemical system is a mechanistic model where parameters are chemical rate constants with precise meaning.

In contrast, machine learning models (such as neural networks, decision trees, generalized linear models, etc.) provide an abstract representations of the relationship between observables. For this reason, the parameters of such models are much harder to interpret.

The success of a machine learning model relies on its performance on unseen data (generalization). In most of the cases this requires training over a large dataset, which might not always be available.

In addition to parameter interpretability, there are two important advantages in designing custom quantitative models. First, by modeling esplicitely the relationship between variables of interests we provide structural constraints that naturally limit overfitting. Second, the process of optimizing model parameters is simplified by choosing realistic "educated" ranges for their values.

Quantitative/mechanistic models are not always available. In many cases we simply do not have good descriptions on how variables of interest might influence each other. Machine learning models can provide an alternative way to fit data and make predictions.

Optimization and inference

Given a model and some data, one of the questions that we might ask is clearly

"how well does the model fit the data?".

One approach to address this question is to find model parameters that minimize the discrepancy between model and data. Such optimization approach can give us a "best fit" but leaves open questions regarding the precision of our estimated parameters. How well do we know those parameters? Are the data informative enough? Should we change the model or produce more data?

At SAMPLED analytics we use Statistical inference to answer these questions. Instead of generating best fit parameters, statistical inference gives access to a distribution of values, the so-called "posterior" probabilities of the model parameters.

These distributions are obtained from the Bayes' theorem $$P(X | data)=\frac{P(X)\times P(data | X)}{P(data)}$$ where $X$ is the unknown parameter of interest. The posterior distributions $P(X|data)$ respond probabilistically to the question what is $X$ given the data?

In order to calculate these probabilities we need two key ingredients: the prior distributions $P(X)$, which represent our knowledge about $X$ prior to the observations and the likelihood of the data $P(data|X)$.

This formulation mimics the process of updating our current knowledge with observations. An important aspect of this approach is the design of prior distributions. Indeed, translating our current knowledge into a set of probabilities is non trivial and several methodologies have been developed to guide the design of so-called non-informative priors.

In spite of their simple expressions given by the Bayes' theorem, obtaining posterior distributions for real applications is not easy and it requires approximations. We employ modern Monte Carlo algorithms to provide unbiased estimates of posterior distributions.

Sampled value:

Monte Carlo methods are a class of algorithms to draw statistical samples from probability distributions. Why should we sample from a distribution? It turns out that we can sample from a distribution even if we do not know how to write it esplicitely. Moreover, by collecting a large number of samples we can generate histograms which provide a numerical approximation of the distribution used to generate them.

These approximations can be used to estimate mean values as well as statistical uncertainties. Moreover, a well known result in statistics is that Monte Carlo estimates are unbiased. This is in contrast to other approaches used to compute posteriors, such as variational inference.

In the past 20 years the field of computational statistics has seen a major breakthrough with the development of sequential Monte Carlo methods, enabling us to carry out statistical inference in complex models and multi-dimensional data.

Since the introduction of the classic Gibbs sampler or the Metropolis-Hastings acceptance rule, there has been a proliferation of new algorithms and variants to improve on efficiency and speed. Our job is to employ these methods to develop dedicated algorithms, adapted to models and data.

Software

We provide software implementation of our models, simulations and inferences as needed in C++, R or python. We are keen on making our analyses accessible and user-friendly. We have extensive experience in building graphical user interfaces (GUI) to support complex analysis pipelines and data management.

Very often the data to be analyzed is distributed across multiple files and formats. Take the example of a PhD-long project involving hundreds of imaging data. Experiments need to be classified, sorted and filtered before running any analysis.

We provide support for data organization with MySQL relational databases that enables to link all experimental conditions and relevant information, which is systematically passed on to the analysis stage. Systematic data organization guarantees transparent and reproducible analyses and it is the backbone of any software that we develop.