Regression models for compositional data: an introduction

December 20, 2018, 11:00–12:30

Toulouse

Room MS 003

MAD-Stat. Seminar

Abstract

A data vector is a composition if it contains positive values and if the total of its components is irrelevant to the analysis at hand. Mathematically, it can be modeled by a vector in the simplex of R^n. In a regression framework, one or several compositions may appear as dependent variables as well as independent variables. Classical linear regression is not adapted to these data due to the inherent constraints involved: positivity and constant sum. After describing the vector space structure and the natural geometry of the simplex, we introduce the mean and variance definitions adapted to simplex valued random variables. We review the list of classical probability distributions on a given simplex. We then concentrate on the log-ratio or CODA approach for defining a regression model with such variables. This approach is based on transformations of the initial composition vector into a so-called coordinate space with a classical Euclidean geometry. After defining the model, we derive the maximum likelihood estimators of the parameters in the simplex space as well as in the coordinate space and their relationships. We describe several methods for the interpretation of such parameters, based on analyzing the predictions or based on computing elasticities. We illustrate the concepts on data sets from marketing, political economy and medicine.

Regression models for compositional data: an introduction

Abstract

Share