Data imputation for compositional data sets

19 novembre 2020, 11h00–12h15

Toulouse

Salle Zoom

MAD-Stat. Seminar

Résumé

Regardless of how cautiously a study is designed and how closely formal protocols to collect the data are followed, one of the issues that are typically associated with real-world data sets relates to the presence of empty or invalid entries. The practical problem is how to deal with them in a sound way with the purpose of data analysis and modelling. The statistical literature around this topic has a long history and is broad. Data imputation is a popular approach with multivariate data sets. It is meant to replace the unobserved values by plausible values. Applied at the data pre-processing stage, imputation produces a completed data set that facilitates subsequent statistical analysis. We will discuss the particularities of the problem and imputation methods for the class of compositional data sets, i.e. multivariate data representing parts of a whole carrying relative information, commonly expressed in percentage units. This is the case of e.g. chemical and nutritional compositions, time or land use allocations, electoral vote shares, and similar. In this context, we will review desirable properties and focus on methods and software for the exploration, statistical testing, and treatment of zeros and missing data in compositional data sets.

Data imputation for compositional data sets

Résumé

Partager