Séminaire

On the use of optimal transportation theory to recode variables and application to database merging

Valérie Gares

18 avril 2019, 11h00–12h15

Toulouse

Salle MS003

MAD-Stat. Seminar

Résumé

When databases are constructed from heterogeneous sources, it is not unusual that different encodings are used for the same outcome. This work considers the problem of finding a relevant way to recode a categorical variable before merging two databases. The method is an application of optimal transportation where we search for a bijective mapping between the distributions of such variable in two databases. Using that common covariates appear in the two databases, the objective is to minimize the expectation of a cost function reflecting a distance measure in the space of the covariates. The first form of the algorithm needs the assumption that the covariates may follow the same distribution in the two databases [1]. We proposed different models stating a novel approach to answer the problem and relaxing this hypothesis. Our different models are compared in a simulation study in different scenarios and are applied to a real dataset. [1] Dimeglio C*, Gar`es V.*, Kosorok M. R., Guernec G., Fantin R., Lepage B. and Savy N. On the use of optimal transportation theory to merge databases. Application to clinical trials.. En révision.