Working paper

Parsimonious Wasserstein Text-mining

Sébastien Gadat, and Stéphane Villeneuve


This document introduces a parsimonious novel method of processing textual data based on the NMF factorization and on supervised clustering withWasserstein barycenter’s to reduce the dimension of the model. This dual treatment of textual data allows for a representation of a text as a probability distribution on the space of profiles which accounts for both uncertainty and semantic interpretability with the Wasserstein distance. The full textual information of a given period is represented as a random probability measure. This opens the door to a statistical inference method that seeks to predict a financial data using the information generated by the texts of a given period.


Natural Language Processing; Textual Analysis; Wasserstein distance; clustering;


Sébastien Gadat, and Stéphane Villeneuve, Parsimonious Wasserstein Text-mining, TSE Working Paper, n. 23-1471, September 2023.

See also

Published in

TSE Working Paper, n. 23-1471, September 2023