This document introduces a parsimonious novel method of processing textual data based on the NMF factorization and on supervised clustering withWasserstein barycenter’s to reduce the dimension of the model. This dual treatment of textual data allows for a representation of a text as a probability distribution on the space of profiles which accounts for both uncertainty and semantic interpretability with the Wasserstein distance. The full textual information of a given period is represented as a random probability measure. This opens the door to a statistical inference method that seeks to predict a financial data using the information generated by the texts of a given period.
Natural Language Processing; Textual Analysis; Wasserstein distance; clustering;
TSE Working Paper, n° 23-1471, septembre 2023