Learning Company Embeddings from Annual Reports for Fine-grained Industry Characterization

Tomoki Ito, Jose Camacho-Collados, Hiroki Sakaji, Steven Schockaert

July 2020

Abstract

Organizing companies by industry segment (e.g. artificial intelligence, healthcare or fintech) is useful for analyzing stock market performance and for designing theme base investment funds, among others. Current practice is to manually assign companies to sectors or industries from a small predefined list, which has two key limitations. First, due to the manual effort involved, this strategy is only feasible for relatively mainstream industry segments, and can thus not easily be used for niche or emerging topics. Second, the use of hard label assignments ignores the fact that different companies will be more or less exposed to a particular segment. To address these limitations, we propose to learn vector representations of companies based on their annual reports. The key challenge is to distill the relevant information from these reports for characterizing their industries, since annual reports also contain a lot of information which is not relevant for our purpose. To this end, we introduce a multi-task learning strategy, which is based on fine-tuning the BERT language model on (i) existing sector labels and (ii) stock market performance. Experiments in both English and Japanese demonstrate the usefulness of this strategy

Type

Publication

Proceedings of the Second Workshop on Financial Technology and Natural Language Processing

Learning Company Embeddings from Annual Reports for Fine-grained Industry Characterization

Abstract

Jose Camacho-Collados

Professor & UKRI Future Leaders Fellow

Steven Schockaert

Professor