SIMPLE AND EFFECTIVE MULTIMODAL LEARNING BASED ON PRE-TRAINED TRANSFORMER MODELS

Simple and Effective Multimodal Learning Based on Pre-Trained Transformer Models

Simple and Effective Multimodal Learning Based on Pre-Trained Transformer Models

Blog Article

Transformer-based models have garnered attention because of their success in natural language processing, and in several other fields, such as image and automatic speech recognition.In addition to them being QUICK D trained on unimodal information, many transformer-based models have been proposed for multimodal information.In multimodal learning, a common problem encountered is the insufficiency of multimodal training data.In this study, to address this problem, a simple and effective method is proposed by using 1) unimodal pre-trained transformer models as encoders for each modal input and 2) a set of transformer layers to fuse their output representations.Further, the proposed method is evaluated by conducting several experiments on two common benchmarks: CMU multimodal opinion sentiment intensity dataset and multimodal internet movie database.

The proposed model exhibits state-of-the-art performances on both benchmarks and Rack is robust against the reduction in the amount of training data.

Report this page