当前位置: 代码迷 >> 综合 >> (四十六):VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
  详细解决方案

(四十六):VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

热度:87   发布时间:2023-11-17 07:40:59.0

(四十六):VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text

  • Abstract
  • 1. Introduction
  • 2. Related work
    • 2.1. Transformers in Vision
    • 2.2. Self-Supervised Learning
  • 3. Approach
    • 3.1. Tokenization and Positional Encoding
    • 3.1.1 DropToken
    • 3.2. The Transformer Architecture
    • 3.3. Common Space Projection
    • 3.4. Multimodal Contrastive Learning
    <
  相关解决方案