목록transformer (4)
정화 코딩
https://arxiv.org/abs/2305.05665 ImageBind: One Embedding Space To Bind Them AllWe present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only imagearxiv.org 1. Introduction아이디어: 이미지의 결합(binding) 능력 -> 다양한 센서와 ..
논문: https://arxiv.org/abs/2104.12763 MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingMulti-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objearxiv.org깃허브(코드): https://github.com..
https://arxiv.org/abs/2404.12467 Towards Multi-modal Transformers in Federated LearningMulti-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models witharxiv.org 0. AbstractMulti-modal transformers: 이미지와 텍스트..
https://arxiv.org/abs/2104.12763 MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingMulti-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objearxiv.org 0. AbstractMulti-modal reasoni..