목록AI (10)
정화 코딩
https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/08071.pdfhttps://arxiv.org/abs/2407.01851 Meerkat: Audio-Visual Large Language Model for Grounding in Space and TimeLeveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been ..
Text ↔ Image DatasetsFlickr30k Entities- 기존 Flickr30k(이미지+문장 캡션)에 명사구 별 bounding box 어노테이션 추가된 데이터셋- 이미지 + 각 이미지에 대해 5개의 문장(캡션) + 각 문장 내 명사구(phrase) ↔ bounding box 정보 ⇒ 전처리 없이 사용 가능- 이미지 31,783개, 이미지 당 객체 8.7개, 총 박스 276K개- https://arxiv.org/abs/1505.04870- https://github.com/BryanPlummer/flickr30k_entities- https://bryanplummer.com/Flickr30kEntities/ Visual Genome (VG)- Flickr 기반 이미지 + 각 이미지에 대해..
https://arxiv.org/abs/2305.05665 ImageBind: One Embedding Space To Bind Them AllWe present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only imagearxiv.org 1. Introduction아이디어: 이미지의 결합(binding) 능력 -> 다양한 센서와 ..
https://arxiv.org/abs/2106.13043 AudioCLIP: Extending CLIP to Image, Text and AudioIn the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new oarxiv.org 1. Introduction- 오디오 분류 분야의 발전. But, 이전까지는 오직 오디오..
논문: https://arxiv.org/abs/2104.12763 MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingMulti-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objearxiv.org깃허브(코드): https://github.com..
https://dl.acm.org/doi/10.1145/3534678.3539384 1. Introduction연합학습 (Federated Learning, FL)여러 클라이언트가 서로의 데이터를 공유하지 않고도 함께 모델을 학습하는 분산 학습 프레임워크의의: 통신 비용 절감과 프라이버시 보호 멀티모달 연합학습 (Multimodal Federated Learning, MFL)배경: 센서 기술의 발전, 다양한 형태의 데이터 증가 -> FL에서 확장되어 MFL 등장여러 클라이언트가 다양한 센서 조합(모달리티)으로 수집한 데이터를 기반으로, 데이터를 공유하지 않고도 함께 모델을 학습 기존 FL/MFL 연구의 한계대부분 통계적 이질성(Statistical Heterogeneity), 즉 클라이언트마다 데이터 분..
https://arxiv.org/abs/2404.12467 Towards Multi-modal Transformers in Federated LearningMulti-modal transformers mark significant progress in different domains, but siloed high-quality data hinders their further improvement. To remedy this, federated learning (FL) has emerged as a promising privacy-preserving paradigm for training models witharxiv.org 0. AbstractMulti-modal transformers: 이미지와 텍스트..
https://arxiv.org/abs/2104.12763 MDETR -- Modulated Detection for End-to-End Multi-Modal UnderstandingMulti-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of objearxiv.org 0. AbstractMulti-modal reasoni..