X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment

  • Dongjae Shin
  • , Hyeonseok Lim
  • , Inho Won
  • , Changsu Choi
  • , Minjun Kim
  • , Seungwoo Song
  • , Hangyeol Yoo
  • , Sangmin Kim
  • , Kyungtae Lim

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

5 Scopus citations

Abstract

The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.

Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics
Subtitle of host publicationNAACL 2024 - Findings
EditorsKevin Duh, Helena Gomez, Steven Bethard
PublisherAssociation for Computational Linguistics (ACL)
Pages2463-2473
Number of pages11
ISBN (Electronic)9798891761193
DOIs
StatePublished - 2024
Event2024 Findings of the Association for Computational Linguistics: NAACL 2024 - Hybrid, Mexico City, Mexico
Duration: 16 Jun 202421 Jun 2024

Publication series

NameFindings of the Association for Computational Linguistics: NAACL 2024 - Findings

Conference

Conference2024 Findings of the Association for Computational Linguistics: NAACL 2024
Country/TerritoryMexico
CityHybrid, Mexico City
Period16/06/2421/06/24

Fingerprint

Dive into the research topics of 'X-LLaVA: Optimizing Bilingual Large Vision-Language Alignment'. Together they form a unique fingerprint.

Cite this