TY - GEN
T1 - X-LLaVA
T2 - 2024 Findings of the Association for Computational Linguistics: NAACL 2024
AU - Shin, Dongjae
AU - Lim, Hyeonseok
AU - Won, Inho
AU - Choi, Changsu
AU - Kim, Minjun
AU - Song, Seungwoo
AU - Yoo, Hangyeol
AU - Kim, Sangmin
AU - Lim, Kyungtae
N1 - Publisher Copyright:
© 2024 Association for Computational Linguistics.
PY - 2024
Y1 - 2024
N2 - The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.
AB - The impressive development of large language models (LLMs) is expanding into the realm of large multimodal models (LMMs), which incorporate multiple types of data beyond text. However, the nature of multimodal models leads to significant expenses in the creation of training data. Furthermore, constructing multilingual data for LMMs presents its own set of challenges due to language diversity and complexity. Therefore, in this study, we propose two cost-effective methods to solve this problem: (1) vocabulary expansion and pretraining of multilingual LLM for specific languages, and (2) automatic and elaborate construction of multimodal datasets using GPT4-V. Based on these methods, we constructed a 91K English-Korean-Chinese multilingual, multimodal training dataset. Additionally, we developed a bilingual multimodal model that exhibits excellent performance in both Korean and English, surpassing existing approaches.
UR - https://www.scopus.com/pages/publications/85197887014
U2 - 10.18653/v1/2024.findings-naacl.158
DO - 10.18653/v1/2024.findings-naacl.158
M3 - Conference contribution
AN - SCOPUS:85197887014
T3 - Findings of the Association for Computational Linguistics: NAACL 2024 - Findings
SP - 2463
EP - 2473
BT - Findings of the Association for Computational Linguistics
A2 - Duh, Kevin
A2 - Gomez, Helena
A2 - Bethard, Steven
PB - Association for Computational Linguistics (ACL)
Y2 - 16 June 2024 through 21 June 2024
ER -