Tag2Text: Guiding Vision-Language Model via Image Tagging

Xinyu Huang^1,2, Youcai Zhang², Jinyu Ma², Weiwei Tian¹, Rui Feng¹,
Yuejie Zhang¹, Yaqian Li², Yandong Guo², Lei Zhang³

¹Fudan University, ²OPPO Research Institute, ³International Digital Economy Academy (IDEA)

Paper Official Code Tag2Text Demo Tag2Text & Grounded-SAM

Tag2Text marry Grounded-SAM, which can automatically recognize, detect, and segment for an image! Tag2Text showcases powerful image recognition capabilities (right figure).

Highlight

Tag2Text is an efficient and controllable vision-language model with tagging guidance.

Tagging. Without manual annotations, Tag2Text achieves superior image tag recognition ability of 3,429 commonly human-used categories.

Efficient. Tagging guidance effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks.

Controllable. Tag2Text permits users to input desired tags, providing the flexibility in composing corresponding texts based on the input tags.

Visualization Results on Image Captioning

Tag2Text integrates recognized image tags into text generation as guiding elements (highlighted in green underline), resulting in the generation with more comprehensive text descriptions. Moreover, Tag2Text permits users to input desired tags, providing the flexibility in composing corresponding texts based on the input tags.

Visualization Results on Image-Text Retrieval

Tag2Text provides tags as additional visible alignment indicators (highlighted in green underline).

BibTeX

@article{huang2023tag2text,
  title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
  author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
  journal={arXiv preprint arXiv:2303.05657},
  year={2023}
}