Tag2Text: Guiding Vision-Language Model via Image Tagging

1Fudan University, 2OPPO Research Institute, 3International Digital Economy Academy (IDEA)

Tag2Text marry Grounded-SAM, which can automatically recognize, detect, and segment for an image! Tag2Text showcases powerful image recognition capabilities (right figure).

Highlight

Tag2Text is an efficient and controllable vision-language model with tagging guidance.

  • Tagging. Without manual annotations, Tag2Text achieves superior image tag recognition ability of 3,429 commonly human-used categories.
  • Efficient. Tagging guidance effectively enhances the performance of vision-language models on both generation-based and alignment-based tasks.
  • Controllable. Tag2Text permits users to input desired tags, providing the flexibility in composing corresponding texts based on the input tags.
  • Visualization Results on Image Captioning

    Tag2Text integrates recognized image tags into text generation as guiding elements (highlighted in green underline), resulting in the generation with more comprehensive text descriptions. Moreover, Tag2Text permits users to input desired tags, providing the flexibility in composing corresponding texts based on the input tags.

    Visualization Results on Image-Text Retrieval

    Tag2Text provides tags as additional visible alignment indicators (highlighted in green underline).

    BibTeX

    @article{huang2023tag2text,
      title={Tag2Text: Guiding Vision-Language Model via Image Tagging},
      author={Huang, Xinyu and Zhang, Youcai and Ma, Jinyu and Tian, Weiwei and Feng, Rui and Zhang, Yuejie and Li, Yaqian and Guo, Yandong and Zhang, Lei},
      journal={arXiv preprint arXiv:2303.05657},
      year={2023}
    }