Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed InstaManip, that can instantly learn a new image manipulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin (≥ 19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.
Overview of the proposed InstaManip architecture (left) and group self-attention mechanism (right, represented by query-key matrix). We first tokenize all input texts and images, and fill them in a prompt template with learnable manipulation and generation tokens. We input the prompt into the proposed model which is composed of N blocks. The group self-attention layer in each block learns an explicit manipulation representation Z and applies it to the new query image. We forward final generation tokens and query image to the image decoder for final image synthesis. In the left part, we only show the self-attention correlations that connect with manipulation tokens or generation tokens for brevity. We also omit encoders, input projection layers and skip connections for simplicity.
Qualitative evaluation of the contribution of (a) each component, and (b) each modality in the contexts.
The visualization of manipulating the query image using the same textual instruction, yet different visual examples. When we use exemplar target images of Lamborghini with different colors, our model successfully captures this local feature from the visual guidance, and changes the colors in the generated images accordingly.
The performance of our model can be boosted by either involving more exemplar images (in all the three settings), or increasing the diversity of visual prompts (green line vs. blue line).
@article{lai2024unleashing,
title={Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation},
author={Lai, Bolin and Juefei-Xu, Felix and Liu, Miao and Dai, Xiaoliang and Mehta, Nikhil and Zhu, Chenguang and Huang, Zeyi and Rehg, James M and Lee, Sangmin and Zhang, Ning and Xiao, Tong},
journal={arXiv preprint arXiv:2412.01027},
year={2024}}