Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation

1GenAI, Meta 2Georgia Institute of Technology 3University of Illinois Urbana-Champaign 4Sungkyunkwan University 5University of Wisconsin-Madison

When learning a new image manipulation operation that is unseen in the training set (as shown above), textual instructions directly point out the subject and provide high-level semantic guidance, while exemplar images mitigate linguistic ambiguity and show more local details that are difficult to describe in language. Our proposed multi-modal autoregressive model -- InstaManip takes advantage of both textual and visual guidance to learn a representation of the desired transformation, and applies it to a new query image.



Abstract

Text-guided image manipulation has experienced notable advancement in recent years. In order to mitigate linguistic ambiguity, few-shot learning with visual examples has been applied for instructions that are underrepresented in the training set, or difficult to describe purely in language. However, learning from visual prompts requires strong reasoning capability, which diffusion models are struggling with. To address this issue, we introduce a novel multi-modal autoregressive model, dubbed InstaManip, that can instantly learn a new image manipulation operation from textual and visual guidance via in-context learning, and apply it to new query images. Specifically, we propose an innovative group self-attention mechanism to break down the in-context learning process into two separate stages -- learning and applying, which simplifies the complex problem into two easier tasks. We also introduce a relation regularization method to further disentangle image transformation features from irrelevant contents in exemplar images. Extensive experiments suggest that our method surpasses previous few-shot image manipulation models by a notable margin (≥ 19% in human evaluation). We also find our model can be further boosted by increasing the number or diversity of exemplar images.

The Proposed Method

Visualization

Ablation Study

Same Textual Instructions + Different Exemplar Images

Scaling Up with More Exemplar Images

Additional Visualization

BibTeX