Computer vision
V7 at NeurIPS: In-Context Learning in Computer Vision
6 min read
—
Dec 11, 2023
Explore V7's innovative approach to in-context learning for computer vision, presented at NeurIPS R0-FoMo workshop.

Ioana Croitoru
Research Engineer
Discover V7's innovative approach to in-context learning in computer vision, leveraging Video Object Segmentation for the binary semantic segmentation task. Our method, set to be presented at the NeurIPS R0-FoMo workshop on December 15th, 2023, offers enhanced flexibility and exceptional performance on out-of-distribution datasets.
In computer vision, training segmentation models usually demands a lot of data and extensive training. Moreover, these models are typically specialized; if you want to switch to a different domain – say, from urban landscapes to medical imagery – you often face the daunting task of gathering new, domain-specific datasets and training a new model from scratch. You'd typically need to collect a new set of specific images and train your model all over again.
But what if there was a smarter, more adaptable way?
Imagine a model that could switch between different types of images or domains just by looking at a few examples. This principle is the basis of the V7 AI team’s latest research. We've blended in-context learning (ICL) with video object segmentation (VOS) to create a model that's not only quick to learn but also flexible across various domains. Such adaptability is particularly beneficial for tasks like binary semantic segmentation, where the challenge is to split an image into two distinct parts: one highlighting the object of interest and the other comprising everything else. This method allows our model to efficiently adapt and accurately perform in various scenarios, from urban landscapes to medical imagery.
In this article:
- What is In-Context-Learning (ICL?) 
- The Challenge With Current Methods 
- Our Innovative Approach to In-Context-Learning 
- Challenges and Solutions 
- Benefit and Impact 
What Is In-Context Learning (ICL)?
In-Context Learning (ICL) is an approach originally used in language models (like GPT-3) where the model uses specific examples to understand what it needs to do. It's like giving a model a mini-tutorial on what you expect from it.
Let’s see ICL in action for language:

So, you need to give the model a few specific examples to illustrate what you need it to do.
- Show the model basic translations, such as "English: Hello". 
- The model then predicts the translation in a new language (such as “Italian: Ciao”) 
Further, let’s understand how in-context learning principles are applied in vision.
- Support Set: In this step, we provide the model with a few specific images, much like showing it what to look for in new images. For instance, we might input images containing road cracks along with their corresponding segmentation masks. 
- Image Segmentation: Now the model takes new images of roads and uses its understanding from the support set to accurately segment these new images, identifying and isolating road cracks. 

In short, in-context learning is efficient and adaptable:
- Rapid Adaptation: No need for long, resource-heavy training cycles. ICL adapts instantly with just a few examples. 
- Flexibility: It can switch between image domains effortlessly. 
The Challenge With Current Methods
The previous use of in-context learning in computer vision was quite limited and had some drawbacks. The traditional method involved cramming lots of image examples into a single pre-defined big picture (a grid) for the computer to learn from. This results in two major issues: firstly, in order to fit the pre-defined grid you need to downscale the images, resulting in a loss of details. Secondly, the rigid structure of this grid made it difficult to incorporate more examples if needed. Essentially, the grid imposed a fixed capacity for the number of images that could be utilized in the support set, limiting the model’s ability to adapt to a broader array of examples.
Our Innovative Approach to In-Context Learning
What's New?
At V7, we recognized the potential of using techniques from Video Object Segmentation (VOS) for in-context learning in images. VOS is particularly adept at understanding and tracking different objects over time in video sequences, akin to how someone might identify and follow a specific individual moving through a crowded market. This precision in object tracking from VOS is what inspired us to apply it to in-context learning. By doing so, we can incorporate a broader range of examples while preserving the integrity and quality of each image. This integration of VOS into our in-context learning method effectively overcomes the limitations of previous methods.

Overview of our method: A query image of zebras initiates the process by influencing the selection of a support set. This support set guides the In-Context Learning Model to produce an accurate segmentation mask, demonstrating the model's ability to adapt and identify similar objects.
Challenges and Solutions
Our approach includes an intelligent support set selection mechanism, which functions like a smart assistant to help choose the most effective images to show the model. This mechanism is essential for in-context learning, as it ensures that the model is exposed to examples that are most representative and informative for the task at hand. By selecting the best examples, we enable the model to swiftly adapt to new images using just a few examples, enhancing its ability to accurately understand and respond to diverse visual scenarios.

Benefit and Impact
Our innovative approach offers various advantages over traditional in-context learning techniques. By using a sequence-based approach, our model gains a more comprehensive understanding of the visual context, leading to improved accuracy.
- Greater flexibility: Our method removes the constraint of having a fixed grid, providing flexibility in how many images can be included in the support set. 
- Great performance on unseen data: One of the standout advantages of our method is its ability to excel on out-of-distribution datasets. In simple terms, this means our model is remarkably good at understanding and interpreting images that are different from the ones it was trained on. 
- Real-world applications: The potential applications for this method are vast and impactful. It enables a single, versatile model to assist in a variety of fields without necessitating retraining. For instance, it can aid doctors in identifying specific cells in medical imaging, crucial for accurate diagnoses. In disaster relief efforts, it can be used to efficiently identify impacted zones, facilitating a quicker response time. 
Conclusion
At V7, our aim is to develop AI solutions that effectively meet the complexities and demands of the real world. In our latest work, set to be presented at the NeurIPS R0-FoMo workshop, we have integrated intelligent support set selection as a distinct innovation. While this aspect is important for improving our model's overall capabilities, our focus also extends to improving performance on out-of-distribution datasets by leveraging the advantages of video object segmentation for in-context learning. This approach underlines our commitment to advancing AI adaptability and reliability, ensuring that our technology aligns with the diverse and ever-changing requirements of our world.
For those eager to explore this technology further, V7 offers resources to help you understand and implement our methods:
GitHub Repository: Visit our GitHub page for detailed insights into the technical aspects of our approach. Here, you'll find code, documentation, and examples that will guide you through implementing and experimenting with our method in your projects.
Hugging Face Demo: Experience our method in action through the Hugging Face demo. It allows you to witness firsthand how the model adapts to different image domains using in-context learning.
Ioana is a Research Engineer with a PhD in Computer Vision, where she explored a range of topics from object detection to text-video retrieval. Her work has been featured in numerous top-tier conferences and journals. She is particularly passionate about exploring a wide range of tasks, with a special focus on multimodal challenges and the application of self-supervised learning techniques.






