Tech
Microsoft drops Florence-2, a unified model to handle a variety of vision tasks
It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More
Today, Microsoft’s Azure AI team dropped a new vision foundation model called Florence-2 on Hugging Face.
Available under a permissive MIT license, the model can handle a variety of vision and vision-language tasks using a unified, prompt-based representation. It comes in two sizes — 232M and 771M parameters — and already excels at tasks such as captioning, object detection, visual grounding and segmentation, performing on par or better than many large vision models out there.
While the real-world performance of the model is yet to be tested, the work is expected to give enterprises a single, unified approach to handle different types of vision applications. This will save investments on separate task-specific vision models that fail to beyond their primary function, without extensive fine-tuning.
What makes Florence-2 unique?
Today, large language models (LLMs) sit at the heart of enterprise operations. A single model can provide summaries, write marketing copies and even handle customer service in many cases. The level of adaptability across domains and tasks has been amazing. But, this success has also left researchers wondering: Can vision models, which have been largely task-specific, do the same?
VB Transform 2024 Registration is Open
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
At the core, vision tasks are more complex than text-based natural language processing (NLP). They demand comprehensive perceptual ability. Essentially, to achieve universal representation of diverse vision tasks, a model must be capable of understanding spatial data across different scales, from broad image-level concepts like object location, to fine-grained pixel details, as well as semantic details such as high-level captions to detailed descriptions.
When Microsoft tried solving this, it found two key roadblocks: Scarcity of comprehensively annotated visual datasets and the absence of a unified pretraining framework with a singular network architecture that integrated the ability to understand spatial hierarchy and semantic granularity.
To address this, the company first used specialized models to generate a visual dataset called FLD-5B. It included a total of 5.4 billion annotations for 126 million images, covering details from high-level descriptions to specific regions and objects. Then, using this data, it trained Florence-2, which uses a sequence-to-sequence architecture (a type of neural network designed for tasks involving sequential data) integrating an image encoder and a multi-modality encoder-decoder. This enables the model to handle various vision tasks, without requiring task-specific architectural modifications.
“All annotations in the dataset, FLD-5B, are uniformly standardized into textual outputs, facilitating a unified multi-task learning approach with consistent optimization with the same loss function as the objective,” the researchers wrote in the paper detailing the model. “The outcome is a versatile vision foundation model capable of performing a variety of tasks… all within a single model governed by a uniform set of parameters. Task activation is achieved through textual prompts, reflecting the approach used by large language models.”
Performance better than larger models
When prompted with images and text inputs, Florence-2 handles a variety of tasks, including object detection, captioning, visual grounding and visual question answering. More importantly, it delivers this with quality on par or better than many larger models.
For instance, in a zero-shot captioning test on the COCO dataset, both 232M and 771M versions of Florence outperformed Deepmind’s 80B parameter Flamingo visual language model with scores of 133 and 135.6, respectively. They even did better than Microsoft’s own visual grounding-specific Kosmos-2 model.
When fine-tuned with public human-annotated data, Florence-2, despite its compact size, was able to compete closely with several larger specialist models across tasks like visual question answering.
“The pre-trained Florence-2 backbone enhances performance on downstream tasks, e.g. COCO object detection and instance segmentation, and ADE20K semantic segmentation, surpassing both supervised and self-supervised models,” the researchers noted. “Compared to pre-trained models on ImageNet, ours improves training efficiency by 4X and achieves substantial improvements of 6.9, 5.5, and 5.9 points on COCO and ADE20K datasets.”
As of now, both pre-trained and fine-tuned versions of Florence-2 232M and 771M are available on Hugging Face under a permissive MIT license that allows for unrestricted distribution and modification for commercial use or private use.
It will be interesting to see how developers will put it to use and offload the need for separate vision models for different tasks. Small, task-agnostic models can not only save developers the need to work with different models but also cut down the compute costs by a significant margin.