TWF Bot
Staff member
- Joined
- Nov 29, 2020
- Messages
- 2,952
Introduction
Expanding on the breakthrough efficiencies of Phi Silica, Microsoft’s state-of-the-art on-device small language model (SLM), this blog shows how we added vision-based multimodal capabilities to it. This additional dimension unlocks new possibilities for local SLMs on Windows, as illustrated by a couple of exciting new experiences for accessibility and productivity which we also dive into in this blog. By introducing support for image understanding, Windows now comes with a built-in multimodal SLM offloaded to the NPU on your Copilot+ PC (available on Snapdragon based Copilot+ devices and upcoming on Intel and AMD Copilot+ PCs), which powers several of our experiences and developer APIs. Effectively understanding images opens up numerous possibilities for reasoning over multimodal inputs, spanning images and text to generate textual output. After implementing Phi Silica with text capabilities, it was crucial that the addition of vision capability did not necessitate deploying a separate, dedicated vision SLM on-device. This consideration is especially important as we strive to be resource-conscious regarding disk space, memory utilization and compute resources with built-in OS language models. The solution we developed coexists with the existing Phi Silica and other models we deploy on-device, extensively reusing them while only adding a relatively small 80-million projector model overhead. Integrating multiple models to introduce new capabilities by training smaller connectors is a method we anticipate will become increasingly prevalent in the client-AI space. Instead of updating the base Phi Silica SLM weights, we feed adapted multimodal inputs to the existing Phi Silica embedding layer. The multimodal input adaptation is done using the vision encoder and the small projector model. For the vision encoder, we reuse the Florence image encoder, which is already being deployed in the Windows Recall (Preview) and the improved Windows search features. The small multimodal projector module, which translates vision embeddings into Phi-compatible embeddings, is trained afresh while the vision encoder remains frozen to avoid impacting existing use cases. Further, we ensure that the behavior of the newly introduced multimodal component is compatible with the existing quantized vision encoder and Phi Silica with acceptable quality. This ensures the best user experience, at minimal extra memory footprint. Phi Silica powered Image Description needs to coexist and often run concurrently with other SLM based experiences like Click to Do (Preview), text rewrite and summarize capabilities, in addition to the user’s own AI workloads. Reusing all the existing components also helps save cost and time on training additional components, optimizes feature loading time and reduces overall memory footprint, providing the user an improved experience.How multimodal functionality for Phi Silica improves accessibility
Understanding the content on the screen, whether text or images, is an important step towards making computing more accessible. Today, many of Microsoft’s products, like Word and PowerPoint, automatically generate Alt Text for images, making it easier for screen readers like Microsoft Narrator to describe the content on the screen. The current Alt Text generation methods leverage cloud-based models to provide a short visual summary. The multimodal functionality for Phi Silica enhances the description of screen contents for people who are blind or with low vision that use screen readers. Phi Silica can generate descriptions with varying levels of detail, from shorter Alt Text to comprehensive descriptions making the AI-generated Alt Text more useful to the person. Device-based Phi Silica makes these high-quality descriptions possible, delivering a more accessible and performant experience.Extracting the Vision Embedding
To extract the Vision Embeddings for the image, we use the Florence image encoder (Florence: A New Foundation Model for Computer Vision - Microsoft Research). The visual features from the input image are fed into a modality projector model. The projector is trained to produce embeddings aligned with the Phi Silica embedding space and can be fed directly into Phi Silica, alongside Phi Silica embeddings of the accompanying text. Our modality projector has a simple architecture comprising two linear layers stacked on top of each other. In addition to being efficient at the inference time, this design minimizes the number of new parameters added to our ecosystem of AI models, enabling multimodal functionality with just a few million new parameters, compared to additional multi-billion parameters if we had to deploy a separate vision language model. To maintain high inference efficiency, we train the system to operate with only one crop over the input image, unlike rival models that require multiple crops. Thus, our training approach ensures that our system needs only 49 visual tokens from the vision encoder per image making the end-to-end system efficient. [caption id="attachment_179694" align="alignnone" width="492"]



Evaluation
We evaluate the image descriptions generated by the multimodal functionality for Phi Silica with respect to existing foundation models like Florence used currently for describing images. The evaluation is conducted using the LLM-as-a-judge technique: we prompt GPT-4o to score the responses between 0 and 5 focusing on the accuracy and completeness, given the image and the generated description. The validation dataset is divided into various categories like natural photographs, charts, maps, diagrams, tables and screenshots, to represent a generic distribution of the images. The radar chart compares the quality of image descriptions for existing Florence driven short accessibility descriptions and multimodal functionality for Phi Silica generated short and detailed descriptions.
Conclusion
In conclusion, we introduce the NPU-enabled multimodal functionality for Phi Silica, capable of performing image descriptions within the Windows ecosystem. By integrating existing components like the Florence vision encoder with Phi Silica, this solution provides an efficient on-device feature to generate detailed, meaningful and contextually rich descriptions of on-screen content. By providing both short and detailed descriptions, the multimodal functionality for Phi Silica enhances Alt Text generation and improves accessibility for blind and low vision individuals by leveraging local on-device models running on the NPU. These breakthroughs wouldn’t be possible without the support of efforts from the Applied Science Group that contributed to this work, including Daniel Rings, Dimitrios Mallios, Eric Sommerlade, Henry Jackson-Flux, Karthik Vijayan, Mohsen Fayyaz, Parth Pathak, Pashmina Cameron, Sunando Sengupta, Tamara Turnadzic, Vojin Dedovic and extended thanks to our partner teams Azure GenAI, Windows Developer Platform, Office Growth Ecosystems and Windows Accessibility. Reference- Florence: A New Foundation Model for Computer Vision - Microsoft Research
- Phi Silica, small but mighty on-device SLM
- Everything you need to know to write effective Alt Text - Microsoft Support
- Add alternative text to a shape, picture, chart, SmartArt graphic, or other object - Microsoft Support
- Retrace your steps with Recall - Microsoft Support
Continue reading...