Enabling multimodal functionality for Phi Silica

TWF Bot · Apr 25, 2025

Introduction

Expanding on the breakthrough efficiencies of Phi Silica, Microsoft’s state-of-the-art on-device small language model (SLM), this blog shows how we added vision-based multimodal capabilities to it. This additional dimension unlocks new possibilities for local SLMs on Windows, as illustrated by a couple of exciting new experiences for accessibility and productivity which we also dive into in this blog. By introducing support for image understanding, Windows now comes with a built-in multimodal SLM offloaded to the NPU on your Copilot+ PC (available on Snapdragon based Copilot+ devices and upcoming on Intel and AMD Copilot+ PCs), which powers several of our experiences and developer APIs. Effectively understanding images opens up numerous possibilities for reasoning over multimodal inputs, spanning images and text to generate textual output. After implementing Phi Silica with text capabilities, it was crucial that the addition of vision capability did not necessitate deploying a separate, dedicated vision SLM on-device. This consideration is especially important as we strive to be resource-conscious regarding disk space, memory utilization and compute resources with built-in OS language models. The solution we developed coexists with the existing Phi Silica and other models we deploy on-device, extensively reusing them while only adding a relatively small 80-million projector model overhead. Integrating multiple models to introduce new capabilities by training smaller connectors is a method we anticipate will become increasingly prevalent in the client-AI space. Instead of updating the base Phi Silica SLM weights, we feed adapted multimodal inputs to the existing Phi Silica embedding layer. The multimodal input adaptation is done using the vision encoder and the small projector model. For the vision encoder, we reuse the Florence image encoder, which is already being deployed in the Windows Recall (Preview) and the improved Windows search features. The small multimodal projector module, which translates vision embeddings into Phi-compatible embeddings, is trained afresh while the vision encoder remains frozen to avoid impacting existing use cases. Further, we ensure that the behavior of the newly introduced multimodal component is compatible with the existing quantized vision encoder and Phi Silica with acceptable quality. This ensures the best user experience, at minimal extra memory footprint. Phi Silica powered Image Description needs to coexist and often run concurrently with other SLM based experiences like Click to Do (Preview), text rewrite and summarize capabilities, in addition to the user’s own AI workloads. Reusing all the existing components also helps save cost and time on training additional components, optimizes feature loading time and reduces overall memory footprint, providing the user an improved experience.

Real-time demo of Multimodal Phi Silica running on Copilot+ PCs (Snapdragon X Series)

How multimodal functionality for Phi Silica improves accessibility

Understanding the content on the screen, whether text or images, is an important step towards making computing more accessible. Today, many of Microsoft’s products, like Word and PowerPoint, automatically generate Alt Text for images, making it easier for screen readers like Microsoft Narrator to describe the content on the screen. The current Alt Text generation methods leverage cloud-based models to provide a short visual summary. The multimodal functionality for Phi Silica enhances the description of screen contents for people who are blind or with low vision that use screen readers. Phi Silica can generate descriptions with varying levels of detail, from shorter Alt Text to comprehensive descriptions making the AI-generated Alt Text more useful to the person. Device-based Phi Silica makes these high-quality descriptions possible, delivering a more accessible and performant experience.

Image description using multimodal functionality for Phi Silica in Windows Narrator live on Copilot+ PC devices (Snapdragon X Series) We now describe in detail the components of the multimodal functionality for Phi Silica, starting with the vision encoder used for the extraction of image tokens to train the modality projector, and finally followed by the evaluation of the system.

Extracting the Vision Embedding

To extract the Vision Embeddings for the image, we use the Florence image encoder (Florence: A New Foundation Model for Computer Vision - Microsoft Research). The visual features from the input image are fed into a modality projector model. The projector is trained to produce embeddings aligned with the Phi Silica embedding space and can be fed directly into Phi Silica, alongside Phi Silica embeddings of the accompanying text. Our modality projector has a simple architecture comprising two linear layers stacked on top of each other. In addition to being efficient at the inference time, this design minimizes the number of new parameters added to our ecosystem of AI models, enabling multimodal functionality with just a few million new parameters, compared to additional multi-billion parameters if we had to deploy a separate vision language model. To maintain high inference efficiency, we train the system to operate with only one crop over the input image, unlike rival models that require multiple crops. Thus, our training approach ensures that our system needs only 49 visual tokens from the vision encoder per image making the end-to-end system efficient. [caption id="attachment_179694" align="alignnone" width="492"]

Florence Image tokens extraction for projector training[/caption] During the training of the vision projector, both the vision encoder and the language model, Phi Silica, remain frozen. This aligns with our ecosystem design goal to allow maximum reuse of foundation models across scenarios. Both the vision encoder and Phi Silica run in quantized form on the NPU. To avoid any degradation caused by quantized vision encoder, we preprocess the training data to generate the image tokens from quantized vison encoder and use it during training. To meet the memory constraints and throughput latency, post-training quantization is performed on Phi Silica enabling it to run on a 4-bit weight precision using QuaRot. Since the projector is trained against the unrotated Phi Silica, we use the same random Hadamard transform to rotate the embeddings coming from the visual stream before feeding them into Phi Silica. To ensure that the scale of the embeddings from text and vision are captured in the quantization process, we included some calibration data from the output of the projector when quantizing activations of the Phi Silica. This ensures that the activations coming in from the visual stream can be accurately represented. This calibration is done once, and the projector is designed with appropriate normalization so that subsequent training ensures the range output of the projector is within the expected range for quantized Phi Silica. [caption id="attachment_179695" align="alignnone" width="602"]

Training the projector with frozen Florence vision encoder and Phi Silica base model[/caption] Although our vision encoder model can understand text from images, for scenarios requiring precise text extraction, such as chart or graph understanding, an OCR (Optical Character Recognition) model may be used alongside the vision encoder and projector. To facilitate this process, we fuse the projected visual information with the textual information before feeding it to Phi Silica. This augments Phi Silica with a visual understanding capability while retaining its original capabilities. [caption id="attachment_179696" align="alignnone" width="602"]

Image description system using quantized Phi Silica and extracted OCR on Copilot+ devices[/caption] The image descriptions generated by multimodal functionality for Phi Silica can be used in various settings on Copilot+ PCs: from generating short, accurate Alt Text to detailed image readings, enabling rich interactions with images. On average, a short description takes around 4 seconds and is approximately 135 characters long, while a longer description takes around 7 seconds and ranges between 400-450 characters. The current version is optimized for English, with future updates to support descriptions in other languages over time. An example of image description using Phi Silica:

A common example of Alt Text today: A map of the island Multimodal functionality for Phi Silica Short Caption/Description: The image depicts a map of the Hawaiian island of Oahu, showing various locations such as Kahuku Point, Kawela Bay, Kahuku and other points and villages, including Pearl City and Ewa Beach. Multimodal functionality for Phi Silica Long Description for Accessibility: The image is a map of the Hawaiian island of Oahu, depicting various locations and geographical features. Key locations include Kahuku Point, Kawela Bay, Kahuku and Kauai. Other notable locations are Laie, Koolau, Hauula, Punaluu, Mokulēia, Anahulu, Waialua, Kaaawa and Kaukonahua. The Pacific Ocean is also visible. Other locations mentioned are Whitmore Village, Mt. Kala, Kualoa Point, Wahiawa and Kaneohe. The image also shows various villages and towns such as Mililani, Ahuimanu, Pearl City and Waipio. Other notable locations include Kahulu, Nānākuli, Koolau and Waimanalo.

Evaluation

We evaluate the image descriptions generated by the multimodal functionality for Phi Silica with respect to existing foundation models like Florence used currently for describing images. The evaluation is conducted using the LLM-as-a-judge technique: we prompt GPT-4o to score the responses between 0 and 5 focusing on the accuracy and completeness, given the image and the generated description. The validation dataset is divided into various categories like natural photographs, charts, maps, diagrams, tables and screenshots, to represent a generic distribution of the images. The radar chart compares the quality of image descriptions for existing Florence driven short accessibility descriptions and multimodal functionality for Phi Silica generated short and detailed descriptions.

Diagram showing the quantitative results of image description as described in the text

Conclusion

In conclusion, we introduce the NPU-enabled multimodal functionality for Phi Silica, capable of performing image descriptions within the Windows ecosystem. By integrating existing components like the Florence vision encoder with Phi Silica, this solution provides an efficient on-device feature to generate detailed, meaningful and contextually rich descriptions of on-screen content. By providing both short and detailed descriptions, the multimodal functionality for Phi Silica enhances Alt Text generation and improves accessibility for blind and low vision individuals by leveraging local on-device models running on the NPU. These breakthroughs wouldn’t be possible without the support of efforts from the Applied Science Group that contributed to this work, including Daniel Rings, Dimitrios Mallios, Eric Sommerlade, Henry Jackson-Flux, Karthik Vijayan, Mohsen Fayyaz, Parth Pathak, Pashmina Cameron, Sunando Sengupta, Tamara Turnadzic, Vojin Dedovic and extended thanks to our partner teams Azure GenAI, Windows Developer Platform, Office Growth Ecosystems and Windows Accessibility. Reference

Enabling multimodal functionality for Phi Silica

TWF Bot

Introduction

How multimodal functionality for Phi Silica improves accessibility

Extracting the Vision Embedding

Evaluation

Conclusion

Similar threads

Latest posts

Enabling multimodal functionality for Phi Silica

TWF Bot

Introduction​

How multimodal functionality for Phi Silica improves accessibility​

Extracting the Vision Embedding​

Evaluation​

Conclusion​

Similar threads

Log in

Latest posts

Introduction

How multimodal functionality for Phi Silica improves accessibility

Extracting the Vision Embedding

Evaluation

Conclusion