Once again, AI takes the headlines. However, this time is different. We are at the beginning of a complete redesign of how humans interact with computers, possibly the biggest shift since Apple popularized the graphic user interface (GUI) or the introduction of Internet-connected devices.
Microsoft recently announced Copilot for PCs while Apple is in talks with OpenAI to integrate GPT4o. These strategies by tech giants are setting the stage for a revolution of on-device smart assistants and AI computers. This first generation of assistants (which will be truly intelligent, unlike Alexa and Siri), will generate and perceive audio, video, images and text, plus apply live multimodal understanding and basic reasoning capabilities.
Despite current model limitations in GPT 4o, Gemini Nano from Google and Small Language Models (SLM), the first seeds of seamless interaction with on-device AI assistants have been planted. Through the use of multimodal perception and voice, AI smart assistants aim to tackle everyday tasks much more efficiently than LLMs and traditional virtual assistants.
In other words, your laptops, tablets and mobiles are about to get a lot smarter, helping you work more efficiently and creatively by providing AI assistance and recommendations right where you need it and even in places with poor or no internet access. This is a huge step forward in making AI a core part of our daily life, making work and personal tasks easier, more productive and personalized.
We’ve seen a few false-starts in this area, most notably the Humane AI pin and the Rabbit R1, both of which were universally panned by reviewers. The general consensus was that using an always-on Internet connection to interact with a large cloud model simply didn’t work.
New superpowers unlocked with on-device AI
Some of the main issues slowing down the adoption of AI apps is latency, the cost to train AI models, and fear of privacy (including copyright theft). With AI integrated directly into our devices, all data and inference is processed locally, at the edge, resulting in faster responses, reduced training / inference costs, and enhanced privacy since users’ data remains local.
During Microsoft’s recent announcement, one of the most impressive new features we saw (albeit controvesial) was the ability to remember anything you have done on your device.
“We introduced memory into the PC. It’s called Recall. It’s not keyword search, it’s semantic search over all your history. It’s not about just any document. We can recreate moments from the past essentially”
said Microsoft’s CEO, Satya Nadella, in a recent interview. Basically, you can now scroll back in time to easily find apps, websites, documents, and more.
It’s a “creepy cool” feature: great advantages unlocked as you get perfect photographic memory, but this also raises huge privacy concerns as Microsoft is taking screenshots every 5 seconds and using them to train their models. Elon Musk even said it should be turned off as it felt like a “Black Mirror” episode.
When it comes to creative tools, the barrier to express your ideas without any technical knowledge is becoming the norm. For example, with the new Paint, PC users will now generate endless images for free and Microsoft is also partnering with Adobe, Da Vinci Resolve, Capcut, and others for improved app performance directly on Surface devices. Microsoft will also offer live captions and translations in 40 languages.
AI assistants and live multimodal understanding
On-device AI and cloud AI both have their own benefits and challenges, and how they work is shaping the (near) future of user experiences, driven by assistants. Super quick responses means faster and more personalized voice copilots that can perceive and understand your screens and multimodal data on the go, because everything is processed right on device.
According to OpenAI’s CEO, Sam Altman, users might want an extension of themselves; an augmented alter-ego that acts on their behalf. For example, responding to emails without even informing them. Alternatively, there’s another approach of having an assistant that acts as an experienced senior employee who has access to the user’s email and works within the constraints set by the individual. This would be a separate entity. This assistant would always be available, detail-oriented, continuously consistent, and incredibly capable.
A key capability of these agents is live multimodal understanding. This is a technology that we use for, among other things, letting you search through your video content.
A demo that illustrates these capabilities (still in alpha for GPT 4o users), is screen sharing in which Imran Khan (from Khan Academy) uses ChatGPT to guide and hint his son Sal on how to solve a mathematical problem. When giving ChatGPT access to your screen, the AI can immediately provide solutions and recommendations on the fly.
Another trend in this space, albeit realistically a few years away, are large action models. The difference with normal agents is that, in addition to retrieval and understanding, these models can orchestrate workflows, coordinate with other team members and AI assistants and be able to connect and act across multiple apps and systems (if interested check H, a company focused on Large Action Models that recently raised $220 million in seed funding).
On device models are not enough
On-device AI also has big drawbacks such as limited computing power in mobile devices, data fragmentation and compatibility across teams and databases/storage systems, and increased battery and memory consumption. This might lead to different devices running different versions of AI models, making it harder to keep UX consistent across platforms. Not small challenges.
That’s why cloud AI is still so attractive for startups and Enterprises alike. It can use vast computing resources to train and make available more complex fine-tuned AI models, provide scalable infrastructure on the go (especially for video and 3D), which would be limited on a mobile device or a PC, plus it’s guaranteed your data will not be used for training purposes.
Cloud AI also makes it easy to update and improve apps immediately ensuring your apps stay current and fresh with the latest version. Finally, cloud AI assistants can gather data from many sources, including on-device systems, cloud storage systems and the Internet as a whole, to make the AI smarter and more accurate.
What’s next?
The efforts to develop and enhance on-device and cloud AI models and assistants is starting to converge, especially after the latest moves announced by tech giants like Microsoft and Apple (including also Samsung and Google showing similar strategies). It all indicates that combining both on-device and cloud AI technologies to power assistants and apps will be the way forward. This way, users will get the speed and privacy of on-device AI along with the power and scalability of cloud AI.