Multimodal AI: what is it, and how does it work?

Multimodality, it’s so hot right now. 2024 was the year that all the major Large Language Models – ChatGPT, Gemini, Claude and others – introduced new modalities and new ways to interact.

Like most new technological fields AI is full to the brim with technical jargon, some of it totally unnecessary, but some of it quite consequential.

Multimodal is one of the consequential ones.

So what does multimodal mean?

Well it’s actually quite simple. In AI terms, a “modality” is a type of media through which an AI model can consume, understand, and respond to information – think text, audio, image, video.

Historically most AI systems have used only text as their training data, their input and their output, and so were single modality. In the last decade or so AI image recognition systems have become more and more common with products like Google Lens and Amazon’s Rekognition. These computer vision models were obviously a step up in complexity from text-based models, but were still limited to only images, and so were also single modality.

The next evolution was text-to-image models like Stable Diffusion and DALL·E which, technically speaking, are multimodal. They take a text prompt, and produce an image – two modalities! However in practice “multimodal AI” has come to mean systems which combine two or more inputs or outputs simultaneously or alongside one another. This is sometimes called multimodal perception due to the fact that, once you introduce multiple modalities, an AI can begin to perceive (or give the impression of perception) what it is looking at, rather than just matching text or visual patterns.

Imagine providing an image recognition AI with this image. A basic system will recognise individual elements; man, woman, nose, hair. A more mature system will put them together to understand the image as a whole; a crowd watching something.

However, if you showed a multimodal system a video of a crowd watching something, it will recognise movements, facial expressions, sound effects, music and more to build a complete description of the scene.

You can think of AI modalities as roughly equivalent to human senses. There’s a lot you can do with just one sense, but when combined they provide a more complete understanding of the world around you.

How many modalities are there?

The main modalities in regular use right now are:

Text, which can include things like:
- Normal written chat
- Numerical data
- Code such as HTML and JavaScript
Image
Video
Text on screen (through optical character recognition)
Non-dialogue audio (music, sound effects etc.)
Dialogue

The vast majority of AI usage is still confined to single-modality text (and the vast majority of that usage is text inputs and outputs through ChatGPT’s web interface, app and API). However as visual and audio AI systems become more mainstream, and cheaper to operate, this will no doubt change in exactly the same way the early internet was mostly text and images, but now contains a huge amount of video, music, podcasts and more.

Another curiosity of current-generation AI is that, although language and visual models can often give the impression of approaching human intelligence, they lack so many of the building blocks of intelligence that we humans take for granted. These are also modalities, which could be incorporated into AI systems in the future.

An example: the first season of HBO’s House of The Dragon takes place over almost three decades, with the lead character of Rhaenyra Targaryen played by Milly Alcock during the first five episodes, and Emma D’Arcy for the remaining four. As humans we can recognise through production cues, wardrobe, context and many other things that we’re dealing with a time jump, and that this is the same character later in life.

An AI system, not understanding as basic a concept as the passage of time, will recognise two different faces and fail to understand they are the same character.

This is one of the most interesting challenges of building AI today – we’re trying to reconstruct human intelligence but starting in the wrong place, so we have to backfill many of the fundamental aspects of basic intelligence.

We are also seeing the emergence of so-called action models, which can complete tasks on behalf of their users – for example logging into Amazon and ordering something. It’s certainly possible that actions will become another modality that is incorporated into larger models in time.

What’s the future for multimodality?

The only guarantee in the AI space right now is that the pace of innovation will continue to be relentless. Even the word multimodal only entered the public sphere around 18 months ago, and the number of people Googling it has increased tenfold in the last year.

Mobile devices and wearables is an obvious category that can benefit from multimodal models. Although the first attempts at devices incorporating multimodal AI were a huge miss, we’ll likely see these features incorporated into smartphones over time. The main limiting factor right now is the size of the models, which require a stable, fast internet connection to process queries in the cloud. This, too, will change as on-device models become more feasible.

Aside from being a nice-to-have, on-device multimodal AI has clear benefits for people with limited vision or hearing. A smartphone which can perceive the world around it is clearly useful for these groups. Imagine a visually-impaired person pointing their iPhone at a supermarket shelf and asking for help finding a specific product. Taking our human senses metaphor to its logical conclusion, these models can fill in the gaps for people who have lost those senses.

Away from consumer products, we are already seeing some really exciting developments in robotics where multimodal perception is allowing off-the-shelf robotic products to engage with the world without specific programming.

Until now industrial robots needed specific instructions for each task (close your claw 60%, raise your arm 45°, rotate 180°, etc.), but multimodal models may allow them to figure out how to complete a task themselves when given a specific outcome, by understanding the world around them and how to interact with it. Imagine a robot arm which can pick specific groceries of differing shapes and textures, and pack them taking into account how susceptible each item is to damage.

In time multimodal AI will become just another technology that we all take for granted, but right now in 2024, it’s the frontier where the most exciting artificial intelligence developments are taking place, and it’s worth keeping an eye on.

Article credits

Posted by

Jose Puga

Originally published on

July 22, 2024

Filed under

Artificial Intelligence

With image generation from

OpenAI

And TikTok creation from

Imaginario AI