Mathieu De Coster

Plato's multilingual large language models

17 March 2024

Multimodal large language models (MLLMs) can analyze both text and images, suggesting a step towards visual reasoning. They can also generate new data in both modalities, suggesting that they are creative. However, this ability doesn't equate to true artificial general intelligence (AGI). MLLMs, in a way, resemble the prisoners in Plato's allegory of the cave.

Large language models

Large language models (LLMs) started popping up around 2018 with BERT1 and GPT2. Based on the transformer architecture, they showed immense potential with regards to natural language modeling. This potential was partially achieved in 2022 with the release of ChatGPT3. It has such a big impact that it made LLMs part of public discourse. The general availability of ChatGPT caused a stir in mainstream media. For a while, you couldn't go to a bar or restaurant without hearing the term at least once. Just like that, everyone and their mother was talking about AI.

However, language alone has limitations. LLMs lack the ability to build a comprehensive understanding of the world. To bridge this gap, researchers have recently equipped LLMs with visual capabilities, creating MLLMs like GPT-44. As the saying goes, a picture is worth a thousand words.

Renowned computer scientist and artificial intelligence researcher Yann Lecun stated recently that "A 4 year-old child has been awake a total 16,000 hours, which translates into 1x10^15 bytes. [...] In a mere 4 years, a child has seen 50 times more data than the biggest LLMs trained on all the text publicly available on the internet." 5,6 Clearly, visual capabilities are required to increase the amount of information in LLMs.

The allegory of the cave

Nearly 2,500 years ago, the ancient Greek philosopher Plato described a cave, an allegory for a separation between, on the one hand, the world we observe and its ephemeral nature, and on the other hand the abstract world of (mathematical) concepts which is pure and is not subject to the influence of time and space.

In Plato's allegory, he describes humans as beings who are stuck in a cave, where no natural daylight can enter. They are chained to a wall and can only look straight ahead, observing shadows on the wall in front of them. These shadows are projections, cast by a fire, of objects held high by people walking behind the wall.

img Plato's cave, by Wikimedia user 4edges (CC-BY-SA 4.0 DEED)

These projections are the reality of the prisoners, as far as they are concerned. These two-dimensional moving figures are their absolute truth and they are unaware that there is a higher form of reality. Should a prisoner somehow free themselves from their shackles and leave the cave, they would be blinded by the sun outside. This represents the incomprehensible nature of the true reality. Eventually, however, their eyes would adjust, and they would finally see the real world clearly.

Multimodal large language models as the prisoners in the cave

MLLMs, like Plato's prisoners, are chained to the world of text and image data. Their understanding is limited to the patterns they can discern within this data. Their abilities, while impressive, don't equate to true comprehension of the world. MLLMs are restricted to observing a pre-processed and filtered version, a projection, of the world and this is inherently limits their ability to achieve intelligence. One could even say that LLMs trained on text alone are similar to a blind prisoner in Plato's cave. This prisoner does not even see the projections; instead, someone sits behind them, describing the shadows by whispering in their ear. Surely, this telephone game version of observing the real world, can only lead to an incomplete world view.

Overcoming MLLMs' limitations requires venturing beyond the cave. For true AGI to emerge, AI systems will need the ability to interact with and learn from the physical world, not just through text and images, but through embodied experience. This will involve incorporating additional sensory modalities, motor skills, and the ability to form and test hypotheses in the real world. Lecun mentions that vision, audio and touch are all required.6

Is intelligent reasoning simply being able to interpolate observations?

But are these three sensory inputs enough to achieve intelligence? What if we were to incorporate taste and smell, too? Even embodied observation may be insufficient. What MLLMs are trained to do, is essentially interpolation. Every observation is a point in a space and when we ask them to generate new data, they interpolate in this space. By providing them with a higher quantity and variety of data points, the space eventually becomes so expansive that interpolating within it gives the illusion of intelligence.7

As humans, our nature is shaped by our past experiences and the physical limitations of ourselves and also our world. In his book Free Play, Stephen Nachmanovitch describes the creation of something new as a marriage of "patterning outside [...] with the patterning [...] within."8 Nachmanovitch's book emphasizes that creativity arises from navigating within limitations and from play. These limitations are different from those of Plato's prisoners who are restricted to observation. The key to intelligence lies not just in being impressed by the outside world but also in manipulation and the ability to express oneself creatively. Manipulation allows us to build reasoning based on extrapolation. We learn "If I do this, then that will happen."

Superhuman intelligence and the arts

Give five humans a paintbrush, some paint, and instructions to draw a summer's day, and you will get five distinct paintings. Hand them the same tools and instructions ten years later, and you will receive another five distinct paintings. Our intelligence and our creativity are shaped by our experiences, both positive and negative. Our highest highs and our lowest lows and everything in between. These, too, are limits that promote creativity. As we get molded by life, so do our expressions. In contrast, MLLMs may generate beautiful text and images, artsy even, but I would not call them art, precisely due to this lack of experience of the world outside the cave. There's a reason why Nick Cave calls AI "a grotesque mockery of what it is to be human."9

(M)LLMs are incredibly impressive and disruptive. They came out of left field for many of us, including leading AI researchers. Yet, we are still far from achieving AGI. We can continue to throw data and compute at neural networks, but without a paradigm shift, they will remain confined to the realms of interpolation.

In Plato's allegory, the escapee returns to the cave after experiencing the real world and tries to convey his newfound insights to his fellow prisoners. They cannot comprehend it and even attack him for it. This raises the question, "Do we even want to achieve AGI, let alone superhuman intelligence?" Is the beauty of humanity not the marriage of our own unique experiences with our empathy for other beings? We attach feelings to other humans, to animals, to plants and even to inanimate objects. What makes life worth living is experiencing the entire spectrum of our emotions: joy, sadness, love, hurt, happiness.

While MLLMs like ChatGPT and Gemini represent a significant leap forward in AI capabilities, it's crucial to continue to think of them as tools. They should augment our creativity, not replace it, and enhance communication, not isolate us. Continued research into AI is vital, but the true power lies in our ability to leverage these tools for human progress, not supplant the very aspects that make us human: artistic expression and social connection.

(1) Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). [Back to text]
(2) Radford, Alec, et al. "Improving language understanding by generative pre-training." (2018). [Back to text]
(3) Do I even need to put the URL here? Doesn't everyone have it bookmarked or memorized? [Back to text]
(4) [Back to text]
(5) In the same tweet, he also says "Yes, humans can get smart without vision, even pretty smart [sic] without vision and audition. But not without touch. Touch is pretty high bandwidth, too." We'll get back to that. [Back to text]
(6) [Back to text (1)] [Back to text (2)]
(7) Benjie Holson has some great illustrations and an intuitive explanation for this in his blog post on ML for Robots, Specialization vs Overfitting. [Back to text]
(8) Stephen Nachmanovitch. "Free Play: Improvisation in Life and Art." (1990) [Back to text]
(9) [Back to text]