Sunday, December 3, 2023
Google search engine
HomeUncategorizedGPT4 and the Multi-Modal, Multi-Model, Multi-Everything Future of AGI

GPT4 and the Multi-Modal, Multi-Model, Multi-Everything Future of AGI

As was rumored and then confirmed by Microsoft Germany, GPT-4 was released yesterday in ChatGPT with a blogpost, paper, livestream, and a couple of short videos:

To use simple measures of how anticipated this was – GPT-4 is already the 11th-most upvoted Hacker News story of ALL TIME, the Developer Livestream got 1.5 million views in 20 hours (currently #5 trending video on all of YouTube) and the announcement tweet got 4x more likes than the same for ChatGPT, itself the biggest story of 2022.

  • “Today has been a great year in AI” – Tobi Lutke, Shopify CEO

  • “Not sure I can think of a time where there was this much unexplored territory with this much new capability in the hands of this many users.” – Karpathy

There are lots of screenshots and bad takes flying around, so I figure it would be most useful to do the same executive-summary-style recap I did for ChatGPT, for GPT-4.

GPT-4 is the newest version of OpenAI’s flagship language model. It is:

That alone would qualify it as a huge release, but GPT-4 is also OpenAI’s first multimodal model, being able to natively understand image input as well as text. This is orders of magnitude better than existing OCR and Image-to-Text (e.g. BLIP) solutions and has to be seen to be fully understood, but the capabilities that you must know include:

GPT-4 can be tried out today by being a ChatGPT Plus subscriber ($20/month), while text API access is granted on a waitlist or by contributing OpenAI Evals. The multimodal visual API capability is exclusive to BeMyEyes for now. API Pricing is now split into prompt tokens and completion tokens and is 30-60x higher than GPT-3.5


In a break from the past, OpenAI declined to release any technical details of GPT-4, citing competition and safety concerns. This means the Small Circle, Big Circle memes were not confirmed nor denied

and that another round of criticism of OpenAI not being open started again.

In place of technical detail, OpenAI instead focused on demonstrating capabilities (explained above), scaling and safety research (done by OpenAI’s Alignment Research Center

) and demonstrating usecases with launch partners in an impressively coordinated launch (with a full slate of Built With GPT-4 examples on launch day):

Race Dynamics. The coordination reached beyond OpenAI – GPT-4 wasn’t the only foundation model launch of Tuesday. Both Google and Anthropic launched their PaLM API

and Claude+ models as well, with Quora Poe being the first app to launch with both OpenAI GPT-4 AND Anthropic’s Claude+ models. This ultra-competitive launch cycle across companies on Pi Day smacks of last month’s Google vs Microsoft race for special events and is causing concern from AI safety worriers and sleep-deprived Substack writers alike.

(end of summary! phew! but discussions ongoing @ Hacker News and Twitter)

GPT-4’s Multimodality is a glimpse of the AGI future to come. It didn’t end up fitting all the speculated capabilities – it doesn’t have image output, and audio was notably missing from the accepted inputs given the Whisper API release, but Jim Fan’s hero image here was mostly spot on:

However, 3 days ago Microsoft Research China released another approach to multiple modalities with Visual ChatGPT, allowing you to converse with your images the same as GPT-4:

This is a multi-modal project, but is more accurately described as a multi-model project, because it really is basically “22 models in a trenchcoat


This hints at two ways of achieving multi-modality – the cheap way (chaining together models, likely with LangChain), and the “right” way (training and embedding on mixed modality datasets). We have some reason to believe that multimodal training gives benefits over and above single modality training – in the same way that adding a corpus of code to language model training has been observed to improve results for non-code natural language, we might observe that teaching an AI what something looks like improves their ability to describe it and vice versa


Even being single-modality but multi-model is proving to be useful. Quora founder Adam D’Angelo chose to launch his new Poe bot with both OpenAI GPT-4 and Anthropic Claude support, and former GitHub CEO Nat Friedman built to help compare outputs across the largest possible range of text models:

Eliezer Yudkowsky has also commented that being multi-model can be useful for model distillation as well, with the recent Stanford Alpaca result finetuning Meta’s LLaMa off of GPT-3 to achieve comparable results with a 25x smaller model.

This seems to be a tremendously fruitful area of development (not forgetting Palm-E, Kosmos-1, ViperGPT, and other developments I don’t have room to cover) and I expect multimodal, multimodel developments to dominate research and engineering cycles through at least the rest of 2023, edging us closer and closer to the AGI event horizon.

Moravec’s Paradox can be summarized as “computers find easy things that humans find hard, and vice versa”. But human capabilities evolve about 100,000x slower than computers, and it does not take long for computers to go from sub-human to super-human. By now we are familiar with the idea that LLMs are effortlessly multilingual (across the most popular human and programming languages, but also increasingly with lower resource languages) and multidisciplinary (GPT-4 simultaneously capable of being a great sommelier, law student, med student and coder, though english lit is safe).

But those are merely just two dimensions we can think of. OpenAI ARC and Meta FAIR tested AI’s ability to be duplicitious, and we are increasingly seeing AI be effortlessly multi-personality as well – with the Waluigi Effect recently entering the AI discourse as a formal shorthand and Bing’s Sydney showing wildly disturbing alternative personalities variously known as Venom and Dark Sydney. And yet we press on.

AI is under no obligation to only be multi- in ways that we expect. I am reminded of the ending of the movie Her, when Joaquin Pheonix learns that Samantha is simultaneously in love with 641 people, a number so big it boggles his mind but is functionally the same as loving 1 person for a multi-everything AI:

Moloch, thy name is race dynamics.

Read More



Please enter your comment!
Please enter your name here

- Advertisment -
Google search engine

Most Popular

Recent Comments