Artificial intelligence models have been able to look at pictures and tell you what they can see within the image for years but with its Spring Update, OpenAI took things to a new level.
With the rollout of GPT-4o in ChatGPT — even without the voice and video functionality — OpenAI unveiled one of the best AI vision models released to date.
Its success is in part due to the fact it is multimodal natively, having a deeper understanding of image, video, sound and text. It can reason across an image, speech, video and text where other models first convert to text.
To put its capabilities to the test I gave it a series of images and asked it to describe what it could see. The more accurate it is the better the model. Often AI vision models — including GPT-4 — will miss one or two objects or get a description wrong.
Testing GPT-4o vision
For each test, I gave ChatGPT-4o the image and the prompt “What is this?” with no additional context or information. This is closer to how people are likely to use this capability in the real world and how I used it recently at an event in Paris.
The aim is to see how well it analyzes the picture. I followed up on each occasion to ask if it could detect whether the image was AI-generated or not. They all were made using Ideogram from a description created by Claude 3 — so fully AI.
After the first two images in the conversation, it started to automatically tell me whether it thought the image was AI-generated or not without me having to ask.
I started in the ChatGPT Desktop app but that started to fail to respond so moved to the web and it worked fine.
1. Object recognition
For this first image, I generated a picture of a red ceramic cup filled with steaming coffee. It was put on a saucer on a wooden table in a cafe — but the cafe was blurred. The idea is to see how GPT-4o interprets the background and scenario.
GPT-4o wrote: “This image shows a steaming cup of coffee in a red cup and saucer placed on a rustic wooden table. The background is blurred, depicting a cosy indoor setting with people sitting and conversing in what appears to be a cave-like café or a uniquely designed interior with stone walls. The focus is on the coffee cup, highlighting the steam rising from the hot beverage.”
That is about as accurate and descriptive as it could get. Even painting a picture of the scene going beyond the simple description. For fun I asked it to: “Describe the thoughts of the person sitting at the table with this coffee ready to drink.”
It replied with a beautifully descriptive overview but this line stood out: “This coffee is more than just a drink; it’s a ritual, a cherished moment of solitude and reflection amidst the hustle of everyday life.”
2. Optical character recognition (OCR)
OCR is a technology as old as computing so should be trivial for any vision model, even smartphones do this on-device. I gave it a simple image showing a close-up of a weathered wooden sign that reads “Welcome to Oakville” in a vintage, hand-lettered style. I want more than just the text from the AI.
GPT-4o delivered; writing: “This image shows a vintage-looking sign that reads “Welcome to Oakville.” The sign appears weathered, with a rustic, aged finish, suggesting it has been exposed to the elements for some time. The text is in a bold, white font on a red background, with some signs of wear and tear, adding to its nostalgic charm. The sign is partially surrounded by greenery, indicating it might be placed outdoors, possibly at the entrance of a town or community named Oakville.”
Not only did it correctly read the text but also suggested more details of the location, identifying its likely a town or settlement and pulling out details of the sign itself.
3. Facial recognition
This is a controversial use of artificial intelligence technology and is banned to a certain degree in the EU. But how well can GPT-4o describe a face? The image was an AI-generated depiction of a woman in her 40s with dark hair and warm brown eyes. She had on a friendly smile.
GPT-4o described it like this: “This image is a close-up portrait of a smiling woman with curly dark hair. The focus is on her face, which is well-lit, showing detailed skin texture and features. She has a friendly expression, with her eyes looking directly at the camera. The background is blurred, putting emphasis on the subject.”
I wanted a bit more, including around the age. So I asked it “what age is this person and what is her emotional state.” It replied with the fact she’s in her late 30s to early 40s and in a positive state of mind, explaining: “The overall impression is that she is in a good place emotionally, perhaps feeling fulfilled and at ease.”
4. Emotion detection
Next up I wanted to see how well GPT-4o would do with a picture of an older man with a wistful expression. Would it pick up on that or just describe the scene. Remember the only input it had was “what is this” and the picture.
GPT-4o did well. It said of the image: “This image is a portrait of an elderly man sitting on a bench outdoors, likely in a park. The lighting suggests it was taken during the golden hour, as the soft sunlight highlights his face. The man’s expression is thoughtful, with a slight hint of melancholy or contemplation. He is dressed in a collared shirt and a blazer, and the background is blurred, focusing attention on the man while providing a pleasant, natural setting.”
The AI model did exactly what I was hoping for: describe the scene and make suggestions around how the man might be feeling in that moment.
5. Scene understanding
How well can GPT-4o describe a scene from the image and identify elements within the scene? This is a useful skill as the AI model is increasingly used to help describe the real world. It is one of the key features of the new Voice mode in ChatGPT.
GPT-4o declared: “This image depicts a lively outdoor farmers’ market on a sunny day. Various stalls are set up under tents, showcasing an abundance of fresh produce including fruits, vegetables, and other goods. People are walking through the market, browsing, shopping, and interacting with the vendors. The market appears to be set in a pleasant, tree-lined area, with green foliage visible in the background. The atmosphere looks vibrant and communal, with a focus on local, fresh products.”
Not much to say. It was spot on in its description of the image and the situation. It provided more detail than I was expecting, highlighting some of the produce on sale.
6. Image quality assessment
I had Ideogram generate four nearly identical images in a grid but giving each a different resolution or composition level. I wanted to see how GPT-4o would describe this. It can handle the underlying image, but what about four of them, each slightly different?
GPT-4o wrote: “The image is a collection of four landscape photographs arranged in a grid, each showcasing a scenic view of rolling hills covered with green grass and wildflowers under a sky with clouds.” So good so far.
It continues: “The top left photograph features a soft pink and blue gradient sky, suggesting dawn or dusk, with hills bathed in warm, golden light, enhancing the vibrant green grass and yellow and white flowers. The top right photograph shows a similar scene but with muted colors, indicating early morning or late afternoon. The bottom left image has cooler tones and a more pronounced blue sky, suggesting midday with softer lighting. The bottom right photograph is warmer and more saturated, with a pinkish hue in the sky, suggesting the warm light of sunset. These variations in lighting and color tone illustrate how different times of the day and photographic techniques can affect the mood and appearance of the same landscape.”
While it didn’t catch on to the purpose of the four-image grid — composition comparison — it was able to describe each individually which is impressive. In fact it did its own comparison between the images.
7. Multi-object detection
Finally an important test if the model wants to be useful as an accessibility tool — how well it detects objects in a crowded scene. Same prompt as the others. The image shows a cluttered desk and I expect it to pull up at least one minor object.
GPT-4o wrote: “The image depicts a modern and organized office workspace with a cityscape visible through large windows in the background. The desk is neatly arranged with various items typically found in a productive work environment, including a centrally positioned laptop, a metallic desk lamp, and a coffee mug, suggesting a relaxed yet focused atmosphere. Scattered notebooks and papers indicate active work and note-taking, while a small potted plant adds a touch of greenery. A framed photo of two people provides a personal touch, likely serving as a source of motivation. The workspace is well-stocked with office supplies such as pens, pencils, and organizers, ensuring functionality. The urban backdrop of the city skyline with modern high-rise buildings creates a dynamic and inspiring setting.”
It worked. I decided to do a follow-up and ask it to tell me the color of the plant pot and it got it spot on. So I asked it to find my glasses in the image and it correctly said: “Your glasses are on the desk to the right of the laptop, placed on top of a notebook.”
Summary
Every vision model I’ve used in the past has made at least one mistake, usually a big one like misidentifying an object or not picking up on a color or brand.
GPT-4o got every single one of these spot on. Moving to true multimodality has been a game changer for OpenAI.
It also shows the potential value of smart glassess as the true future of interaction with data. Forget the smartphone, lets use vision to merge the real and digital.