Tech

Can you do better than top-level AI models on these basic vision tests?

Published

5 months ago

July 11, 2024

Admin

Can you do better than top-level AI models on these basic vision tests?

Enlarge / Whatever you do, don’t ask the AI how many horizontal lines are in this image.

Getty Images

In the last couple of years, we’ve seen amazing advancements in AI systems when it comes to recognizing and analyzing the contents of complicated images. But a new paper highlights how many state-of-the-art “vision learning Models” (VLMs) often fail at simple, low-level visual analysis tasks that are trivially easy for a human.

In the provocatively titled pre-print paper “Vision language models are blind“ (which has a PDF version that includes a dark sunglasses emoji in the title), researchers from Auburn University and the University of Alberta create eight simple visual acuity tests with objectively correct answers. These range from identifying how often two colored lines intersect to identifying which letter in a long word has been circled to counting how many nested shapes exist in an image (representative examples and results can be viewed on the research team’s webpage).

If you can solve these kinds of puzzles, you may have better visual reasoning than state-of-the-art AIs.
The puzzles on the right are like something out of Highlights magazine.
A representative sample shows AI models failing at a task that most human children would find trivial.

Crucially, these tests are generated by custom code and don’t rely on pre-existing images or tests that could be found on the public Internet, thereby “minimiz[ing] the chance that VLMs can solve by memorization,” according to the researchers. The tests also “require minimal to zero world knowledge” beyond basic 2D shapes, making it difficult for the answer to be inferred from “textual question and choices alone” (which has been identified as an issue for some other visual AI benchmarks).

Are you smarter than a fifth grader?

After running multiple tests across four different visual models—GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5—the researchers found all four fell well short of the 100 percent accuracy you might expect for such simple visual analysis tasks (and which most sighted humans would have little trouble achieving). But the size of the AI underperformance varied greatly depending on the specific task. When asked to count the number of rows and columns in a blank grid, for instance, the best-performing model only gave an accurate answer less than 60 percent of the time. On the other hand, Gemini-1.5 Pro hit nearly 93 percent accuracy in identifying circled letters, approaching human-level performance.

For some reason, the models tend to incorrectly guess the “o” is circled a lot more often than all the other letters in this test.
The models performed perfectly in counting five interlocking circles, a pattern they might be familiar with from common images of the Olympic rings.
Do you have an easier time counting columns than rows in a grid? If so, you probably aren’t an AI.

Even small changes to the tasks could also lead to huge changes in results. While all four tested models were able to correctly identify five overlapping hollow circles, the accuracy across all models dropped to well below 50 percent when six to nine circles were involved. The researchers hypothesize that this “suggests that VLMs are biased towards the well-known Olympic logo, which has 5 circles.” In other cases, models occasionally hallucinated nonsensical answers, such as guessing “9,” “n”, or “©” as the circled letter in the word “Subdermatoglyphic.”

Overall, the results highlight how AI models that can perform well at high-level visual reasoning have some significant “blind spots” (sorry) when it comes to low-level abstract images. It’s all somewhat reminiscent of similar capability gaps that we often see in state-of-the-art large language models, which can create extremely cogent summaries of lengthy texts while at the same time failing extremely basic math and spelling questions.

These gaps in VLM capabilities could come down to the inability of these systems to generalize beyond the kinds of content they are explicitly trained on. Yet when the researchers tried fine-tuning a model using specific images drawn from one of their tasks (the “are two circles touching?” test), that model showed only modest improvement, from 17 percent accuracy up to around 37 percent. “The loss values for all these experiments were very close to zero, indicating that the model overfits the training set but fails to generalize,” the researchers write.

The researchers propose that the VLM capability gap may be related to the so-called “late fusion” of vision encoders onto pre-trained large language models. An “early fusion” training approach that integrates visual encoding alongside language training could lead to better results on these low-level tasks, the researchers suggest (without providing any sort of analysis of this question).

Daily Star News Today

Can you do better than top-level AI models on these basic vision tests?

Tech

Can you do better than top-level AI models on these basic vision tests?

Are you smarter than a fifth grader?

Exploring Online Casino Gaming: A Guide to the Thrills and Strategies

The latest jobs in search marketing

Deloitte Ports and Freight Yearbook 2024: DAESCHI mid-year update | Infrastructure | Deloitte New Zealand

Dow soars more than 700 points to close at another record high

Albares reiterates Foreign Ministry recommendations to “travel safely” on holidays

Let’s take this offline: why indie fashion boutiques are back in fashion

I’m a Travel Writer, and Out of the 5 Million Prime Day Deals on Site, These Are the 12 I’m Shopping

Military Installation Job Fairs: Setting Realistic Expectations for Veterans

Shooting at Baltimore’s Westside Shopping Center leaves man dead, two injured

Cybersecurity jobs available right now: July 17, 2024 – Help Net Security