Researchers from Auburn University and the University of Alberta recentlypublisheda paper titled “Vision language models are blind.”

The study used eight straightforward visual acuity tests to highlight deficiencies in visual learning models (VLM).

The tasks included counting intersecting lines, identifying circled letters, counting nested shapes and others.

Study shows the best visual learning models fail at very basic visual identification tests

These tests have objectively definitive answers and require minimal knowledge beyond basic 2D shapes.

To avoid models solving these tasks through memorization, the researchersgeneratedthe tests using custom code rather than pre-existing images.

They evaluated four VLM models, including GPT-4o, Gemini-1.5 Pro, Sonnet-3, and Sonnet-3.5.

Conversely, Gemini-1.5 Pro approached human-level performance by correctly identifying circled letters 93 percent of the time.

Furthermore, even minor modifications to the tasks resulted in significant performance changes.

These findings underscore a significant limitation in the ability of VLMs to handle low-level abstract visual tasks.

The researchers hypothesized that these gaps might stem from the models' inability to generalize beyond their training data.

However, they did not provide an analysis to support this suggestion.

you’re able to view the results and other examples on the team’swebsite.