Apple Intelligence & Siri: The cup

Fulcrum29

Honorary Master
Joined
Jun 25, 2010
Messages
68,345
Reaction score
31,473
Location
At the arcade
The date: 17/02/2026

Having seen this video:


I want the community to do an experiment. Using Apple's Visual Intelligence, I want you to hold a glass or cup upside down, and ask,

Siri, I was given a glass/cup, by a friend, and the top of the glass/cup is sealed and the bottom is open. How do I drink from this?

Just post the response. No, this isn't about ChatGPT. I want to see how Apple Intelligence, in its current state, handles a visual assessment. I don't have any compatible Apple devices so I am sourcing unspoiled answers.
 
Instructions unclear.

When I point my camera at a cup and ask it a question it answers with ChatGPT and tells me to turn it around and use the wider opening.

I’m not sure how one would ask Apple Intelligence this specifically.
 
Instructions unclear.

When I point my camera at a cup and ask it a question it answers with ChatGPT and tells me to turn it around and use the wider opening.

I’m not sure how one would ask Apple Intelligence this specifically.

ChatGPT can't do real-time video analysis, and would take a multimodel approach. Apple Intelligence can do real-time video analysis.

So, yes, correction: I don't want it to analyse a image, but having it do an video analysis.

I did the test using Gemini with an Android device. Whilst in video capture, I asked:

Gemini, I was given a glass, by a friend, and the top of the glass is sealed and the bottom is open. How do I drink from this?

Gemini responded:

It looks like you're holding it upside down. The flatter, more solid top is actually the bottom, and the open side is where you would typically drink from.

I am attempting to see where the YouTuber errored. I tested it with Copilot and received almost the exact same response ChatGPT gave to him, but then he attempted to add video context, and well, it built on the ChatGPT response. All I want to see is how Apple Intelligence would handle it, starting with video.
 
Apple Intelligence can do real-time video analysis.

Well then you’ll need to explain how to do it because I have no idea.

When I point camera at thing and go “Ask” it immediately takes a snapshot and then hands off to ChatGPT to answer.

Video stuff is only inside the Photos app to identify stuff or strip out details.
 
Grok (via camera feed aka video mode):


grok.png
 
Well then you’ll need to explain how to do it because I have no idea.

I noticed that the questions have to go through or passed on to ChatGPT. Eh, I thought this would only be applicable to deep generative reasoning. Video/image, then input prompting seems to work without any issue. I assume the experience shared by the YouTuber have been 'patched' by OpenAI.

AFAIK, Apple still does on-device recognition. Is there a way, without ChatGPT, to determine the glass/cup orientation in image or video?

EDIT: I think Apple Intelligence already made that determination, and provided it to ChatGPT.
 
Last edited:
Grok (via camera feed aka video mode):


View attachment 1886770

I tested Gemini and Grok, both got it correct. I know ChatGPT got it wrong, and it has been reported that ChatGPT has corrected this reasoning. Copilot got it wrong, and phrased it almost verbatim to the initial ChatGPT response.

In addition I tested Deepseek-r1-0528-qwen3-8b and GPT-OSS-20b, both got it right without any localised tweaking. How ChatGPT errored, also in other examples using the cup, is odd. I still need to test localised VLMs.

For this thread's purpose, I want to know how Apple, via Visual Intelligence, would determine the cup's orientation.
 
Top
Sign up to the MyBroadband newsletter
X