This perfectly explains why I don’t trust AI at all. It’s at PHD level knowledge, and crèche-level intuition:
Some things that are important here in human context. That is ChatGPT, and though it is capable at vision, it is not a specialised vision model. Memory exists, and many are exploiting it by poisoning AI with pre-existing memory. There will be those using this to harvest content that they can sell to the public via social media. When you conduct such an experiment, do so with a clean slate or so to speak.
To ensure optimal visual support. Though YOLO is very good, it isn't great at contextualising. You would need a layered approach. I talked about this a bit in the AI cameras thread that Cape Town wants to roll out.
Hmmm. In layman's terms, frameworks aside, so ignoring a paradigm (which would include on-going training).
1) Detecting multiple objects within a frame (commonly by bounding an object to a box, and labelling the box)
2) Classifying the objects, and this goes beyond labelling. E.g. Facial recognition, specific tagging, etc.
3) Image segmentation. This breaks up an image into identifiable areas or regions, and can be applied to bounding boxes. E.g. In the video it will divide the cup into pixels, helping to identify its orientation and the positions of the open and closed sides.
4) Contextualize the frame, objects and images. In essence, the explainer of what is being observed in whole or part of the frame.
ChatGPT can't do this. You would require different models for a more accurate answer.
In practice, it is far more complex. There are processes for switching between image and text, using models for encoding, decoding, both, or dual-coding. Multimodels, etc. This is also applicable to audio. ChatGPT isn't designed for this.
I am currently experimenting with YOLO, and am having quite some fun. I want to try some transformer stuff, but eh, my RDNA 4 GPU is not up for the task. This is also why edge computing will start to play a larger role in localised deployments.