Vibecoding: a beginner's guide

This summarises it perfectly for me … we’re moving so fast at the moment, I’m just trying to keep vaguely in the loop from the side. I feel sorry for people actually trying to stay on top of this all given the current pace of development:
 
Okay can we keep this thread on topic instead of a meme fest please.
 
This perfectly explains why I don’t trust AI at all. It’s at PHD level knowledge, and crèche-level intuition:


Some things that are important here in human context. That is ChatGPT, and though it is capable at vision, it is not a specialised vision model. Memory exists, and many are exploiting it by poisoning AI with pre-existing memory. There will be those using this to harvest content that they can sell to the public via social media. When you conduct such an experiment, do so with a clean slate or so to speak.

To ensure optimal visual support. Though YOLO is very good, it isn't great at contextualising. You would need a layered approach. I talked about this a bit in the AI cameras thread that Cape Town wants to roll out.

Hmmm. In layman's terms, frameworks aside, so ignoring a paradigm (which would include on-going training).

1) Detecting multiple objects within a frame (commonly by bounding an object to a box, and labelling the box)
2) Classifying the objects, and this goes beyond labelling. E.g. Facial recognition, specific tagging, etc.
3) Image segmentation. This breaks up an image into identifiable areas or regions, and can be applied to bounding boxes. E.g. In the video it will divide the cup into pixels, helping to identify its orientation and the positions of the open and closed sides.
4) Contextualize the frame, objects and images. In essence, the explainer of what is being observed in whole or part of the frame.

ChatGPT can't do this. You would require different models for a more accurate answer.

In practice, it is far more complex. There are processes for switching between image and text, using models for encoding, decoding, both, or dual-coding. Multimodels, etc. This is also applicable to audio. ChatGPT isn't designed for this.

I am currently experimenting with YOLO, and am having quite some fun. I want to try some transformer stuff, but eh, my RDNA 4 GPU is not up for the task. This is also why edge computing will start to play a larger role in localised deployments.
 
Some things that are important here in human context. That is ChatGPT, and though it is capable at vision, it is not a specialised vision model. Memory exists, and many are exploiting it by poisoning AI with pre-existing memory. There will be those using this to harvest content that they can sell to the public via social media. When you conduct such an experiment, do so with a clean slate or so to speak.

To ensure optimal visual support. Though YOLO is very good, it isn't great at contextualising. You would need a layered approach. I talked about this a bit in the AI cameras thread that Cape Town wants to roll out.

Hmmm. In layman's terms, frameworks aside, so ignoring a paradigm (which would include on-going training).

1) Detecting multiple objects within a frame (commonly by bounding an object to a box, and labelling the box)
2) Classifying the objects, and this goes beyond labelling. E.g. Facial recognition, specific tagging, etc.
3) Image segmentation. This breaks up an image into identifiable areas or regions, and can be applied to bounding boxes. E.g. In the video it will divide the cup into pixels, helping to identify its orientation and the positions of the open and closed sides.
4) Contextualize the frame, objects and images. In essence, the explainer of what is being observed in whole or part of the frame.

ChatGPT can't do this. You would require different models for a more accurate answer.

In practice, it is far more complex. There are processes for switching between image and text, using models for encoding, decoding, both, or dual-coding. Multimodels, etc. This is also applicable to audio. ChatGPT isn't designed for this.

I am currently experimenting with YOLO, and am having quite some fun. I want to try some transformer stuff, but eh, my RDNA 4 GPU is not up for the task. This is also why edge computing will start to play a larger role in localised deployments.
The video I shared shows that the guy even explains to ChatGPT what he has, and explains the setup (effectively an upside-down glass) quite clearly. ChatGPT just couldn’t reason that you can turn an upside-down glass the other way around and then drink from it like normal.

This is exactly like another setup someone on here shared, where AI told them to “walk their car” to the nearby carwash, because it didn’t make sense to drive such a short distance (AI conflating the good idea of walking short distances with the idea that cars can walk).

Image recognition aside, are LLM-based AIs ever going to be able to reason their way out of even trivial problems like these, without an understanding of the physical world and an ability to reason through a problem? I suspect not … although they might get clever at faking it by regurgitating correct answers they are given but don’t understand.

EDIT: I just want to add that I have tested ChatGPT with some text riddles and its reasoning has been spot-on, so your results seem very dependent on the setup (phrasing and order of terms introduced). This is what I got from ChatGPT with my framing of the problem … which might be logical but isn’t practical:
IMG_9128.jpeg
 
Last edited:
The video I shared shows that the guy even explains to ChatGPT what he has, and explains the setup (effectively an upside-down glass) quite clearly. ChatGPT just couldn’t reason that you can turn an upside-down glass the other way around and then drink from it like normal.

This is exactly like another setup someone on here shared, where AI told them to “walk their car” to the nearby carwash, because it didn’t make sense to drive such a short distance (AI conflating the good idea of walking short distances with the idea that cars can walk).

Image recognition aside, are LLM-based AIs ever going to be able to reason their way out of even trivial problems like these, without an understanding of the physical world and an ability to reason through a problem? I suspect not … although they might get clever at faking it by regurgitating correct answers they are given but don’t understand.

LLMs are notoriously bad for intepretation, or just simply at comprehension. I was specifically referencing the event in which he displayed the cup to ChatGPT for visual context. It would reference the earlier contextualization for a starting weight. The influence is spoiled from the go, unless you inform the model that the cup is described in reverse.

A specialized computer vision model will autonomously fix this issue by identifying the cup and segmenting it into layers to determine its shape, form, and orientation. ChatGPT failed this, because it isn't designed for that. To conduct this experiment effectively, he would have to improve his workflow by combining a vision model with an LLM, using a multimodal approach, or by using clearer language for better computer understanding. This is why I made my previous post in response to that video. It is exploiting a known LLM fault line, and you can intoxicate it even more.

ChatGPT will evolve, in which ways, I can't say. I doubt they will make it heavy on the vision side without vast improvments on the compression side of the stack. As far as outputting go, it is crowdsourcing, and that would include poor human comprehension of things too.
 
LLMs are notoriously bad for intepretation, or just simply at comprehension. I was specifically referencing the event in which he displayed the cup to ChatGPT for visual context. It would reference the earlier contextualization for a starting weight. The influence is spoiled from the go, unless you inform the model that the cup is described in reverse.

A specialized computer vision model will autonomously fix this issue by identifying the cup and segmenting it into layers to determine its shape, form, and orientation. ChatGPT failed this, because it isn't designed for that. To conduct this experiment effectively, he would have to improve his workflow by combining a vision model with an LLM, using a multimodal approach, or by using clearer language for better computer understanding. This is why I made my previous post in response to that video. It is exploiting a known LLM fault line, and you can intoxicate it even more.

ChatGPT will evolve, in which ways, I can't say. I doubt they will make it heavy on the vision side without vast improvments on the compression side of the stack. As far as outputting go, it is crowdsourcing, and that would include poor human comprehension of things too.
Nice explanation, always good to see experts taking the time to share detailed information in a meaningful way, thank you.

I wonder if we’ll ever find a way to project information from AIs into human brains, so that we can do the processing ourselves. What would that even feel like, to have access to all the world’s knowledge only a thought away? It at least lets humans do what we’re good at (the fuzzy logic, image recognition and intuitive thinking) and let’s the AIs do what they’re good at (cataloguing info and making instantaneous suggestions of solutions and trend-spotting), without us competing with each other.
 
Nice explanation, always good to see experts taking the time to share detailed information in a meaningful way, thank you.

I wonder if we’ll ever find a way to project information from AIs into human brains, so that we can do the processing ourselves. What would that even feel like, to have access to all the world’s knowledge only a thought away? It at least lets humans do what we’re good at (the fuzzy logic, image recognition and intuitive thinking) and let’s the AIs do what they’re good at (cataloguing info and making instantaneous suggestions of solutions and trend-spotting), without us competing with each other.
Neural link chip project?
 
This perfectly explains why I don’t trust AI at all. It’s at PHD level knowledge, and crèche-level intuition:

Odd. I asked Chat GPT, and it just said:

1️⃣ Flip it upside down​


If you turn it upside down, the open bottom becomes the top.
Now you can pour liquid in and drink like a normal cup.
 
Odd. I asked Chat GPT, and it just said:

1️⃣ Flip it upside down​


If you turn it upside down, the open bottom becomes the top.
Now you can pour liquid in and drink like a normal cup.
I did the same. But it’s possible we’re witnessing AI learning after other people have challenged it on wrong answers.
 
LLMs are notoriously bad for intepretation, or just simply at comprehension. I was specifically referencing the event in which he displayed the cup to ChatGPT for visual context. It would reference the earlier contextualization for a starting weight. The influence is spoiled from the go, unless you inform the model that the cup is described in reverse.

A specialized computer vision model will autonomously fix this issue by identifying the cup and segmenting it into layers to determine its shape, form, and orientation. ChatGPT failed this, because it isn't designed for that. To conduct this experiment effectively, he would have to improve his workflow by combining a vision model with an LLM, using a multimodal approach, or by using clearer language for better computer understanding. This is why I made my previous post in response to that video. It is exploiting a known LLM fault line, and you can intoxicate it even more.

ChatGPT will evolve, in which ways, I can't say. I doubt they will make it heavy on the vision side without vast improvments on the compression side of the stack. As far as outputting go, it is crowdsourcing, and that would include poor human comprehension of things too.
Some LLM models hand off image classification to other models, or they have classification built in. I have experimented extensively with various models and built my own classification models. Both ChatGPT and Claude have performed phenomenally in detecting the particular objects I was testing with. Compression has nothing to do with this (unless you're referring to image compression itself, which is not a model problem per say.); these models are already trained on billions of image classification problems. Once the weights have been learnt, the images are discarded.
 
Nice explanation, always good to see experts taking the time to share detailed information in a meaningful way, thank you.

I wish. Every day, I come across a new AI thought leader who has little to no experience in AI. Nevermind the absense of their track record on computers or networking (infrastructure). I am an AI explorer within my own home lab, applying agentic AI and reasoning to my own workflow. Currently, I am more critical than supportive because many are irresponsibly taking shortcuts with AI. On top of that I am a big proponent of localised AI, and privacy. AI has certainly made me more productive, and ingenuity to explore new fields of science, and improve processes.

I have used analytical AI for a long time already. Just for fun, I do persona creation.

I wonder if we’ll ever find a way to project information from AIs into human brains, so that we can do the processing ourselves. What would that even feel like, to have access to all the world’s knowledge only a thought away?

In theory, it's possible, but it would involve humans interacting with a neural network, likely external to the brain. Unless of course, the human itself is enhanced or otherwise augmented. I just can't see myself having to plug into a wall socket for charging the battery of my neural processor. I have seen The Outer Limit episodes on this.

It at least lets humans do what we’re good at (the fuzzy logic, image recognition and intuitive thinking) and let’s the AIs do what they’re good at (cataloguing info and making instantaneous suggestions of solutions and trend-spotting), without us competing with each other.

Agree, but AI is not ready for broad consumption. Though it is needed for data. In its current condition, it is best for assistant-driven solutions. There are usuable forms of AI, but not for decisive decision-making.

That said, AI is being hyped hard AF. Whenever I go on the internet, on social media, people are being rained in with noise. Everything traditional is being drowned out by it. In the event of a disagreement, quantify the differences by means of ratio. Technology has simplified this process, making AI an incredibly valuable tool even in its current condition. Case in point,


We are rolling out more detection for automation & spam (and a lot more to come).If a human is not tapping on the screen, the account and all associated accounts will likely be suspended—even if you’re just experimenting.While we aim to support legitimate use-cases of agents, this will take some time to do properly.For now, we recommend holding off on plugging in your bots. If it’s critical, you can use the official API.

My greatest fear of AI is not the use of it, but those who belief they have an right to specific prompting. AI should be free of any such form of ownership, and remain accessible for all to use.
 
I did the same. But it’s possible we’re witnessing AI learning after other people have challenged it on wrong answers.
I don't know if they've openly stated they use RLHF real time.


Looks like you can use the models as a base and employ RLHF, but not the core model thats trained by OpenAI (by that i mean i dont think its gonna learn, but no idea for sure).
 
Last edited:
Ok I just saw this in my YT feed and I found it very interesting. Massive news: ClawdBot has been acquired by Open AI.

The video as a whole is fascinating and I agree with the conclusions:
 
Some LLM models hand off image classification to other models, or they have classification built in. I have experimented extensively with various models and built my own classification models.

Remember, a video was added to existing context via voice, turned to text, input. In the video I would assume that it calls on Apple Visual Intelligence. ChatGPT + AI (Yes, that is what Apple calls it), it is more convenience than technical precision.

Both ChatGPT and Claude have performed phenomenally in detecting the particular objects I was testing with. Compression has nothing to do with this (unless you're referring to image compression itself, which is not a model problem per say.); these models are already trained on billions of image classification problems. Once the weights have been learnt, the images are discarded.

ChatGPT and Claude excel at text, but they are both multimodel capable. Claude can also do OCR. ChatGPT needs additional support. It is important to separate it into text, image, audio and video. For high precision, you would still require image segmentation, and though that can be pre-trained too, pixels don't work like that in real-time. Training can help to accelerate that process, but there are also ample scenarios that require it to be context sensitive.

Say you want to roll out AI ticketing cameras, then it will need to be video. A single image can't determine whether a person has an seatbelt on; it will need to take multiple stills, segmentise the images to make a high precision determinations, and even then it can still be wrong. I haven't even touched on how such a process will work. Frameworks are important to have here.

This now boils down to an ecosystem. I can gaurantee anyone here with an Android device using Gemini. Holding a glass upside down, and asking Gemini via video that you have purchased a cup, that the top side is closed and the bottom is open, and how you should proceed, in almost any phrasing. That you would receive a correct response. Then again, Google's plain LLM models are already good at context, and I won't be surprised that it got the answer correct as opposed to ChatGPT.

Though I would like anyone here with an iPhone with /cough AI support to do it too.
 
Top
Sign up to the MyBroadband newsletter
X