Vibecoding: a beginner's guide

New paper by Microsoft Research:

Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Some of this is just going to be the big players playing the game and defending their own ecosystems. This comes FROM Microsoft after all, which is really bitter no doubt that nobody is rating their efforts with Copilot.
 
Since using this the quality of my projects has gone up by strides - https://www.promptcowboy.ai/

promptcowboy to create my prompt
prompt created I paste into claude
once i have used free claude credits
drag and drop project into Deepseek and continue the rest of the day
 
New paper by Microsoft Research:

Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
I'm sure you were not supposed to have dozens of open chats and dozens of days long running chats?

Also worth knowing for the uninitiated... be clear as **** on what you actually want / going to do otherwise the LLM will assume. Hell it does not have your context, nor does it live in your mind and code.

to add: "hey claude be jesus take the wheel" - you are going to get ****ed and it doesn't matter if you are on opus or not.
 
Some of this is just going to be the big players playing the game and defending their own ecosystems. This comes FROM Microsoft after all, which is really bitter no doubt that nobody is rating their efforts with Copilot.

Copilot is not a model, it's an agent that uses ChatGPT.

Also, this is a research paper, not some opinion piece in Huisgenoot.
 
I'm sure you were not supposed to have dozens of open chats and dozens of days long running chats?

Also worth knowing for the uninitiated... be clear as **** on what you actually want / going to do otherwise the LLM will assume. Hell it does not have your context, nor does it live in your mind and code.

to add: "hey claude be jesus take the wheel" - you are going to get ****ed and it doesn't matter if you are on opus or not.

It's not about time or prompt specificity, it's about interactions. From the title page of the paper:

1777043775321.png
 
Copilot is not a model, it's an agent that uses ChatGPT.

Also, this is a research paper, not some opinion piece in Huisgenoot.
Yes, cigarette companies never paid for research papers published in reputable journals that proved that smoking is safe for you. Oh wait...
 
Since using this the quality of my projects has gone up by strides - https://www.promptcowboy.ai/

promptcowboy to create my prompt
prompt created I paste into claude
once i have used free claude credits
drag and drop project into Deepseek and continue the rest of the day
Giving Deepseek a go and so far it's pretty good. Gemini Pro was great up until I hit about 300 lines of code, after which it refused to return complete code after multiple changes, offering only snippets with little context .Tells me my code is taxing it's memory, I tell it that 300 lines of code is barely anything...

IMG_20260425_113357.png

Proceeds to give me code with multiple strings of code completely changed.

Gemini is very well priced and great to get something going but you hit a very obvious wall quite soon. Hopefully gets fixed because I like how it works.
 
Giving Deepseek a go and so far it's pretty good. Gemini Pro was great up until I hit about 300 lines of code, after which it refused to return complete code after multiple changes, offering only snippets with little context .Tells me my code is taxing it's memory, I tell it that 300 lines of code is barely anything...

View attachment 1903445

Proceeds to give me code with multiple strings of code completely changed.

Gemini is very well priced and great to get something going but you hit a very obvious wall quite soon. Hopefully gets fixed because I like how it works.
Gemini is trash.
 
In other news: seems opus 4.7 can now reach its limits 'too soon' - even on max:

Had a chuckle as I did than in less than one hour for the one before 1pm reset.

All I basically did was running sonnet 4.6 cli to do initial analysis of work done against provided markdown for some unit tests I created/revised and then had opus read and check against the running workbook I had for it.

I assume that happend because of sonnet firing off its own agents and having asked opus for making a condesed conversation catch-up markdown.

Still actually just ****ing around with a web chat driving the process and cli doing some actions for it like anylsis. Of course you need to be strict with the cli otherwise it will do **** you don't want and like any llm needs to be clear and concise messages...

Kinda works but still best if these tools augment the process and not 'jesus take the wheel' / you end up being glorified proof reader knowing ****. Lol.
 
Last edited:
In other news: seems opus 4.7 can now reach its limits 'too soon' - even on max:

Had a chuckle as I did than in less than one hour for the one before 1pm reset.

All I basically did was running sonnet 4.6 cli to do initial analysis of work done against provided markdown for some unit tests I created/revised and then had opus read and check against the running workbook I had for it.

I assume that happend because of sonnet firing off its own agents and having asked opus for making a condesed conversation catch-up markdown.

Still actually just ****ing around with a web chat driving the process and cli doing some actions for it like anylsis. Of course you need to be strict with the cli otherwise it will do **** you don't want and like any llm needs to be clear and concise messages...

Kinda works but still best if these tools augment the process and not 'jesus take the wheel' / you end up being glorified proof reader knowing ****. Lol.
Anthropic has been nerfing limits every week, even max accounts. I am currently using a ProX5 on OpenAI. But will upgrade to 20x. Limits are exceptionally generous on those. I've also applied to OpenAI for OSS support on the project im working on just need to see if they accept.
 
Top
Sign up to the MyBroadband newsletter
X