Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
Some of this is just going to be the big players playing the game and defending their own ecosystems. This comes FROM Microsoft after all, which is really bitter no doubt that nobody is rating their efforts with Copilot.
promptcowboy to create my prompt
prompt created I paste into claude
once i have used free claude credits
drag and drop project into Deepseek and continue the rest of the day
Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.
I'm sure you were not supposed to have dozens of open chats and dozens of days long running chats?
Also worth knowing for the uninitiated... be clear as **** on what you actually want / going to do otherwise the LLM will assume. Hell it does not have your context, nor does it live in your mind and code.
to add: "hey claude be jesus take the wheel" - you are going to get ****ed and it doesn't matter if you are on opus or not.
Some of this is just going to be the big players playing the game and defending their own ecosystems. This comes FROM Microsoft after all, which is really bitter no doubt that nobody is rating their efforts with Copilot.
I'm sure you were not supposed to have dozens of open chats and dozens of days long running chats?
Also worth knowing for the uninitiated... be clear as **** on what you actually want / going to do otherwise the LLM will assume. Hell it does not have your context, nor does it live in your mind and code.
to add: "hey claude be jesus take the wheel" - you are going to get ****ed and it doesn't matter if you are on opus or not.
promptcowboy to create my prompt
prompt created I paste into claude
once i have used free claude credits
drag and drop project into Deepseek and continue the rest of the day
Giving Deepseek a go and so far it's pretty good. Gemini Pro was great up until I hit about 300 lines of code, after which it refused to return complete code after multiple changes, offering only snippets with little context .Tells me my code is taxing it's memory, I tell it that 300 lines of code is barely anything...
Proceeds to give me code with multiple strings of code completely changed.
Gemini is very well priced and great to get something going but you hit a very obvious wall quite soon. Hopefully gets fixed because I like how it works.
Giving Deepseek a go and so far it's pretty good. Gemini Pro was great up until I hit about 300 lines of code, after which it refused to return complete code after multiple changes, offering only snippets with little context .Tells me my code is taxing it's memory, I tell it that 300 lines of code is barely anything...
Proceeds to give me code with multiple strings of code completely changed.
Gemini is very well priced and great to get something going but you hit a very obvious wall quite soon. Hopefully gets fixed because I like how it works.
In other news: seems opus 4.7 can now reach its limits 'too soon' - even on max:
Had a chuckle as I did than in less than one hour for the one before 1pm reset.
All I basically did was running sonnet 4.6 cli to do initial analysis of work done against provided markdown for some unit tests I created/revised and then had opus read and check against the running workbook I had for it.
I assume that happend because of sonnet firing off its own agents and having asked opus for making a condesed conversation catch-up markdown.
Still actually just ****ing around with a web chat driving the process and cli doing some actions for it like anylsis. Of course you need to be strict with the cli otherwise it will do **** you don't want and like any llm needs to be clear and concise messages...
Kinda works but still best if these tools augment the process and not 'jesus take the wheel' / you end up being glorified proof reader knowing ****. Lol.
In other news: seems opus 4.7 can now reach its limits 'too soon' - even on max:
Had a chuckle as I did than in less than one hour for the one before 1pm reset.
All I basically did was running sonnet 4.6 cli to do initial analysis of work done against provided markdown for some unit tests I created/revised and then had opus read and check against the running workbook I had for it.
I assume that happend because of sonnet firing off its own agents and having asked opus for making a condesed conversation catch-up markdown.
Still actually just ****ing around with a web chat driving the process and cli doing some actions for it like anylsis. Of course you need to be strict with the cli otherwise it will do **** you don't want and like any llm needs to be clear and concise messages...
Kinda works but still best if these tools augment the process and not 'jesus take the wheel' / you end up being glorified proof reader knowing ****. Lol.
Anthropic has been nerfing limits every week, even max accounts. I am currently using a ProX5 on OpenAI. But will upgrade to 20x. Limits are exceptionally generous on those. I've also applied to OpenAI for OSS support on the project im working on just need to see if they accept.
Fake news, you can't vibe code something that doesn't exist.
Meanwhile detectives are on the hunt for A records that were stolen from DNS. We will work through the night and keep the public updated...