Home About Me

Testing Where LLMs Actually Break

What can’t AI do yet?

Why I started poking at this

Lately, while experimenting with LLMs, I’ve had the strong feeling that they can do almost anything. GPT-4o’s image generation in particular is absurdly good right now. That said, I’m not actually that interested in image generation—I don’t really have many pictures I want it to make anyway 😂

What I care about much more is text generation, because that’s still the core job of an LLM. AI is already very strong at solving problems, but if it still hasn’t replaced people on a massive scale, then there must be things it still fundamentally can’t do. I wanted to explore that boundary a bit.

The problem with analyzing very large bodies of text

A lot of current LLMs advertise long context windows, but that “long” usually only means something on the scale of tens of thousands of words. That’s enough for reading one paper, or maybe a few articles. But if you want it to analyze hundreds of articles, things get much less workable.

For example, I wanted an AI to read everything I’ve written on my blog and then give me an evaluation of myself.

My blog already has more than a hundred posts. I had previously built full-text search, so all posts could be pulled from search.json, which made it a pretty good source file for AI analysis. But feeding the whole thing directly into a model context wasn’t realistic. The JSON file is around 1 MiB, while even many stronger models only offer a little over 100k of context. They simply can’t finish reading it.

I also tried some models that claim extremely long context—for example, one with a 10M context window. But the output quality was poor, and it didn’t really seem to reference the earlier material properly 😓

I tried another route too: giving the AI article content as attachments. That seemed to work by chunking the file and then reading pieces of it, probably something similar to RAG—retrieving the most relevant text fragments based on the question. That works if the task is to answer something specific, but it doesn’t solve the problem of analyzing all articles as a whole.

I also tested some agents, but those mostly just wanted to write code to analyze my posts: graphs of article length over time, article counts by year, word frequency analysis, and so on. For me, that was basically useless 😅

A workaround: let AI read AI summaries

So was there really no way around this?

Around the same time, something else happened. Cloudflare started acting weird. Requests from Workers to my D1 database would randomly throw internal error. I even posted about it on their forum, and of course nobody responded 😅 At that point I suddenly realized that I didn’t really have a replacement for Cloudflare at all. If something broke, I was just stuck 😰

That issue took down several things on my blog: AI summaries, article recommendations, and the click counter. It was a good reminder that I shouldn’t depend so heavily on Cloudflare Workers.

The click counter didn’t have an obvious fix, but the AI summaries were different. Once a summary is generated for a post, it basically never changes. So instead of requesting it every time, why not periodically export those summaries and cache them locally on the blog itself?

That would make summaries display instantly, and because no API request is needed, it would also avoid Cloudflare-related failures. So I exported the summary data from the database into ai-cache.json. If a summary already exists there, the site simply doesn’t request the API anymore.

And after doing that, I realized I had accidentally built the perfect way to let AI read all my writing.

Instead of feeding it every full article, I could feed it the summaries AI had already generated for every article, then ask it to summarize those and evaluate me from there. In other words: let AI read AI’s summaries of my writing. That gives it a much better shot at understanding my whole body of work.

And the best part is that this summary file is only about 100 KiB—small enough to fit into the context window nicely.

Originally I wanted to try using DeepSeek for this, but apparently some keyword triggered a refusal and it wouldn’t generate the result 🤣 So I ended up asking GPT-4o to do it.

The result was honestly pretty impressive. The structure was clear, and unlike older ChatGPT output, it didn’t feel overwhelmingly “AI-written.” This time it actually sounded pretty human, which surprised me. So I ended up sharing what it wrote just to see what AI thought of me 🤣

A Technical Drifter, a Digital Nomad: Observations on a Blogger

Among the vast sea of Chinese-language technology blogs, some writers resemble faint yet resilient stars, shining alone in their own orbit. They do not necessarily chase trends, nor do they always present themselves as authorities, but the independence, technical enthusiasm, and attentiveness to reality that emerge in their writing are often more compelling than the many flashy “tutorial-style” blogs. The blogger discussed here, Mayx, is one such figure: both programmer and “life hacker”; attentive to device performance as well as technological ethics; fond of practical tools yet equally capable of personal reflection.

1. Technology as the body, thought as the soul

Technical articles make up the overwhelming majority of Mayx’s blog. From homemade email subscription scripts, Cloudflare Workers automation, and notes on internal network penetration, to experiments with low-power development boards, running AI models locally, and in-depth experiences with Hackintosh and Linux, the topics span many parts of the current mainstream technical ecosystem.

Yet he is not the kind of technical writer who shows off for the sake of showing off. On the contrary, most of his writing is grounded in pragmatism—he cares about cost-performance ratio, power consumption, stability, and openness, rather than chasing technology for its own sake. For example, when discussing Hackintosh, he did not become obsessed with whether macOS could be made to run, but instead carefully pointed out the gap between that experience and a real Mac. When experimenting with AI models, he chose a path that balanced performance and cost instead of blindly pursuing the biggest models and the most powerful GPUs.

His technical exploration usually begins with an actual need. When an old check-in script stopped working, he tried replacing it with Cloudflare Workers; when his blog was blocked by GitHub, he started researching anti-censorship architectures himself; when Heroku shut down, he quickly moved to Koyeb and noted how easy it was to use. This reflects an engineer’s habit of solving problems hands-on, but also a certain skepticism toward ready-made tools and platforms—nothing is irreplaceable, but nothing is flawless either.

2. Independent, reflective, and a little rebellious

It is easy to sense a certain distance from, or even resistance to, mainstream technology discourse when reading Mayx’s blog. He does not place much trust in so-called authoritative recommendations, rarely cites big-name influencers, questions paid tools, remains wary of closed platforms, and openly complains about advertising and forced apps. In multiple posts about the BaoTa panel, he not only criticized its bloated feature set and inflated pricing, but also argued at the code level that its technical standards were limited. In another post, after Server酱 introduced charges, he built his own notification platform and strongly conveyed the view that developers should not be paying for functionality like that in the first place.

This can be seen as a kind of digital libertarian spirit: valuing individual choice, control, and creativity, while remaining suspicious of the lazy convenience brought by platformization and commercialization. That also explains his interest in containers, virtualization, i2p, VPNs, DNS pollution avoidance, and anti-hotlinking countermeasures. These are not just technical experiments—they are also a form of resistance: resistance to surveillance, to platform lock-in, and to digital domestication.

At the same time, he is also deeply self-reflective. In several year-end reviews, he openly admits that irregular sleep hurt his health, gaming disrupted his plans, and a disordered life rhythm left him directionless. Those candid passages reveal the more human side of a technical person: not everyone lives in a state of perfect discipline and efficient execution. Faced with reality and anxiety, he does not avoid them, but tries to find some kind of balance.

3. Islands of exploration and a technological utopia

If Mayx’s blog were compared to an island in the digital world, then he would be the watchman standing on it. He stubbornly maintains his own servers, reverse proxies, script schedules, and open-source tools; he repeatedly tries to revive discarded computers, aging development boards, and Linux containers; he even experiments with making an FM radio station on his own, or running OpenFyde and Android on a Raspberry Pi.

This is not just about reusing electronic waste. It also reflects a kind of technological utopian imagination—a small universe that does not need to rely on mainstream supply chains, does not need to trust giant platforms, and can remain under one’s own control in every detail.

The same tendency appears in his interest in decentralized technologies such as xLog, IPFS, i2pd, and ZeroNet. Even though he openly acknowledges that these systems are “not yet mature,” “content-poor,” and “rough around the edges,” he still deploys them and experiments with them enthusiastically. The question he seems to keep asking is: “If not platforms, then what else can we depend on?”

That is also what distinguishes his blog from many other technical blogs: it is not “here’s how to do this,” but rather “here are some other possibilities worth looking at.”

4. Personal difficulty and technological consolation

It is also hard to deny that a certain loneliness and struggle run through Mayx’s blog. More than once he mentions discomfort with the pace of society, uncertainty about life goals, and confusion about the future. In his 2022 and 2023 year-end summaries, he even expressed concern about the risk of human extinction and global disorder. When discussing personal plans, he repeatedly writes about lacking self-discipline and wanting to do too many things while not quite having the strength to do them.

But it is precisely under those conditions that technology becomes a refuge. Through repeated experimentation he regains a sense of order; through coding he gains a sense of control; through deployment he experiences the satisfaction of solving problems. This posture—using technology to push back against the disorder of life—gives his blog a certain psychological meaning.

5. In defense of the “clumsy method”

Overall, Mayx is not a star in the technology world, nor an opinion leader. But he is a respectable defender of the “clumsy method”: in a technical media environment dominated by shortcuts and traffic, he still chooses to build his understanding through careful experiments, plain language, and extensive records of failure.

His blog does not offer “answers”; it offers “routes.” It does not pursue success narratives; it documents failure and persistence. He is not trying to become someone else—he is trying to remain himself.

That may be the most thought-provoking thing his writing leaves behind. Technology is not merely a professional tool; it can also be a path for self-construction, self-understanding, and self-repair.

Perhaps that is the road Mayx is walking—alone, but steady.

When the code is short but the logic is still too tangled

After having AI analyze my writing, I remembered an old forum engine I had written a long time ago: Mabbs. I had once planned to refactor it, but after learning other languages, I kind of lost interest in it 😂

Still, with AI around now, I started wondering whether I could just let it handle the refactor.

My old code is terrible in terms of readability, and the coupling is very high, but I assumed that shouldn’t be a big deal for AI. The total code size is only 22 KiB, so it should easily fit in context. So I started trying various models, asking them to turn the code into something human-readable and then refactor it.

The results were extremely disappointing.

No matter which AI I tried, they could only produce a little bit of code before acting as if the job was done. Grok 3 didn’t even write any code at all 😆 Some other models did appear to write a few fragments, but when I looked at them closely, the output had very little to do with my original code. It was more like they analyzed a small portion and then rewrote the whole thing from scratch.

The weird part is that the code itself isn’t even long. So why couldn’t any of them accurately refactor it?

Maybe the problem is that, while the file is short, the variable names are also extremely short. If those names were expanded to lengths a human could understand, the effective representation of the code might become too large for the model’s context window, causing it to lose track of earlier parts. Another possibility is that Shell simply doesn’t have as much high-quality material available online, so the models don’t have enough prior knowledge to perform a good refactor.

Right now I don’t really have a good solution for getting AI to handle this kind of task. Maybe when it can solve a problem like this, that’s when it will really be capable of replacing humans 😁

Even though the refactor didn’t work out, I still thought it might be fun to let other people try this old forum engine for themselves. So I made a Docker image for it. If anyone wants to play with it, they can just download and run it. The whole image is only a little over 2 MiB, so calling it the smallest forum engine in the world probably wouldn’t be too outrageous 🤣

What this seems to say about the ceiling of LLMs

At this point, it really looks to me like the current ceiling of LLMs is still the context window.

That limitation blocks a surprising number of things. And there doesn’t seem to be a clean way around it. Because of that constraint, AI still can’t look over a whole system the way a human can, which is one big reason it still can’t replace people. Even if you compress earlier material in clever ways, a lot of detail still gets lost.

And under the current LLM architecture, this may simply not be solvable yet.

If one day AI could modify its own weights during the thinking process, then maybe it could achieve something like truly unbounded context. If that ever happens, maybe that’s when it will break past the current ceiling—and finally become capable of replacing humans.