Musings While Building with LLMs
When you’re working with large language models (LLMs) to bring an idea to life, it’s easy to be impressed by how quickly they can write code, generate explanations, or stitch together components. But once you try to build something that actually works end-to-end like a real demo with frontend, backend, integrations, and maybe even some bells and whistles the experience shifts dramatically.
Recently, I decided to push a fun idea all the way to a functioning demo using LLMs as my main assistants. I relied on them to suggest code, fix bugs, structure the project, and more. Along the way, I discovered what works, what really doesn’t, and a few key practices that made the whole process smoother. More fascinatingly, many of the “hacks” I stumbled upon seem to be practical workarounds for deep, theoretically understood behaviors of these models.
Here are some lessons I learned the hard way, backed by a dive into the research explaining why LLMs behave like this.
1. Always Ask for a Directory Structure Right at the Start
Before writing a single line of code, ask the LLM for a full directory structure. Without it, your project becomes a mess. Make the model detail folders, files, and even placeholder comments. This sets the stage and keeps both you and the model aligned on architecture.
While not directly addressed by attention sink research, this relates to the LLM’s limited reasoning capacity and susceptibility to losing track. Providing a clear, stable structure acts as an external “anchor” or scaffold, helping the model (and you) maintain coherence throughout the development process, much like the `<bos>`
token might anchor attention within a sequence ([1]).
2. Regenerate Full Files Every Few Steps to Avoid Drift
LLMs are clever, but they can lose track of the big picture pretty quickly. As conversations go on, they might forget earlier details or start to contradict themselves without realizing it.
One of the best things you can do is to ask the model to regenerate the entire file after every three or four rounds of updates or modifications. It may feel redundant, but this helps keep everything internally consistent.
Without this, you will find yourself fixing logic errors that crept in because a function name changed slightly, or an assumption made earlier was silently dropped.
Treat this like refreshing the page. It helps you avoid stale or corrupted states.
This directly combats representational collapse and the “lost in the middle” phenomenon. Research like Barbero et al. (2024, 2025) shows that as sequences get longer or models deeper, token representations can homogenize, making distinct concepts blend together ([1], [2]). Furthermore, studies like Liu et al. (2023) demonstrate that LLMs struggle to recall or accurately use information presented in the middle of long contexts, favoring the beginning and end ([3]). By regenerating the full file, you force the model to re-encode the entire structure and logic based on the current state, effectively “refreshing” its internal representation and mitigating the drift caused by these effects.
3. When the LLM Gets Stuck, Step In and Debug Manually
Sometimes the LLM gets stuck in a loop, suggesting similar, ineffective fixes. This is your cue to take over. List potential failure modes, guide the model step-by-step through them, and clearly state what you’ve already tried.
This could be a manifestation of the model getting “stuck” in suboptimal attention patterns or being overly influenced by earlier (potentially incorrect) parts of the context due to phenomena like attention sinks or representational collapse. It might be unable to distinguish between subtly different states or re-evaluate its approach ([1], [2]). Your manual intervention and precise guidance help break this cycle by forcing it to reconsider specific components or logic paths it might otherwise overlook. Think of yourself as redirecting its focus away from collapsed or overly mixed representations.
4. Work around the context window
Even models with huge advertised context windows (like 1 million tokens) often show performance degradation well before that limit. Practical experience often suggests issues start cropping up around the 50k-100k token mark. Plan your workflow accordingly: break tasks into smaller milestones and restart sessions periodically, summarizing progress.
This aligns perfectly with research on effective context length and the “lost in the middle” problem ([3]). Studies show that the effective context length (how much context the model can actually use reliably) is often much smaller than the training context length. The U-shaped performance curve (better at ends, worse in the middle) persists even in very long context models ([3]). Techniques like attention sinks help ([1]), but they don’t fully solve the problem of utilizing distant information. Restarting sessions is a pragmatic way to avoid hitting the point where information loss becomes critical.
5. For Authentication and Security, Trust External Resources
LLMs are notoriously bad at generating secure code, especially for critical areas like authentication. Ask for the general flow, but never trust the implementation details. Rely on official documentation, battle-tested libraries (Firebase, Auth0), and established security best practices.
Research consistently shows that LLM-generated code often contains vulnerabilities ([5]). Models are trained on vast amounts of public code, including insecure examples. They lack a true understanding of security consequences. Studies report high percentages of misuse or vulnerabilities in generated code ([5]). Furthermore, LLMs can be susceptible to prompt injection, potentially leading to insecure outputs even from seemingly benign prompts ([4]). Security requires nuanced, context-aware reasoning that current LLMs struggle with. Human oversight and reliance on verified resources are non-negotiable here. The OWASP Top 10 for LLM Applications specifically highlights risks like insecure output handling ([4]).
Wrapping Up
Building complex applications with LLMs is an exercise in navigating these inherent limitations. They are incredibly powerful assistants, but not infallible oracles. Understanding why they sometimes fail, due to mechanisms like attention sinks, representational collapse, or context window effects, makes it easier to anticipate problems and apply effective workarounds.
Guide the LLM, correct it often, verify its outputs (especially security-critical code!), and don’t hesitate to take the wheel. Let the models do the heavy lifting, but remember: you’re still the engineer steering the ship.
References
- Barbero, F., Arroyo, Á., Gu, X., Perivolaropoulos, C., Bronstein, M., Velicković, P., & Pascanu, R. (2025). Why do LLMs attend to the first token? arXiv preprint arXiv:2504.02732.
- Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., et al. (2024). Transformers need glasses! information over-squashing in language tasks. Advances in Neural Information Processing Systems (NeurIPS). ( Link)
- Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint arXiv:2307.03172.
- OWASP. OWASP Top 10 for Large Language Model Applications. ( Link)
- Li, Z., et al. (2023). Can ChatGPT Replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation. arXiv preprint arXiv:2308.10335.