Musings While Building with LLMs

When you are working with large language models (LLMs) to bring an idea to life, it is easy to be impressed by how quickly they can write code, generate explanations, or stitch together components. However, when you try to build something that actually works end to end—such as a real demo with frontend, backend, integrations, and even some additional features—the experience shifts dramatically.

Recently, I decided to push a fun idea all the way to a functioning demo using LLMs as my main assistants. I relied on them to suggest code, fix bugs, structure the project, and more. Along the way, I discovered what works, what really does not, and a few key practices that made the whole process smoother. Even more fascinating, many of the “hacks” I stumbled upon appear to be practical workarounds for deeply understood theoretical behaviors in these models.

Here are some lessons I learned the hard way, backed by a dive into research that explains why LLMs behave in certain ways.

1. Always Ask for a Directory Structure Right at the Start

Before writing a single line of code, ask the LLM for a full directory structure. Without it, your project can quickly become a mess. Ask the model to detail the folders, files, and even include placeholder comments. This sets the stage and keeps both you and the model aligned on the architecture.

Although not directly addressed by attention sink research, this practice relates to the LLM’s limited reasoning capacity and its tendency to lose track over time. Providing a clear, stable structure acts as an external anchor or scaffold, helping both the model and you maintain coherence throughout development. This is similar in concept to how the <bos> token might anchor attention within a sequence ([1]).

2. Regenerate Full Files Every Few Steps to Avoid Drift

LLMs are clever, but they can lose sight of the big picture quickly. As conversations continue, they might forget earlier details or begin to contradict themselves without realizing it.

One effective strategy is to ask the model to regenerate the entire file after every three or four rounds of updates or modifications. Although it might feel redundant, this approach keeps everything internally consistent.

Without this practice, you may find yourself fixing subtle logic errors caused by minor changes—a function name might change slightly or an earlier assumption might be silently dropped. Think of this process as refreshing the page; it helps you avoid stale or corrupted states.

This technique directly combats representational collapse and the “lost in the middle” phenomenon. Research such as Barbero et al. (2024, 2025) shows that as sequences get longer or models become deeper, token representations tend to homogenize, causing distinct concepts to blend together ([1], [2]). Additionally, studies like Liu et al. (2023) demonstrate that LLMs struggle to recall or accurately use information presented in the middle of long contexts, favoring the beginning and the end ([3]). Regenerating the full file forces the model to re-encode the entire structure and logic based on the current state, effectively refreshing its internal representation and mitigating drift.

3. When the LLM Gets Stuck, Step In and Debug Manually

Sometimes the LLM gets stuck in a loop, suggesting similar, ineffective fixes. When this happens, take over. List all potential failure modes, guide the model through each one step by step, and clearly state what you have already tried.

This issue may be a manifestation of the model getting locked into suboptimal attention patterns or being overly influenced by earlier parts of the context due to phenomena like attention sinks or representational collapse. It might struggle to distinguish between subtly different states or re-evaluate its approach. Your manual intervention and precise guidance help break this cycle by forcing the model to reconsider specific components or logic paths it might otherwise overlook. In other words, you redirect its focus away from collapsed or overly mixed representations.

4. Work Around the Context Window

LLMs operate within a finite context window, which limits how much information they can process coherently at any given time. Although recent models have shown remarkable progress in handling long contexts (for example, one of the key differentiators for Gemini models is their one million token context window), there is a consensus in the industry that performance degrades as the context length increases.

Practical experience suggests that beyond a certain point, typically between 50,000 and 100,000 tokens, issues begin to surface. The model may start mixing information, lose track of earlier logic, or even introduce bugs that seem to appear out of nowhere.

Plan your workflow with this limitation in mind. Break your project into smaller, self-contained milestones that can each be completed within a reasonable context window. Once you approach that upper limit, or if you notice obvious degradation in performance, start a fresh session. Summarize what you have accomplished so far—perhaps by copying over key files or notes—and continue from there. This practice can help you avoid the phenomenon where the project becomes “haunted” by forgotten or corrupted context.

5. For Authentication and Security, Trust External Resources

LLMs are notorious for generating insecure code, especially when it comes to authentication and security. While you can ask the LLM for a general flow or a rough outline, never rely solely on its implementation details. Instead, consult official documentation, use battle-tested libraries (such as Firebase or Auth0), and adhere strictly to established security best practices.

Research consistently shows that LLM-generated code often contains vulnerabilities ([5]). Models are trained on vast amounts of public code, which includes insecure examples. They lack a deep understanding of security implications. Studies report high rates of misuse or vulnerabilities in generated code ([5]). Moreover, LLMs can be susceptible to prompt injection, leading to insecure outputs even from benign prompts ([4]). Human oversight and reliance on verified resources are essential.

Wrapping Up

Building complex applications with LLMs is an exercise in navigating inherent limitations. They are incredibly powerful assistants, but not infallible oracles. Understanding why they sometimes fail—due to mechanisms like attention sinks, representational collapse, or context window effects—helps you anticipate problems and apply effective workarounds.

Guide the LLM, verify its outputs (especially for security-critical code), and do not hesitate to take the lead when necessary. Let the models handle the heavy lifting, but remember that you are still the engineer steering the ship.

References

  1. Barbero, F., Arroyo, Á., Gu, X., Perivolaropoulos, C., Bronstein, M., Velicković, P., & Pascanu, R. (2025). Why do LLMs attend to the first token? arXiv preprint arXiv:2504.02732.
  2. Barbero, F., Banino, A., Kapturowski, S., Kumaran, D., et al. (2024). Transformers need glasses! Information over-squashing in language tasks. Advances in Neural Information Processing Systems. Link
  3. Liu, N. F., Lin, K., Hewitt, J., et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. arXiv preprint arXiv:2307.03172.
  4. OWASP. OWASP Top 10 for Large Language Model Applications. Link
  5. Li, Z., et al. (2023). Can ChatGPT Replace StackOverflow? A Study on Robustness and Reliability of Large Language Model Code Generation. arXiv preprint arXiv:2308.10335
Hardik Meisheri
Hardik Meisheri
ML Scientist

My research interests include Reinforcement Learning and Natural Language Processing.

Related