Rabbit hole of code auto-complete Fill In the Middle (FIM)

Recently a friend suggested an interesting idea. What if cursor like auto-complete for code is available for a completely different domain a.k.a indian law since advocate on records spend a significant chunk of time drafting. This took me down a rabbit hole for a few weeks and found some interesting things.

I took the dataset of the different domain and did a LoRA fine-tune for left-to-right generation (or auto-regressive generation). It was kind of obvious that it would be better to use the base model instead of the instruct model since instruct models are optimised to "chat" and do "instruction following". Decent results but not great, it kept getting important things like section number of law incorrect. Had a hunch that we need to remove "memory" of code/math etc from the model and instead introduce memory of the legal stuff. Started with a full-fine tune of the raw pre-trained llama3.2 3B base model. The good thing was it now got some section numbers correct. Since training on a single A100 on google colab threw errors because it had to run for 60-80hrs and colab kept shutting down the kernels, I had to take 8xH200 servers and do a DDP training.

So now, let's actually think about what the code auto-complete does. When I am editing a file, it pulls context from before the line I am editing and from after the line I am editing and tries to match my style. It also takes in repository related info sometimes. I was initially under the impression that okay we can do some amount of post-training and get it to work like auto-complete.

Due to some random co-incidences on X/Linkedin I saw a few people using the word "Fill-In-The-Middle" reached out to them and dug deeper. I found a paper, titled "Efficient Training of Language Models to Fill in the Middle" by OpenAI, begins with a deceptively simple insight: auto-regressive models don't have to be limited to left-to-right generation. The key is a straightforward data transformation that doesn't require any architectural changes whatsoever.

What FIM Actually Is (And Why It Matters)

Here's the core idea: During training, you take a document and split it randomly into three parts: prefix, middle, and suffix. Then you rearrange them as (prefix, suffix, middle) with special sentinel tokens (<PRE>, <SUF>, <MID>). The model still predicts tokens left-to-right, but now it's learning to generate the middle section while conditioning on both the prefix and the suffix.

For my legal domain use case, this was the missing piece. When a lawyer is drafting a clause, they aren't just continuing from what came before, they're trying to bridge between existing text on both sides. A statute might start with "Section 144: Whoever..." and end with "...shall be punishable with imprisonment." The lawyer needs to fill in the precise conditions in the middle that match both the legislative intent (prefix) and the prescribed consequences (suffix).

The paper's most surprising finding is what they call the "FIM-for-free property" that that models trained this way achieve the same left-to-right performance as regular auto-regressive models, while gaining infilling capability at no additional computational cost. This explained why my initial approach felt so inefficient: I was forcing a left-to-right model to do a task it was never designed for.

The Devil in the Details

Character-level splitting is crucial. The authors tested splitting by lines, tokens, and characters. While line-level splitting performed slightly better on line-based benchmarks, it failed catastrophically when asked to complete partial lines. Exactly the kind of robustness you need in real-world use. Character-level splitting ensures the model can handle cases where the cursor cuts through a word or a legal citation. For Indian law, where you might have citations like "Section 2(1)(a)" that could be split anywhere, this is essential.
High FIM rates work better. Contrary to intuition, they found that using a 50-90% FIM rate (transforming most of your training data) actually improves infilling performance without harming left-to-right capabilities. Many previous implementations used rates as low as 15%, which the paper shows is sub-optimal. This was my first mistake: I had been fine-tuning on purely left-to-right data, giving the model almost no exposure to the actual task I wanted it to perform.
Context-level vs. document-level matters. Document-level FIM (transforming before chunking into contexts) can result in fragmented examples where prefix, middle, and suffix get separated during training. Context-level FIM applying the transformation after documents are already chunked provides a significant performance boost. This is especially relevant when working with long legal documents that exceed your context window.
SPM mode has an edge. The paper introduces two ordering schemes: PSM (<PRE> prefix <SUF> suffix <MID> middle) and SPM (<PRE> <SUF> suffix <MID> prefix middle). While both work, SPM mode (suffix-prefix-middle) showed slightly better performance and has caching advantages during inference. For a legal assistant where you're constantly editing the same document, this matters.

The Fine-Tuning Trap (That I Fell Into)

Here's the part that made me laugh out of frustration: the paper explicitly shows that fine-tuning for FIM is computationally inefficient. They took a 1.4B parameter model pretrained without FIM and fine-tuned it with various hyperparameters. Even after 50 billion tokens of fine-tuning (half their pretraining budget) at aggressive learning rates, they couldn't match the performance of a model trained with FIM from scratch.

I had done exactly what the research warned against: full fine-tuning after my LoRA experiments, spending compute to retrofit a capability that could have been learned for free during pretraining. The paper explains this might be due to "ossification" i.e the model's learned attention patterns become rigid during pretraining, making it difficult to adapt to FIM's global structure.

Connecting It Back

This research explained why my model kept getting section numbers wrong. Left-to-right generation doesn't know where you're trying to end up. It doesn't know that Section 144 must connect to the penalty clause that follows. FIM training forces the model to consider both context and destination, making it far more likely to generate legally coherent middle sections that actually bridge the intended concepts.

I think the worst part was, I should have started this exercise by understanding how code auto-complete works before adapting it to a different domain. Would have saved quite some time. I found out about "Fill-In-The-Middle" only after I did a full fine tune and was looking for ways to do post-training. Nevertheless, still learned a lot about LoRA, full fine-tuning, DDP, setting up training runs in ways that enable you to resume from checkpoints, etc!

I did some demand testing, but the excitement was lacklustre which definitely wouldn't justify the budget for pre-training a base model from scratch.

Here's the source code: https://github.com/pranitbauva1997/legal-ai-fine-tune

Rabbit hole of code auto-complete FIM

What FIM Actually Is (And Why It Matters)

The Devil in the Details

The Fine-Tuning Trap (That I Fell Into)

Connecting It Back

Related Posts