The LLM-shaped hole in your XGBoost pipeline

May 4, 2026

Every team I've worked with that ships a tabular ML model has, at some point in the last two years, been asked the same question by someone in leadership: "Have we tried GPT for this?"

A fair question to ask once, less so by the third time. Tabular prediction (predicting a number from a row of mixed numeric and categorical features) is still the domain where gradient-boosted trees beat everything else. That hasn't fundamentally changed since 2017. Foundation tabular models like TabPFN have won some ground on small data, but the broader pattern holds, and it hasn't shifted because LLMs got bigger.

But there is an LLM-shaped hole in most XGBoost pipelines. It's just not where most people put it.

Why trees still win on tabular

The Grinsztajn et al. (2022) result, and the follow-ups since (McElfresh 2023, the TabArena and TabRepo benchmarks), keep coming to the same conclusion: tree-based models outperform deep learning on tabular data, especially with high-cardinality categoricals, modest dataset sizes (under a million rows), and feature interactions that aren't already linearised.

The reasons are mostly mechanical. Trees handle mixed types natively (numeric, categorical, ordinal, missing all go in). Feature scaling doesn't matter. Training is fast enough that iteration time goes into features instead of hyperparameters. And SHAP works on every prediction, which means you can explain the score to whoever has to act on it.

The serious alternatives

When someone asks "should we try something other than XGBoost for tabular," the honest short list is short.

TabPFN v2 (Hollmann et al., Nature 2025). A pre-trained transformer for tabular prediction. Native sweet spot is around ten thousand rows; the chunked and ensembled variants push higher. No per-dataset training loop. You give it data, it gives you predictions in seconds. Genuinely competitive with tuned XGBoost on small, clean datasets. Worth running on a stratified subsample of any new project just to set a baseline.

FT-Transformer / SAINT. Attention over feature embeddings, with SAINT additionally attending across rows. Closes the gap with trees on high-cardinality categorical-heavy data, occasionally beats them. Integration cost is real: GPU inference, less mature tooling, no SHAP. Rarely the right call until you've exhausted feature engineering.

LLM-as-feature-engineer. CAAFE (Hollmann et al., 2023) and follow-ups use an LLM offline, given a dataset's schema and sample rows, to propose candidate feature transforms, which a gradient booster then validates empirically. The LLM never touches the prediction path. Highest expected value of these alternatives, lowest production risk, runs on a developer laptop.

Fine-tuned LLMs on serialised rows (TabLLM, Hegselmann et al. 2023). Genuine few-shot lifts at hundred-row scale; no useful transfer to production tabular. Two orders of magnitude more inference cost per row, no SHAP. Look at it if you have 200 labeled rows and a deadline; otherwise, no.

What's not on this list. An LLM-as-regressor without fine-tuning, pushing rows through a context window at inference time to get a prediction back. Orders of magnitude more expensive than a tree, no calibration, no SHAP, no clear retrain story. The reason it keeps getting proposed is that LLMs are genuinely transformative for other problem shapes. That enthusiasm doesn't transfer to tabular regression.

Where LLMs earn their keep

The pattern that works: LLMs upstream of trees. Three concrete shapes.

Embeddings as features

Take any unstructured text adjacent to your prediction problem (product titles a customer browsed, free-form notes on a customer record, descriptions of items in a basket) and embed it. Mean-pool per row. Concatenate the embedding vector to your existing feature set. Let XGBoost decide whether the new dimensions carry signal.

The cost is trivial. As of 2026, text-embedding-3-small lists around $0.02 per million tokens; a batch of 150,000 customers with a few hundred tokens of relevant text each comes out to a couple of dollars per full re-embedding. Cache by source object rather than by row and the marginal cost on incremental data is rounding-error.

The argument against this pattern is empirical, not economic. Do the embeddings carry signal beyond what the existing categorical IDs already encode? Often yes, sometimes no. You find out by running it.

On a recent project, we discussed embedding pre-purchase text fields to lift recall on a stubborn minority-class classifier. The cost math worked out to a few dollars per re-embedding run. We never shipped it. Two weeks of offline eval couldn't separate the embeddings' contribution from the hand-curated categorical that was already capturing most of the same signal, and we couldn't justify the operational complexity for a lift we couldn't measure cleanly. The empirical question is real. The answer was no.

The same pattern works on images. If your prediction problem has product photos, listing thumbnails, scanned documents, or any other image content attached to rows, multimodal embeddings (CLIP, SigLIP, or your provider's equivalent) plug in the same way. They often produce bigger lifts than text embeddings, because image content is even less likely to already be captured by your structured columns.

When the upstream embedding model version changes, your features silently change too. Pin the model version in the embedding pipeline, log it alongside your model artifact, and alert on it the way you'd alert on a schema migration. Embedding drift is a real production failure mode.

Text-to-features

Most businesses have a pile of unstructured text adjacent to the prediction problem (support tickets, sales notes, complaint logs, onboarding emails). None of it is in your tabular feature set. An LLM classifier or extractor, run in batch over that text, produces a small set of structured columns: intent, urgency, topic mix, sentiment, B2B signal. Bolt those onto the existing feature set. XGBoost still trains. SHAP still works.

Of the three patterns, text-to-features tends to produce the biggest MAE lifts in practice. The projects I've seen succeed land somewhere in the single-digit to low-double-digit percent range on MAE; my sample is small enough that the range is anecdotal. In the projects where it doesn't help, the unstructured signal turned out to already be captured by existing categoricals. The unstructured signal was sitting unused, and the LLM is the first tool that makes it cheap to structure.

Be honest about the cost. Text-to-features is materially more expensive than embeddings. Running a gpt-4o-mini-class classifier over hundreds of tokens per row at 150k rows is single-digit dollars per pass, and the cost recurs every time you reprocess. Cache aggressively by source document, not by customer.

Enrichment gap-filling

Any tabular feature set that includes external lookups (registry data, firmographic enrichment, geographic classification) has coverage gaps. Typical match rates on external registries are 60-80%, leaving a long tail where the feature is "unknown."

LLMs can't invent registry data from a key they've never seen. But a tool-using agent can: take the missing key, find a public website, classify the entity from the website, validate against a small labeled set. This has been done by enough teams to be unremarkable, and the win is mostly integration ergonomics. You're replacing a fragile multi-step ETL with a single agent loop. Don't oversell it as a model improvement; sell it as a data pipeline upgrade.

So when leadership asks again

The next time someone asks whether you've tried GPT for it: you have. It's feeding features into the model that does the predicting.