Finetuning LLMs for Enterprises: Interview with Travis Addair, CTO of Predibase

Enterprise AI Trends

0:00

-1:28:08

Finetuning LLMs for Enterprises: Interview with Travis Addair, CTO of Predibase

Plus, how RFT (reinforcement finetuning) will really change the game for finetuning AI models

John Hwang

Feb 28, 2025

Transcript

Help shape our future content—please take 2 minutes to fill out this quick survey.

Start Survey

Fine-tuning large language models isn't flashy enough to trend on AI social media—but it's quietly becoming a workhorse for enterprises for optimizing inference costs (often by 80%+), enhancing data privacy, and dramatically cutting latency. This is especially relevant for industries with unique performance and regulatory requirements, where the latest frontier model with RAG can fall short.

Also, the space is changing fast: a recent advance called Reinforcement Finetuning (RFT) may dramatically increase the demand for finetuning. RFT is used by OpenAI to train custom models for customers, and many practitioners see it as the future of model personalization.

So to break down exactly what's happening—and why you can't afford to ignore it—we're joined by Travis Addair, CTO and co-founder of Predibase, a Greylock-backed platform that has helped enterprises like Checkr and Nubank train and serve finetuned LLMs.

Before co-founding Predibase, Travis worked on ML infrastructure at Uber, where he built deep learning infrastructure for self-driving cars, customer support automation, and recommendation systems. Now, at Predibase, he’s focused on helping companies deploy fine-tuned models quickly and affordably—without needing a massive ML team or months of engineering work.

In this interview, we talk about finetuning and RFT in non-technical terms so that it’s accessible to both business and technical leaders. By the end, you will understand:

the use cases of finetuning,
what’s possible with finetuning,
whether it’s relevant for your company,
how RFT works, and whether you should spin up a RFT project,
and much more

If you’re curious about how fine-tuning and RFT are reshaping AI, read the transcript below, listen to the podcast version, or watch the full episode on Youtube.

How to get started finetuning LLMs and reinforcement finetuning: Predibase is the first platform to bring reinforcement learning to AI. With as little as 10 rows of data, engineering teams can customize and serve their own high performance reasoning models. Get started with:

$25 in Free Credits: https://pbase.ai/freecredits
Request Early Access to Reinforcement Fine-Tuning: https://pbase.ai/earlyRFT

Timestamps for Audio and Video

[00:06:53] Real-World Use Cases and Benefits of Fine-Tuning

[00:12:14] Technical Insights: Achieving Faster Latency and Privacy

[00:14:20] Build vs buy decisions for AI infrastructure: how CTOs at enterprises think

[00:18:43] Challenges in building AI Infrastructure from scratch

[00:23:24] Did DeepSeek increase interest among enterprise customers in finetuning?

[00:29:13] What is Reinforcement Finetuning (RFT) and why is it a big deal

[00:35:55] Commercial implications of Reinforcement Fine-Tuning

[00:45:27] Vertical Use Cases and Future Plans for RFT

[00:47:59] How Predibase built its RFT platform

[00:49:28] The practical challenges of running RFT

[00:50:32] What is reward function modeling and why it’s important

[00:51:46] RFT training process is interactive and vibes driven

[00:53:13] Other challenges in model training

[00:58:59] Biggest bottlenecks in RFT

[01:03:02] More use cases for RFT

[01:06:22] Transpilation and legacy code modernization are big deals

[01:10:31] Future of Inference Costs - will it keep coming down

[01:17:43] Predibase's Vision and Goals

[01:19:10] Enterprise AI Adoption Challenges

[01:26:44] Closing Remarks and Future Predictions

Transcript

John
Today, we have Travis with us—he's the CTO at Predibase. Since day one, he's been focused on democratizing fine-tuning for enterprises and is building some really exciting technology.

Before we dive in, could you tell us a bit about yourself and your journey—from working at Uber to what inspired you to co-found Predibase?

Travis

Yeah, for sure. As you mentioned, I worked at Uber for several years—about four and a half, to be exact. I was part of the machine learning platform team, Michelangelo, where I led a team focused on building deep learning infrastructure.

At the time, what we now call AI was simply referred to as deep learning. Initially, we worked on self-driving cars, specifically on perception models. Over time, our focus expanded to areas like customer support automation, Uber Eats recommendation systems, and developing the foundational infrastructure for training and serving models.

Along the way, we also contributed to open-source technology. Around 2020, my colleague and friend from Uber, Piero—who was in the research group—and I decided to start Predibase. Our goal was to take the foundational technology we had been building in open source and make AI more accessible to a broader audience.

We specifically wanted to help organizations sitting on large amounts of text or image data who were wondering how to leverage it to improve their business and products. Many of these companies lacked AI researchers who could write PyTorch or build custom neural networks. Our vision was to bring that technology to them through declarative and low-code interfaces, making it much more accessible.

John

That sounds great. I know declarative ML is one of the biggest differentiators for Predibase, but you also started with a low-code open-source project.

Could you talk about that real quick and how you decided to go all-in on the LLM training stack? I think that shift happened around 2023—what was that time like? When did you realize, "Okay, we need to focus on LLMs now"?

Travis

Yeah, exactly. As you mentioned, the first version of our product was a low-code tool for fine-tuning and serving deep learning models—things like BERT classification models, for those in the audience who are familiar. At the time, we were working a lot with engineering teams looking to productionize these systems.

Looking back, one interesting thing was that in 2021 and 2022, we would approach large organizations and say, "Hey, you have a lot of text and image data—you should be using deep learning because traditional methods don't fully leverage that data." But the response we often got was, "I don’t really care about that. Can you give me XGBoost? Can you give me something that understands tabular data?"

The biggest shift came with ChatGPT—it created an awakening around NLP use cases. Suddenly, companies started realizing the potential of leveraging raw text data in meaningful ways.

Then, in 2023, LLaMA came out. That was the first time an open-source model was truly competitive with proprietary foundation models, like those from OpenAI. Almost immediately, people started fine-tuning and customizing it for specific tasks. We quickly realized that this was very aligned with what we were already doing—just now applied to predictive AI.

From a technology standpoint, there was already strong synergy. The use cases evolved—things like text generation and chatbots—but fine-tuning remained central. What really stood out to me was the nature of foundation models like GPT-4. They’re incredibly powerful but also heavyweight—huge, general-purpose sledgehammers that can do a lot.

For B2C applications, like ChatGPT, that generalization makes sense because you don’t know what any one user will want. But in enterprise use cases, where AI is embedded into a specific product or core infrastructure, you don’t need a generalist—you need something highly optimized for a particular task.

There’s always a tradeoff: quality for a specific use case, speed (because large models are slower), or cost (since running massive models isn’t cheap). Our vision with Predibase was to help companies build customized, fit-for-purpose AI solutions that solve their exact problem as efficiently as possible. We wanted to guide organizations exploring generative AI toward this more tailored, fine-tuned paradigm.

John

Speaking of use cases, I talk to a lot of people, and even two years after ChatGPT, there's still a lot of uncertainty about when to fine-tune a model and when not to.

Many teams start with frontier models from OpenAI, but later realize they can save a lot of money by fine-tuning or using smaller, specialized models. Let's dive into some concrete use cases you've seen. Could you walk us through a typical end-to-end onboarding scenario—how a company goes from exploring fine-tuning to actually productionizing a model? Something that would really resonate with most of our listeners.

Travis

For sure. We see a lot of customers follow a similar journey. They start by using an API service like OpenAI, but eventually hit a wall—whether it's quality (the model doesn’t understand their data well enough), speed (response latency is too high for end users), or cost (as usage scales, the bill becomes unsustainable).

A great example is Conversa, a company in the customer service and call center technology space. We worked with them on real-time information extraction and summarization from call transcripts. Their goal was to extract metadata, generate summaries, and more.

The challenge? They had around 60 different use cases—each requiring specific metadata extraction and its own tailored prompt.

John

Like structuring the output into JSON from a prompt, for example?

Travis

Exactly. Imagine a call transcript comes in, and you need to extract metadata—things like identifying the speakers and pulling out key semantic information from what was originally an audio recording, later converted into a massive text transcript.

Converza deals with millions of these calls each month, coming from hundreds of customers. This is a real-time use case, meaning data is continuously flowing in, and the system needs to scale efficiently to handle that volume.

Initially, they were using OpenAI’s models for this, but costs quickly spiraled out of control. Latency also became a major issue—they needed responses fast, but the existing setup wasn’t keeping up. Surprisingly, even quality became a concern. The general-purpose model wasn’t always extracting the right metadata accurately.

So, what we did was fine-tune over 60 adapters—task-specific models that add a small number of parameters on top of the base model. This approach led to a nearly 10% accuracy improvement over OpenAI’s best models for their specific use cases.

What made this even more efficient was our serving architecture. We built a system called Lorax, which allows us to serve hundreds of fine-tuned models simultaneously on a single GPU. This gave Converza much higher throughput than they had before, reducing both cost and latency.

And this isn’t an uncommon pattern—we see many companies start with foundation models and later realize that fine-tuning can improve not just cost and speed, but also quality. It’s often surprising, but fine-tuning can actually outperform even the strongest general-purpose models when applied to domain-specific tasks.

John

Could you comment on the key motivators for adopting Predibase—or really, any infrastructure provider focused on fine-tuning—besides cost? Would you say agility, privacy, or better developer experience are major drivers? What would you rank as the top three reasons companies choose fine-tuning, aside from cost?

Travis

Yeah, cost is definitely a problem of scale—it doesn’t necessarily hit you right away. But there are other constraints that can be immediate blockers to productionizing an AI system.

I’d say the top three motivators, aside from cost, are quality, speed, and governance.

Quality – If you're building a system where accuracy is critical—like medical diagnoses or financial risk assessments—you cannot afford hallucinations or errors. The accuracy bar is extremely high in these cases, and fine-tuning helps ensure the model performs optimally for your specific use case.
Speed – This is often an issue for end-user applications. If someone is using an app and expects an immediate response, LLMs that take multiple seconds can be unacceptable. Many real-time applications need sub-second latency, sometimes even in the range of 100ms or less. Fine-tuning allows companies to optimize models for speed while maintaining quality.
Governance – Using a foundation model’s API is fine for some applications, but not for those dealing with sensitive data. Many of our customers have strict constraints on where their data can be processed and stored. They need models deployed in VPCs (Virtual Private Clouds) to ensure no data leaves their environment. Fine-tuning allows them to maintain full control over the model and its data, which is essential for industries like healthcare, finance, and government.

These factors—accuracy, speed, and data control—are often more pressing than cost, especially when companies need AI systems to work reliably at scale.

John

Let’s talk about how you achieve faster latency and ensure privacy. Are you improving latency by deploying models closer to the customer’s environment?

Is that the key mechanism for reducing time to first token, or are there other optimizations at play? A lot of companies are focusing on reducing latency—so what’s the extra edge that allows you to hit sub-100ms response times?

Travis

Yeah, there are a few different ways we achieve lower latency.

One approach is VPC deployment—when we run a model directly inside a customer’s VPC, all traffic stays within their network. There’s no external API call overhead, so you get much lower latency just from that direct VPC-to-VPC communication.

But more critically, fine-tuning itself is a major factor. If you start with a general-purpose model that’s hundreds of billions of parameters, you inherently have higher latency. Through fine-tuning, we can get better quality with a much smaller model—say, an 8-billion-parameter model instead of a massive one. That kind of reduction in model size leads to order-of-magnitude speed improvements while maintaining or even improving accuracy.

Beyond that, we’ve also developed custom fine-tuning optimizations that further boost speed. One example is TurboLoRA, a technique we built and released last year that’s now available in our product.

TurboLoRA doesn’t just fine-tune a model for accuracy—it also trains the model to predict multiple tokens at once. Instead of just predicting the next token (which is what LLMs usually do), the model learns to predict the next three tokens in a single step. This can lead to up to 3x faster inference, simply by reducing the number of compute steps needed per response.

So, between VPC-based deployments, model size optimizations through fine-tuning, and custom techniques like TurboLoRA, we’re able to deliver much lower latency—often below 100ms—while maintaining strong model performance in production.

John

With enterprises, there’s always a challenge when it comes to rolling out new infrastructure—especially when companies already have ML teams and have spent years building their own stack on something like Kubeflow.

Now, with LLMs evolving so rapidly, these companies face a critical build vs. buy decision. Do they double down on their internal ML infrastructure, or do they adopt solutions like Predibase or other ISVs to move faster?

What are the key decision variables you've seen CTOs grapple with when making this call? What ultimately pushes them toward choosing a platform like Predibase? What's going through their minds when they realize, “Okay, we need to move faster”?

Travis

Yeah, the way I think about it is this—because as a CTO, I also make build vs. buy decisions for other products.

If you're in a high-growth, highly competitive market, and you're under pressure to figure out your GenAI strategy and execute quickly, then unless you're confident that you can build a truly differentiated, best-in-class platform, I don’t see why you’d want to build your own from scratch.

LLMs are moving fast, and companies like us are solely focused on this problem—investing heavily to stay ahead. The cost of buying a solution is almost always lower than the cost of hiring a large team to build, maintain, and evolve an internal platform over the next 10 years. So, for most companies, buying is the logical choice unless you're Google or a similarly resourced company with a very unique platform vision.

Now, when it comes to choosing Predibase vs. other providers, it really depends on what you're trying to achieve. One area where we differentiate is reinforcement fine-tuning—we’ve invested a lot in making sure our platform isn’t just providing raw infrastructure, but also the tools needed to deliver the best possible model quality and seamlessly deploy models into production.

We don’t just offer a self-service training tool or fast inference. We think end-to-end about the AI lifecycle—training, deployment, collecting real-world usage data, and continuously fine-tuning over time.

That full flywheel—where models improve continuously as they are used—is something we actively manage and optimize for. That’s a fundamental difference in how we approach the problem compared to many other solutions in the space.

John

Let’s talk about privacy and on-prem deployment.

How long does the end-to-end process typically take? I’m not sure how Predibase or similar ISVs onboard customers, so can you walk me through it?

For example, when you fine-tune a model, are you copying it into the customer’s VPC? Do you store it in S3 and load it from there? Where do the different components of the infrastructure sit—what runs in the cloud versus what stays on-prem? And is there a hosted service dependency on Predibase, or can customers operate fully independently?

Travis

Yeah, good question. The way we handle VPC deployments is through a hybrid deployment model. We have a centralized control plane that lives in our cloud environment, and a localized data plane that resides in the customer’s VPC. The control plane handles orchestration, which includes submitting training jobs, monitoring their progress, handling updates, and tracking state. It also manages all metadata, such as metrics, telemetry, and records of what jobs have been run and their current status.

However, everything that is sensitive—training data, user data, customer data—remains entirely within the customer’s VPC. If the customer has their training dataset stored in an S3 bucket, for example, there will be a secure processing environment within their VPC that reads the data for training, processes it, and then discards it once training is complete. Nothing ever persists on our end, and no customer data is sent back to the control plane.

For inference, we follow the same approach. The model weights are written to an S3 bucket inside the customer's environment, ensuring that all assets remain localized. When it comes time to deploy the model, another pod spins up within the VPC, reads those model weights into memory, and begins serving requests. At this point, all inference requests flow directly to that server, meaning that nothing ever escapes the customer’s environment. This ensures that even if customers are working with sensitive data, such as personally identifiable information, everything remains fully isolated within their own infrastructure.

This setup allows enterprises to maintain the security and compliance of an on-prem solution while benefiting from a cloud-managed control plane that handles orchestration and monitoring without ever touching their actual data.

John

Switching back to the build versus buy discussion, what do you think enterprises tend to underestimate the most when they try to build this type of infrastructure for full fine-tuning and productionizing models entirely in-house? What’s the most underrated and difficult challenge they run into?

Travis

I think the biggest challenge is that doing something once isn’t that hard—but doing it repeatedly and at scale is what’s truly difficult. If you just want to train one model, you can download some code from Hugging Face, spin up a notebook, get a GPU on AWS, and probably figure it out. If you want to serve a model, you can grab an open-source inference framework, spin up a container, and start sending requests. That part isn’t the hard part.

The real challenge starts when you need to operate at scale. What happens when you need multiple replicas of a model running reliably? What if you need SLAs on your deployment and care about cold start times? How do you efficiently manage scaling from one to eight to four replicas without over-provisioning resources? How do you ensure your training jobs don’t fail, track experiments properly, and continuously improve quality? These are all complex infrastructure challenges that aren’t trivial to solve.

At that point, you need a dedicated team just to maintain and optimize the system. And on top of that, things in this space move incredibly fast. Every week, there are new techniques for serving models, new optimizations, and new approaches. Who on your team is responsible for running A/B tests to evaluate the latest methods? Is it even compatible with your current stack, or will you need to update all your dependencies? It becomes a huge maintenance burden.

Even some of the larger organizations I keep in touch with are reconsidering their approach to ML infrastructure. They’re asking themselves whether it makes more sense to standardize on an industry solution rather than continuing to build and maintain everything in-house. The reality is that unless you have a very specific reason to develop and maintain your own platform, just taking open-source code and running it yourself isn’t the most cost-effective or scalable solution in the long run.

John

This is just a personal anecdote, but I know of a Fortune 100 company that spent seven months trying to build out a GPU cluster with all the features needed for fine-tuning and inference, only to scrap the entire thing. They even attempted to pre-train their own models and spent about $20 million on it.

I think a lot of people underestimate how fast the industry moves, and it’s that speed that makes these kinds of technical investments so risky. The reality is that these bets aren’t like traditional ML projects from three years ago, where you had a much longer lifecycle for software development. Would you agree that it’s a lot riskier now compared to the pre-LLM era?

Travis

Yeah, I think the velocity is a huge factor. Data science used to be treated more like an R&D division, where companies would try a bunch of things, knowing that 99.9% of them wouldn’t go anywhere, but at least they could tell their board and shareholders they were investing in AI.

It’s completely different now. There are real expectations around AI, and companies are actually shipping products—and they’re doing it in weeks, not years. If you get slowed down because you’ve invested in building infrastructure with only a small team, while the rest of the industry is moving at full speed, there’s a real risk of being left behind.

Even for a company like ours, where this is all we do, and we’ve been focused exclusively on this problem for four years, it’s still a challenge to keep up. Just staying on top of the latest research, reading papers, and tracking where the industry is headed is almost a full-time job for me.

So I definitely don’t envy someone trying to build all of this in-house with a smaller, more constrained setup. The industry is evolving too quickly, and making the wrong bet on infrastructure today could mean wasting years of effort.

John

Let’s switch gears and talk about the state of fine-tuning as an industry. There are so many techniques coming out all the time, which makes it hard to keep up—not just with fine-tuning methods, but also with new inference optimizations.

DeepSeek came out about a month ago, and I’m curious if you’ve seen an uptick in enterprise interest in fine-tuning because of it, or if this is mostly just Substack and Twitter hype. Is DeepSeek actually driving companies toward fine-tuning, or is it just another model release with a lot of buzz but little real-world impact?

Travis

Yeah, I’d definitely say there’s been real interest, but it comes in a few different forms.

Some companies are interested in DeepSeek itself because, for the first time, they see an open-source model that could genuinely compete with top-tier proprietary models. For them, it’s about evaluating whether they can move off closed models entirely and switch to open-source alternatives.

Others are looking at fine-tuning DeepSeek specifically because it’s already a strong model, and they’re thinking, “What if we fine-tune it on our own data? Could we make it even better?” That’s where fine-tuning starts becoming a real differentiator.

Then there’s a third group—companies that are more interested in the techniques that DeepSeek used. In their paper, they also introduced DeepSeek R1.0, a pure reinforcement learning (RL) model, and we’ve seen a lot of people experimenting with reproducing that. What’s particularly fascinating is that their RL approach seems to unlock reasoning from first principles—what they describe as an “aha moment” where the model learns to reason in a way that wasn’t explicitly programmed. We’ve actually seen similar effects in our own work, which makes this an exciting area to watch.

That said, I don’t think DeepSeek itself will necessarily have staying power. Models come and go, and something newer will inevitably replace it. I also don’t think companies should tie themselves too closely to a single model, since the landscape evolves so quickly.

What DeepSeek has done, though, is reinforce a few key trends. First, open-source is the future. The moat around proprietary foundation models is shrinking fast, and the gap between closed and open models is closing quicker than many expected. Second, it has highlighted new paradigms for fine-tuning, like RL-based training, that unlock capabilities companies hadn’t considered before. We’re already seeing real enterprise interest in applying these methods to their own fine-tuning workflows, especially reinforcement learning as a fine-tuning technique.

John

When it comes to the state of fine-tuning, is LLaMA still the dominant model for fine-tuning, or has something else taken over? Is it still the most popular base model simply because it's a U.S.-based model, or do you think there's a level of nervousness around DeepSeek because it's not?

I know in theory, it shouldn’t really matter since it’s just model weights, but have you seen any nuance around that? And more broadly, is LLaMA still relevant, or is its dominance fading? When LLaMA first launched, people were shocked that Meta had caught up. Now that we have models like DeepSeek, I agree it’s just another competitor, not a fundamental shift. But could you comment on LLaMA vs. DeepSeek and how enterprises are thinking about them?

Travis

LLaMA is still the most popular model on our platform, and we actually track these stats internally. From what we see, about 60% of fine-tuning jobs are still done on LLaMA. That said, the two most commonly fine-tuned models right now are LLaMA (and its variants) and Qwen 2.5. Those two categories are pretty evenly split in terms of demand.

One of LLaMA’s biggest limitations is its license. It’s not fully open, and there are restrictions on how it can be used. For companies that want complete freedom in how they deploy and modify a model, they tend to lean toward Qwen, which has a more permissive license. So, there’s definitely a tradeoff there, but in general, those two models dominate fine-tuning right now.

As for DeepSeek, do people hesitate because it’s a Chinese-developed model? Honestly, I think that geopolitical concerns are more of a mainstream media narrative than something we actually see affecting practitioner decisions. If that were a major issue, then Qwen wouldn’t be as popular as it is, considering it comes from Alibaba, another Chinese company.

What actually matters to people fine-tuning and deploying these models is the license—can they use it however they want, or are there restrictions? The second big factor is the model itself—does it actually serve their use case well, or does it have implicit guardrails that limit what it can do?

So while geopolitical concerns might be part of the public conversation, what we see among enterprises and ML teams is that practical constraints—like licensing and performance—drive the decision far more than the origin of the model.

John

Yeah, that’s actually quite surprising. I don’t think most people realize that Qwen had a bit of a head start over DeepSeek and has already gained double-digit adoption—at least on your platform. I didn’t expect it to be that widely used. Do you think Qwen is underappreciated in the broader fine-tuning community? Should practitioners be giving it more attention when considering which models to fine-tune?

Travis

Yeah, I definitely think Qwen is underrated in a lot of ways. One thing I find interesting is that if you look at benchmarks for top coding models, you’ll see names like GPT-4 Turbo, Gemini, and Claude, but Qwen performs surprisingly well in those comparisons. And this isn’t just about open-source models—it holds up against the best proprietary models too.

John

Yeah, I think part of the excitement around DeepSeek isn’t just about performance and cost, but also the fact that they were able to kickstart a self-reinforcing learning loop—what people are calling the "aha moment." The fact that this result was also independently verified gives practitioners confidence that they can reproduce it themselves.

For this audience, could you briefly explain what reinforcement fine-tuning (RFT) is and why people in this space are so excited about it? What makes it potentially a game-changer?

Travis

Yeah, absolutely. I think this all starts with the idea of reasoning models—what reasoning actually means from the perspective of LLMs. This goes back to GPT-4 Turbo (01) and its preview from OpenAI last year, which, as far as I know, was the first widely available model designed as a native chain-of-thought (CoT) model.

What that means is that instead of just predicting an answer directly, the model was trained to generate intermediate reasoning steps, effectively thinking before giving a final answer. It’s not just prompted to do this—it’s actually trained to generate tokens in a way that follows a structured, step-by-step reasoning process rather than responding instantly.

There’s a lot of research suggesting that for certain tasks, LLMs perform significantly better when they break problems down step by step instead of just trying to predict a final answer in one shot. Math is a great example because LLMs are notoriously bad at doing math in their "heads"—meaning, just through their activations alone. But math problems have a nice property: they can be broken down into smaller subproblems, which allows a model to incrementally build toward a correct solution. When you force an LLM to solve a math problem step by step, it tends to perform much better. That was the intuition behind this first generation of reasoning-focused models.

What made DeepSeek R1 particularly notable is that after OpenAI’s 01 model, everyone was expecting that eventually, someone would release a strong open-source reasoning model. But the real significance of R1 is what it did beyond just reasoning capabilities.

With OpenAI’s models, people were already trying to distill their outputs—essentially taking GPT-4's responses and using them to train smaller, open models that could perform comparably. Companies like OpenAI obviously don’t like this, because it reduces their competitive moat. So, what they did with 01 was train the model to think through problems internally but hide that process from users—only showing the final answer. That made it much harder to distill the model’s thought process and train competitive open alternatives.

For months, the industry was waiting for someone to release an open-source model that could reason just as well—but transparently. DeepSeek R1 was that moment. It wasn’t just another attempt at building a reasoning model—it was the first open-source model that was actually competitive with OpenAI’s models on reasoning benchmarks.

Up until that point, most open-source attempts at reasoning models hadn’t been that great. R1 was different—it wasn’t just good for an open-source model, it was good, period. It held its own against 01 on key benchmarks, proving that reasoning capabilities could be achieved in an open model without relying on closed systems. That’s what made it such a big deal.

John

Oh, I just wanted to confirm—so for practitioners in the field, it wasn’t a total surprise that an open-source reasoning model at this level came out? It was just a matter of time?

Travis

Yeah, I figured it was only a matter of time before someone did it. If there had been a betting market on it, most people probably would have expected Meta to be the first to release a high-quality open-source reasoning model. As for how close it would be to frontier models like OpenAI’s, that was more of an open question.

I think people probably didn’t expect DeepSeek R1 to be as close in performance to GPT-4 Turbo (01) as it was. But there’s also the timing factor—if GPT-4.5 (03) had been released earlier, and DeepSeek had been benchmarked against that instead of 01, the comparison might not have looked quite as strong. It’s hard to say. But the timing worked out really well for them—they published their results just before GPT-4.5 came out, so when they compared against the state of the art at that moment, they looked very competitive.

It was definitely expected that someone would release a high-quality open-source reasoning model, but I don’t think anyone expected it to be DeepSeek. They had already done some solid work with models like DeepSeek V3, but this release really caught people off guard.

What made it particularly interesting to me was that they didn’t just release the model—they gave the full recipe for how they trained it. Their paper was extremely transparent, and that’s why people were able to independently reproduce many of their results. Hugging Face, for example, has been working on a project called OpenR1, which aims to replicate DeepSeek R1. Others have also verified the "aha moment" effect through their own experiments.

To explain that a bit more, DeepSeek’s paper described a method where you take a general-purpose instruction-tuned model—in their case, DeepSeek V3—and you simply prompt it with something like, “Think step by step and solve the problem.” Then, using a verifiable reward signal, which is a form of reinforcement learning with verifiable rewards (RLVR), you can tell the model, "Yes, this answer was correct" or "No, this was wrong," and automatically refine its approach over time.

The breakthrough came when the model started developing novel problem-solving strategies on its own. It wasn’t just memorizing examples—it was generalizing and discovering new ways to solve problems without explicit demonstrations. That’s what they called the "aha moment"—the realization that the model spontaneously figures out better reasoning techniques that weren’t manually taught to it.

To me, that was actually the most important takeaway from the DeepSeek paper. Since then, this effect has been successfully reproduced, and it was a major inspiration for how we built reinforcement learning into our own platform as well.

John

Let’s talk about the commercial and enterprise implications of being able to replicate DeepSeek R1’s reinforcement fine-tuning (RFT) process. Now that this is possible, how can enterprises use these techniques to create their own company-specific reasoning models, if that’s even feasible?

Before diving into how your platform facilitates that process, could you first give a high-level overview of what use cases might motivate a company to train a custom reasoning model using RFT? And could you also give a clear definition of what we mean by reinforcement fine-tuning (RFT)?

Travis

Yeah, absolutely. RFT, or reinforcement fine-tuning, is another name for reinforcement learning with verifiable rewards (RLVR). OpenAI coined the term RFT late last year when discussing similar methodologies they use internally. Essentially, it’s a fusion of reinforcement learning (RL) and supervised fine-tuning (SFT).

The way I think about it, RFT takes problems that would traditionally be solved with supervised fine-tuning and instead applies reinforcement learning techniques to optimize the model. The key difference is that RFT is designed for objective, verifiable tasks—cases where there is a clear right or wrong answer—as opposed to subjective tasks where human preference plays a bigger role.

Traditional reinforcement learning with human feedback (RLHF), for example, is based on subjective preferences. You show a model two responses, and a human ranks them—"this one is better than that one"—and the model learns based on those rankings. That works well for tasks like chatbot personality alignment, where there’s no single correct answer.

But RFT is different because it applies reinforcement learning to tasks where correctness can be verified. Examples include classification tasks, code generation, and problem-solving with a clear ground truth.

For classification, you have labeled data, so you know the correct category an input should belong to. For code generation, even if you don’t have labeled data, you can still use a compiler and unit tests to verify whether the output actually works. That verification signal allows the model to self-improve without requiring massive amounts of manually labeled data.

When thinking about how RFT compares to SFT, the key difference is data efficiency. Supervised fine-tuning requires large amounts of labeled data, because the model essentially memorizes patterns and tries to generalize from them. That’s a problem when labeled data is scarce.

In contrast, RFT doesn’t explicitly tell the model what the right answer looks like. Instead, it gives a reward signal that tells the model whether its output was correct and how close or far off it was. The model adjusts itself dynamically, optimizing for higher rewards. This creates room for the model to be creative and discover new problem-solving strategies, rather than just imitating past examples.

This flexibility is particularly important for reasoning models, because they don’t just generate a final answer—they construct a chain of thought along the way. A key challenge for training reasoning models is where do you get enough high-quality reasoning data? One approach is distilling a strong reasoning model, like R1, where you generate inputs, let R1 solve them, and then filter for correctness before training a smaller model on that dataset.

That approach works if you can generate hundreds of thousands of examples, but it has a hard ceiling—the student model can never surpass the teacher model. If you need a model to develop novel strategies beyond what pre-trained models can do, RFT becomes essential because it allows the model to actively explore new approaches instead of being limited to mimicking existing solutions.

John

One possible interpretation of why OpenAI doesn’t provide the raw chain of thought could be that they want to deter this kind of behavior—essentially stopping the community from using their model outputs to generate training samples for competing models. Does that seem like a reasonable explanation for why we only get a cosmetically summarized version instead of the unfiltered reasoning process?

Travis

That’s exactly right. I think their main concern is competitors—they don’t want DeepSeek or another company distilling their models and replicating their capabilities overnight. It makes sense from their perspective—if you put in years of work and billions of dollars into training a model, you wouldn’t want someone to copy it in a day by extracting its reasoning process.

But while this makes sense for OpenAI’s business strategy, it’s not great for enterprises that want to build their own models for custom use cases. That’s why open-source models are so important—they provide transparency and allow companies to train models that actually fit their specific needs instead of relying on a closed system.

This is where RFT becomes particularly valuable. Since it allows the model to explore different problem-solving strategies, it enables reasoning models to discover new approaches that might be better suited to a company’s unique domain. It’s not just about mimicking human reasoning; it’s about enabling models to develop novel ways of thinking that weren’t explicitly trained into them.

And this isn’t just relevant to reasoning tasks—it also applies to code generation. You might have a reference implementation of a function, but there’s always room for improvement. Maybe the model can come up with a version that’s twice as fast or uses fewer resources. By allowing the model to explore different solutions through reinforcement fine-tuning, you can uncover optimizations that wouldn’t have been possible with traditional supervised fine-tuning alone.

John

So the big catch here seems to be that RFT is primarily designed for verifiable tasks—things like coding, math, protein classification, or gene mutation prediction, which OpenAI has also highlighted in their demos.

But from a high-level perspective, can’t you reframe a lot of subjective problems to fit into this framework? For example, you could use an LLM as a judge, assigning scores to outputs. That way, you could theoretically apply RFT to train a writing assistant by giving partial credit to essays that aren’t perfect but still decent.

Is there a way to cast subjective tasks as verifiable problems so that RFT can be applied to domains where human preference plays a bigger role?

Travis

Yeah, I think you can, but there’s an important caveat to consider.

For something like creative writing, you’re heavily relying on the judge model’s reliability. If you’re using an LLM as a judge, you’re essentially saying, "I fully trust this model to decide what is good and what is bad." That might be fine for some tasks, but it can also be limiting in ways that aren’t immediately obvious.

One risk is that the model could end up optimizing for a narrow, homogeneous style—one that mimics the biases of the LLM judge rather than actually improving creativity. If the LLM judge prefers overly structured, conventional writing, the fine-tuned model might learn to produce formulaic responses, even when more originality would be better.

So while using an LLM judge is one way to create a reward function for subjective tasks, it’s worth being aware of the trade-offs. The more objective a task is, the better it aligns with RFT. For subjective problems, it’s still possible to apply RFT, but you need to be careful about how the reward function is designed to avoid reinforcing unintended biases.

John

Got it. So what are some verticals or specific use cases where RFT can be applied right now, besides coding and math?

And maybe this is a good segue into talking about how you’re building RFT into your platform.

Since you mentioned that RFT is very data-efficient, could this be a way for non-technical users to participate in fine-tuning? Could business users provide training labels and help drive the fine-tuning workflow, making it more accessible to entire organizations?

Travis

Yeah, absolutely.

In terms of use cases, my favorite right now is definitely text-to-SQL and text-to-programming-language, especially for domain-specific languages (DSLs). RFT is a perfect fit for these because the outputs are highly structured and easily verifiable, and we’ve already worked with several companies on these types of applications.

Other great use cases include classification tasks and named entity recognition (NER), particularly in domains where labeled data is scarce, such as legal and medical fields. Traditional supervised fine-tuning struggles when there isn’t much training data, often leading to overfitting. But RFT doesn’t suffer from that problem as badly, making it a strong alternative when high-quality labeled datasets are limited.

Another area where I see a lot of potential is advanced reasoning tasks, particularly in hard RAG (retrieval-augmented generation) problems. We’re starting to see chain-of-thought reasoning become valuable in these contexts. Recent research on chain-of-thought RAG and planning agents suggests that RFT could significantly improve multi-step retrieval and reasoning.

Another really exciting use case is function calling. This is another highly verifiable task where you can execute the function calls selected by the model and verify whether they worked as expected. Since function calling is crucial for LLM-driven automation, RFT could play a big role in optimizing API orchestration and autonomous agent workflows.

Now, in terms of what we’re building into our platform, we’re making reinforcement fine-tuning a fully integrated, generally available feature within Predibase. The goal is to deliver a strong offering on three key fronts.

First, we’re focusing on managed infrastructure that’s fast, scalable, and highly efficient. We want to abstract away the complexity of orchestrating all the moving parts—handling scaling, keeping throughput high, and ensuring that all components interact seamlessly.

Second, we’re building out advanced training optimization techniques to improve the efficiency of learning. One of the biggest mistakes we see is people dumping all training data into the model at once, which isn’t the most effective way for a model to learn.

If you think about how humans learn math, for example, you wouldn’t just throw every math problem at someone and say, “Figure it out.” Instead, you structure learning in phases—starting with basic arithmetic, then algebra, then geometry, then calculus.

We see the same effect with reinforcement fine-tuning. Instead of overwhelming the model with complex problems all at once, it makes more sense to gradually expose it to increasing difficulty levels so it can build up reasoning capabilities step by step. We’re baking this structured learning approach directly into the platform, allowing users to strategically curate training data for optimal learning efficiency.

By making RFT more structured and accessible, we think we can open the door for non-technical users to actively participate in fine-tuning workflows, whether that’s by curating datasets, setting evaluation criteria, or guiding model training through reward mechanisms.

John

Is it basically monitoring the loss and adapting what kind of data gets fed into the RL process? Is that what’s happening? Is it a form of adaptive training, or is there something else going on?

Travis

Yes, it’s adaptive training in the sense that we monitor how difficult the examples are perceived to be by the model. Based on that, we dynamically adjust what data we send for training at each step. The idea is to avoid sending examples that are too difficult too early, because that would just waste compute cycles without meaningful learning. Instead, we structure the training so that the model builds up its capabilities incrementally, handling progressively harder examples as it improves.

One of the biggest differences between RFT and supervised fine-tuning (SFT) is in where the effort is concentrated. With SFT, all the effort is front-loaded—you need to generate a large, high-quality labeled dataset, clean it, verify it, and ensure everything is error-free before training even begins. But once that dataset is prepared, fine-tuning is fairly simple: upload the data, push a button, and get a model.

With RFT, the process is different. You don’t need massive labeled datasets—you can start with as few as 10 hand-crafted examples, which makes the initial setup much easier. But the real effort shifts to a new challenge: reward function modeling.

This is where you, as an operator, play the role of a teacher. The model generates an output, and you have to evaluate it: what’s right, what’s wrong, and where did it go wrong? Instead of just providing labeled examples upfront, you write reward functions—a set of rules or heuristics that score the model’s responses at every training step.

Our product is designed to make this process interactive and visual. You can see how the model’s responses evolve over time, tracking its learning step by step. For example, you might notice that at epoch 1, the model gets certain basic structures correct, but consistently misuses a specific variable name. By epoch 10, it starts structuring its answers correctly but still makes subtle mistakes in syntax or logic.

At that point, it becomes an interactive loop—you refine your reward functions to steer the model away from persistent mistakes. Instead of treating training as a one-and-done process, RFT makes it adaptive, allowing you to actively guide the model’s learning in response to how it performs.

And yes, we’re building a UI for this, so you can visually inspect how the model improves over time and adjust your reward functions without needing to dive into raw logs or manual scoring.

John

So you’re saying that during the training process, you can interact with the model’s learning behavior through a UI, see where it’s repeatedly making mistakes, and adjust the reward function on the fly?

And the mental model for the reward function is that it works like a grading rubric—you might assign X points for formatting, Y points for correctness, and so on. As an ML engineer, your role shifts from just curating high-quality data (which is the focus in supervised fine-tuning) to taking on a more active teaching role.

With SFT, the hard part was gathering and labeling thousands of data points, which could take months. But now, with RFT, it’s more like you’re building a UI where you’re actively guiding the model’s learning, almost like having a conversation with it.

Would you say that’s a fair interpretation of this process?

Travis

That’s exactly right. The way we envision it, you’re essentially having a dialogue with the model. It’s like, "Hey, is this right?" and you, as the teacher, have a chance to audit its reasoning, correct mistakes, and refine the reward function accordingly.

The model continues to train, but at any point, you can step in, review its behavior, and tweak the reward function to steer it toward better answers.

A good example of why this active feedback loop is crucial comes from a common problem in reinforcement fine-tuning, known as reward hacking. This is similar to overfitting in supervised fine-tuning but manifests differently. Instead of memorizing training data, the model finds loopholes in the reward function and exploits them.

We ran into this issue when working on code translation—converting source code from one language to another. The model figured out a hack: instead of actually translating the code, it would generate the solution in the original language, then write a dummy function in the target language that did nothing, and return the original answer.

Obviously, that’s not what we wanted—the model was gaming the reward function instead of solving the problem properly. To fix this, we had to modify the reward function to check:

If the model's output was the same both with and without the function in the target language, it was likely hacking.
If the translated code actually executed correctly and produced the right output, then it was a valid translation.

This kind of adaptive intervention is what makes RFT fundamentally different from SFT—instead of just fine-tuning a model with static labeled data, you’re actively shaping its learning process in real time.

John

Models are getting smart enough to find shortcuts, hacking their way through training instead of genuinely learning. As a teacher, you have to actively monitor and correct this behavior, punishing these kinds of reward exploits when necessary.

This feels very different from traditional fine-tuning methods. For someone like me, who has fine-tuned models before, this level of interactivity—where you’re engaged throughout the process rather than just staring at TensorBoard metrics—feels like a UX problem as much as a machine learning problem.

Travis

Yeah, I completely agree. This is probably the most UX-heavy feature we’ve ever built, because it’s fundamentally about how humans interact with and guide a model’s learning in real time.

One last point on this—you mentioned whether non-technical users could get involved. Right now, the biggest limitation is writing reward functions. Theoretically, someone has to code these functions to evaluate model outputs, which seems like a barrier for non-technical users.

But in reality, they don’t have to code—that’s the interesting part. We’ve reached a point where AI coding assistants like Cursor are good enough that you can just describe what you want, and it will generate the function for you.

So we’re working on a way to let users define reward functions in natural language, rather than needing to write Python. Instead of manually coding evaluation rules, users will be able to describe their grading rubric in plain English—things like "It must follow this format" or "It should answer in this way"—and the system will automatically translate that into executable code.

This would open up RFT to non-technical users, allowing domain experts to contribute directly to fine-tuning, even if they don’t know how to program.

John

What if your platform suggested some pre-built reward functions that users could compose together when first brainstorming how to structure the reward system?

For example, you could provide a rubric template for different types of problems. I’m not sure how effective that would be, but it might help reduce cognitive load, so users don’t have to write a perfect reward function on their first attempt.

Travis

Yeah, I think there’s definitely a lot of potential for that. We’re already exploring template-style rubrics that can serve as defaults for common tasks—things like grading a code generation task or evaluating a named entity recognition (NER) model. Many of these grading structures end up being fairly standard, so having predefined rubrics could help users get started quickly.

That said, once you dig into the specifics of a problem, things often become more nuanced. A simple reward function—like 1 point if correct, 0 if wrong—is a great way to start. But as the model progresses, you start to see where it gets stuck, where it fails to improve, or where it finds exploits.

At that point, the process becomes much more interactive. Instead of trying to design a perfect reward function upfront, you build it incrementally, refining it as the model evolves. That’s a major difference from supervised fine-tuning (SFT), where all the effort goes into preparing a perfect dataset before training even begins.

With RFT, the reward function evolves alongside the model, making it a more iterative, adaptive process.

John So with RFT, you still need to start with a powerful enough base model, especially if you don’t have a lot of fine-tuning samples. You can apply RFT to a 7-billion parameter model, but wouldn’t a larger model—like 70 billion parameters—be more sample-efficient and learn faster?

Is that still the case? And is reward hacking and shaping the reward function the biggest bottleneck in achieving high performance with RFT? Or is there some hidden bottleneck that people aren’t talking about when it comes to fine-tuning with RFT?

Travis Yeah, in terms of model size, the biggest difference I’ve noticed between larger and smaller models is how quickly they pick up fundamentals—things like formatting, syntax, and structure. Larger models get those things right almost immediately, while smaller models may require a few more iterations to lock them down.

But in terms of final convergence, they actually perform surprisingly well depending on the task. For code generation, for example, you don’t need a massive 700-billion parameter model. Many use cases work really well with something in the 30-billion parameter range, or even 8 billion parameters, depending on task complexity.

So I actually still recommend starting with smaller models because they’re cheaper and faster to fine-tune. If you’re not getting the performance you need, you can scale up gradually rather than defaulting to a massive model upfront.

As for bottlenecks, reward hacking and shaping the reward function are definitely big challenges, but they’re not necessarily the biggest hidden issue. One thing that isn’t talked about enough is the stability of the training process itself.

When working with RFT, it’s not just about defining a good reward function—you also have to make sure the reward signal doesn’t collapse or lead to unintended learning dynamics. If the model starts exploiting reward shortcuts or the reward signal becomes too weak or inconsistent, training can fail to generalize or converge in a suboptimal way.

A lot of hidden challenges come from ensuring the reward shaping process remains stable, which requires careful tuning—not just of the reward function, but also of learning rates, exploration strategies, and training schedules. These factors can have a huge impact on how well RFT actually works in practice.

John

Second question—what is the bottleneck now in this whole process?

If having a large number of training samples is no longer the main issue, then what is? Is it the reward function and monitoring the training process, or is there something else?

Travis

Yeah, the reward function is definitely the biggest bottleneck. Getting a good reward function can be really challenging. That’s where you end up spending about 80 percent of your time when working with this.

John

So is it basically just people guessing right now—just going off their intuition or experience to decide what makes a good reward function? Or is there a more systematic approach to this?

Travis

Yeah, I think at the beginning, it’s a bit subjective and vibes-based. To your point, you start with criteria like formatting, compilation, successful output—you have a sense of what matters. But you’re also responsible for weighting these factors, and in the abstract, it’s hard to know the perfect balance.

So early on, it’s more intuition-driven than strictly scientific. But once you start running the training process, it becomes a lot more empirical. You can observe how the model is learning, what mistakes it keeps making, and use that feedback to refine the reward function. If it’s consistently getting something wrong, you tweak the function to push it toward the correct behavior. That iteration process is where most of the time goes.

There’s also an aspect of data that matters. If the task is too hard from the start and the model can’t get anything right initially, learning can stall. In those cases, simplifying examples can help it overcome that cold start problem.

This is also where supervised fine-tuning (SFT) can play a role. If you look at the R1 paper—going back to DeepSeek for a second—one thing they did was bootstrap the model with a set of high-quality reasoning traces, then fine-tune it with SFT before applying reinforcement learning.

So SFT still has a place alongside RFT, especially when the problem is so difficult that the model doesn’t even know how to begin solving it. In some cases, data can still be a bottleneck, but by and large, the reward function is where most of the effort goes.

John

I think what surprises a lot of people is that even practitioners don’t always realize how much model babysitting is involved, even with frontier techniques like RFT. There are so many judgment calls in the process.

For a field that’s supposed to be very quantitative, we still see a lot of high-level abstraction and intuition-based decisions. Ten years ago, people were frustrated with deep learning because they couldn’t fully explain how it worked at a mathematical level, and even now, things are becoming more abstract rather than more transparent.

What are your top five use cases for RFT—both generally applicable ones and domain-specific ones? What are the highest-conviction use cases where you see immediate results?

Travis

Yeah, my highest-conviction use case is definitely anything in the text-to-SQL space.

Text-to-SQL problems are really common, and I think the reason is that a lot of companies develop their own domain-specific languages (DSLs) internally—essentially extensions of existing query languages.

For example, say a company has an internal BI tool for accessing all their data. They might develop a custom DSL to streamline queries. Some companies also build domain-specific languages for their customers.

One example that comes to mind is Grafana. They have a language called PromQL, which allows users to query over their metrics. Since Grafana is a platform for storing and visualizing metrics, they needed an efficient way to query those results, so they developed PromQL internally.

A great RFT use case would be a tool where a user just says, “I want these metrics,” without knowing PromQL syntax. The model could then generate the correct PromQL query for them. That kind of application—bridging natural language and domain-specific languages—is a really strong fit for RFT. There are tons of use cases like this.

John

Maybe with RFT, you guys could create something that writes OpenTelemetry specs and queries. No one really wants to learn it, but it's kind of important. I just realized you were talking about DSLs, but yeah.

Travis

Yes, exactly—DSLs, domain-specific languages. There are a lot of these out there.

And it’s not just about query languages; it extends to frameworks too. Say you’re trying to do an API integration with something like HubSpot. Tons of companies have their own SDKs and APIs, and integrating with them can be a huge pain.

Instead of figuring everything out yourself or hopping on a call with their solutions engineering team, what if you had a model that could just generate the integration code for you? Companies could use it to help their customers onboard faster. This is almost like a…

John

A new category of integrations—like a neural integration or something, right? You just feed in the entire documentation or schema, and then it just... I mean, to your point, billions of dollars are spent on these integrations. There are entire companies dedicated to this. If you frame it that way, it actually sounds really exciting.

Travis

For sure. I love the potential here. GPT-4 and similar models are already great at writing Python and general-purpose code, but they struggle with these more niche use cases. Even if they can find documentation, they haven’t seen enough real-world examples to generate truly reliable solutions.

But the advantage is that these are verifiable tasks. You can curate a set of common integrations, write unit tests for them, and then let the model generate code, run the tests, and iterate until it produces a working solution—all in an automated process. That’s why I see this as my number one use case.

Another major use case is transpilation. When I was an intern, I worked on a lot of Fortran-to-C++ and Fortran-to-Java conversions. Those are also highly verifiable. If you have a compiler or interpreter for one language and another for the target language, you can execute both and directly verify whether the output matches.

John For this audience, can you help us appreciate just how big that market is? If you could really figure out transpiling, how big of a deal would that be?

Travis Yeah, well, if your business has been around for decades, you’ve probably already had to deal with modernization efforts in some form. The languages we use today aren’t the same as those used 40 or 50 years ago—or even 20 years ago, in many cases.

And it’s not always about a complete overhaul. Even upgrading from an older version of Java to a much newer one can feel like switching to a completely different language. There are tons of legacy codebases out there still running on COBOL, Fortran, and even C that companies are trying to migrate off of.

John

If you could modernize all the legacy code into something like the Python stack—or whatever the preferred modern equivalent is—how much value would that create? That would be measured in billions, right?

Travis

I think so, yeah. You also have to consider the legacy maintenance costs, which are a huge issue. A lot of these systems are maintained by COBOL developers who are in their fifties or older. Younger engineers don’t want to work on those problems, so hiring is incredibly difficult. In fact, COBOL programmers actually get paid really well precisely because of this shortage.

So as these people start to retire, who’s going to maintain these systems? The ability to modernize them automatically is massive—not just in terms of cost savings, but also for the long-term health and sustainability of these businesses.

Another big opportunity I’d call out is in traditional supervised fine-tuning use cases where there isn’t much labeled data. Classification, domain-specific named entity recognition, and information extraction are all common problems. Tons of companies struggle with these tasks because they don’t have enough labeled examples. Since labeled data is expensive and time-consuming to curate, any approach that helps here could really move the needle.

John

What’s the most oddball but genius use case of RFT you’ve heard of in the last month?

Travis

So I thought of something the other day—maybe someone will hear this and go write a paper about it.

One idea I found really interesting was using RFT to optimize CUDA kernels. We did some work on writing CUDA kernels with RFT, but the real goal isn’t necessarily to generate a model that knows how to write CUDA in general. Sometimes, you just want to optimize a single, highly specialized kernel—like an attention kernel.

People have spent countless hours trying to make attention computation faster. So imagine an RFT process that starts with just one example—the attention kernel—and then runs for hundreds of hours, continuously optimizing it. The model could be trained to maximize speed while still producing the correct output, using a verifiable process where the reward function measures both accuracy and execution time.

If the reward function is pushing for faster performance while maintaining correctness, the system would keep finding clever optimizations to speed it up. I think there’s a lot of potential for creative applications like this, especially in scientific computing.

John

Would something like that make you optimistic about the ongoing reduction in inference costs?

Do you think the cost of serving models will keep dropping at a rate of 5x per year, or is there some kind of limit? I know we have the test-time compute paradigm now, but how much juice is left in this approach—are we just getting started?

For people who are skeptical about the continued progress, do you have any soundbites that address that specifically? Looking at every part of the inference stack, are you optimistic that costs will keep dropping?

Travis

For sure, I think it will. It’s just hard to say exactly what will be the main drivers.

On the hardware side, we’re seeing a Moore’s Law-type effect with GPUs—every new generation is significantly more energy-efficient, faster, and has more capacity. But the big open question is whether this scaling can continue indefinitely. Will we keep seeing major improvements, or will we hit a point where each new GPU generation is like new iPhones—incremental improvements rather than game-changers?

On the software side, I definitely think optimizations will continue, but they’re fundamentally limited by hardware. In the first few years of LLMs, we saw big breakthroughs like paged attention and flash attention, but software-side inference innovations seem to have slowed down recently.

At this point, I don’t see major differentiation in base model inference between companies and open-source solutions in terms of speed. Most of the real performance gains come from task-specific optimizations—like Turbo LoRA—rather than general-purpose inference improvements. That suggests inference is already becoming commoditized, with everyone converging on the same basic techniques.

However, hardware is still an area with big disparities. If you look at the tokens per second that specialized chips from Cerebras or Grok can achieve, you can see that there’s still a lot of room for optimization at the hardware level. That’s where I think we’ll continue to see the most significant innovations for some time.

John

Predibase started in the low-code space in its early days, and now you’re a full-featured platform. But with RFT, do you think we might see a renaissance of low-code ML?

Do you see a future where we have vibe-based AI researchers—just like we now have vibe coders, thanks to powerful copilots?

Travis

Yeah, well, that was always the killer feature of LLMs, right? The fact that you don’t have to be super explicit about what you need—it just works through natural language. That was the magical breakthrough ChatGPT brought to the table.

So seeing that shift happen on the training side with RFT, where intuition plays a bigger role than deep technical fundamentals, is a good thing. As long as the evaluation remains objective—if you can say, “Yeah, the accuracy is 99%,” then it doesn’t matter whether you got there by running complex plots as a data scientist or through an iterative process of telling the model, “You got that wrong, fix it,” until it gets it right.

We’re already in an era of vibe-based programming with tools like Cursor. I spend most of my time just hitting Command+K and asking for code that does what I need, and it figures it out. I think that’s the direction we’re heading as an industry—moving from human precision to human intuition. And that’s probably a good thing for knowledge work overall.

John

For people interested in getting started with RFT—which I really think more people should experiment with to understand its potential—where can they go? I know you guys are either in beta or about to launch. When do you plan to release the platform? Are you still accepting beta signups?

Travis: Yeah, we’re currently in a closed beta. Anyone who wants to reach out and collaborate with us, we’d love to work with them on use cases that might be a good fit. We’re also hands-on in helping with that process.

We’re planning to release our first self-service feature around this within the next week or so. Then, we’re gearing up for a bigger launch of the full platform, which will be generally available in about a month. That’s our target date.

So it’s definitely coming soon—along with all the UX improvements and everything we’ve discussed. The full version will bring this capability to customers at scale.

John: Will the platform support any open-source model for this purpose, or is it limited to LLaMA, DeepSeek, or something specific?

Travis Yeah, it’ll work with any model that we support on the platform.

We’ll have first-class support for LLaMA models, Qwen, and DeepSeek distills initially—unless something really cool drops in the next few days. But through the SDK, you’ll technically be able to use any model that we support.

John: Just to wrap up—you’ve been at this for a while now in your entrepreneurial journey. What do you think is the biggest thing you’ve learned while working in AI infrastructure?

And if you could go back four years to when you were just starting out, what advice would you give yourself?

Travis: Oh, man. There’s been a lot of learning along the way about how to build a great product.

But I think the biggest lesson for me was that, in the beginning, we spent too much time thinking about the product in the abstract—what made sense logically—without spending enough time just listening to people about their real problems. When we first started, we were designing based on what we thought should exist, rather than deeply understanding the pain points people were actually facing.

Now, as a company, we spend a lot more time just listening—really understanding what customers are trying to do, what’s painful for them, and most importantly, what’s painful enough that they’re actually willing to pay to solve it. Designing the product around those pain points has been a huge shift.

For me personally, coming from a technical background rather than a business one, this was a hard lesson to learn. So if I could go back, I’d tell myself: spend more time talking to people in the space who could be your customers and ask them what they’re struggling with. That would have saved us a lot of time.

John: I think this is a good way to close out this episode. What do you think is the number one pain point in enterprise AI adoption?

Travis: I think it’s that the models just aren’t good enough yet. The quality isn’t where it needs to be for production use, and it’s difficult to get the data into a form that helps push them to that level.

Every time we talk to customers about potential features or improvements, it always comes back to quality. They need the models to be better. They struggle with data collection, labeling, and all the surrounding infrastructure—but ultimately, it all leads back to the fact that they just need better models.

John: That’s probably surprising to a lot of people, since many assume we already have AGI or that frontier models are powerful enough to handle anything. They’ve blown past all the benchmarks, so people assume we’re already there.

But what you’re saying is that, from a boots-on-the-ground perspective, the feedback is that models still need to improve before they can truly function as a turnkey solution. Is that what you’re getting at?

Travis: Yeah. I mean, models like Sora and other frontier models are incredible for their generality. But when you're dealing with a highly specific application—especially something that's core infrastructure or core IP—being "good" or even 80% accurate just isn’t enough for production.

It needs to be nearly perfect, or as close as possible, because every small incremental improvement can drive millions of dollars in revenue. There are definitely enterprise use cases where that level of precision is critical.

Even as models continue to get stronger, they will never understand an enterprise’s data as well as a model that’s fine-tuned specifically for that company. That’s the inherent limitation.

So I don’t see a future where there’s just one ultimate model that everyone uses for everything. Instead, we’ll see a lot of specialized models for different domains, and even those will likely be further fine-tuned for specific enterprise needs—because in many cases, every ounce of performance matters.

John: That really speaks to the importance of having the data and the process in place to fine-tune these models—whether it’s RFT or whatever the next hot technique turns out to be.

It seems like model improvement will always require that hands-on effort to squeeze out extra performance, and that’ll always be an ongoing challenge. So even if GPT-7 comes out, it’s unlikely that one model will rule everything. That’s what you’re getting at, right?

Travis: Exactly. This has been true in every organization I’ve worked at—once something is critical to the business, even the smallest optimizations can drive massive upside.

At Uber, for example, we worked on ETA prediction, migrating from an XGBoost-based model to a deep learning model. They were pushing for just a 1–2% improvement in accuracy, but making that transition was a massive effort that took over a year.

They moved from a lightweight, fast model to a much more expensive deep learning model—but that tiny improvement in accuracy had a massive financial impact, easily in the seven-figure range. When you get something like that right, it ripples through the entire business and improves core metrics in a way that’s hard to overstate.

John: That’s super counterintuitive, and I think a lot of people are sleeping on this fact.

As models become more powerful and widely available, the effort to squeeze out those extra gains—whether through fine-tuning, better data, or workflow optimizations—actually increases, not decreases.

Sure, there will always be some edge in having access to the best model, but once those frontier models become more evenly distributed, the real competition shifts to optimization. Platforms like Predibase and other workflow tools will become even more important because, at that point, everyone is playing on a level field, and those marginal improvements become the real differentiator.

Travis: Exactly. If this is the core of your product—like, say, Cursor, which is trying to be the best code generation platform for developers—at some point, competition is inevitable. Other companies will be using the same models, so what’s your differentiation? What’s your moat?

It comes down to your data and your ability to build models that are better suited to what your customers need. If you’re not customizing these models for your business, you’re effectively commoditizing yourself. You’re leaving the door open for competitors to replicate what you’ve already done.

So the need to differentiate through custom AI solutions is only going to become more important over time.

John: Yeah, speaking of Cursor, I’m always shocked that they managed to become relevant despite Microsoft’s huge head start. A couple of years ago, I never would’ve predicted Copilot losing mindshare among developers.

If I had to give advice back then, I would’ve told people not to build a coding assistant—Copilot’s market share was over 90%, even in the enterprise. It seemed like an impossible space to break into. But Cursor’s rise is a great example of how startups can compete, even against entrenched incumbents. It shattered a lot of assumptions about what was possible.

And honestly, I think that’s relevant to a company like Predibase too. AWS and GCP already have fine-tuning solutions—Vertex AI, Bedrock—but there are always underserved needs, whether it’s around developer experience, model parity, or customization.

From my perspective, that seems like a key opportunity for you as well. Large companies do fumble the bag sometimes, even when they seem to have an overwhelming advantage. Cursor’s success is proof of that. So I’d imagine that applies to you guys too, right?

Travis: Yeah, I think that’s definitely true. When we started this company, there weren’t many players in deep learning infrastructure. Now, obviously, there are a lot of companies claiming to do what we do—whether it’s on the fine-tuning side, the serving side, or elsewhere.

So it’s critical for us to focus on how we can win by being the best at something very specific. If AWS and GCP aim to be an "everything for everyone" kind of company, our strategy is to be exceptional for a particular type of customer. That’s how all great companies start out.

For us, that means being the best platform for building customized models and getting them into production. That requires thinking about everything holistically—efficiency, user experience, and ultimately the quality of the models that come out on the other side.

So yeah, I completely agree with your point. And, you…

John: Especially if RFT becomes as big as we think it could be—if it turns into a popular way for even non-technical people to fine-tune models with just 20 data points—this could be a breakout moment.

It reminds me of Cursor. Their big use case was merging code, and that turned into a massive differentiator. For fine-tuning platforms, RFT could be that use case. It could blow up in ways we don’t expect, especially since OpenAI hasn’t even GA’d their platform yet, and everyone’s still figuring out what this is going to look like.

Personally, I’m really bullish on RFT—that’s why I got in touch with you guys in the first place. So maybe we can wrap this up with your predictions. Where do you see RFT by the end of this year? And where do you see your product? Can you give us some goalposts to strive for?

Travis: Yeah, well, my hope for this year is that we see RFT used to train a state-of-the-art specialized model in some domain—whether that’s CUDA kernel generation or something else. I’d love to see it recognized as a go-to approach that solves real problems, particularly in a sample-efficient way.

For our product, we want to launch our reinforcement fine-tuning platform as a generally available offering. Beyond that, my goal is to keep refining it so we can lower the barrier to entry—not just for technical users, but also for non-technical users.

I want RFT to become something intuitive and easy to use, almost Cursor-like in terms of adoption and user experience. That’s what I’m aiming for this year.

John

Awesome. Well, Travis, thanks so much for taking the time to talk with us and for being so generous with your insights.

Really excited for the release of your product, and for anyone interested in signing up for Predibase RFT beta, stay tuned—we’ll probably have something from Predibase soon.

Thanks again, Travis, and have a great weekend.

Travis

Thanks, you too, John. Appreciate it. Bye.

John

All right.

Enterprise AI Trends

Finetuning LLMs for Enterprises: Interview with Travis Addair, CTO of Predibase

Timestamps for Audio and Video

Transcript

Discussion about this episode