Register for SOSV’s VC-Founder Physical AI Matchup – May 11-15
Back
Logos Research: Is New Math the Answer to AI Slop Code?
Back
Po Bronson
Po Bronson

Logos Research has recently started working with a small and very exclusive set of top universities and quant funds. They are a new entrant among mathematical reasoning companies, joining startups such as Axiom, Harmonic, Math.Inc and Logical Intelligence. 

Logos stands out for being built by a mixture of world-class mathematicians and quant finance managers, a combination reminiscent of Renaissance Technologies (Jim Simons’ legendary quant fund). Prof. Cris Salvi, the CEO, left his tenured position at Imperial College London to found the company. Robert Smith, the CTO, was former Global Head of Automated Trading and Data Science at Citi. Heavyweights such as  leading expert in formal mathematics Prof. Kevin Buzzard, Fields Medalist, Prof. Martin Hairer, and Oxford Wallis Professor of Mathematics, Prof. Massimiliano Gubinelli, are just some among several others top mathematicians and financial engineers who joined Logos as scientific collaborators.

Logos has been operating in stealth, but recently opened early access to LogosLib, an infrastructure layer designed to let AI agents reason mathematically over long horizons without hallucinating.

So we sat down with CEO Prof. Cris Salvi to discuss Logos and how formal mathematical verification can eliminate hallucinations from AI-generated code, and why that matters for industries from quantitative finance to mission-critical software.

Po Bronson: The media has covered many stories about the problems with AI generated code – everything from low end vibe coding resulting in glitchy social media to mission critical servers at big tech firms taken down by faulty software updates. But are there any actual industry statistics about how prevalent the problem really is?

Cris Salvi: Yes. At the end of 2025, CodeRabbit analyzed hundreds of GitHub pull requests that were either human-only or AI-coauthored. They found that when AI was in the loop, the code had 1.7 times more bugs, and 75% more logic errors. Separately, METR (a research nonprofit that measures AI capabilities) showed that this problem scales dramatically with task complexity. If it’s something simple a human coder could do in, say, just 4 minutes, then AI code is fine and will work nearly 100% of the time. But if it’s much more substantial work, let’s say the kind of work a human coder would complete in 4 hours, AI code works only about 10% of the time. And the trend continues downward from there.

Parikshit Sharma: The conventional wisdom is that programmers run the AI generated code back through scaffolding agents to test and fix the code. 

Cris Salvi: That’s the story. But the truth is different. There are published leaderboards, like SWE-bench, that track how well models do at finding and fixing real bugs in open-source repositories. Even the top models still fail on a significant fraction of real-world issues. And these are well-scoped bugs with clear test suites, not the kind of ambiguous, multi-file logic errors you encounter in production. 

Po Bronson: So how do programmers then feel about AI generated code? 

Cris Salvi: It depends on the domain. AI coding agents are becoming increasingly popular; indeed a recent Stack Overflow’s developer survey shows the majority of programmers now use them daily. But what’s surprising is how little penetration there is in quantitative and technical fields. Quant developers, engineers writing mission-critical systems, applied mathematicians, they are far more skeptical. In effect, the more complex the task, the more likely the model is to hallucinate. In these domains a logic error isn’t a minor bug, it can mean a failed proof, a crashed system, or millions lost on a bad trade. According to the same survey, two-thirds of developers who do use AI cite lack of trust in the output as their biggest challenge. In quantitative finance, that distrust is closer to a wall.

Po Bronson: Most people might assume, “Well, it’ll get better fast.” Two years ago, the LLMs hallucinated on basic research textual questions, but today they are incredible. I need you to slowly explain to the audience why this doesn’t just resolve itself with time and practice and reinforcement. 

Cris Salvi: Coding agents generate code with a certain level of randomness. Classical compilers catch syntax errors. Unit tests catch some logic errors, the ones you’ve seen before and thought about writing tests for. But the problems are edge cases you haven’t observed in historical data or haven’t thought about, like a crash, a cyber-attack, an earthquake. Unit tests can’t cover what you can’t anticipate. 

Po Bronson: Say more. Give me some context what that’s like day to day, not just in rare events, catastrophic as they might be.

Cris Salvi: Here’s a real example. A quant team builds a complex feature extraction pipeline to train prediction models for future return distributions. They split data into train, validation, and test sets over rolling windows. The code is clean, all tests pass, backtests look exceptional. But somewhere in the pipeline, information from the future might be subtly leaking into the training set, and now the model is secretly overfitting on data it should never have seen. This isn’t a syntax error. It’s a logical flaw in the data flow. When the trading strategy is deployed live, you lose real money.

Po Bronson: Okay and back to catastrophes – are there famous examples of poor code causing huge losses? 

Cris Salvi: Oh for sure. The Knight Capital meltdown was caused by one of the 8 servers not getting upgraded. An old algorithm got activated and made 4 million trades. Or there’s examples where algorithms respond to what other algorithms are doing – the Flash Crash of 2010 vaporized $1 trillion. Infinium turned on an algorithm just for five seconds before they got to the kill switch – it made 3,000 trades per second, spiking oil futures. 

Po Bronson: Can you explain the different methods used to address this problem and explain what formal verification is?

Cris Salvi: There’s a spectrum. At one end, you have testing (run the code, see if it breaks). You can make it more sophisticated with property-based testing, fuzzing, mutation testing. But fundamentally, testing is empirical. You’re sampling from a very large, often infinite dimensional, space of possible inputs. You can almost surely never test them all.

Po Bronson: The trials of debugging — but on the other end?

Cris Salvi: At the other end you have formal verification. Instead of testing individual cases, you write a mathematical proof that the code satisfies its specification for all possible inputs. The tools for this, interactive theorem provers like Lean, Coq, Isabelle, and formal verification languages like TLA+ and Dafny, have been around for many years. Every step of a proof is machine-checked. If it compiles, it’s correct. Not probably correct. Not “correct on the data we tested”. Always correct.

Po Bronson: So all of this has been the case for decades, right? How exactly is AI making this worse? Is there something inherent to the use of AI that results in logic errors? 

Parikshit Sharma: Yes, absolutely. Let’s take a historical perspective. Early programming languages were extremely logical and formal, but very inaccessible. As the languages became gradually more accessible, and a little more like human language, easier to use, the tradeoff was less formal precision — so it resulted in more implementation errors – stack and buffer overflows, null points, race conditions, deadlocks, floating point errors, and the like. 

We’re all familiar with the evolution of programming languages – Fortran, Pascal, Ada, C++, Java, and today’s Python, et cetera. Alongside those came these companion verification languages and proof assistants, far less well known, that allowed coders to write logic proofs to check their code. If the programming language wasn’t entirely logical, the verifier language could be used to catch logic errors. Pascal was quite logical and concise, but it soon had the first real companion verifier. Languages like Ada – used by the Dept of Defense – were very logical and had guardrails. While C++ was very loose, and almost impossible to verify, so it fell out of favor. Today so much is written in Python. It’s extremely accessible, but loose, and the most popular verifier is called Lean, or Lean4.

And then, since 2022, all of sudden, anyone can spit out code just by typing into a chatbot in human language, describing what you want. Looseness and imprecision — as Cris said, randomness — is inherently the tradeoff. The need for a verifier for this new era is greater than ever. 

Po Bronson: And what percent of the world’s programmers use, or what percent of the world’s code goes through this rigorous formal verification? Roughly? 

Cris Salvi: Companies like AWS and Intel already use formal verification for their most critical infrastructure. But writing a formal proof is extremely complex and technical, the learning curve is brutal. So adoption has been limited to a handful of extreme cases.

Parikshit Sharma: Industry estimates suggest that less than 1% of developers actively use formal verification methods like Lean4. For every one line of code, you might have to write 6 to 60 lines of Set Theory equations in Lean. So it’s really hard. There are 47 million software developers worldwide. Lean4 has only 11,000 regular users.

Po Bronson: Wow, that’s small. There’s a solution – but its a very big friction.

Cris Salvi: This is where Logos comes in. We train AI agents to interact with our proof co-pilots: they read your code, translate the correctness requirements into a formal specification, and then attempt to prove or disprove it automatically. If the proof goes through, you have a mathematical guarantee. If the proof fails, the system constructs a counterexample showing exactly where and why. The human need not touch these AI proof assistants, they are shielded from their complexity. They get the guarantees of formal verification without the cost and technical debt.

All in all, it doesn’t matter whether the code was written by a human or an LLM, the proof assistant running in the backend is the arbiter. It detects logical flaws that no amount of testing would catch, and that’s the most promising technology today to effectively eliminate hallucinations from AI-generated code.

Parikshit Sharma: Once we have autoformalization, then all code can level up and be mission-critical grade.

Logos MCP in action: model-agnostic agents with Logos-powered verified chain-of-thought, indexed on LogosLib, being function-called for financial, regulatory, and mathematical verification.

Po Bronson: Can we complete the thought: let’s go back to why LLMs won’t just solve this with time – are you saying that they can get better at debugging, but can’t make the jump to proving code, because they rely inherently on probabilistic inference, rather than logical deduction?

Cris Salvi: Not exactly. LLMs can generate proofs, but cannot verify that a proof is correct. That’s what the proof assistant does, deterministically. So LLMs generate hypotheses, proof assistants verify them. You need both. But the final authority can never be the LLM.

Now, Chain of Thought (CoT), which is how modern LLMs reason through complex problems, improves results by breaking a task into simpler steps. But it’s still statistical. The model is predicting the form of a logical argument it’s seen in training data. 

Po Bronson: And Logos automated formal proof? How?

Cris Salvi: What Logos provides is what we call Verified Chain of Thought. It’s an infrastructure that interleaves classical CoT with formal verification. The verification feedback from the compiler of the underlying formal language provides intermediate reward signals that steer and guardrails the LLM reasoning towards logically correctness. This is a game changer especially for complex long-horizon reasoning tasks.

Po Bronson: How does this integrate with the LLMs which are helping write AI code?

Parikshit Sharma: Logos is connected into the LLMs through their MCP server and A2A (agent 2 agent) infrastructure layer. When a code writing task, or open-ended research task is of the complexity where it needs formal verification, the LLM just calls to Logos. 

Po Bronson: Okay I want to switch the train of thought for a moment. Let’s bring in the mathematicians. 

Parikshit Sharma: So, Lean wasn’t invented by coders. It was actually invented by mathematicians, who used it to write formal proofs for their mathematical theorems. It worked better than any verifier that coders used, so the programming community also adopted it. 

Po Bronson: Tell me about MathLib

Cris Salvi: Mathlib is the largest open-source library of formalised verified mathematics. Almost 200,000 machine-checked theorems spanning algebra, analysis, geometry, combinatorics. It plugs into Lean4. It’s an extraordinary achievement. It took the community ten years to build. 

Po Bronson: And Logos is taking that to another level.

Cris Salvi: Mathlib has real limitations. First, it covers almost exclusively pure mathematics. Large areas of applied mathematics like probability, PDEs, numerical analysis, optimisation, and statistics are essentially absent. If you’re a quant developer trying to verify a pricing model or generate a trading strategy, Mathlib has very little for you. Second, it’s governed by a small group that decides what gets accepted, Bourbaki-style. That’s fine for maintaining quality, but it means progress is slow, and Peer Reviews of math papers are stuck in long queues, and the priorities reflect a narrow set of interests.

Parikshit Sharma: So it’s scalability is very human constrained. 

Cris Salvi: Yes. And that’s the most important point. MathLib was built by humans, for humans. The fact that it’s a file-based library rather than a graphical database, the way tactic-based proofs are structured, none of it was designed to be consumed or extended by an LLM. In a world where AI agents need to reason mathematically about the physical world, I think the underlying mathematical infrastructure has to be designed for them.

Po Bronson: And so what is Logos building? What’s LogosLib?

Cris Salvi: Because Quant finance is our first vertical, LogosLib will effectively launch with a kind of “FinanceLib”: a training ground, and infrastructure-layer for agents that reason mathematically about algo trading, portfolio optimisation, hedging, and risk management. That’s deliberate. It’s the background of most of the team, and we have reputational respect in that industry. We know what it takes to deploy strategies and models with hedge-fund spec.

The finance space taps into mathematical domains that are either missing or underdeveloped in Mathlib but crucial for real-world use, as I said above: probability, stochastic analysis, PDEs, numerical analysis, optimisation, statistics.

On top of this infrastructure, we’re building our own agents optimised to use it. Before we open it up to third-party agents, we’ll have a significant advantage.

Po Bronson: Help me understand what this means for quants and hedge funds. You gave an example earlier. Maybe you can go back to that, and explain how it’d be different with Logos. 

Cris Salvi: Sure. We talked about the data leakage example earlier. A quant team has a complex feature extraction pipeline, everything looks clean, backtests look great, but somewhere information from the future is leaking into the training set. Logos goes a step beyond verified debugging. Logos reads the entire codebase, models the data dependencies mathematically, and either proves the pipeline is clean or constructs a counterexample showing exactly where the leak is. Sort of red-teaming your work. 

Po Bronson: What keeps a quant up at night?

Cris Salvi: Quantitative models in finance are full of assumptions, about distributions, boundary conditions, convergence. These assumptions are often implicit and untested. Logos formalises them and checks whether they actually hold.

Parikshit Sharma: What’s the hardest part of a quant’s job?

Cris Salvi: Knowing which algorithm, of all the algorithms in the world, will best fit your data. 

Po Bronson: And how do you help that?

Cris Salvi: A quant gives Logos a specification, it can be as simple as a LaTeX document describing an American put option pricing problem. No hints about which algorithm to use. Logos autonomously discovers the correct algorithm and produces a formally verified implementation with correctness proofs. The specification goes in, an implementation of the algorithm, and a proof of its correctness come out.

Parikshit Sharma: Something to point out here is how important a strong library is. Enriching these libraries with metadata, workflows, strategies, and mathematical tactics will create real network effects and make superintelligence deployable. Also, how you’ve set up LogosLib and your MCP/infrastructure, the libraries become self-growing, adding new domains quickly, with every entry machine-checked. 

Cris Salvi: The library is crucial, no doubt. The core challenge is designing an optimised mathematical API: how formalised mathematics should be decomposed, indexed, and structured so that AI systems can reliably retrieve and recombine it. My strong conviction is that this is mainly a data and infrastructure problem, more than a model problem.

I should also say, we’re being very careful not to bulldoze existing community projects. For example, Rémy Degenne, who maintains Mathlib, is a scientific collaborator at Logos. With his stochastic integral blueprint, he’s fully in charge: the goal is to train the system to write Mathlib-standard Lean under his rules and supervision, not to flood the project with uncontrolled agent output.

Po Bronson: I need to ask you about a competitor, Axiom. A lot of what you are describing today does sound very similar to descriptions of Axiom. Would you agree that you’re similar? And if so, how is Logos different from Axiom?

Cris Salvi: Look, Axiom is a good team. But if you actually go to their website, they’re building proof verification tooling for pure mathematics. “Check any mathematical proof, instantly.” That’s their pitch. More generally, the companies in this space (Axiom, Harmonic, Math.Inc) they’re asking: can AI prove theorems? Can it solve olympiad problems? That’s interesting research. But it doesn’t help you if you’re a quant developer and you need to verify that your hedging strategy is correct.

Did you know?