A gap-first computational framework for what AAV capsid engineering cannot do today
Or Continue to Summary Below ↓
THE PROBLEM
AAV capsid engineering is not waiting for better models. It is waiting for better computational tools built for the problems it actually has.
The field has accumulated powerful individual tools: fitness models, library design methods, tropism predictors. Each occupying a space it already knew how to think about. What it lacks are solutions for the problems it has learned to work around rather than solve. The workarounds are expensive. Six specific gaps cost programs the most. None of them has a solution today.
WHAT THE DIGITAL TWIN IS
The AAV Digital Twin is not a pipeline or a workflow. It is a set of purpose-built tools, each designed around one gap the field has accepted and left open. The gaps came first. The tools followed. Programs use whichever tools address the specific bottleneck they face. No fixed sequence, no required combination.
WHAT THE DIGITAL TWIN DELIVERS
Value 01
LONG-LENGTH VARIANT DISCOVERY
The sequence spaces programs avoid because the field has no computational starting point for them.
GAP 1:
Receptor-targeted design at long insertion lengths has nowhere to start.
When a receptor is identified as the entry point for a desired tropism, the standard next step is to screen a massive random library and hope a small fraction happen to bind. At short insertion lengths this is inefficient but survivable. At 10-15 residues and above, the lengths complex receptor binding surfaces typically require, functional variants become exponentially rarer while sequence space expands beyond practical coverage. The step between knowing the receptor and having long insertion candidates worth testing does not exist.
ReceptorFit creates that step. It converts receptor identity directly into a prioritized peptide candidates of user-defined length before a library is synthesized, and can run against multiple species orthologs simultaneously. This addresses one of the most persistent failure modes in translational programs: a capsid optimized for rodent receptor biology that loses function when the ortholog shifts in primates.
GAP 2:
Long-length organ-targeted screening ignores the evolutionary record that already solved it.
Every virus that has evolved to infect a specific organ carries a record of how to get there within its surface-exposed receptor-binding domains. Those sequences have been pressure-tested across deep evolutionary time against exactly the biological barriers AAV programs are trying to cross. They are in public databases, specific to the tissues programs care about most: brain, liver, lung, muscle, kidney.
At short insertion lengths, random libraries can find organ-tropic candidates without this prior, inefficiently, but reliably enough. At long insertion lengths, where functional variants become exponentially rarer and sequence space expands beyond practical coverage, that reliability disappears. There is no principled starting point.
HijackFit converts that evolutionary record into organ-labeled peptide libraries at user-defined insertion lengths. The screen starts with biological precedent as its prior rather than random chance, concentrated in sequence space that evolution validated as functional in the target tissue, before a single screen is run.
GAP 3:
ML loses predictive signal at high mutational burden, making the most promising design spaces unreachable.
Long insertions, dual-loop engineering, distributed immune evasion substitutions. These design spaces carry enormous therapeutic potential and are almost universally avoided. Not because the biology is inaccessible, but because ML cannot guide them.
As mutational burden increases, functional variants become increasingly rare in conventionally designed libraries. When functional variants are rare, the data surrounding them carries vanishing learnable gradient. The field has accepted this as a limitation of ML. It is a limitation of library design.
SpanFit addresses this at the source by changing what a library is optimized for so the resulting training data remains informative across the full mutational range. A model trained on SpanFit-designed data remains predictive at 20 or 30 mutations from the starting point. Variable-length fitness modeling at this scale has no substitute.
Value 02
MULTI-TRAIT VARIANT DISCOVERY
Finding candidates that satisfy multiple functional constraints simultaneously, from datasets that share no sequences and were never designed to connect.
GAP 4:
Multi-trait candidates cannot be found across the disconnected datasets most programs actually have.
Programs accumulate phenotypic knowledge across campaigns that were never designed to connect and thus cannot be directly compared, combined, or jointly queried. Multi-trait candidates typically cannot be found without running sequential screens that destroy diversity with each round, compounding the timeline cost.
MultiFit learns sequence-to-function relationships independently from each dataset, regardless of library origin, institution, or timepoint. It finds their intersection computationally. Without returning to the bench for sequential screens.
Value 03
MORE DISCOVERY FROM EVERY SCREEN
Recovering ML signal from screens already written off, and eliminating the trade-off between affordable and fit-space screenings.
GAP 5:
Conventional ML training cannot use the screening data most programs actually have.
Most organizations with any history in AAV capsid engineering have accumulated randomized library screening, typically not trusted for machine learning. Not because the biology is absent from it, but because conventional ML training typically cannot extract reliable signal from noisy randomized screens. The conclusion the field has drawn, that this data is not systematically usable, is wrong. Noisy data and uninformative data are not the same thing.
NoisyFit is a training approach built specifically for this data type. It extracts meaningful predictive signal from randomized library screens that conventional training cannot use, turning accumulated archives into an active ML foundation without new experiments. This also lowers the bar for future screens: programs no longer need expensive controlled assays to generate ML-compatible training data.
GAP 6:
The library design trade-off every AAV program accepts has no practical middle path.
After a first screening round, a program knows roughly where functional variants cluster in sequence space. The next library should exploit that knowledge.
Random degenerate libraries, while affordable and diverse, cannot be used for second round screening because they are uncontrolled: a significant fraction of variants carry stop codons, and the library distributes diversity across sequence space indiscriminately, ignoring what the first round already revealed. Printed oligonucleotide pools can incorporate that knowledge precisely, but at a cost and diversity ceiling that makes them impractical for the scale many follow-up campaigns require.
Every program doing iterative capsid screening has lived inside this trade-off at some point, and no library design approach has existed that resolves it.
BaseFit defines a third option: NNK-scale combinatorial diversity, stop codons suppressed to a negligible fraction, coverage concentrated on the sequence space the first round identified as promising, and all of it implementable through standard doped oligonucleotide synthesis at degenerate library cost. The trade-off disappears.
AVAILABILITY
Ready now
Ready for project engagements.
Early access
Available for early-access programs with experimental validation built in.
AAV DIGITAL TWIN: TECHNICAL OVERVIEW
Module-level detail, validation anchors, and the combinations that unlock capabilities no individual tool delivers alone.
Built by one scientist over one year. No team, no lab, no funding. Drawing on a decade working at the intersection of AAV biology and machine learning, and validated against the field's own published evidence base. The gaps described here were not identified theoretically. They were observed across programs, repeatedly, until the pattern was undeniable.
WHAT COMES NEXT
Version 1.0 addresses capsid discovery. Version 2.0 extends into the bottlenecks that sit between a promising capsid and a viable IND.
Manufacturability modeling
Titer, stability, and production yield modeled as design constraints, incorporated before experimental commitment, not as filters applied after scale-up fails.
Immunogenicity prediction
Sequence-level immunogenic risk modeling before clinical exposure. No public data exists to build this alone. This is where I am looking for programs with clinical and immunological data to build it with.
If any of these gaps appear in your program, or if Version 2.0 describes where your program is heading, the conversation is open.
— TheBioMLClinic