The AAV Digital Twin 1.0

The AAV Digital Twin 1.0

The AAV Digital Twin 1.0

A gap-first computational framework for what AAV capsid engineering cannot do today



WHAT THE DIGITAL TWIN IS

The AAV Digital Twin is not a pipeline. It is not a platform. It is a set of purpose-built computational tools, each designed around one gap the field has accepted and left open, because no existing tool was built to close it.

The gaps came first. The tools followed. A program uses whichever tools address the specific bottleneck it faces. No fixed sequence. No required combination. And for the programs where multiple gaps converge at once, the full system.



THE HARDEST CAPSID PROGRAMS

Some programs are not looking for an incremental improvement on what already exists. They are trying to engineer something the field has not built before: long insertions that accommodate a receptor with a large binding footprint, immune evasion substitutions distributed where it matters, multi-trait targeting and detargeting, function preserved across species whose receptor biology does not match.

This combination is not a collection of separate problems. It is a system. Each requirement compounds the others. Long insertions push ML into the mutational range where predictive signal collapses. Immune evasion substitutions interact with receptor binding in ways no individual model was built to handle. Cross-species function requires ortholog-aware design from the first library, not as a filter applied after candidates fail in a second species. Multi-trait screening across disconnected datasets has no conventional computational path.

No team, platform, or CRO addresses this combination as a whole.

The AAV Digital Twin was built for exactly this.



THIS IS FOR YOU IF:

Your program identified the receptor. The binding footprint is large enough that 7-mers cannot engage it meaningfully. You need peptide candidates at 10 or more residues and there is no principled starting point before library synthesis.

You are engineering for organ tropism through loop IV or other surface-exposed regions. The insertion lengths and substitution patterns required are beyond what random 7-mer libraries explore reliably. And the functional variants you need are too rare at that scale to find by chance.

You need a capsid that evades immunity, crosses species, or carries mutations at a burden where ML loses predictive signal before you reach the design space you actually need.

You have accumulated years of screening data from randomized libraries. You want to apply ML to what you already have, without returning to the bench to generate cleaner data first.

Your second-round library needs to concentrate on what the first round found, but degenerate libraries cannot do that, and printed oligo pools are too expensive at this scale.

Your program is approaching the translational gap. Manufacturability and immunogenicity are the next constraints, and sequence-level modeling for either does not exist yet in any accessible form.



THE PROBLEM

AAV capsid engineering is not waiting for better models. It is waiting for better computational tools built for the problems it actually has.

The field has accumulated powerful individual tools: fitness models, library design methods, tropism predictors. Each occupying a space it already knew how to think about. What it lacks are solutions for the problems it has learned to work around rather than solve. The workarounds are expensive. Six specific gaps cost programs the most. None of them has a solution today.

WHAT THE DIGITAL TWIN DELIVERS


Value 01

LONG-LENGTH VARIANT DISCOVERY

The sequence spaces programs avoid because the field has no computational starting point for them.

GAP 1:

Receptor-targeted design at long insertion lengths has nowhere to start.

When a receptor is identified as the entry point for a desired tropism, the standard next step is to screen a massive random library and hope a small fraction happen to bind. At short insertion lengths this is inefficient but survivable. At 10-15 residues and above, the lengths complex receptor binding surfaces typically require, functional variants become exponentially rarer while sequence space expands beyond practical coverage. The step between knowing the receptor and having long insertion candidates worth testing does not exist.

Most programs absorb this by defaulting to 7-mers. The receptor engagement is suboptimal. The binding footprint is too small for the surface. The decision gets rationalized as sufficient for the screen and revisited, if at all, after candidates fail in.

When a receptor is identified as the entry point for a desired tropism, the standard next step is to screen a massive random library and hope a small fraction happen to bind. At short insertion lengths this is inefficient but survivable. At 10-15 residues and above, the lengths complex receptor binding surfaces typically require, functional variants become exponentially rarer while sequence space expands beyond practical coverage. The step between knowing the receptor and having long insertion candidates worth testing does not exist.

Most programs absorb this by defaulting to 7-mers. The receptor engagement is suboptimal. The binding footprint is too small for the surface. The decision gets rationalized as sufficient for the screen and revisited, if at all, after candidates fail in.

ReceptorFit creates that step. It converts receptor identity directly into a prioritized peptide candidates of user-defined length before a library is synthesized, and can run against multiple species orthologs simultaneously. This addresses one of the most persistent failure modes in translational programs: a capsid optimized for rodent receptor biology that loses function when the ortholog shifts in primates.

GAP 2:

Long-length organ-targeted screening ignores the evolutionary record that already solved it.

Every virus that has evolved to infect a specific organ carries a record of how to get there within its surface-exposed receptor-binding domains. Those sequences have been pressure-tested across deep evolutionary time against exactly the biological barriers AAV programs are trying to cross. They are in public databases, specific to the tissues programs care about most: brain, liver, lung, muscle, kidney.

At short insertion lengths, random libraries can find organ-tropic candidates without this prior, inefficiently, but reliably enough. At long insertion lengths, where functional variants become exponentially rarer and sequence space expands beyond practical coverage, that reliability disappears. There is no principled starting point.

Most programs absorb this by running the standard 7-mer library anyway. The insertion length fit what the screen can find, not what the receptor needs.

Every virus that has evolved to infect a specific organ carries a record of how to get there within its surface-exposed receptor-binding domains. Those sequences have been pressure-tested across deep evolutionary time against exactly the biological barriers AAV programs are trying to cross. They are in public databases, specific to the tissues programs care about most: brain, liver, lung, muscle, kidney.

At short insertion lengths, random libraries can find organ-tropic candidates without this prior, inefficiently, but reliably enough. At long insertion lengths, where functional variants become exponentially rarer and sequence space expands beyond practical coverage, that reliability disappears. There is no principled starting point.

Most programs absorb this by running the standard 7-mer library anyway. The insertion length fit what the screen can find, not what the receptor needs.

HijackFit converts that evolutionary record into organ-labeled peptide libraries at user-defined insertion lengths. The screen starts with biological precedent as its prior rather than random chance, concentrated in sequence space that evolution validated as functional in the target tissue, before a single screen is run.

GAP 3:

ML loses predictive signal at high mutational burden, making the most promising design spaces unreachable.

Long insertions, dual-loop engineering, distributed immune evasion substitutions. These design spaces carry enormous therapeutic potential and are almost universally avoided. Not because the biology is inaccessible, but because ML cannot guide them.

As mutational burden increases, functional variants become increasingly rare in conventionally designed libraries. When functional variants are rare, the data surrounding them carries vanishing learnable gradient. The field has accepted this as a limitation of ML. It is a limitation of library design.

The cost is not a failed experiment. It is a design space that never gets explored. Programs building toward immune evasion, dual-loop engineering, or high-burden multi-site modification are making therapeutic compromises before the first library is synthesized, without knowing that the constraint is computational, not biological.

Long insertions, dual-loop engineering, distributed immune evasion substitutions. These design spaces carry enormous therapeutic potential and are almost universally avoided. Not because the biology is inaccessible, but because ML cannot guide them.

As mutational burden increases, functional variants become increasingly rare in conventionally designed libraries. When functional variants are rare, the data surrounding them carries vanishing learnable gradient. The field has accepted this as a limitation of ML. It is a limitation of library design.

The cost is not a failed experiment. It is a design space that never gets explored. Programs building toward immune evasion, dual-loop engineering, or high-burden multi-site modification are making therapeutic compromises before the first library is synthesized, without knowing that the constraint is computational, not biological.

SpanFit addresses this at the source by changing what a library is optimized for so the resulting training data remains informative across the full mutational range. A model trained on SpanFit-designed data remains predictive at 20 or 30 mutations from the starting point. Variable-length fitness modeling at this scale has no substitute.

Value 02

MULTI-TRAIT VARIANT DISCOVERY

Finding candidates that satisfy multiple functional constraints simultaneously, from datasets that share no sequences and were never designed to connect.

GAP 4:

Multi-trait candidates cannot be found across the disconnected datasets most programs actually have.

Programs accumulate phenotypic knowledge across campaigns that were never designed to connect and thus cannot be directly compared, combined, or jointly queried.

Most programs run the screens sequentially: optimize for one trait, take the survivors, screen again for the next. Each round costs three to six months and collapses the sequence diversity the next round needs. By the time a multi-trait candidate is found, the program has spent a year finding something a broader starting pool might have contained from the beginning.

The other non-sequential multi-trait search option is limited to finding variants targeting organ X while detargeting liver, then filtered for production fitness. This does not scale to other traits and requires concurrent screening.

Programs accumulate phenotypic knowledge across campaigns that were never designed to connect and thus cannot be directly compared, combined, or jointly queried.

Most programs run the screens sequentially: optimize for one trait, take the survivors, screen again for the next. Each round costs three to six months and collapses the sequence diversity the next round needs. By the time a multi-trait candidate is found, the program has spent a year finding something a broader starting pool might have contained from the beginning.

The other non-sequential multi-trait search option is limited to finding variants targeting organ X while detargeting liver, then filtered for production fitness. This does not scale to other traits and requires concurrent screening.

MultiFit learns sequence-to-function relationships independently from each dataset, regardless of library origin, institution, or timepoint. It finds their intersection computationally. Without returning to the bench for sequential screens.

Value 03

MORE DISCOVERY FROM EVERY SCREEN

Recovering ML signal from screens already written off, and eliminating the trade-off between affordable and fit-space screenings.

GAP 5:

Conventional ML training cannot use the screening data most programs actually have.

Most organizations with any history in AAV capsid engineering have accumulated randomized library screening, typically not trusted for machine learning. Not because the biology is absent from it, but because conventional ML training typically cannot extract reliable signal from noisy randomized screens.

Most programs discard the archive entirely and commission new controlled screens, adding cost and timeline before ML can begin. Or they accept that the data is ML-unusable and never attempt ML at all.

The conclusion that this data is not systematically usable is wrong and expensive. Noisy data and uninformative data are not the same thing.

Most organizations with any history in AAV capsid engineering have accumulated randomized library screening, typically not trusted for machine learning. Not because the biology is absent from it, but because conventional ML training typically cannot extract reliable signal from noisy randomized screens.

Most programs discard the archive entirely and commission new controlled screens, adding cost and timeline before ML can begin. Or they accept that the data is ML-unusable and never attempt ML at all.

The conclusion that this data is not systematically usable is wrong and expensive. Noisy data and uninformative data are not the same thing.

NoisyFit is a training approach built specifically for this data type. It extracts meaningful predictive signal from randomized library screens that conventional training cannot use, turning accumulated archives into an active ML foundation without new experiments. This also lowers the bar for future screens: programs no longer need expensive controlled assays to generate ML-compatible training data.

GAP 6:

The library design trade-off every AAV program accepts has no practical middle path.

After a first screening round, a program knows roughly where functional variants cluster in sequence space. The next library should exploit that knowledge.

Random degenerate libraries, while affordable and diverse, cannot be used for second round screening because they are uncontrolled: a significant fraction of variants carry stop codons, and the library distributes diversity across sequence space indiscriminately, ignoring what the first round already revealed. Printed oligonucleotide pools can incorporate that knowledge precisely, but at a cost and diversity ceiling that makes them impractical for the scale many follow-up campaigns require.

Every program doing iterative capsid screening has lived inside this trade-off at some point, and no library design approach has existed that resolves it.

The cost is an extra screening round. Three to six months, another synthesis budget, and a second pass through sequence space that a better-designed first follow-up library would have covered. Programs that iterate twice where once was sufficient are not making a scientific decision; they are paying for the absence of a tool.

After a first screening round, a program knows roughly where functional variants cluster in sequence space. The next library should exploit that knowledge.

Random degenerate libraries, while affordable and diverse, cannot be used for second round screening because they are uncontrolled: a significant fraction of variants carry stop codons, and the library distributes diversity across sequence space indiscriminately, ignoring what the first round already revealed. Printed oligonucleotide pools can incorporate that knowledge precisely, but at a cost and diversity ceiling that makes them impractical for the scale many follow-up campaigns require.

Every program doing iterative capsid screening has lived inside this trade-off at some point, and no library design approach has existed that resolves it.

The cost is an extra screening round. Three to six months, another synthesis budget, and a second pass through sequence space that a better-designed first follow-up library would have covered. Programs that iterate twice where once was sufficient are not making a scientific decision; they are paying for the absence of a tool.

BaseFit defines a third option: NNK-scale combinatorial diversity, stop codons suppressed to a negligible fraction, coverage concentrated on the sequence space the first round identified as promising, and all of it implementable through standard doped oligonucleotide synthesis at degenerate library cost. The trade-off disappears.

WHERE PROGRAMS ENTER

Available for project and full-program engagements

SpanFit
MultiFit
NoisyFit

Available for programs with experimental validation built in

BaseFit
ReceptorFit
HijackFit

Not every module fits every program. The right combination depends on where your data is, what your screen has produced, and where the program is heading. That is where the conversation starts.

AAV DIGITAL TWIN: TECHNICAL OVERVIEW

Module-level detail, validation anchors, and the combinations that unlock capabilities no individual tool delivers alone.

Built by one scientist over one year. No team, no lab, no funding. Drawing on a decade working at the intersection of AAV biology and machine learning, and validated against the field's own published evidence base. The gaps described here were not identified theoretically. They were observed across programs, repeatedly, until the pattern was undeniable.



WHAT COMES NEXT

Version 1.0 addresses capsid discovery. Version 2.0 extends into the three bottlenecks that determine whether a promising capsid becomes a viable IND candidate : 1) the immune system, 2) the full vector system, and 3) the translatability gap.

These are open problems the field has not yet solved computationally. Some will be built inside program-building engagements where the data and the problem arrive together. Some require a co-development partner: a clinical-stage program sitting on immunological or manufacturing data that has not yet been modeled.

If either of these describes your situation, that is the right conversation to have.


If any of these gaps appear in your program, or if Version 2.0 describes where your program is heading, the conversation is open.

— TheBioMLClinic