GAP 1:
Receptor-targeted design at long insertion lengths has nowhere to start.
When a receptor is identified as the entry point for a desired tropism, the standard next step is to screen a massive random library and hope a small fraction happen to bind. At short insertion lengths this is inefficient but survivable. At 10-15 residues and above, the lengths complex receptor binding surfaces typically require, functional variants become exponentially rarer while sequence space expands beyond practical coverage. The step between knowing the receptor and having long insertion candidates worth testing does not exist.
Most programs absorb this by defaulting to 7-mers. The receptor engagement is suboptimal. The binding footprint is too small for the surface. The decision gets rationalized as sufficient for the screen and revisited, if at all, after candidates fail in.
When a receptor is identified as the entry point for a desired tropism, the standard next step is to screen a massive random library and hope a small fraction happen to bind. At short insertion lengths this is inefficient but survivable. At 10-15 residues and above, the lengths complex receptor binding surfaces typically require, functional variants become exponentially rarer while sequence space expands beyond practical coverage. The step between knowing the receptor and having long insertion candidates worth testing does not exist.
Most programs absorb this by defaulting to 7-mers. The receptor engagement is suboptimal. The binding footprint is too small for the surface. The decision gets rationalized as sufficient for the screen and revisited, if at all, after candidates fail in.
ReceptorFit creates that step. It converts receptor identity directly into a prioritized peptide candidates of user-defined length before a library is synthesized, and can run against multiple species orthologs simultaneously. This addresses one of the most persistent failure modes in translational programs: a capsid optimized for rodent receptor biology that loses function when the ortholog shifts in primates.
GAP 2:
Long-length organ-targeted screening ignores the evolutionary record that already solved it.
Every virus that has evolved to infect a specific organ carries a record of how to get there within its surface-exposed receptor-binding domains. Those sequences have been pressure-tested across deep evolutionary time against exactly the biological barriers AAV programs are trying to cross. They are in public databases, specific to the tissues programs care about most: brain, liver, lung, muscle, kidney.
At short insertion lengths, random libraries can find organ-tropic candidates without this prior, inefficiently, but reliably enough. At long insertion lengths, where functional variants become exponentially rarer and sequence space expands beyond practical coverage, that reliability disappears. There is no principled starting point.
Most programs absorb this by running the standard 7-mer library anyway. The insertion length fit what the screen can find, not what the receptor needs.
Every virus that has evolved to infect a specific organ carries a record of how to get there within its surface-exposed receptor-binding domains. Those sequences have been pressure-tested across deep evolutionary time against exactly the biological barriers AAV programs are trying to cross. They are in public databases, specific to the tissues programs care about most: brain, liver, lung, muscle, kidney.
At short insertion lengths, random libraries can find organ-tropic candidates without this prior, inefficiently, but reliably enough. At long insertion lengths, where functional variants become exponentially rarer and sequence space expands beyond practical coverage, that reliability disappears. There is no principled starting point.
Most programs absorb this by running the standard 7-mer library anyway. The insertion length fit what the screen can find, not what the receptor needs.
HijackFit converts that evolutionary record into organ-labeled peptide libraries at user-defined insertion lengths. The screen starts with biological precedent as its prior rather than random chance, concentrated in sequence space that evolution validated as functional in the target tissue, before a single screen is run.
GAP 3:
ML loses predictive signal at high mutational burden, making the most promising design spaces unreachable.
Long insertions, dual-loop engineering, distributed immune evasion substitutions. These design spaces carry enormous therapeutic potential and are almost universally avoided. Not because the biology is inaccessible, but because ML cannot guide them.
As mutational burden increases, functional variants become increasingly rare in conventionally designed libraries. When functional variants are rare, the data surrounding them carries vanishing learnable gradient. The field has accepted this as a limitation of ML. It is a limitation of library design.
The cost is not a failed experiment. It is a design space that never gets explored. Programs building toward immune evasion, dual-loop engineering, or high-burden multi-site modification are making therapeutic compromises before the first library is synthesized, without knowing that the constraint is computational, not biological.
Long insertions, dual-loop engineering, distributed immune evasion substitutions. These design spaces carry enormous therapeutic potential and are almost universally avoided. Not because the biology is inaccessible, but because ML cannot guide them.
As mutational burden increases, functional variants become increasingly rare in conventionally designed libraries. When functional variants are rare, the data surrounding them carries vanishing learnable gradient. The field has accepted this as a limitation of ML. It is a limitation of library design.
The cost is not a failed experiment. It is a design space that never gets explored. Programs building toward immune evasion, dual-loop engineering, or high-burden multi-site modification are making therapeutic compromises before the first library is synthesized, without knowing that the constraint is computational, not biological.
SpanFit addresses this at the source by changing what a library is optimized for so the resulting training data remains informative across the full mutational range. A model trained on SpanFit-designed data remains predictive at 20 or 30 mutations from the starting point. Variable-length fitness modeling at this scale has no substitute.
GAP 4:
Multi-trait candidates cannot be found across the disconnected datasets most programs actually have.
Programs accumulate phenotypic knowledge across campaigns that were never designed to connect and thus cannot be directly compared, combined, or jointly queried.
Most programs run the screens sequentially: optimize for one trait, take the survivors, screen again for the next. Each round costs three to six months and collapses the sequence diversity the next round needs. By the time a multi-trait candidate is found, the program has spent a year finding something a broader starting pool might have contained from the beginning.
The other non-sequential multi-trait search option is limited to finding variants targeting organ X while detargeting liver, then filtered for production fitness. This does not scale to other traits and requires concurrent screening.
Programs accumulate phenotypic knowledge across campaigns that were never designed to connect and thus cannot be directly compared, combined, or jointly queried.
Most programs run the screens sequentially: optimize for one trait, take the survivors, screen again for the next. Each round costs three to six months and collapses the sequence diversity the next round needs. By the time a multi-trait candidate is found, the program has spent a year finding something a broader starting pool might have contained from the beginning.
The other non-sequential multi-trait search option is limited to finding variants targeting organ X while detargeting liver, then filtered for production fitness. This does not scale to other traits and requires concurrent screening.
MultiFit learns sequence-to-function relationships independently from each dataset, regardless of library origin, institution, or timepoint. It finds their intersection computationally. Without returning to the bench for sequential screens.
GAP 5:
Conventional ML training cannot use the screening data most programs actually have.
Most organizations with any history in AAV capsid engineering have accumulated randomized library screening, typically not trusted for machine learning. Not because the biology is absent from it, but because conventional ML training typically cannot extract reliable signal from noisy randomized screens.
Most programs discard the archive entirely and commission new controlled screens, adding cost and timeline before ML can begin. Or they accept that the data is ML-unusable and never attempt ML at all.
The conclusion that this data is not systematically usable is wrong and expensive. Noisy data and uninformative data are not the same thing.
Most organizations with any history in AAV capsid engineering have accumulated randomized library screening, typically not trusted for machine learning. Not because the biology is absent from it, but because conventional ML training typically cannot extract reliable signal from noisy randomized screens.
Most programs discard the archive entirely and commission new controlled screens, adding cost and timeline before ML can begin. Or they accept that the data is ML-unusable and never attempt ML at all.
The conclusion that this data is not systematically usable is wrong and expensive. Noisy data and uninformative data are not the same thing.
NoisyFit is a training approach built specifically for this data type. It extracts meaningful predictive signal from randomized library screens that conventional training cannot use, turning accumulated archives into an active ML foundation without new experiments. This also lowers the bar for future screens: programs no longer need expensive controlled assays to generate ML-compatible training data.
GAP 6:
The library design trade-off every AAV program accepts has no practical middle path.
After a first screening round, a program knows roughly where functional variants cluster in sequence space. The next library should exploit that knowledge.
Random degenerate libraries, while affordable and diverse, cannot be used for second round screening because they are uncontrolled: a significant fraction of variants carry stop codons, and the library distributes diversity across sequence space indiscriminately, ignoring what the first round already revealed. Printed oligonucleotide pools can incorporate that knowledge precisely, but at a cost and diversity ceiling that makes them impractical for the scale many follow-up campaigns require.
Every program doing iterative capsid screening has lived inside this trade-off at some point, and no library design approach has existed that resolves it.
The cost is an extra screening round. Three to six months, another synthesis budget, and a second pass through sequence space that a better-designed first follow-up library would have covered. Programs that iterate twice where once was sufficient are not making a scientific decision; they are paying for the absence of a tool.
After a first screening round, a program knows roughly where functional variants cluster in sequence space. The next library should exploit that knowledge.
Random degenerate libraries, while affordable and diverse, cannot be used for second round screening because they are uncontrolled: a significant fraction of variants carry stop codons, and the library distributes diversity across sequence space indiscriminately, ignoring what the first round already revealed. Printed oligonucleotide pools can incorporate that knowledge precisely, but at a cost and diversity ceiling that makes them impractical for the scale many follow-up campaigns require.
Every program doing iterative capsid screening has lived inside this trade-off at some point, and no library design approach has existed that resolves it.
The cost is an extra screening round. Three to six months, another synthesis budget, and a second pass through sequence space that a better-designed first follow-up library would have covered. Programs that iterate twice where once was sufficient are not making a scientific decision; they are paying for the absence of a tool.
BaseFit defines a third option: NNK-scale combinatorial diversity, stop codons suppressed to a negligible fraction, coverage concentrated on the sequence space the first round identified as promising, and all of it implementable through standard doped oligonucleotide synthesis at degenerate library cost. The trade-off disappears.