1,457
Views
2
CrossRef citations to date
0
Altmetric
Editorial

Library size in virtual screening: is it truly a number’s game?

Pages 1177-1179 | Received 15 Jun 2022, Accepted 26 Sep 2022, Published online: 04 Oct 2022

1. Introduction

Drug discovery and development is a long and expensive enterprise. Looking back at the advances made toward various stages of the drug discovery paradigm, a persistent pattern emerges: researchers invent a breakthrough technology or a novel methodological approach, even a discipline, in the hope of expediting the process, consistently leading to justifiably raised expectations. Dependent upon the specifics of the said technology, the pharmaceutical sector invests heavily, commensurate with its culture and approach to risk and innovation. Yet, every new technology needs time for optimization in order for its potential to be realized. Years later, these high expectations are replaced by realism stemming from real-world determinations of the strengths and limitations of what seemed to be ‘the next big thing’ that would shorten and/or deliver a therapeutic faster and cheaper. High throughput screening, several -omics disciplines, the fall and reemergence of phenotypic screening, cheminformatics, and novel targets from the human genome project come to mind as a few examples of those advancements that commenced substantial research efforts. This brings us to a contradictory notion in that scientists are reserved and critical, yet they place great faith and expectations on promising new approaches. Among the root causes of this seemingly discrepant tendency could be our deeply seated aspiration to contribute, our disappointment by the cost associated with R&D when compared to the low success rates of investigational new drugs, and our desire to translate knowledge into clinical applicability. Many of these novel methodologies have been instrumental in opening new avenues; nevertheless, they often did not meet the magnitude and breadth of the initial expectations. One of the latest developments with seemingly high expectations is large-scale docking (ultra-high throughput virtual screening). It is premature to say whether it will follow the same pattern as some of the advancements mentioned above. Capitalizing on the vastness of chemical space presents a technological advantage; it also comes with higher chances at ranking errors, inaccurate pose predictions, overlooked target–ligand interactions, and inefficient compound selection from a small fraction of the ultra-large dataset. Instead of viewing these and future novel advancements independently, we know by now that drug discovery is a balancing act and sector-specific.

2. Virtual screening and library size in perspective

Structure-based virtual screening (VS) has been a staple in early drug discovery for more than two decades now. Pitfalls of VS are discussed elsewhere [Citation1], but it is worth mentioning the high interdependence between the sources of their inaccuracies. Specifically, scoring functions are not able to accurately rank returned solutions, and rarely are these predicted target-ligand complexes experimentally confirmed in prospective studies. Thus, the top-ranked solutions may not be the ones to advance to the hit-to-lead stage either due to scoring errors or due to failures in binding pose prediction. Another widely discussed aspect of VS is the choice of libraries and the related inadequacy pertaining to their coverage of chemical space. If the entirety of all possibly synthetic targets is not considered in a VS experiment, how can we be assured we will find the most active and novel hits to initiate drug discovery efforts?

Before we consider whether the above question is indeed the cornerstone of VS in drug discovery, let us look at its current status. For those who have followed recent advances in ultra-large chemical libraries and giga-docking, the future seems less bleak and once again more promising than ever [Citation2–7]. Lyu et al. [Citation2] docked 99 million and 138 million make-on-demand lead-like compound collections against AmpC β-lactamase and dopamine D4 receptor, respectively. Forty-four of the 51 top-ranked selected compounds against AmpC were synthesized, and five of them showed activities ranging from 1.3 to 400 μM. With respect to D4, 549 molecules were synthesized, 81 of which showed activity between 18.4 nM to 8.3 μM. They concluded that unless they screened the entire collection, the reported compounds would not have been discovered. Notably, two targets do not justify such a generalizable statement regarding the merits of ultra-large libraries, in particular when the activity range in one target is modest and typical of smaller VS campaign findings. In a later report on six additional targets [Citation3], employing the entire DUD-E collection and stratified samples of it, the worst performance (percentages of enrichment factors and hit rates) for four of the six targets was observed with the entire, largest size DUD-E library, albeit that the best performance was achieved with the full DUE-E for the remaining two targets. To explore if screening larger compound collections results in the identification of more potent actives, the same researchers assayed the five most potent compounds in the top 1% of the docked poses. For the two targets, the potencies were proportional to the library size, but this was not consistent throughout the investigated target space. These findings are undeniably powerful but not without issues. They are limited to a few targets and the docking algorithms differed; therefore, these inherent limitations preclude any conclusive statements on generality. By their own admission, the researchers noted high false-positive and false-negative rates [Citation2]. In post-docking processing, compounds resembling known AmpC inhibitors in ChEMBL were excluded from the top-ranked 1 million molecules in order to identify novel chemotypes. In a prospective VS experiment, actives are unknown, and screening is performed with the intent to identify novel hits, without preexisting knowledge of the active space. Analog and decoy bias are reportedly hidden in the DUD-E dataset [Citation8]. Finally, the positive impact of the size of chemical space on enrichment factors and potencies of identified hits is suggestive at best, inconclusive at worst.

Let us now revisit the present trend of ultra-large databases in VS and the recurrent claim that (i) such libraries guarantee improved VS performance, whereas the best compounds will be missed if smaller libraries are screened, and (ii) the number of compounds needed to be subsequently synthesized and assayed will get smaller [Citation5]. The underlying belief is that a larger space increases the chances of discovering better actives in terms of potency and/or novelty. Having a bigger pool of molecules to select from might increase our chances for more optimal VS outcomes. However, these molecules need to be ranked at the top, which is uncertain [Citation9]. On the other hand, increasing the library size increases the possibility of false-positives as well. False-negatives may not be as detrimental in the long run since fundamentally VS is not about discovering all actives. Further, the simple reactions used to generate molecules in REAL Space and similar make-on-demand libraries may construct less complex or diverse enough compounds [Citation10,Citation11]. Hit identification through VS is a multi-dimensional phase; it involves ranking, patentability, and data mining with the objective of identifying scaffolds to initiate lead optimization efforts. Ranking and data mining of VS outputs are exponentially more challenging with increased library size, while patentability concerns remain the same for large or small(er) datasets. In regard to the second claim that fewer molecules will need to be synthesized following ultra-large docking and provided the presupposition for more potent and synthesizable actives is met, improving potency in lead optimization is rarely a rate-limiting step. What has historically been challenging is the seemingly inverse relationship between potency and ADME. Moreover, R&D attrition is not attributed to suboptimal potency but efficacy and safety, which are related to target validation and toxicology [Citation12].

Nevertheless, what is emerging through this push for big data science is the need to mine the increasingly vast virtual space in order to identify the most relevant compounds to the target under investigation. Toward that end, machine learning (ML) and deep learning techniques are seemingly outperforming consensus scoring [Citation8,Citation13,Citation14]. However, even these reports are not without confounding factors [Citation9], such as the bias of the datasets used in prospective studies, data quality, and data paucity. Therefore, more validation experiments and suitable benchmark data sets are needed to conclude that ML is superior.

3. Expert opinion

The challenge of any drug discovery endeavor is that protocols/workflows, which led to a successful clinical candidate in one area, will not necessarily be applicable in another. Critical thinking, a deeper understanding of the system at hand, well-designed experiments, and balancing multi-disciplinary inputs increases our likelihood for success in pharmaceutical research. Over the years, we have been fascinated by the chemical space; however, quantity is meaningless if not coupled with efficient ways to identify subsets of molecules that are novel and relevant to the target under investigation [Citation11,Citation15]. It seems promising that we are now able to navigate through a large number of molecules in the hope of expediting early drug discovery; nonetheless, we should approach this new territory with caution, always remembering lessons we learned from the past. Integration of information, careful processing of libraries employed prior to respective screens, and thorough post-processing are major contributors to numerous examples of optimal VS outcomes to date. Even in those cases that have recently been highlighted for their ability to identify more potent hits/leads via large-scale docking, an extensive series of pre- and post-screening controls, in addition to subjective visual inspection, were undertaken. We do not have sufficient data thus far to make unequivocal statements that ‘unless one screens the biggest possibly space, the chance of discovering optimizable hits is limited.’ Instead we should balance the pros and cons on a per project basis, including hardware, software, and man-power requirements and availability and proceed accordingly. The numerous successes of the past, despite using smaller libraries, should prove and remind us that efficiency stems from thorough and manageable analyses, carefully planned rounds of optimization, and chemical intuition.

Declaration of interest

The author has no relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript. This includes employment, consultancies, honoraria, stock ownership or options, expert testimony, grants or patents received or pending, or royalties.

Reviewer disclosures

Peer reviewers on this manuscript have no relevant financial or other relationships to disclose.

Acknowledgment

The author thanks Dr Bill Seibel for carefully reading and providing feedback on the manuscript.

Additional information

Funding

This paper was not funded.

References

  • Slater O, Kontoyianni M. The compromise of virtual screening and its impact on drug discovery. Expert Opin Drug Discov. 2019;14(7):619–637.
  • Lyu J, Wang S, Balius TE, et al. Ultra-large library docking for discovering new chemotypes. Nature. 2019;566:224–229.
  • Fresnais L, Ballester PJ. The impact of compound library size on the performance of scoring functions for structure-based virtual screening. Brief Bioinform. 2021;22(3).
  • Bender BJ, Gahbauer S, Luttens A, et al. A practical guide to large-scale docking. Nat Protoc. 2021;16:4799–4832.
  • Gloriam DE. Bigger is better in virtual screens. Nature. 2019;566:193–194.
  • Clark DE. Virtual screening: is bigger always better? Or can small be beautiful? J Chem. 2020;60:4120–4123 .
  • Warr WA, Nicklaus MC, Nicolaou CA, et al. Exploration of ultralarge compound collections for drug discovery. J Chem Inf Model. 2022;62:2021–2034.
  • Chen L, Cruz A, Ramsey S, et al. Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS One. 2019;14:e0220113.
  • Ross GA, Morris GM, Biggin PC. One size does not fit all: the limits of structure-based models in drug discovery. J Chem Theory Comput. 2013;9:4266–4274.
  • Grygorenko OO, Radchenko DS, Dziuba I, et al. Generating multibillion chemical space of readily accessible screening compounds. iScience. 2020;23(11):101681.
  • Hoffmann T, Gastreich M. The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discovery Today. 2019;24(5):1148–1156.
  • Waring MJ, Arrowsmith J, Leach AR, et al. An analysis of the attrition of drug candidates from four major pharmaceutical companies. Nature Reviews Drug Discovery. 2015;14(7):475–486.
  • Ricci-Lopez J, Aguila SA, Gilson MK, et al. Improving structure-based virtual screening with ensemble docking and machine learning. Journal of Chemical Information and Modeling. 2021;61(11):5362–5376.
  • Xiong GL, Ye WL, Shen C, et al. Improving structure-based virtual screening performance via learning from scoring function components . Brief bioinform. 2021;22(3):1–14.
  • van Hilten N, Chevillard F, Kolb P. Virtual compound libraries in computer-assisted drug discovery. Journal of Chemical Information and Modeling. 2019;59(2):644–651.

Reprints and Corporate Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

To request a reprint or corporate permissions for this article, please click on the relevant link below:

Academic Permissions

Please note: Selecting permissions does not provide access to the full text of the article, please see our help page How do I view content?

Obtain permissions instantly via Rightslink by clicking on the button below:

If you are unable to obtain permissions via Rightslink, please complete and submit this Permissions form. For more information, please visit our Permissions help page.