Data Visualization in Medicinal Chemistry

Hiking Trails in Activity Landscapes
von Prof. Dr Jürgen Bajorath

The massive growth of compound activity data provides opportunities and challenges for medicinal chemistry. Conventional approaches for the analysis of structure-activity relationships (SARs) are not suitable for the exploration and exploitation of this unprecedented knowledge base. Recently, new computational methodologies have been introduced for large-scale SAR analysis that put emphasis on visualization to provide an intuitive access to complex SAR patterns and identify key compounds.

Chemical tradition and subjectivity

Chemists are trained on the basis of two-dimensional representations of molecular structure, i.e., molecular graphs. In medicinal chemistry, the exploration of structure-activity relationships (SARs), a cornerstone of compound optimization efforts, is largely based on comparisons of molecular graphs of active compounds. Traditionally, structure-activity data is recorded and monitored in R-group tables that list chemical core structures and substituents (R-groups) of active compounds together with their potency information. To this date, such R-group tables are indispensable tools for practicing medicinal chemists.

Traditionally, medicinal chemistry efforts are centered on individual compound series. Analogs of active compounds are made and tested and compound potency and other optimization relevant properties (such as, for example, solubility or metabolic stability) are attempted to be optimized. Compound series are considered on a case-by-case basis, usually one series at a time. Individual compound series can be conveniently represented in R-group tables to deduce SAR information as long as compound numbers do not become too large. For example, it is hardly possible to subjectively analyze, with a chemist’s eye, and understand SAR information associated with more than 100 or so compounds. Although there are individual differences, we quickly strike our limits in analyzing and comparing larger numbers of chemical structures and their activities to derive SAR rules.

Despite these constraints, subjective criteria, chemical intuition, and experience continue to play a major role in compound analysis and design, more so than one might anticipate in the era of virtually unlimited information resources. Furthermore, despite the undisputed role of chemical ingenuity, it is well documented that even seasoned and successful medicinal chemists rarely agree in their assessment of chemical characteristics that render compounds ‘drug-like’ and attractive for further optimization [1]. In addition, our perception of chemical structure and properties is strongly context-dependent and conclusions drawn about preferred candidate compounds typically change with the ordering of molecules presented to us [1]. Chemical experience and intuition have a successful history in medicinal chemistry, but there is ample room for more systematic and ‘objective’ data analysis and compound design concepts.

Fig. 1 Section of a prototypic SAR network of a large compound data set. In a so-called network-like similarity graph (NSG), compounds are displayed as nodes and edges indicate molecular similarity relationships. Nodes are coloured according to compound potency using a continuous colour spectrum from green (lowest potency in the data set) over yellow to red (highest potency). In addition, nodes are scaled in size according to their contribution to local SAR discontinuity.

Fig. 2 Three-dimensional activity landscape. A 3D model of a compound data set is displayed that is reminiscent of ‘true’ activity landscapes containing gently sloped and rugged regions. This activity landscape view is obtained by a 2D projection of chemical reference space with an interpolated biological activity surface added as the third dimension. The surface is colored by compound potency according to Figure 1. White/transparent surface areas are interpolated and not populated with active compounds. Activity cliff and smooth regions are indicated. In activity cliff regions, small chemical modifications of compounds (i.e., very short ‘moves’ in chemical space) have a profound effect on biological activity. By contrast, in smooth regions, structurally diverse compounds have similar activity.

Fig. 3 Correspondence of SAR features in alternative landscape representations. On the right, a three-dimensional activity landscape is shown. Compound positions are represented as dots and the surface is colored according to compound potency. On the left, the corresponding network-like similarity graph is shown (represented according to Figure 1). Corresponding regions in these activity landscape views are indicated. In addition, molecular graphs of active compounds are shown that form an activity cliff (top) or map to regions of SAR continuity (bottom).

The compound activity data deluge

In recent years, we have been experiencing a nearly exponential growth in compound activity data (there is no end in sight) and currently available data volumes are beginning to impede traditional medicinal chemistry strategies. Compound activity data do not only grow in pharmaceutical companies at unprecedented rates, they also grow in the public domain. For example, PubChem [2], the major public domain repository for biological screening data, and ChEMBL [3], a major public source of compound activity data from medicinal chemistry projects, currently already contain more than 10 million active molecules, the majority of which is annotated with activities against multiple biological targets.

In addition to increasing data volumes, the heterogeneity of SAR data has become a substantial complication for SAR exploration. For attractive therapeutic targets, many chemically different compound series are typically available that have originated from diverse sources and have been subjected to different types of biological activity measurements. This equally applies to public domain data, which are collected from the scientific and patent literature, and proprietary compound data that accumulate within large pharmaceutical companies. For high-profile target families such a G protein coupled receptors in the central nervous system or protein kinases implicated in various forms of cancer, the pharmaceutical industry has been generating large amounts of increasingly heterogeneous compound data in the course of drug discovery projects. Learning from this information for medicinal chemistry applications has become a challenge that is as of yet largely unmet.

Fig. 4 Exploring the basis of chemical modifications of receptor ligands leading to changes in the molecular mechanism-of-action. In (a), An NSG variant is shown for a large set of receptor ligands with different mechanism-of-action in which the potency-based color code according to Figure 1 is replaced by mechanism-based coloring. As expected, compounds with the same mechanism are often more similar to each other than compounds with different mechanisms and thus form clusters in the network. However, there are exceptions such as the enlarged subgraph on the right that consists of very similar compounds with different mechanisms. In (b), the structures of analogs from this subgraph are compared, which reveals small chemical modifications leading to ‘mechanism hopping’.

Computational analysis and predictions

Increasing compound data volumes and heterogeneity conflict with the traditional ‘one compound series at a time’ focus of practical medicinal chemistry and make subjective case-by-case analysis very difficult. The tasks at hand go much beyond of what could possibly be handled with the aid of molecular graphs and R-group tables. In light of this situation, should their not be a push to complement and support practical medicinal chemistry efforts with more systematic and ‘objective’ data analysis schemes? Yes, indeed. However, computational methods have traditionally played a different role in medicinal chemistry.

In order to assess the role of computational analysis and predictions in medicinal chemistry, we should distinguish between new computer-aided drug design methods that are typically promoted and applied by computational chemistry groups, often fairly remote from practicing medicinal chemists (which presents a problem for drug discovery), and computational techniques that have long been considered a more or less integral part of medicinal chemistry. First and foremost, this applies to the Quantitative SAR (QSAR) paradigm that has dominated computational approaches in medicinal chemistry since the 1960s [4]. QSAR analysis generally attempts to derive linear models of biological activity on the basis of known sets of structurally similar compounds that are represented by various two- or three-dimensional descriptors of molecular structure and properties. Then, the resulting series-specific QSAR models are utilized to predict the potency of new analogs. Although QSAR methods often substantially differ in their computational details, they share the traditional medicinal chemistry focus on individual compound series and, even more importantly, try to answer the cardinal question that governs the efforts of a practicing medicinal chemist: ‘Which compound to make next?’

This question is usually more important to a medicinal chemist than any other that might conceivably be addressed by computational analysis. It is therefore not surprising that QSAR-based predictions have long been the major focal point of computational medicinal chemistry and, in addition, that medicinal chemists are generally much more interested in compound activity predictions than in compound data mining and knowledge extraction. This presents a conundrum for medicinal chemistry that is beginning to be addressed.

New computational concepts

Given the large volumes of heterogeneous compound activity data that are becoming available, there has recently been increasing awareness -inside and outside the pharmaceutical industry- that medicinal chemistry must go beyond conventional (Q)SAR paradigms and learn from these large volumes of available proprietary as well as public data. Given declining drug approval rates and stellar budgetary requirements of pharmaceutical R&D, one cannot possibly afford not to make use of such data as a knowledge base to learn from the past (considering both successes and failures) and make more data-driven decisions going forward.

To these ends, computational methodologies are required for systematic large-scale SAR analysis, taking structural heterogeneity of active compounds and different types of activity measurements into account. This is an area where data mining and SAR exploration meet and where new questions are addressed that go beyond individual compound activity predictions. For example, one would like to monitor the evolution of SAR information in the context of lead optimization projects involving different compound series over time and revisit decisions made by project teams to select one or the other analog or series for further exploration. In addition, one would like to compile and compare compound and SAR information that is currently available for a given therapeutic target and view newly identified active compounds in the context of this information to select the most promising candidates for further development.

Such tasks bring along new computational requirements. For example, numerical SAR analysis functions have been developed to systematically compare compound structures and potencies in very large data sets and make it possible to quantify SAR information on a large scale [5]. Furthermore, algorithms have been introduced to systematically identify compound pairs that are only distinguished by a single substructure exchange [6] and associate these so-called matched molecular pairs with SAR information [7]. Moreover, many results of large-scale SAR analysis are made accessible through visualization techniques.

SAR visualization

For the practice of medicinal chemistry, a numerical description of SAR characteristics is usually insufficient. Rather, results of data mining and analysis efforts must be presented to chemists in an intuitive manner. For large data sets, this ultimately requires the consideration of SAR visualization methods, which are becoming increasingly popular [8].

The concept of ‘activity landscapes’ is particularly suitable for visualization purposes. An activity landscape is generally defined as any graphical representation that systematically integrates compound similarity and potency relationships [9]. In a particularly intuitive form, an activity landscape can be rationalized as a two-dimensional projection of chemical space with biological activity added as the third dimension. Computationally, this requires the application of dimension reduction techniques as well as the interpolation of a coherent activity surface from arrays of compound potency values. So generated activity landscapes remind us of geographical maps where smooth and rugged regions have a concrete SAR meaning. For example, in gently sloped and smooth regions, propagating structural changes of compounds (corresponding to moves in chemical space) are accompanied by only small to moderate changes in activity. Thus, structurally diverse compounds retain similar activity, a phenotype often referred to as ‘SAR continuity’. By contrast, in rugged landscape regions containing mountains and peaks, small chemical changes lead to significant potency alterations, a phenotype rationalized as ‘SAR discontinuity’. Pairs or groups of structurally very similar compounds (closely related analogs) with large potency differences (e.g., two to three orders of magnitude or more) represent the extreme form of SAR discontinuity and are termed ‘activity cliffs’ [10]. These activity cliffs are the most prominent feature of activity landscapes and rich in SAR information, because small chemical modifications of active compounds lead to large-magnitude biological effects. Not surprisingly, once identified in large compound data sets, activity cliffs often become an immediate focal point of medicinal chemistry efforts.

In addition to three-dimensional landscape models, there are many different -and equally informative- ways to represent activity landscapes of compound data sets and visualize SAR information. Among others, these include molecular network representations in which nodes represent active compounds and edges pairwise similarity relationships. Such SAR networks can be annotated with additional layers of information. For example, in so-called ‘network-like similarity graphs’ (NSGs) [9], large red and green nodes connected by edges represent activity cliffs that are easy to spot. Compounds forming such activity cliffs can be interactively selected from NSGs for further analysis. NSGs are designed to explore relationships between global and local SAR features in large and heterogeneous compound data sets. They enable the identification of information-rich SAR microenvironments in compound data sets having varying levels of SAR information content as well as data sets yielding heterogeneous activity landscapes.

Methods for large-scale SAR visualization based upon the activity landscape concept and data mining techniques add a new dimension to computational medicinal chemistry and complement traditional QSAR analysis. They are designed to explore and exploit the rapidly growing amounts of compound activity data for the practice of medicinal chemistry.

Bibliography
[1] Lajiness, M.S. et al. (2004) J. Med. Chem. 47, 4891-4896
[2] Wang, Y. et al. (2012) Nucleic Acids Res. 40, D400-D412
[3] Gaulton, A. et al. (2012) Nucleic Acids Res. 40, D1100-D1107
[4] Esposito, E.X. et al. (2004) Methods Mol. Biol. 275, 131-214
[5] Peltason, L. & Bajorath, J. (2009) Future Med. Chem. 1, 451-466
[6] Hussain, J. & Rea, C. (2010) J. Chem. Inf. Model. 50, 339-348
[7] Wassermann, A.M. et al. (2012) Drug Develop. Res. 73, 518-527
[8] Stumpfe, D. & Bajorath, J. (2012) RSC Adv. 2, 369-378
[9] Wassermann, A.M. et al. (2010) J. Med. Chem. 53, 8209-8223
[10] Stumpfe, D. & Bajorath, J. (2012) J. Med. Chem. 55, 2932-2942

L&M int. 2 / 2013

The articles are publishes in issue L&M int. 2 / 2013.
Free download here: download here

The Author:

Prof. Dr Jürgen Bajorath

Data Visualization in Medicinal Chemistry

Hiking Trails in Activity Landscapes von Prof. Dr Jürgen Bajorath

L&M int. 2 / 2013

The Author:

Read more articles online

Hiking Trails in Activity Landscapes
von Prof. Dr Jürgen Bajorath