The number of publications describing chemical structures has increased steadily over the last decades. However, the majority of published chemical information is currently not available in machine-re Show more
The number of publications describing chemical structures has increased steadily over the last decades. However, the majority of published chemical information is currently not available in machine-readable form in public databases. It remains a challenge to automate the process of information extraction in a way that requires less manual intervention - especially the mining of chemical structure depictions. As an open-source platform that leverages recent advancements in deep learning, computer vision, and natural language processing, DECIMER.ai (Deep lEarning for Chemical IMagE Recognition) strives to automatically segment, classify, and translate chemical structure depictions from the printed literature. The segmentation and classification tools are the only openly available packages of their kind, and the optical chemical structure recognition (OCSR) core application yields outstanding performance on all benchmark datasets. The source code, the trained models and the datasets developed in this work have been published under permissive licences. An instance of the DECIMER web application is available at https://decimer.ai . Show less
Hydrogen bonding is an interaction of great importance in drug discovery and development as it may significantly affect chemical and biological processes including the interaction of small molecules w Show more
Hydrogen bonding is an interaction of great importance in drug discovery and development as it may significantly affect chemical and biological processes including the interaction of small molecules with other molecules, proteins, and membranes. In particular, hydrogen bonding can impact drug-like properties such as target affinity and oral availability which are critical to developing effective pharmaceuticals, and therefore, numerous methods for the calculation of properties such as hydrogen-bond strengths, free energy of hydration, or water solubility have been proposed over time. However, the accessibility to efficient methods for the predictions of such properties is still limited. Here, we present the development of Jazzy, an open-source tool for the prediction of hydrogen-bond strengths and free energies of hydration of small molecules. Jazzy also allows the visualisation of hydrogen-bond strengths with atomistic resolution to support the design of compounds with desired properties and the interpretation of existing data. The tool is described in its implementation, parameter fitting, and validation against two data sets of experimental hydration free energies. Jazzy is also applied against two chemical series of bioactive compounds to show that hydrogen-bond strengths can be used to understand their structure–activity relationships. Results from the validations highlight the strengths and limitations of Jazzy, and suggest its suitability for interactive design, screening, and machine-learning featurisation. Show less
Over the past three years, significant progress has been made in the development of novel promising drug candidates against COVID-19. However, SARS-CoV-2 mutations resulting in the emergence of new vi Show more
Over the past three years, significant progress has been made in the development of novel promising drug candidates against COVID-19. However, SARS-CoV-2 mutations resulting in the emergence of new viral strains that can be resistant to the drugs used currently in the clinic necessitate the development of novel potent and broad therapeutic agents targeting different vulnerable spots of the viral proteins. In this study, two deep learning generative models were developed and used in combination with molecular modeling tools for de novo design of small molecule compounds that can inhibit the catalytic activity of SARS-CoV-2 main protease (Mpro), an enzyme critically important for mediating viral replication and transcription. As a result, the seven best scoring compounds that exhibited low values of binding free energy comparable with those calculated for two potent inhibitors of Mpro, via the same computational protocol, were selected as the most probable inhibitors of the enzyme catalytic site. In light of the data obtained, the identified compounds are assumed to present promising scaffolds for the development of new potent and broad-spectrum drugs inhibiting SARS-CoV-2 Mpro, an attractive therapeutic target for anti-COVID-19 agents. Show less
Natural products represent an excellent source of unprecedented anticancer compounds. However, the identification of the mechanism of action remains a major challenge. Several techniques and methodolo Show more
Natural products represent an excellent source of unprecedented anticancer compounds. However, the identification of the mechanism of action remains a major challenge. Several techniques and methodologies have been considered, but with limited success. In this work, we explored the combination of live cell imaging and machine learning techniques as a promising tool to depict in a fast and affordable test the mode of action of natural compounds with antiproliferative activity. To develop the model, we selected the non-small cell lung cancer cell line SW1573, which was exposed to the known antimitotic drugs paclitaxel, colchicine and vinblastine. The novelty of our methodology focuses on two main features with the highest relevance, (a) meaningful phenotypic metrics, and (b) fast Fourier transform (FFT) of the time series of the phenotypic parameters into their corresponding amplitudes and phases. The resulting algorithm was able to cluster the microtubule disruptors, and meanwhile showed a negative correlation between paclitaxel and the other treatments. The FFT approach was able to group the samples as efficiently as checking by eye. This methodology could easily scale to group a large amount of data without visual supervision. Show less
Drug discovery (DD) is a time-consuming and expensive process. Thus, the industry employs strategies such as drug repositioning and drug repurposing, which allows the application of already approved d Show more
Drug discovery (DD) is a time-consuming and expensive process. Thus, the industry employs strategies such as drug repositioning and drug repurposing, which allows the application of already approved drugs to treat a different disease, as occurred in the first months of 2020, during the COVID-19 pandemic. The prediction of drug-target interactions is an essential part of the DD process because it can accelerate it and reduce the required costs. DTI prediction performed in silico have used approaches based on molecular docking simulations, including similarity-based and network- and graph-based ones. This paper presents MPS2IT-DTI, a DTI prediction model obtained from research conducted in the following steps: the definition of a new method for encoding molecule and protein sequences onto images; the definition of a deep-learning approach based on a convolutional neural network in order to create a new method for DTI prediction. Training results conducted with the Davis and KIBA datasets show that MPS2IT-DTI is viable compared to other state-of-the-art (SOTA) approaches in terms of performance and complexity of the neural network model. With the Davis dataset, we obtained 0.876 for the concordance index and 0.276 for the MSE; with the KIBA dataset, we obtained 0.836 and 0.226 for the concordance index and the MSE, respectively. Moreover, the MPS2IT-DTI model represents molecule and protein sequences as images, instead of treating them as an NLP task, and as such, does not employ an embedding layer, which is present in other models. Show less
Subtle variations in the lipid composition of mitochondrial membranes can have a profound impact on mitochondrial function. The inner mitochondrial membrane contains the phospholipid cardiolip Show more
Subtle variations in the lipid composition of mitochondrial membranes can have a profound impact on mitochondrial function. The inner mitochondrial membrane contains the phospholipid cardiolipin, which has been demonstrated to act as a biomarker for a number of diverse pathologies. Small molecule dyes capable of selectively partitioning into cardiolipin membranes enable visualization and quantification of the cardiolipin content. Here we present a data-driven approach that combines a deep learning-enabled active learning workflow with coarse-grained molecular dynamics simulations and alchemical free energy calculations to discover small organic compounds able to selectively permeate cardiolipin-containing membranes. By employing transferable coarse-grained models we efficiently navigate the all-atom design space corresponding to small organic molecules with molecular weight less than ≈500 Da. After direct simulation of only 0.42% of our coarse-grained search space we identify molecules with considerably increased levels of cardiolipin selectivity compared to a widely used cardiolipin probe 10-N-nonyl acridine orange. Our accumulated simulation data enables us to derive interpretable design rules linking coarse-grained structure to cardiolipin selectivity. The findings are corroborated by fluorescence anisotropy measurements of two compounds conforming to our defined design rules. Our findings highlight the potential of coarse-grained representations and multiscale modelling for materials discovery and design.
Show less
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of pr Show more
Artificial intelligence (AI) and machine learning (ML) are expanding in popularity for broad applications to challenging tasks in chemistry and materials science. Examples include the prediction of properties, the discovery of new reaction pathways, or the design of new molecules. The machine needs to read and write fluently in a chemical language for each of these tasks. Strings are a common tool to represent molecular graphs, and the most popular molecular string representation, Smiles, has powered cheminformatics since the late 1980s. However, in the context of AI and ML in chemistry, Smiles has several shortcomings-most pertinently, most combinations of symbols lead to invalid results with no valid chemical interpretation. To overcome this issue, a new language for molecules was introduced in 2020 that guarantees 100% robustness: SELF-referencing embedded string (Selfies). Selfies has since simplified and enabled numerous new applications in chemistry. In this perspective, we look to the future and discuss molecular string representations, along with their respective opportunities and challenges. We propose 16 concrete future projects for robust molecular representations. These involve the extension toward new chemical domains, exciting questions at the interface of AI and robust languages, and interpretability for both humans and machines. We hope that these proposals will inspire several follow-up works exploiting the full potential of molecular string representations for the future of AI in chemistry and materials science. Show less
The elucidation of a compound's Mechanism of Action (MoA) is a challenging task in the drug discovery process, but it is important in order to rationalise phenotypic findings and to anticipate potenti Show more
The elucidation of a compound's Mechanism of Action (MoA) is a challenging task in the drug discovery process, but it is important in order to rationalise phenotypic findings and to anticipate potential side-effects. Bioinformatic approaches, advances in machine learning techniques and the increasing deposition of high-throughput data in public databases have significantly contributed to recent advances in the field, but it is not straightforward to decide which data and methods are most suitable to use in a given case. In this review, we focus on these methods and data and their applications in generating MoA hypotheses for subsequent experimental validation. We discuss compound-specific data such as -omics, cell morphology and bioactivity data, as well as commonly used supplementary prior knowledge such as network and pathway data, and provide information on databases where this data can be accessed. In terms of methodologies, we discuss both well-established methods (connectivity mapping, pathway enrichment) as well as more developing methods (neural networks and multi-omics integration). Finally, we review case studies where the MoA of a compound was successfully suggested from computational analysis by incorporating multiple data modalities and/or methodologies. Our aim for this review is to provide researchers with insights into the benefits and drawbacks of both the data and methods in terms of level of understanding, biases and interpretation – and to highlight future avenues of investigation which we foresee will improve the field of MoA elucidation, including greater public access to -omics data and methodologies which are capable of data integration. Show less
Deep learning methods have in recent years shown promising results in characterizing proteins and extracting complex sequence–structure–function relationships. This Analysis describes a benchmarking s Show more
Deep learning methods have in recent years shown promising results in characterizing proteins and extracting complex sequence–structure–function relationships. This Analysis describes a benchmarking study to compare the performances and advantages of recent deep learning approaches in a range of protein prediction tasks. Show less
Drug discovery aims at finding new compounds with specific chemical properties for the treatment of diseases. In the last years, the approach used in this search presents an important component in com Show more
Drug discovery aims at finding new compounds with specific chemical properties for the treatment of diseases. In the last years, the approach used in this search presents an important component in computer science with the skyrocketing of machine learning techniques due to its democratization. With the objectives set by the Precision Medicine initiative and the new challenges generated, it is necessary to establish robust, standard and reproducible computational methodologies to achieve the objectives set. Currently, predictive models based on Machine Learning have gained great importance in the step prior to preclinical studies. This stage manages to drastically reduce costs and research times in the discovery of new drugs. This review article focuses on how these new methodologies are being used in recent years of research. Analyzing the state of the art in this field will give us an idea of where cheminformatics will be developed in the short term, the limitations it presents and the positive results it has achieved. This review will focus mainly on the methods used to model the molecular data, as well as the biological problems addressed and the Machine Learning algorithms used for drug discovery in recent years. Show less
MOTIVATION: With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale Show more
MOTIVATION: With the rapid increase of biomedical articles, large-scale automatic Medical Subject Headings (MeSH) indexing has become increasingly important. FullMeSH, the only method for large-scale MeSH indexing with full text, suffers from three major drawbacks: FullMeSH (i) uses Learning To Rank, which is time-consuming, (ii) can capture some pre-defined sections only in full text and (iii) ignores the whole MEDLINE database.
RESULTS: We propose a computationally lighter, full text and deep-learning-based MeSH indexing method, BERTMeSH, which is flexible for section organization in full text. BERTMeSH has two technologies: (i) the state-of-the-art pre-trained deep contextual representation, Bidirectional Encoder Representations from Transformers (BERT), which makes BERTMeSH capture deep semantics of full text. (ii) A transfer learning strategy for using both full text in PubMed Central (PMC) and title and abstract (only and no full text) in MEDLINE, to take advantages of both. In our experiments, BERTMeSH was pre-trained with 3 million MEDLINE citations and trained on ∼1.5 million full texts in PMC. BERTMeSH outperformed various cutting-edge baselines. For example, for 20 K test articles of PMC, BERTMeSH achieved a Micro F-measure of 69.2%, which was 6.3% higher than FullMeSH with the difference being statistically significant. Also prediction of 20 K test articles needed 5 min by BERTMeSH, while it took more than 10 h by FullMeSH, proving the computational efficiency of BERTMeSH.
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Show less
The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-acce Show more
The automatic recognition of chemical structure diagrams from the literature is an indispensable component of workflows to re-discover information about chemicals and to make it available in open-access databases. Here we report preliminary findings in our development of Deep lEarning for Chemical ImagE Recognition (DECIMER), a deep learning method based on existing show-and-tell deep neural networks, which makes very few assumptions about the structure of the underlying problem. It translates a bitmap image of a molecule, as found in publications, into a SMILES. The training state reported here does not yet rival the performance of existing traditional approaches, but we present evidence that our method will reach a comparable detection power with sufficient training time. Training success of DECIMER depends on the input data representation: DeepSMILES are superior over SMILES and we have a preliminary indication that the recently reported SELFIES outperform DeepSMILES. An extrapolation of our results towards larger training data sizes suggests that we might be able to achieve near-accurate prediction with 50 to 100 million training structures. This work is entirely based on open-source software and open data and is available to the general public for any purpose. Show less
Computational methods to predict Z-DNA regions are in high demand to understand the functional role of Z-DNA. The previous state-of-the-art method Z-Hunt is based on statistical mechanical and energy Show more
Computational methods to predict Z-DNA regions are in high demand to understand the functional role of Z-DNA. The previous state-of-the-art method Z-Hunt is based on statistical mechanical and energy considerations about B- to Z-DNA transition using sequence information. Z-DNA CHiP-seq experiment results showed little overlap with Z-Hunt predictions implying that sequence information only is not sufficient to explain emergence of Z-DNA at different genomic locations. Adding epigenetic and other functional genomic mark-ups to DNA sequence level can help revealing the functional Z-DNA sites. Here we take advantage of the deep learning approach that can analyze and extract information from large volumes of molecular biology data. We developed a machine learning approach DeepZ that aggregates information from genome-wide maps of epigenetic markers, transcription factor and RNA polymerase binding sites, and chromosome accessibility maps. With the developed model we not only verify the experimental Z-DNA predictions, but also generate the whole-genome annotation, introducing new possible Z-DNA regions, which have not yet been found in experiments and can be of interest to the researchers from various fields. Show less
The task of drug-target interaction prediction holds significant importance in pharmacology and therapeutic drug design. In this paper, we present FRnet-DTI, an auto-encoder based feature manipulation Show more
The task of drug-target interaction prediction holds significant importance in pharmacology and therapeutic drug design. In this paper, we present FRnet-DTI, an auto-encoder based feature manipulation and a convolutional neural network based classifier for drug target interaction prediction. Two convolutional neural networks are proposed: FRnet-Encode and FRnet-Predict. Here, one model is used for feature manipulation and the other one for classification. Using the first method FRnet-Encode, we generate 4096 features for each of the instances in each of the datasets and use the second method, FRnet-Predict, to identify interaction probability employing those features. We have tested our method on four gold standard datasets extensively used by other researchers. Experimental results shows that our method significantly improves over the state-of-the-art method on three out of four drug-target interaction gold standard datasets on both area under curve for Receiver Operating Characteristic (auROC) and area under Precision Recall curve (auPR) metric. We also introduce twenty new potential drug-target pairs for interaction based on high prediction scores. The source codes and implementation details of our methods are available from https://github.com/farshidrayhanuiu/FRnet-DTI/ and also readily available to use as an web application from http://farshidrayhan.pythonanywhere.com/FRnet-DTI/ . Show less
Therapeutic effects of drugs are mediated via interactions between them and their intended targets. As such, prediction of drug-target interactions is of great importance. Drug-target interaction pred Show more
Therapeutic effects of drugs are mediated via interactions between them and their intended targets. As such, prediction of drug-target interactions is of great importance. Drug-target interaction prediction is especially relevant in the case of drug repositioning where attempts are made to repurpose old drugs for new indications. While experimental wet-lab techniques exist for predicting such interactions, they are tedious and time-consuming. On the other hand, computational methods also exist for predicting interactions, and they do so with reasonable accuracy. In addition, computational methods can help guide their wet-lab counterparts by recommending interactions for further validation. In this chapter, a computational method for predicting drug-target interactions is presented. Specifically, we describe a machine learning method that utilizes ensemble learning to perform predictions. We also mention details pertaining to the preparation of the data required for the prediction effort and demonstrate how to evaluate and improve prediction performance. Show less
Identifying drug-target interactions is crucial for the success of drug discovery. Approaches based on machine learning for this problem can be divided into two types: feature-based and similarity-bas Show more
Identifying drug-target interactions is crucial for the success of drug discovery. Approaches based on machine learning for this problem can be divided into two types: feature-based and similarity-based methods. By utilizing the "Learning to rank" framework, we propose a new method, DrugE-Rank, to combine these two different types of methods for improving the prediction performance of new candidate drugs and targets. DrugE-Rank is available at http://datamining-iip.fudan.edu.cn/service/DrugE-Rank/ . Show less
Generating molecules with desired chemical properties is important for drug discovery. The use of generative neural networks is promising for this task. However, from visual inspection, it often appea Show more
Generating molecules with desired chemical properties is important for drug discovery. The use of generative neural networks is promising for this task. However, from visual inspection, it often appears that generated samples lack diversity. In this paper, we quantify this internal chemical diversity, and we raise the following challenge: can a nontrivial AI model reproduce natural chemical diversity for desired molecules? To illustrate this question, we consider two generative models: a Reinforcement Learning model and the recently introduced ORGAN. Both fail at this challenge. We hope this challenge will stimulate research in this direction. Show less
MOTIVATION: With the booming of interactome studies, a lot of interactions can be measured in a high throughput way and large scale datasets are available. It is becoming apparent that many different Show more
MOTIVATION: With the booming of interactome studies, a lot of interactions can be measured in a high throughput way and large scale datasets are available. It is becoming apparent that many different types of interactions can be potential drug targets. Compared with inhibition of a single protein, inhibition of protein-protein interaction (PPI) is promising to improve the specificity with fewer adverse side-effects. Also it greatly broadens the drug target search space, which makes the drug target discovery difficult. Computational methods are highly desired to efficiently provide candidates for further experiments and hold the promise to greatly accelerate the discovery of novel drug targets.
RESULTS: Here, we propose a machine learning method to predict PPI targets in a genomic-wide scale. Specifically, we develop a computational method, named as PrePPItar, to Predict PPIs as drug targets by uncovering the potential associations between drugs and PPIs. First, we survey the databases and manually construct a gold-standard positive dataset for drug and PPI interactions. This effort leads to a dataset with 227 associations among 63 PPIs and 113 FDA-approved drugs and allows us to build models to learn the association rules from the data. Second, we characterize drugs by profiling in chemical structure, drug ATC-code annotation, and side-effect space and represent PPI similarity by a symmetrical S-kernel based on protein amino acid sequence. Then the drugs and PPIs are correlated by Kronecker product kernel. Finally, a support vector machine (SVM), is trained to predict novel associations between drugs and PPIs. We validate our PrePPItar method on the well-established gold-standard dataset by cross-validation. We find that all chemical structure, drug ATC-code, and side-effect information are predictive for PPI target. Moreover, we can increase the PPI target prediction coverage by integrating multiple data sources. Follow-up database search and pathway analysis indicate that our new predictions are worthy of future experimental validation.
CONCLUSION: In conclusion, PrePPItar can serve as a useful tool for PPI target discovery and provides a general heterogeneous data integrative framework.
AVAILABILITY AND IMPLEMENTATION: PrePPItar is available at http://doc.aporc.org/wiki/PrePPItar.
CONTACT: ycwang@nwipb.cas.cn or ywang@amss.ac.cn
SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Show less
Computationally predicting drug-target interactions is useful to select possible drug (or target) candidates for further biochemical verification. We focus on machine learning-based approaches, partic Show more
Computationally predicting drug-target interactions is useful to select possible drug (or target) candidates for further biochemical verification. We focus on machine learning-based approaches, particularly similarity-based methods that use drug and target similarities, which show relationships among drugs and those among targets, respectively. These two similarities represent two emerging concepts, the chemical space and the genomic space. Typically, the methods combine these two types of similarities to generate models for predicting new drug-target interactions. This process is also closely related to a lot of work in pharmacogenomics or chemical biology that attempt to understand the relationships between the chemical and genomic spaces. This background makes the similarity-based approaches attractive and promising. This article reviews the similarity-based machine learning methods for predicting drug-target interactions, which are state-of-the-art and have aroused great interest in bioinformatics. We describe each of these methods briefly, and empirically compare these methods under a uniform experimental setting to explore their advantages and limitations. Show less
AbstractThe early phases of commercial drug discovery programs are increasingly guided by information extracted from three‐dimensional structures of the target proteins and in silico design techniques Show more
AbstractThe early phases of commercial drug discovery programs are increasingly guided by information extracted from three‐dimensional structures of the target proteins and in silico design techniques. This review addresses key issues of docking and scoring, a popular technique in structure‐based drug design. The pros and cons of computational tools currently used will be outlined as well as the integration of these methods in the lead finding and lead optimization process. Show less