Please wait a minute...

中国生物工程杂志

CHINA BIOTECHNOLOGY
中国生物工程杂志  2022, Vol. 42 Issue (4): 40-48    DOI: 10.13523/j.cb.2111037
综述     
基于机器学习的药物-靶标相互作用预测*
刘皓淼,杨志伟**(),王力卓,周彦章,龙建纲
西安交通大学生命科学与技术学院 线粒体生物医学研究所 生物医学信息工程教育部重点实验室 西安 710049
Research Progress of Drug Target Interaction Prediction Based on Machine Learning
LIU Hao-miao,YANG Zhi-wei**(),WANG Li-zhuo,ZHOU Yan-zhang,LONG Jian-gang
Center of Mitochondrial Biology and Medicine, Key Laboratory of Biomedical Information Engineering, Ministry of Education, School of Life Science and Technology, Xi’an Jiaotong University, Xi’an 710049, China
 全文: PDF(785 KB)   HTML
摘要:

近年来,随着计算机硬件、软件工具和数据丰度的不断突破,以机器学习为代表的人工智能技术在生物、基础医学和药学等领域的应用不断拓展和融合,极大地推动了这些领域的发展,尤其是药物研发领域的变革。其中,药物-靶标相互作用(drug-target interactions, DTI)的识别是药物研发领域中的重要难题和人工智能技术交叉融合的热门方向,研究人员在DTI预测方面做了大量的工作,构建了许多重要的数据库,开发或拓展了各类机器学习算法和工具软件。对基于机器学习的DTI预测的基本流程进行了介绍,并对利用机器学习预测DTI的研究进行了回顾,同时对不同的机器学习方法运用于DTI预测的优缺点进行了简单总结,以期对开发更加有效的预测算法和DTI预测的发展提供帮助。

关键词: 机器学习药物-靶标相互作用药物研发算法    
Abstract:

In recent years, with the continuous breakthrough of computer hardware capability, software efficiency and data abundance, the applications of artificial intelligence technology represented by machine learning have been continuously expanded and integrated, which has greatly promoted the development in fields of biology, medicine, pharmacy, and especially drug R&D. Among those technology advancements, the identification of drug-target interactions (DTI) is an important problem in the field of drug R&D and a popular research direction for the cross-integration of artificial intelligence technology. As the source of innovative drug development, drug-target interaction prediction can provide high-probability potential drug targets for biological experiments, thereby increasing the rate of lead compound discovery, increasing the success rate of late-stage drug development and shortening the total development cycle. Researchers have already done a lot of work in constructing the prediction methods of drug-target interactions by building databases, developing software and establishing machine learning algorithms. In most works, data are transformed into feature vectors or similarities, and then suitable machine learning methods are employed to build predictive models. This paper introduces the basic process and reviews the research progress of drug-target interaction prediction based on machine learning. In addition, the advantages and disadvantages of existing prediction methods are briefly summarized in order to facilitate the development of more efficient prediction algorithms and drug-target interaction prediction methods.

Key words: Machine learning    Drug target interaction    Drug research    Algorithm
收稿日期: 2021-11-18 出版日期: 2022-05-05
ZTFLH:  Q819  
基金资助: * 国家自然科学基金(31870848);陕西省科学基金重点项目(2018JZ3005)
通讯作者: 杨志伟     E-mail: yzws-123@xjtu.edu.cn
服务  
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章  
刘皓淼
杨志伟
王力卓
周彦章
龙建纲

引用本文:

刘皓淼,杨志伟,王力卓,周彦章,龙建纲. 基于机器学习的药物-靶标相互作用预测*[J]. 中国生物工程杂志, 2022, 42(4): 40-48.

LIU Hao-miao,YANG Zhi-wei,WANG Li-zhuo,ZHOU Yan-zhang,LONG Jian-gang. Research Progress of Drug Target Interaction Prediction Based on Machine Learning. China Biotechnology, 2022, 42(4): 40-48.

链接本文:

https://manu60.magtech.com.cn/biotech/CN/10.13523/j.cb.2111037        https://manu60.magtech.com.cn/biotech/CN/Y2022/V42/I4/40

图1  机器学习应用于预测DTI技术路线图
数据库 链接 简介
DrugBank[7] https://go.drugbank.com 包含详细的药物数据和全面的药物靶标信息,最流行的数据库之一
PubChem[8] https://pubchem.ncbi.nlm.nih.gov 各种化合物及其相关活性的集合,支持复杂的查询和检索结果的下载
TTD[9] https://db.idrblab.net/ttd 记录已知蛋白质、核酸靶标相关信息,以及此类靶标针对的疾病、通路和对应产生相互作用的药物-靶标分子
BindingDB[10] https://www.bindingdb.org/bind/index.jsp 提供作用于各类蛋白质的海量小分子活性数据,相互作用的亲和力信息
KEGG DRUG[11] https://www.genome.jp/kegg/drug 各种基因组和生物途径的集合,包含有关各种疾病、药物和化合物的信息
ChEMBL[12] https://www.ebi.ac.uk/chembl 包含具有类似药物特性的生物活性分子的详细信息,提供针对药物靶标的生物活性数据
STITCH[13] https://stitch.embl.de 存储蛋白质和小分子之间相互作用信息,数据从其他数据库和文献中收集
ZINC[14] https://zinc.docking.org 提供化合物的购买信息、靶标、临床试验等方面的信息,并包含靶标预测功能
DGIdb[15] https://dgidb.genome.wustl.edu/ 包含药物与基因相互作用的相关信息,可通过输入基因查找相互作用的药物或通过输入药物查找相互作用的基因
BRENDA[16] https://www.brenda-enzymes.org 全面的酶数据库,包含大量酶及其相应的酶-配体相关信息
UniProt[17] https://www.uniprot.org 蛋白质数据库,包含有关蛋白质序列及其生物功能信息的信息
SIDER[18] https://sideeffects.embl.de 整合了有关药物、靶点和药物副作用的数据,以便全面了解药物的作用及其不良反应
表1  常用数据库
工具 链接 简介
CDK[24] http://cdk.github.io/ 软件应安装在Linux下,可以计算16种分子指纹
PaDEL[25] http://www.yapcwsoft.com/dd/padeldescriptor 计算分子描述符和指纹的软件。可以计算12种类型的指纹
RDKit[26] http://www.rdkit.org 为化合物生成各种描述符的工具包,可运行于各种操作系统
ChemDes[27] http://www.scbdd.com/chemdes/list-fingerprints/ 提供了格式转换、描述符计算、指纹生成、相似度计算等功能的Web平台
Rcpi[28] http://bioconductor.org/packages/release/bioc/html/Rcpi.html 用于药物、蛋白质及其相互作用的复杂表示,它计算各种化学、物理化学和结构描述符
PyDPI[29] http://sourceforge.net/projects/pydpicao/ 服务于DTI,可以计算药物的分子描述符和蛋白质的结构和物理化学性质
表2  用于计算药物蛋白质描述符的工具
图2  二进制标签矩阵R
[1] Adams C P, Brantner V V. Estimating the cost of new drug development: is it really $802 million? Health Affairs, 2006, 25(2): 420-428.
doi: 10.1377/hlthaff.25.2.420
[2] Chen S C, Zhu Y L, Zhang D Q, et al. Feature extraction approaches based on matrix pattern: MatPCA and MatFLDA. Pattern Recognition Letters, 2005, 26(8): 1157-1167.
doi: 10.1016/j.patrec.2004.10.009
[3] Dejori M, Schuermann B, Stetter M. Hunting drug targets by systems-level modeling of gene expression profiles. IEEE Transactions on Nanobioscience, 2004, 3(3): 180-191.
doi: 10.1109/TNB.2004.833690
[4] Russ A P, Lampel S. The druggable genome: an update. Drug Discovery Today, 2005, 10(23-24): 1607-1610.
doi: 10.1016/S1359-6446(05)03666-4
[5] Li Z P, Wang R S, Zhang X S. Two-stage flux balance analysis of metabolic networks for drug target identification. BMC Systems Biology, 2011, 5(Suppl 1): S11.
[6] Chatr-Aryamontri A, Ceol A, Palazzi L M, et al. MINT: the molecular INTeraction database. Nucleic Acids Research, 2007, 35(Database): D572-D574.
[7] Wishart D S, Knox C, Guo A C, et al. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Research, 2006, 34(suppl_1): D668-D672.
doi: 10.1093/nar/gkj067
[8] Kim S, Thiessen P A, Bolton E E, et al. PubChem substance and compound databases. Nucleic Acids Research, 2015, 44(D1): D1202-D1213.
doi: 10.1093/nar/gkv951
[9] Chen X, Ji Z L, Chen Y Z. TTD: therapeutic target database. Nucleic Acids Research, 2002, 30(1): 412-415.
[10] Liu T Q, Lin Y, Wen X, et al. BindingDB: a web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Research, 2006, 35(suppl_1): D198-D201.
[11] Kanehisa M, Furumichi M, Tanabe M, et al. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 2016, 45(D1): D353-D361.
doi: 10.1093/nar/gkw1092
[12] Gaulton A, Bellis L J, Bento A P, et al. ChEMBL: a large-scale bioactivity database for drug discovery. Nucleic Acids Research, 2011, 40(D1): D1100-D1107.
[13] Szklarczyk D, Santos A, von Mering C, et al. STITCH 5: augmenting protein-chemical interaction networks with tissue and affinity data. Nucleic Acids Research, 2015, 44(D1): D380-D384.
doi: 10.1093/nar/gkv1277
[14] Sterling T, Irwin J J. ZINC 15-ligand discovery for everyone. Journal of Chemical Information and Modeling, 2015, 55(11): 2324-2337.
doi: 10.1021/acs.jcim.5b00559 pmid: 26479676
[15] Cotto K C, Wagner A H, Feng Y Y, et al. DGIdb 3.0: a redesign and expansion of the drug-gene interaction database. Nucleic Acids Research, 2018, 46(D1): D1068-D1073.
[16] Schomburg I, Chang A, Ebeling C, et al. BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Research, 2004, 32(suppl_1): D431-D433.
[17] Consortium U. UniProt: a hub for protein information. Nucleic Acids Research, 2015, 43(Database issue): D204-D212.
doi: 10.1093/nar/gku989
[18] Kuhn M, Letunic I, Jensen L J, et al. The SIDER database of drugs and side effects. Nucleic Acids Research, 2016, 44(D1): D1075-D1079.
[19] Pozzan A. Molecular descriptors and methods for ligand based virtual high throughput screening in drug discovery. Current Pharmaceutical Design, 2006, 12(17): 2099-2110.
doi: 10.2174/138161206777585247
[20] Chen I J, Hubbard R E. Lessons for fragment library design: analysis of output from multiple screening campaigns. Journal of Computer-Aided Molecular Design, 2009, 23(8): 603-620.
doi: 10.1007/s10822-009-9280-5 pmid: 19495994
[21] Feng H W, Zhang L, Li S M, et al. Predicting the reproductive toxicity of chemicals using ensemble learning methods and molecular fingerprints. Toxicology Letters, 2021, 340: 4-14.
doi: 10.1016/j.toxlet.2021.01.002
[22] Batista J, Godden J W, Bajorath J. Assessment of molecular similarity from the analysis of randomly generated structural fragment populations. Journal of Chemical Information and Modeling, 2006, 46(5): 1937-1944.
doi: 10.1021/ci0601261 pmid: 16995724
[23] Biasini M, Bienert S, Waterhouse A, et al. SWISS-MODEL: modelling protein tertiary and quaternary structure using evolutionary information. Nucleic Acids Research, 2014, 42(Web Server issue): W252-W258.
doi: 10.1093/nar/gku340
[24] Steinbeck C, Han Y Q, Kuhn S, et al. The chemistry development kit (CDK): an open-source Java library for chemo- and bioinformatics. ChemInform, 2003, 34(21): 493-500.
[25] Yap C W. PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. Journal of Computational Chemistry, 2011, 32(7): 1466-1474.
doi: 10.1002/jcc.21707
[26] Lovrić M, Molero J M, Kern R. PySpark and RDKit: moving towards big data in cheminformatics. Molecular Informatics, 2019, 38(6): 1800082.
doi: 10.1002/minf.201800082
[27] Dong J, Cao D S, Miao H Y, et al. ChemDes: an integrated web-based platform for molecular descriptor and fingerprint computation. Journal of Cheminformatics, 2015, 7: 60.
doi: 10.1186/s13321-015-0109-z pmid: 26664458
[28] Cao D S, Xiao N, Xu Q S, et al. Rcpi: R/Bioconductor package to generate various descriptors of proteins, compounds and their interactions. Bioinformatics, 2014, 31(2): 279-281.
doi: 10.1093/bioinformatics/btu624
[29] Cao D S, Liang Y Z, Yan J, et al. PyDPI: freely available Python package for chemoinformatics, bioinformatics, and chemogenomics studies. Journal of Chemical Information and Modeling, 2013, 53(11): 3086-3096.
doi: 10.1021/ci400127q
[30] Johnson M, Maggiora G. Concepts and applications of molecular similarity. New York: Wiley Interscience, 1990.
[31] González-Díaz H, Prado-Prado F, García-Mera X, et al. MIND-BEST: web server for drugs and target discovery; design, synthesis, and assay of MAO-B inhibitors and theoretical-experimental study of G3PDH protein from Trichomonas gallinae. Journal of Proteome Research, 2011, 10(4): 1698-1718.
doi: 10.1021/pr101009e pmid: 21184613
[32] Shoichet B K, Kuntz I D, Bodian D L. Molecular docking using shape descriptors. Journal of Computational Chemistry, 1992, 13(3): 380-397.
doi: 10.1002/jcc.540130311
[33] Chen X, Liu X E, Wu J. Research progress on drug representation learning. Journal of Tsinghua University (Science and Technology), 2020(2): 171-180.
[34] Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: machine learning in Python. CoRR, 2012.DOI: abs/1201.0490:2825-2830.
doi: abs/1201.0490:2825-2830
[35] Quinlan J R. Induction of decision trees. Machine Learning, 1986, 1(1): 81-106.
[36] Deb K, Pratap A, Agarwal S, et al. A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation, 2002, 6(2): 182-197.
doi: 10.1109/4235.996017
[37] Mountrakis G, Im J, Ogole C. Support vector machines in remote sensing: a review. ISPRS Journal of Photogrammetry and Remote Sensing, 2011, 66(3): 247-259.
doi: 10.1016/j.isprsjprs.2010.11.001
[38] Biau G. Analysis of a random forests model. Journal of Machine Learning Research, 2012, 13: 1063-1095.
[39] Peduzzi P, Concato J, Kemper E, et al. A simulation study of the number of events per variable in logistic regression analysis. Journal of Clinical Epidemiology, 1996, 49(12): 1373-1379.
doi: 10.1016/s0895-4356(96)00236-3 pmid: 8970487
[40] Srivastava N, Hinton G E, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 2014, 15(1): 1929-1958.
[41] Wu Z R, Li W H, Liu G X, et al. Network-based methods for prediction of drug-target interactions. Frontiers in Pharmacology, 2018, 9: 1134.
doi: 10.3389/fphar.2018.01134
[42] Zeng X X, Zhu S Y, Liu X R, et al. deepDR: a network-based deep learning approach to in silico drug repositioning. Bioinformatics, 2019, 35(24): 5191-5198.
doi: 10.1093/bioinformatics/btz418
[43] Zhang R L, Ding Y R. Identification of key features of CNS drugs based on SVM and greedy algorithm. Current Computer-Aided Drug Design, 2020, 16(6): 725-733.
doi: 10.2174/1573409915666191212095340
[44] Madhukar N S, Khade P K, Huang L, et al. A Bayesian machine learning approach for drug target identification using diverse data types. Nature Communications, 2019, 10: 5221.
doi: 10.1038/s41467-019-12928-6 pmid: 31745082
[45] Mahmud S M H, Chen W Y, Liu Y S, et al. PreDTIs: prediction of drug-target interactions based on multiple feature information using gradient boosting framework with data balancing and feature selection techniques. Briefings in Bioinformatics, 2021, 22(5): bbab046.
doi: 10.1093/bib/bbab046
[46] Piazza I, Beaton N, Bruderer R, et al. A machine learning-based chemoproteomic approach to identify drug targets and binding sites in complex proteomes. Nature Communications, 2020, 11: 4200.
doi: 10.1038/s41467-020-18071-x
[47] Chu Y Y, Kaushik A C, Wang X G, et al. DTI-CDF: a cascade deep forest model towards the prediction of drug-target interactions based on hybrid features. Briefings in Bioinformatics, 2021, 22(1): 451-462.
doi: 10.1093/bib/bbz152
[48] Li Y, Liu X Z, You Z H, et al. A computational approach for predicting drug-target interactions from protein sequence and drug substructure fingerprint information. International Journal of Intelligent Systems, 2021, 36(1): 593-609.
doi: 10.1002/int.22332
[49] Sachdev K, Gupta M K. A comprehensive review of feature based methods for drug target interaction prediction. Journal of Biomedical Informatics, 2019, 93: 103159.
doi: 10.1016/j.jbi.2019.103159
[50] Li X Y, Li W K, Zeng M, et al. Network-based methods for predicting essential genes or proteins: a survey. Briefings in Bioinformatics, 2020, 21(2): 566-583.
doi: 10.1093/bib/bbz017
[51] Huang K, Xiao C, Glass L M, et al. SkipGNN: predicting molecular interactions with skip-graph networks. Scientific Reports, 2020, 10: 21092.
doi: 10.1038/s41598-020-77766-9
[52] Parvizi P, Azuaje F, Theodoratou E, et al. A network-based embedding method for drug-target interaction prediction. Annual International Conference of the IEEE Engineering in Medicine and Biology Society IEEE Engineering in Medicine and Biology Society Annual International Conference, 2020, 2020: 5304-5307.
[53] Yue Y, He S. DTI-HeNE: a novel method for drug-target interaction prediction based on heterogeneous network embedding. BMC Bioinformatics, 2021, 22(1): 418.
doi: 10.1186/s12859-021-04327-w
[54] Wan F P, Hong L X, Xiao A, et al. NeoDTI: neural integration of neighbor information from a heterogeneous network for discovering new drug-target interactions. Bioinformatics, 2018, 35(1): 104-111.
doi: 10.1093/bioinformatics/bty543
[55] Mohamed S K, Novááček V, Nounu A. Discovering protein drug targets using knowledge graph embeddings. Bioinformatics, 2019, 36(2): 603-610.
[56] Shang Y F, Gao L, Zou Q, et al. Prediction of drug-target interactions based on multi-layer network representation learning. Neurocomputing, 2021, 434: 80-89.
doi: 10.1016/j.neucom.2020.12.068
[57] Zhao T Y, Hu Y, Valsdottir L R, et al. Identifying drug-target interactions based on graph convolutional network and deep neural network. Briefings in Bioinformatics, 2020, 22(2): 2141-2150.
doi: 10.1093/bib/bbaa044
[58] Xu X, Xuan P, Zhang T, et al. Inferring drug-target interactions based on random walk and convolutional neural network. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2021. DOI: 10.1109/TCBB.2021.3066813.
doi: 10.1109/TCBB.2021.3066813
[59] Lee D D, Seung H S. Learning the parts of objects by non-negative matrix factorization. Nature, 1999, 401 (6755): 788-791.
doi: 10.1038/44565
[60] Stokes J M, Yang K, Swanson K, et al. A deep learning approach to antibiotic discovery. Cell, 2020, 180(4): 688-702.e13.
doi: 10.1016/j.cell.2020.01.021
[61] Meng Y J, Jin M, Tang X F, et al. Drug repositioning based on similarity constrained probabilistic matrix factorization: COVID-19 as a case study. Applied Soft Computing, 2021, 103: 107135.
doi: 10.1016/j.asoc.2021.107135
[62] Bagherian M, Kim R B, Jiang C, et al. Coupled matrix-matrix and coupled tensor-matrix completion methods for predicting drug-target interactions. Briefings in Bioinformatics, 2020, 22(2): 2161-2171.
doi: 10.1093/bib/bbaa025 pmid: 32186716
[63] Yang M Y, Wu G Y, Zhao Q C, et al. Computational drug repositioning based on multi-similarities bilinear matrix factorization. Briefings in Bioinformatics, 2020, 22(4): bbaa267.
doi: 10.1093/bib/bbaa267
[64] Ceddia G, Pinoli P, Ceri S, et al. Matrix factorization-based technique for drug repurposing predictions. IEEE Journal of Biomedical and Health Informatics, 2020, 24(11): 3162-3172.
doi: 10.1109/JBHI.2020.2991763
[65] Hao M, Bryant S H, Wang Y. Predicting drug-target interactions by dual-network integrated logistic matrix factorization. Scientific Reports, 2017, 7: 40376.
doi: 10.1038/srep40376
[66] Wang M H, Tang C, Chen J J. Drug-target interaction prediction via dual Laplacian graph regularized matrix completion. BioMed Research International, 2018, 2018: 1425608.
[67] Peng Y H, Gao P P, Shi L, et al. Central and peripheral metabolic defects contribute to the pathogenesis of Alzheimer’s disease: targeting mitochondria for diagnosis and prevention. Antioxidants & Redox Signaling, 2020, 32(16): 1188-1236.
[68] Hao J J, Shen W L, Tian C, et al. Mitochondrial nutrients improve immune dysfunction in the type 2 diabetic Goto-Kakizaki rats. Journal of Cellular and Molecular Medicine, 2009, 13(4): 701-711.
doi: 10.1111/j.1582-4934.2008.00342.x
[1] 武瑞君,李治非,张鑫,濮润,敖翼,孙燕荣. 新冠病毒抗体药物研发进展及展望分析[J]. 中国生物工程杂志, 2020, 40(5): 1-6.
[2] 姜吉喆, 潘航, 乐敏, 章乐. 基于比较基因组学方法的世界范围的犬布鲁氏菌系统发育群研究 *[J]. 中国生物工程杂志, 2020, 40(3): 38-47.
[3] 朱小丽,黄翠,马丽丽,张超,巩玥,赵婉雨,赵秀芳,郭文姣,彭皓,张吉,梁慧刚. 新型冠状病毒病(COVID-19)研究进展[J]. 中国生物工程杂志, 2020, 40(1-2): 38-50.
[4] 谢志勇,周翔. 基于机器学习的医学影像分析在药物研发和精准医疗方面的应用[J]. 中国生物工程杂志, 2019, 39(2): 90-100.
[5] 吴升星, 李艳, 张海燕, 刘洋, 赖琼, 杨明. 诱导多能干细胞技术在药物研发领域中的前景[J]. 中国生物工程杂志, 2017, 37(11): 116-122.
[6] 梁栋, 邢永强, 蔡禄. 肾肿瘤相关基因的共表达网络构建与分析[J]. 中国生物工程杂志, 2016, 36(2): 30-37.
[7] 张许, 丁健, 高鹏, 高敏杰, 贾禄强, 涂庭勇, 史仲平. 基于差分进化算法的酿酒酵母分批补料培养在线自适应控制[J]. 中国生物工程杂志, 2016, 36(1): 68-75.
[8] 周勇, 郑毅, 宋利丹. 人工神经网络与遗传算法耦合法优化辅酶Q10发酵培养基[J]. 中国生物工程杂志, 2013, 33(9): 73-78.
[9] D.McCormick, 朱睿中. 应用于生物技术的微机软件[J]. 中国生物工程杂志, 1985, 5(4): 48-53.