For more information visit:

Recent Submissions

  • Identifying Novel Drug Targets by iDTPnd: A Case Study of Kinase Inhibitors.

    Naveed, Hammad; Reglin, Corinna; Schubert, Thomas; Gao, Xin; Arold, Stefan T.; Maitland, Michael L (Genomics, proteomics & bioinformatics, Elsevier BV, 2021-04-01) [Article]
    Current FDA-approved kinase inhibitors cause diverse adverse effects, some of which are due to the mechanism-independent effects of these drugs. Identifying these mechanism-independent interactions could improve drug safety and support drug repurposing. We have developed iDTPnd (integrated Drug Target Predictor with negative dataset), a computational approach for large-scale discovery of novel targets for known drugs. For a given drug, we construct a positive and a negative structural signature that captures the weakly conserved structural features of drug binding sites. To facilitate assessment of unintended targets, iDTPnd also provides a docking-based interaction score and its statistical significance. We were able to confirm the interaction of sorafenib, imatinib, dasatinib, sunitinib, and pazopanib with their known targets at a sensitivity and specificity of 52% and 55%, respectively. We have validated 10 predicted novel targets by using in vitro experiments. Our results suggest that proteins other than kinases, such as nuclear receptors, cytochrome P450, or MHC Class I molecules can also be physiologically relevant targets of kinase inhibitors. Our method is general and broadly applicable for the identification of protein-small molecule interactions, when sufficient drug-target 3D data are available. The code for constructing the structural signature is available at
  • Analysis of the effects of related fingerprints on molecular similarity using an eigenvalue entropy approach

    Kuwahara, Hiroyuki; Gao, Xin (Journal of Cheminformatics, Springer Nature, 2021-03-23) [Article]
    AbstractTwo-dimensional (2D) chemical fingerprints are widely used as binary features for the quantification of structural similarity of chemical compounds, which is an important step in similarity-based virtual screening (VS). Here, using an eigenvalue-based entropy approach, we identified 2D fingerprints with little to no contribution to shaping the eigenvalue distribution of the feature matrix as related ones and examined the degree to which these related 2D fingerprints influenced molecular similarity scores calculated with the Tanimoto coefficient. Our analysis identified many related fingerprints in publicly available fingerprint schemes and showed that their presence in the feature set could have substantial effects on the similarity scores and bias the outcome of molecular similarity analysis. Our results have implication in the optimal selection of 2D fingerprints for compound similarity analysis and the identification of potential hits for compounds with target biological activity in VS.
  • Radiomics of Tumor Heterogeneity in Longitudinal Dynamic Contrast-Enhanced Magnetic Resonance Imaging for Predicting Response to Neoadjuvant Chemotherapy in Breast Cancer

    Fan, Ming; Chen, Hang; You, Chao; Liu, Li; Gu, Yajia; Peng, Weijun; Gao, Xin; Li, Lihua (Frontiers in Molecular Biosciences, Frontiers Media SA, 2021-03-22) [Article]
    Breast tumor morphological and vascular characteristics can be changed during neoadjuvant chemotherapy (NACT). The early changes in tumor heterogeneity can be quantitatively modeled by longitudinal dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI), which is useful in predicting responses to NACT in breast cancer. In this retrospective analysis, 114 female patients with unilateral unifocal primary breast cancer who received NACT were included in a development (n = 61) dataset and a testing dataset (n = 53). DCE-MRI was performed for each patient before and after treatment (two cycles of NACT) to generate baseline and early follow-up images, respectively. Feature-level changes (delta) of the entire tumor were evaluated by calculating the relative net feature change (deltaRAD) between baseline and follow-up images. The voxel-level change inside the tumor was evaluated, which yielded a Jacobian map by registering the follow-up image to the baseline image. Clinical information and the radiomic features were fused to enhance the predictive performance. The area under the curve (AUC) values were assessed to evaluate the prediction performance. Predictive models using radiomics based on pre- and post-treatment images, Jacobian maps and deltaRAD showed AUC values of 0.568, 0.767, 0.630 and 0.726, respectively. When features from these images were fused, the predictive model generated an AUC value of 0.771. After adding the molecular subtype information in the fused model, the performance was increased to an AUC of 0.809 (sensitivity of 0.826 and specificity of 0.800), which is significantly higher than that of the baseline imaging- and Jacobian map-based predictive models (p = 0.028 and 0.019, respectively). The level of tumor heterogeneity reduction (evaluated by texture feature) is higher in the NACT responders than in the nonresponders. The results suggested that changes in DCE-MRI features that reflect a reduction in tumor heterogeneity following NACT could provide early prediction of breast tumor response. The prediction was improved when the molecular subtype information was combined into the model.
  • HMD-ARG: hierarchical multi-task deep learning for annotating antibiotic resistance genes

    Li, Yu; Xu, Zeling; Han, Wenkai; Cao, Huiluo; Umarov, Ramzan; Yan, Aixin; Fan, Ming; Chen, Huan; Duarte, Carlos M.; Li, Lihua; Ho, Pak-Leung; Gao, Xin (Microbiome, Springer Nature, 2021-02-08) [Article]
    Abstract Background The spread of antibiotic resistance has become one of the most urgent threats to global health, which is estimated to cause 700,000 deaths each year globally. Its surrogates, antibiotic resistance genes (ARGs), are highly transmittable between food, water, animal, and human to mitigate the efficacy of antibiotics. Accurately identifying ARGs is thus an indispensable step to understanding the ecology, and transmission of ARGs between environmental and human-associated reservoirs. Unfortunately, the previous computational methods for identifying ARGs are mostly based on sequence alignment, which cannot identify novel ARGs, and their applications are limited by currently incomplete knowledge about ARGs. Results Here, we propose an end-to-end Hierarchical Multi-task Deep learning framework for ARG annotation (HMD-ARG). Taking raw sequence encoding as input, HMD-ARG can identify, without querying against existing sequence databases, multiple ARG properties simultaneously, including if the input protein sequence is an ARG, and if so, what antibiotic family it is resistant to, what resistant mechanism the ARG takes, and if the ARG is an intrinsic one or acquired one. In addition, if the predicted antibiotic family is beta-lactamase, HMD-ARG further predicts the subclass of beta-lactamase that the ARG is resistant to. Comprehensive experiments, including cross-fold validation, third-party dataset validation in human gut microbiota, wet-experimental functional validation, and structural investigation of predicted conserved sites, demonstrate not only the superior performance of our method over the state-of-art methods, but also the effectiveness and robustness of the proposed method. Conclusions We propose a hierarchical multi-task method, HMD-ARG, which is based on deep learning and can provide detailed annotations of ARGs from three important aspects: resistant antibiotic class, resistant mechanism, and gene mobility. We believe that HMD-ARG can serve as a powerful tool to identify antibiotic resistance genes and, therefore mitigate their global threat. Our method and the constructed database are available at
  • Introduction of Progress in Education under Recent Technology Revolution

    Li, Chengyan; Gao, Xin; Sun, Qingquan (Mobile Networks and Applications, Springer Nature, 2021-02-05) [Article]
    Nowadays, the new technology revolution brings a fast progress for pedagogy. The popularization of cloud computing, edge computing and 5G network provides a new opportunity for the development of mobile education, which makes the lifelong education and fragmented education possible. Meantime, the worldwide use of artificial intelligence (AI) and big data technology are changing research domain of educational technology, and will bring novel study in individualized education and teaching reformation. So, there are many remaining scientific and technical problems in it. For example, the use of edge computing pattern and multi-modal information in education, the application of AI and big data technology in teaching reformation, image information extraction from user’s handwritten manuscripts, and the stability of cloud platform of mobile education, etc. Meantime, emerging methods which can improve the efficiency of this domain are also welcome.
  • Predicting Entropy and Heat Capacity of Hydrocarbons using Machine Learning

    Aldosari, Mohammed; Yalamanchi, Kiran K.; Gao, Xin; Sarathy, Mani (Energy and AI, Elsevier BV, 2021-02) [Article]
    Chemical substances are essential in all aspects of human life, and understanding their properties is essential for developing chemical systems. The properties of chemical species can be accurately obtained by experiments or ab initio computational calculations; however, these are time-consuming and costly. In this work, machine learning models (ML) for estimating entropy, S, and constant pressure heat capacity, Cp, at 298.15 K, are developed for alkanes, alkenes, and alkynes. The training data for entropy and heat capacity are collected from the literature. Molecular descriptors generated using alvaDesc software are used as input features for the ML models. Support vector regression (SVR), v-support vector regression (v-SVR), and random forest regression (RFR) algorithms were trained with K-fold cross-validation on two levels. The first level assessed the models' performance, and the second level generated the final models. Between the three ML models chosen, SVR shows better performance on the test dataset. The SVR model was then compared against traditional Benson's group additivity to illustrate the advantages of using the ML model. Finally, a sensitivity analysis is performed to find the most critical descriptors in the property estimations.
  • Stable maintenance of hidden switches as a strategy to increase the gene expression stability

    Kuwahara, Hiroyuki; Gao, Xin (Nature Computational Science, Springer Nature, 2021-01-14) [Article]
    In response to severe genetic and environmental perturbations, wild-type organisms can express hidden alternative phenotypes adaptive to such adverse conditions. While our theoretical understanding of the population-level fitness advantage and evolution of phenotypic switching under variable environments has grown, the mechanism by which these organisms maintain phenotypic switching capabilities under static environments remains to be elucidated. Here, using computational simulations, we analyzed the evolution of gene circuits under natural selection and found that different strategies evolved to increase the gene expression stability near the optimum level. In a population comprising bistable individuals, a strategy of maintaining bistability and raising the potential barrier separating the bistable regimes was consistently taken. Our results serve as evidence that hidden bistable switches can be stably maintained during environmental stasis—an essential property enabling the timely release of adaptive alternatives with small genetic changes in the event of substantial perturbations.
  • Robust and ultrafast fiducial marker correspondence in electron tomography by a two-stage algorithm considering local constraints

    Han, Renmin; Li, Guojun; Gao, Xin (Bioinformatics, Oxford University Press (OUP), 2021-01-08) [Article]
    Abstract Motivation Electron tomography (ET) has become an indispensable tool for structural biology studies. In ET, the tilt series alignment and the projection parameter calibration are the key steps towards high-resolution ultrastructure analysis. Usually, fiducial markers are embedded in the sample to aid the alignment. Despite the advances in developing algorithms to find correspondence of fiducial markers from different tilted micrographs, the error rate of the existing methods is still high such that manual correction has to be conducted. In addition, existing algorithms do not work well when the number of fiducial markers is high. Results In this paper, we try to completely solve the fiducial marker correspondence problem. We propose to divide the workflow of fiducial marker correspondence into two stages: (i) initial transformation determination, and (ii) local correspondence refinement. In the first stage, we model the transform estimation as a correspondence pair inquiry and verification problem. The local geometric constraints and invariant features are used to reduce the complexity of the problem. In the second stage, we encode the geometric distribution of the fiducial markers by a weighted Gaussian mixture model and introduce drift parameters to correct the effects of beam-induced motion and sample deformation. Comprehensive experiments on real-world datasets demonstrate the robustness, efficiency and effectiveness of the proposed algorithm. Especially, the proposed two-stage algorithm is able to produce an accurate tracking within an average of ≤ ms per image, even for micrographs with hundreds of fiducial markers, which makes the real-time ET data processing possible. Availability The code is available at . Additionally, the detailed original figures demonstrated in the experiments can be accessed at
  • Transcriptomic analysis identifies organ-specific metastasis genes and pathways across different primary sites.

    Zhang, Lin; Fan, Ming; Napolitano, Francesco; Gao, Xin; Xu, Ying; Li, Lihua (Journal of translational medicine, Springer Nature, 2021-01-08) [Article]
    BackgroundMetastasis is the most devastating stage of cancer progression and often shows a preference for specific organs.MethodsTo reveal the mechanisms underlying organ-specific metastasis, we systematically analyzed gene expression profiles for three common metastasis sites across all available primary origins. A rank-based method was used to detect differentially expressed genes between metastatic tumor tissues and corresponding control tissues. For each metastasis site, the common differentially expressed genes across all primary origins were identified as organ-specific metastasis genes.ResultsPathways enriched by these genes reveal an interplay between the molecular characteristics of the cancer cells and those of the target organ. Specifically, the neuroactive ligand-receptor interaction pathway and HIF-1 signaling pathway were found to have prominent roles in adapting to the target organ environment in brain and liver metastases, respectively. Finally, the identified organ-specific metastasis genes and pathways were validated using a primary breast tumor dataset. Survival and cluster analysis showed that organ-specific metastasis genes and pathways tended to be expressed uniquely by a subgroup of patients having metastasis to the target organ, and were associated with the clinical outcome.ConclusionsElucidating the genes and pathways underlying organ-specific metastasis may help to identify drug targets and develop treatment strategies to benefit patients.
  • A Siamese neural network model for the prioritization of metabolic disorders by integrating real and simulated data.

    Messa, Gian Marco; Napolitano, Francesco; Elsea, Sarah H; di Bernardo, Diego; Gao, Xin (Bioinformatics (Oxford, England), Oxford University Press (OUP), 2020-12-31) [Article]
    MotivationUntargeted metabolomic approaches hold a great promise as a diagnostic tool for inborn errors of metabolisms (IEMs) in the near future. However, the complexity of the involved data makes its application difficult and time consuming. Computational approaches, such as metabolic network simulations and machine learning, could significantly help to exploit metabolomic data to aid the diagnostic process. While the former suffers from limited predictive accuracy, the latter is normally able to generalize only to IEMs for which sufficient data are available. Here, we propose a hybrid approach that exploits the best of both worlds by building a mapping between simulated and real metabolic data through a novel method based on Siamese neural networks (SNN).ResultsThe proposed SNN model is able to perform disease prioritization for the metabolic profiles of IEM patients even for diseases that it was not trained to identify. To the best of our knowledge, this has not been attempted before. The developed model is able to significantly outperform a baseline model that relies on metabolic simulations only. The prioritization performances demonstrate the feasibility of the method, suggesting that the integration of metabolic models and data could significantly aid the IEM diagnosis process in the near future.Availability and implementationMetabolic datasets used in this study are publicly available from the cited sources. The original data produced in this study, including the trained models and the simulated metabolic profiles, are also publicly available (Messa et al., 2020).
  • PATHcre8: A Tool That Facilitates the Searching for Heterologous Biosynthetic Routes

    Motwalli, Olaa Amin; Uludag, Mahmut; Mijakovic, Ivan; Alazmi, Meshari; Bajic, Vladimir B.; Gojobori, Takashi; Gao, Xin; Essack, Magbubah (ACS Synthetic Biology, American Chemical Society (ACS), 2020-11-16) [Article]
    Developing computational tools that can facilitate the rational design of cell factories producing desired products at increased yields is challenging, as the tool needs to take into account that the preferred host organism usually has compounds that are consumed by competing reactions that reduce the yield of the desired product. On the other hand, the preferred host organisms may not have the native metabolic reactions needed to produce the compound of interest; thus, the computational tool needs to identify the metabolic reactions that will most efficiently produce the desired product. In this regard, we developed the generic tool PATHcre8 to facilitate an optimized search for heterologous biosynthetic pathway routes. PATHcre8 finds and ranks biosynthesis routes in a large number of organisms, including Cyanobacteria. The tool ranks the pathways based on feature scores that reflect reaction thermodynamics, the potentially toxic products in the pathway (compound toxicity), intermediate products in the pathway consumed by competing reactions (product consumption), and host-specific information such as enzyme copy number. A comparison with several other similar tools shows that PATHcre8 is more efficient in ranking functional pathways. To illustrate the effectiveness of PATHcre8, we further provide case studies focused on isoprene production and the biodegradation of cocaine. PATHcre8 is free for academic and nonprofit users and can be accessed at
  • Few-shot learning for classification of novel macromolecular structures in cryo-electron tomograms

    Li, Ran; Yu, Liangyong; Zhou, Bo; Zeng, Xiangrui; Wang, Zhenyu; Yang, Xiaoyan; Zhang, Jing; Gao, Xin; Jiang, Rui; Xu, Min (PLOS Computational Biology, Public Library of Science (PLoS), 2020-11-11) [Article]
    Cryo-electron tomography (cryo-ET) provides 3D visualization of subcellular components in the near-native state and at sub-molecular resolutions in single cells, demonstrating an increasingly important role in structural biology in situ. However, systematic recognition and recovery of macromolecular structures in cryo-ET data remain challenging as a result of low signal-to-noise ratio (SNR), small sizes of macromolecules, and high complexity of the cellular environment. Subtomogram structural classification is an essential step for such task. Although acquisition of large amounts of subtomograms is no longer an obstacle due to advances in automation of data collection, obtaining the same number of structural labels is both computation and labor intensive. On the other hand, existing deep learning based supervised classification approaches are highly demanding on labeled data and have limited ability to learn about new structures rapidly from data containing very few labels of such new structures. In this work, we propose a novel approach for subtomogram classification based on few-shot learning. With our approach, classification of unseen structures in the training data can be conducted given few labeled samples in test data through instance embedding. Experiments were performed on both simulated and real datasets. Our experimental results show that we can make inference on new structures given only five labeled samples for each class with a competitive accuracy (> 0.86 on the simulated dataset with SNR = 0.1), or even one sample with an accuracy of 0.7644. The results on real datasets are also promising with accuracy > 0.9 on both conditions and even up to 1 on one of the real datasets. Our approach achieves significant improvement compared with the baseline method and has strong capabilities of generalizing to other cellular components.
  • Poly(A)-DG: A deep-learning-based domain generalization method to identify cross-species Poly(A) signal without prior knowledge from target species

    Zheng, Yumin; Wang, Haohan; Zhang, Yang; Gao, Xin; Xing, Eric P.; Xu, Min (PLOS Computational Biology, Public Library of Science (PLoS), 2020-11-05) [Article]
    In eukaryotes, polyadenylation (poly(A)) is an essential process during mRNA maturation. Identifying the cis-determinants of poly(A) signal (PAS) on the DNA sequence is the key to understand the mechanism of translation regulation and mRNA metabolism. Although machine learning methods were widely used in computationally identifying PAS, the need for tremendous amounts of annotation data hinder applications of existing methods in species without experimental data on PAS. Therefore, cross-species PAS identification, which enables the possibility to predict PAS from untrained species, naturally becomes a promising direction. In our works, we propose a novel deep learning method named Poly(A)-DG for cross-species PAS identification. Poly(A)-DG consists of a Convolution Neural Network-Multilayer Perceptron (CNN-MLP) network and a domain generalization technique. It learns PAS patterns from the training species and identifies PAS in target species without re-training. To test our method, we use three species and build cross-species training sets with two of them and evaluate the performance of the remaining one. Moreover, we test our method against insufficient data and imbalanced data issues and demonstrate that Poly(A)-DG not only outperforms state-of-the-art methods but also maintains relatively high accuracy when it comes to a smaller or imbalanced training set.
  • RepAHR: an improved approach for de novo repeat identification by assembly of the high-frequency reads.

    Liao, Xingyu; Gao, Xin; Zhang, Xiankai; Wu, Fang-Xiang; Wang, Jianxin (BMC bioinformatics, Springer Nature, 2020-10-19) [Article]
    BACKGROUND:Repetitive sequences account for a large proportion of eukaryotes genomes. Identification of repetitive sequences plays a significant role in many applications, such as structural variation detection and genome assembly. Many existing de novo repeat identification pipelines or tools make use of assembly of the high-frequency k-mers to obtain repeats. However, a certain degree of sequence coverage is required for assemblers to get the desired assemblies. On the other hand, assemblers cut the reads into shorter k-mers for assembly, which may destroy the structure of the repetitive regions. For the above reasons, it is difficult to obtain complete and accurate repetitive regions in the genome by using existing tools. RESULTS:In this study, we present a new method called RepAHR for de novo repeat identification by assembly of the high-frequency reads. Firstly, RepAHR scans next-generation sequencing (NGS) reads to find the high-frequency k-mers. Secondly, RepAHR filters the high-frequency reads from whole NGS reads according to certain rules based on the high-frequency k-mer. Finally, the high-frequency reads are assembled to generate repeats by using SPAdes, which is considered as an outstanding genome assembler with NGS sequences. CONLUSIONS:We test RepAHR on five data sets, and the experimental results show that RepAHR outperforms RepARK and REPdenovo for detecting repeats in terms of N50, reference alignment ratio, coverage ratio of reference, mask ratio of Repbase and some other metrics.
  • Semantic similarity and machine learning with ontologies.

    Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Briefings in bioinformatics, Oxford University Press (OUP), 2020-10-13) [Article]
    Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at
  • A unified linear convergence analysis of k-SVD

    Xu, Zhiqiang; Ke, Yiping; Cao, Xin; Zhou, Chunlai; Wei, Pengfei; Gao, Xin (Memetic Computing, Springer Nature, 2020-10-12) [Article]
    Eigenvector computation, e.g., k-SVD for finding top-k singular subspaces, is often of central importance to many scientific and engineering tasks. There has been resurgent interest recently in analyzing relevant methods in terms of singular value gap dependence. Particularly, when the gap vanishes, the convergence of k-SVD is considered to be capped by a gap-free sub-linear rate. We argue in this work both theoretically and empirically that this is not necessarily the case, refreshing our understanding on this significant problem. Specifically, we leverage the recently proposed structured gap in a careful analysis to establish a unified linear convergence of k-SVD to one of the ground-truth solutions, regardless of what target matrix and how large target rank k are given. Theoretical results are evaluated and verified by experiments on synthetic or real data.
  • Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data.

    Wang, Chunxiang; Gao, Xin; Liu, Juntao (BMC bioinformatics, Springer Nature, 2020-10-07) [Article]
    BACKGROUND:Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. RESULTS:We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. CONCLUSION:The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.
  • Automatic and Interpretable Model for Periodontitis Diagnosis in Panoramic Radiographs

    Li, Haoyang; Zhou, Juexiao; Zhou, Yi; Chen, Jieyu; Gao, Feng; Xu, Ying; Gao, Xin (Springer Nature, 2020-09-29) [Conference Paper]
    Periodontitis is a prevalent and irreversible chronic inflammatory disease both in developed and developing countries, and affects about 20%-50% of the global population. The tool for automatically diagnosing periodontitis is highly demanded to screen at-risk people for periodontitis and its early detection could prevent the onset of tooth loss, especially in local community and health care settings with limited dental professionals. In the medical field, doctors need to understand and trust the decisions made by computational models and developing interpretable machine learning models is crucial for disease diagnosis. Based on these considerations, we propose an interpretable machine learning method called Deetal-Perio to predict the severity degree of periodontitis in dental panoramic radiographs. In our method, alveolar bone loss (ABL), the clinical hallmark for periodontitis diagnosis, could be interpreted as the key feature. To calculate ABL, we also propose a method for teeth numbering and segmentation. First, Deetal-Perio segments and indexes the individual tooth via Mask R-CNN combined with a novel calibration method. Next, Deetal-Perio segments the contour of the alveolar bone and calculates a ratio for individual tooth to represent ABL. Finally, Deetal-Perio predicts the severity degree of periodontitis given the ratios of all the teeth. The entire architecture could not only outperform state-of-the-art methods and show robustness on two data sets in both periodontitis prediction, and teeth numbering and segmentation tasks, but also be interpretable for doctors to understand the reason why Deetal-Perio works so well.
  • AttPNet: Attention-Based Deep Neural Network for 3D Point Set Analysis

    Yang, Yufeng; Ma, Yixiao; Zhang, Jing; Gao, Xin; Xu, Min (Sensors, MDPI AG, 2020-09-23) [Article]
    Point set is a major type of 3D structure representation format characterized by its data availability and compactness. Most former deep learning-based point set models pay equal attention to different point set regions and channels, thus having limited ability in focusing on small regions and specific channels that are important for characterizing the object of interest. In this paper, we introduce a novel model named Attention-based Point Network (AttPNet). It uses attention mechanism for both global feature masking and channel weighting to focus on characteristic regions and channels. There are two branches in our model. The first branch calculates an attention mask for every point. The second branch uses convolution layers to abstract global features from point sets, where channel attention block is adapted to focus on important channels. Evaluations on the ModelNet40 benchmark dataset show that our model outperforms the existing best model in classification tasks by 0.7% without voting. In addition, experiments on augmented data demonstrate that our model is robust to rotational perturbations and missing points. We also design a Electron Cryo-Tomography (ECT) point cloud dataset and further demonstrate our model’s ability in dealing with fine-grained structures on the ECT dataset.
  • Efficient locality-sensitive hashing over high-dimensional streaming data

    Wang, Hao; Yang, Chengcheng; Zhang, Xiangliang; Gao, Xin (Neural Computing and Applications, Springer Nature, 2020-09-17) [Article]
    Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Locality-sensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a large number of individual indexes optimized for searching efficiency. Updating such indexes might be impractical when processing high-dimensional streaming data. In this paper, we present a novel disk-based LSH index that offers efficient support for both searches and updates. The contributions of our work are threefold. First, we use the write-friendly LSM-trees to store the LSH projections to facilitate efficient updates. Second, we develop a novel estimation scheme to estimate the number of required LSH functions, with which the disk storage and access costs are effectively reduced. Third, we exploit both the collision number and the projection distance to improve the efficiency of candidate selection, improving the search performance with theoretical guarantees on the result quality. Experiments on four real-world datasets show that our proposal outperforms the state-of-the-art schemes.

View more