For more information visit: https://cemse.kaust.edu.sa/cs

Recent Submissions

  • Learning from Scholarly Attributed Graphs for Scientific Discovery

    Akujuobi, Uchenna Thankgod (2020-10-18) [Dissertation]
    Advisor: Zhang, Xiangliang
    Committee members: Moshkov, Mikhail; Hoehndorf, Robert; Zhang, Min
    Research and experimentation in various scientific fields are based on the knowledge and ideas from scholarly literature. The advancement of research and development has, thus, strengthened the importance of literary analysis and understanding. However, in recent years, researchers have been facing massive scholarly documents published at an exponentially increasing rate. Analyzing this vast number of publications is far beyond the capability of individual researchers. This dissertation is motivated by the need for large scale analyses of the exploding number of scholarly literature for scientific knowledge discovery. In the first part of this dissertation, the interdependencies between scholarly literature are studied. First, I develop Delve – a data-driven search engine supported by our designed semi-supervised edge classification method. This system enables users to search and analyze the relationship between datasets and scholarly literature. Based on the Delve system, I propose to study information extraction as a node classification problem in attributed networks. Specifically, if we can learn the research topics of documents (nodes in a network), we can aggregate documents by topics and retrieve information specific to each topic (e.g., top-k popular datasets). Node classification in attributed networks has several challenges: a limited number of labeled nodes, effective fusion of topological structure and node/edge attributes, and the co-existence of multiple labels for one node. Existing node classification approaches can only address or partially address a few of these challenges. This dissertation addresses these challenges by proposing semi-supervised multi-class/multi-label node classification models to integrate node/edge attributes and topological relationships. The second part of this dissertation examines the problem of analyzing the interdependencies between terms in scholarly literature. I present two algorithms for the automatic hypothesis generation (HG) problem, which refers to the discovery of meaningful implicit connections between scientific terms, including but not limited to diseases, drugs, and genes extracted from databases of biomedical publications. The automatic hypothesis generation problem is modeled as a future connectivity prediction in a dynamic attributed graph. The key is to capture the temporal evolution of node-pair (term-pair) relations. Experiment results and case study analyses highlight the effectiveness of the proposed algorithms compared to the baselines’ extension.
  • Dynamic Programming Multi-Objective Combinatorial Optimization

    Mankowski, Michal (2020-10-18) [Dissertation]
    Advisor: Moshkov, Mikhail
    Committee members: Keyes, David E.; Shihada, Basem; Boros, Endre
    In this dissertation, we consider extensions of dynamic programming for combinatorial optimization. We introduce two exact multi-objective optimization algorithms: the multi-stage optimization algorithm that optimizes the problem relative to the ordered sequence of objectives (lexicographic optimization) and the bi-criteria optimization algorithm that simultaneously optimizes the problem relative to two objectives (Pareto optimization). We also introduce a counting algorithm to count optimal solution before and after every optimization stage of multi-stage optimization. We propose a fairly universal approach based on so-called circuits without repetitions in which each element is generated exactly one time. Such circuits represent the sets of elements under consideration (the sets of feasible solutions) and are used by counting, multi-stage, and bi-criteria optimization algorithms. For a given optimization problem, we should describe an appropriate circuit and cost functions. Then, we can use the designed algorithms for which we already have proofs of their correctness and ways to evaluate the required number of operations and the time. We construct conventional (which work directly with elements) circuits without repetitions for matrix chain multiplication, global sequence alignment, optimal paths in directed graphs, binary search trees, convex polygon triangulation, line breaking (text justi cation), one-dimensional clustering, optimal bitonic tour, and segmented least squares. For these problems, we evaluate the number of operations and the time required by the optimization and counting algorithms, and consider the results of computational experiments. If we cannot nd a conventional circuit without repetitions for a problem, we can either create custom algorithms for optimization and counting from scratch or can transform a circuit with repetitions into a so-called syntactical circuit, which is a circuit without repetitions that works not with elements but with formulas representing these elements. We apply both approaches to the optimization of matchings in trees and apply the second approach to the 0/1 knapsack problem. We also brie y introduce our work in operation research with applications to health care. This work extends our interest in the optimization eld from developing new methods included in this dissertation towards the practical application.
  • Flexible Cross-Modal Hashing

    Yu, Guoxian; Liu, Xuanwu; Wang, Jun; Domeniconi, Carlotta; Zhang, Xiangliang (IEEE Transactions on Neural Networks and Learning Systems, Institute of Electrical and Electronics Engineers (IEEE), 2020-10-14) [Article]
    Hashing has been widely adopted for large-scale data retrieval in many domains due to its low storage cost and high retrieval speed. Existing cross-modal hashing methods optimistically assume that the correspondence between training samples across modalities is readily available. This assumption is unrealistic in practical applications. In addition, existing methods generally require the same number of samples across different modalities, which restricts their flexibility. We propose a flexible cross-modal hashing approach (FlexCMH) to learn effective hashing codes from weakly paired data, whose correspondence across modalities is partially (or even totally) unknown. FlexCMH first introduces a clustering-based matching strategy to explore the structure of each cluster and, thus, to find the potential correspondence between clusters (and samples therein) across modalities. To reduce the impact of an incomplete correspondence, it jointly optimizes the potential correspondence, the crossmodal hashing functions derived from the correspondence, and a hashing quantitative loss in a unified objective function. An alternative optimization technique is also proposed to coordinate the correspondence and hash functions and reinforce the reciprocal effects of the two objectives. Experiments on public multimodal data sets show that FlexCMH achieves significantly better results than state-of-the-art methods, and it, indeed, offers a high degree of flexibility for practical cross-modal hashing tasks.
  • Semantic similarity and machine learning with ontologies.

    Kulmanov, Maxat; Smaili, Fatima Z.; Gao, Xin; Hoehndorf, Robert (Briefings in bioinformatics, Oxford University Press (OUP), 2020-10-13) [Article]
    Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.
  • FAME: 3D Shape Generation via Functionality-Aware Model Evolution

    Guan, Yanran; Liu, Han; Liu, Kun; Yin, Kangxue; Hu, Ruizhen; vanKaick, Oliver; Zhang, Yan; Yumer, Ersin; Carr, Nathan; Mech, Radomir; Zhang, Richard (IEEE Transactions on Visualization and Computer Graphics, IEEE, 2020-10-12) [Article]
    We introduce a modeling tool which can evolve a set of 3D objects in a functionality-aware manner. Our goal is for the evolution to generate large and diverse sets of plausible 3D objects for data augmentation, constrained modeling, as well as open-ended exploration to possibly inspire new designs. Starting with an initial population of 3D objects belonging to one or more functional categories, we evolve the shapes through part re-combination to produce generations of hybrids or crossbreeds between parents from the heterogeneous shape collection. Evolutionary selection of offsprings is guided both by a functional plausibility score derived from functionality analysis of shapes in the initial population and user preference, as in a design gallery. Since cross-category hybridization may result in offsprings not belonging to any of the known functional categories, we develop a means for functionality partial matching to evaluate functional plausibility on partial shapes. We show a variety of plausible hybrid shapes generated by our functionality-aware model evolution, which can complement existing datasets as training data and boost the performance of contemporary data-driven segmentation schemes, especially in challenging cases.
  • Impact of data preprocessing on cell-type clustering based on single-cell RNA-seq data.

    Wang, Chunxiang; Gao, Xin; Liu, Juntao (BMC bioinformatics, Springer Science and Business Media LLC, 2020-10-08) [Article]
    BACKGROUND:Advances in single-cell RNA-seq technology have led to great opportunities for the quantitative characterization of cell types, and many clustering algorithms have been developed based on single-cell gene expression. However, we found that different data preprocessing methods show quite different effects on clustering algorithms. Moreover, there is no specific preprocessing method that is applicable to all clustering algorithms, and even for the same clustering algorithm, the best preprocessing method depends on the input data. RESULTS:We designed a graph-based algorithm, SC3-e, specifically for discriminating the best data preprocessing method for SC3, which is currently the most widely used clustering algorithm for single cell clustering. When tested on eight frequently used single-cell RNA-seq data sets, SC3-e always accurately selects the best data preprocessing method for SC3 and therefore greatly enhances the clustering performance of SC3. CONCLUSION:The SC3-e algorithm is practically powerful for discriminating the best data preprocessing method, and therefore largely enhances the performance of cell-type clustering of SC3. It is expected to play a crucial role in the related studies of single-cell clustering, such as the studies of human complex diseases and discoveries of new cell types.
  • Optimal Gradient Compression for Distributed and Federated Learning

    Albasyoni, Alyazeed; Safaryan, Mher; Condat, Laurent; Richtarik, Peter (arXiv, 2020-10-07) [Preprint]
    Communicating information, like gradient vectors, between computing nodes in distributed and federated learning is typically an unavoidable burden, resulting in scalability issues. Indeed, communication might be slow and costly. Recent advances in communication-efficient training algorithms have reduced this bottleneck by using compression techniques, in the form of sparsification, quantization, or low-rank approximation. Since compression is a lossy, or inexact, process, the iteration complexity is typically worsened; but the total communication complexity can improve significantly, possibly leading to large computation time savings. In this paper, we investigate the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error. We perform both worst-case and average-case analysis, providing tight lower bounds. In the worst-case analysis, we introduce an efficient compression operator, Sparse Dithering, which is very close to the lower bound. In the average-case analysis, we design a simple compression operator, Spherical Compression, which naturally achieves the lower bound. Thus, our new compression schemes significantly outperform the state of the art. We conduct numerical experiments to illustrate this improvement.
  • Multi-typed Objects Multi-view Multi-instance Multi-label Learning

    Yang, Yuanlin; Yu, Guoxian; Wang, Jun; Domeniconi, Carlotta; Zhang, Xiangliang (arXiv, 2020-10-06) [Preprint]
    Multi-typed objects Multi-view Multi-instance Multi-label Learning (M4L) deals with interconnected multi-typed objects (or bags) that are made of diverse instances, represented with heterogeneous feature views and annotated with a set of non-exclusive but semantically related labels. M4L is more general and powerful than the typical Multi-view Multi-instance Multi-label Learning (M3L), which only accommodates single-typed bags and lacks the power to jointly model the naturally interconnected multi-typed objects in the physical world. To combat with this novel and challenging learning task, we develop a joint matrix factorization based solution (M4L-JMF). Particularly, M4L-JMF firstly encodes the diverse attributes and multiple inter(intra)-associations among multi-typed bags into respective data matrices, and then jointly factorizes these matrices into low-rank ones to explore the composite latent representation of each bag and its instances (if any). In addition, it incorporates a dispatch and aggregation term to distribute the labels of bags to individual instances and reversely aggregate the labels of instances to their affiliated bags in a coherent manner. Experimental results on benchmark datasets show that M4L-JMF achieves significantly better results than simple adaptions of existing M3L solutions on this novel problem.
  • Stereo Event-Based Particle Tracking Velocimetry for 3D Fluid Flow Reconstruction

    Wang, Yuanhao; Idoughi, Ramzi; Heidrich, Wolfgang (Springer International Publishing, 2020-10-06) [Conference Paper]
    Existing Particle Imaging Velocimetry techniques require the use of high-speed cameras to reconstruct time-resolved fluid flows. These cameras provide high-resolution images at high frame rates, which generates bandwidth and memory issues. By capturing only changes in the brightness with a very low latency and at low data rate, event-based cameras have the ability to tackle such issues. In this paper, we present a new framework that retrieves dense 3D measurements of the fluid velocity field using a pair of event-based cameras. First, we track particles inside the two event sequences in order to estimate their 2D velocity in the two sequences of images. A stereo-matching step is then performed to retrieve their 3D positions. These intermediate outputs are incorporated into an optimization framework that also includes physically plausible regularizers, in order to retrieve the 3D velocity field. Extensive experiments on both simulated and real data demonstrate the efficacy of our approach.
  • Lower Bounds and Optimal Algorithms for Personalized Federated Learning

    Hanzely, Filip; Hanzely, Slavomir; Horvath, Samuel; Richtarik, Peter (arXiv, 2020-10-05) [Preprint]
    In this work, we consider the optimization formulation of personalized federated learning recently introduced by Hanzely and Richt\'arik (2020) which was shown to give an alternative explanation to the workings of local {\tt SGD} methods. Our first contribution is establishing the first lower bounds for this formulation, for both the communication complexity and the local oracle complexity. Our second contribution is the design of several optimal methods matching these lower bounds in almost all regimes. These are the first provably optimal methods for personalized federated learning. Our optimal methods include an accelerated variant of {\tt FedProx}, and an accelerated variance-reduced version of {\tt FedAvg}/Local {\tt SGD}. We demonstrate the practical superiority of our methods through extensive numerical experiments.
  • Temporal Positive-unlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation

    Akujuobi, Uchenna Thankgod; Chen, Jun; Elhoseiny, Mohamed; Spranger, Michael; Zhang, Xiangliang (arXiv, 2020-10-05) [Preprint]
    Understanding the relationships between biomedical terms like viruses, drugs, and symptoms is essential in the fight against diseases. Many attempts have been made to introduce the use of machine learning to the scientific process of hypothesis generation(HG), which refers to the discovery of meaningful implicit connections between biomedical terms. However, most existing methods fail to truly capture the temporal dynamics of scientific term relations and also assume unobserved connections to be irrelevant (i.e., in a positive-negative (PN) learning setting). To break these limits, we formulate this HG problem as future connectivity prediction task on a dynamic attributed graph via positive-unlabeled (PU) learning. Then, the key is to capture the temporal evolution of node pair (term pair) relations from just the positive and unlabeled data. We propose a variational inference model to estimate the positive prior, and incorporate it in the learning of node pair embeddings, which are then used for link prediction. Experiment results on real-world biomedical term relationship datasets and case study analyses on a COVID-19 dataset validate the effectiveness of the proposed model.
  • Maximizing I/O Bandwidth for Out-of-Core HPC Applications on Homogeneous and Heterogeneous Large-Scale Systems

    Alturkestani, Tariq (2020-09-30) [Dissertation]
    Advisor: Keyes, David E.
    Committee members: Shihada, Basem; Moshkov, Mikhail; Sun, Xian-He
    Out-of-Core simulation systems often produce a massive amount of data that cannot t on the aggregate fast memory of the compute nodes, and they also require to read back these data for computation. As a result, I/O data movement can be a bottleneck in large-scale simulations. Advances in memory architecture have made it feasible and a ordable to integrate hierarchical storage media on large-scale systems, starting from the traditional Parallel File Systems (PFSs) to intermediate fast disk technologies (e.g., node-local and remote-shared NVMe and SSD-based Burst Bu ers) and up to CPU main memory and GPU High Bandwidth Memory (HBM). However, while adding additional and faster storage media increases I/O bandwidth, it pressures the CPU, as it becomes responsible for managing and moving data between these layers of storage. Simulation systems are thus vulnerable to being blocked by I/O operations. The Multilayer Bu er System (MLBS) proposed in this research demonstrates a general and versatile method for overlapping I/O with computation that helps to ameliorate the strain on the processors through asynchronous access. The main idea consists in decoupling I/O operations from computational phases using dedicated hardware resources to perform expensive context switches. MLBS monitors I/O tra c in each storage layer allowing fair utilization of shared resources. By continually prefetching up and down across all hardware layers of the memory and storage subsystems, MLBS transforms the original I/O-bound behavior of evaluated applications and shifts it closer to a memory-bound or compute-bound regime. The evaluation on the Cray XC40 Shaheen-2 supercomputer for a representative I/Obound application, seismic inversion, shows that MLBS outperforms state-of-the-art PFSs, i.e., Lustre, Data Elevator and DataWarp by 6.06X, 2.23X, and 1.90X, respectively. On the IBM-built Summit supercomputer, using 2048 compute nodes equipped with a total of 12288 GPUs, MLBS achieves up to 1.4X performance speedup compared to the reference PFS-based implementation. MLBS is also demonstrated on applications from cosmology, combustion, and a classic out-of-core computational physics and linear algebra routines.
  • Error Compensated Distributed SGD Can Be Accelerated

    Qian, Xun; Richtarik, Peter; Zhang, Tong (arXiv, 2020-09-30) [Preprint]
    Gradient compression is a recent and increasingly popular technique for reducing the communication cost in distributed training of large-scale machine learning models. In this work we focus on developing efficient distributed methods that can work for any compressor satisfying a certain contraction property, which includes both unbiased (after appropriate scaling) and biased compressors such as RandK and TopK. Applied naively, gradient compression introduces errors that either slow down convergence or lead to divergence. A popular technique designed to tackle this issue is error compensation/error feedback. Due to the difficulties associated with analyzing biased compressors, it is not known whether gradient compression with error compensation can be combined with Nesterov's acceleration. In this work, we show for the first time that error compensated gradient compression methods can be accelerated. In particular, we propose and study the error compensated loopless Katyusha method, and establish an accelerated linear convergence rate under standard assumptions. We show through numerical experiments that the proposed method converges with substantially fewer communication rounds than previous error compensated algorithms.
  • AttPNet: Attention-Based Deep Neural Network for 3D Point Set Analysis

    Yang, Yufeng; Ma, Yixiao; Zhang, Jing; Gao, Xin; Xu, Min (Sensors, MDPI AG, 2020-09-23) [Article]
    Point set is a major type of 3D structure representation format characterized by its data availability and compactness. Most former deep learning-based point set models pay equal attention to different point set regions and channels, thus having limited ability in focusing on small regions and specific channels that are important for characterizing the object of interest. In this paper, we introduce a novel model named Attention-based Point Network (AttPNet). It uses attention mechanism for both global feature masking and channel weighting to focus on characteristic regions and channels. There are two branches in our model. The first branch calculates an attention mask for every point. The second branch uses convolution layers to abstract global features from point sets, where channel attention block is adapted to focus on important channels. Evaluations on the ModelNet40 benchmark dataset show that our model outperforms the existing best model in classification tasks by 0.7% without voting. In addition, experiments on augmented data demonstrate that our model is robust to rotational perturbations and missing points. We also design a Electron Cryo-Tomography (ECT) point cloud dataset and further demonstrate our model’s ability in dealing with fine-grained structures on the ECT dataset.
  • Alternating maximization: unifying framework for 8 sparse PCA formulations and efficient parallel codes

    Richtarik, Peter; Jahani, Majid; Ahipaşaoğlu, Selin Damla; Takáč, Martin (Optimization and Engineering, Springer Science and Business Media LLC, 2020-09-22) [Article]
    Given a multivariate data set, sparse principal component analysis (SPCA) aims to extract several linear combinations of the variables that together explain the variance in the data as much as possible, while controlling the number of nonzero loadings in these combinations. In this paper we consider 8 different optimization formulations for computing a single sparse loading vector: we employ two norms for measuring variance (L2, L1) and two sparsity-inducing norms (L0, L1), which are used in two ways (constraint, penalty). Three of our formulations, notably the one with L0 constraint and L1 variance, have not been considered in the literature. We give a unifying reformulation which we propose to solve via the alternating maximization (AM) method. We show that AM is equivalent to GPower for all formulations. Besides this, we provide 24 efficient parallel SPCA implementations: 3 codes (multi-core, GPU and cluster) for each of the 8 problems. Parallelism in the methods is aimed at (1) speeding up computations (our GPU code can be 100 times faster than an efficient serial code written in C++), (2) obtaining solutions explaining more variance and (3) dealing with big data problems (our cluster code can solve a 357 GB problem in a minute).
  • Optimal correlation order in super-resolution optical fluctuation microscopy

    Vlasenko, S.; Mikhalychev, A. B.; Karuseichyk, I. L.; Lyakhov, D. A.; Michels, Dominik L.; Mogilevtsev, D. (arXiv, 2020-09-21) [Preprint]
    Here, we show that, contrary to the common opinion, the super-resolution optical fluctuation microscopy might not lead to ideally infinite super-resolution enhancement with increasing of the order of measured cumulants. Using information analysis for estimating error bounds on the determination of point sources positions, we show that reachable precision per measurement might be saturated with increasing of the order of the measured cumulants in the super-resolution regime. In fact, there is an optimal correlation order beyond which there is practically no improvement for objects of three and more point sources. However, for objects of just two sources, one still has an intuitively expected resolution increase with the cumulant order.
  • Multi-label zero-shot learning with graph convolutional networks

    Ou, Guangjin; Yu, Guoxian; Domeniconi, Carlotta; Lu, Xuequan; Zhang, Xiangliang (Neural Networks, Elsevier BV, 2020-09-21) [Article]
    The goal of zero-shot learning (ZSL) is to build a classifier that recognizes novel categories with no corresponding annotated training data. The typical routine is to transfer knowledge from seen classes to unseen ones by learning a visual-semantic embedding. Existing multi-label zero-shot learning approaches either ignore correlations among labels, suffer from large label combinations, or learn the embedding using only local or global visual features. In this paper, we propose a Graph Convolution Networks based Multi-label Zero-Shot Learning model, abbreviated as MZSL-GCN. Our model first constructs a label relation graph using label co-occurrences and compensates the absence of unseen labels in the training phase by semantic similarity. It then takes the graph and the word embedding of each seen (unseen) label as inputs to the GCN to learn the label semantic embedding, and to obtain a set of inter-dependent object classifiers. MZSL-GCN simultaneously trains another attention network to learn compatible local and global visual features of objects with respect to the classifiers, and thus makes the whole network end-to-end trainable. In addition, the use of unlabeled training data can reduce the bias toward seen labels and boost the generalization ability. Experimental results on benchmark datasets show that our MZSL-GCN competes with state-of-the-art approaches.
  • Decentralized Embedding Framework for Large-Scale Networks

    Imran, Mubashir; Yin, Hongzhi; Chen, Tong; Shao, Yingxia; Zhang, Xiangliang; Zhou, Xiaofang (Springer International Publishing, 2020-09-21) [Conference Paper]
    Network embedding aims to learn vector representations of vertices, that preserve both network structures and properties. However, most existing embedding methods fail to scale to large networks. A few frameworks have been proposed by extending existing methods to cope with network embedding on large-scale networks. These frameworks update the global parameters iteratively or compress the network while learning vector representation. Such network embedding schemes inevitably lead to a high cost of either high communication overhead or sub-optimal embedding quality. In this paper, we propose a novel decentralized large-scale network embedding framework called DeLNE. As the name suggests, DeLNE divides a network into smaller partitions and learn vector representation in a distributed fashion, avoiding any unnecessary communication overhead. Our proposed framework uses Variational Graph Convolution Auto-Encoders to embed the structure and properties of each sub-network. Secondly, we propose an embedding aggregation mechanism, that captures the global properties of each node. Thirdly, we propose an alignment function, that reconciles all sub-networks embedding into the same vector space. Due to the parallel nature of DeLNE, it scales well on large clustered environments. Through extensive experimentation on realistic datasets, we show that DeLNE produces high-quality embedding and outperforms existing large-scale network embeddings frameworks, in terms of both efficiency and effectiveness.
  • Efficient locality-sensitive hashing over high-dimensional streaming data

    Wang, Hao; Yang, Chengcheng; Zhang, Xiangliang; Gao, Xin (Neural Computing and Applications, Springer Science and Business Media LLC, 2020-09-17) [Article]
    Approximate nearest neighbor (ANN) search in high-dimensional spaces is fundamental in many applications. Locality-sensitive hashing (LSH) is a well-known methodology to solve the ANN problem. Existing LSH-based ANN solutions typically employ a large number of individual indexes optimized for searching efficiency. Updating such indexes might be impractical when processing high-dimensional streaming data. In this paper, we present a novel disk-based LSH index that offers efficient support for both searches and updates. The contributions of our work are threefold. First, we use the write-friendly LSM-trees to store the LSH projections to facilitate efficient updates. Second, we develop a novel estimation scheme to estimate the number of required LSH functions, with which the disk storage and access costs are effectively reduced. Third, we exploit both the collision number and the projection distance to improve the efficiency of candidate selection, improving the search performance with theoretical guarantees on the result quality. Experiments on four real-world datasets show that our proposal outperforms the state-of-the-art schemes.
  • Spark-based parallel calculation of 3D fourier shell correlation for macromolecule structure local resolution estimation

    Lü, Yongchun; Zeng, Xiangrui; Tian, Xinhui; Shi, Xiao; Wang, Hui; Zheng, Xiaohui; Liu, Xiaodong; Zhao, Xiaofang; Gao, Xin; Xu, Min (BMC Bioinformatics, Springer Science and Business Media LLC, 2020-09-17) [Article]
    Abstract Background Resolution estimation is the main evaluation criteria for the reconstruction of macromolecular 3D structure in the field of cryoelectron microscopy (cryo-EM). At present, there are many methods to evaluate the 3D resolution for reconstructed macromolecular structures from Single Particle Analysis (SPA) in cryo-EM and subtomogram averaging (SA) in electron cryotomography (cryo-ET). As global methods, they measure the resolution of the structure as a whole, but they are inaccurate in detecting subtle local changes of reconstruction. In order to detect the subtle changes of reconstruction of SPA and SA, a few local resolution methods are proposed. The mainstream local resolution evaluation methods are based on local Fourier shell correlation (FSC), which is computationally intensive. However, the existing resolution evaluation methods are based on multi-threading implementation on a single computer with very poor scalability. Results This paper proposes a new fine-grained 3D array partition method by key-value format in Spark. Our method first converts 3D images to key-value data (K-V). Then the K-V data is used for 3D array partitioning and data exchange in parallel. So Spark-based distributed parallel computing framework can solve the above scalability problem. In this distributed computing framework, all 3D local FSC tasks are simultaneously calculated across multiple nodes in a computer cluster. Through the calculation of experimental data, 3D local resolution evaluation algorithm based on Spark fine-grained 3D array partition has a magnitude change in computing speed compared with the mainstream FSC algorithm under the condition that the accuracy remains unchanged, and has better fault tolerance and scalability. Conclusions In this paper, we proposed a K-V format based fine-grained 3D array partition method in Spark to parallel calculating 3D FSC for getting a 3D local resolution density map. 3D local resolution density map evaluates the three-dimensional density maps reconstructed from single particle analysis and subtomogram averaging. Our proposed method can significantly increase the speed of the 3D local resolution evaluation, which is important for the efficient detection of subtle variations among reconstructed macromolecular structures.

View more