Projects
User-friendly programming of GPU-enhanced clusters

By developing a simple directive-based programming model and its accompanying fully automated source-to-source code translator and domain-specific optimizer, we aim to greatly simplify the task of programming scientific codes that can run efficiently on accelerator-enhanced computer clusters. This project is motivated by an urgent need from the community of computational scientists for programming methodologies that are easy to use, while capable of harnessing especially the non-conventional computing resources, such as GPUs, that dominate today's HPC field. Based on a proof-of-concept work that has already successfully automated C-to-CUDA translation and optimization restricted to the single-GPU scenario and stencil methods, the proposed project aims to greatly enhance the success by extending to the following topics:
- improving the newly developed directive-based programming model and its accompanying framework of automated code translation and optimization
- extending to the scenario of multiple GPUs
- extending to the scenario of GPU-accelerated CPU clusters
- tackling a number of real-world scientific codes
The project has the potential of considerably enhancing the productivity of computational scientists, to let them focus more on their scientific investigations at hand, instead of spending precious time on painstakingly writing complex codes.
Funding source:
Research Council of Norway, FRINATEK program
All partners:
- Simula Research Laboratory
- University of California, San Diego (UCSD)
- San Diego Supercomputer Center (SDSC)
- National University of Defense Technology (NUDT)
- SINTEF
Publications for User-friendly programming of GPU-enhanced clusters
Journal Article
Panda: A Compiler Framework for Concurrent CPU+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers
International Journal of Parallel Programming (2016).Status: Published
Panda: A Compiler Framework for Concurrent CPU+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers
We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI+CUDA+OpenMP code that uses concurrent CPU+GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90% of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.
Afilliation | Scientific Computing |
Project(s) | User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2016 |
Journal | International Journal of Parallel Programming |
Date Published | 10/2016 |
Publisher | ACM/Springer |
Keywords | code generation, code optimisation, CPU+GPU computing, CUDA, heterogeneous computing, MPI, OpenMP, source-to-source translation, stencil computation |
DOI | 10.1007/s10766-016-0454-1 |
Accelerating Detailed Tissue-Scale 3D Cardiac Simulations Using Heterogeneous CPU-Xeon Phi Computing
International Journal of Parallel Programming (2016): 1-23.Status: Published
Accelerating Detailed Tissue-Scale 3D Cardiac Simulations Using Heterogeneous CPU-Xeon Phi Computing
We investigate heterogeneous computing, which involves both multicore CPUs and manycore Xeon Phi coprocessors, as a new strategy for computational cardiology. In particular, 3D tissues of the human cardiac ventricle are studied with a physiologically realistic model that has 10,000 calcium release units per cell and 100 ryanodine receptors per release unit, together with tissue-scale simulations of the electrical activity and calcium handling. In order to attain resource-efficient use of heterogeneous computing systems that consist of both CPUs and Xeon Phis, we first direct the coding effort at ensuring good performance on the two types of compute devices individually. Although SIMD code vectorization is the main theme of performance programming, the actual implementation details differ considerably between CPU and Xeon Phi. Moreover, in addition to combined OpenMP+MPI programming, a suitable division of the cells between the CPUs and Xeon Phis is important for resource-efficient usage of an entire heterogeneous system. Numerical experiments show that good resource utilization is indeed achieved and that such a heterogeneous simulator paves the way for ultimately understanding the mechanisms of arrhythmia. The uncovered good programming practices can be used by computational scientists who want to adopt similar heterogeneous hardware platforms for a wide variety of applications.
Afilliation | Scientific Computing |
Project(s) | User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2016 |
Journal | International Journal of Parallel Programming |
Pagination | 1-23 |
Date Published | 10/2016 |
Publisher | ACM/Springer |
Keywords | Calcium handling, multiscale cardiac tissue simulation, supercomputing, Xeon Phi |
DOI | 10.1007/s10766-016-0461-2 |
Proceedings, refereed
Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2
In IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). ACM/IEEE, 2016.Status: Published
Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2
We develop a simulator for 3D tissue of the human cardiac ventricle with a physiologically realistic cell model and deploy it on the supercomputer Tianhe-2. In order to attain the full performance of the heterogeneous CPU-Xeon Phi design, we use carefully optimized codes for both devices and combine them to obtain suitable load balancing. Using a large number of nodes, we are able to perform tissue-scale simulations of the electrical activity and calcium handling in millions of cells, at a level of detail that tracks the states of trillions of ryanodine receptors. We can thus simulate arrythmogenic spiral waves and other complex arrhythmogenic patterns which arise from calcium handling deficiencies in human cardiac ventricle tissue. Due to extensive code tuning and parallelization via OpenMP, MPI, and SCIF/COI, large scale simulations of 10 heartbeats can be performed in a matter of hours. Test results indicate excellent scalability, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.
Afilliation | Scientific Computing |
Project(s) | User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2016 |
Conference Name | IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) |
Pagination | 843-852 |
Date Published | 12/2016 |
Publisher | ACM/IEEE |
ISSN Number | 1521-9097 |
Keywords | Calcium handling, multiscale cardiac tissue simulation, supercomputing, Xeon Phi |
DOI | 10.1109/ICPADS.2016.0114 |
Talks, invited
Heterogeneous HPC solutions in cardiac electrophysiology
In Lawrence Berkeley National Laboratory, Berkeley, CA, USA, 2016.Status: Published
Heterogeneous HPC solutions in cardiac electrophysiology
Detailed simulations of electrical signal transmission in the human heart require immense processing power, thereby creating the need for large scale parallel implementations. We present two heterogeneous codes solving such problems, focusing on the interaction between OpenMP, MPI, and CUDA in irregular computations, and discuss practical experiences on different supercomputers.
Afilliation | Scientific Computing |
Project(s) | User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2016 |
Location of Talk | Lawrence Berkeley National Laboratory, Berkeley, CA, USA |
New Industrial Capabilities Through Embedded Multi-Core Systems

Embedded Multi-Core Systems for Mixed Criticality Applications in Dynamic and Changeable Real-Time Environments (EMC2)
Embedded systems are the key innovation driver to improve almost all mechatronic products with cheaper and even new functionalities. Furthermore, they strongly support today's information society as inter-system communication enabler. Consequently boundaries of application domains are alleviated and ad-hoc connections and interoperability play an increasing role. At the same time, multi-core and many-core computing platforms are becoming available on the market and provide a breakthrough for system (and application) integration. A major industrial challenge arises facing cost-efficient integration of different applications with different levels of safety and security on a single computing platform in an open context.
The objective of the EMC2 project is to foster these changes through an innovative and sustainable service-oriented architecture approach for mixed criticality applications in dynamic and changeable real-time environments. The project focuses on the industrialization of European research outcomes and builds on the results of previous ARTEMIS, European and national projects. It provides the paradigm shift to a new and sustainable system architecture which is suitable to handle open dynamic systems. EMC2 is part of the European Embedded Systems industry strategy to maintain its leading edge position by providing solutions for:
- Dynamic Adaptability in Open Systems
- Utilization of expensive system features only as Service-on-Demand in order to reduce the overall system cost
- Handling of mixed criticality applications under real-time conditions
- Scalability and utmost flexibility
- Full scale deployment and management of integrated tool chains, through the entire lifecycle
Final goal:
To foster cost-efficient integration of different applications with different levels of safety and security on a single computing platform in an open context, through an innovative and sustainable service-oriented architecture approach for mixed criticality applications in dynamic and changeable real-time environments.
Funding source:
- EU Artemis program
- Research Council of Norway
All partners (in Norway):
- Simula
- WestenGeco
- Fornebu Consulting
- UiO
The project in its entirety has about 100 partners throughout Europe.
PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming

In today's exponential world of digital data, big data services have made the power consumption the lion's share of the total cost. For instance, Google data centres consume almost 260 MW, about a quarter of the output of a nuclear power plant, enough to power 200 000 homes. Energy efficiency is therefore considered a major criterion for "sustainable" computing systems and services over the data deluge. However, energy-efficient computing systems make parallel programming even more complex and thereby less robust due to requirements of massive parallelism, heterogeneity and data locality.
The PREAPP project aims to devise novel programming models that will form foundations for a paradigm shift from energy "blind" to energy "aware" software development. The new models will enable one order of magnitude improvement in energy efficiency in comparison with today's multicore computing, thereby greatly advancing green computing and sustainable services. The new models will facilitate unprecedented productivity for implementing scientific big data applications that run effectively on large-scale high-performance computing (HPC) platforms, which are based on cutting-edge manycore architectures. The threshold of adopting large-scale parallel computing will thus be considerably lowered for a large number of computational scientists in several disciplines.
Funding source:
- Research Council of Norway
- FRINATEK program
All partners:
- Simula Research Laboratory
- University of Tromsø
Publications for PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming
PhD Thesis
The PGAS Programming Model and Mesh Based Computation: an HPC Challenge
In The University of Oslo, 2020.Status: Published
The PGAS Programming Model and Mesh Based Computation: an HPC Challenge
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming |
Publication Type | PhD Thesis |
Year of Publication | 2020 |
Degree awarding institution | The University of Oslo |
Date Published | June, 2020 |
Journal Article
Performance optimization and modeling of fine-grained irregular communication in UPC
Scientific Programming 2019 (2019): Article ID 6825728.Status: Published
Performance optimization and modeling of fine-grained irregular communication in UPC
The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming , Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Journal Article |
Year of Publication | 2019 |
Journal | Scientific Programming |
Volume | 2019 |
Pagination | Article ID 6825728 |
Date Published | 03/2019 |
Publisher | Hindawi |
Keywords | Fine-grained irregular communication, performance modeling, Performance optimization, Sparse matrix-vector multiplication, UPC programming language |
URL | https://www.hindawi.com/journals/sp/2019/6825728/ |
DOI | 10.1155/2019/6825728 |
Talks, invited
Heterogeneous computing for cardiac electrophysiology
In PREAPP workshop on Efficient Frameworks for Compute- and Data-intensive Computing (EFFECT), University of Tromsø, Norway, 2019.Status: Published
Heterogeneous computing for cardiac electrophysiology
Electrical activities inside the heart are immensely important for the functioning of this vital organ. In the pursuit of a scientific understanding of the processes and mechanisms in electro-physiology, computer simulations have become an established paradigm of research. Both the complex mathematical models and the extreme physiological details require huge-scale simulations, which nowadays see an increasing use of heterogeneous computing. That is, the computational power is delivered by more than one processor type. We will discuss some of the resulting challenges in programming and performance optimization. Successful applications from the domain of cardiac electro-physiology will be used to demonstrate the usefulness of heterogeneous computing. We will also take a peek into the future of heterogeneous computing through eX3: the brand-new national infrastructure for experimental exploration of exascale computing.
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming , Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Talks, invited |
Year of Publication | 2019 |
Location of Talk | PREAPP workshop on Efficient Frameworks for Compute- and Data-intensive Computing (EFFECT), University of Tromsø, Norway |
Proceedings, refereed
On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures
In High Performance Computing & Simulation (2016) - International Workshop on Optimization of Energy Efficient HPC & Distributed Systems. ACM IEEE, 2016.Status: Published
On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming , Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2016 |
Conference Name | High Performance Computing & Simulation (2016) - International Workshop on Optimization of Energy Efficient HPC & Distributed Systems |
Date Published | 08/2016 |
Publisher | ACM IEEE |
URL | http://dx.doi.org/10.1109/HPCSim.2016.7568416 |
DOI | 10.1109/HPCSim.2016.7568416 |
Meeting Exascale Computing with Source-to-Source Compilers

Future computing platforms are expected to be heterogeneous in architecture, that is, consisting of conventional CPUs and powerful hardware accelerators. The hardware heterogeneity, combined with the huge scale of these future platforms, will make the task of programming extremely difficult.
To overcome the programming challenge for the important class of scientific computations that are based on meshes, this project aims to develop two fully automated source-to-source compilers. These two compilers will help computational scientists to quickly prepare implementations of, respectively, implicit and explicit mesh-based computations for truly-heterogeneous and resource-efficient execution on CPU+accelerator computing platforms. Two real-world simulators from computational cardiology will be used as testbeds of the fully automated compilers.
The success of such real-world heterogeneous simulations will not only verify the usefulness of the source-to-source compilers, but more importantly will allow unprecedented resolution and fidelity when investigating the particular topics of heart failure and arrhythmia.
Funding source
The Research Council og Norway (IKTPLUSS)
Partners
- University of California, San Diego
- Imperial College London
- Oslo University Hospital
Publications for Meeting Exascale Computing with Source-to-Source Compilers
Journal Article
On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs
ACM Transactions on Mathematical Software 48, no. 2 (2022): 1-31.Status: Published
On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs
Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2022 |
Journal | ACM Transactions on Mathematical Software |
Volume | 48 |
Issue | 2 |
Number | 19 |
Pagination | 1–31 |
Date Published | 05/2022 |
Publisher | Association for Computing Machinery (ACM) |
ISSN | 0098-3500 |
DOI | 10.1145/3503925 |
PhD Thesis
High-performance finite element computations: Performance modelling, optimisation, GPU acceleration & automated code generation
In University of Oslo. Vol. PhD. Oslo, Norway: University of Oslo, 2021.Status: Published
High-performance finite element computations: Performance modelling, optimisation, GPU acceleration & automated code generation
Computer experiments have become a valuable tool for investigating various physical and biological processes described by partial differential equations (PDEs), such as weather forecasting or modelling the mechanical behaviour of cardiac tissue. Finite element methods are a class of numerical methods for solving PDEs that are often preferred, but these methods are rather difficult to implement correctly, let alone efficiently.
This thesis investigates the performance of several key computational kernels involved in finite element methods. First, a performance model is developed to better understand sparse matrix-vector multiplication, which is central to solving linear systems of equations that arise during finite element calculations. Second, the process of assembling linear systems is considered through careful benchmarking and analysis of the memory traffic involved. This results in clear guidelines for finite element assembly on shared-memory multicore CPUs.
Finally, hardware accelerators are incorporated by extending the FEniCS PDE solver framework to carry out assembly and solution of linear systems on a graphics processing unit (GPU). Example problems show that GPU-accelerated finite element solvers can exhibit substantial speedup over optimised multicore CPU codes. Moreover, the use of automated code generation makes these techniques much more accessible to domain scientists and non-experts.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | PhD Thesis |
Year of Publication | 2021 |
Degree awarding institution | University of Oslo |
Degree | PhD |
Number of Pages | 132 |
Date Published | 09/2020 |
Publisher | University of Oslo |
Place Published | Oslo, Norway |
Other Numbers | ISSN 1501-7710 |
Poster
Automated Code Generation for GPU-Based Finite Element Computations in FEniCS
SIAM Conference on Computational Science and Engineering (CSE21): SIAM, 2021.Status: Published
Automated Code Generation for GPU-Based Finite Element Computations in FEniCS
Developing high-performance finite element codes normally requires hand-crafting and fine tuning of computational kernels, which is not an easy task to carry out for each and every problem. Automated code generation has proved to be a highly productive alternative for frameworks like FEniCS, where a compiler is used to automatically generate suitable kernels from high-level mathematical descriptions of finite element problems. This strategy has so far enabled users to develop and run a variety of high-performance finite element solvers on clusters of multicore CPUs. We have recently enhanced FEniCS with GPU acceleration by enabling its internal compiler to generate CUDA kernels that are needed to offload finite element calculations to GPUs, particularly the assembly of linear systems. This poster presents the results of GPU-accelerating FEniCS and explores performance characteristics of auto-generated CUDA kernels and GPU-based assembly of linear systems for finite element methods.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Poster |
Year of Publication | 2021 |
Date Published | 03/2021 |
Publisher | SIAM |
Place Published | SIAM Conference on Computational Science and Engineering (CSE21) |
Journal Article
Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication
Journal of Parallel and Distributed Computing 144 (2020): 189-205.Status: Published
Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication
Parallel computations with irregular memory access patterns are often limited by the memory subsystems of multi-core CPUs, though it can be difficult to pinpoint and quantify performance bottlenecks precisely. We present a method for estimating volumes of data traffic caused by irregular, parallel computations on multi-core CPUs with memory hierarchies containing both private and shared caches. Further, we describe a performance model based on these estimates that applies to bandwidth-limited computations. As a case study, we consider two standard algorithms for sparse matrix–vector multiplication, a widely used, irregular kernel. Using three different multi-core CPU systems and a set of matrices that induce a range of irregular memory access patterns, we demonstrate that our cache simulation combined with the proposed performance model accurately quantifies performance bottlenecks that would not be detected using standard best- or worst-case estimates of the data traffic volume.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2020 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 144 |
Pagination | 189--205 |
Date Published | 06/2020 |
Publisher | Elsevier |
ISSN | 0743-7315 |
Keywords | AMD Epyc, Cache simulation, Intel Xeon, Performance model, Sparse matrix–vector multiplication |
URL | http://www.sciencedirect.com/science/article/pii/S0743731520302999 |
DOI | 10.1016/j.jpdc.2020.05.020 |
Poster
Towards detailed Organ-Scale Simulations in Cardiac Electrophysiology
GPU Technology Conference (GTC), Silicon Valley, San Jose, USA, 2020.Status: Published
Towards detailed Organ-Scale Simulations in Cardiac Electrophysiology
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Poster |
Year of Publication | 2020 |
Place Published | GPU Technology Conference (GTC), Silicon Valley, San Jose, USA |
Type of Work | Poster |
Proceedings, refereed
Karp-Sipser based Kernels for Bipartite Graph Matching
In Algorithm Engineering and Experiment (ALENEX). Society for Industrial and Applied Mathematics, 2020.Status: Published
Karp-Sipser based Kernels for Bipartite Graph Matching
We consider Karp–Sipser, a well known matching heuristic in the context of data reduction for the max- imum cardinality matching problem. We describe an efficient implementation as well as modifications to reduce its time complexity in worst case instances, both in theory and in practical cases. We compare experimentally against its widely used simpler variant and show cases for which the full algorithm yields better performance.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, UMOD: Understanding and Monitoring Digital Wildfires, Department of High Performance Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2020 |
Conference Name | Algorithm Engineering and Experiment (ALENEX) |
Pagination | 134-145 |
Publisher | Society for Industrial and Applied Mathematics |
Journal Article
Performance optimization and modeling of fine-grained irregular communication in UPC
Scientific Programming 2019 (2019): Article ID 6825728.Status: Published
Performance optimization and modeling of fine-grained irregular communication in UPC
The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming , Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Journal Article |
Year of Publication | 2019 |
Journal | Scientific Programming |
Volume | 2019 |
Pagination | Article ID 6825728 |
Date Published | 03/2019 |
Publisher | Hindawi |
Keywords | Fine-grained irregular communication, performance modeling, Performance optimization, Sparse matrix-vector multiplication, UPC programming language |
URL | https://www.hindawi.com/journals/sp/2019/6825728/ |
DOI | 10.1155/2019/6825728 |
Poster
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
International Conference in Computing in Cardiology, Singapore, 2019.Status: Published
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
Recent advances in personalized arrhythmia risk prediction show that computational models can provide not only safer but also more accurate results than invasive procedures. However, biophysically accurate simulations require solving linear systems over fine meshes and time resolutions, which can take hours or even days. This limits the use of such simulations in the clinic where diagnosis and treatment planning can be time sensitive, even if it is just for the reason of operation schedules. Furthermore, the non-interactive, non-intuitive way of accessing simulations and their results makes it hard to study these collaboratively.
Overcoming these limitations requires speeding up computations from hours to seconds, which requires a massive increase in computational capabilities.
Fortunately, the cost of computing has fallen dramatically in the past decade. A prominent reason for this is the recent introduction of manycore processors such as GPUs, which by now power the majority of the world’s leading supercomputers. These devices owe their success to the fact that they are optimized for massively parallel workloads, such as applying similar ODE kernel computations to millions of mesh elements in scientific computing applications. Unlike CPUs, which are typically optimized for sequential performance, this allows GPU architectures to dedicate more transistors to performing computations, thereby increasing parallel speed and energy efficiency.
In this poster, we present ongoing work on the parallelization of finite volume computations over an unstructured mesh as well as the challenges involved in building scalable simulation codes and discuss the steps needed to close the gap to accurate real-time computations.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Poster |
Year of Publication | 2019 |
Date Published | 09/2019 |
Place Published | International Conference in Computing in Cardiology, Singapore |
Proceedings, refereed
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
In Computing in Cardiology. Vol. 46. IEEE, 2019.Status: Published
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
Recent advances in personalized arrhythmia risk pre- diction show that computational models can provide not only safer but also more accurate results than invasive pro- cedures. However, biophysically accurate simulations re- quire solving linear systems over fine meshes and time res- olutions, which can take hours or even days. This limits the use of such simulations in the clinic where diagnosis and treatment planning can be time sensitive, even if it is just for the reason of operation schedules. Furthermore, the non-interactive, non-intuitive way of accessing simula- tions and their results makes it hard to study these collab- oratively. Overcoming these limitations requires speeding up computations from hours to seconds, which requires a massive increase in computational capabilities.
Fortunately, the cost of computing has fallen dramati- cally in the past decade. A prominent reason for this is the recent introduction of manycore processors such as GPUs, which by now power the majority of the world’s leading supercomputers. These devices owe their success to the fact that they are optimized for massively parallel work- loads, such as applying similar ODE kernel computations to millions of mesh elements in scientific computing ap- plications. Unlike CPUs, which are typically optimized for sequential performance, this allows GPU architectures to dedicate more transistors to performing computations, thereby increasing parallel speed and energy efficiency.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Proceedings, refereed |
Year of Publication | 2019 |
Conference Name | Computing in Cardiology |
Volume | 46 |
Date Published | 12/2019 |
Publisher | IEEE |
Combining algorithmic rethinking and AVX-512 intrinsics for efficient simulation of subcellular calcium signaling
In International Conference on Computational Science (ICCS 2019). Springer, 2019.Status: Published
Combining algorithmic rethinking and AVX-512 intrinsics for efficient simulation of subcellular calcium signaling
Calcium signaling is vital for the contraction of the heart. Physiologically realistic simulation of this subcellular process requires nanometer resolutions and a complicated mathematical model of differential equations. Since the subcellular space is composed of several irregularly-shaped and intricately-connected physiological domains with distinct properties, one particular challenge is to correctly compute the diffusion-induced calcium fluxes between the physiological domains. The common approach is to pre-calculate the effective diffusion coefficients between all pairs of neighboring computational voxels, and store them in large arrays. Such a strategy avoids complicated if-tests when looping through the computational mesh, but suffers from substantial memory overhead. In this paper, we adopt a memory-efficient strategy that uses a small lookup table of diffusion coefficients. The memory footprint and traffic are both drastically reduced, while also avoiding the if-tests. However, the new strategy induces more instructions on the processor level. To offset this potential performance pitfall, we use AVX-512 intrinsics to effectively vectorize the code. Performance measurements on a Knights Landing processor and a quad-socket Skylake server show a clear performance advantage of the manually vectorized implementation that uses lookup tables, over the counterpart using coefficient arrays.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2019 |
Conference Name | International Conference on Computational Science (ICCS 2019) |
Pagination | 681-687 |
Publisher | Springer |
DOI | 10.1007/978-3-030-22750-0_66 |
MicroCard: Numerical modeling of cardiac electrophysiology at the cellular scale
A groundbreaking approach to studying cardiac electrophysiology is to model the heart cell by cell, which will lead to a mathematical problem that is 10,000 times larger, and also harder to solve. We will need larger supercomputers than those that exist today, and a lot of inventiveness to compute efficiently on these future machines.
The purpose of the MICROCARD project is to develop a software code that will be able to solve this problem on future "exascale" supercomputers. We will develop algorithms that are tailored to the specific mathematical problem, to the size of the computations, and to the particular design of these future computers, which will probably owe most of their computing power to ultra-parallel computing elements such as Graphics Processing Units. We will not content ourselves with a "proof of concept", but will use the code that we develop to solve real-life problems in cardiology. Therefore the project includes computer experts, mathematicians, and biomedical engineers, and collaborates with cardiologists and physiologists.
The code will be adaptable to similar biological systems such as nerves, and some components will be reusable in an even wider range of applications.
Funding source
50% EU (through EuroHPC), and 50% from the Research Council of Norway
Partners
- University of Bordeaux (France)
- University of Strasbourg (France)
- Inria (France)
- Karlsruhe Institute of Technology (Germany)
- Zuse Institute Berlin (Germany)
- University of Pavia (Italy), Università Della Svizzera Italiana (Switzerland)
- Megware (Germany)
- Orobix (Italy)
- NumeriCor (Austria)
Project coordinator:
- Mark Potse (Univ. Bordeaux/Inria),
- Simulal contact person: Xing Cai
Project website at EU: http://www.microcard.eu
Publications for MicroCard: Numerical modeling of cardiac electrophysiology at the cellular scale
Journal Article
Resource-efficient use of modern processor architectures for numerically solving cardiac ionic cell models
Frontiers in Physiology 13 (2022).Status: Published
Resource-efficient use of modern processor architectures for numerically solving cardiac ionic cell models
A central component in simulating cardiac electrophysiology is the numerical solution of nonlinear ordinary differential equations, also called cardiac ionic cell models, that describe cross-cell-membrane ion transport. Biophysically detailed cell models often require a considerable amount of computation, including calls to special mathematical functions. This paper systematically studies how to efficiently use modern multicore CPUs for this costly computational task. We start by investigating the code restructurings needed to effectively enable compiler- supported SIMD vectorisation, which is the most important performance booster in this context. It is found that suitable OpenMP directives are sufficient for achieving both vectorisation and parallelisation. We then continue with an evaluation of the performance optimisation technique of using lookup tables. Due to increased challenges for automated vectorisation, the obtainable benefits of lookup tables are dependent on the hardware platforms chosen. Throughout the study, we report detailed time measurements obtained on Intel Xeon, Xeon Phi, AMD Epyc and two ARM processors including Fujitsu A64FX, while attention is also paid to the impact of SIMD vectorisation and lookup tables on the computational accuracy. As a realistic example, the benefits of performance enhancement are demonstrated by a 10^9-run ensemble on the OakForest-PACS system, where code restructurings and SIMD vectorisation yield an 84% reduction in computing time, corresponding to 63,270 node hours.
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology, MicroCard: Numerical modeling of cardiac electrophysiology at the cellular scale |
Publication Type | Journal Article |
Year of Publication | 2022 |
Journal | Frontiers in Physiology |
Volume | 13 |
Date Published | 06/2022 |
Publisher | Frontiers |
ISSN | 1664-042X |
URL | https://www.frontiersin.org/article/10.3389/fphys.2022.904648 |
DOI | 10.3389/fphys.2022.904648 |
SparCity: An Optimization and Co-design Framework for Sparse Computation
Perfectly aligned with the vision of the EuroHPC Joint Undertaking, the SparCity project aims at creating a supercomputing framework that will provide efficient algorithms and coherent tools specifically designed for maximising the performance and energy efficiency of sparse computations on emerging HPC systems, while also opening up new usage areas for sparse computations in data analytics and deep learning. The framework enables comprehensive application characterization and modelling, performing synergistic node-level and system-level software optimizations. By creating a digital SuperTwin, the framework is also capable of evaluating existing hardware components and addressing what-if scenarios on emerging architectures and systems in a co-design perspective. To demonstrate the effectiveness, societal impact, and usability of the framework, the SparCity project will enhance the computing scale and energy efficiency of four challenging real-life applications that come from drastically different domains, namely, computational cardiology, social networks, bioinformatics and autonomous driving. By targeting this collection of challenging applications, SparCity will develop world-class, extreme-scale and energy-efficient HPC technologies, and contribute to building a sustainable exascale ecosystem and increasing Europe’s competitiveness.
Funding source
50% EU (through EuroHPC), and 50% from the Research Council of Norway
Partners
- Koc University (Turkey)
- Sabanci University (Turkey)
- INESC-ID (Portugal)
- Ludwig-Maximilians-Universität (Germany)
- Graphcore AS (Norway)
Project coordinator:
- Didem Unat (Koc Univ)
- Simula contact person: Xing Cai
Publications for SparCity: An Optimization and Co-design Framework for Sparse Computation
Journal Article
Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication
IEEE Transactions on Parallel and Distributed Systems 34, no. 5 (2023): 1580-1593.Status: Published
Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication
The network topology of modern parallel computing systems is inherently heterogeneous, with a variety of latency and bandwidth values. Moreover, contention for the bandwidth can exist on different levels when many processes communicate with each other. Many-pair, point-to-point MPI communication is thus characterized by heterogeneity and contention, even on a cluster of homogeneous multicore CPU nodes. To get a detailed understanding of the individual communication cost per MPI process, we propose a new modeling methodology that incorporates both heterogeneity and contention. First, we improve the standard max-rate model to better quantify the actually achievable bandwidth depending on the number of MPI processes in competition. Then, we make a further extension that more detailedly models the bandwidth contention when the competing MPI processes have different numbers of neighbors, with also non-uniform message sizes. Thereafter, we include more flexibility by considering interactions between intra-socket and inter-socket messaging. Through a series of experiments done on different processor architectures, we show that the new heterogeneous and contention-constrained performance models can adequately explain the individual communication cost associated with each MPI process. The largest test of realistic point-to-point MPI communication involves 8,192 processes and in total 2,744,632 simultaneous messages over 64 dual-socket AMD Epyc Rome compute nodes connected by InfiniBand, for which the overall prediction accuracy achieved is 84%.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 34 |
Issue | 5 |
Pagination | 1580 - 1593 |
Date Published | 03/2023 |
Publisher | IEEE |
ISSN | 1045-9219 |
URL | https://ieeexplore.ieee.org/document/10064025 |
DOI | 10.1109/TPDS.2023.3253881 |
Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation
Frontiers in Physics 11 (2023).Status: Published
Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation
A new trend in processor architecture design is the packaging of thousands of small processor cores into a single device, where there is no device-level shared memory but each core has its own local memory. Thus, both the work and data of an application code need to be carefully distributed among the small cores, also termed as tiles. In this paper, we investigate how numerical computations that involve unstructured meshes can be efficiently parallelized and executed on a massively tiled architecture. Graphcore IPUs are chosen as the target hardware platform, to which we port an existing monodomain solver that simulates cardiac electrophysiology over realistic 3D irregular heart geometries. There are two computational kernels in this simulator, where a 3D diffusion equation is discretized over an unstructured mesh and numerically approximated by repeatedly executing sparse matrix-vector multiplications (SpMVs), whereas an individual system of ordinary differential equations (ODEs) is explicitly integrated per mesh cell. We demonstrate how a new style of programming that uses Poplar/C++ can be used to port these commonly encountered computational tasks to Graphcore IPUs. In particular, we describe a per-tile data structure that is adapted to facilitate the inter-tile data exchange needed for parallelizing the SpMVs. We also study the achievable performance of the ODE solver that heavily depends on special mathematical functions, as well as their accuracy on Graphcore IPUs. Moreover, topics related to using multiple IPUs and performance analysis are addressed. In addition to demonstrating an impressive level of performance that can be achieved by IPUs for monodomain simulation, we also provide a discussion on the generic theme of parallelizing and executing unstructured-mesh multiphysics computations on massively tiled hardware.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | Frontiers in Physics |
Volume | 11 |
Date Published | 03/2023 |
Publisher | Frontiers |
ISSN | 2296-424X |
Keywords | hardware accelerator, heterogenous computing, irregular meshes, scientific computation, scientific computation on MIMD processors, sparse matrix-vector multiplication (SpMV) |
URL | https://www.frontiersin.org/articles/10.3389/fphy.2023.979699/full |
DOI | 10.3389/fphy.2023.979699 |
Proceedings, refereed
iPUG for multiple Graphcore IPUs: Optimizing performance and scalability of parallel breadth-first search
In 28th IEEE International Conference on High Performance Computing, Data, & Analytics (HiPC). Bangalore, India: IEEE, 2021.Status: Published
iPUG for multiple Graphcore IPUs: Optimizing performance and scalability of parallel breadth-first search
Parallel graph algorithms have become one of the principal applications of high-performance computing besides numerical simulations and machine learning workloads. However, due to their highly unstructured nature, graph algorithms remain extremely challenging for most parallel systems, with large gaps between observed performance and theoretical limits. Further-more, most mainstream architectures rely heavily on single instruction multiple data (SIMD) processing for high floating-point rates, which is not beneficial for graph processing which instead requires high memory bandwidth, low memory latency, and efficient processing of unstructured data. On the other hand, we are currently observing an explosion of new hardware architectures, many of which are adapted to specific purposes and diverge from traditional designs. A notable example is the Graphcore Intelligence Processing Unit (IPU), which is developed to meet the needs of upcoming machine intelligence applications. Its design eschews the traditional cache hierarchy, relying on SRAM as its main memory instead. The result is an extremely high-bandwidth, low-latency memory at the cost of capacity. In addition, the IPU consists of a large number of independent cores, allowing for true multiple instruction multiple data (MIMD) processing. Together, these features suggest that such a processor is well suited for graph processing. We test the limits of graph processing on multiple IPUs by implementing a low-level, high-performance code for breadth-first search (BFS), following the specifications of Graph500, the most widely used benchmark for parallel graph processing. Despite the simplicity of the BFS algorithm, implementing efficient parallel codes for it has proven to be a challenging task in the past. We show that our implementation scales well on a system with 8 IPUs and attains roughly twice the performance of an equal number of NVIDIA V100 GPUs using state-of-the-art CUDA code.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Proceedings, refereed |
Year of Publication | 2021 |
Conference Name | 28th IEEE International Conference on High Performance Computing, Data, & Analytics (HiPC) |
Pagination | 162-171 |
Date Published | 12/2021 |
Publisher | IEEE |
Place Published | Bangalore, India |
DOI | 10.1109/HiPC53243.2021.00030 |
Explaining the Performance of Supervised and Semi-Supervised Methods for Automated Sparse Matrix Format Selection
In 50th International Conference on Parallel Processing Workshop. Chicago, Illinois, USA: ACM, 2021.Status: Published
Explaining the Performance of Supervised and Semi-Supervised Methods for Automated Sparse Matrix Format Selection
Afilliation | Scientific Computing, Machine Learning |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Proceedings, refereed |
Year of Publication | 2021 |
Conference Name | 50th International Conference on Parallel Processing Workshop |
Pagination | 1-10 |
Date Published | 08/2021 |
Publisher | ACM |
Place Published | Chicago, Illinois, USA |
Department of High Performance Computing
For computational modelling and simulation, widely considered as the third paradigm of science, efficient use of computer hardware is vital. The same is also true for the newly resurgent field of machine learning and artificial intelligence. The Department of High-Performance Computing (HPC) thus sees as its mission to enable large and huge-scale computations in these contexts, through resource-efficient usage of modern computing platforms.
To achieve this ambitious goal, the HPC department will research several topics, including methodologies of parallel programming, hardware-compatible and/or inspired numerical strategies, software tools for user-friendly deployment and optimization of scientific code, plus real-world applications from various branches of computational science. The department will actively engage in multi-disciplinary collaborations across the departmental boundaries within Simula, as well as nationally and internationally.
eX3
The national research infrastructure eX3 (Experimental Infrastructure for Exploration and Exascale Computing), which is hosted at Simula, will be the primary hardware testbed for the HPC department. The various bleeding-edge processor architectures and interconnect solutions that are available on this extremely heterogeneous infrastructure will be carefully studied and tested for preparation of the upcoming Exascale Computing era and beyond. In particular, the machine-learning specific hardware within the eX3 platform is expected to pave the way for the forward-looking, multi-disciplinary research direction of scientific machine learning.
More about ex3
- Feature article: "Preparing Norway for the next generation supercomputers"
- ex3 website: ex3.simula.no
Publications for Department of High Performance Computing
Journal Article
Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication
IEEE Transactions on Parallel and Distributed Systems 34, no. 5 (2023): 1580-1593.Status: Published
Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication
The network topology of modern parallel computing systems is inherently heterogeneous, with a variety of latency and bandwidth values. Moreover, contention for the bandwidth can exist on different levels when many processes communicate with each other. Many-pair, point-to-point MPI communication is thus characterized by heterogeneity and contention, even on a cluster of homogeneous multicore CPU nodes. To get a detailed understanding of the individual communication cost per MPI process, we propose a new modeling methodology that incorporates both heterogeneity and contention. First, we improve the standard max-rate model to better quantify the actually achievable bandwidth depending on the number of MPI processes in competition. Then, we make a further extension that more detailedly models the bandwidth contention when the competing MPI processes have different numbers of neighbors, with also non-uniform message sizes. Thereafter, we include more flexibility by considering interactions between intra-socket and inter-socket messaging. Through a series of experiments done on different processor architectures, we show that the new heterogeneous and contention-constrained performance models can adequately explain the individual communication cost associated with each MPI process. The largest test of realistic point-to-point MPI communication involves 8,192 processes and in total 2,744,632 simultaneous messages over 64 dual-socket AMD Epyc Rome compute nodes connected by InfiniBand, for which the overall prediction accuracy achieved is 84%.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 34 |
Issue | 5 |
Pagination | 1580 - 1593 |
Date Published | 03/2023 |
Publisher | IEEE |
ISSN | 1045-9219 |
URL | https://ieeexplore.ieee.org/document/10064025 |
DOI | 10.1109/TPDS.2023.3253881 |
Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation
Frontiers in Physics 11 (2023).Status: Published
Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation
A new trend in processor architecture design is the packaging of thousands of small processor cores into a single device, where there is no device-level shared memory but each core has its own local memory. Thus, both the work and data of an application code need to be carefully distributed among the small cores, also termed as tiles. In this paper, we investigate how numerical computations that involve unstructured meshes can be efficiently parallelized and executed on a massively tiled architecture. Graphcore IPUs are chosen as the target hardware platform, to which we port an existing monodomain solver that simulates cardiac electrophysiology over realistic 3D irregular heart geometries. There are two computational kernels in this simulator, where a 3D diffusion equation is discretized over an unstructured mesh and numerically approximated by repeatedly executing sparse matrix-vector multiplications (SpMVs), whereas an individual system of ordinary differential equations (ODEs) is explicitly integrated per mesh cell. We demonstrate how a new style of programming that uses Poplar/C++ can be used to port these commonly encountered computational tasks to Graphcore IPUs. In particular, we describe a per-tile data structure that is adapted to facilitate the inter-tile data exchange needed for parallelizing the SpMVs. We also study the achievable performance of the ODE solver that heavily depends on special mathematical functions, as well as their accuracy on Graphcore IPUs. Moreover, topics related to using multiple IPUs and performance analysis are addressed. In addition to demonstrating an impressive level of performance that can be achieved by IPUs for monodomain simulation, we also provide a discussion on the generic theme of parallelizing and executing unstructured-mesh multiphysics computations on massively tiled hardware.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | Frontiers in Physics |
Volume | 11 |
Date Published | 03/2023 |
Publisher | Frontiers |
ISSN | 2296-424X |
Keywords | hardware accelerator, heterogenous computing, irregular meshes, scientific computation, scientific computation on MIMD processors, sparse matrix-vector multiplication (SpMV) |
URL | https://www.frontiersin.org/articles/10.3389/fphy.2023.979699/full |
DOI | 10.3389/fphy.2023.979699 |
Journal Article
Impacts of Covid-19 on Norwegian salmon exports: A firm-level analysis
Aquaculture 561 (2022): 738678.Status: Published
Impacts of Covid-19 on Norwegian salmon exports: A firm-level analysis
Afilliation | Scientific Computing, Machine Learning |
Project(s) | Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2022 |
Journal | Aquaculture |
Volume | 561 |
Pagination | 738678 |
Date Published | Jan-12-2022 |
Publisher | Elsevier |
ISSN | 00448486 |
URL | https://www.sciencedirect.com/science/article/pii/S0044848622007955 |
DOI | 10.1016/j.aquaculture.2022.738678 |
On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs
ACM Transactions on Mathematical Software 48, no. 2 (2022): 1-31.Status: Published
On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs
Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2022 |
Journal | ACM Transactions on Mathematical Software |
Volume | 48 |
Issue | 2 |
Number | 19 |
Pagination | 1–31 |
Date Published | 05/2022 |
Publisher | Association for Computing Machinery (ACM) |
ISSN | 0098-3500 |
DOI | 10.1145/3503925 |
PhD Thesis
Explaining News Spreading Phenomena in Social Networks
In Technische Universität Berlin. Vol. PhD, 2022.Status: Published
Explaining News Spreading Phenomena in Social Networks
Afilliation | Machine Learning |
Project(s) | Department of High Performance Computing |
Publication Type | PhD Thesis |
Year of Publication | 2022 |
Degree awarding institution | Technische Universität Berlin |
Degree | PhD |
SmartIO: Device sharing and memory disaggregation in PCIe clusters using non-transparent bridging
In The University of Oslo. Vol. PhD. University of Oslo (UiO), 2022.Status: Published
SmartIO: Device sharing and memory disaggregation in PCIe clusters using non-transparent bridging
Distributed and parallel computing applications are becoming increasingly compute-heavy and data-driven, accelerating the need for disaggregation solutions that enable sharing of I/O resources between networked machines. For example, in a heterogeneous computing cluster, different machines may have different devices available to them, but distributing I/O resources in a way that maximizes both resource utilization and overall cluster performance is a challenge. To facilitate device sharing and memory disaggregation among machines connected using PCIe non-transparent bridges, we present SmartIO. SmartIO makes all machines in the cluster, including their internal devices and memory, part of a common PCIe domain. By leveraging the memory mapping capabilities of non-transparent bridges, remote resources may be used directly, as if these resources were local to the machines using them. Whether devices are local or remote is made transparent by SmartIO. NVMes, GPUs, FPGAs, NICs, and any other PCIe device can be dynamically shared with and distributed to remote machines, and it is even possible to disaggregate devices and memory, in order to share component parts with multiple machines at the same time. Software is entirely removed from the performance-critical path, allowing remote resources to be used with native PCIe performance. To demonstrate that SmartIO is an efficient solution, we have performed a comprehensive evaluation consisting of a wide range of performance experiments, including both synthetic benchmarks and realistic, large-scale workloads. Our experimental results show that remote resources can be used without any performance overhead compared to using local resources, in terms of throughput and latency. Thus, compared to existing disaggregation solutions, SmartIO provides more efficient, low-cost resource sharing, increasing the overall system performance and resource utilization.
Afilliation | Communication Systems |
Project(s) | Unified PCIe IO: Unified PCI Express for Distributed Component Virtualization, Department of Holistic Systems, Department of High Performance Computing |
Publication Type | PhD Thesis |
Year of Publication | 2022 |
Degree awarding institution | The University of Oslo |
Degree | PhD |
Number of Pages | 236 |
Date Published | 10/2022 |
Publisher | University of Oslo (UiO) |
Thesis Type | Paper Collection |
URL | https://www.duo.uio.no/handle/10852/97351 |
Proceedings, refereed
Efficient Minimum Weight Vertex Cover Heuristics Using Graph Neural Networks
In 20th International Symposium on Experimental Algorithms (SEA 2022). Vol. 233. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2022.Status: Published
Efficient Minimum Weight Vertex Cover Heuristics Using Graph Neural Networks
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2022 |
Conference Name | 20th International Symposium on Experimental Algorithms (SEA 2022) |
Volume | 233 |
Pagination | 12:1–12:17 |
Publisher | Schloss Dagstuhl – Leibniz-Zentrum für Informatik |
Place Published | Dagstuhl, Germany |
ISBN Number | 978-3-95977-251-8 |
ISSN Number | 1868-8969 |
URL | https://drops.dagstuhl.de/opus/volltexte/2022/16546 |
DOI | 10.4230/LIPIcs.SEA.2022.12 |
Implementing Spatio-Temporal Graph Convolutional Networks on Graphcore IPUs
In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Lyon, France: IEEE, 2022.Status: Published
Implementing Spatio-Temporal Graph Convolutional Networks on Graphcore IPUs
Afilliation | Machine Learning |
Project(s) | Department of High Performance Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2022 |
Conference Name | 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) |
Pagination | 45-54 |
Publisher | IEEE |
Place Published | Lyon, France |
URL | https://ieeexplore.ieee.org/document/9835385/http://xplorestaging.ieee.o... |
DOI | 10.1109/IPDPSW55747.2022.00016 |
Host Bypassing: Let your GPU speak Ethernet
In IEEE 8th International Conference on Network Softwarization (NetSoft). IEEE, 2022.Status: Published
Host Bypassing: Let your GPU speak Ethernet
Hardware acceleration of network functions is essential to meet the challenging Quality of Service requirements in nowadays computer networks. Graphical Processing Units (GPU) are a widely deployed technology that can also be used for computing tasks, including acceleration of network functions. In this work, we demonstrate how commodity GPUs, which do not provide any network interfaces, can be used to accelerate network functions. Our approach leverages PCIe peer-to-peer capabilities and allows the GPU to control the network interface card directly, without any assistance from the operating system or control application. The presented evaluation results demonstrate the feasibility of our approach and its performance of up to 10 Gbit/s, even for small packets.
Afilliation | Communication Systems |
Project(s) | Unified PCIe IO: Unified PCI Express for Distributed Component Virtualization, Department of Holistic Systems, Department of High Performance Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2022 |
Conference Name | IEEE 8th International Conference on Network Softwarization (NetSoft) |
Pagination | 85-90 |
Date Published | 06/2022 |
Publisher | IEEE |
ISBN Number | 978-1-6654-0694-9 |
URL | https://ieeexplore.ieee.org/document/9844090 |
DOI | 10.1109/NetSoft54395.2022.9844090 |
A Streaming System for Large-scale Temporal Graph Mining of Reddit Data
In 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Lyon, France: IEEE, 2022.Status: Published
A Streaming System for Large-scale Temporal Graph Mining of Reddit Data
Afilliation | Scientific Computing, Machine Learning |
Project(s) | Department of High Performance Computing , Enabling Graph Neural Networks at Exascale |
Publication Type | Proceedings, refereed |
Year of Publication | 2022 |
Conference Name | 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) |
Pagination | 1153-1162 |
Publisher | IEEE |
Place Published | Lyon, France |
URL | https://ieeexplore.ieee.org/document/9835250/http://xplorestaging.ieee.o... |
DOI | 10.1109/IPDPSW55747.2022.00189 |
Publications
Journal Article
A cell-based framework for modeling cardiac mechanics
Biomechanics and Modeling in Mechanobiology (2023).Status: Published
A cell-based framework for modeling cardiac mechanics
Cardiomyocytes are the functional building blocks of the heart – yet most models developed to simulate cardiac mechanics do not represent the individual cells and their surrounding matrix. Instead, they work on a homogenized tissue level, assuming that cellular and subcellular structures and processes scale uniformly. Here we present a mathematical and numerical framework for exploring tissue level cardiac mechanics on a microscale given an explicit three-dimensional geometrical representation of cells embedded in a matrix. We defined a mathematical model over such a geometry, and parametrized our model using publicly available data from tissue stretching and shearing experiments. We then used the model to explore mechanical differences between the extracellular and the intracellular space. Through sensitivity analysis, we found the stiffness in the extracellular matrix to be most important for the intracellular stress values under contraction. Strain and stress values were observed to follow a normal-tangential pattern concentrated along the membrane, with substantial spatial variations both under contraction and stretching. We also examined how it scales to larger size simulations, considering multicellular domains. Our work extends existing continuum models, providing a new geometrical-based framework for exploring complex cell-cell and cell-matrix interactions.
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | Biomechanics and Modeling in Mechanobiology |
Date Published | 01/2023 |
Publisher | Springer |
Keywords | Cardiac Mechanics, cardiomyocyte contraction, cell geometries, intracellular and extracellular mechanics, microscale modeling |
A cell-based framework for modeling cardiac mechanics
Biomechanics and Modeling in Mechanobiology 10101010, no. 1137/11115/11109/101145/779359 (2023).Status: Published
A cell-based framework for modeling cardiac mechanics
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | Biomechanics and Modeling in Mechanobiology |
Volume | 10101010 |
Issue | 1137/11115/11109/101145/779359 |
Date Published | 01/2023 |
Publisher | Springer |
ISSN | 1617-7959 |
Keywords | Cardiac Mechanics, cardiomyocyte contraction, cell geometries, intracellular and extracellular mechanics, microscale modeling |
URL | https://link.springer.com/article/10.1007/s10237-022-01660-8 |
DOI | 10.1007/s10237-022-01660-8 |
Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication
IEEE Transactions on Parallel and Distributed Systems 34, no. 5 (2023): 1580-1593.Status: Published
Detailed Modeling of Heterogeneous and Contention-Constrained Point-to-Point MPI Communication
The network topology of modern parallel computing systems is inherently heterogeneous, with a variety of latency and bandwidth values. Moreover, contention for the bandwidth can exist on different levels when many processes communicate with each other. Many-pair, point-to-point MPI communication is thus characterized by heterogeneity and contention, even on a cluster of homogeneous multicore CPU nodes. To get a detailed understanding of the individual communication cost per MPI process, we propose a new modeling methodology that incorporates both heterogeneity and contention. First, we improve the standard max-rate model to better quantify the actually achievable bandwidth depending on the number of MPI processes in competition. Then, we make a further extension that more detailedly models the bandwidth contention when the competing MPI processes have different numbers of neighbors, with also non-uniform message sizes. Thereafter, we include more flexibility by considering interactions between intra-socket and inter-socket messaging. Through a series of experiments done on different processor architectures, we show that the new heterogeneous and contention-constrained performance models can adequately explain the individual communication cost associated with each MPI process. The largest test of realistic point-to-point MPI communication involves 8,192 processes and in total 2,744,632 simultaneous messages over 64 dual-socket AMD Epyc Rome compute nodes connected by InfiniBand, for which the overall prediction accuracy achieved is 84%.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | IEEE Transactions on Parallel and Distributed Systems |
Volume | 34 |
Issue | 5 |
Pagination | 1580 - 1593 |
Date Published | 03/2023 |
Publisher | IEEE |
ISSN | 1045-9219 |
URL | https://ieeexplore.ieee.org/document/10064025 |
DOI | 10.1109/TPDS.2023.3253881 |
Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation
Frontiers in Physics 11 (2023).Status: Published
Enabling unstructured-mesh computation on massively tiled AI processors: An example of accelerating in silico cardiac simulation
A new trend in processor architecture design is the packaging of thousands of small processor cores into a single device, where there is no device-level shared memory but each core has its own local memory. Thus, both the work and data of an application code need to be carefully distributed among the small cores, also termed as tiles. In this paper, we investigate how numerical computations that involve unstructured meshes can be efficiently parallelized and executed on a massively tiled architecture. Graphcore IPUs are chosen as the target hardware platform, to which we port an existing monodomain solver that simulates cardiac electrophysiology over realistic 3D irregular heart geometries. There are two computational kernels in this simulator, where a 3D diffusion equation is discretized over an unstructured mesh and numerically approximated by repeatedly executing sparse matrix-vector multiplications (SpMVs), whereas an individual system of ordinary differential equations (ODEs) is explicitly integrated per mesh cell. We demonstrate how a new style of programming that uses Poplar/C++ can be used to port these commonly encountered computational tasks to Graphcore IPUs. In particular, we describe a per-tile data structure that is adapted to facilitate the inter-tile data exchange needed for parallelizing the SpMVs. We also study the achievable performance of the ODE solver that heavily depends on special mathematical functions, as well as their accuracy on Graphcore IPUs. Moreover, topics related to using multiple IPUs and performance analysis are addressed. In addition to demonstrating an impressive level of performance that can be achieved by IPUs for monodomain simulation, we also provide a discussion on the generic theme of parallelizing and executing unstructured-mesh multiphysics computations on massively tiled hardware.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Journal Article |
Year of Publication | 2023 |
Journal | Frontiers in Physics |
Volume | 11 |
Date Published | 03/2023 |
Publisher | Frontiers |
ISSN | 2296-424X |
Keywords | hardware accelerator, heterogenous computing, irregular meshes, scientific computation, scientific computation on MIMD processors, sparse matrix-vector multiplication (SpMV) |
URL | https://www.frontiersin.org/articles/10.3389/fphy.2023.979699/full |
DOI | 10.3389/fphy.2023.979699 |
Talks, contributed
Modeling cardiac mechanics using a cell-based framework
In 15th World Congress on Computational Mechanics (WCCM-XV), Yokohama, Japan. 15th World Congress on Computational Mechanics (WCCM-XV), 2022.Status: Published
Modeling cardiac mechanics using a cell-based framework
Cardiac tissue primarily consists of interconnected cardiac cells which contract in a synchronized manner as the heart beats. Most computational models of cardiac tissue, however, homogenize out the individual cells and their surroundings. This approach has been immensely useful for describing cardiac mechanics on an overall level, but gives very limited understanding of the interaction between individual cells and their intermediate surroundings. Several models have been developed for single cells, see e.g. [1, 2]. In this work, we extend the mechanical part of these frameworks to a domain representing multiple cells, allowing us to investigate cell-matrix and cell-cell interactions. We present a mechanical model in which each cell and the extracellular matrix have an explicit geometrical representation, similar to the electrophysiological model presented in [3]. The strain energy functions are defined separately for each of the intracellular and extracellular subdomains, while we assume continuity of displacement and stresses along the membrane. Active tension is only assigned to the intracellular subdomain. For each state, we find an equilibrium solution using the finite element method. We explore passive and active mechanics for a single cell surrounded by an extracellular matrix and for small collections of cells combined into tissue blocks. The explicit geometric representation gives rise to highly varying strain and stress patterns. We show that the extracellular matrix stiffness highly influences the cardiomyocyte stresses during contraction. Through large-scale simulations enabled by high-performance computing, we also demonstrate that our model can be scaled to small collections of cells, resembling small cardiac tissue samples.
[1] Tracqui, T. and Ohayon, J. An integrated formulation of anisotropic force–calcium relations driving spatio-temporal contractions of cardiac myocytes. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences (2009).
[2] Ruiz-Baier, R. Gizzi, A., Rossi, S. Cherubini, C. Laadhari, A. Filippi, S. and Quarteroni, A. Mathematical modelling of active contraction in isolated cardiomyocytes. Mathematical Medicine and Biology (2014).
[3] Tveito, A., Jæger, KH. Kuchta, M. Mardal, K-A. and Rognes, ME. A cell-based framework for numerical modeling of electrical conduction in cardiac tissue. Frontiers in Physics (2017).
\end{thebibliography}
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology |
Publication Type | Talks, contributed |
Year of Publication | 2022 |
Location of Talk | 15th World Congress on Computational Mechanics (WCCM-XV), Yokohama, Japan |
Publisher | 15th World Congress on Computational Mechanics (WCCM-XV) |
Type of Talk | Contributed |
Keywords | cardiomyocyte contraction, cell-based geometries, intracellular and extracellular mechanics, microscale cardiac mechanics |
URL | https://prezi.com/view/uGIK0kQvrZ6G1CNOkc73/ |
Journal Article
On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs
ACM Transactions on Mathematical Software 48, no. 2 (2022): 1-31.Status: Published
On memory traffic and optimisations for low-order finite element assembly algorithms on multi-core CPUs
Motivated by the wish to understand the achievable performance of finite element assembly on unstructured computational meshes, we dissect the standard cellwise assembly algorithm into four kernels, two of which are dominated by irregular memory traffic. Several optimisation schemes are studied together with associated lower and upper bounds on the estimated memory traffic volume. Apart from properly reordering the mesh entities, the two most significant optimisations include adopting a lookup table in adding element matrices or vectors to their global counterparts, and using a row-wise assembly algorithm for multi-threaded parallelisation. Rigorous benchmarking shows that, due to the various optimisations, the actual volumes of memory traffic are in many cases very close to the estimated lower bounds. These results confirm the effectiveness of the optimisations, while also providing a recipe for developing efficient software for finite element assembly.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2022 |
Journal | ACM Transactions on Mathematical Software |
Volume | 48 |
Issue | 2 |
Number | 19 |
Pagination | 1–31 |
Date Published | 05/2022 |
Publisher | Association for Computing Machinery (ACM) |
ISSN | 0098-3500 |
DOI | 10.1145/3503925 |
Resource-efficient use of modern processor architectures for numerically solving cardiac ionic cell models
Frontiers in Physiology 13 (2022).Status: Published
Resource-efficient use of modern processor architectures for numerically solving cardiac ionic cell models
A central component in simulating cardiac electrophysiology is the numerical solution of nonlinear ordinary differential equations, also called cardiac ionic cell models, that describe cross-cell-membrane ion transport. Biophysically detailed cell models often require a considerable amount of computation, including calls to special mathematical functions. This paper systematically studies how to efficiently use modern multicore CPUs for this costly computational task. We start by investigating the code restructurings needed to effectively enable compiler- supported SIMD vectorisation, which is the most important performance booster in this context. It is found that suitable OpenMP directives are sufficient for achieving both vectorisation and parallelisation. We then continue with an evaluation of the performance optimisation technique of using lookup tables. Due to increased challenges for automated vectorisation, the obtainable benefits of lookup tables are dependent on the hardware platforms chosen. Throughout the study, we report detailed time measurements obtained on Intel Xeon, Xeon Phi, AMD Epyc and two ARM processors including Fujitsu A64FX, while attention is also paid to the impact of SIMD vectorisation and lookup tables on the computational accuracy. As a realistic example, the benefits of performance enhancement are demonstrated by a 10^9-run ensemble on the OakForest-PACS system, where code restructurings and SIMD vectorisation yield an 84% reduction in computing time, corresponding to 63,270 node hours.
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology, MicroCard: Numerical modeling of cardiac electrophysiology at the cellular scale |
Publication Type | Journal Article |
Year of Publication | 2022 |
Journal | Frontiers in Physiology |
Volume | 13 |
Date Published | 06/2022 |
Publisher | Frontiers |
ISSN | 1664-042X |
URL | https://www.frontiersin.org/article/10.3389/fphys.2022.904648 |
DOI | 10.3389/fphys.2022.904648 |
Poster
Automated Code Generation for GPU-Based Finite Element Computations in FEniCS
SIAM Conference on Computational Science and Engineering (CSE21): SIAM, 2021.Status: Published
Automated Code Generation for GPU-Based Finite Element Computations in FEniCS
Developing high-performance finite element codes normally requires hand-crafting and fine tuning of computational kernels, which is not an easy task to carry out for each and every problem. Automated code generation has proved to be a highly productive alternative for frameworks like FEniCS, where a compiler is used to automatically generate suitable kernels from high-level mathematical descriptions of finite element problems. This strategy has so far enabled users to develop and run a variety of high-performance finite element solvers on clusters of multicore CPUs. We have recently enhanced FEniCS with GPU acceleration by enabling its internal compiler to generate CUDA kernels that are needed to offload finite element calculations to GPUs, particularly the assembly of linear systems. This poster presents the results of GPU-accelerating FEniCS and explores performance characteristics of auto-generated CUDA kernels and GPU-based assembly of linear systems for finite element methods.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Poster |
Year of Publication | 2021 |
Date Published | 03/2021 |
Publisher | SIAM |
Place Published | SIAM Conference on Computational Science and Engineering (CSE21) |
Journal Article
Efficient numerical solution of the EMI model representing the extracellular space (E), cell membrane (M) and intracellular space (I) of a collection of cardiac cells
Frontiers in Physics 8 (2021): 579461.Status: Published
Efficient numerical solution of the EMI model representing the extracellular space (E), cell membrane (M) and intracellular space (I) of a collection of cardiac cells
The EMI model represents excitable cells in a more accurate manner than traditional homogenized models at the price of increased computational complexity. The increased complexity of solving the EMI model stems from a significant increase in the number of computational nodes and from the form of the linear systems that need to be solved. Here, we will show that the latter problem can be solved by careful use of operator splitting of the spatially coupled equations. By using this method, the linear systems can be broken into sub-problems that are of the classical type of linear, elliptic boundary value problems. Therefore, the vast collection of methods for solving linear, elliptic partial differential equations can be used. We demonstrate that this enables us to solve the systems using shared-memory parallel computers. The computing time scales perfectly with the number of physical cells. For a collection of 512×256 cells, we manage to solve linear systems with about 2.5×10^8 unknows. Since the computational effort scales linearly with the number of physical cells, we believe that larger computers can be used to simulate millions of excitable cells and thus allow careful analysis of physiological systems of great importance.
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology, Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2021 |
Journal | Frontiers in Physics |
Volume | 8 |
Pagination | 579461 |
Publisher | Frontiers |
URL | https://www.frontiersin.org/articles/10.3389/fphy.2020.579461/full |
DOI | 10.3389/fphy.2020.579461 |
On the impact of heterogeneity-aware mesh partitioning and non-contributing computation removal on parallel reservoir simulations
Journal of Mathematics in Industry 11 (2021).Status: Published
On the impact of heterogeneity-aware mesh partitioning and non-contributing computation removal on parallel reservoir simulations
Parallel computations have become standard practice for simulating the complicated multi-phase flow in a petroleum reservoir. Increasingly sophisticated numerical techniques have been developed in this context. During the chase of algorithmic superiority, however, there is a risk of forgetting the ultimate goal, namely, to efficiently simulate real-world reservoirs on realistic parallel hardware platforms. In this paper, we quantitatively analyse the negative performance impact caused by non-contributing computations that are associated with the “ghost computational cells” per subdomain, which is an insufficiently studied subject in parallel reservoir simulation. We also show how these non-contributing computations can be avoided by reordering the computational cells of each subdomain, such that the ghost cells are grouped together. Moreover, we propose a new graph-edge weighting scheme that can improve the mesh partitioning quality, aiming at a balance between handling the heterogeneity of geological properties and restricting the communication overhead. To put the study in a realistic setting, we enhance the open-source Flow simulator from the OPM framework, and provide comparisons with industrial-standard simulators for real-world reservoir models.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2021 |
Journal | Journal of Mathematics in Industry |
Volume | 11 |
Date Published | 06/2021 |
Publisher | Springer |
URL | https://mathematicsinindustry.springeropen.com/articles/10.1186/s13362-0... |
DOI | 10.1186/s13362-021-00108-5 |
Proceedings, refereed
iPUG for multiple Graphcore IPUs: Optimizing performance and scalability of parallel breadth-first search
In 28th IEEE International Conference on High Performance Computing, Data, & Analytics (HiPC). Bangalore, India: IEEE, 2021.Status: Published
iPUG for multiple Graphcore IPUs: Optimizing performance and scalability of parallel breadth-first search
Parallel graph algorithms have become one of the principal applications of high-performance computing besides numerical simulations and machine learning workloads. However, due to their highly unstructured nature, graph algorithms remain extremely challenging for most parallel systems, with large gaps between observed performance and theoretical limits. Further-more, most mainstream architectures rely heavily on single instruction multiple data (SIMD) processing for high floating-point rates, which is not beneficial for graph processing which instead requires high memory bandwidth, low memory latency, and efficient processing of unstructured data. On the other hand, we are currently observing an explosion of new hardware architectures, many of which are adapted to specific purposes and diverge from traditional designs. A notable example is the Graphcore Intelligence Processing Unit (IPU), which is developed to meet the needs of upcoming machine intelligence applications. Its design eschews the traditional cache hierarchy, relying on SRAM as its main memory instead. The result is an extremely high-bandwidth, low-latency memory at the cost of capacity. In addition, the IPU consists of a large number of independent cores, allowing for true multiple instruction multiple data (MIMD) processing. Together, these features suggest that such a processor is well suited for graph processing. We test the limits of graph processing on multiple IPUs by implementing a low-level, high-performance code for breadth-first search (BFS), following the specifications of Graph500, the most widely used benchmark for parallel graph processing. Despite the simplicity of the BFS algorithm, implementing efficient parallel codes for it has proven to be a challenging task in the past. We show that our implementation scales well on a system with 8 IPUs and attains roughly twice the performance of an equal number of NVIDIA V100 GPUs using state-of-the-art CUDA code.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , SparCity: An Optimization and Co-design Framework for Sparse Computation |
Publication Type | Proceedings, refereed |
Year of Publication | 2021 |
Conference Name | 28th IEEE International Conference on High Performance Computing, Data, & Analytics (HiPC) |
Pagination | 162-171 |
Date Published | 12/2021 |
Publisher | IEEE |
Place Published | Bangalore, India |
DOI | 10.1109/HiPC53243.2021.00030 |
Book Chapter
Operator Splitting and Finite Difference Schemes for Solving the EMI Model
In Modeling Excitable Tissue: The EMI Framework, 44-55. Vol. 7. Cham: Springer International Publishing, 2021.Status: Published
Operator Splitting and Finite Difference Schemes for Solving the EMI Model
We want to be able to perform accurate simulations of a large number of cardiac cells based on mathematical models where each individual cell is represented in the model. This implies that the computational mesh has to have a typical resolution of a few µm leading to huge computational challenges. In this paper we use a certain operator splitting of the coupled equations and showthat this leads to systems that can be solved in parallel. This opens up for the possibility of simulating large numbers of coupled cardiac cells.
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology, Department of High Performance Computing |
Publication Type | Book Chapter |
Year of Publication | 2021 |
Book Title | Modeling Excitable Tissue: The EMI Framework |
Volume | 7 |
Chapter | 4 |
Pagination | 44 - 55 |
Publisher | Springer International Publishing |
Place Published | Cham |
ISBN Number | 978-3-030-61156-9 |
ISBN | 2512-1677 |
URL | http://link.springer.com/content/pdf/10.1007/978-3-030-61157-6_4 |
DOI | 10.1007/978-3-030-61157-6_4 |
Journal Article
Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication
Journal of Parallel and Distributed Computing 144 (2020): 189-205.Status: Published
Cache simulation for irregular memory traffic on multi-core CPUs: Case study on performance models for sparse matrix–vector multiplication
Parallel computations with irregular memory access patterns are often limited by the memory subsystems of multi-core CPUs, though it can be difficult to pinpoint and quantify performance bottlenecks precisely. We present a method for estimating volumes of data traffic caused by irregular, parallel computations on multi-core CPUs with memory hierarchies containing both private and shared caches. Further, we describe a performance model based on these estimates that applies to bandwidth-limited computations. As a case study, we consider two standard algorithms for sparse matrix–vector multiplication, a widely used, irregular kernel. Using three different multi-core CPU systems and a set of matrices that induce a range of irregular memory access patterns, we demonstrate that our cache simulation combined with the proposed performance model accurately quantifies performance bottlenecks that would not be detected using standard best- or worst-case estimates of the data traffic volume.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Journal Article |
Year of Publication | 2020 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 144 |
Pagination | 189--205 |
Date Published | 06/2020 |
Publisher | Elsevier |
ISSN | 0743-7315 |
Keywords | AMD Epyc, Cache simulation, Intel Xeon, Performance model, Sparse matrix–vector multiplication |
URL | http://www.sciencedirect.com/science/article/pii/S0743731520302999 |
DOI | 10.1016/j.jpdc.2020.05.020 |
Poster
Efficient simulations of patient-specific electrical heart activity on the DGX-2
GPU Technology Conference (GTC) 2020, Silicon Valley, USA: Nvidia, 2020.Status: Published
Efficient simulations of patient-specific electrical heart activity on the DGX-2
Patients who have suffered a heart attack have an elevated risk of developing arrhythmia. The use of computer simulations of the electrical activity in the hearts of these patients, is emerging as an alternative to traditional, more invasive examinations performed by doctors today. Recent advances in personalised arrhythmia risk prediction show that computational models can provide not only safer but also more accurate results than invasive procedures. However, biophysically accurate simulations of the electrical activity in the heart require solving linear systems over fine meshes and time resolutions, which can take hours or even days. This limits the use of such simulations in the clinic where diagnosis and treatment planning can be time sensitive, even if it is just for the reason of operation schedules. Furthermore, the non-interactive, non-intuitive way of accessing simulations and their results makes it hard to study these collaboratively. Overcoming these limitations requires speeding up computations from hours to seconds, which requires a massive increase in computational capabilities.
We have developed a code that is capable of performing highly efficient heart simulations on the DGX-2, making use of all 16 V100 GPUs. Using a patient-specific unstructured tetrahedral mesh with 11.7 million cells, we are able to simulate the electrical heart activity at 1/30 of real-time. Moreover, we are able to show that the throughput achieved using all 16 GPUs in the DGX-2 is 77.6% of the theoretical maximum.
We achieved this through extensive optimisations of the two kernels constituting the body of the main loop in the simulator. In the kernel solving the diffusion equation (governing the spread of the electrical signal), constituting of a sparse matrix-vector multiplication, we minimise the memory traffic by reordering the mesh (and matrix) elements into clusters that fit in the V100's L2 cache. In the kernel solving the cell model (describing the complex interactions of ion channels in the cell membrane), we apply sophisticated domain-specific optimisations to reduce the number of floating point operations to the point where the kernel becomes memory bound. After optimisation, both kernels are memory bound, and we derive the minimum memory traffic, which we then divide by the aggregate memory bandwidth to obtain a lower bound on the execution time.
Topics discussed include optimisations for sparse matrix-vector multiplications, strategies for handling inter-device communication for unstructured meshes, and lessons we learnt while programming the DGX-2.
Afilliation | Scientific Computing |
Project(s) | Department of Computational Physiology, Department of High Performance Computing |
Publication Type | Poster |
Year of Publication | 2020 |
Date Published | 03/2020 |
Publisher | Nvidia |
Place Published | GPU Technology Conference (GTC) 2020, Silicon Valley, USA |
Towards detailed Organ-Scale Simulations in Cardiac Electrophysiology
GPU Technology Conference (GTC), Silicon Valley, San Jose, USA, 2020.Status: Published
Towards detailed Organ-Scale Simulations in Cardiac Electrophysiology
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Poster |
Year of Publication | 2020 |
Place Published | GPU Technology Conference (GTC), Silicon Valley, San Jose, USA |
Type of Work | Poster |
Talks, invited
On parallel simulation of porous media flow
In Schlumberger Eureka Applied Math Special Interest Group Meeting, 2020.Status: Published
On parallel simulation of porous media flow
Parallel computing has become an indispensable tool for efficiently simulating the complicated multi-phase flows in porous media. To achieve good performance, we need to pay special attention to the foundation of parallelization, namely, how the computation is decomposed among the hardware processing units. The prerequisite mesh-partitioning problem boils down to an intricate interplay among load balance, communication overhead, and effectiveness of the resulting numerical calculation. Specifically, we quantitatively analyse the negative performance impact caused by non-contributing computations that are associated with the ``ghost computational cells" per subdomain. We also show how these non-contributing computations can be avoided by reordering the computational cells of each subdomain. Moreover, we propose a new graph-edge weighting scheme that can improve the mesh partitioning quality, aiming at a balance between handling the heterogeneity of geological properties and restricting the communication overhead. The findings are applied to the open-source Flow simulator from the OPM framework (https://opm-project.org), leading to substantial improvements of the parallel performance.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing |
Publication Type | Talks, invited |
Year of Publication | 2020 |
Location of Talk | Schlumberger Eureka Applied Math Special Interest Group Meeting |
Type of Talk | Invited guest talk |
Talks, contributed
Balancing the numerical and parallel performance for reservoir simulations
In SIAM Conference on Computational Science and Engineering (CSE19), Spokane, Washington, USA, 2019.Status: Published
Balancing the numerical and parallel performance for reservoir simulations
The overall performance of a PDE-based simulator depends on two factors: the algorithmic efficiency of the numerical scheme chosen and the parallel efficiency of the software implementation. Since aspects from the two factors may influence each other's performance, a suitable balance between the two is important. The focus of this talk is on the OPM framework of oil reservoir simulation, for which the computational core is to solve the black-oil model: a coupled system of nonlinear PDEs. Due to large variations in the geological properties of a reservoir, the sparse matrix that arises from discretizing the coupled PDEs exhibits a strong heterogeneity in its nonzero values. These reflect the strength of coupling between the degrees of freedom. It is thus necessary to consider this heterogeneity in the unstructured mesh partitioning process, typically translated to partitioning a graph with weighted edges. Particularly, we study the impact of different strategies of edge weighting on both the numerical and parallel performance. The ordering of the degrees of freedom, which also affects both sides, is studied in addition. Our purpose is to shed some light on a suitable mesh partitioning and ordering methodolgy, which is also relevant beyond the context of reservoir simulation. The issue of how to allow users of OPM to inject such flexibility into the existing software framework is also discussed.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing |
Publication Type | Talks, contributed |
Year of Publication | 2019 |
Location of Talk | SIAM Conference on Computational Science and Engineering (CSE19), Spokane, Washington, USA |
Keywords | HPC |
Compiling finite element variational forms for GPU-based assembly
In FEniCS‘19, Washington DC, USA, 2019.Status: Published
Compiling finite element variational forms for GPU-based assembly
We present an experimental form compiler for exploring GPU-based algorithms for assembling vectors, matrices, and higher-order tensors from finite element variational forms.
Previous studies by Cecka et al. (2010), Markall et al. (2013), and Reguly and Giles (2015) have explored different strategies for using GPUs for finite element assembly, demonstrating the potential rewards and highlighting some of the difficulties in offloading assembly to a GPU. Even though these studies are limited to a few specific cases, mostly related to the Poisson problem, they already indicate that to achieve high performance, the appropriate assembly strategy depends on the problem at hand and the chosen discretisation.
By using a form compiler to automatically generate code for GPU-based assembly, we can explore a range of problems based on different variational forms and finite element discretisations. In this way, we aim to get a better picture of the potential benefits and challenges of assembling finite element variational forms on a GPU. Ultimately, the goal is to explore algorithms based on automated code generation that offload entire finite element methods to a GPU, including assembly of vectors and matrices and solution of linear systems.
In this talk, we give an exact characterisation of the class of finite element variational forms supported by our compiler, comprising a small subset of the Unified Form Language that is used by FEniCS and Firedrake. Furthermore, we describe a denotational semantics that explains how expressions in the form language are translated to low-level C or CUDA code for performing assembly over a computational mesh. We also present some initial results and discuss the performance of the generated code.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing , Department of Numerical Analysis and Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2019 |
Location of Talk | FEniCS‘19, Washington DC, USA |
Keywords | Code translation, GPU, HPC |
Proceedings, refereed
Combining algorithmic rethinking and AVX-512 intrinsics for efficient simulation of subcellular calcium signaling
In International Conference on Computational Science (ICCS 2019). Springer, 2019.Status: Published
Combining algorithmic rethinking and AVX-512 intrinsics for efficient simulation of subcellular calcium signaling
Calcium signaling is vital for the contraction of the heart. Physiologically realistic simulation of this subcellular process requires nanometer resolutions and a complicated mathematical model of differential equations. Since the subcellular space is composed of several irregularly-shaped and intricately-connected physiological domains with distinct properties, one particular challenge is to correctly compute the diffusion-induced calcium fluxes between the physiological domains. The common approach is to pre-calculate the effective diffusion coefficients between all pairs of neighboring computational voxels, and store them in large arrays. Such a strategy avoids complicated if-tests when looping through the computational mesh, but suffers from substantial memory overhead. In this paper, we adopt a memory-efficient strategy that uses a small lookup table of diffusion coefficients. The memory footprint and traffic are both drastically reduced, while also avoiding the if-tests. However, the new strategy induces more instructions on the processor level. To offset this potential performance pitfall, we use AVX-512 intrinsics to effectively vectorize the code. Performance measurements on a Knights Landing processor and a quad-socket Skylake server show a clear performance advantage of the manually vectorized implementation that uses lookup tables, over the counterpart using coefficient arrays.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2019 |
Conference Name | International Conference on Computational Science (ICCS 2019) |
Pagination | 681-687 |
Publisher | Springer |
DOI | 10.1007/978-3-030-22750-0_66 |
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
In Computing in Cardiology. Vol. 46. IEEE, 2019.Status: Published
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
Recent advances in personalized arrhythmia risk pre- diction show that computational models can provide not only safer but also more accurate results than invasive pro- cedures. However, biophysically accurate simulations re- quire solving linear systems over fine meshes and time res- olutions, which can take hours or even days. This limits the use of such simulations in the clinic where diagnosis and treatment planning can be time sensitive, even if it is just for the reason of operation schedules. Furthermore, the non-interactive, non-intuitive way of accessing simula- tions and their results makes it hard to study these collab- oratively. Overcoming these limitations requires speeding up computations from hours to seconds, which requires a massive increase in computational capabilities.
Fortunately, the cost of computing has fallen dramati- cally in the past decade. A prominent reason for this is the recent introduction of manycore processors such as GPUs, which by now power the majority of the world’s leading supercomputers. These devices owe their success to the fact that they are optimized for massively parallel work- loads, such as applying similar ODE kernel computations to millions of mesh elements in scientific computing ap- plications. Unlike CPUs, which are typically optimized for sequential performance, this allows GPU architectures to dedicate more transistors to performing computations, thereby increasing parallel speed and energy efficiency.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Proceedings, refereed |
Year of Publication | 2019 |
Conference Name | Computing in Cardiology |
Volume | 46 |
Date Published | 12/2019 |
Publisher | IEEE |
Talks, invited
Heterogeneous computing for cardiac electrophysiology
In PREAPP workshop on Efficient Frameworks for Compute- and Data-intensive Computing (EFFECT), University of Tromsø, Norway, 2019.Status: Published
Heterogeneous computing for cardiac electrophysiology
Electrical activities inside the heart are immensely important for the functioning of this vital organ. In the pursuit of a scientific understanding of the processes and mechanisms in electro-physiology, computer simulations have become an established paradigm of research. Both the complex mathematical models and the extreme physiological details require huge-scale simulations, which nowadays see an increasing use of heterogeneous computing. That is, the computational power is delivered by more than one processor type. We will discuss some of the resulting challenges in programming and performance optimization. Successful applications from the domain of cardiac electro-physiology will be used to demonstrate the usefulness of heterogeneous computing. We will also take a peek into the future of heterogeneous computing through eX3: the brand-new national infrastructure for experimental exploration of exascale computing.
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming , Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Talks, invited |
Year of Publication | 2019 |
Location of Talk | PREAPP workshop on Efficient Frameworks for Compute- and Data-intensive Computing (EFFECT), University of Tromsø, Norway |
Unstructured computational meshes and data locality
In Fifth Workshop on Programming Abstractions for Data Locality (PADAL'19), Inria Bordeaux, France, 2019.Status: Published
Unstructured computational meshes and data locality
Many scientific and engineering applications rely on unstructured computational meshes to capture the irregular shapes and intricate details involved. With respect to software implementation, unstructured meshes require indirectly-indexed, irregular accesses to data arrays. Attaining data locality in the memory hierarchy is thus challenging. This talk touches two related topics. First, we look at the ordering/clustering of entities in an unstructured mesh with respect to cache efficiency. Second, we re-examine the currently widely-used strategy of mesh partitioning, which is based on partitioning a corresponding graph with edge-cut as the optimisation objective. Mismatches between this mainstream methodology of data decomposition and the increasingly heterogeneous computing platforms will be discussed.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Talks, invited |
Year of Publication | 2019 |
Location of Talk | Fifth Workshop on Programming Abstractions for Data Locality (PADAL'19), Inria Bordeaux, France |
Keywords | data locality, HPC |
Use of modern processor architectures for computing the electrical activity in the heart
In Schlumberger Eureka Applied Math Special Interest Group Meeting, 2019.Status: Published
Use of modern processor architectures for computing the electrical activity in the heart
Recent advances in the mathematical models of electro-cardiology have prompted new requirements for computation. The computational challenges span from a detailed simulation of subcellular calcium handling, through a proper cell-tissue coupling, to a whole heart modeled on an unstructured computational mesh. The sheer size of the computation often can only be satisfied by cutting-edge computer clusters, where the main computing power is provided by non-conventional processors (also called accelerators), such as GPUs or many core processors. In this talk, we discuss the related issues of programming and performance engineering. The particular topics include CPU-accelerator computing, memory traffic minimization, and code vectorization.
Afilliation | Scientific Computing |
Project(s) | Department of High Performance Computing , Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Talks, invited |
Year of Publication | 2019 |
Location of Talk | Schlumberger Eureka Applied Math Special Interest Group Meeting |
Type of Talk | Invited guest talk |
Journal Article
Performance optimization and modeling of fine-grained irregular communication in UPC
Scientific Programming 2019 (2019): Article ID 6825728.Status: Published
Performance optimization and modeling of fine-grained irregular communication in UPC
The UPC programming language offers parallelism via logically partitioned shared memory, which typically spans physically disjoint memory sub-systems. One convenient feature of UPC is its ability to automatically execute between-thread data movement, such that the entire content of a shared data array appears to be freely accessible by all the threads. The programmer friendliness, however, can come at the cost of substantial performance penalties. This is especially true when indirectly indexing the elements of a shared array, for which the induced between-thread data communication can be irregular and have a fine-grained pattern. In this paper we study performance enhancement strategies specifically targeting such fine-grained irregular communication in UPC. Starting from explicit thread privatization, continuing with block-wise communication, and arriving at message condensing and consolidation, we obtained considerable performance improvement of UPC programs that originally require fine-grained irregular communication. Besides the performance enhancement strategies, the main contribution of the present paper is to propose performance models for the different scenarios, in form of quantifiable formulas that hinge on the actual volumes of various data movements plus a small number of easily obtainable hardware characteristic parameters. These performance models help to verify the enhancements obtained, while also providing insightful predictions of similar parallel implementations, not limited to UPC, that also involve between-thread or between-process irregular communication. As a further validation, we also apply our performance modeling methodology and hardware characteristic parameters to an existing UPC code for solving a 2D heat equation on a uniform mesh.
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming , Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Journal Article |
Year of Publication | 2019 |
Journal | Scientific Programming |
Volume | 2019 |
Pagination | Article ID 6825728 |
Date Published | 03/2019 |
Publisher | Hindawi |
Keywords | Fine-grained irregular communication, performance modeling, Performance optimization, Sparse matrix-vector multiplication, UPC programming language |
URL | https://www.hindawi.com/journals/sp/2019/6825728/ |
DOI | 10.1155/2019/6825728 |
Poster
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
International Conference in Computing in Cardiology, Singapore, 2019.Status: Published
Towards Detailed Real-Time Simulations of Cardiac Arrhythmia
Recent advances in personalized arrhythmia risk prediction show that computational models can provide not only safer but also more accurate results than invasive procedures. However, biophysically accurate simulations require solving linear systems over fine meshes and time resolutions, which can take hours or even days. This limits the use of such simulations in the clinic where diagnosis and treatment planning can be time sensitive, even if it is just for the reason of operation schedules. Furthermore, the non-interactive, non-intuitive way of accessing simulations and their results makes it hard to study these collaboratively.
Overcoming these limitations requires speeding up computations from hours to seconds, which requires a massive increase in computational capabilities.
Fortunately, the cost of computing has fallen dramatically in the past decade. A prominent reason for this is the recent introduction of manycore processors such as GPUs, which by now power the majority of the world’s leading supercomputers. These devices owe their success to the fact that they are optimized for massively parallel workloads, such as applying similar ODE kernel computations to millions of mesh elements in scientific computing applications. Unlike CPUs, which are typically optimized for sequential performance, this allows GPU architectures to dedicate more transistors to performing computations, thereby increasing parallel speed and energy efficiency.
In this poster, we present ongoing work on the parallelization of finite volume computations over an unstructured mesh as well as the challenges involved in building scalable simulation codes and discuss the steps needed to close the gap to accurate real-time computations.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Department of High Performance Computing |
Publication Type | Poster |
Year of Publication | 2019 |
Date Published | 09/2019 |
Place Published | International Conference in Computing in Cardiology, Singapore |
Talks, contributed
Education in HPC and Data Science at Simula Research Lab and UiO
In SUPERDATA Workshop on curriculum development, Yunan, China, 2018.Status: Published
Education in HPC and Data Science at Simula Research Lab and UiO
Afilliation | Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2018 |
Location of Talk | SUPERDATA Workshop on curriculum development, Yunan, China |
Unstructured mesh partitioning in the presence of strong coefficient heterogeneity
In PDESoft 2018 Conference, Bergen, Norway, 2018.Status: Published
Unstructured mesh partitioning in the presence of strong coefficient heterogeneity
Mesh partitioning is the first step in enabling parallel computation for solving PDEs. For an unstructured computational mesh, the task of partitioning is nontrivial, which can be formulated as an optimization problem with two goals. The first goal is load balancing, i.e., dividing the computational work evenly among the subdomains. The second goal is communication overhead minimization, i.e., limiting the subsequent inter-subdomain communication. Traditionally, an unstructured mesh is translated to a graph before partitioning, where the graph's nodes correspond to the mesh entities and the edges reflect the direct couplings between mesh entities.
In the presence of strong coefficient heterogeneity in a PDE, the coupling between the degrees of freedom (e.g., the nonzero values in a linear system arisen from the discretization) will also exhibit strong heterogeneity. It is numerically beneficial to group the degrees of freedom that have strong in-between couplings in the same subdomain, whereas the weaker couplings are prioritized as places for the separation cuts between subdomains. This is achieved by heterogeneously weighting the edges of the graph, which is then partitioned by a graph partitioning algorithm that almost exclusively aims to minimize the so-called edge cut (sum of weights of all the cut-through edges).
Such an approach requires care, because an over-weighting of edges that represent strong couplings may result in too few ``light-weight'' edges that can be candidates for cut between subdomains. This may again lead to bad partitioning results where e.g., a subdomain consists of several disjoint patches. An accompanying weakness is that the associated edge-cut value no longer bears resemblance to the true communication overhead that will arise in the parallel solution process of the PDE.
This talk will report the results of ongoing work on investigating the weaknesses of the edge-cut-oriented graph partitoning strategy in the presence of strong coefficient heterogeneity. We will use concrete examples from reservoir simulations that involve strong disparity in key geological coefficients. Moreover, we aim to experiment with possible improvements to the current graph partitioning paradigm by giving less emphasis to edge cut while incorporating a new metric that better resembles the overhead of inter-subdomain communication.
Afilliation | Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2018 |
Location of Talk | PDESoft 2018 Conference, Bergen, Norway |
Talks, invited
Heterogeneous Computing: Programming, Performance and Applications
In CoSaS 2018 Symposium, Erlangen, Germany, 2018.Status: Published
Heterogeneous Computing: Programming, Performance and Applications
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Talks, invited |
Year of Publication | 2018 |
Location of Talk | CoSaS 2018 Symposium, Erlangen, Germany |
Type of Talk | Invited keynote talk |
Proceedings, refereed
Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures
In International Conference on Parallel and Distributed Systems (ICPADS). Singapore: ACM/IEEE, 2018.Status: Published
Memory Bandwidth Contention: Communication vs Computation Tradeoffs in Supercomputers with Multicore Architectures
We study the problem of contention for memory bandwidth between computation and communication in supercomputers that feature multicore CPUs. The problem arises when communication and computation are overlapped, and both operations compete for the same memory bandwidth. This contention is most visible at the limits of scalability, when communication and computation take similar amounts of time, and thus must be taken into account in order to reach maximum scalability in memory bandwidth bound applications. Typical examples of codes affected by the memory bandwidth contention problem are sparse matrix-vector computations, graph algorithms, and many machine learning problems, as they typically exhibit a high demand for both memory bandwidth and inter-node communication, while performing a relatively low number of arithmetic operations.
The problem is even more relevant in truly heterogeneous computations where CPUs and accelerators are used in concert. In that case it can lead to mispredictions of expected performance and consequently to suboptimal load balancing between CPU and accelerator, which in turn can lead to idling of powerful accelerators and thus to a large decrease in performance.
We propose a simple benchmark in order to quantify the loss of performance due to memory bandwidth contention. Based on that, we derive a theoretical model to determine the impact of the phenomenon on parallel memory-bound applications. We test the model on scientific computations, discuss the practical relevance of the problem and suggest possible techniques to remedy it.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Proceedings, refereed |
Year of Publication | 2018 |
Conference Name | International Conference on Parallel and Distributed Systems (ICPADS) |
Publisher | ACM/IEEE |
Place Published | Singapore |
Keywords | Hybrid MPI/OpenMP, Memory bandwidth contention, Multicore supercomputers, performance modeling, Scientific Computing |
Poster
Quantifying data traffic of sparse matrix-vector multiplication in a multi-level memory hierarchy
London, UK, 2018.Status: Published
Quantifying data traffic of sparse matrix-vector multiplication in a multi-level memory hierarchy
Sparse matrix-vector multiplication (SpMV) is the central operation in an iterative linear solver. On a computer with a multi-level memory hierarchy, SpMV performance is limited by memory or cache bandwidth. Furthermore, for a given sparse matrix, the volume of data traffic depends on the location of the matrix non-zeros. By estimating the volume of data traffic with Aho, Denning and Ullman’s page replacement model [1], we can locate bottlenecks in the memory hierarchy and evaluate optimizations such as matrix reordering. The model is evaluated by comparing with measurements from hardware performance counters on Intel Sandy Bridge.
[1]: Alfred V. Aho, Peter J. Denning, and Jeffrey D. Ullman. 1971. Principles of Optimal Page Replacement. J. ACM 18, 1 (January 1971), pp. 80-93.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Poster |
Year of Publication | 2018 |
Date Published | 06/2018 |
Place Published | London, UK |
Towards Detailed Organ-Scale Simulations in Cardiac Electrophysiology
International Symposium on Computational Science at Scale (CoSaS), Erlangen, Germany, 2018.Status: Published
Towards Detailed Organ-Scale Simulations in Cardiac Electrophysiology
We present implementations of tissue-scale 3D simulations of the human cardiac ventricle using a physiologically realistic cell model. Computational challenges in such simulations arise from two factors, the first of which is the sheer amount of computation when simulating a large number of cardiac cells in a detailed model containing 10^4 calcium release units, 10^6 stochastically changing ryanodine receptors and 1.5 × 10^5 L-type calcium channels per cell.
Additional challenges arise from the fact that the computational tasks have various levels of arithmetic intensity and control complexity, which require careful adaptation of the simulation code to the target device. By exploiting the strengths of GPUs and manycore accelerators, we obtain a performance that is far superior to that of the basic CPU implementation, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Poster |
Year of Publication | 2018 |
Date Published | 09/2018 |
Place Published | International Symposium on Computational Science at Scale (CoSaS), Erlangen, Germany |
Type of Work | Poster |
Keywords | Cardiac electrophysiology, GPU, Scientific Computing, Xeon Phi |
Talk, keynote
Accelerated high-performance computing for computational cardiac electrophysiology
In The University of Tokyo, Tokyo, Japan, 2017.Status: Published
Accelerated high-performance computing for computational cardiac electrophysiology
Massively parallel hardware accelerators, such as GPUs, are nowadays prevalent in the HPC hardware landscape. While having tremendous computing power, these accelerators also bring programming challenges. Often, a different programming standard applies for the accelerators than that for the conventional CPUs. For computing clusters that consist of both accelerators and CPUs, where the latter are hosts of the accelerators, elaborate hybrid parallel programming is needed to ensure an efficient use of the heterogeneous hardware.
This talk aims to share some experiences of implementing computational science software for heterogeneous computing platforms. We look at two scenarios: CPU+GPU [1] and CPU+Xeon Phi [2][3] heterogeneous computing. Common for both scenarios is the necessity of a proper pipelining of the involved computational and communication tasks, such that the overhead of various data movements can be reduced or completely masked. Moreover, suitable multi-threading with thread divergence is needed on the CPU host side. This is for enforcing computation-communication overlap, coordinating the accelerators, and allowing the CPU hosts to also contribute with their computing power. We have successfully applied hybrid CPU+Knights Corner co-processor computing [2][3] to two topics of computational cardiac electrophysiology, making use of the Tianhe-2 supercomputer. Results [4] about using the new Xeon Phi Knights Landing processor will also be presented.
[1]. J. Langguth, M. Sourouri, G. T. Lines, S. B. Baden, and X. Cai. Scalable heterogeneous CPU-GPU computations for unstructured tetrahedral meshes. IEEE Micro, 35(4):6–15, 2015.
[2]. J. Chai, J. Hake, N. Wu, M. Wen, X. Cai, G. T. Lines, J. Yang, H. Su, C. Zhang, and X. Liao. Towards simulation of subcellular calcium dynamics at nanometre resolution. International Journal of High Performance Computing Applications, 29(1):51–63, 2015.
[3]. J. Langguth, Q. Lan, N. Gaur, and X. Cai. Accelerating detailed tissue-scale 3D cardiac simulations using heterogeneous CPU-Xeon Phi computing. International Journal of Parallel Programming, 45(5):1236–1258, 2017.
[4]. J. Langguth, C. Jarvis, and X. Cai. Porting tissue-scale cardiac simulations to the Knights Landing platform. Proceedings of ISC High Performance 2017, 376–388, 2017.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Talk, keynote |
Year of Publication | 2017 |
Location of Talk | The University of Tokyo, Tokyo, Japan |
Notes | 2nd International Symposium on Research and Education of Computational Science |
Proceedings, refereed
Automated Translation of MATLAB Code to C++ with Performance and Traceability
In The Eleventh International Conference on Advanced Engineering Computing and Applications in Sciences (ADVCOMP 2017). International Academy, Research and Industry Association (IARIA), 2017.Status: Published
Automated Translation of MATLAB Code to C++ with Performance and Traceability
In this paper, we discuss the implementation and performance of m2cpp: an automated translator from MATLAB code to its matching Armadillo counterpart in the C++ language. A non-invasive strategy has been adopted, meaning that the user of m2cpp does not insert annotations or additional code lines into the input serial MATLAB code. Instead, a combination of code analysis, automated preprocessing and a user-editable metainfo file ensures that m2cpp overcomes some specialties of the MATLAB language, such as implicit typing of variables and multiple return values from functions. Thread-based parallelisation, using either OpenMP or Intel's Threading Building Blocks (TBB) library, can also be carried out by m2cpp for designated for-loops. Such an automated and non-invasive strategy allows maintaining an independent MATLAB code base that is favoured by algorithm developers, while an updated translation into the easily readable C++ counterpart can be obtained at any time. Illustrating examples from seismic data processing are provided in this paper, with performance results obtained on multicore Sandy Bridge CPUs and Intel's Knights-Landing Xeon Phi processor.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Proceedings, refereed |
Year of Publication | 2017 |
Conference Name | The Eleventh International Conference on Advanced Engineering Computing and Applications in Sciences (ADVCOMP 2017) |
Pagination | 50-55 |
Date Published | 11/2017 |
Publisher | International Academy, Research and Industry Association (IARIA) |
ISBN Number | 978-1-61208-599-9 |
ISSN Number | 2308-4499 |
Keywords | C++, Code translation, Image processing, Matlab, Seismology |
URL | http://www.thinkmind.org/index.php?view=article&articleid=advcomp_2017_4... |
Porting Tissue-Scale Cardiac Simulations to the Knights Landing Platform
In International Conference on High Performance Computing. Lecture Notes in Computer Science, Springer, 2017.Status: Published
Porting Tissue-Scale Cardiac Simulations to the Knights Landing Platform
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers |
Publication Type | Proceedings, refereed |
Year of Publication | 2017 |
Conference Name | International Conference on High Performance Computing |
Date Published | 10/2017 |
Publisher | Lecture Notes in Computer Science, Springer |
ISBN Number | 978-3-319-67629-6 |
DOI | 10.1007/978-3-319-67630-2_28 |
Talks, contributed
Heterogeneous Manycore Simulations in Cardiac Electrophysiology
In Tenth International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), Stockholm, Sweden, 2017.Status: Published
Heterogeneous Manycore Simulations in Cardiac Electrophysiology
The demand for computing power in computational cardiology is continuously increasing due to the use of physiologically realistic cell models and the need to simulate the heart and use higher resolutions in time and space. And while parallel computing itself is widely used in the field by now, the arrival of heterogeneous multicore architectures presents an important opportunity for speeding up the cardiac simulation process. However, this speedup is not obtained effortlessly, and achieving it presents numerous computational challenges of its own.
We present a summary of our experiences from the implementation of multiple cardiac research codes on multicore, manycore, and GPU processors. Our main goal is to highlight the platform-specific challenges we face in our applications, ranging from efficient data movement to optimizations for large-scale irregular computations. Based on that, we discuss the suitability of the different plattforms for cardiac computations.
Afilliation | Scientific Computing |
Project(s) | Meeting Exascale Computing with Source-to-Source Compilers, Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2017 |
Location of Talk | Tenth International Workshop on Programmability and Architectures for Heterogeneous Multicores (MULTIPROG-2017), Stockholm, Sweden |
Journal Article
Accelerating Detailed Tissue-Scale 3D Cardiac Simulations Using Heterogeneous CPU-Xeon Phi Computing
International Journal of Parallel Programming (2016): 1-23.Status: Published
Accelerating Detailed Tissue-Scale 3D Cardiac Simulations Using Heterogeneous CPU-Xeon Phi Computing
We investigate heterogeneous computing, which involves both multicore CPUs and manycore Xeon Phi coprocessors, as a new strategy for computational cardiology. In particular, 3D tissues of the human cardiac ventricle are studied with a physiologically realistic model that has 10,000 calcium release units per cell and 100 ryanodine receptors per release unit, together with tissue-scale simulations of the electrical activity and calcium handling. In order to attain resource-efficient use of heterogeneous computing systems that consist of both CPUs and Xeon Phis, we first direct the coding effort at ensuring good performance on the two types of compute devices individually. Although SIMD code vectorization is the main theme of performance programming, the actual implementation details differ considerably between CPU and Xeon Phi. Moreover, in addition to combined OpenMP+MPI programming, a suitable division of the cells between the CPUs and Xeon Phis is important for resource-efficient usage of an entire heterogeneous system. Numerical experiments show that good resource utilization is indeed achieved and that such a heterogeneous simulator paves the way for ultimately understanding the mechanisms of arrhythmia. The uncovered good programming practices can be used by computational scientists who want to adopt similar heterogeneous hardware platforms for a wide variety of applications.
Afilliation | Scientific Computing |
Project(s) | User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2016 |
Journal | International Journal of Parallel Programming |
Pagination | 1-23 |
Date Published | 10/2016 |
Publisher | ACM/Springer |
Keywords | Calcium handling, multiscale cardiac tissue simulation, supercomputing, Xeon Phi |
DOI | 10.1007/s10766-016-0461-2 |
Panda: A Compiler Framework for Concurrent CPU+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers
International Journal of Parallel Programming (2016).Status: Published
Panda: A Compiler Framework for Concurrent CPU+GPU Execution of 3D Stencil Computations on GPU-accelerated Supercomputers
We present a new compiler framework for truly heterogeneous 3D stencil computation on GPU clusters. Our framework consists of a simple directive-based programming model and a tightly integrated source-to-source compiler. Annotated with a small number of directives, sequential stencil C codes can be automatically parallelized for large-scale GPU clusters. The most distinctive feature of the compiler is its capability to generate hybrid MPI+CUDA+OpenMP code that uses concurrent CPU+GPU computing to unleash the full potential of powerful GPU clusters. The auto-generated hybrid codes hide the overhead of various data motion by overlapping them with computation. Test results on the Titan supercomputer and the Wilkes cluster show that auto-translated codes can achieve about 90% of the performance of highly optimized handwritten codes, for both a simple stencil benchmark and a real-world application in cardiac modeling. The user-friendliness and performance of our domain-specific compiler framework allow harnessing the full power of GPU-accelerated supercomputing without painstaking coding effort.
Afilliation | Scientific Computing |
Project(s) | User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2016 |
Journal | International Journal of Parallel Programming |
Date Published | 10/2016 |
Publisher | ACM/Springer |
Keywords | code generation, code optimisation, CPU+GPU computing, CUDA, heterogeneous computing, MPI, OpenMP, source-to-source translation, stencil computation |
DOI | 10.1007/s10766-016-0454-1 |
Solving 3D Time-Fractional Diffusion Equations by High-Performance Parallel Computing
Fractional Calculus and Applied Analysis 19, no. 1 (2016): 140-160.Status: Published
Solving 3D Time-Fractional Diffusion Equations by High-Performance Parallel Computing
Numerically solving time-fractional diffusion equations, especially in three space dimensions, is a daunting computational task. This is due to the huge requirements of both computation time and memory storage. Compared with solving integer-ordered diffusion equations, the costs for time and storage both increase by a factor that equals the number of time steps involved. Aiming to overcome these two obstacles, we study in this paper three programming techniques: loop unrolling, vectorization and parallelization. For a representative numerical scheme that adopts finite differencing and explicit time integration, the performance-enhancing techniques are indeed shown to dramatically reduce the computation time, while allowing the use of many CPU cores and thereby a large amount of memory storage. Moreover, we have developed simple-to-use performance models that support our empirical findings, which are based on using up to 8192 CPU cores and 12.2 terabytes.
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2016 |
Journal | Fractional Calculus and Applied Analysis |
Volume | 19 |
Issue | 1 |
Pagination | 140-160 |
Publisher | DE GRUYTER |
Keywords | fractional differential equations, loop unrolling, parallel computing, vectorization |
URL | http://www.degruyter.com/view/j/fca.2016.19.issue-1/fca-2016-0008/fca-20... |
DOI | 10.1515/fca-2016-0008 |
Proceedings, refereed
Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2
In IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS). ACM/IEEE, 2016.Status: Published
Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2
We develop a simulator for 3D tissue of the human cardiac ventricle with a physiologically realistic cell model and deploy it on the supercomputer Tianhe-2. In order to attain the full performance of the heterogeneous CPU-Xeon Phi design, we use carefully optimized codes for both devices and combine them to obtain suitable load balancing. Using a large number of nodes, we are able to perform tissue-scale simulations of the electrical activity and calcium handling in millions of cells, at a level of detail that tracks the states of trillions of ryanodine receptors. We can thus simulate arrythmogenic spiral waves and other complex arrhythmogenic patterns which arise from calcium handling deficiencies in human cardiac ventricle tissue. Due to extensive code tuning and parallelization via OpenMP, MPI, and SCIF/COI, large scale simulations of 10 heartbeats can be performed in a matter of hours. Test results indicate excellent scalability, thus paving the way for detailed whole-heart simulations in future generations of leadership class supercomputers.
Afilliation | Scientific Computing |
Project(s) | User-friendly programming of GPU-enhanced clusters, Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2016 |
Conference Name | IEEE 22nd International Conference on Parallel and Distributed Systems (ICPADS) |
Pagination | 843-852 |
Date Published | 12/2016 |
Publisher | ACM/IEEE |
ISSN Number | 1521-9097 |
Keywords | Calcium handling, multiscale cardiac tissue simulation, supercomputing, Xeon Phi |
DOI | 10.1109/ICPADS.2016.0114 |
Matlab2cpp: A Matlab-to-C++ code translator
In IEEE 2016 11th System of Systems Engineering Conference (SoSE). IEEE, 2016.Status: Published
Matlab2cpp: A Matlab-to-C++ code translator
This paper discusses the source-to-source Matlab2cpp translator, which is currently being developed in the EMC2 project. With help of user-supplied information about variable data types and a few special translation rules, Matlab code can be automatically translated into C++ code that makes use of the Armadillo C++ library. Preliminary tests with examples from the SeismicLab package have confirmed that this Matlab-to-C++ translator is indeed capable of handling realistic Matlab code. This tool thus has the potential of closing the gap between human-friendly experimentation offered by interactive Matlab scripting and performance-critical production runs that rely on C++ programming.
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2016 |
Conference Name | IEEE 2016 11th System of Systems Engineering Conference (SoSE) |
Date Published | 06/2016 |
Publisher | IEEE |
Keywords | Armadillo, C++, Code translation, Matlab |
DOI | 10.1109/SYSOSE.2016.7542966 |
On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures
In High Performance Computing & Simulation (2016) - International Workshop on Optimization of Energy Efficient HPC & Distributed Systems. ACM IEEE, 2016.Status: Published
On the Performance and Energy Efficiency of the PGAS Programming Model on Multicore Architectures
Afilliation | Scientific Computing |
Project(s) | PREAPP: PRoductivity and Energy-efficiency through Abstraction-based Parallel Programming , Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2016 |
Conference Name | High Performance Computing & Simulation (2016) - International Workshop on Optimization of Energy Efficient HPC & Distributed Systems |
Date Published | 08/2016 |
Publisher | ACM IEEE |
URL | http://dx.doi.org/10.1109/HPCSim.2016.7568416 |
DOI | 10.1109/HPCSim.2016.7568416 |
Journal Article
An Analytical GPU Performance Model for 3D Stencil Computations from the Angle of Data Traffic
The Journal of Supercomputing 71, no. 7 (2015): 2433-2453.Status: Published
An Analytical GPU Performance Model for 3D Stencil Computations from the Angle of Data Traffic
The achievable GPU performance of many scientific computations is not determined by a GPU's peak floating-point rate, but rather how fast data are moved through different stages of the entire memory hierarchy. We take low-order 3D stencil computations as a representative class to study the reachable GPU performance from the angle of data traffic. Specifically, we propose a simple analytical model to estimate the execution time based on quantifying the data traffic volume at three stages: (1) between registers and on-SMX storage, (2) between on-SMX storage and L2 cache, (3) between L2 cache and GPU's device memory. Three associated granularities are used: a CUDA thread, a thread block, and a set of simultaneously active thread blocks. For four 3D stencil computations, NVIDIA's profiling tools have been used to verify the accuracy of the quantified data traffic volumes, by trying a large number of executions with different problem sizes and thread block configurations. Moreover, by introducing an imbalance coefficient, together with the known realistic memory bandwidths, we can predict the execution time usage based on the quantified data traffic volumes. For the four 3D stencils, the average error of the time prediction is 6.9% for a baseline implementation approach, whereas for a blocking implementation approach the average prediction error is 9.5%.
Afilliation | Scientific Computing, , |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2015 |
Journal | The Journal of Supercomputing |
Volume | 71 |
Issue | 7 |
Pagination | 2433-2453 |
Date Published | 02/2015 |
Publisher | Springer |
ISSN | 0920-8542 |
Keywords | 3D stencil methods, GPU, performance modeling |
URL | http://link.springer.com/article/10.1007/s11227-015-1392-1 |
DOI | 10.1007/s11227-015-1392-1 |
Communication-Hiding Programming for Clusters with Multi-Coprocessor Nodes
Concurrency and Computation: Practice and Experience 27, no. 16 (2015): 4172-4185.Status: Published
Communication-Hiding Programming for Clusters with Multi-Coprocessor Nodes
Future exascale systems are expected to adopt compute nodes that incorporate many accelerators. To shed some light on the upcoming software challenge, this paper investigates the particular topic of programming clusters that have multiple Xeon Phi coprocessors in each compute node. A new offload approach is considered for intra-node communication, which combines Intel’s APIs of coprocessor offload infrastructure (COI) and symmetric communication interface (SCIF) for achieving low latency. While the conventional pragma-based offload approach allows simpler programming, the COI-SCIF approach has three advantages in (1) lower overhead associated with launching offloaded code, (2) higher data transfer bandwidths, and (3) more advanced asynchrony between computation and data movement. The low-level COI-SCIF approach is also shown to have benefits over the MPI-OpenMP counterpart, which belongs to the symmetric usage mode. Moreover, a hybird programming strategy based on COI-SCIF is presented for joining the computational force of all CPUs and coprocessors, while realizing communication hiding. All the programming approaches are tested by a real-world 3D application, for which the COI-SCIF-based approach shows a performance advantage on Tianhe-2.
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2015 |
Journal | Concurrency and Computation: Practice and Experience |
Volume | 27 |
Issue | 16 |
Pagination | 4172–4185 |
Date Published | 05/2015 |
Publisher | John Wiley & Sons, Ltd |
Keywords | hybrid programming, Intel Xeon Phi coprocessor, offload model, SCIF, Tianhe-2 |
Notes | Published online before print. |
DOI | 10.1002/cpe.3507 |
Enabling a Uniform OpenCL Device View for Heterogeneous Platforms
IEICE Transactions on Information and Systems E98-D, no. 4 (2015): 812-823.Status: Published
Enabling a Uniform OpenCL Device View for Heterogeneous Platforms
Aiming to ease the parallel programming for heterogeneous architectures, we propose and implement a high-level OpenCL runtime that conceptually merges multiple heterogeneous hardware devices into one virtual heterogeneous compute device (VHCD). Moreover, automated workload distribution among the devices is based on offline profiling, together with new programming directives that define the device-independent data access range per work-group. Therefore, an OpenCL program originally written for a single compute device can, after inserting a small number of programming directives, run efficiently on a platform consisting of heterogeneous compute devices. Performance is ensured by introducing the technique of virtual cache management, which minimizes the amount of host-device data transfer. Our new OpenCL runtime is evaluated by a diverse set of OpenCL benchmarks, demonstrating good performance on various configurations of a heterogeneous system.
Afilliation | Scientific Computing, , |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2015 |
Journal | IEICE Transactions on Information and Systems |
Volume | E98-D |
Issue | 4 |
Pagination | 812-823 |
Date Published | 04/2015 |
Publisher | IEICE |
ISSN | 1745-1361 |
Keywords | automated workload distribution, data transfer minimization, heterogeneous devices, OpenCL, virtualized single device |
DOI | 10.1587/transinf.2014EDP7244 |
Parallel performance modeling of irregular applications in cell-centered finite volume methods over unstructured tetrahedral meshes
Journal of Parallel and Distributed Computing 76 (2015): 120-131.Status: Published
Parallel performance modeling of irregular applications in cell-centered finite volume methods over unstructured tetrahedral meshes
Finite volume methods are widely used numerical strategies for solving partial differential equations. This paper aims at obtaining a quantitative understanding of the achievable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes, using traditional multicore CPUs as well as modern GPUs. By using an optimized implementation and a synthetic connectivity matrix that exhibits a perfect structure of equal-sized blocks lying on the main diagonal, we can closely relate the achievable computing performance to the size of these diagonal blocks. Moreover, we have derived a theoretical model for identifying characteristic levels of the attainable performance as a function of hardware parameters, based on which a realistic upper limit of the performance can be predicted accurately. For real-world tetrahedral meshes, the key to high performance lies in a reordering of the tetrahedra, such that the resulting connectivity matrix resembles a block diagonal form where the optimal size of the blocks depends on the hardware. Numerical experiments confirm that the achieved performance is close to the practically attainable maximum and it reaches 75% of the theoretical upper limit, independent of the actual tetrahedral mesh considered. From this, we develop a general model capable of identifying bottleneck performance of a system’s memory hierarchy in irregular applications.
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2015 |
Journal | Journal of Parallel and Distributed Computing |
Volume | 76 |
Pagination | 120-131 |
Date Published | 02/2015 |
Publisher | Elsevier |
DOI | 10.1016/j.jpdc.2014.10.005 |
Scalable heterogeneous CPU-GPU computations for unstructured tetrahedral meshes
IEEE Micro 35, no. 4 (2015): 6-15.Status: Published
Scalable heterogeneous CPU-GPU computations for unstructured tetrahedral meshes
A recent trend in modern high-performance computing environments is the introduction of powerful, energy-efficient hardware accelerators such as GPUs and Xeon Phi coprocessors. These specialized computing devices coexist with CPUs and are optimized for highly parallel applications. In regular computing-intensive applications with predictable data access patterns, these devices often far outperform CPUs and thus relegate the latter to pure control functions instead of computations. For irregular applications, however, the performance gap can be much smaller and is sometimes even reversed. Thus, maximizing the overall performance on heterogeneous systems requires making full use of all available computational resources, including both accelerators and CPUs.
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2015 |
Journal | IEEE Micro |
Volume | 35 |
Issue | 4 |
Pagination | 6-15 |
Date Published | 07/2015 |
Publisher | ACM IEEE |
DOI | 10.1109/MM.2015.70 |
Towards Simulation of Subcellular Calcium Dynamics at Nanometre Resolution
International Journal of High Performance Computing Applications 29, no. 1 (2015): 51-63.Status: Published
Towards Simulation of Subcellular Calcium Dynamics at Nanometre Resolution
Numerical simulation of subcellular dynamics with a resolution down to one nanometre can be an important tool for discovering the physiological cause of many heart diseases. The requirement of enormous computational power, however, has made such simulations prohibitive so far. By using up to 12,288 Intel Xeon Phi 31S1P coprocessors on the new hybrid cluster Tianhe-2, which is the new number one supercomputer of the world, we have achieved 1.27 Pflop/s in double precision, which brings us much closer to the nanometre resolution. This is the result of efficiently using the hardware on different levels: (1) a single Xeon Phi (2) a single compute node that consists of a host and three coprocessors, and (3) a huge number of interconnected nodes. To overcome the challenge of programming Intel’s new many-integrated core (MIC) architecture, we have adopted techniques such as vectorization, hierarchical data blocking, register data reuse, offloading computations to the coprocessors, and pipelining computations with intra-/inter-node communications.
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2015 |
Journal | International Journal of High Performance Computing Applications |
Volume | 29 |
Issue | 1 |
Pagination | 51-63 |
Publisher | SAGE |
DOI | 10.1177/1094342013514465 |
Proceedings, refereed
CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters
In IEEE 18th International Conference on Computational Science and Engineering. IEEE Computer Society, 2015.Status: Published
CPU+GPU Programming of Stencil Computations for Resource-Efficient Use of GPU Clusters
On modern GPU clusters, the role of the CPUs is often restricted to controlling the GPUs and handling MPI communication. The unused computing power of the CPUs, however, can be considerable for computations whose performance is bounded by memory traffic. This paper investigates the challenges of simultaneous usage of CPUs and GPUs for computation. Our emphasis is on deriving a heterogeneous CPU+GPU programming approach that combines MPI, OpenMP and CUDA. To effectively hide the overhead of various inter- and intra-node communications, a new level of task parallelism is introduced on top of the conventional data parallelism. Combined with a suitable workload division between the CPUs and GPUs, our CPU+GPU programming approach is able to fully utilize the different processing units. The programming details and achievable performance are exemplified by a widely used 3D 7-point stencil computation, which shows high performance and scaling in experiments using up to 64 CPU-GPU nodes.
Afilliation | Scientific Computing, , |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2015 |
Conference Name | IEEE 18th International Conference on Computational Science and Engineering |
Pagination | 17-26 |
Date Published | 10/2015 |
Publisher | IEEE Computer Society |
Keywords | CPU+GPU computing, CUDA, GPU, MPI, stencil |
DOI | 10.1109/CSE.2015.33 |
Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding
In ICCS 2015. Elsevier, 2015.Status: Published
Multi-GPU Implementations of Parallel 3D Sweeping Algorithms with Application to Geological Folding
This paper studies the CUDA programming challenges with using multiple GPUs inside a single machine to carry out plane-by-plane updates in parallel 3D sweeping algorithms. In particular, care must be taken to mask the overhead of various data movements between the GPUs. Multiple OpenMP threads on the CPU side should be combined multiple CUDA streams per GPU to hide the data transfer cost related to the halo computation on each 2D plane. Moreover, the technique of peer-to-peer data motion can be used to reduce the impact of 3D volumetric data shuffles that have to be done between mandatory changes of the grid partitioning. We have investigated the performance improvement of 2- and 4-GPU implementations that are applicable to 3D anisotropic front propagation computations related to geological folding. In comparison with a straightforward multi-GPU implementation, the overall performance improvement due to masking of data movements on four GPUs of the Fermi architecture was 23%. The corresponding improvement obtained on four Kepler GPUs was 47%.
Afilliation | , Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2015 |
Conference Name | ICCS 2015 |
Pagination | 1494-1503 |
Date Published | 06/2015 |
Publisher | Elsevier |
Keywords | 3D sweeping, anisotropic front propagation, CUDA programming, NVIDIA GPU, OpenMP |
DOI | 10.1016/j.procs.2015.05.339 |
Towards Detailed Tissue-Scale 3D Simulations of Electrical Activity and Calcium Handling in the Human Cardiac Ventricle
In The 15th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2015). Lecture Notes in Computer Science, Springer Verlag, 2015.Status: Published
Towards Detailed Tissue-Scale 3D Simulations of Electrical Activity and Calcium Handling in the Human Cardiac Ventricle
We adopt a detailed human cardiac cell model, which has 10000 calcium release units, in connection with simulating the electrical activity and calcium handling at the tissue scale. This is a computationally intensive problem requiring a combination of efficient numerical algorithms and parallel programming. To this end, we use a method that is based on binomial distributions to collectively study the stochastic state transitions of the 100 ryanodine receptors inside every calcium release unit, instead of individually following each ryanodine receptor. Moreover, the implementation of the parallel simulator has incorporated optimizations in form of code vectorization and removing redundant calculations. Numerical experiments show very good parallel performance of the 3D simulator and demonstrate that various physiological behaviors are correctly reproduced. This work thus paves way for high-fidelity 3D simulations of human ventricular tissues, with the ultimate goal of understanding the mechanisms of arrhythmia.
Afilliation | Scientific Computing, Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2015 |
Conference Name | The 15th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP 2015) |
Pagination | 79-92 |
Date Published | 11/2015 |
Publisher | Lecture Notes in Computer Science, Springer Verlag |
ISBN Number | 978-3-319-27136-1 |
Keywords | Calcium handling, multiscale cardiac tissue simulation, supercomputing |
URL | http://link.springer.com/chapter/10.1007/978-3-319-27137-8_7 |
DOI | 10.1007/978-3-319-27137-8_7 |
Poster
Dysfunctional Sarcoplasmic Reticulum Ca2+ Release Underlies Arrhythmogenic Triggers in Catecholaminergic Polymorphic Ventricular Tachycardia: A Simulation Study in a Human Ventricular Myocyte Model
In Gordons Research Conference on Cardiac Arrhythmia. Lucca, Italy: Gordons Research Conference on Cardiac Arrhythmia, 2015.Status: Published
Dysfunctional Sarcoplasmic Reticulum Ca2+ Release Underlies Arrhythmogenic Triggers in Catecholaminergic Polymorphic Ventricular Tachycardia: A Simulation Study in a Human Ventricular Myocyte Model
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Poster |
Year of Publication | 2015 |
Secondary Title | Gordons Research Conference on Cardiac Arrhythmia |
Publisher | Gordons Research Conference on Cardiac Arrhythmia |
Place Published | Lucca, Italy |
Technical reports
Is PGAS ready for the challenge of energy efficiency? A study with the NAS benchmark.
Tromsø: UiT, 2015.Status: Published
Is PGAS ready for the challenge of energy efficiency? A study with the NAS benchmark.
In this study we compare the performance and power efficiency of Unified Parallel C (UPC), MPI and OpenMP by running a set of kernels from the NAS Benchmark.
One of the goals of this study is to focus on the Partitioned Global Address Space (PGAS) model, in order to describe it and compare it to MPI and OpenMP.
In particular we consider the power efficiency expressed in millions operations per second per watt as a criterion to evaluate the suitability of PGAS compared to MPI and OpenMP.
Based on these measurements, we provide an analysis to explain the difference of performance between UPC, MPI, and OpenMP.
Afilliation | Scientific Computing, , |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Technical reports |
Year of Publication | 2015 |
Publisher | UiT |
Place Published | Tromsø |
Keywords | MPI, NAS Benchmark, OpenMP, performance evaluation, PGAS, power efficiency, UPC |
URL | http://munin.uit.no/bitstream/handle/10037/8207/article.pdf?sequence=1&i... |
Book Chapter
Parallel Computing
In Encyclopedia of Applied and Computational Mathematics, 1129-1132. Springer Berlin Heidelberg, 2015.Status: Published
Parallel Computing
Parallel computing can be understood as solving a computational problem through collaborative use of multiple resources that belong to a parallel computer system. Here, a parallel system can be anything between a single multiprocessor machine and an Internet-connected cluster that is made up of hybrid compute nodes. There are two main motivations for adopting parallel computations. The first motivation is about reducing the computational time, because employing more computational units for solving a same problem usually results in lower wall-time usage. The second – and perhaps more important – motivation is the wish of obtaining more details, which can arise from higher temporal and spatial resolutions, more advanced mathematical and numerical models, and more realizations.
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Book Chapter |
Year of Publication | 2015 |
Book Title | Encyclopedia of Applied and Computational Mathematics |
Pagination | 1129-1132 |
Date Published | 11/2015 |
Publisher | Springer Berlin Heidelberg |
ISBN Number | 978-3-540-70528-4 |
DOI | 10.1007/978-3-540-70529-1_424 |
Talks, contributed
Arrhythmogenic Mechanisms and Therapeutic Targets for Catecholaminergic Polymorphic Ventricular Tachycardia: A Simulation Study in a Human Ventricular Myocyte
In Simula Research Laboratory, 2014.Status: Published
Arrhythmogenic Mechanisms and Therapeutic Targets for Catecholaminergic Polymorphic Ventricular Tachycardia: A Simulation Study in a Human Ventricular Myocyte
Afilliation | Scientific Computing, , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2014 |
Location of Talk | Simula Research Laboratory |
Type of Talk | Cardiac Modeling Workshop |
Mathematical Modeling of Ca Handling and Computational Studies of Ca-related Arrhythmogenesis in Heart
In National University of Defense Technology, China. Changsha, China, 2014.Status: Published
Mathematical Modeling of Ca Handling and Computational Studies of Ca-related Arrhythmogenesis in Heart
Afilliation | Scientific Computing, , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2014 |
Location of Talk | National University of Defense Technology, China |
Place Published | Changsha, China |
Type of Talk | Workshop |
Proceedings, refereed
Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs
In Proceedings of Euro-Par 2014. Vol. 8632. LNCS 8632. Berlin Heidelberg New York: Springer, 2014.Status: Published
Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2014 |
Conference Name | Proceedings of Euro-Par 2014 |
Volume | 8632 |
Pagination | 210-221 |
Publisher | Springer |
Place Published | Berlin Heidelberg New York |
Keywords | Conference |
DOI | 10.1007/978-3-319-09873-9_18 |
Effective Multi-GPU Communication Using Multiple CUDA Streams and Threads
In 20th International Conference on Parallel and Distributed Systems (ICPADS 2014). IEEE, 2014.Status: Published
Effective Multi-GPU Communication Using Multiple CUDA Streams and Threads
In the context of multiple GPUs that share the same PCIe bus, we propose a new communication scheme that leads to a more effective overlap of communication and computation. Multiple CUDA streams and OpenMP threads are adopted so that data can simultaneously be sent and received. A representative 3D stencil example is used to demonstrate the effectiveness of our scheme. We compare the performance of our new scheme with an MPI-based state-of-the-art scheme. Results show that our approach outperforms the state-of-the-art scheme, being up to 1.85× faster. However, our performance results also indicate that the current underlying PCIe bus architecture needs improvements to handle the future scenario of many GPUs per node.
Afilliation | Scientific Computing, Scientific Computing, , |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2014 |
Conference Name | 20th International Conference on Parallel and Distributed Systems (ICPADS 2014) |
Pagination | 981-986 |
Publisher | IEEE |
DOI | 10.1109/PADSW.2014.7097919 |
Heterogeneous CPU-GPU Computing for the Finite Volume Method on 3D Unstructured Meshes
In 20th International Conference on Parallel and Distributed Systems (ICPADS 2014). IEEE, 2014.Status: Published
Heterogeneous CPU-GPU Computing for the Finite Volume Method on 3D Unstructured Meshes
A recent trend in modern high-performance computing environments is the introduction of accelerators such as GPU and Xeon Phi, i.e. specialized computing devices that are optimized for highly parallel applications and coexist with CPUs. In regular compute-intensive applications with predictable data access patterns, these devices often outperform traditional CPUs by far and thus relegate them to pure control functions instead of computations. For irregular applications however, the gap in relative performance can be much smaller, and sometimes even reversed. Thus, maximizing overall performance in such systems requires that full use of all available computational resources is made. In this paper we study the attainable performance of the cell-centered finite volume method on 3D unstructured tetrahedral meshes using heterogeneous systems consisting of CPUs and multiple GPUs. Finite volume methods are widely used numerical strategies for solving partial differential equations. The advantages of using finite volumes include built-in support for conservation laws and suitability for unstructured meshes. Our focus lies in demonstrating how a workload distribution that maximizes overall performance can be derived from the actual performance attained by the different computing devices in the heterogeneous environment. We also highlight the dual role of partitioning software in reordering and partitioning the input mesh, thus giving rise to a new combined approach to partitioning.
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2014 |
Conference Name | 20th International Conference on Parallel and Distributed Systems (ICPADS 2014) |
Pagination | 191-199 |
Publisher | IEEE |
DOI | 10.1109/PADSW.2014.7097808 |
Utilizing Multiple Xeon Phi Coprocessors on One Compute Node
In International Conference on Algorithms and Architectures for Parallel Processing. Vol. 8631. LNCS 8631. Berlin Heidelberg New York: Springer, 2014.Status: Published
Utilizing Multiple Xeon Phi Coprocessors on One Compute Node
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2014 |
Conference Name | International Conference on Algorithms and Architectures for Parallel Processing |
Volume | 8631 |
Pagination | 68-81 |
Publisher | Springer |
Place Published | Berlin Heidelberg New York |
DOI | 10.1007/978-3-319-11194-0_6 |
Poster
Cellular Arrhythmogenesis in CPVT in a computational model of cardiac ventricular myocyte
Maastrich, Netherlands: European Working Group of Cardiac Cellular Electrophysiology, 2014.Status: Published
Cellular Arrhythmogenesis in CPVT in a computational model of cardiac ventricular myocyte
Afilliation | Scientific Computing, , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Poster |
Year of Publication | 2014 |
Date Published | 09/14 |
Publisher | European Working Group of Cardiac Cellular Electrophysiology |
Place Published | Maastrich, Netherlands |
Cellular Arrhythmogenesis in CPVT in a Computational Model of Cardiac Ventricular Myocyte
Scandinavian Physiological Society Meeting, 2014.Status: Published
Cellular Arrhythmogenesis in CPVT in a Computational Model of Cardiac Ventricular Myocyte
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Poster |
Year of Publication | 2014 |
Date Published | 08/2014 |
Place Published | Scandinavian Physiological Society Meeting |
Type of Work | Poster at Scandinavian Physiological Society Meeting |
Spontaneous Ca2+ Release and Ca2+ Waves Underlie Early and Delayed Afterdepolarizations, and Triggered Activity in Ryanodine Receptor Mutation associated with Catecholaminergic Polymorphic Ventricular Tachycardia
Scandinavian Physiological Society, 2014.Status: Published
Spontaneous Ca2+ Release and Ca2+ Waves Underlie Early and Delayed Afterdepolarizations, and Triggered Activity in Ryanodine Receptor Mutation associated with Catecholaminergic Polymorphic Ventricular Tachycardia
Afilliation | Scientific Computing, , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Poster |
Year of Publication | 2014 |
Date Published | 08/2014 |
Place Published | Scandinavian Physiological Society |
Spontaneous Ca2+ Release and Ca2+ Waves Underlie Early and Delayed Afterdepolarizations, and Triggered Activity, in Ryanodine Receptor Mutations Associated With Catecholaminergic Polymorphic Ventricular Tachycardia
2014.Status: Published
Spontaneous Ca2+ Release and Ca2+ Waves Underlie Early and Delayed Afterdepolarizations, and Triggered Activity, in Ryanodine Receptor Mutations Associated With Catecholaminergic Polymorphic Ventricular Tachycardia
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Poster |
Year of Publication | 2014 |
Date Published | August |
Keywords | Conference |
Journal Article
High Efficient Sedimentary Basin Simulations on Hybrid CPU-GPU Clusters
Cluster Computing 17 (2014): 359-369.Status: Published
High Efficient Sedimentary Basin Simulations on Hybrid CPU-GPU Clusters
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2014 |
Journal | Cluster Computing |
Volume | 17 |
Number | 2 |
Pagination | 359-369 |
Publisher | |
DOI | 10.1007/s10586-013-0300-9 |
Performance Modeling of Serial and Parallel Implementations of the Fractional Adams-Bashforth-Moulton Method
Fractional Calculus and Applied Analysis 17 (2014): 617-637.Status: Published
Performance Modeling of Serial and Parallel Implementations of the Fractional Adams-Bashforth-Moulton Method
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2014 |
Journal | Fractional Calculus and Applied Analysis |
Volume | 17 |
Number | 3 |
Pagination | 617-637 |
Publisher |
Time-Fractional Heat Equations and Negative Absolute Temperatures
Computers & Mathematics with Applications 67 (2014): 164-171.Status: Published
Time-Fractional Heat Equations and Negative Absolute Temperatures
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2014 |
Journal | Computers & Mathematics with Applications |
Volume | 67 |
Number | 1 |
Pagination | 164-171 |
Publisher | |
DOI | 10.1016/j.camwa.2013.11.007 |
Public outreach
Supercomputing-Enabled Study of Subcellular Calcium Dynamics
2014.Status: Published
Supercomputing-Enabled Study of Subcellular Calcium Dynamics
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Public outreach |
Year of Publication | 2014 |
Type of Work | Article in "meta" - a magazine published by the notur project |
Talks, invited
Adopting Heterogeneous Hardware Platforms for Scientific Computing
In Guest lecture at Technical Unviersity of Denmark, December 5, 2013.Status: Published
Adopting Heterogeneous Hardware Platforms for Scientific Computing
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2013 |
Location of Talk | Guest lecture at Technical Unviersity of Denmark, December 5 |
Introduction to Scientific Writing
In Intensive course given at National University of Defence Technology, China, October 17-19, 2013.Status: Published
Introduction to Scientific Writing
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2013 |
Location of Talk | Intensive course given at National University of Defence Technology, China, October 17-19 |
Scientific Computing on Accelerator-Based Supercomputers
In Guest lecture at FFI, September 20, 2013.Status: Published
Scientific Computing on Accelerator-Based Supercomputers
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2013 |
Location of Talk | Guest lecture at FFI, September 20 |
Journal Article
Balancing Efficiency and Accuracy for Sediment Transport Simulations
Computational Science & Discovery 6 (2013): 015011.Status: Published
Balancing Efficiency and Accuracy for Sediment Transport Simulations
Simulating multi-lithology sediment transport requires numerically solving a fully-coupled system of nonlinear partial differential equations. The most standard approach is to simultaneously update all the unknown fields at each time step. Such a fully-implicit strategy is computationally demanding due to the need of Newton-Raphson iterations, each having to set up and solve a large system of linearized algebraic equations. Fully-explicit numerical schemes that do not solve linear systems are possible to devise, but suffer from lower numerical stability and accuracy. If we count the total number of floating-point operations needed to achieve stable numerical solutions with a prescribed level of accuracy, the fully-implicit approach probably wins over its fully-explicit counterpart. However, the latter may nevertheless win in the overall computation time, because computers achieve higher hardware efficiency for simpler numerical computations. Adding to this competition, there are semi-implicit numerical schemes that lie between the two extremes. This paper has two novel contributions. First, we device a new semi-implicit scheme that has secondorder accuracy in the temporal direction. Second, and more importantly, we propose a simple prediction model for the overall computation time on multicore architectures, applicable to many numerical implementations. Based on performance prediction, appropriate numerical schemes can be chosen by considering accuracy, stability, and computing speed at the same time. Our methodology is tested by numerical experiments modeling the sediment transport in Monterey Bay.
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2013 |
Journal | Computational Science & Discovery |
Volume | 6 |
Number | 1 |
Pagination | 015011 |
DOI | 10.1088/1749-4699/6/1/015011 |
Resource-Efficient Utilization of CPU/GPU-Based Heterogeneous Supercomputers for Bayesian Phylogenetic Inference
The Journal of Supercomputing 66 (2013): 364-380.Status: Published
Resource-Efficient Utilization of CPU/GPU-Based Heterogeneous Supercomputers for Bayesian Phylogenetic Inference
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2013 |
Journal | The Journal of Supercomputing |
Volume | 66 |
Number | 1 |
Pagination | 364-380 |
DOI | 10.1007/s11227-013-0911-1 |
Simulating Cardiac Electrophysiology in the Era of GPU-Cluster Computing
IEICE Transactions on Information and Systems E96-D (2013): 2587-2595.Status: Published
Simulating Cardiac Electrophysiology in the Era of GPU-Cluster Computing
Afilliation | , Scientific Computing |
Publication Type | Journal Article |
Year of Publication | 2013 |
Journal | IEICE Transactions on Information and Systems |
Volume | E96-D |
Number | 12 |
Pagination | 2587-2595 |
DOI | 10.1587/transinf.E96.D.2587 |
Talks, contributed
Mint: a User-Friendly C-to-CUDA Code Translator
In Talk given at SIAM CSE'13, February 25, 2013.Status: Published
Mint: a User-Friendly C-to-CUDA Code Translator
Aiming at automated source-to-source code translation from C to CUDA, we have developed the Mint framework. Users only need to annotate serial C code with a few compiler directives, specifying host-device data transfers plus the parallelization depth and granularity of loop nests. Mint then generates CUDA code as output, while carrying out on-chip memory optimizations that will greatly benefit 3D stencil computations. Several real-world applications have been ported to GPU using Mint.
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2013 |
Location of Talk | Talk given at SIAM CSE'13, February 25 |
Keywords | Conference |
Proceedings, refereed
On the GPU Performance of 3D Stencil Computations Implemented in OpenCL
In Proceedings of International Supercomputing Conference, ISC 2013. Vol. 7905. Lecture Notes in Computer Science 7905. Berlin Heidelberg New York: Springer, 2013.Status: Published
On the GPU Performance of 3D Stencil Computations Implemented in OpenCL
Afilliation | , Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2013 |
Conference Name | Proceedings of International Supercomputing Conference, ISC 2013 |
Volume | 7905 |
Pagination | 125-135 |
Publisher | Springer |
Place Published | Berlin Heidelberg New York |
Keywords | Conference |
DOI | 10.1007/978-3-642-38750-0_10 |
On the GPU Performance of Cell-Centered Finite Volume Method Over Unstructured Tetrahedral Meshes
In Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms. New York: ACM, 2013.Status: Published
On the GPU Performance of Cell-Centered Finite Volume Method Over Unstructured Tetrahedral Meshes
Afilliation | , Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2013 |
Conference Name | Proceedings of the 3rd Workshop on Irregular Applications: Architectures and Algorithms |
Publisher | ACM |
Place Published | New York |
DOI | 10.1145/2535753.2535765 |
On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations
In Proceedings of IEEE 19th International Conference on Parallel and Distributed Systems. Los Alamitos, California • Washington • Tokyo: IEEE, 2013.Status: Published
On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations
Afilliation | , Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2013 |
Conference Name | Proceedings of IEEE 19th International Conference on Parallel and Distributed Systems |
Pagination | 78-85 |
Publisher | IEEE |
Place Published | Los Alamitos, California • Washington • Tokyo |
Keywords | Conference |
DOI | 10.1109/ICPADS.2013.23 |
Performance of Sediment Transport Simulations on NVIDIA's Kepler Architecture
In The International Conference on Computational Science, ICCS 2013. Vol. 18. Procedia Computer Science 18. Elsevier, 2013.Status: Published
Performance of Sediment Transport Simulations on NVIDIA's Kepler Architecture
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2013 |
Conference Name | The International Conference on Computational Science, ICCS 2013 |
Volume | 18 |
Pagination | 1275-1281 |
Publisher | Elsevier |
Keywords | Conference |
DOI | 10.1016/j.procs.2013.05.294 |
Proceedings, refereed
A New Parallel 3D Front Propagation Algorithm for Fast Simulation of Geological Folds
In The International Conference on Computational Science, ICCS 2012. Vol. 9. Procedia Computer Science 9. Amsterdam: ICCS, 2012.Status: Published
A New Parallel 3D Front Propagation Algorithm for Fast Simulation of Geological Folds
We present a novel method for 3D anisotropic front propagation and apply it to the simulation of geological folding. The new iterative algorithm has a simple structure and abundant parallelism, and is easily adapted to multithreaded architectures using OpenMP. Moreover, we have used the automated C-to-CUDA source code translator, Mint, to achieve greatly enhanced computing speed on GPUs. Both OpenMP and CUDA implementations have been tested and benchmarked on several examples of 3D geological folding.
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2012 |
Conference Name | The International Conference on Computational Science, ICCS 2012 |
Volume | 9 |
Pagination | 947-955 |
Publisher | ICCS |
Place Published | Amsterdam |
Keywords | Conference |
DOI | 10.1016/j.procs.2012.04.101 |
Efficient Implementations of the Adams-Bashforth-Moulton Method for Solving Fractional Differential Equations
In Proceedings of FDA'12. : , 2012.Status: Published
Efficient Implementations of the Adams-Bashforth-Moulton Method for Solving Fractional Differential Equations
Afilliation | , Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2012 |
Conference Name | Proceedings of FDA'12 |
Publisher | |
Place Published | |
Keywords | Conference |
Using 1000+ GPUs and 10000+ CPUs for Sedimentary Basin Simulations
In Proceedings of IEEE Cluster 2012. Los Alamitos, California • Washington • Tokyo: IEEE, 2012.Status: Published
Using 1000+ GPUs and 10000+ CPUs for Sedimentary Basin Simulations
Afilliation | , Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2012 |
Conference Name | Proceedings of IEEE Cluster 2012 |
Pagination | 27-35 |
Publisher | IEEE |
Place Published | Los Alamitos, California • Washington • Tokyo |
Keywords | Conference |
DOI | 10.1109/CLUSTER.2012.37 |
Journal Article
Accelerating a 3D Finite-Difference Earthquake Simulation With a C-to-CUDA Translator
Computing in Science & Engineering 14 (2012): 48-59.Status: Published
Accelerating a 3D Finite-Difference Earthquake Simulation With a C-to-CUDA Translator
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2012 |
Journal | Computing in Science & Engineering |
Volume | 14 |
Number | 3 |
Pagination | 48-59 |
DOI | 10.1109/MCSE.2012.44 |
Talks, invited
Elements of Scientific Computing
In 3-day intensive course given at National University of Defence Technology, China, October 16-18, 2012.Status: Published
Elements of Scientific Computing
Afilliation | , , Scientific Computing, Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2012 |
Location of Talk | 3-day intensive course given at National University of Defence Technology, China, October 16-18 |
Scientific Computing Needs Supercomputers, But Also Something Else!
In Invited lecture at National University of Defence Technology, China, March 29, 2012.Status: Published
Scientific Computing Needs Supercomputers, But Also Something Else!
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2012 |
Location of Talk | Invited lecture at National University of Defence Technology, China, March 29 |
Public outreach
Simulating Basin Evolution on GPU-Enhanced Hybrid Supercomputers
2012.Status: Published
Simulating Basin Evolution on GPU-Enhanced Hybrid Supercomputers
According to the Top500 list that was published in November 2011, three of the world's five most powerful supercomputers are GPU-enhanced clusters of multicore CPUs. This hardware trend is expected to prevail for the foreseeable future. It is therefore our intention to report here some experiences of using one such cutting-edge GPU-CPU cluster, when applied to simulations of sediment deposition in connection with basin evolution. Our observations are twofold: (1) Simple numerical algorithms are to be favored on homogeneous clusters of CPUs, and even more so on hybrid CPU-GPU clusters. (2) It is possible but challenging to utilize the computing power of both the CPU and GPU sides on a hybrid cluster.
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Public outreach |
Year of Publication | 2012 |
Talks, contributed
Some Perspectives on High-Performance Computing in the Geosciences
In Computational Geoscience Workshop, Geilo, January 19, 2012.Status: Published
Some Perspectives on High-Performance Computing in the Geosciences
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2012 |
Location of Talk | Computational Geoscience Workshop, Geilo, January 19 |
Understanding the Performance of Stencil-Based Computations on Multicore CPU
In CBC Seminar series, 2012.Status: Published
Understanding the Performance of Stencil-Based Computations on Multicore CPU
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2012 |
Location of Talk | CBC Seminar series |
Talks, contributed
A Function-Centric Generic Framework for Parallelization
In Talk at CLS Workshop at UiO on April 13, 2011.Status: Published
A Function-Centric Generic Framework for Parallelization
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2011 |
Location of Talk | Talk at CLS Workshop at UiO on April 13 |
Efficient Computations of Initial-Value Problems Involving Fractional Derivatives
In Talk at the seminar on wave propagation in complex media, November 23, 2011.Status: Published
Efficient Computations of Initial-Value Problems Involving Fractional Derivatives
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2011 |
Location of Talk | Talk at the seminar on wave propagation in complex media, November 23 |
Study of the Computational Efficiency for Different Usages of Pythoning
In Talk at CLS Workshop at UiO on April 13, 2011.Status: Published
Study of the Computational Efficiency for Different Usages of Pythoning
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2011 |
Location of Talk | Talk at CLS Workshop at UiO on April 13 |
Proceedings, refereed
An OpenMP-Enabled Parallel Simulator for Particle Transport in Fluid Flows
In Proceedings of the International Conference on Computational Science, ICCS 2011. Vol. 4. Procedia Computer Science 4. : Elsevier Science, 2011.Status: Published
An OpenMP-Enabled Parallel Simulator for Particle Transport in Fluid Flows
By using C/C++ programming and OpenMP parallelization, we implement a newly developed numerical strategy for simulating particle transport in sparsely particle-laden fluid flows. Due to its highly dynamic property of the chosen numerical framework, the implementation needs to properly handle the moving, merging and splitting of a large number of particle lumps. We show that a careful division of the entire computational work into a set of distinctive tasks not only produces a clearly structured code, but also allows taskwise parallelization through appropriate use of OpenMP compiler directives. The performance of the OpenMP-enabled parallel simulator is tested on representative architectures of multicore-based shared memory, by running a large case of particle transport in a pipe flow. Attention is also given to a number of performance-critical features of the simulator.
Afilliation | , Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2011 |
Conference Name | Proceedings of the International Conference on Computational Science, ICCS 2011 |
Volume | 4 |
Pagination | 1475-1484 |
Publisher | Elsevier Science |
Place Published | |
DOI | 10.1016/j.procs.2011.04.160 |
Mint: Realizing CUDA Performance in 3D Stencil Methods With Annotated C
In Proceedings of the 25th International Conference on Supercomputing (ICS'11). ACM Press, 2011.Status: Published
Mint: Realizing CUDA Performance in 3D Stencil Methods With Annotated C
We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels in two and three dimensions, Mint realized 80% of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2011 |
Conference Name | Proceedings of the 25th International Conference on Supercomputing (ICS'11) |
Pagination | 214-224 |
Publisher | ACM Press |
ISBN Number | 978-1-4503-0102-2 |
DOI | 10.1145/1995896.1995932 |
Talks, invited
Parallel Simulation of Particle Transport Using OpenMP
In Guest lecture at UCSD on January 31, 2011.Status: Published
Parallel Simulation of Particle Transport Using OpenMP
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2011 |
Location of Talk | Guest lecture at UCSD on January 31 |
Programming With OpenMP and Mixed MPI-OpenMP
In Invited lecture during USIT's Research Computing Services training week, November 14-17, 2011.Status: Published
Programming With OpenMP and Mixed MPI-OpenMP
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2011 |
Location of Talk | Invited lecture during USIT's Research Computing Services training week, November 14-17 |
Programming With OpenMP and Mixed MPI-OpenMP
In Invited lecture at pre-conference workshop of NOTUR 2011, 2011.Status: Published
Programming With OpenMP and Mixed MPI-OpenMP
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, invited |
Year of Publication | 2011 |
Location of Talk | Invited lecture at pre-conference workshop of NOTUR 2011 |
Journal Article
Stability of Two Time-Integrators for the Aliev-Panfilov System
International Journal of Numerical Analysis and Modeling 8 (2011): 427-442.Status: Published
Stability of Two Time-Integrators for the Aliev-Panfilov System
Afilliation | , , Scientific Computing, Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2011 |
Journal | International Journal of Numerical Analysis and Modeling |
Volume | 8 |
Number | 3 |
Pagination | 427-442 |
Talks, invited
A Non-Invasive Approach to Parallelizing Sequential Simulators of Partial Differential Equations
In Guest lecture at UCSD on October 28, 2010.Status: Published
A Non-Invasive Approach to Parallelizing Sequential Simulators of Partial Differential Equations
Afilliation | , Scientific Computing |
Publication Type | Talks, invited |
Year of Publication | 2010 |
Location of Talk | Guest lecture at UCSD on October 28 |
Journal Article
Computational Modeling of the Initiation and Development of Spontaneous Intracellular Ca2+ Waves in Ventricular Myocytes
Philosophical Transactions of the Royal Society A 368 (2010): 3953-3965.Status: Published
Computational Modeling of the Initiation and Development of Spontaneous Intracellular Ca2+ Waves in Ventricular Myocytes
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2010 |
Journal | Philosophical Transactions of the Royal Society A |
Volume | 368 |
Number | 1925 |
Pagination | 3953-3965 |
Date Published | August |
DOI | 10.1098/rsta.2010.0146 |
Simplifying the Parallelization of Scientific Codes by a Function-Centric Approach in Python
Computational Science & Discovery 3 (2010): 015003.Status: Published
Simplifying the Parallelization of Scientific Codes by a Function-Centric Approach in Python
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2010 |
Journal | Computational Science & Discovery |
Volume | 3 |
Pagination | 015003 |
DOI | 10.1088/1749-4699/3/1/015003 |
Talks, contributed
Detailed Numerical Analyses of the Aliev-Panfilov Model on GPGPU
In Talk at PARA2010 Conference, 2010.Status: Published
Detailed Numerical Analyses of the Aliev-Panfilov Model on GPGPU
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2010 |
Location of Talk | Talk at PARA2010 Conference |
OpenMP: an Easy Parallel Approach for Scientific Computing on Multi-Core Architecture
In A short course respectively given at Simula in March and University of Oslo in May, 2010.Status: Published
OpenMP: an Easy Parallel Approach for Scientific Computing on Multi-Core Architecture
Afilliation | , Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2010 |
Location of Talk | A short course respectively given at Simula in March and University of Oslo in May |
Optimizing the Aliev-Panfilov Model of Cardiac Excitation on Heterogeneous Systems
In Talk at Para 2010: State of the Art in Scientific and Parallel Computing in Reykjavik on June 6-9, 2010, 2010.Status: Published
Optimizing the Aliev-Panfilov Model of Cardiac Excitation on Heterogeneous Systems
The Aliev-Panfilov model is a simple model for signal propagation in cardiac tissue, and accounts for complex behavior such as how spiral waves break up and form elaborate patterns. Spiral waves can lead to life-threatening situations such as ventricular fibrillation. We discuss an implementation and underlying optimizations for the nVIDIA Tesla C1060 GPU as well as an implementation on multiple GPUs running under MPI. We achieve nearly perfect scaling on 4 GPUs, in single precision, running 58 times faster than a CPU-only implementation and 26 times faster in double precision.
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2010 |
Location of Talk | Talk at Para 2010: State of the Art in Scientific and Parallel Computing in Reykjavik on June 6-9, 2010 |
Parallel Programming Using Python
In CBC Seminar on advanced use of Python programming language, 2010.Status: Published
Parallel Programming Using Python
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2010 |
Location of Talk | CBC Seminar on advanced use of Python programming language |
Book
Elements of Scientific Computing
Berlin / Heidelberg: Springer, 2010.Status: Published
Elements of Scientific Computing
Afilliation | Scientific Computing, Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Book |
Year of Publication | 2010 |
Date Published | October |
Publisher | Springer |
Place Published | Berlin / Heidelberg |
ISBN Number | 978-3-642-11298-0 |
DOI | 10.1007/978-3-642-1129 |
Proceedings, refereed
Numerical Analysis of a Dual-Sediment Transport Model Applied to Lake Okeechobee, Florida
In Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing. ISPDC '10. IEEE Computer Society Press, 2010.Status: Published
Numerical Analysis of a Dual-Sediment Transport Model Applied to Lake Okeechobee, Florida
In this work, we study two numerical strategies for solving a coupled system of distinct nonlinear partial differential equations, which can be used to model dual-lithology sedimentation. Using high-resolution bathymetry data of Lake Okeechobee, Florida, we study the stability and computational speed of these numerical strategies. The fully-explicit scheme is straightforward to implement and requires a relatively small amount of computation per time step. However, this simple numerical strategy has to use small time steps to ensure stability. These small time steps may render the explicit solver impractical for long-term and high-resolution basin simulations. As a comparison, we have implemented a semi-implicit scheme, where the two partial differential equations at each time step are solved implicitly in sequence. This semi-implicit scheme is numerically stable even for very large time steps. Using parallel computing, we have applied both schemes to a realistic case, Lake Okeechobee, Florida. The simulation successfully diffused material along a river-channel and into the lake. Both MPI-based implementations demonstrated satisfactory parallel efficiency on a multicore-based cluster.
Afilliation | , Scientific Computing, , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2010 |
Conference Name | Proceedings of the 2010 Ninth International Symposium on Parallel and Distributed Computing |
Pagination | 189-194 |
Publisher | IEEE Computer Society Press |
ISBN Number | 978-1-4244-7602-2 |
DOI | 10.1109/ISPDC.2010.29 |
Book Chapter
Parallel Computing Engines for Subsurface Imaging Technologies
In Advanced Computational Infrastructures for Parallel and Distributed Adaptive Applications, 29-43. Wiley Series of Parallel and Distributed Computing. Hoboken, New Jersey: Wiley, 2010.Status: Published
Parallel Computing Engines for Subsurface Imaging Technologies
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Book Chapter |
Year of Publication | 2010 |
Book Title | Advanced Computational Infrastructures for Parallel and Distributed Adaptive Applications |
Secondary Title | Wiley Series of Parallel and Distributed Computing |
Chapter | 3 |
Pagination | 29-43 |
Publisher | Wiley |
Place Published | Hoboken, New Jersey |
ISBN Number | 978-0-470-07294-3 |
Journal Article
A Multilevel Approach for the Satisfiability Problem
ISAST Transactions on Computers and Intelligent Systems 1 (2009): 29-37.Status: Published
A Multilevel Approach for the Satisfiability Problem
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2009 |
Journal | ISAST Transactions on Computers and Intelligent Systems |
Volume | 1 |
Number | 2 |
Pagination | 29-37 |
A Study on Modified Szabo's Wave Equation Modeling of Frequency-Dependent Dissipation in Ultrasonic Medical Imaging
Physica Scripta 2009 (2009): 014014.Status: Published
A Study on Modified Szabo's Wave Equation Modeling of Frequency-Dependent Dissipation in Ultrasonic Medical Imaging
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2009 |
Journal | Physica Scripta |
Volume | 2009 |
Number | T136 |
Pagination | 014014 |
DOI | 10.1088/0031-8949/2009/T136/014014 |
Analysis of Tracer Tomography Using Temporal Moments of Tracer Breakthrough Curves
Advances in Water Resources 32 (2009): 391-400.Status: Published
Analysis of Tracer Tomography Using Temporal Moments of Tracer Breakthrough Curves
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2009 |
Journal | Advances in Water Resources |
Volume | 32 |
Number | 3 |
Pagination | 391-400 |
DOI | 10.1016/j.advwatres.2008.12.001 |
Towards a Computational Method for Imaging the Extracellular Potassium Concentration During Regional Ischemia
Mathematical Biosciences 220 (2009): 118-130.Status: Published
Towards a Computational Method for Imaging the Extracellular Potassium Concentration During Regional Ischemia
Afilliation | Scientific Computing, , Scientific Computing |
Publication Type | Journal Article |
Year of Publication | 2009 |
Journal | Mathematical Biosciences |
Volume | 220 |
Number | 2 |
Pagination | 118-130 |
DOI | 10.1016/j.mbs.2009.05.004 |
Proceedings, refereed
Evolution of Intracellular Ca2+ Waves From About 10,000 RyR Clusters: Towards Solving a Computationally Daunting Task
In The Fifth International Conference on Functional Imaging and Modeling of the Heart. Lecture Notes in Computer Science, vol. 5528. Springer, 2009.Status: Published
Evolution of Intracellular Ca2+ Waves From About 10,000 RyR Clusters: Towards Solving a Computationally Daunting Task
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2009 |
Conference Name | The Fifth International Conference on Functional Imaging and Modeling of the Heart |
Pagination | 11-20 |
Publisher | Springer |
DOI | 10.1007/978-3-642-01932-6 |
Poster
Parallel Simulation of Dual Lithology Sedimentation
2009.Status: Published
Parallel Simulation of Dual Lithology Sedimentation
Afilliation | , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Poster |
Year of Publication | 2009 |
Notes | Second prize winner of the poster competition at the conference. |
Book Chapter
Past and Future Perspectives on Scientific Software
In Simula Research Laboratory - by thinking constantly about it, 321-362. Heidelberg: Springer, 2009.Status: Published
Past and Future Perspectives on Scientific Software
Afilliation | Scientific Computing, , , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Book Chapter |
Year of Publication | 2009 |
Book Title | Simula Research Laboratory - by thinking constantly about it |
Chapter | 23 |
Pagination | 321-362 |
Publisher | Springer |
Place Published | Heidelberg |
ISBN Number | 978-3-642-01155-9 |
Book Chapter
A Multilevel Greedy Algorithm for the Satisfiability Problem
In Advances in Greedy Algorithms, 39-54. Vienna: IN-TECH Education and Publishing, 2008.Status: Published
A Multilevel Greedy Algorithm for the Satisfiability Problem
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Book Chapter |
Year of Publication | 2008 |
Book Title | Advances in Greedy Algorithms |
Chapter | 3 |
Pagination | 39-54 |
Publisher | IN-TECH Education and Publishing |
Place Published | Vienna |
ISBN Number | 978-953-7619-27-5 |
Journal Article
A View Toward the Future of Subsurface Characterization: CAT Scanning Groundwater Basins
Water Resources Research 44 (2008).Status: Published
A View Toward the Future of Subsurface Characterization: CAT Scanning Groundwater Basins
In this opinion paper we contend that high-resolution characterization, monitoring, and prediction are the key elements to advancing and reducing uncertainty in our understanding and prediction of subsurface processes at basin scales. First, we advocate that recently developed tomographic surveying is an effective and high-resolution approach for characterizing the field-scale subsurface. Fusion of different types of tomographic surveys further enhances the characterization. A basin is an appropriate scale for many water resources management purposes. We thereby propose the expansion of the tomographic surveying and data fusion concept to basin-scale characterization. In order to facilitate basin-scale tomographic surveys, different types of passive, basin-scale, CAT scan technologies are suggested that exploit recurrent natural stimuli (e.g., lightning, earthquakes, storm events, barometric variations, river-stage variations, etc.) as sources of excitations, along with implementation of sensor networks that provide long-term and spatially distributed monitoring of excitation as well as response signals on the land surface and in the subsurface. This vision for basin-scale subsurface characterization faces many significant technological challenges and requires interdisciplinary collaborations (e.g., surface and subsurface hydrology, geophysics, geology, geochemistry, information and sensor technology, applied mathematics, atmospheric science, etc.). We nevertheless contend that this should be a future direction for subsurface science research.
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2008 |
Journal | Water Resources Research |
Volume | 44 |
Notes | Citation number: W03301 |
DOI | 10.1029/2007WR006375 |
Talks, contributed
High-Performance Computing on Distributed-Memory Architecture
In Lecture at the 2008 Winter School on Parallel Computing, Jan. 20-25, Geilo, Norway, 2008.Status: Published
High-Performance Computing on Distributed-Memory Architecture
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2008 |
Location of Talk | Lecture at the 2008 Winter School on Parallel Computing, Jan. 20-25, Geilo, Norway |
Parallel Computing; Why & How?
In Lecture at the 2008 Winter School on Parallel Computing, Jan. 20-25, Geilo, Norway, 2008.Status: Published
Parallel Computing; Why & How?
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2008 |
Location of Talk | Lecture at the 2008 Winter School on Parallel Computing, Jan. 20-25, Geilo, Norway |
Resource-Efficient Simulation Of Tsunami Wave Propagation on Parallel Computers
In Invited talk at 2nd Internationsal Symposium for Integrated Predictive Simulation System for Earthequake and Tsunami Disaster, October 21-22, Tokyo, Japan, 2008.Status: Published
Resource-Efficient Simulation Of Tsunami Wave Propagation on Parallel Computers
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2008 |
Location of Talk | Invited talk at 2nd Internationsal Symposium for Integrated Predictive Simulation System for Earthequake and Tsunami Disaster, October 21-22, Tokyo, Japan |
Simulation of Tsunami Propagation
In Talk at the 2nd eScience Meeting, Jan. 21-22, Geilo, Norway, 2008.Status: Published
Simulation of Tsunami Propagation
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2008 |
Location of Talk | Talk at the 2nd eScience Meeting, Jan. 21-22, Geilo, Norway |
Use of Advanced Computing in Tomographic Surveys
In Talk at PARA 2008, May 13-16, Trondheim, Norway, 2008.Status: Published
Use of Advanced Computing in Tomographic Surveys
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2008 |
Location of Talk | Talk at PARA 2008, May 13-16, Trondheim, Norway |
Proceedings, refereed
On the Efficiency of Python for High-Performance Computing: a Case Study Involving Stencil Updates for Partial Differential Equations
In Modeling, Simulation and Optimization of Complex Processes. LNCSE. Springer, 2008.Status: Published
On the Efficiency of Python for High-Performance Computing: a Case Study Involving Stencil Updates for Partial Differential Equations
Afilliation | Scientific Computing, , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2008 |
Conference Name | Modeling, Simulation and Optimization of Complex Processes |
Pagination | 337-358 |
Publisher | Springer |
ISBN Number | 978-3-540-23027-4 |
Edited books
Quantitative Information Fusion for Hydrological Sciences
Vol. 79 in Studies in Computational Intelligence. Springer, 2008.Status: Published
Quantitative Information Fusion for Hydrological Sciences
In a rapidly evolving world of knowledge and technology, do you ever wonder how hydrology is catching up? This book takes the angle of computational hydrology and envisions one of the future directions, namely, quantitative integration of high-quality hydrologic field data with geologic, hydrologic, chemical, atmospheric, and biological information to characterize and predict natural systems in hydrological sciences. Intelligent computation and information fusion are the key words. The aim is to provide both established scientists and graduate students with a summary of recent developments in this topic. The chapters of this edited volume cover some of the most important ingredients for quantitative hydrological information fusion, including data fusion techniques, interactive computational environments, and supporting mathematical and numerical methods. Real-life applications of hydrological information fusion are also addressed.
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Edited books |
Year of Publication | 2008 |
Volume | 79 in Studies in Computational Intelligence |
Date Published | January, 2008 |
Publisher | Springer |
ISBN Number | 978-3-540-75383-4 |
Journal Article
A Note on the Efficiency of the Conjugate Gradient Method for a Class of Time-Dependent Problems
Numerical Linear Algebra with Applications 14 (2007): 459-467.Status: Published
A Note on the Efficiency of the Conjugate Gradient Method for a Class of Time-Dependent Problems
Afilliation | , Scientific Computing, Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2007 |
Journal | Numerical Linear Algebra with Applications |
Volume | 14 |
Number | 5 |
Pagination | 459-467 |
A Unified Framework of Multi-Objective Cost Functions for Partitioning Unstructured Finite Element Meshes
Applied Mathematical Modelling 31 (2007): 1711-1728.Status: Published
A Unified Framework of Multi-Objective Cost Functions for Partitioning Unstructured Finite Element Meshes
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2007 |
Journal | Applied Mathematical Modelling |
Volume | 31 |
Number | 9 |
Pagination | 1711-1728 |
An Order Optimal Solver for the Discretized Bidomain Equations
Numerical Linear Algebra with Applications 14 (2007): 83-98.Status: Published
An Order Optimal Solver for the Discretized Bidomain Equations
Afilliation | Scientific Computing, , Scientific Computing, Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2007 |
Journal | Numerical Linear Algebra with Applications |
Volume | 14 |
Number | 2 |
Pagination | 83-98 |
On the Possibility for Computing the Transmembrane Potential in the Heart With a One Shot Method; an Inverse Problem
Mathematical Biosciences 210 (2007): 523-553.Status: Published
On the Possibility for Computing the Transmembrane Potential in the Heart With a One Shot Method; an Inverse Problem
Afilliation | , Scientific Computing, Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Journal Article |
Year of Publication | 2007 |
Journal | Mathematical Biosciences |
Volume | 210 |
Number | 2 |
Pagination | 523-553 |
Talks, contributed
Bridging the Gap Between Computational Scientists and HPC
In Article published in Meta, Number 3, 2007.Status: Published
Bridging the Gap Between Computational Scientists and HPC
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2007 |
Location of Talk | Article published in Meta, Number 3 |
Building Hybrid Parallel PDE Software by Domain Decomposition and Object-Oriented Programming
In Talk at the ICCM 2007 Conference, April 4-6, Hiroshima, Japan, 2007.Status: Published
Building Hybrid Parallel PDE Software by Domain Decomposition and Object-Oriented Programming
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2007 |
Location of Talk | Talk at the ICCM 2007 Conference, April 4-6, Hiroshima, Japan |
Making Parallel PDE Software by Object-Oriented Programming
In Guest lecture given at Hohai University, China, May 17, 2007.Status: Published
Making Parallel PDE Software by Object-Oriented Programming
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2007 |
Location of Talk | Guest lecture given at Hohai University, China, May 17 |
On a Future Software Platform for Demanding Multi-Scale and Multi-Physics Problems
In Talk at SIAM CSE07 Conference, Costa Mesa, CA, Feb. 19-23, 2007.Status: Published
On a Future Software Platform for Demanding Multi-Scale and Multi-Physics Problems
Afilliation | Scientific Computing, , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2007 |
Location of Talk | Talk at SIAM CSE07 Conference, Costa Mesa, CA, Feb. 19-23 |
On Building Parallel Algorithms and Software for Hydraulic Tomography
In Talk at SIAM GS2007 Conference, March 19-22, Santa Fe, New Mexico, USA, 2007.Status: Published
On Building Parallel Algorithms and Software for Hydraulic Tomography
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2007 |
Location of Talk | Talk at SIAM GS2007 Conference, March 19-22, Santa Fe, New Mexico, USA |
Parallelisation and Numerical Performance of a 3D Model for Coupled Deformation, Fluid Flow, and Heat Transport in Porous Geological Formations
In Talk at the Fourth National Conference on Computational Mechanics (MekIT'07), Trondheim, Norway, 2007.Status: Published
Parallelisation and Numerical Performance of a 3D Model for Coupled Deformation, Fluid Flow, and Heat Transport in Porous Geological Formations
In this paper, we present some parallel performance results for a 3D simulator of coupled deformation, fluid flow and heat transfer in sedimentary basins. The model parameters are derived from an industry simulator, with realistic material properties and complex irregular grids of up to 1.5 million nodes with 7.3 million degrees of freedom. We have performed parallelisation on the linear algebra level using the ML algebraic multigrid preconditioner with iterative methods in the Diffpack finite element framework. Implementation and speedup results are presented.
Afilliation | Scientific Computing, , Scientific Computing, Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2007 |
Location of Talk | Talk at the Fourth National Conference on Computational Mechanics (MekIT'07), Trondheim, Norway |
Notes | Presented by J. B. Haga |
Simulating Tsunami Propagation on Parallel Computers Using a Hybrid Software Framework
In Guest lecture given at the University of Stuttgart, March 12, 2007.Status: Published
Simulating Tsunami Propagation on Parallel Computers Using a Hybrid Software Framework
Afilliation | , Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Talks, contributed |
Year of Publication | 2007 |
Location of Talk | Guest lecture given at the University of Stuttgart, March 12 |
Proceedings, refereed
Making Hybrid Tsunami Simulators in a Parallel Software Framework
In International Workshop on Applied Parallel Computing (PARA'06). Lecture Notes in Computer Science, volume 4699. Berlin Heidelberg: Springer Verlag, 2007.Status: Published
Making Hybrid Tsunami Simulators in a Parallel Software Framework
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2007 |
Conference Name | International Workshop on Applied Parallel Computing (PARA'06) |
Pagination | 686-693 |
Publisher | Springer Verlag |
Place Published | Berlin Heidelberg |
ISBN Number | 978-3-540-75754-2 |
Parallelisation and Numerical Performance of a 3D Model for Coupled Deformation, Fluid Flow and Heat Transfer in Sedimentary Basins
In MekIT'07. Fourth National Conference on Computational Mechanics. Trondheim: Tapir Academic Press, 2007.Status: Published
Parallelisation and Numerical Performance of a 3D Model for Coupled Deformation, Fluid Flow and Heat Transfer in Sedimentary Basins
In this paper, we present some parallel performance results for a 3D simulator of coupled deformation, fluid flow and heat transfer in sedimentary basins. The model parameters are derived from an industry simulator, with realistic material properties and complex irregular grids of up to 1.5 million nodes with 7.3 million degrees of freedom. We have performed parallelisation on the linear algebra level using the ML algebraic multigrid preconditioner with iterative methods in the Diffpack finite element framework. Implementation and speedup results are presented.
Afilliation | Scientific Computing |
Project(s) | Center for Biomedical Computing (SFF) |
Publication Type | Proceedings, refereed |
Year of Publication | 2007 |
Conference Name | MekIT'07. Fourth National Conference on Computational Mechanics |
Pagination | 151-162 |
Date Published | May |
Publisher | Tapir Academic Press |
Place Published | Trondheim |
ISBN Number | 978-82-519-2235-7 |
Talks, contributed
A Hybrid Software Framework for Parallel Tsunami Simulations
In Talk at SIAM PP06 Conference, February 22-24, 2006, San Francisco, 2006.Status: Published
A Hybrid Software Framework for Parallel Tsunami Simulations
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Talk at SIAM PP06 Conference, February 22-24, 2006, San Francisco |
Computational Issues in Heart Modeling
In Presented at the Johann Radon Institute for Computational and Applied Mathematics, Linz, Austria, 2006.Status: Published
Computational Issues in Heart Modeling
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Presented at the Johann Radon Institute for Computational and Applied Mathematics, Linz, Austria |
Fusion of Hydraulic and Tracer Tomography for DNAPL Detection
In Poster presented at AGU Fall Meeting 2006, Dec. 11-15, San Francisco, 2006.Status: Published
Fusion of Hydraulic and Tracer Tomography for DNAPL Detection
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Poster presented at AGU Fall Meeting 2006, Dec. 11-15, San Francisco |
Hybrid Parallelization of a 3D Transient Hydraulic Tomography Code
In Poster presented at Western Pacific Geophysics Meeting 2006, Beijng, July 24-27, 2006.Status: Published
Hybrid Parallelization of a 3D Transient Hydraulic Tomography Code
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Poster presented at Western Pacific Geophysics Meeting 2006, Beijng, July 24-27 |
On the Use of the Bidomain Equations for Computing the Transmembrane Potential Throughout the Heart Wall: an Inverse Problem
In Presented at the Computers in Cardiology conference in Valencia, Spain, 2006.Status: Published
On the Use of the Bidomain Equations for Computing the Transmembrane Potential Throughout the Heart Wall: an Inverse Problem
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Presented at the Computers in Cardiology conference in Valencia, Spain |
Parallel Computational Methodology for Hydraulic Tomography
In Poster presented at AGU Fall Meeting 2006, San Francisco, Dec. 11-15, 2006.Status: Published
Parallel Computational Methodology for Hydraulic Tomography
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Poster presented at AGU Fall Meeting 2006, San Francisco, Dec. 11-15 |
Parallel Programming and Computing for Large-Scale Hydraulic Tomography
In Poster presented at Workshop on Hydraulic Tomography, Boise, June 8-9, 2006.Status: Published
Parallel Programming and Computing for Large-Scale Hydraulic Tomography
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Poster presented at Workshop on Hydraulic Tomography, Boise, June 8-9 |
Parallelizing Serial PDE Software Using a Generic Approach
In Seminar at the University of Arizona, February 27, 2006.Status: Published
Parallelizing Serial PDE Software Using a Generic Approach
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Seminar at the University of Arizona, February 27 |
Python in High Performance Computing
In Tutorial presented at the Para06 Workshop, 2006.Status: Published
Python in High Performance Computing
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Tutorial presented at the Para06 Workshop |
Simulating Tsunamis on Parallel Computers
In Invited talk at Notur 2006 Conference, May 11-12, Bergen, Norway, 2006.Status: Published
Simulating Tsunamis on Parallel Computers
Publication Type | Talks, contributed |
Year of Publication | 2006 |
Location of Talk | Invited talk at Notur 2006 Conference, May 11-12, Bergen, Norway |
Book
Computing the Electrical Activity in the Heart
Berlin Heidelberg: Springer, 2006.Status: Published
Computing the Electrical Activity in the Heart
This book describes mathematical models and numerical techniques for simulating the electrical activity in the heart. The book gives an introduction to the most important models of the field, followed by a detailed description of numerical techniques for the models. Particular focus is on efficient numerical methods for large scale simulations on both scalar and parallel computers. The results presented in the book will be of particular interest to researchers in bioengineering and computational biology, who face the challenge of solving these complex mathematical models efficiently. The book will also serve as a valuable introduction to a new and exciting field for computational scientists and applied mathematicians.
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Book |
Year of Publication | 2006 |
Publisher | Springer |
Place Published | Berlin Heidelberg |
ISBN Number | 3-540-33432-7 |
Book Chapter
Full-Scale Simulation of Cardiac Electrophysiology on Parallel Computers
In Numerical Solution of Partial Differential Equations on Parallel Computers, 385-411. Lecture Notes in Computational Science and Engineering. Springer, 2006.Status: Published
Full-Scale Simulation of Cardiac Electrophysiology on Parallel Computers
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Book Chapter |
Year of Publication | 2006 |
Book Title | Numerical Solution of Partial Differential Equations on Parallel Computers |
Secondary Title | Lecture Notes in Computational Science and Engineering |
Pagination | 385-411 |
Publisher | Springer |
Parallelizing PDE Solvers Using the Python Programming Language
In Numerical Solution of Partial Differential Equations on Parallel Computers, 295-325. Lecture Notes in Computational Science and Engineering. Springer, 2006.Status: Published
Parallelizing PDE Solvers Using the Python Programming Language
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Book Chapter |
Year of Publication | 2006 |
Book Title | Numerical Solution of Partial Differential Equations on Parallel Computers |
Secondary Title | Lecture Notes in Computational Science and Engineering |
Pagination | 295-325 |
Publisher | Springer |
Proceedings, non-refereed
Identifying Ischemic Heart Disease in Terms of ECG Recordings and an Inverse Problem for the Bidomain Equations; Modeling and Experiments
In The Third International Conference "Inverse Problems: Modeling and Simulation". Literatür Yayincilik Ltd, 2006.Status: Published
Identifying Ischemic Heart Disease in Terms of ECG Recordings and an Inverse Problem for the Bidomain Equations; Modeling and Experiments
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Proceedings, non-refereed |
Year of Publication | 2006 |
Conference Name | The Third International Conference "Inverse Problems: Modeling and Simulation" |
Pagination | 138-140 |
Publisher | Literatür Yayincilik Ltd. |
ISBN Number | 975-04-0381-9 |
Proceedings, refereed
Improving the Performance of Large-Scale Unstructured PDE Applications
In Proceedings of the PARA'04 Workshop, June 20-23, 2004, Lyngby, Denmark. Lecture Notes in Computer Science, volume 3732. Springer, 2006.Status: Published
Improving the Performance of Large-Scale Unstructured PDE Applications
Publication Type | Proceedings, refereed |
Year of Publication | 2006 |
Conference Name | Proceedings of the PARA'04 Workshop, June 20-23, 2004, Lyngby, Denmark |
Pagination | 699-708 |
Publisher | Springer |
ISBN Number | 3-540-29067-2 |
On the Use of the Bidomain Equations for Computing the Transmembrane Potential Throughout the Heart Wall: an Inverse Problem
In Computers in Cardiology 2006. Computers in Cardiology, 2006.Status: Published
On the Use of the Bidomain Equations for Computing the Transmembrane Potential Throughout the Heart Wall: an Inverse Problem
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2006 |
Conference Name | Computers in Cardiology 2006 |
Pagination | 797-800 |
Publisher | Computers in Cardiology |
ISBN Number | 0276-6547 |
Notes | ISSN 0276-6547 |
Parallel Simulation of Tsunamis Using a Hybrid Software Approach
In Proceedings of the International Conference ParCo 2005, September 13-16, Malaga, Spain. Volume 33 in NIC series. John von Neumann Institute for Computing, 2006.Status: Published
Parallel Simulation of Tsunamis Using a Hybrid Software Approach
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2006 |
Conference Name | Proceedings of the International Conference ParCo 2005, September 13-16, Malaga, Spain |
Pagination | 383-390 |
Publisher | John von Neumann Institute for Computing |
ISBN Number | 3-00-017352-8 |
Journal Article
On the Computational Complexity of the Bidomain and the Monodomain Models of Electrophysiology
Annals of Biomedical Engineering 34 (2006): 1088-1097.Status: Published
On the Computational Complexity of the Bidomain and the Monodomain Models of Electrophysiology
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Journal Article |
Year of Publication | 2006 |
Journal | Annals of Biomedical Engineering |
Volume | 34 |
Number | 7 |
Pagination | 1088-1097 |
Date Published | July |
Journal Article
A Numerical Method for Computing the Profile of Weld Pool Surfaces
International Journal for Computational Methods in Engineering Science and Mechanics 6 (2005): 115-125.Status: Published
A Numerical Method for Computing the Profile of Weld Pool Surfaces
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Journal Article |
Year of Publication | 2005 |
Journal | International Journal for Computational Methods in Engineering Science and Mechanics |
Volume | 6 |
Number | 2 |
Pagination | 115-125 |
A Parallel Multi-Subdomain Strategy for Solving Boussinesq Water Wave Equations
Advances in Water Resources 28 (2005): 215-233.Status: Published
A Parallel Multi-Subdomain Strategy for Solving Boussinesq Water Wave Equations
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Journal Article |
Year of Publication | 2005 |
Journal | Advances in Water Resources |
Volume | 28 |
Number | 3 |
Pagination | 215-233 |
Date Published | March |
On the Performance of the Python Programming Language for Serial and Parallel Scientific Computations
Scientific Programming 13 (2005): 31-56.Status: Published
On the Performance of the Python Programming Language for Serial and Parallel Scientific Computations
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Journal Article |
Year of Publication | 2005 |
Journal | Scientific Programming |
Volume | 13 |
Number | 1 |
Pagination | 31-56 |
Technical reports
An Order Optimal Solver for the Discretized Bidomain Equations
Simula Research Laboratory, 2005.Status: Published
An Order Optimal Solver for the Discretized Bidomain Equations
Afilliation | Scientific Computing |
Project(s) | No Simula project |
Publication Type | Technical reports |
Year of Publication | 2005 |
Publisher | Simula Research Laboratory |
Notes | This technical report is an earlier version of a journal article. The journal article can be found here: https://www.simula.no/publications/order-optimal-solver-discretized-bidomain-equations |
Talks, contributed
Parallel Simulation of Tsunamis Using a Hybrid Software Approach
In Talk at ParCo 2005 Conference, 13 - 16 September, Malaga, Spain, 2005.Status: Published
Parallel Simulation of Tsunamis Using a Hybrid Software Approach
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2005 |
Location of Talk | Talk at ParCo 2005 Conference, 13 - 16 September, Malaga, Spain |
Parallelization of PDE Codes
In Talk at the CMA Workshop on High-Performance Computing in Physics, November 4, Oslo, Norway, 2005.Status: Published
Parallelization of PDE Codes
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2005 |
Location of Talk | Talk at the CMA Workshop on High-Performance Computing in Physics, November 4, Oslo, Norway |
Solving Boussinesq Water Wave Equations on Parallel Computers
In Talk at the International Workshop on Numerical Ocean Modeling, Oslo, Norway, 2005.Status: Published
Solving Boussinesq Water Wave Equations on Parallel Computers
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2005 |
Location of Talk | Talk at the International Workshop on Numerical Ocean Modeling, Oslo, Norway |
Book Chapter
A Numerical Study of Some Parallel Algebraic Preconditioners
In Parallel and Distributed Scientific and Engineering Computing: Practice and Experience, 9-21. Nova Science Publishers, 2004.Status: Published
A Numerical Study of Some Parallel Algebraic Preconditioners
Publication Type | Book Chapter |
Year of Publication | 2004 |
Book Title | Parallel and Distributed Scientific and Engineering Computing: Practice and Experience |
Pagination | 9-21 |
Publisher | Nova Science Publishers |
Notes | An eariler version is included in Proceedings of the IPDPS 2003 Conference, Nice, France, April 2003, IEEE Computer Society |
Parallel Solution of the Bidomain Equations With High Resolutions
In Parallel Computing: Software Technology, Algorithms, Architectures & Applications, 837-844. Elsevier Science, 2004.Status: Published
Parallel Solution of the Bidomain Equations With High Resolutions
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Book Chapter |
Year of Publication | 2004 |
Book Title | Parallel Computing: Software Technology, Algorithms, Architectures & Applications |
Pagination | 837-844 |
Publisher | Elsevier Science |
Talks, contributed
Using Linux Clusters for Full-Scale Simulation of Cardiac Electrophysiology
In Invited talk at the fifth annual workshop on Linux Clusters for Super Computing, October 18-21, 2004, Linköping, Sweden, 2004.Status: Published
Using Linux Clusters for Full-Scale Simulation of Cardiac Electrophysiology
Publication Type | Talks, contributed |
Year of Publication | 2004 |
Location of Talk | Invited talk at the fifth annual workshop on Linux Clusters for Super Computing, October 18-21, 2004, Linköping, Sweden |
Journal Article
Using the Parallel Algebraic Recursive Multilevel Solver in Modern Physical Applications
Future Generation Computer Systems 20 (2004): 489-500.Status: Published
Using the Parallel Algebraic Recursive Multilevel Solver in Modern Physical Applications
Publication Type | Journal Article |
Year of Publication | 2004 |
Journal | Future Generation Computer Systems |
Volume | 20 |
Number | 3 |
Pagination | 489-500 |
Notes | An earlier version appeared as Technical Report 2002-106 at the Minnesota Supercomputing Institute |
Proceedings, refereed
A Flexible Architecture for Welding Simulators Used in Weld Planning
In Proceedings of International Conference on Productive Welding in Industrial Applications. Lappenranta, Finland, 2003.Status: Published
A Flexible Architecture for Welding Simulators Used in Weld Planning
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Proceedings, refereed |
Year of Publication | 2003 |
Conference Name | Proceedings of International Conference on Productive Welding in Industrial Applications |
Date Published | May |
Place Published | Lappenranta, Finland |
Talks, contributed
A Numerical Study of Some Parallel Algebraic Preconditioners
In Talk at the IPDPS 2003 Conference, April 22-26, 2003, Nice, France, 2003.Status: Published
A Numerical Study of Some Parallel Algebraic Preconditioners
Publication Type | Talks, contributed |
Year of Publication | 2003 |
Location of Talk | Talk at the IPDPS 2003 Conference, April 22-26, 2003, Nice, France |
Computing the Electrical Activity in the Human Heart
In Presented at the European Conference on Numerical Mathematics and Advanced Applications, Prague, Czech Republic, 2003.Status: Published
Computing the Electrical Activity in the Human Heart
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2003 |
Location of Talk | Presented at the European Conference on Numerical Mathematics and Advanced Applications, Prague, Czech Republic |
Computing the Electrical Activity in the Human Heart
In Presented at the Centre of Mathematics for Applications, Oslo, 2003.Status: Published
Computing the Electrical Activity in the Human Heart
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2003 |
Location of Talk | Presented at the Centre of Mathematics for Applications, Oslo |
Computing the Heart
In Presented at the 21st CAD-FEM users' meeting 2003 - International congress on FEM technology, Potsdam, Germany, 2003.Status: Published
Computing the Heart
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2003 |
Location of Talk | Presented at the 21st CAD-FEM users' meeting 2003 - International congress on FEM technology, Potsdam, Germany |
Mathematical and Numerical Modeling of Medical Ultrasound Wave Propagation
In Invited talk to MACSI-Workshop for Numerical Simulations for Ultrasound Imaging and Inversion, St. Georgen, Austria, pages 8-13, 2003.Status: Published
Mathematical and Numerical Modeling of Medical Ultrasound Wave Propagation
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2003 |
Location of Talk | Invited talk to MACSI-Workshop for Numerical Simulations for Ultrasound Imaging and Inversion, St. Georgen, Austria, pages 8-13 |
Parallel Algorithms for Simulating the Electrical Activity of the Heart
In Presented at the Dagstuhl seminar Challenges in computational science and engineering, 2003.Status: Published
Parallel Algorithms for Simulating the Electrical Activity of the Heart
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2003 |
Location of Talk | Presented at the Dagstuhl seminar Challenges in computational science and engineering |
Notes | Presented by Joakim Sundnes, March 2003. |
Toward Extremely High-Resolution Simulation of Human Heart
In Talk at the ParCo 2003 Conference, 2 - 5 September 2003, Dresden, Germany, 2003.Status: Published
Toward Extremely High-Resolution Simulation of Human Heart
Afilliation | Scientific Computing, Scientific Computing |
Publication Type | Talks, contributed |
Year of Publication | 2003 |
Location of Talk | Talk at the ParCo 2003 Conference, 2 - 5 September 2003, Dresden, Germany |
Book Chapter
Overlapping Domain Decomposition Methods
In Advanced Topics in Computational Partial Differential Equations - Numerical Methods and Diffpack Programming, 57-95. Springer, 2003.Status: Published
Overlapping Domain Decomposition Methods
Publication Type | Book Chapter |
Year of Publication | 2003 |
Book Title | Advanced Topics in Computational Partial Differential Equations - Numerical Methods and Diffpack Programming |
Pagination | 57-95 |
Publisher | Springer |
Parallel Computing
In Advanced Topics in Computational Partial Differential Equations - Numerical Methods and Diffpack Programming, 1-55. Lecture Notes in Computational Science and Engineering. Springer, 2003.Status: Published
Parallel Computing
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Book Chapter |
Year of Publication | 2003 |
Book Title | Advanced Topics in Computational Partial Differential Equations - Numerical Methods and Diffpack Programming |
Secondary Title | Lecture Notes in Computational Science and Engineering |
Pagination | 1-55 |
Publisher | Springer |
Performance Modeling of PDE Solvers
In Advanced Topics in Computational Partial Differential Equations - Numerical Methods and Diffpack Programming, 361-399. Lecture Notes in Computational Science and Engineering. Springer, 2003.Status: Published
Performance Modeling of PDE Solvers
Afilliation | Scientific Computing, Scientific Computing, Scientific Computing |
Publication Type | Book Chapter |
Year of Publication | 2003 |
Book Title | Advanced Topics in Computational Partial Differential Equations - Numerical Methods and Diffpack Programming |
Secondary Title | Lecture Notes in Computational Science and Engineering |
Pagination | 361-399 |