EXTRA

Results from the EXTRA project

1) “5 whys” of main issues our FET project is solving.

Why-1: There has been a lot of research on novel reconfigurable architectures (mostly coarse-grain) in the last two decades. However, all these efforts where individual proposals where all researchers proposing a novel architecture had to evaluate the architecture through their own evaluation tools and had to develop their own tools for implementing applications on their architectures. Why do we always have to redo everything from scratch instead of designing a common research platform that can be used/customized by all the researchers?

Why-2: Several run-time reconfigurable systems have been proposed over the years but they still suffer from obstacles that prevent them from becoming mainstream compute nodes. Why are we still in such a scenario?

Why-3: Why do we need to enable the joint optimization of architecture, tools, applications, and reconfiguration technology to design the necessary HPC hardware nodes of the future?

Why-4: Future HPC systems that address exascale applications will require to be flexible and extensible even after the system is operational, in order to cope with changing user requirements, improvements in system features, changing protocol and data-coding standards, evolving demands for support of different user applications, and newly emerging HPC applications. Hence in order to achieve the full range of technological capabilities, also research on optimal reconfigurable architectures is necessary. In a context like this, why are we still discussing/evaluating the possibility of adopting reconfigurable technologies in exascale/high performance computing instead of using them?

Why-5: The duration of the reconfiguration process is one of the important bottlenecks in the current reconfigurable systems. By reducing the time needed to reconfigure the hardware, more tasks can be accelerated, significantly reducing the overhead of programming new kernels into the hardware. Within this context, to support the flexibility during run-time according to application requirements, a novel concept for on the fly configuration generation inclusive the supporting tool flow needs to be developed, why?

2) How will our solution contribute to society?

The EXTRA project focuses on strengthening European research leadership in HPC technologies, not only by providing opportunities for further exploration of the project results in follow-up European and national projects, but also by enabling a whole new research era in reconfigurable computing. Indeed, the EXTRA project invested, for the first time, significant effort in setting up an open exploration framework for both architecture exploration as well as design tool exploration which specifically targets reconfigurability.

The open exploration platform for reconfigurable computing is intended to boost research on reconfiguration. If more research institutes are able to perform research on run-time reconfigurable systems, more students will be confronted with this novel research domain in HPC computing and they will introduce this within the companies they are later working for, increasing opportunities for commercial use of such reconfigurable HPC systems. Moreover, novel research results on architectures, tools and applications, enabled by our exploration platform, may lead to new start-ups (university spin-offs) and new jobs.

The energy cost of HPC applications is vast and growing. Data centers housing such HPC hardware have carbon emissions that are forecast to overtake those of the airline industry by 2020, and as such will have considerable impact on society. A major benefit of reconfigurable architectures is their substantially smaller energy usage than conventional CPU based hardware. One of the goals of EXTRA is to popularize reconfigurable hardware to the scientific software community and to increase its uptake for HPC applications. This will have a profound impact on HPC energy usage ideally leading to significant reductions in carbon emissions.

Finally, we envisage that the EXTRA project will have both a direct and an indirect impact on future HPC markets. The direct impact relates to the activities for optimizing architectures and tools to design reconfigurable HPC systems, as well as in the explicit exploitation of reconfigurability for three important applications in the HPC domain: a finite element application, a medical application, and a scientific computation application. The indirect impact will come from the opportunities arising from increased research efforts that will become possible with our architecture and tool exploration platform that is targeted at reconfigurable HPC systems.

3) What were the main scientific/ engineering challenges of our FET project?

The objective of the EXTRA project is to make Europe the worldwide leader in reconfigurable HPC technology. Making HPC systems highly reconfigurable will allow European industry to greatly improve the efficiency of HPC systems. In order to reach this long term goal, the project targets several key objectives.

Because the exploitation of system reconfigurability is relatively new, much more research is needed on the optimal HPC architectures that can maximally benefit from reconfiguration, on improvements in the tools to exploit reconfiguration while designing high performance and power-efficient implementations, and on the appropriate application optimizations. For this reason, the EXTRA project targets the development and promotion of an open reconfigurable technology exploration platform that combines a reconfigurable architecture description with reconfigurable design tools and thus allows to evaluate and optimize reconfigurable applications (key challenge 1).

Apart from developing the open exploration platform, the EXTRA project also makes significant contributions in reconfigurable architecture development, reconfigurable tools, and the optimization of reconfigurable HPC applications (key challenge 2).

The contributions on architecture, tools and applications are validated using the EXTRA platform and valorised by implementing three HPC applications using reconfiguration with the aim to improve performance, area and power efficiency (key challenge 3).

Given these three key challenges, the main technical challenges (TC) of the EXTRA project are the following.

To enable a co-design approach for developing reconfigurable HPC architectures, tools and applications by providing a joint exploration platform where individual contributions can be tested on their merits using the complete chain from device up to the application (TC1);
To include reconfigurability as an explicit design concept in future HPC systems (TC2);
To develop novel reconfiguration approaches for processing, BRAMs, special blocks and interconnection in a coarse-grain reconfiguration architecture to allow speeding up the reconfiguration process to a single cycle for applications with dynamic data access (TC3);
To provide just-in-time synthesis methods for reconfiguration on the fly, based on application requirements. This requires a complete tool flow that targets this just-in-time synthesis with minimal reconfiguration overhead (TC4);
To improve the running time, area requirements and power efficiency of the HPC applications under consideration on current reconfigurable devices (TC5);
To use the common research platform for proving the efficiency of new reconfiguration technology proposals and to suggest new reconfiguration features for future technologies (TC5).

4) Results from the project EXTRA

CAOS, the CAD as an Adaptive Open platform Service

The ever increasing amount of computational power required by High Performance Computing (HPC) systems fostered the research towards Heterogeneous System Architectures, in order to achieve better performance without losing energy efficiency.

Within this context, FPGAs emerged since they provide higher performance per watt in several applicative scenarios. Nevertheless, the wide adoption of such devices is still limited by the programmability challenge that they pose.

Indeed, highly experienced designers are required in order to implement both an efficient FPGA-based accelerator as well as the needed communication infrastructure to interface the accelerator with the host system.

That is exactly why in EXTRA the Politecnico di Milano designed CAOS, the CAD as an Adaptive Open platform Service.

CAOS is a framework that aims to provide a fully integrated platform for automating and assisting all the steps of the hardware design flow, targeting high performance and erasable reconfigurable computing systems.

CAOS been developed around three key points: MODULARITY, USABILITY and INTERACTIVITY.

The modularity allows to create independent components that can be unplugged and replaced with custom modules. In this regard, CAOS provides a set of well-defined APIs to help external contributors in the process of integrating a custom module. For what concerns usability, CAOS allows users with low expertise to develop and optimize their application, providing support in analysing potential performance and suggesting targeted optimizations. Finally, the tool has a high interactivity in guiding the user through the optimization process, providing suggestions and error reports and allowing the user to modify the solution automatically provided by the user.

The overall CAOS design flow is organised into three main flows:

the frontend,
the function optimization flow
and the backend

Within the frontend, CAOS analyzes the application provided by the user and matches the application against one or more architectural templates, profile the user application against the user specified datasets and, finally, guide the user through the hardware-software partitioning of the application.

The function optimization flow performs a static analysis and a hardware resource estimation of the functionalities to be accelerated on the FPGA. Such analyses are dependent on the considered architectural template and the derived information is used to estimate the performance of the hardware functions and to derive the optimizations to apply (such as loop pipelining, loop tiling and loop unrolling). After one or more iterations of the function optimization flow, the resulting functions are given to the backend flow, where the desired architectural template for implementing the system is selected and the required high level synthesis and hardware synthesis tools are leveraged to generate the final FPGA bitstreams.

Furthermore, within the backend, CAOS takes care also of generating the host code for running the FPGA accelerators.

Three applications accelerated as test cases

Within the scope of the EXTRA project, we have used CAOS to develop three different applications and investigate their scaling behavior for exascale computing:

A novel image segmentation application targeting the medical sector (by partner Synelixis)

In the scope of the EXTRA project, Synelixis has developed a novel image segmentation application targeting the medical sector. Τhe implemented Retinal Image Segmentation application enables the extraction of morphological attributes of retinal blood vessels, assisting the diagnosis, screening, treatment and evaluation of various cardiovascular and ophthalmologic diseases. By employing FPGAs with embedded ARM processors, Synelixis managed to accelerate the process of segmentation by at least 100 times over cutting edge general purpose server CPUs while being 800x better in terms of power efficiency. However, these goals came with two serious drawbacks. The hardware accelerated solution required very significant amount of time to be developed and verified while the final system was not flexible, requiring major redesigns when additional features were to be added.

The CAOS framework developed within EXTRA provided a very elegant solution to the aforementioned productivity and deployment problems. Through an easy to use web-interface, Synelixis’ engineers were able to design, test, iterate on the application development process very fast and thus they were able to produce ready-to-deploy solutions in a fraction of time (few hours compared to several weeks). This fast pace of iterations also enabled several algorithmic improvements in the application resulting in a significantly better end-product. The performance and efficiency achieved was significantly better than software (more than 4x speed improvements at a fraction of the energy cost). A final advantage of CAOS is that it made trivial the retargeting of the developed solution to different FPGA-based platforms (including Amazon’s AWS F1 instance) giving unprecedented flexibility and options to Synelixis, as well as to Synelixis’ partners and clients, to optimize the deployment of the application under different cost and performance scenarios.

Application that allows financial institutions to model risk in their portfolios more accurately (by partner Maxeler)

Maxeler Technologies commercialises Dataflow supercomputing systems that are based on reconfigurable accelerators. In the EXTRA project, Maxeler has participated in the development of a reconfigurable tool platform that can help reconfigurable computing systems to become more mainstream. Based on its commercial experience with such systems, Maxeler has contributed to the design exploration and analytical modelling techniques in EXTRA. Furthermore, Maxeler extended its MaxCompiler Dataflow development tools with a new front end that leverages the techniques developed in EXTRA. This has allowed Maxeler to develop an optimised a financial Value-at-Risk application for significantly faster and more energy efficient processing on Maxeler's FPGA-based MAX5 Dataflow Engines as well as on Amazon EC2 F1 Instances and Xilinx Alveo Datacenter cards. With reconfigurable acceleration, this application will allow financial institutions to model risk in their portfolios more accurately.

A computational chemistry application (by partner Cambridge University)

In Computational Chemistry many applications need a lot of computing power, such as the Quantum Monte Carlo (QMC) application. We implemented this application on a Maxeler system using the MAIA FPGA boards. The reference implementation managed to outperform the multithreaded software reference by 5x to 35x depending on the ensemble sizes of walkers, with energy/performance gains of similar magnitude. Using the Parallel Memory proposed within EXTRA, it was possible to achieve an order of magnitude better performance at the cost of using additional memory resources.

Other open innovations as a result of EXTRA

Alongside the CAOS framework, all partners of the EXTRA project have built new methods and tools for several supporting actions needed to improve the flexibility of the platform and aid the development of future HPC applications. The most important contributions are:

A novel meta-programming approach: Artisan

The Imperial College team, in collaboration with various partners, made several contributions to EXTRA. First, Artisan, a novel meta-programming approach, was developed to codify complex design methods and optimisation strategies for automatically mapping high-level descriptions to the next-generation large-scale reconfigurable platforms. With Artisan, all stages of a compilation design-flow can be described by Python programs, supporting tasks such as source-code analysis and instrumentation as well as the control of third-party tools and hardware platforms. Artisan enables exploration of a wide range of optimisation opportunities for a computing landscape that is increasingly heterogeneous and distributed. Second, PERKS, a performance modeling capability, was developed for next-generation hardware platforms with reconfigurable dataflow accelerators. Such hardware platforms are ideal for future large-scale distributed computing, since they can significantly speed up high-performance computing applications while consuming much less power than conventional many-core processor platforms. Additional innovations from Imperial include prototyping and evaluation of the EURECA on-chip configuration generation method, and new in-circuit assertions and exceptions for effective fault detection and management in reconfigurable systems.

The PolyMem polymorphic memory system.

The PolyMem system was developed at the University of Amsterdam. It automatically analyses memory access patterns and infers the potential impact of employing parallel memory for hardware accelerated kernels. It automatically derives a parallel memory configuration for that application.

Just-in-time synthesis, mapping, and P&R tools

The researchers at Ruhr University Bochum focused on using the existing TCON/TLUT-tool flow from Ghent University to show the advantages of using Dynamic Circuit Specialization (DCS) for Virtual Coarse Grained Reconfigurable Arrays (VCGRA). With this, we can provide a faster synthesis based on a hierarchical multi-level approach. The results showed benefits in area consumptions up to 75% using the TLUT/TCON backend for our VCGRA designs. Nevertheless, we focused on commercial vendor backends for further research to implement the VCGRA overlays on FPGA designs and running the EXTRA applications for benchmarking the coarse-grained design approach. On the architectural part, we created a coarse-grained architecture overlay for arithmetic tasks, which works as a parameterizable overlay for FPGAs. In the first half of the project, we developed a CGRA overlay named “Pixie” to implement a convolution algorithm to accelerate image processing applications together with Ghent University. In the second half of the project we developed an AXI-Lite/AXI-Stream interface to create IP-blocks usable in Vivado block designs. Results showed that almost 50% of the execution time was needed for data transfer. This leads to the complexity of the AXI4 protocol in terms of handshaking and the unavailability of burst mode. Thus, we created an AXI4-Stream interface to save transport delays and increasing the execution frequency of the whole design. AXI4-Stream essentially gives the same data transmission rate as the AXI4-Full burst mode, but with lower complexity, less number of signals and less resource utilization. EXTRA delivers a complete tool-chain for the development of application specific CGRAs for FPGA based reconfigurable hardware architectures. The description of the application is performed at a high level of abstraction, which enables a wide usage of the approach, and acceptance, by non-specialists in the domain. Together, the tool-chain and the realized hardware structure form a platform for further research performed by the EXTRA partners and also by the wider community within the domain of dynamic and partial reconfigurable hardware. Novel hardware architectures from all vendors can be used to apply the developed approach. There is no restriction to a specific vendor since the backend tools for the physical CGRA design can be substituted while the EXTRA tool chain remains on top.

Adding debugging tools to the framework

Ghent University focused on an approach to enhance the observability of VCGRA-based designs for functional debugging. The proposed approach is a custom-made in-circuit-debugger that is installed during the initial recompilation and can be used to rapidly trace functional errors in high-performance computing applications that can be implemented as VCGRAs.

In order to improve the debugging capabilities of the EXTRA platform, Ghent University uses parameterised configurations in order to introduce a low overhead debugging infrastructure. This is achieved with a semi-automated in-circuit debugging method for theoretical and commercial FPGA architectures, which integrates FPGA overlays with debugging properties, with minimal user intervention. The custom two-level overlay architecture has virtual and automated debugging instrumentation and it is integrated in the design during the original compilation. The debugging infrastructure is added incrementally and optimized alongside the target FPGA. The first overlay level is the VCGRA, whereas the second overlay implements the proposed functionality.

This second-level overlay infrastructure offers increased signal observability for on-the-fly low overhead accelerated debugging by using parameterised configurations. The result is a parameterised Design Under Test, with parameters that define different circuit instances that describe different signal sets. Our debugging flow follows the typical stages of the VCGRA flow and consists of two phases: the design phase and the verification phase.

Ghent University also created the Superimposed Debugging Architecture (SDA) in order to debug and repair on the fly a VCGRA.

RACOS (a Reconfigurable ACcelerator OS) and DAER

The Telecommunications Systems Institute in Greece developed advanced accelerator crafting and dispatching techniques, RACOS and DAER.

RACOS extends Linux so that applications can easily attach and detach reconfigurable accelerations for their use. RACOS also allows the virtualization of reconfigurable accelerators so that multiple applications can use them concurrently (or in a time-shared fashion). DAER is a framework to structure accelerators and their accesses to data memories. Based on queues to decouple the memory accesses from processing, it offers an intuitive way to structure the circuits of the accelerator but also the data accesses, and for many application is supports the seamless prefetching of data.

EDRA, a H2020 FET Launchpad (https://edra-project.eu/, contract #851631) focuses on the commercialization of the DAER framework.

The EXTRA project has received funding from the European Union Horizon 2020 Framework Programme (H2020-EU.1.2.2.) under grant agreement number 671653.