The trial period of a supercomputer

SNG-2


Supercomputers are built to meet specific requirements. For this reason, extensive experiments and functional tests are carried out by the team before the actual start of operation: researchers and specialists from the LRZ are currently working hand in hand with technology companies at the Leibniz Supercomputing Centre to prepare SuperMUC-NG Phase 2 for its work.

Anticipation and frustration go hand in hand in high-performance computing (HPC): Since mid-February, Prof. Klaus Dolag, an astrophysicist at the Max Planck Institute (MPA), has been logging into the SuperMUC-NG Phase 2 or SNG-2, every day from his office to try out the GADGET code for modelling gravitational forces and fluids. "Unfortunately, big jobs don't run long enough. With our complex programmes, which we also want to run on the whole system if possible, we always have a lot of technical problems at the beginning with new HP computers," he says. "I'd like to get off to a good start, but things don't always go as planned when initialising a supercomputer. SNG-2 has issues and bugs during the trial period. Dolag adds that "GADGET is part of the acceptance and approval process. Intel and Lenovo have to show that our code works on the new system and GPU.

Turning on new computer technology

Getting a supercomputer up and running is a team effort: At the Leibniz Supercomputing Centre (LRZ), almost 20 employees from the technology companies Intel and Lenovo are currently working hand in hand with high-performance computing (HPC) specialists from the LRZ and researchers to prepare SNG-2 for simulations and other calculations. Some are adapting computer technology, others scientific codes. Constant coordination is required, but above all experience and patience: "Phase 2 is the first accelerated HPC system that we are putting into operation," says Dr Gerald Mathias, Head of Computational X Support (CXS) at the LRZ. "It's not unusual for the latest technology to cause problems at first”.    

The 240 nodes of the newly installed high-performance computer are based on Lenovo ThinkSystem SD650-I V3 Neptune DWC servers and are cooled with 45°C water for greater energy efficiency. In addition to two central processing units (CPU: Intel Xeon Platinum 8480+), for the first time they contain four graphics processing units (GPU: Intel Data Centre GPU Max 1550). These accelerate data processing and are also suitable for highly scalable, data-intensive workloads such as machine learning. These tasks are supported by the Intel Optane Distributed Asynchronous Object Storage (DAOS) system, which enables fast access to large amounts of data. There are currently only two systems in the world with a comparable architecture: Aurora at Argonne National Laboratory in Chicago and Dawn at the University of Cambridge. "SNG-2 is a novel system," says Adam Roe, who is responsible for Intel's HPC business unit in EMEA. "It requires collaboration to get the most out of the GPU for science and also to enable future workloads involving artificial intelligence." Karsten Kutzer, system architect at Lenovo, who coordinated the design, adds: "We start by designing supercomputers on paper. The exact interaction of the individual components is determined when the system is set up. We then work together to find the optimum solution and refine the plans accordingly".

Changing and adapting technology and codes

Since February, teams from both companies have been working regularly on SNG-2, either on-site or remotely, and more than 20 research groups with HPC experience have been invited to implement their codes on SNG-2, pushing the limits of processor power, communication link bandwidth and memory module functionality. "The challenge is the stability of the system," says Dolag. "Initially, I was only able to run small tasks and calculate for short periods of time; larger simulations were not yet possible."

Using software such as DGEMM, Stream, VTune or HPL Linpack, the technical teams can see where processors are underperforming or where interconnect cables with reduced transmission quality are impeding the flow of data because they have been bent during transport. The key challenge is to coordinate the interaction of the 480 CPUs and 960 GPUs with the interconnects and dynamic memory to keep all compute nodes equally busy. This requires not only technology, but also software and code: "Phase 2 requires new programming paradigms to execute parts of the codes and workloads on the GPUs, so programmes have to be adapted and routines reprogrammed," Mathias says, explaining that some of the work had already been done before the setup and initialisation. "Programming models such as OpenMP and SYCL, an extension of C++, play an important role in exploiting the potential of GPUs. OpenMP is widely used in academic applications, but most applications still need to be adapted to SYCL".

Aligning the components

Software controls the components and initiates the exchange of data, for example between the CPU and the GPU or memory. This process often takes too long, which negates the acceleration offered by GPUs over CPUs: "We therefore use new instructions and commands in the code to determine which data continues to be processed on the CPU and which is processed faster on the GPU," explains Dolag. However, this solution is complicated by the fact that processors from different manufacturers can only be activated with certain programming languages and schemes: Although GADGET was already prepared for NVIDIA GPUs using OpenACC, it had to be rewritten for the Intel GPU in the SNG-2, this time using OpenMP: "Parts of it are already working," says Dolag. "But it will probably be a while before we have rewritten GADGET as a whole and can use it. The ECHO astro code, used in astrophysics to model magnetohydrodynamics around black holes, has been translated from Fortran to SYCL and OpenMPI by researchers at the LRZ and Intel. The SeisSol group – which models earthquakes and seismological phenomena – has developed a code generator to prepare its proven software for SNG-2, which, like the migration tool SYCLomatic from the Intel toolset, can help to adapt codes.

Replacing faulty components, optimising the network and coordinating technology and codes: Every two weeks, the company's teams discuss with the researchers and LRZ staff where there are obstacles, where technology needs to be replaced or where the firmware, the programmes that control the hardware, needs to be optimised. "If individual nodes are not performing as expected, we check, correct parameters and even replace parts," says Kutzer. "We are in constant dialogue with the Intel development team to classify observations and test results and to coordinate the procedure. Performance and stability testing during the SNG-2 trial period lasted until around May, and the operating system will be updated before the final launch in June.

Exploring the GPU and experimenting with AI models    

AI

For greater energy efficiency, SNG-2 is designed not only to accelerate computations, but also to help integrate artificial intelligence (AI) methods into established HPC processes. "With SNG-2, we have not built a dedicated AI system, but rather an AI-enabled high-performance computer," explains Intel's Roe. "In close cooperation with the LRZ, we will now further optimise the supercomputer’s architecture for HPC and AI user groups. It is well known that AI requires more and more computing power, and research is now increasingly combining classical simulation with AI methods". For surrogate models in research, statistical methods such as pattern recognition are replacing the most complex calculations in simulations. The LRZ Big Data & Artificial Intelligence (BDAI) team was therefore involved in the initialisation of a supercomputer for the first time, and a working group led by Prof. Frank Hutter, who works on machine learning at the University of Freiburg, was also able to experiment on SNG-2.

In addition to proven HPC tools and development environments from the LRZ, frameworks like PyTorch or TensorFlow are implemented on the GPUs of SNG-2, as well as AI systems such as BLOOM and GPT-3 have been ported, with further models to follow with research projects. "BLOOM is suitable for inference or developing pattern recognition, while GPT-3 is a generative pre-trained transformer, a basis for neural networks of generative AI applications, such as large language models. We can use it to investigate training steps," explains Dr Nicolay Hammer, head of the LRZ BDAI team. "Together with researchers, we want to show that large AI models can also run on supercomputers. We also want to gain experience with Intel's GPUs, which are generally still limited".

AI applications require a lot of computing power, their integration on HPC systems is complex, and new workloads and tools are needed. As AI becomes more pervasive, technical alternatives are being sought. Many of the smart systems are based on processors from pioneer and market leader NVIDIA. More diversity is desirable not only for economic reasons, but also because researchers want to be able to run their applications on different processors. This is why the Gauss Centre for Supercomputing (GCS), of which the LRZ is a part, relies on a variety of chips: The research centre in Jülich uses an NVIDIA accelerator, and the LRZ uses an Intel accelerator. This allows the researchers to select, compare and evaluate: interesting aspects to consider      include power consumption and efficiency, the use in parallel supercomputers or potential bottlenecks in (scientific) applications.

Faster or more calculations with accelerators

Three months after the supercomputer was switched on for the first time, the first practical experiences from the test phase have already been gathered. For example, the CXS team has compared the performance of different chips. Mathias will present the results of these tests at the International Supercomputing 2024 (ISC) in Hamburg: "Whether from Intel, NVIDIA or AMD, GPU accelerators really show their full performance with large models and simulations, while CPUs also work well with smaller tasks." Performance data or the fact that CPUs are better suited to complex tasks, while GPUs are better suited to data-intensive tasks, will be the subject of workshops and hackathons organised by the LRZ to educate user groups about the characteristics of hybrid systems.

Klaus Dolag is also very pleased with the performance of SNG-2: "In some very important segments of our programme, we can see a speed-up of a factor of 10; for the whole machine, it is between a factor of 2 and 3". Although larger, SuperMUC-NG Phase 1 without a GPU takes two to ten times longer to complete the same tasks. It consumes significantly more power than SNG-2. For an astrophysicist, these numbers sound promising: "We are constantly coming across interesting data and need more and more power for the calculations," says Dolag, "For us, acceleration means that we can run some simulations faster. Even more interesting for us is that we can model more and with higher resolution, which means that the volume of the simulated part of the universe can grow". For the first time, the team from the MPA and the LMU University Observatory plan to use SNG-2 to develop simulations of the universe that start from a fixed rather than a random point in time. "This will create a picture of the real universe that we can use to test our current assumptions and understand the formation or position of a galaxy. It will be exciting to see how deep we can go into the calculations with SNG-2.

Like Dolag and other researchers, the corporate teams from Intel and Lenovo will continue to work on SNG-2 in the computer cube or remotely: "In supercomputers, the software stack, including the firmware, is regularly adapted and updated to reflect the latest developments," says Kutzer. "A system consisting of hundreds or thousands of components is never completely finished. We are also busy identifying potential improvements and optimising the way the components work together in day-to-day operations. (vs/ssc, LRZ)

Steckbrief: SuperMUC-NG Phase 2 (SNG-2)

  • Total Memory: 123 Terabyte DDR
  • Peak Performance: 27.96 Peta Floating Point Operatinos (PetaFLOPS) per Sec
  • Compute Nodes: 240
  • CPU Cores per Node: 112
  • CPU per Node: 2
  • GPU per Node: 4
  • Memory per Node: 512 GByte DDR5 plus 512 GB HBM2e
  • Network: NVIDIA Mellanox HDR Infiniband
  • Scientific HPC-Codes/Frameworks: AIMD, ALPACA, AMBER, ATHENA, CP2K, DeTol, DPEcho, ExaHype, GADGET, Ginko, GRID, Gromacs, HemeLB, HyTeG, Kokkos, LQCD, MGLET, OpenMM, Seissol, WalBerla,
  • AI frameworks: PyTorch, TensorFlow
  • AI models: Bloom, GPT-3
  • Video SNG-2: https://www.youtube.com/watch?v=ruYyR1_xfIw

SNG2