Performance and Energy Aware Training of a Deep Neural Network in a Multi-GPU Environment with Power Capping-Reference-Cited by-同舟云学术

Performance and Energy Aware Training of a Deep Neural Network in a Multi-GPU Environment with Power Capping

Published:2024 Issue: Volume: Page:5-16
ISSN:0302-9743
Container-title:Lecture Notes in Computer Science
language:en
Short-container-title:

Author:

Koszczał Grzegorz,Dobrosolski Jan^ORCID,Matuszek Mariusz^ORCID,Czarnul Paweł^ORCID

Abstract

AbstractIn this paper we demonstrate that it is possible to obtain considerable improvement of performance and energy aware metrics for training of deep neural networks using a modern parallel multi-GPU system, by enforcing selected, non-default power caps on the GPUs. We measure the power and energy consumption of the whole node using a professional, certified hardware power meter. For a high performance workstation with 8 GPUs, we were able to find non-default GPU power cap settings within the range of 160–200 W to improve the difference between percentage energy gain and performance loss by over 15.0%, EDP (Abbreviations and terms used are described in main text.) by over 17.3%, EDS with k = 1.5 by over 2.2%, EDS with k = 2.0 by over 7.5% and pure energy by over 25%, compared to the default power cap setting of 260 W per GPU. These findings demonstrate the potential of today’s CPU+GPU systems for configuration improvement in the context of performance-energy consumption metrics.

Publisher

Springer Nature Switzerland

Link

https://link.springer.com/content/pdf/10.1007/978-3-031-48803-0_1

Reference25 articles.

1. Chen, G., Wang, X.: Performance optimization of machine learning inference under latency and server power constraints. In: 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS), pp. 325–335 (2022). https://doi.org/10.1109/ICDCS54860.2022.00039

2. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1800–1807, July 2017. https://doi.org/10.1109/CVPR.2017.195

3. Czarnul, P., Proficz, J., Drypczewski, K.: Survey of methodologies, approaches, and challenges in parallel programming using high-performance computing systems. Sci. Program. 2020, 4176794:1–4176794:19 (2020). https://doi.org/10.1155/2020/4176794

4. García-Martín, E., Rodrigues, C.F., Riley, G., Grahn, H.: Estimation of energy consumption in machine learning. J. Parallel Distrib. Comput. 134, 75–88 (2019). https://doi.org/10.1016/j.jpdc.2019.07.007, https://www.sciencedirect.com/science/article/pii/S0743731518308773

5. He, X., et al.: Enabling energy-efficient DNN training on hybrid GPU-FPGA accelerators. In: Proceedings of the ACM International Conference on Supercomputing, ICS 2021, pp. 227–241. Association for Computing Machinery, New York, NY, USA (2021). https://doi.org/10.1145/3447818.3460371