Skip to Content
🎬 Visit our YouTube channel for AI tutorials & demos → AI360Xpert on YouTube
Video tutorialsCPU vs GPU Benchmarking with PyTorch
CPU vs GPU Benchmarking with PyTorch

CPU vs GPU Benchmarking with PyTorch

When working with deep learning and massive datasets, choosing between a CPU and a GPU is critical for execution speed. This article explores a benchmark test comparing CPU and GPU performance using PyTorch for matrix multiplication.

We start with smaller 1x1 matrices and scale all the way up to 50,000x50,000 to observe when the GPU truly outpaces the CPU and when memory constraints become an issue.

Benchmark Script

The following Python script was used to determine the environment information and execute matrix multiplication across different sizes:

import torch import time import platform import psutil def print_env_info(): print(f"PyTorch Version: {torch.__version__}") print(f"CUDA Available: {torch.cuda.is_available()}") print(f"CUDA Version: {torch.version.cuda}") def print_system_info(): print("\nSystem Information:") try: import cpuinfo cpu = cpuinfo.get_cpu_info() cpu_name = cpu.get('brand_raw', 'Unknown') except ImportError: cpu_name = platform.processor() or 'Unknown' print(f"CPU: {cpu_name}") freq = psutil.cpu_freq() print(f"CPU Frequency: {freq.current:.2f} MHz" if freq else "CPU Frequency: Unknown") print(f"CPU Cores: {psutil.cpu_count(logical=False)} Physical, {psutil.cpu_count(logical=True)} Logical") mem = psutil.virtual_memory() print(f"RAM: {mem.total / 1024 ** 3:.2f} GB (Available: {mem.available / 1024 ** 3:.2f} GB)") print(f"OS: {platform.system()}") def print_gpu_info(): if torch.cuda.is_available(): print(f"\nGPU: {torch.cuda.get_device_name(0)}") props = torch.cuda.get_device_properties(0) print(f"GPU Memory: {props.total_memory / 1e9:.2f} GB") def benchmark(size): print(f"\n=== Benchmark: {size}x{size} Matrix ===") import gc # CPU TEST cpu_duration = None try: x_cpu = torch.randn(size, size) y_cpu = torch.randn(size, size) _ = torch.matmul(x_cpu, y_cpu) # Warm-up start = time.perf_counter() _ = torch.matmul(x_cpu, y_cpu) cpu_duration = time.perf_counter() - start print(f"CPU Time: {cpu_duration:.6f} seconds") except Exception as e: print(f"CPU Benchmark Error: {e}") finally: del x_cpu, y_cpu gc.collect() # GPU TEST if torch.cuda.is_available(): try: device = torch.device("cuda") x_gpu = torch.randn(size, size, device=device) y_gpu = torch.randn(size, size, device=device) for _ in range(5): _ = torch.matmul(x_gpu, y_gpu) # Warm-up torch.cuda.synchronize() start = time.perf_counter() _ = torch.matmul(x_gpu, y_gpu) torch.cuda.synchronize() gpu_duration = time.perf_counter() - start print(f"GPU Time: {gpu_duration:.6f} seconds") if cpu_duration is not None and gpu_duration > 0: print(f"Speedup: {cpu_duration / gpu_duration:.2f}x") except Exception as e: print(f"GPU Benchmark Error: {e}") finally: del x_gpu, y_gpu torch.cuda.empty_cache() gc.collect() else: print("GPU not found.") if __name__ == "__main__": print_env_info() print_system_info() print_gpu_info() sizes = [1, 5, 10, 100, 1000, 10000, 25000, 50000] for size in sizes: try: benchmark(size) except Exception as e: print(f"Error for size {size}: {e}")

Benchmark Output Results

The execution of the script produced the following output:

PyTorch Version: 2.10.0+cu130 CUDA Available: True CUDA Version: 13.0 System Information: CPU: Intel64 Family 6 Model 85 Stepping 7, GenuineIntel CPU Frequency: 2200.00 MHz CPU Cores: 6 Physical, 12 Logical RAM: 48.00 GB (Available: 41.02 GB) OS: Windows GPU: NVIDIA L4 GPU Memory: 23.91 GB === Benchmark: 1x1 Matrix === CPU Time: 0.000078 seconds GPU Time: 0.000163 seconds Speedup: 0.48x === Benchmark: 5x5 Matrix === CPU Time: 0.000063 seconds GPU Time: 0.000058 seconds Speedup: 1.10x === Benchmark: 10x10 Matrix === CPU Time: 0.000009 seconds GPU Time: 0.000066 seconds Speedup: 0.13x === Benchmark: 100x100 Matrix === CPU Time: 0.000105 seconds GPU Time: 0.000095 seconds Speedup: 1.11x === Benchmark: 1000x1000 Matrix === CPU Time: 0.003641 seconds GPU Time: 0.000667 seconds Speedup: 5.45x === Benchmark: 10000x10000 Matrix === CPU Time: 2.631520 seconds GPU Time: 0.174161 seconds Speedup: 15.11x === Benchmark: 25000x25000 Matrix === CPU Time: 42.635571 seconds GPU Time: 2.966667 seconds Speedup: 14.37x === Benchmark: 50000x50000 Matrix === CPU Time: 323.114121 seconds GPU Benchmark Error: CUDA out of memory. Tried to allocate 9.31 GiB. GPU 0 has a total capacity of 22.27 GiB of which 3.42 GiB is free. Of the allocated memory 18.63 GiB is allocated by PyTorch, and 14.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Questions Covered in the Video

Based on the code, outputs, and the benchmarking process explored, watch the video to find detailed answers and explanations to the following questions:

  • Why do smaller computations, like a 1x1 and 10x10 matrix, run faster on the CPU compared to the GPU?
  • At what matrix size does the GPU consistently start outperforming the CPU?
  • What is the magnitude of the performance leap when scaling up to the 10,000x10,000 matrix level?
  • Why did the 50,000x50,000 matrix multiplication trigger a CUDA out-of-memory error?
  • How does the available 23.91 GB GPU memory impact the scaling potential of these models compared to CPU memory?

Want to find out the answers and learn more about exactly why this happens? Subscribe to AI360Xpert for weekly articles and tutorials on AI, machine learning, and cloud computing.

Watch the complete video explanation of this benchmark on our YouTube channel: https://www.youtube.com/watch?v=O4VJr_q1KFg 

Have questions? Drop them in the comments below!

Last updated on