
CPU vs GPU Benchmarking with PyTorch
When working with deep learning and massive datasets, choosing between a CPU and a GPU is critical for execution speed. This article explores a benchmark test comparing CPU and GPU performance using PyTorch for matrix multiplication.
We start with smaller 1x1 matrices and scale all the way up to 50,000x50,000 to observe when the GPU truly outpaces the CPU and when memory constraints become an issue.
Benchmark Script
The following Python script was used to determine the environment information and execute matrix multiplication across different sizes:
import torch
import time
import platform
import psutil
def print_env_info():
print(f"PyTorch Version: {torch.__version__}")
print(f"CUDA Available: {torch.cuda.is_available()}")
print(f"CUDA Version: {torch.version.cuda}")
def print_system_info():
print("\nSystem Information:")
try:
import cpuinfo
cpu = cpuinfo.get_cpu_info()
cpu_name = cpu.get('brand_raw', 'Unknown')
except ImportError:
cpu_name = platform.processor() or 'Unknown'
print(f"CPU: {cpu_name}")
freq = psutil.cpu_freq()
print(f"CPU Frequency: {freq.current:.2f} MHz" if freq else "CPU Frequency: Unknown")
print(f"CPU Cores: {psutil.cpu_count(logical=False)} Physical, {psutil.cpu_count(logical=True)} Logical")
mem = psutil.virtual_memory()
print(f"RAM: {mem.total / 1024 ** 3:.2f} GB (Available: {mem.available / 1024 ** 3:.2f} GB)")
print(f"OS: {platform.system()}")
def print_gpu_info():
if torch.cuda.is_available():
print(f"\nGPU: {torch.cuda.get_device_name(0)}")
props = torch.cuda.get_device_properties(0)
print(f"GPU Memory: {props.total_memory / 1e9:.2f} GB")
def benchmark(size):
print(f"\n=== Benchmark: {size}x{size} Matrix ===")
import gc
# CPU TEST
cpu_duration = None
try:
x_cpu = torch.randn(size, size)
y_cpu = torch.randn(size, size)
_ = torch.matmul(x_cpu, y_cpu) # Warm-up
start = time.perf_counter()
_ = torch.matmul(x_cpu, y_cpu)
cpu_duration = time.perf_counter() - start
print(f"CPU Time: {cpu_duration:.6f} seconds")
except Exception as e:
print(f"CPU Benchmark Error: {e}")
finally:
del x_cpu, y_cpu
gc.collect()
# GPU TEST
if torch.cuda.is_available():
try:
device = torch.device("cuda")
x_gpu = torch.randn(size, size, device=device)
y_gpu = torch.randn(size, size, device=device)
for _ in range(5):
_ = torch.matmul(x_gpu, y_gpu) # Warm-up
torch.cuda.synchronize()
start = time.perf_counter()
_ = torch.matmul(x_gpu, y_gpu)
torch.cuda.synchronize()
gpu_duration = time.perf_counter() - start
print(f"GPU Time: {gpu_duration:.6f} seconds")
if cpu_duration is not None and gpu_duration > 0:
print(f"Speedup: {cpu_duration / gpu_duration:.2f}x")
except Exception as e:
print(f"GPU Benchmark Error: {e}")
finally:
del x_gpu, y_gpu
torch.cuda.empty_cache()
gc.collect()
else:
print("GPU not found.")
if __name__ == "__main__":
print_env_info()
print_system_info()
print_gpu_info()
sizes = [1, 5, 10, 100, 1000, 10000, 25000, 50000]
for size in sizes:
try:
benchmark(size)
except Exception as e:
print(f"Error for size {size}: {e}")Benchmark Output Results
The execution of the script produced the following output:
PyTorch Version: 2.10.0+cu130
CUDA Available: True
CUDA Version: 13.0
System Information:
CPU: Intel64 Family 6 Model 85 Stepping 7, GenuineIntel
CPU Frequency: 2200.00 MHz
CPU Cores: 6 Physical, 12 Logical
RAM: 48.00 GB (Available: 41.02 GB)
OS: Windows
GPU: NVIDIA L4
GPU Memory: 23.91 GB
=== Benchmark: 1x1 Matrix ===
CPU Time: 0.000078 seconds
GPU Time: 0.000163 seconds
Speedup: 0.48x
=== Benchmark: 5x5 Matrix ===
CPU Time: 0.000063 seconds
GPU Time: 0.000058 seconds
Speedup: 1.10x
=== Benchmark: 10x10 Matrix ===
CPU Time: 0.000009 seconds
GPU Time: 0.000066 seconds
Speedup: 0.13x
=== Benchmark: 100x100 Matrix ===
CPU Time: 0.000105 seconds
GPU Time: 0.000095 seconds
Speedup: 1.11x
=== Benchmark: 1000x1000 Matrix ===
CPU Time: 0.003641 seconds
GPU Time: 0.000667 seconds
Speedup: 5.45x
=== Benchmark: 10000x10000 Matrix ===
CPU Time: 2.631520 seconds
GPU Time: 0.174161 seconds
Speedup: 15.11x
=== Benchmark: 25000x25000 Matrix ===
CPU Time: 42.635571 seconds
GPU Time: 2.966667 seconds
Speedup: 14.37x
=== Benchmark: 50000x50000 Matrix ===
CPU Time: 323.114121 seconds
GPU Benchmark Error: CUDA out of memory. Tried to allocate 9.31 GiB. GPU 0 has a total capacity of 22.27 GiB of which 3.42 GiB is free. Of the allocated memory 18.63 GiB is allocated by PyTorch, and 14.39 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)Questions Covered in the Video
Based on the code, outputs, and the benchmarking process explored, watch the video to find detailed answers and explanations to the following questions:
- Why do smaller computations, like a 1x1 and 10x10 matrix, run faster on the CPU compared to the GPU?
- At what matrix size does the GPU consistently start outperforming the CPU?
- What is the magnitude of the performance leap when scaling up to the 10,000x10,000 matrix level?
- Why did the 50,000x50,000 matrix multiplication trigger a CUDA out-of-memory error?
- How does the available 23.91 GB GPU memory impact the scaling potential of these models compared to CPU memory?
Want to find out the answers and learn more about exactly why this happens? Subscribe to AI360Xpert for weekly articles and tutorials on AI, machine learning, and cloud computing.
Watch the complete video explanation of this benchmark on our YouTube channel: https://www.youtube.com/watch?v=O4VJr_q1KFgÂ
Have questions? Drop them in the comments below!