"One scientific epoch ended and another began with James Clerk Maxwell"
  - Albert Einstein

 

NVIDIA has recently launched new Tesla GPUs based on the Maxwell architecture.
Table 1 lists key specifications for the new Maxwell-based Tesla M40 and M60 compared to the Kepler-based Tesla K10, K40 and K80.

 

 

Tesla K10

Tesla K40

Tesla M40

Tesla M60

Tesla K80

Architecture

Kepler

Kepler

Maxwell

Maxwell

Kepler

Cores

3072 (2x1536)

2880

3072

4096 (2x2048)

4992 (2x2496)

Memory (GB)

8 (2x4)

12

12

16 (2x8)

24 (2x12)

Memory Bandwidth (GB/s)

320 (2x160)

288

288

320 (2x160)

480 (2x240)

Peak Single Precision (TFlops)

4.580

5.040

6.84

9.7

8.74

Peak Double Precision (TFlops)

0.190

1.680

0.213

0.3

2.91

Double Precision:Single Precision Throughput Ratio

1:24

1:3

1:32

1:32

1:3

Maximum Shared Memory / Streaming Multiprocessor (KB)

48

48

96

96

112

Registers / Streaming Multiprocessor (KB)

256

256

256

256

512

Compute Capability

3.0

3.5

5.2

5.2

3.7

Cooling Solution

Passive

Passive / Active

Passive

Passive / Active

Passive

 Table 1 - Key Specifications of NVIDIA Tesla GPUs

 

The Tesla M60 GPU card, like the K10 and K80, features two GPUs on a single PCIe card.  Your CUDA applications have to be designed to work with multiple GPUs to leverage all the resources on these cards.

If your applications utilize predominantly single precision floating-point arithmetic, keep in mind that the Maxwell architecture is significantly more efficient that Kepler. For example, our benchmarks of single precision dense matrix-matrix multiplication routines in cuBLAS (SGEMM) sustain ~87% of peak throughput on Maxwell, compared to ~75% on a Kepler.

The Maxwell architecture is not optimized for double precision arithmetic.  You can expect the K40 and K80 to outperform the new M40 and M60 for workloads that require high throughput double precision arithmetic.

The new Tesla GPUs support Compute Capability 5.2, which provides Dynamic Parallelism and HyperQ, like the K40/K80. They also feature 96KB of shared memory per streaming multiprocessor, which doubles the maximum amount compared to Compute 3.0 and 3.5 devices.

The Tesla K80 is still a compelling offering. In addition to unmatched double precision performance, it provides higher global memory bandwidth. The K80 also provides the most shared memory per multiprocessor, as well as twice the shared memory bandwidth per streaming multiprocessor. This is because the Kepler shared memory banks are twice as wide as Fermi/Maxwell. The K80 also provides twice as many registers per multiprocessor, which improves efficiency in some applications.

The Tesla M40 comes with a passive cooling solution, so it can only be installed in servers and workstations specifically designed to provide airflow to cool the GPU. The Tesla M60 comes in active and passive variants. The actively cooled variant of the M60 can be installed in any server or workstation with sufficient space and power.