估算一台机器~
shell# lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A72
Model: 3
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: r0p3
CPU max MHz: 1800.0000
CPU min MHz: 600.0000
BogoMIPS: 108.00
Flags: fp asimd evtstrm crc32 cpuid
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 192 KiB (4 instances)
L2: 1 MiB (1 instance)
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Vulnerable
Spectre v1: Mitigation; __user pointer sanitization
Spectre v2: Vulnerable
Srbds: Not affected
Tsx async abort: Not affected
注:
理论峰值算力 = 4 × 1.8 GHz × 8 = 57.6 GFLOPS(FP32)
ARM + 内存带宽限制较明显,经验值:
实际可用 FP32:30% ~ 60% ≈ 17 ~ 35 GFLOPS
FP32 GFLOPS = (P核 × P频率 + E核 × E频率) × 8
shellsysctl -n hw.perflevel0.physicalcpu # 性能核(P-core) sysctl -n hw.perflevel1.physicalcpu # 能效核(E-core)
看芯片型号
shellsysctl -n machdep.cpu.brand_string
| 芯片 | P 核频率 | E 核频率 |
|---|---|---|
| M1 | ~3.2 GHz | ~2.0 GHz |
| M2 | ~3.5 GHz | ~2.4 GHz |
| M3 | ~4.0 GHz | ~2.8 GHz |
shell$ sysctl -n hw.perflevel0.physicalcpu
6
$ sysctl -n hw.perflevel1.physicalcpu
2
$ sysctl -n machdep.cpu.brand_string
Apple M1 Pro
算力 ≈ ( 6 x 3.2 + 2 x 2.0 ) x 8 = 185.6 GFLOPS
Apple Silicon 内存带宽强,但调度保守:
实际可用 FP32 ≈ 60% ~ 80%
≈ 110 ~ 150 GFLOPS
shell$ system_profiler SPDisplaysDataType
Graphics/Displays:
Apple M1 Pro:
Chipset Model: Apple M1 Pro
Type: GPU
Bus: Built-In
Total Number of Cores: 14
Vendor: Apple (0x106b)
Metal Support: Metal 4
Displays:
Color LCD:
Display Type: Built-in Liquid Retina XDR Display
Resolution: 3024 x 1964 Retina
Main Display: Yes
Mirror: Off
Online: Yes
Automatically Adjust Brightness: Yes
Connection Type: Internal
官方参考 FP32 理论峰值
| 芯片 / 核心数 | FP32 TFLOPS |
|---|---|
| M1 Pro 14 核 | ~2.3 TFLOPS |
| M1 Pro 16 核 | ~2.6 TFLOPS |
shell$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 128
On-line CPU(s) list: 0-127
Vendor ID: GenuineIntel
Model name: INTEL(R) XEON(R) GOLD 6530
CPU family: 6
Model: 207
Thread(s) per core: 2
Core(s) per socket: 32
Socket(s): 2
Stepping: 2
CPU max MHz: 4000.0000
CPU min MHz: 800.0000
BogoMIPS: 4200.00
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc a
rt arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr p
dcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_pp
in cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx
512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_
mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulq
dq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ib
t amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization features:
Virtualization: VT-x
Caches (sum of all):
L1d: 3 MiB (64 instances)
L1i: 2 MiB (64 instances)
L2: 128 MiB (64 instances)
L3: 320 MiB (2 instances)
NUMA:
NUMA node(s): 2
NUMA node0 CPU(s): 0-31,64-95
NUMA node1 CPU(s): 32-63,96-127
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Reg file data sampling: Not affected
Retbleed: Not affected
Spec rstack overflow: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Srbds: Not affected
Tsx async abort: Not affected
注:
2 × Intel Xeon Gold 6530,一共 128 逻辑核心(64 物理核心 × 2 超线程)
理论峰值算力 = 64 × 4 GHz × 32 = 8.2 TFLOPS(FP32)
实际可用通常 70–80% ≈ 5.74 ~ 6.56 TFLOPS
shell$ nvidia-smi
Wed Jan 7 17:51:05 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.169 Driver Version: 570.169 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 Off | 00000000:4B:00.0 Off | Off |
| 30% 29C P8 8W / 450W | 7449MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 Off | 00000000:4C:00.0 Off | Off |
| 30% 31C P8 8W / 450W | 21079MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA GeForce RTX 4090 Off | 00000000:4E:00.0 Off | Off |
| 30% 29C P8 10W / 450W | 3457MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA GeForce RTX 4090 Off | 00000000:4F:00.0 Off | Off |
| 30% 31C P8 7W / 450W | 2007MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA GeForce RTX 4090 Off | 00000000:CB:00.0 Off | Off |
| 30% 33C P8 7W / 450W | 20443MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA GeForce RTX 4090 Off | 00000000:CC:00.0 Off | Off |
| 30% 34C P8 5W / 450W | 20359MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA GeForce RTX 4090 Off | 00000000:CE:00.0 Off | Off |
| 30% 31C P8 7W / 450W | 20359MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA GeForce RTX 4090 Off | 00000000:CF:00.0 Off | Off |
| 30% 32C P8 5W / 450W | 22098MiB / 24564MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| GPU | CUDA 核心 | 基本时钟 | FP32 计算能力 |
|---|---|---|---|
| RTX 4090 | 16,384 | 2.23 GHz | ~82.6 TFLOPS(FP32) |
82.6 TFLOPS × 8 ≈ 660 TFLOPS(理论峰值)
实际可用(70–80%)≈ 450–500 TFLOPS
| 项目 | 树莓派 4 | M1 Pro | 服务器 |
|---|---|---|---|
| 理论 FP32 | ~58 GFLOPS | ~186 GFLOPS | 8.2 TFLOPS |
| 实际 FP32 | ~20–30 | ~110–150 | 5.12 ~ 6.56 |
| 倍数 | 1× | ~5–6× | ~170-328x |
| 项目 | 树莓派 4 | M1 Pro | 服务器 |
|---|---|---|---|
| GPU FP32 | ~10 GFLOPS | ~2.6 TFLOPS | ~660 TFLOPS |
| AI / 专用计算单元 | ❌ | Neural Engine ≈ 11 TOPS | Tensor Core(4090)≈ 1300+ TOPS INT8 |
M1 Pro ≈ 树莓派 100×
8×4090 服务器 ≈ M1 Pro 60×
服务器 ≈ 树莓派 8000×


本文作者:42tr
本文链接:
版权声明:本博客所有文章除特别声明外,均采用 BY-NC-SA 许可协议。转载请注明出处!