コンテンツにスキップ

Proof / Evidence Hub(実例・ベンチ・Research)

広告や紹介から来た方が 迷わず“証拠”を確認できるように、主要リンクをここに集約しています。

検収チェックリスト(3項目)

「何をもって“導入OK”とするか」を先に固定すると、迷いが減って導入が早くなります。

  1. 性能(Performance):最低ラインを数値で確認(例:step/s、ns/day)
    → ベンチ:/benchmarks/lammps-allegro-llzo-bulk-50k/
  2. 再現性(Reproducibility):同じ入力で同じ結果が出る(環境差で壊れない)
    → 壊しても戻せる基盤(Baseline+Runbook):/lp/rescue-ready-baseline/
  3. 運用性(Operability):検収後に日々回せる(Slurm/ログ/保守が回る)
    → まずは要件を1分で共有:/lp/h200-nvl-1gpu-runbook/#contact

入力ファイルがあれば最短

pw.x(QE)/ in.lammps などを貼れる範囲だけでOK。貼れない場合も「目的・期限・現状」の3点だけで進められます。


実ログ(抜粋)— 性能 / 再現性 / 運用性

Evidence Hubの検収チェックリストに対応する“実物ログ”です(機微情報は含めません)。

1) 性能(Performance)

  • 対象ログ:~/qe_bench/llzo_demo_qe/log.300K.lammps
    386000   651.10626     -471925.79     -471990.34      64.552323      5928.7809    
    388000   631.44725     -471921.21     -471983.81      62.603279      10206.577    
    390000   944.71392     -471883.58     -471977.25      93.66133       4906.1555    
    392000   360.11325     -471952.59     -471988.3       35.702539      8098.5491    
    394000   634.54382     -471899.04     -471961.95      62.910281     -1505.0129    
    396000   542.438       -471924.62     -471978.4       53.778676      13390.788    
    398000   537.233       -471915.97     -471969.23      53.262639      9760.8953    
    400000   768.84021     -471866.12     -471942.34      76.224765     -1216.0175    
Loop time of 2759.68 on 1 procs for 400000 steps with 768 atoms

Performance: 6.262 ns/day, 3.833 hours/ns, 144.945 timesteps/s, 111.317 katom-step/s
105.3% CPU use with 1 MPI tasks x 1 OpenMP threads

MPI task timing breakdown:
Section |  min time  |  avg time  |  max time  |%varavg| %total
---------------------------------------------------------------
Pair    | 2618.6     | 2618.6     | 2618.6     |   0.0 | 94.89
Neigh   | 101.68     | 101.68     | 101.68     |   0.0 |  3.68
Comm    | 18.388     | 18.388     | 18.388     |   0.0 |  0.67
Output  | 0.064166   | 0.064166   | 0.064166   |   0.0 |  0.00
Modify  | 15.671     | 15.671     | 15.671     |   0.0 |  0.57
Other   |            | 5.248      |            |       |  0.19

Nlocal:            768 ave         768 max         768 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Nghost:           3495 ave        3495 max        3495 min
Histogram: 1 0 0 0 0 0 0 0 0 0
Neighs:              0 ave           0 max           0 min
Histogram: 1 0 0 0 0 0 0 0 0 0
FullNghs:       136138 ave      136138 max      136138 min
Histogram: 1 0 0 0 0 0 0 0 0 0

Total # of neighbors = 136138
Ave neighs/atom = 177.26302
Neighbor list builds = 10367
Dangerous builds = 0

write_data      final_${Tlab}.data
write_data      final_300K.data
System init for write_data ...
Generated 0 of 6 mixed pair_coeff terms from geometric mixing rule
Total wall time: 0:46:14

2) 再現性(Reproducibility)

  • 対象ログ:~/qe_bench/llzo_demo_qe/log.lammps
LAMMPS (29 Aug 2024)
  using 1 OpenMP thread(s) per MPI task
Loaded 1 plugins from /opt/deepmd-kit/lib/deepmd_lmp
units           metal
atom_style      atomic
boundary        p p p

# 96原子の基底構造(LLZO)を読み → 4x4x4 に拡張(= 96 × 64 = 6144 原子)
read_data       data.llzo
Reading data file ...
  orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (8.8460671 9.1068882 8.4665605)
  2 by 1 by 2 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.007 seconds
replicate       4 4 4
Replication is creating a 4x4x4 = 64 times larger system...
  orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (50.724903 36.427556 49.85449)
  2 by 1 by 2 MPI processor grid
  6144 atoms
  replicate CPU = 0.002 seconds

# 質量(dataにあっても上書き可) 1:Li, 2:La, 3:Zr, 4:O
mass 1 6.94
mass 2 138.905
mass 3 91.224
mass 4 15.999

# DeePMD モデル
pair_style      deepmd graph_3T.pb
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /opt/deepmd-kit
  source:             
  source branch:      HEAD
  source commit:      8b3dc08
  source commit at:   2025-06-11 13:00:46 +0200
  support model ver.: 1.1 
  build variant:      cuda
  build with tf inc:  /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/include;/opt/deepmd-kit/include
  build with tf lib:  /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
  build with pt lib:  torch;torch_library;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10.so;CUDA::nvrtc;torch::nvtoolsext;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10_cuda.so
  set tf intra_op_parallelism_threads: 1
  set tf inter_op_parallelism_threads: 1
  >>> Info of lammps module:
  use deepmd-kit at:  /opt/deepmd-kitpair_coeff      * *  Li La Zr O

# 近接
neighbor        2.0 bin

3) 運用性(Operability / Slurm)

  • Slurm err:~/qe_bench/llzo_demo_qe/slurm.llzo_700K_4gpu.2925.err
sh: 1: srun: not found
/var/spool/slurmd/job02925/slurm_script: line 33: 99446 Segmentation fault      (core dumped) apptainer exec --nv "$SIF" bash -lc '
  export OMP_NUM_THREADS=1
  export CUDA_DEVICE_ORDER=PCI_BUS_ID
  export PATH=/opt/deepmd-kit/bin:$PATH
  export LD_LIBRARY_PATH=/opt/deepmd-kit/lib:$LD_LIBRARY_PATH
  export OMPI_MCA_btl="^openib"  # IBを使わないなら除外(無害な保険)

  # ランクごとにローカルGPUを割当:local_rank(0..3) → CUDA_VISIBLE_DEVICES
  /opt/deepmd-kit/bin/mpirun -np 4 \
    --map-by ppr:4:node --bind-to none \
    bash -lc "export CUDA_VISIBLE_DEVICES=\${OMPI_COMM_WORLD_LOCAL_RANK}; \
              echo [rank:\$OMPI_COMM_WORLD_RANK] gpu=\$CUDA_VISIBLE_DEVICES; \
              exec /opt/deepmd-kit/bin/lmp -in '"$IN"'"
'
  • Slurm out:~/qe_bench/llzo_demo_qe/slurm.llzo_700K_4gpu.2927.out
[rank:0] GPU=0
[rank:1] GPU=0
[rank:3] GPU=0
[rank:2] GPU=0
LAMMPS (29 Aug 2024)
  using 1 OpenMP thread(s) per MPI task
Loaded 1 plugins from /opt/deepmd-kit/lib/deepmd_lmp
Reading data file ...
  orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (8.8460671 9.1068882 8.4665605)
  2 by 1 by 2 MPI processor grid
  reading atoms ...
  96 atoms
  read_data CPU = 0.002 seconds
Replication is creating a 3x3x3 = 27 times larger system...
  orthogonal box = (-5.1135447 -9.4665824e-07 -5.3294159) to (36.765291 27.320666 36.058513)
  2 by 1 by 2 MPI processor grid
  2592 atoms
  replicate CPU = 0.001 seconds
Summary of lammps deepmd module ...
  >>> Info of deepmd-kit:
  installed to:       /opt/deepmd-kit
  source:             
  source branch:      HEAD
  source commit:      8b3dc08
  source commit at:   2025-06-11 13:00:46 +0200
  support model ver.: 1.1 
  build variant:      cuda
  build with tf inc:  /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/include;/opt/deepmd-kit/include
  build with tf lib:  /opt/deepmd-kit/lib/python3.12/site-packages/tensorflow/libtensorflow_cc.so.2
  build with pt lib:  torch;torch_library;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10.so;CUDA::nvrtc;torch::nvtoolsext;/opt/deepmd-kit/lib/python3.12/site-packages/torch/lib/libc10_cuda.so
  set tf intra_op_parallelism_threads: 2

Simulation Proof Packs(雛形)

SIM-CFD-001:DoMINO(CFDサロゲート)— 4GPU並列スループット実測

  • 対象:DoMINO-Automotive-Aero NIM(single instance / all-gpus), endpoint /v1/infer/surface
  • 指標:throughput_req_per_min_by_wall(wall-clock)

A100x4 - conc=1: 7.50 req/min → conc=4: 31.79 req/min(4.24×)

H100 NVL 4GPU - conc=1: 9.16 req/min → conc=4: 36.92 req/min(4.03×) - conc=8: 55.49 req/min(conc=4→8で+50%)

H100優位 - conc=1: +22% / conc=4: +16%

→ 詳細記事: DoMINOで“外部空力(CFD)”を秒オーダーで試す(初心者向け)(初心者向け・図つき)

Benchmarks(数値で見る)


Research Notes(技術の背景・再現)


Posts(読み物/実例)


Free Tools / Lead Magnets(無償公開)


相談が必要な典型(有償が早い)

  • 本番で失敗できない/検収が必要
  • Secure Boot / DKMS / kernel差分で詰まる
  • Slurm(multi-node・運用設計)まで固めたい
  • FP64前提で性能を出したい(ボトルネック解析含む)