अस्वीकरण: यह पोस्ट काफी लंबी है क्योंकि मैंने सभी प्रासंगिक कॉन्फ़िगरेशन जानकारी प्रदान करने की कोशिश की थी।

स्थिति और समस्या:

मैं एक gpu क्लस्टर का प्रबंधन करता हूं और मैं नौकरी प्रबंधन के लिए स्लम का उपयोग करना चाहता हूं। Unfortunatelly, मैं slurm के संबंधित सामान्य संसाधनों प्लगइन का उपयोग करके GPU का अनुरोध नहीं कर सकता।

नोट: test.sh पर्यावरण चर CUDA_VISIBLE_DEVICES की छपाई करने वाली एक छोटी स्क्रिप्ट है।

साथ चलने वाला काम `--gres=gpu:1`पूरा नहीं होता है

srun -n1 --gres=gpu:1 test.shनिम्न त्रुटि में परिणाम चल रहा है :

srun: error: Unable to allocate resources: Requested node configuration is not available

लॉग इन करें:

gres: gpu state for job 83
    gres_cnt:4 node_cnt:0 type:(null)
    _pick_best_nodes: job 83 never runnable
    _slurm_rpc_allocate_resources: Requested node configuration is not available

साथ चल रहा काम `--gres=gram:500`पूरा करता है

यदि मैं कॉल करता हूं srun -n1 --gres=gram:500 test.sh, तो काम चलता है और प्रिंट होता है

CUDA_VISIBLE_DEVICES=NoDevFiles

लॉग इन करें:

sched: _slurm_rpc_allocate_resources JobId=76 NodeList=smurf01 usec=193
debug:  Configuration for job 76 complete
debug:  laying out the 1 tasks on 1 hosts smurf01 dist 1
job_complete: JobID=76 State=0x1 NodeCnt=1 WIFEXITED 1 WEXITSTATUS 0
job_complete: JobID=76 State=0x8003 NodeCnt=1 done

इस प्रकार स्लम सही तरीके srunसे अनुरोधित जेनेरिक संसाधनों का उपयोग करके नौकरियों को चलाने के लिए कॉन्फ़िगर किया गया लगता है, --gresलेकिन किसी कारणवश gpus को नहीं पहचानता है।

मेरा पहला विचार gpu जेनेरिक संसाधन के लिए किसी अन्य नाम का उपयोग करना था क्योंकि अन्य जेनेरिक संसाधन काम करने लगते हैं, लेकिन मैं gpu प्लगइन से चिपकना चाहूंगा।

विन्यास

क्लस्टर में दो से अधिक गुलाम होस्ट हैं, लेकिन स्पष्टता के लिए मैं दो अलग-अलग कॉन्फ़िगर किए गए दास होस्ट और नियंत्रक होस्ट से चिपकेगा: पापा (नियंत्रक), smurf01 और smurf02. two

slurm.conf

स्लरम विन्यास के सामान्य-resrouce- प्रासंगिक भागों:

...
TaskPlugin=task/cgroup
...
GresTypes=gpu,ram,gram,scratch
...
NodeName=smurf01 NodeAddr=192.168.1.101 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=2 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
NodeName=smurf02 NodeAddr=192.168.1.102 Feature="intel,fermi" Boards=1 SocketsPerBoard=2 CoresPerSocket=6 ThreadsPerCore=1 Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
...

नोट: रैम जीबी में है, ग्राम एमबी में है और जीबी में फिर से खरोंच है।

का आउटपुट `scontrol show node`

NodeName=smurf01 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=24 CPULoad=0.01 Features=intel,fermi
   Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
   NodeAddr=192.168.1.101 NodeHostName=smurf01 Version=14.11
   OS=Linux RealMemory=1 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1
   BootTime=2015-04-23T13:58:15 SlurmdStartTime=2015-04-24T10:30:46
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=smurf02 Arch=x86_64 CoresPerSocket=6
   CPUAlloc=0 CPUErr=0 CPUTot=12 CPULoad=0.01 Features=intel,fermi
   Gres=gpu:tesla:8,ram:48,gram:no_consume:6000,scratch:1300
   NodeAddr=192.168.1.102 NodeHostName=smurf02 Version=14.11
   OS=Linux RealMemory=1 AllocMem=0 Sockets=2 Boards=1
   State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1
   BootTime=2015-04-23T13:57:56 SlurmdStartTime=2015-04-24T10:24:12
   CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

smurf01 कॉन्फ़िगरेशन

GPUs

 > ls /dev | grep nvidia
nvidia0
... 
nvidia7
 > nvidia-smi | grep Tesla
|   0  Tesla M2090         On   | 0000:08:00.0     Off |                    0 |
... 
|   7  Tesla M2090         On   | 0000:1B:00.0     Off |                    0 |
...

gres.conf

Name=gpu Type=tesla File=/dev/nvidia0 CPUs=0
Name=gpu Type=tesla File=/dev/nvidia1 CPUs=1
Name=gpu Type=tesla File=/dev/nvidia2 CPUs=2
Name=gpu Type=tesla File=/dev/nvidia3 CPUs=3
Name=gpu Type=tesla File=/dev/nvidia4 CPUs=4
Name=gpu Type=tesla File=/dev/nvidia5 CPUs=5
Name=gpu Type=tesla File=/dev/nvidia6 CPUs=6
Name=gpu Type=tesla File=/dev/nvidia7 CPUs=7
Name=ram Count=48
Name=gram Count=6000
Name=scratch Count=1300

smurf02 कॉन्फ़िगरेशन

GPUs

एक ही विन्यास / आउटपुट smurf01 के रूप में।

gres.conf smurf02 पर

Name=gpu Count=8 Type=tesla File=/dev/nvidia[0-7]
Name=ram Count=48
Name=gram Count=6000
Name=scratch Count=1300

नोट: बहरों को फिर से चालू किया गया है, मशीनों को भी रिबूट किया गया है। स्लम और जॉब सबमिट करने वाले उपयोगकर्ता के दास और नियंत्रक नोड पर समान आईडी / समूह होते हैं और मुंज प्रमाणीकरण ठीक से काम कर रहा है।

लॉग आउटपुट करें

मैंने DebugFlags=Gresslurm.conf फ़ाइल में जोड़ा और GPU को प्लगिन द्वारा पहचाना जाने लगता है:

नियंत्रक लॉग

gres / gpu: state for smurf01
   gres_cnt found : 8 configured : 8 avail : 8 alloc : 0
   gres_bit_alloc :
   gres_used : (null)
   topo_cpus_bitmap[0] : 0
   topo_gres_bitmap[0] : 0
   topo_gres_cnt_alloc[0] : 0
   topo_gres_cnt_avail[0] : 1
   type[0] : tesla
   topo_cpus_bitmap[1] : 1
   topo_gres_bitmap[1] : 1
   topo_gres_cnt_alloc[1] : 0
   topo_gres_cnt_avail[1] : 1
   type[1] : tesla
   topo_cpus_bitmap[2] : 2
   topo_gres_bitmap[2] : 2
   topo_gres_cnt_alloc[2] : 0
   topo_gres_cnt_avail[2] : 1
   type[2] : tesla
   topo_cpus_bitmap[3] : 3
   topo_gres_bitmap[3] : 3
   topo_gres_cnt_alloc[3] : 0
   topo_gres_cnt_avail[3] : 1
   type[3] : tesla
   topo_cpus_bitmap[4] : 4
   topo_gres_bitmap[4] : 4
   topo_gres_cnt_alloc[4] : 0
   topo_gres_cnt_avail[4] : 1
   type[4] : tesla
   topo_cpus_bitmap[5] : 5
   topo_gres_bitmap[5] : 5
   topo_gres_cnt_alloc[5] : 0
   topo_gres_cnt_avail[5] : 1
   type[5] : tesla
   topo_cpus_bitmap[6] : 6
   topo_gres_bitmap[6] : 6
   topo_gres_cnt_alloc[6] : 0
   topo_gres_cnt_avail[6] : 1
   type[6] : tesla
   topo_cpus_bitmap[7] : 7
   topo_gres_bitmap[7] : 7
   topo_gres_cnt_alloc[7] : 0
   topo_gres_cnt_avail[7] : 1
   type[7] : tesla
   type_cnt_alloc[0] : 0
   type_cnt_avail[0] : 8
   type[0] : tesla
...
gres/gpu: state for smurf02
   gres_cnt found:TBD configured:8 avail:8 alloc:0
   gres_bit_alloc:
   gres_used:(null)
   type_cnt_alloc[0]:0
   type_cnt_avail[0]:8
   type[0]:tesla

दास लॉग

Gres Name = gpu Type = tesla Count = 8 ID = 7696487 File = / dev / nvidia[0 - 7]
...
gpu 0 is device number 0
gpu 1 is device number 1
gpu 2 is device number 2
gpu 3 is device number 3
gpu 4 is device number 4
gpu 5 is device number 5
gpu 6 is device number 6
gpu 7 is device number 7

cluster hpc job-scheduler

— Pixchem
स्रोत

अनुरोध करने पर क्या होता है --gres=gpu:tesla:1?

— NNWizard

@NMWizard एक निर्दिष्ट प्रकार के बिना बहुत ही समान है।

— Pixchem

स्थापित संस्करण (में Slurm 14.11.5) को बाहर निकालने के बाद से GPUs के लिए सौंपा प्रकार के साथ समस्याओं के लिए लगता है Type=...से gres.confऔर (करने के लिए तदनुसार नोड विन्यास लाइनों को बदलने Gres=gpu:N,ram:...) के माध्यम से GPUs की आवश्यकता होती है नौकरियों के सफल निष्पादन में परिणाम --gres=gpu:N।

— Pixchem
स्रोत

क्यों अंतर्निहित प्लग इन के साथ SLURM चलाने वाले क्लस्टर पर एक सामान्य संसाधन के रूप में GPU का अनुरोध करता है?

स्थिति और समस्या:

साथ चलने वाला काम --gres=gpu:1पूरा नहीं होता है

साथ चल रहा काम --gres=gram:500पूरा करता है

विन्यास

slurm.conf

का आउटपुट scontrol show node

smurf01 कॉन्फ़िगरेशन

GPUs

gres.conf

smurf02 कॉन्फ़िगरेशन

GPUs

gres.conf smurf02 पर

लॉग आउटपुट करें

नियंत्रक लॉग

दास लॉग

साथ चलने वाला काम `--gres=gpu:1`पूरा नहीं होता है

साथ चल रहा काम `--gres=gram:500`पूरा करता है

का आउटपुट `scontrol show node`