Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
786 commits
Select commit Hold shift + click to select a range
4c0774d
Fix the job template dropdown issue (#424)
LeoHongyi Jul 19, 2019
d421b66
no longer using "hostport", remove useness code
hao1939 Jul 19, 2019
54f7492
job state transition graph
hao1939 Jul 20, 2019
eb2981f
resubmit "Unknown" job
hao1939 Jul 20, 2019
8ccdbeb
refactoring
hao1939 Jul 20, 2019
02a6d43
no need anymore, all job pod "restartPolicy: Never"
hao1939 Jul 20, 2019
0e8cf0a
fix typo
hao1939 Jul 20, 2019
e9d1389
two potential problems
hao1939 Jul 20, 2019
93b75ba
profile data handler in jobmanager
xudifsd Jul 22, 2019
0acdf54
Merge pull request #6 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
29357c3
fix some bugs
xudifsd Jul 22, 2019
4042459
config exporter port
xudifsd Jul 22, 2019
91be5b1
Merge pull request #7 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
3c558a7
shorter port name
xudifsd Jul 22, 2019
93a8b8a
Merge pull request #8 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
fb3a9d7
fix some bug
xudifsd Jul 22, 2019
3eb2250
Merge pull request #9 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
b6540b0
expose in restfulapi
xudifsd Jul 22, 2019
928fb70
Merge pull request #10 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
7fa12c4
fix bug
xudifsd Jul 22, 2019
5218c61
Merge pull request #11 from xudifsd/dixu/profile-handler
hao1939 Jul 22, 2019
d4e2239
Change title of every table and adjust the orders (#427)
LeoHongyi Jul 23, 2019
ee72f95
fix typo
hao1939 Jul 23, 2019
7dc0556
When node back after "lost", the pod may turn into "NotFound".
hao1939 Jul 23, 2019
2d8fa8e
add useful link
hao1939 Jul 23, 2019
aeb2744
Merge pull request #422 from hao1939/dltsdev
Anbang-Hu Jul 24, 2019
42d8032
add performance dashboard
xudifsd Jul 23, 2019
3678e77
add manager histogram
xudifsd Jul 23, 2019
9e877ba
perf job deployer
xudifsd Jul 23, 2019
4c35dfa
Merge pull request #12 from xudifsd/dixu/perf-dashboard
hao1939 Jul 24, 2019
f79ad2f
add missing import
xudifsd Jul 24, 2019
47c97e5
Merge pull request #13 from xudifsd/dixu/perf-dashboard
hao1939 Jul 24, 2019
ad40629
fix conflict error
xudifsd Jul 24, 2019
000913f
Merge pull request #14 from xudifsd/dixu/perf-dashboard
hao1939 Jul 24, 2019
b21eca7
Merge pull request #429 from hao1939/jobmanager
Anbang-Hu Jul 24, 2019
99f57b4
add reaper
xudifsd Jul 24, 2019
8ba0ca0
Merge pull request #15 from xudifsd/dixu/reaper
hao1939 Jul 24, 2019
d0b7cd8
split jobmanager's metrics with restfulapi's
xudifsd Jul 24, 2019
11f3be0
Merge pull request #16 from xudifsd/dixu/refine-perf-dashboard
hao1939 Jul 24, 2019
4c45f52
share k8s client: to avoid connection leak
hao1939 Jul 24, 2019
fed8c67
add option for "force" delete pod
hao1939 Jul 25, 2019
ebfd771
forcing cleanup job before submit
hao1939 Jul 25, 2019
0d5f87e
Enable distribute job in low priority job template (#432)
LeoHongyi Jul 25, 2019
5b1df00
Clean up advance option & save job template issue
Jul 25, 2019
fbd2454
default using localhost as prometheus ip in grafana
xudifsd Jul 26, 2019
6618ecb
Merge pull request #434 from xudifsd/dixu/default-prometheus-ip
Anbang-Hu Jul 26, 2019
6e0b34b
force delete pod when killing
hao1939 Jul 26, 2019
8fd7693
no need to retry on submit
hao1939 Jul 26, 2019
3a3a14c
refine execption handle
hao1939 Jul 26, 2019
9263d66
reset endpoint when resubmit job
hao1939 Jul 26, 2019
5ecfa44
waiting 30s before resubmit (was 300s)
hao1939 Jul 26, 2019
de777d8
won't change 'pending' endpionts to 'stoped' or other status
hao1939 Jul 26, 2019
30ce59c
narrow "dead endpoint", execlude the endpoints for job in status pend…
hao1939 Jul 26, 2019
e08880e
fix func call parameter
hao1939 Jul 26, 2019
110eda7
correctly use the return value
hao1939 Jul 26, 2019
7c5bf14
Merge pull request #415 from deepak-ms/d8
Anbang-Hu Jul 26, 2019
538ba0a
fix restapi bugs
hao1939 Jul 29, 2019
2da3bb3
Merge pull request #433 from LeoHongyi/dltsdev
Anbang-Hu Jul 30, 2019
95cebe0
Hidden Priviledge docker & keep the logic when job type change
Jul 30, 2019
92b4192
Merge pull request #430 from hao1939/jobmanager
Anbang-Hu Jul 30, 2019
570cc9a
Merge pull request #437 from LeoHongyi/dltsdev
Anbang-Hu Jul 30, 2019
8349cfb
Rename change team request query to "current-team" from "team"
Gerhut Jul 3, 2019
c1ba1d2
Set authorized clusters in session when using password auth
Gerhut Jul 3, 2019
0328e3d
Add password validation and team checking
Gerhut Jul 4, 2019
375199e
add missing cluster prefix in alert manager
xudifsd Jul 10, 2019
ae4671c
add cluster gpu statistic dashboard
xudifsd Jul 11, 2019
d51a6bb
fix issue of not fetching through restfulapi without login
Jul 12, 2019
a76551e
consider used gpus while calculating reserved on unschedulable nodes
deepak-ms Jul 14, 2019
584cf85
Disable distrubed job when using low priority cluster job in job typ…
LeoHongyi Jul 16, 2019
796f942
Fix issue of changing job template based on low priority
Jul 17, 2019
878d7af
Fix issue of changing job template based on low priority
Jul 17, 2019
8abbc21
Fix issue of changing job template based on low priority
Jul 17, 2019
334fb1e
Add master key support (#421)
Gerhut Jul 17, 2019
704a672
tolerate master node in job/node-exporter (#420)
xudifsd Jul 18, 2019
199c8ca
Redirect to Wiki page for unauthorized login user (#423)
LeoHongyi Jul 18, 2019
4fbd4e5
Fix the job template dropdown issue (#424)
LeoHongyi Jul 19, 2019
bf5feb9
Change title of every table and adjust the orders (#427)
LeoHongyi Jul 23, 2019
214f278
Enable distribute job in low priority job template (#432)
LeoHongyi Jul 25, 2019
981482f
Clean up advance option & save job template issue
Jul 25, 2019
622d0cc
default using localhost as prometheus ip in grafana
xudifsd Jul 26, 2019
c375e6b
Hidden Priviledge docker & keep the logic when job type change
Jul 30, 2019
d0c86bd
Merge pull request #438 from hao1939/dltsdev
Anbang-Hu Jul 30, 2019
24ec7da
fix generate ssh config
hao1939 Jul 30, 2019
4c697c4
exec "sleep infinity" on workers
hao1939 Jul 30, 2019
792eb74
fix setup ssh script
hao1939 Jul 30, 2019
67391fb
Merge pull request #439 from hao1939/dltsdev
Anbang-Hu Jul 31, 2019
84b98bf
Enable the preemptible job and disable when low priority job
Jul 31, 2019
d2a9918
fix perf dashboard
xudifsd Jul 31, 2019
beaf0d5
Merge pull request #441 from xudifsd/dixu/fix
Anbang-Hu Jul 31, 2019
0f15860
Merge pull request #440 from LeoHongyi/dltsdev
Anbang-Hu Jul 31, 2019
b2268e0
fix dist job pod_name
hao1939 Jul 31, 2019
26f658c
fix endpoint
hao1939 Jul 31, 2019
d631160
robust
hao1939 Jul 31, 2019
e54c602
fix dist job path
hao1939 Jul 31, 2019
9b59bdd
Merge pull request #442 from hao1939/dltsdev
Anbang-Hu Jul 31, 2019
9cd49fa
use inter-pod affinity to achieve less fragmentation
xudifsd Aug 1, 2019
f5b5e94
fix bootstrap script
hao1939 Aug 1, 2019
a92f92e
add missing label
xudifsd Aug 2, 2019
c1dd2c9
Merge pull request #444 from hao1939/dltsdev
Anbang-Hu Aug 2, 2019
929e5d2
Merge pull request #443 from xudifsd/dixu/less-fragmentation
Anbang-Hu Aug 2, 2019
e33978a
reset endpoint before resubmit
hao1939 Aug 2, 2019
8f5f0a3
persist prometheus data into host path
xudifsd Aug 2, 2019
8201c1b
Expose environment variable DLWS_NUM_GPU_PER_WORKER for regular job
Anbang-Hu Aug 2, 2019
ceedfdd
Merge pull request #447 from Anbang-Hu/dltsdev
Anbang-Hu Aug 2, 2019
b25d9f9
setup ssh and hostfile for regular job
hao1939 Aug 4, 2019
b80a90d
Merge pull request #445 from hao1939/dltsdev
Anbang-Hu Aug 5, 2019
6992f64
Merge pull request #446 from xudifsd/dixu/persist-prometheus
Anbang-Hu Aug 5, 2019
7d5bb04
send out email while killing
xudifsd Aug 5, 2019
fb31b0b
same role anti affinity
xudifsd Aug 5, 2019
5e4efb4
Merge pull request #448 from xudifsd/dixu/kill-email
Anbang-Hu Aug 5, 2019
5cc0e07
Revert "persist prometheus data into host path"
Anbang-Hu Aug 5, 2019
d77ed4b
Merge pull request #450 from microsoft/revert-446-dixu/persist-promet…
Anbang-Hu Aug 5, 2019
18637fa
persist prometheus data into host path
xudifsd Aug 2, 2019
1ff0b27
fix permission of /prometheus-data
xudifsd Aug 6, 2019
e38da86
remove anti affinity
xudifsd Aug 6, 2019
b52b73d
Merge pull request #449 from xudifsd/dixu/anti-affinity
Anbang-Hu Aug 6, 2019
7c072d9
Merge pull request #451 from xudifsd/dixu/persist-prometheus
Anbang-Hu Aug 6, 2019
bd5f61f
change to required affinity
xudifsd Aug 6, 2019
d07c909
change order of cmd in restful api to speed up build & deployment
xudifsd Aug 6, 2019
2b80f3f
Merge pull request #452 from xudifsd/dixu/required-affinity
Anbang-Hu Aug 6, 2019
342d5d3
Merge pull request #453 from xudifsd/dixu/reorder-install
Anbang-Hu Aug 6, 2019
c395a40
fix create database
xudifsd Aug 6, 2019
4cb996a
Merge pull request #454 from xudifsd/dixu/fix-create-db
Anbang-Hu Aug 6, 2019
b0f9173
fix template error
xudifsd Aug 7, 2019
0b272db
Merge pull request #455 from xudifsd/dixu/fix
Anbang-Hu Aug 7, 2019
3c29c98
install missing pkg "openssl"
hao1939 Aug 7, 2019
61ec5f2
Merge pull request #456 from hao1939/dltsdev
Anbang-Hu Aug 7, 2019
3f3040e
fix apt-get install "-y"
hao1939 Aug 7, 2019
a4457d4
Merge pull request #457 from hao1939/dltsdev
Anbang-Hu Aug 7, 2019
7e28c47
fix apt-get hang
hao1939 Aug 7, 2019
4cd9b34
Merge pull request #458 from hao1939/dltsdev
Anbang-Hu Aug 7, 2019
5eca31e
Enable submit distrbuted job under low priority cluster
Aug 7, 2019
1375359
Merge pull request #459 from LeoHongyi/dltsdev
Anbang-Hu Aug 7, 2019
eb8988d
change DaemonSet to apps/v1
xudifsd Aug 2, 2019
1818d6f
change Deployment to apps/v1
xudifsd Aug 2, 2019
0d9d99a
remove --show-all
xudifsd Aug 5, 2019
e2f42ba
add PodShareProcessNamespace=true
xudifsd Aug 5, 2019
868e3dc
--allow-privileged in kubelet is default to be true
xudifsd Aug 5, 2019
a9dd37b
replace --admission-control with --enable-admission-plugins
xudifsd Aug 5, 2019
86bd9de
change argument of hypekube
xudifsd Aug 7, 2019
ae2fcb6
remove removed argument to kubelet
xudifsd Aug 7, 2019
00a96b0
finish upgrade scripts
xudifsd Aug 8, 2019
e8ac8a0
robust: fix query ssh port failed on some case
hao1939 Aug 8, 2019
42f303b
add description
xudifsd Aug 8, 2019
22be76d
Allow to set user quota in vc level.
hao1939 Aug 9, 2019
4e536c4
check if *_manager is hanging and restart accordingly
xudifsd Aug 9, 2019
d46dbb9
add selector to DaemonSet
xudifsd Aug 14, 2019
d127ed8
change proxy to kube-proxy
xudifsd Aug 14, 2019
a943515
use 1.11 nvidia device plugin
xudifsd Aug 14, 2019
a295f4f
upgrade nvidia driver
xudifsd Aug 14, 2019
abf20fb
upgrade cni
xudifsd Aug 16, 2019
a7b5b25
add comment
xudifsd Aug 16, 2019
514077f
add per vc gpu usage dashboard
xudifsd Aug 19, 2019
a2e6646
Add template API in rest server
Gerhut Aug 19, 2019
2b44a33
change cni url
xudifsd Aug 20, 2019
a9e8876
use noninteractive mode
xudifsd Aug 20, 2019
e6995e2
add script to disable kernel auto updates
xudifsd Aug 20, 2019
580e2d6
Merge pull request #466 from xudifsd/dixu/disable-kernel-auto-update
Anbang-Hu Aug 20, 2019
f593259
Merge pull request #462 from xudifsd/dixu/restart
Anbang-Hu Aug 21, 2019
0a6a434
Add vc level templates
Gerhut Aug 21, 2019
936a839
Merge pull request #461 from hao1939/dltsdev
Anbang-Hu Aug 21, 2019
12b19a6
Merge pull request #465 from Gerhut/restfulapi/templates
Anbang-Hu Aug 21, 2019
2f8b203
Merge pull request #464 from xudifsd/dixu/per-vc-gpu-usage
Anbang-Hu Aug 21, 2019
51023b5
Upgrade kubernetes client, and remove 'dry_run'. (#1)
hao1939 Aug 22, 2019
54a6e46
add gpu reporter
xudifsd Aug 21, 2019
1d117cc
add gpu fragmentation dashboard
xudifsd Aug 22, 2019
1d85031
Merge pull request #460 from xudifsd/dixu/upgrade
Anbang-Hu Aug 22, 2019
aae4077
Merge pull request #468 from xudifsd/dixu/gpu-fragmentation-dashboard
Anbang-Hu Aug 22, 2019
21f534a
cleanup
hao1939 Aug 13, 2019
e10214d
cleanup
hao1939 Aug 14, 2019
a78892d
refactoring
hao1939 Aug 14, 2019
c8d2031
code cleanup
hao1939 Aug 14, 2019
6cf1f85
cleanup
hao1939 Aug 14, 2019
4cedbb5
Support manually adjust job priority in queuing.
hao1939 Aug 15, 2019
384163b
add api /jobs/priorites
hao1939 Aug 19, 2019
6e60f94
ib topology awareness
xudifsd Aug 27, 2019
514cdfe
always use full qulified domain name
xudifsd Aug 27, 2019
5610a74
Merge pull request #471 from xudifsd/dixu/fqdn
Anbang-Hu Aug 27, 2019
43e499c
Merge pull request #463 from hao1939/priority
Anbang-Hu Aug 28, 2019
407d1ca
Merge pull request #467 from xudifsd/dixu/gpu-report
Anbang-Hu Aug 28, 2019
b00c969
Merge pull request #470 from xudifsd/dixu/ib-topology
Anbang-Hu Aug 28, 2019
cf1b709
adapt k8s 1.15 changes on date
xudifsd Aug 28, 2019
d501f12
generate dlws-scripts in pre-render
xudifsd Aug 28, 2019
3f15a3f
Merge pull request #472 from xudifsd/dixu/fix-tzinfo
Anbang-Hu Aug 28, 2019
5fc6fa4
Merge pull request #473 from xudifsd/dixu/dlws-scripts
Anbang-Hu Aug 28, 2019
db96ee3
add gpu retired page alert
xudifsd Aug 29, 2019
0c8c9ca
Merge pull request #474 from xudifsd/dixu/retired-page-alert
Anbang-Hu Aug 29, 2019
76bad75
fix typo
hao1939 Aug 29, 2019
5429176
change default job priority to 100
hao1939 Aug 29, 2019
0843331
Merge pull request #475 from hao1939/dltsdev
Anbang-Hu Aug 29, 2019
33857ac
devbox, docs, WebUI
Aug 29, 2019
c2d66b3
auto_deployment_after_DLTS
Aug 29, 2019
9deccf5
make gpu-reporter support CORS
xudifsd Aug 30, 2019
fee0a3c
Merge pull request #476 from xudifsd/dixu/cors
Anbang-Hu Aug 30, 2019
de09fb2
support config kill through config.yaml
xudifsd Aug 30, 2019
17333b2
support config notifier through config.yaml
xudifsd Aug 30, 2019
d8092ff
Merge branch 'dltsdev' of https://github.com/microsoft/DLWorkspace in…
Aug 30, 2019
af92e1a
bug fixing
Aug 30, 2019
6225332
try auto
Aug 30, 2019
e81182d
try auto
Aug 30, 2019
fbf5983
Merge branch 'dltsdev' of https://github.com/YinYangOfDao/DLWorkspace…
YinYangOfDao Aug 30, 2019
d0f45f7
fix template config for restful API
Aug 31, 2019
1f3579b
Merge pull request #477 from xudifsd/dixu/config-kill
hongzhili Sep 3, 2019
7bed99c
Merge pull request #478 from xudifsd/dixu/config-notify
hongzhili Sep 3, 2019
3e870fd
Merge branch 'dltsdev' of https://github.com/microsoft/DLWorkspace in…
Sep 3, 2019
f0c7201
add storage usage dashboard
xudifsd Sep 4, 2019
cf8860d
Merge pull request #481 from xudifsd/dixu/storage-usage
Anbang-Hu Sep 5, 2019
a69cb52
Disable cache
Anbang-Hu Sep 5, 2019
086c170
Merge pull request #482 from Anbang-Hu/dltsdev
hongzhili Sep 5, 2019
fde798b
user level quota
hao1939 Sep 5, 2019
b8195e5
Merge pull request #483 from hao1939/fix_user_quote
Anbang-Hu Sep 5, 2019
65daa39
Merge branch 'dltsdev' of https://github.com/microsoft/DLWorkspace in…
Sep 5, 2019
55aebfb
after resume, job will be "unapproved" state
hao1939 Sep 6, 2019
2a33ac2
Merge pull request #485 from hao1939/fix_user_quote
Anbang-Hu Sep 6, 2019
87d6bfb
format correction and doc update, mapping file for Azure, jinja rende…
YinYangOfDao Sep 6, 2019
c7df23a
Merge branch 'dltsdev' of https://github.com/microsoft/DLWorkspace in…
YinYangOfDao Sep 6, 2019
20f711e
apply nsg machine creation Azure
YinYangOfDao Sep 7, 2019
7663b6e
Allo 0 GPU job to go through
Anbang-Hu Sep 10, 2019
13da5d5
Merge pull request #486 from Anbang-Hu/dltsdev
Anbang-Hu Sep 10, 2019
639b15f
log more info about pod
hao1939 Sep 10, 2019
5e2e933
Merge pull request #487 from hao1939/add_logs
Anbang-Hu Sep 10, 2019
f0e8152
ignore preemptible GPUs
hao1939 Sep 10, 2019
c3fb783
auto approve preemptible job, preemptible GPUs are not take in count …
hao1939 Sep 10, 2019
3e8efaf
default mount home-folder for all jobs
hao1939 Sep 10, 2019
c53697d
Merge branch 'dltsdev' of https://github.com/microsoft/DLWorkspace in…
YinYangOfDao Sep 10, 2019
9bb2cc6
Merge branch 'dltsdev' of github.com:YinYangOfDao/DLWorkspace into dl…
YinYangOfDao Sep 10, 2019
ee5d377
Merge pull request #488 from hao1939/ignore_preemptible_gpus
Anbang-Hu Sep 10, 2019
c84444a
Update Get Pending Jobs
Anbang-Hu Sep 10, 2019
0889e6e
Only consider running jobs as active jobs
Anbang-Hu Sep 11, 2019
a8389c3
change gpu statistic dashboard
xudifsd Sep 11, 2019
a0903f5
Merge pull request #490 from xudifsd/dixu/gpu-utils
Anbang-Hu Sep 11, 2019
a755be8
Expose priority of more statuses of job .
Gerhut Sep 11, 2019
3282928
Merge pull request #489 from Anbang-Hu/dltsdev
hongzhili Sep 11, 2019
0f35a32
Merge pull request #491 from Gerhut/dltsdev
hongzhili Sep 11, 2019
3b5b58b
Merge branch 'dltsdev' of https://github.com/microsoft/DLWorkspace in…
YinYangOfDao Sep 11, 2019
5262a6a
NFS auto deployment attempt
YinYangOfDao Sep 11, 2019
2f0ff4f
Merge branch 'dltsdev' of github.com:YinYangOfDao/DLWorkspace into dl…
YinYangOfDao Sep 11, 2019
b93fbd9
bug fixing and corner cases, adding runscript on role
YinYangOfDao Sep 11, 2019
8f099c2
bug fix
YinYangOfDao Sep 11, 2019
a6a2645
default nfs rule, bug fix
YinYangOfDao Sep 13, 2019
0c74e65
NFS automation test, merged sku_mapping config info
Sep 14, 2019
3c8f17e
basic NFS automation version ready
Sep 15, 2019
70a3c61
fix sshd server port config for some case
hao1939 Sep 16, 2019
68d17a4
Merge pull request #492 from hao1939/fix_ssh_port
hongzhili Sep 16, 2019
413a056
Merge branch 'dltsdev' of https://github.com/microsoft/DLWorkspace in…
YinYangOfDao Sep 16, 2019
8eb6d43
autoshare naming bug fix, nfs nsg alto allow devbox
Sep 17, 2019
0080c58
Add dashboard
Gerhut Sep 17, 2019
b2fc21e
Merge pull request #479 from YinYangOfDao/dltsdev
hongzhili Sep 17, 2019
841d877
Merge pull request #493 from Gerhut/dashboard
hongzhili Sep 17, 2019
cc48c43
add azure-pipeline
hao1939 Sep 18, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ src/WebUI/dotnet/WebPortal/userconfig.json
src/WebUI/dotnet/WebPortal/configAuth.json
src/WebUI/dotnet/WebPortal/Master-Templates.json
src/WebUI/dotnet/WebPortal/hosting.json
**package-lock.json
**/wwwroot/*
**/bin/Release/*
**/bin/Debug/*
Expand All @@ -61,3 +62,8 @@ src/WebUI/dotnet/WebPortal/hosting.json
/.vs/DLWorkspace/v15/.suo

cluster-autoscaler
src/ClusterBootstrap/services/monitor/grafana-config.yaml
src/ClusterBootstrap/services/monitor/prometheus-alerting.yaml
src/ClusterBootstrap/services/monitor/alert-templates.yaml
src/ClusterBootstrap/services/jobmanager/dlws-scripts.yaml
src/ClusterBootstrap/services/monitor/alerting/kill-idle.rules
40 changes: 40 additions & 0 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Starter pipeline
# Start with a minimal pipeline that you can customize to build and deploy your code.
# Add steps that build, run tests, deploy, and more:
# https://aka.ms/yaml

trigger:
- dltsdev

pool:
name: 'DLTS-Platform'

# container: ubuntu:18.04

variables: { SUBSCRIPTION_NAME: "'Bing DLTS'" }

steps:
- script: |
cd src/ClusterBootstrap/
sudo ./install_prerequisites.sh
az account set --subscription $(SUBSCRIPTION_NAME)
az account list | grep -A5 -B5 '"isDefault": true'
displayName: 'Install prerequisites'

- script: |
cd src/ClusterBootstrap/
cp /mnt/_work/dlts_ci_config.yaml config.yaml
./bash_step_by_step_deploy.sh
displayName: 'Deploy DLWorkspace'

- script: |
echo TODO: verify the cluster is ready!
displayName: 'Verify deployment'

- script: |
echo TODO: RUN functional tests!
displayName: 'Functional tests'

- script: |
echo TODO: cleanup the deployment!
displayName: 'Cleanup'
94 changes: 63 additions & 31 deletions docs/deployment/Azure/FAQ.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,64 @@
# Frequently Asked Questions (FAQ) for Azure Cluster Deployment.

Please refer to [this](../knownissues/Readme.md) for more general deployment issues.

## After setup, I cannot visit the deployed DL Workspace portal.

* Please wait a few minutes after the deployment script runs through to allow the portal container to be pulled and scheduled for execution.

## I can't execute Spark job on Azure.

The current default deployment procedure on Azure doesn't deploy HDFS/Spark. So Spark job execution is not available.

## For 'az login', when I type in the device code, the web page prompt me again for the code.

It seems that sometime the browser (Edge, Chrome) cache another identity not intended to be used with az login. To get around, please start the browser in (in-private) or (incognito) mode, you may then enter the proper device code.

## I have launched a job (e.g., TensorFlow-iPython-GPU). However, I am unable to access the endpoint with error

```This site can’t be reached
....cloudapp.azure.com refused to connect.
```

Please check the docker image of the job you are running. Sometime, the iPython (or SSH server) hasn't been properly started, which caused the endpoint to be not accessible.

## I notice that my azure command is failing.

Azure CLI may time out after inactivity. You may need to re-login via 'az login'.

## Common configuration errors.

* "merge_config( config["azure_cluster"], tmpconfig["azure_cluster"][config["azure_cluster"]["cluster_name"]], verbose )"
# Frequently Asked Questions (FAQ) for Azure Cluster Deployment.

Please refer to [this](../knownissues/Readme.md) for more general deployment issues.

## After setup, I cannot visit the deployed DL Workspace portal.

* Please wait a few minutes after the deployment script runs through to allow the portal container to be pulled and scheduled for execution.

## sudo ./az_tools.py create failed.

* Check whether your subscription is correct. Always execute ```az account list | grep -A5 -B5 '"isDefault": true'``` to double check.

## Lost connection at the very first step of deploying infra node to Azure, or ```./deploy.py runscriptonall ./scripts/prepare_vm_disk.sh```

* Check whether hostname and source address in config.yaml are correctly set. Also try to make sure that you can ssh to the node.

## I cannot ssh to the node when my devbox is a physical server instead of a virtual one.

* Source IP address in config.yaml should probably be public IP, which could be derived by ```curl ifconfig.me```, instead of private IP you use to ssh to the devbox deriving from ```hostname -I```. If you cannot even ssh to the node after creating it, try to first set a new rule in Azure portal, allowing any source and destination IP, and set destination portal ranges to 22. Then ssh to the node, and type ```who``` to get the actual IP that is used to login to the node. Delete the temporary rule and in Networking setting, add <broaden IP>/16 to valid source IP, where <broaden IP> is the ```who``` IP with last two numbers set to 0. (e.g., 167.220.2.105 to 167.220.0.0/16)

## How do I know the node has been deployed?

* You can log into the master node: ```./deploy.py connect master```

## I could not build docker image/No such image/An image does not exist locally with the tag/The repository XXX does not have a Release file

* Check whether your docker is able to correctly resolve dns. First try on your devbox to ping a certain website, then do it in docker, such as `docker run -it busybox`,
if the former setting can ping but not the later one, try to figure out whether your devbox need to visit public Internet via some private DNS. Then edit it in `/etc/docker/daemon.json` on your devbox. refer to [this article](https://medium.com/@faithfulanere/solved-docker-build-could-not-resolve-archive-ubuntu-com-apt-get-fails-to-install-anything-9ea4dfdcdcf2)
use `systemd-resolve --status` to get more info about DNS if it is not managed by network-manager

## I can connect master/infra node, but the UI is not working (cannot access from browser), how to debug?

* Login to the master node, and use ```docker ps | grep web``` to get the ID corresponding to Web UI, then use ```docker logs --follow <WebUI ID>``` to figure out what happened.
a better way is to use ```sudo docker logs --tail 100 --follow $(sudo docker ps | grep webui | awk '{print $1}') ``` since the ID would change everytime the docker is restarted.
Everytime after modifying /etc/WebUI/userconfig.json etc., remember to restart that docker image: ```docker rm -f <WebUI ID>```

## finished all deployment, but not able to connect to master node via ```./deploy.py connect master```, ssh denied even with ``` ssh -i deploy/sshkey/id_rsa core@<infra node url>```.

* Need to change owner ```sudo chown -R <usr_name>:<usr_name> DLWorkspace/```, can check ownership using ```ls -l```

## I can't execute Spark job on Azure.

* The current default deployment procedure on Azure doesn't deploy HDFS/Spark. So Spark job execution is not available.

## For 'az login', when I type in the device code, the web page prompt me again for the code.

* It seems that sometime the browser (Edge, Chrome) cache another identity not intended to be used with az login. To get around, please start the browser in (in-private) or (incognito) mode, you may then enter the proper device code.

## I have launched a job (e.g., TensorFlow-iPython-GPU). However, I am unable to access the endpoint with error

```This site can’t be reached
....cloudapp.azure.com refused to connect.
```

Please check the docker image of the job you are running. Sometime, the iPython (or SSH server) hasn't been properly started, which caused the endpoint to be not accessible.

## I notice that my azure command is failing.

Azure CLI may time out after inactivity. You may need to re-login via 'az login'.

## Common configuration errors.

* "merge_config( config["azure_cluster"], tmpconfig["azure_cluster"][config["azure_cluster"]["cluster_name"]], verbose )"
Please check if the cluster_name used in azure_cluster is the same as the DL workspace cluster name.
59 changes: 44 additions & 15 deletions docs/deployment/Azure/Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,20 +4,36 @@ This document describes the procedure to deploy a DL Workspace cluster on Azure.

Please note that the procedure below doesn't deploy HDFS/Spark on DLWorkspace cluster on Azure (Spark job execution is not available on Azure Cluster).

1. Follow [this document](../../DevEnvironment/Readme.md) to setup the dev environment of DLWorkspace. Login to your Azure subscription on your dev machine via:
Prerequisite steps:
First require the manager to add you into a subscription group., then either
1. go to that group from Azure Portal and add ubuntu server from resources, this virtual server is your devbox, or
2. if you have a physical machine, install ubuntu server system(18.04) on that and use it as your devbox
then use the devbox to deploy node on cloud.

Workflow:
1. Please [configure](configure.md) your azure cluster. Put config.yaml under src/ClusterBootstrap

2. Change directory to src/ClusterBootstrap on devbox, and install prerequisite packages:
```
cd src/ClusterBootstrap/
sudo ./install_prerequisites.sh
```
3. Login to Azure, setup proper subscription and confirm
```
SUBSCRIPTION_NAME="<subscription name>"
az login
az account set --subscription "${SUBSCRIPTION_NAME}"
az account list | grep -A5 -B5 '"isDefault": true'
```

2. Please [configure](configure.md) your azure cluster.

3. Set proper [authentication](../authentication/Readme.md).

4. Initial cluster and generate certificates and keys:
Configure your location, should be the same as you specified in config.yaml file:
```AZ_LOCATION="<your location>"```
Execute this command, log out(exit) and log in back
```sudo usermod -aG docker zhe_ms```
4. Initiate cluster and generate certificates and keys:
```
./deploy.py -y build
```

5. Create Azure Cluster:
```
./az_tools.py create
Expand All @@ -40,9 +56,10 @@ Please note that if you are not Microsoft user, you should remove the
```
where machine1 is your azure infrastructure node. (you may get the address by ./deploy.py display)

The script block execute the following command in sequences: (you do NOT need to run the following commands if you have run step 5)
1. Setup basic tools on the Ubuntu image.
This command sequetially execute following steps:
1. Setup basic tools on VM and on the Ubuntu image.
```
./deploy.py runscriptonall ./scripts/prepare_vm_disk.sh
./deploy.py runscriptonall ./scripts/prepare_ubuntu.sh
```

Expand All @@ -57,16 +74,28 @@ Please note that if you are not Microsoft user, you should remove the
./deploy.py -y kubernetes labels
```

4. Build and deploy jobmanager, restfulapi, and webportal. Mount storage.
4. Start Nvidia device plugins:
```
./deploy.py kubernetes start nvidia-device-plugin
```

5. Build and deploy jobmanager, restfulapi, and webportal. Mount storage.
```
./deploy.py webui
./deploy.py docker push restfulapi
./deploy.py docker push webui
./deploy.py webui
./deploy.py mount
./deploy.py kubernetes start jobmanager restfulapi webportal
```

8. If you run into a deployment issue, please check [here](FAQ.md) first.

9. If you want to deploy a DLWorkspace cluster that can be autoscaled (i.e., automatically create/release VM when needed), please follow the following additional steps.

8. Manually connect to the infrastructure/master node:
```./deploy.py connect master```
On master node(log in from devbox by ./deploy.py connect master), manually add ```"Grafana": "",``` to /etc/WebUI/userconfig.json, under "Restapi" entry.
Restart the WebUI docker:
Login to the master node, and use
```docker ps | grep web```
to get the ID corresponding to Web UI, then restart that docker image:
```docker rm -f <WebUI ID>```
Wait for minutes for it to restart (can follow by using ```docker logs --follow <WebUI ID>```) and visit the infra node from web browser.

9. If you run into a deployment issue, please check [here](FAQ.md) first.
60 changes: 30 additions & 30 deletions docs/deployment/Azure/auto_scale.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,31 @@
# The following describe the procedures to autoscale a DLWorkspace cluster (i.e., automatically create/release VM when needed).
1. Download auto_scaler binary
```
wget https://github.com/DLWorkspace/autoscaler/releases/download/v1.9.0/cluster-autoscaler
```
2. Setup azure running environment and login (via az login)
3. For the Azure machine types supported, please check the document at:
```
src/ClusterBootstrap/templates/machine-types/azure/machineTypes.yaml
```
A sample template is as follows. Please fill in additional worker VM SKUs if you need.
```
---
Standard_NC6:
cpu: 6
memoryInMb: 56339
gpu: 1
Standard_D3_v2:
cpu: 4
memoryInMb: 14339
```
4. Start auto_scaler:
```
./cluster-autoscaler --v=5 --stderrthreshold=error --logtostderr=true --cloud-provider=aztools --skip-nodes-with-local-storage=false --nodes=0:10:Standard_NC6 --nodes=0:10:Standard_D3_v2 --leader-elect=false --scale-down-enabled=true --kubeconfig=./deploy/kubeconfig/kubeconfig.yaml --expander=least-waste
# The following describe the procedures to autoscale a DLWorkspace cluster (i.e., automatically create/release VM when needed).

1. Download auto_scaler binary
```
wget https://github.com/DLWorkspace/autoscaler/releases/download/v1.9.0/cluster-autoscaler
```

2. Setup azure running environment and login (via az login)

3. For the Azure machine types supported, please check the document at:
```
src/ClusterBootstrap/templates/machine-types/azure/machineTypes.yaml
```

A sample template is as follows. Please fill in additional worker VM SKUs if you need.

```
---
Standard_NC6:
cpu: 6
memoryInMb: 56339
gpu: 1
Standard_D3_v2:
cpu: 4
memoryInMb: 14339
```

4. Start auto_scaler:
```
./cluster-autoscaler --v=5 --stderrthreshold=error --logtostderr=true --cloud-provider=aztools --skip-nodes-with-local-storage=false --nodes=0:10:Standard_NC6 --nodes=0:10:Standard_D3_v2 --leader-elect=false --scale-down-enabled=true --kubeconfig=./deploy/kubeconfig/kubeconfig.yaml --expander=least-waste
```
Loading