Blame - cluster/README - hscloud

blob: 4eeb6b7b1f1c488a4874a5449de3f20913f74749 [file] [log] [blame]

Sergiusz Bazanski	de06180	2019-01-13 21:14:02 +0100	[diff] [blame]	1	HSCloud Clusters
				2	================
				3
				4	Current cluster: `k0.hswaw.net`
				5
				6	Accessing via kubectl
				7	---------------------
				8
				9	There isn't yet a service for getting short-term user certificates. Instead, you'll have to get admin certificates:
				10
Sergiusz Bazanski	73cef11	2019-04-07 00:06:23 +0200	[diff] [blame]	11	bazel run //cluster/clustercfg:clustercfg admincreds $(whoami)-admin
Sergiusz Bazanski	de06180	2019-01-13 21:14:02 +0100	[diff] [blame]	12	kubectl get nodes
				13
				14	Provisioning nodes
				15	------------------
				16
				17	- bring up a new node with nixos, running the configuration.nix from bootstrap (to be documented)
Sergiusz Bazanski	73cef11	2019-04-07 00:06:23 +0200	[diff] [blame]	18	- `bazel run //cluster/clustercfg:clustercfg nodestrap bc01nXX.hswaw.net`
Sergiusz Bazanski	de06180	2019-01-13 21:14:02 +0100	[diff] [blame]	19
				20	That's it!
Sergiusz Bazanski	2fd5861	2019-04-02 14:45:17 +0200	[diff] [blame]	21
				22	Ceph
				23	====
				24
				25	We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.
				26
				27	The following Ceph clusters are available:
				28
				29	ceph-waw1
				30	---------
				31
				32	HDDs on bc01n0{1-3}. 3TB total capacity.
				33
				34	The following storage classes use this cluster:
				35
				36	- `waw-hdd-redundant-1` - erasure coded 2.1
				37
				38	A dashboard is available at https://ceph-waw1.hswaw.net/, to get the admin password run:
				39
				40	kubectl -n ceph-waw1 get secret rook-ceph-dashboard-password -o yaml \| grep "password:" \| awk '{print $2}' \| base64 --decode ; echo
				41
				42	Known Issues
				43	============
				44
				45	After running `nixos-configure switch` on the hosts, the shared host/container CNI plugin directory gets nuked, and pods will fail to schedule on that node (TODO(q3k): error message here). To fix this, restart calico-node pods running on nodes that have this issue. The Calico Node pod will reschedule automatically and fix the CNI plugins directory.
				46
				47	kubectl -n kube-system get pods -o wide \| grep calico-node
				48	kubectl -n kube-system delete pod calico-node-XXXX
				49