cluster/README - hscloud - Gitiles

 HSCloud Clusters
 ================

 Current cluster: `k0.hswaw.net`

 Accessing via kubectl
 ---------------------

     prodaccess # get a short-lived certificate for your use via SSO
                # if youre local username is not the same as your HSWAW SSO
                # username, pass `-username foo`
     kubectl version
     kubectl top nodes

 Every user gets a `personal-$username` namespace. Feel free to use it for your own purposes, but watch out for resource usage!

     kubectl run -n personal-$username run --image=alpine:latest -it foo

 To proceed further you should be somewhat familiar with Kubernetes. Otherwise the rest of terminology might not make sense. We recommend going through the original Kubernetes tutorials.

 Persistent Storage (waw2)
 -------------------------

 HDDs on bc01n0{1-3}. 3TB total capacity. Don't use this as this pool should go away soon (the disks are slow, the network is slow and the RAID controllers lie). Use ceph-waw3 instead.

 The following storage classes use this cluster:

  - `waw-hdd-paranoid-1` - 3 replicas
  - `waw-hdd-redundant-1` - erasure coded 2.1
  - `waw-hdd-yolo-1` - unreplicated (you _will_ lose your data)
  - `waw-hdd-redundant-1-object` - erasure coded 2.1 object store

 Rados Gateway (S3) is available at https://object.ceph-waw2.hswaw.net/. To create a user, ask an admin.

 PersistentVolumes currently bound to PersistentVolumeClaims get automatically backed up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

 Persistent Storage (waw3)
 -------------------------

 HDDs on dcr01s2{2,4}. 40TB total capacity for now. Use this.

 The following storage classes use this cluster:

  - `waw-hdd-yolo-3` - 1 replica
  - `waw-hdd-redundant-3` - 2 replicas
  - `waw-hdd-redundant-3-object` - 2 replicas, object store

 Rados Gateway (S3) is available at https://object.ceph-waw3.hswaw.net/. To create a user, ask an admin.

 PersistentVolumes currently bound to PVCs get automatically backed up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

 Administration
 ==============

 Provisioning nodes
 ------------------

  - bring up a new node with nixos, the configuration doesn't matter and will be nuked anyway
  - edit cluster/nix/defs-machines.nix
  - `bazel run //cluster/clustercfg nodestrap bc01nXX.hswaw.net`

 Ceph - Debugging
 -----------------

 We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.

 A dashboard is available at https://ceph-waw2.hswaw.net/, to get the admin password run:

     kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo


 Ceph - Backups
 --------------

 Kubernetes PVs backed in Ceph RBDs get backed up using Benji. An hourly cronjob runs in every Ceph cluster. You can also manually trigger a run by doing:

     kubectl -n ceph-waw2 create job --from=cronjob/ceph-waw2-benji ceph-waw2-benji-manual-$(date +%s)

 Ceph ObjectStorage pools (RADOSGW) are _not_ backed up yet!

 Ceph - Object Storage
 ---------------------

 To create an object store user consult rook.io manual (https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html)
 User authentication secret is generated in ceph cluster namespace (`ceph-waw2`),
 thus may need to be manually copied into application namespace. (see
 `app/registry/prod.jsonnet` comment)

 `tools/rook-s3cmd-config` can be used to generate test configuration file for s3cmd.
 Remember to append `:default-placement` to your region name (ie. `waw-hdd-redundant-1-object:default-placement`)
	HSCloud Clusters
	================

	Current cluster: `k0.hswaw.net`

	Accessing via kubectl
	---------------------

	prodaccess # get a short-lived certificate for your use via SSO
	# if youre local username is not the same as your HSWAW SSO
	# username, pass `-username foo`
	kubectl version
	kubectl top nodes

	Every user gets a `personal-$username` namespace. Feel free to use it for your own purposes, but watch out for resource usage!

	kubectl run -n personal-$username run --image=alpine:latest -it foo

	To proceed further you should be somewhat familiar with Kubernetes. Otherwise the rest of terminology might not make sense. We recommend going through the original Kubernetes tutorials.

	Persistent Storage (waw2)
	-------------------------

	HDDs on bc01n0{1-3}. 3TB total capacity. Don't use this as this pool should go away soon (the disks are slow, the network is slow and the RAID controllers lie). Use ceph-waw3 instead.

	The following storage classes use this cluster:

	- `waw-hdd-paranoid-1` - 3 replicas
	- `waw-hdd-redundant-1` - erasure coded 2.1
	- `waw-hdd-yolo-1` - unreplicated (you _will_ lose your data)
	- `waw-hdd-redundant-1-object` - erasure coded 2.1 object store

	Rados Gateway (S3) is available at https://object.ceph-waw2.hswaw.net/. To create a user, ask an admin.

	PersistentVolumes currently bound to PersistentVolumeClaims get automatically backed up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

	Persistent Storage (waw3)
	-------------------------

	HDDs on dcr01s2{2,4}. 40TB total capacity for now. Use this.

	The following storage classes use this cluster:

	- `waw-hdd-yolo-3` - 1 replica
	- `waw-hdd-redundant-3` - 2 replicas
	- `waw-hdd-redundant-3-object` - 2 replicas, object store

	Rados Gateway (S3) is available at https://object.ceph-waw3.hswaw.net/. To create a user, ask an admin.

	PersistentVolumes currently bound to PVCs get automatically backed up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

	Administration
	==============

	Provisioning nodes
	------------------

	- bring up a new node with nixos, the configuration doesn't matter and will be nuked anyway
	- edit cluster/nix/defs-machines.nix
	- `bazel run //cluster/clustercfg nodestrap bc01nXX.hswaw.net`

	Ceph - Debugging
	-----------------

	We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.

	A dashboard is available at https://ceph-waw2.hswaw.net/, to get the admin password run:

	kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml \| grep "password:" \| awk '{print $2}' \| base64 --decode ; echo


	Ceph - Backups
	--------------

	Kubernetes PVs backed in Ceph RBDs get backed up using Benji. An hourly cronjob runs in every Ceph cluster. You can also manually trigger a run by doing:

	kubectl -n ceph-waw2 create job --from=cronjob/ceph-waw2-benji ceph-waw2-benji-manual-$(date +%s)

	Ceph ObjectStorage pools (RADOSGW) are _not_ backed up yet!

	Ceph - Object Storage
	---------------------

	To create an object store user consult rook.io manual (https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html)
	User authentication secret is generated in ceph cluster namespace (`ceph-waw2`),
	thus may need to be manually copied into application namespace. (see
	`app/registry/prod.jsonnet` comment)

	`tools/rook-s3cmd-config` can be used to generate test configuration file for s3cmd.
	Remember to append `:default-placement` to your region name (ie. `waw-hdd-redundant-1-object:default-placement`)