HSCloud Clusters

Admin documentation. For user documentation, see //cluster/doc/user.md.

Current cluster: k0.hswaw.net

Persistent Storage (waw2)

HDDs on bc01n0{1-3}. 3TB total capacity. Don't use this as this pool should go away soon (the disks are slow, the network is slow and the RAID controllers lie). Use ceph-waw3 instead.

The following storage classes use this cluster:

  • waw-hdd-paranoid-1 - 3 replicas
  • waw-hdd-redundant-1 - erasure coded 2.1
  • waw-hdd-yolo-1 - unreplicated (you will lose your data)
  • waw-hdd-redundant-1-object - erasure coded 2.1 object store

Rados Gateway (S3) is available at https://object.ceph-waw2.hswaw.net/. To create a user, ask an admin.

PersistentVolumes currently bound to PersistentVolumeClaims get automatically backed up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

Persistent Storage (waw3)

HDDs on dcr01s2{2,4}. 40TB total capacity for now. Use this.

The following storage classes use this cluster:

  • waw-hdd-yolo-3 - 1 replica
  • waw-hdd-redundant-3 - 2 replicas
  • waw-hdd-redundant-3-object - 2 replicas, object store

Rados Gateway (S3) is available at https://object.ceph-waw3.hswaw.net/. To create a user, ask an admin.

PersistentVolumes currently bound to PVCs get automatically backed up (hourly for the next 48 hours, then once every 4 weeks, then once every month for a year).

Administration

Provisioning nodes

  • bring up a new node with nixos, the configuration doesn't matter and will be nuked anyway
  • edit cluster/nix/defs-machines.nix
  • bazel run //cluster/clustercfg nodestrap bc01nXX.hswaw.net

Ceph - Debugging

We run Ceph via Rook. The Rook operator is running in the ceph-rook-system namespace. To debug Ceph issues, start by looking at its logs.

A dashboard is available at https://ceph-waw2.hswaw.net/ and https://ceph-waw3.hswaw.net, to get the admin password run:

kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo
kubectl -n ceph-waw2 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo

Ceph - Backups

Kubernetes PVs backed in Ceph RBDs get backed up using Benji. An hourly cronjob runs in every Ceph cluster. You can also manually trigger a run by doing:

kubectl -n ceph-waw2 create job --from=cronjob/ceph-waw2-benji ceph-waw2-benji-manual-$(date +%s)
kubectl -n ceph-waw3 create job --from=cronjob/ceph-waw3-benji ceph-waw3-benji-manual-$(date +%s)

Ceph ObjectStorage pools (RADOSGW) are not backed up yet!

Ceph - Object Storage

To create an object store user consult rook.io manual (https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html) User authentication secret is generated in ceph cluster namespace (ceph-waw2), thus may need to be manually copied into application namespace. (see app/registry/prod.jsonnet comment)

tools/rook-s3cmd-config can be used to generate test configuration file for s3cmd. Remember to append :default-placement to your region name (ie. waw-hdd-redundant-1-object:default-placement)