blob: f5b5970ba3d81ccf6e53cac44322036aae64030d [file] [log] [blame]
HSCloud Clusters
================
Current cluster: `k0.hswaw.net`
Accessing via kubectl
---------------------
There isn't yet a service for getting short-term user certificates. Instead, you'll have to get admin certificates:
bazel run //cluster/clustercfg:clustercfg admincreds $(whoami)-admin
kubectl get nodes
Provisioning nodes
------------------
- bring up a new node with nixos, running the configuration.nix from bootstrap (to be documented)
- `bazel run //cluster/clustercfg:clustercfg nodestrap bc01nXX.hswaw.net`
That's it!
Ceph
====
We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs.
The following Ceph clusters are available:
ceph-waw1
---------
HDDs on bc01n0{1-3}. 3TB total capacity.
The following storage classes use this cluster:
- `waw-hdd-redundant-1` - erasure coded 2.1
- `waw-hdd-yolo-1` - unreplicated (you _will_ lose your data)
- `waw-hdd-redundant-1-object` - erasure coded 2.1 object store
A dashboard is available at https://ceph-waw1.hswaw.net/, to get the admin password run:
kubectl -n ceph-waw1 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo
Rados Gateway (S3) is available at https://object.ceph-waw1.hswaw.net/. To create
an object store user consult rook.io manual (https://rook.io/docs/rook/v0.9/ceph-object-store-user-crd.html)
User authentication secret is generated in ceph cluster namespace (`ceph-waw1`),
thus may need to be manually copied into application namespace. (see
`app/registry/prod.jsonnet` comment)
`tools/rook-s3cmd-config` can be used to generate test configuration file for s3cmd.
Remember to append `:default-placement` to your region name (ie. `waw-hdd-redundant-1-object:default-placement`)
Known Issues
============
After running `nixos-configure switch` on the hosts, the shared host/container CNI plugin directory gets nuked, and pods will fail to schedule on that node (TODO(q3k): error message here). To fix this, restart calico-node pods running on nodes that have this issue. The Calico Node pod will reschedule automatically and fix the CNI plugins directory.
kubectl -n kube-system get pods -o wide | grep calico-node
kubectl -n kube-system delete pod calico-node-XXXX