Sergiusz Bazanski | de06180 | 2019-01-13 21:14:02 +0100 | [diff] [blame] | 1 | HSCloud Clusters |
| 2 | ================ |
| 3 | |
| 4 | Current cluster: `k0.hswaw.net` |
| 5 | |
| 6 | Accessing via kubectl |
| 7 | --------------------- |
| 8 | |
| 9 | There isn't yet a service for getting short-term user certificates. Instead, you'll have to get admin certificates: |
| 10 | |
Sergiusz Bazanski | 73cef11 | 2019-04-07 00:06:23 +0200 | [diff] [blame] | 11 | bazel run //cluster/clustercfg:clustercfg admincreds $(whoami)-admin |
Sergiusz Bazanski | de06180 | 2019-01-13 21:14:02 +0100 | [diff] [blame] | 12 | kubectl get nodes |
| 13 | |
| 14 | Provisioning nodes |
| 15 | ------------------ |
| 16 | |
| 17 | - bring up a new node with nixos, running the configuration.nix from bootstrap (to be documented) |
Sergiusz Bazanski | 73cef11 | 2019-04-07 00:06:23 +0200 | [diff] [blame] | 18 | - `bazel run //cluster/clustercfg:clustercfg nodestrap bc01nXX.hswaw.net` |
Sergiusz Bazanski | de06180 | 2019-01-13 21:14:02 +0100 | [diff] [blame] | 19 | |
| 20 | That's it! |
Sergiusz Bazanski | 2fd5861 | 2019-04-02 14:45:17 +0200 | [diff] [blame] | 21 | |
| 22 | Ceph |
| 23 | ==== |
| 24 | |
| 25 | We run Ceph via Rook. The Rook operator is running in the `ceph-rook-system` namespace. To debug Ceph issues, start by looking at its logs. |
| 26 | |
| 27 | The following Ceph clusters are available: |
| 28 | |
| 29 | ceph-waw1 |
| 30 | --------- |
| 31 | |
| 32 | HDDs on bc01n0{1-3}. 3TB total capacity. |
| 33 | |
| 34 | The following storage classes use this cluster: |
| 35 | |
| 36 | - `waw-hdd-redundant-1` - erasure coded 2.1 |
| 37 | |
| 38 | A dashboard is available at https://ceph-waw1.hswaw.net/, to get the admin password run: |
| 39 | |
| 40 | kubectl -n ceph-waw1 get secret rook-ceph-dashboard-password -o yaml | grep "password:" | awk '{print $2}' | base64 --decode ; echo |
| 41 | |
| 42 | Known Issues |
| 43 | ============ |
| 44 | |
| 45 | After running `nixos-configure switch` on the hosts, the shared host/container CNI plugin directory gets nuked, and pods will fail to schedule on that node (TODO(q3k): error message here). To fix this, restart calico-node pods running on nodes that have this issue. The Calico Node pod will reschedule automatically and fix the CNI plugins directory. |
| 46 | |
| 47 | kubectl -n kube-system get pods -o wide | grep calico-node |
| 48 | kubectl -n kube-system delete pod calico-node-XXXX |
| 49 | |