added documentation for the RAM consumption tests we jsut did

This commit is contained in:
pb 2025-05-28 11:30:22 +02:00
parent 0cff31b32f
commit a10021fb98
3 changed files with 206 additions and 0 deletions

View File

@ -0,0 +1,135 @@
# Goals
This originated from a demand to know how much RAM is consummed by Open Cloud when running a large number of workflow at the same time on the same node.
We differentiated between differents components :
- The "oc-stack", which is the minimum set of services to be able to create and schedule a workflow execution : oc-auth, oc-datacenter, oc-scheduler, oc-front, oc-schedulerd, oc-workflow, oc-catalog, oc-peer, oc-workspace, loki, mongo, traefik and nats
- oc-monitord, which is the daemon instanciated by the scheduling daemon (oc-schedulerd) that created the YAML for argo and creates the necessary kubernetes ressources.
We monitor both parts to view how much RAM the oc-stack uses before / during / after the execution, the RAM consummed by the monitord containers and the total of the stack and monitors.
# Setup
In order to have optimal performance we used a Promox server with high ressources (>370 GiB RAM and 128 cores) to hosts two VMs composing our Kubernetes cluster, with one control plane node were the oc stack is running and a worker node with only k3s running.
## VMs
We instantiated a 2 node kubernetes (with k3s) cluster on the superg PVE (https://superg-pve.irtse-pf.ext:8006/)
### VM Control
This vm is running the oc stack and the monitord containers, it carries the biggest part of the load. It must have k3s and argo installed. We allocated **62 GiB of RAM** and **31 cores**.
### VM Worker
This VM is holding the workload for all the pods created, acting as a worker node for the k3s cluster. We deploy k3s as a nodes as explained in the K3S quick start guide :
`curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -`
The value to use for K3S_TOKEN is stored at `/var/lib/rancher/k3s/server/node-token` on the server node.
Verify that the server has been added as a node to the cluster on the control plane with `kubectl get nodes` and look for the hostname of the worker VM on the list of nodes.
### Delegate pods to the worker node
In order for the pods to be executed on another node we need to modify how we construct he Argo YAML, to add a label in the metadata. We have added the needed attributes to the `Spec` struct in `oc-monitord` on the `test-ram` branch.
```go
type Spec struct {
ServiceAccountName string `yaml:"serviceAccountName"`
Entrypoint string `yaml:"entrypoint"`
Arguments []Parameter `yaml:"arguments,omitempty"`
Volumes []VolumeClaimTemplate `yaml:"volumeClaimTemplates,omitempty"`
Templates []Template `yaml:"templates"`
Timeout int `yaml:"activeDeadlineSeconds,omitempty"`
NodeSelector struct{
NodeRole string `yaml:"node-role"`
} `yaml:"nodeSelector"`
}
```
and added the tag in the `CreateDAG()` method :
```go
b.Workflow.Spec.NodeSelector.NodeRole = "worker"
```
## Container monitoring
Docker compose to instantiate the monitoring stack :
- Prometheus : storing data
- Cadvisor : monitoring of the containers
```yml
version: '3.2'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- 9090:9090
command:
- --config.file=/etc/prometheus/prometheus.yml
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
depends_on:
- cadvisor
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- 9999:8080
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
```
Prometheus scrapping configuration :
```yml
scrape_configs:
- job_name: cadvisor
scrape_interval: 5s
static_configs:
- targets:
- cadvisor:8080
```
## Dashboards
In order to monitor the ressource consumption during our tests we need to create dashboard in Grafana.
We create 4 different queries using Prometheus as the data source. For each query we can use the `code` mode to create them from a PromQL query.
## OC stack consumption
```
sum(container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"})
```
## Monitord consumption
```
sum(container_memory_usage_bytes{image="oc-monitord"})
```
## Total RAM consumption
```
sum(
container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"}
or
container_memory_usage_bytes{image="oc-monitord"}
)
```
## Number of monitord containers
```
count(container_memory_usage_bytes{image="oc-monitord"} > 0)
```

File diff suppressed because one or more lines are too long