added documentation for the RAM consumption tests we jsut did

2025-05-28 11:30:22 +02:00 · 2025-05-28 11:30:22 +02:00 · a10021fb98
commit a10021fb98
parent 0cff31b32f
3 changed files with 206 additions and 0 deletions
--- a/docs/performance_test/README.md
+++ b/docs/performance_test/README.md
@ -0,0 +1,135 @@
+# Goals
+
+This originated from a demand to know how much RAM is consummed by Open Cloud when running a large number of workflow at the same time on the same node. 
+
+We differentiated between differents components : 
+
+- The "oc-stack", which is the minimum set of services to be able to create and schedule a workflow execution : oc-auth, oc-datacenter, oc-scheduler, oc-front, oc-schedulerd, oc-workflow, oc-catalog, oc-peer, oc-workspace, loki, mongo, traefik and nats
+
+- oc-monitord, which is the daemon instanciated by the scheduling daemon (oc-schedulerd) that created the YAML for argo and creates the necessary kubernetes ressources.
+
+We monitor both parts to view how much RAM the oc-stack uses before / during / after the execution, the RAM consummed by the monitord containers and the total of the stack and monitors.
+
+# Setup
+
+In order to have optimal performance we used a Promox server with high ressources (>370 GiB RAM and 128 cores) to hosts two VMs composing our Kubernetes cluster, with one control plane node were the oc stack is running and a worker node with only k3s running.
+
+## VMs
+
+We instantiated a 2 node kubernetes (with k3s) cluster on the superg PVE (https://superg-pve.irtse-pf.ext:8006/)
+
+### VM Control 
+
+This vm is running the oc stack and the monitord containers, it carries the biggest part of the load. It must have k3s and argo installed. We allocated **62 GiB of RAM** and **31 cores**.
+
+### VM Worker
+
+This VM is holding the workload for all the pods created, acting as a worker node for the k3s cluster. We deploy k3s as a nodes as explained in the K3S quick start guide :
+
+`curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -`
+
+The value to use for K3S_TOKEN is stored at `/var/lib/rancher/k3s/server/node-token` on the server node.
+
+Verify that the server has been added as a node to the cluster on the control plane with `kubectl get nodes` and look for the hostname of the worker VM on the list of nodes.
+
+### Delegate pods to the worker node
+
+In order for the pods to be executed on another node we need to modify how we construct he Argo YAML, to add a label in the metadata. We have added the needed attributes to the `Spec` struct in `oc-monitord` on the `test-ram` branch. 
+
+```go
+type Spec struct {
+	ServiceAccountName	string					`yaml:"serviceAccountName"`
+	Entrypoint 			string                	`yaml:"entrypoint"`
+	Arguments  			[]Parameter           	`yaml:"arguments,omitempty"`
+	Volumes    			[]VolumeClaimTemplate 	`yaml:"volumeClaimTemplates,omitempty"`
+	Templates  			[]Template            	`yaml:"templates"`
+	Timeout    			int                   	`yaml:"activeDeadlineSeconds,omitempty"`
+	NodeSelector		struct{
+							NodeRole string `yaml:"node-role"`
+						} `yaml:"nodeSelector"`
+}
+```
+
+and added the tag in the `CreateDAG()` method :
+
+```go
+b.Workflow.Spec.NodeSelector.NodeRole = "worker"
+```
+
+## Container monitoring
+
+Docker compose to instantiate the monitoring stack : 
+- Prometheus : storing data  
+- Cadvisor : monitoring of the containers 
+
+```yml
+version: '3.2'
+services:
+  prometheus:
+    image: prom/prometheus:latest
+    container_name: prometheus
+    ports:
+    - 9090:9090
+    command:
+    - --config.file=/etc/prometheus/prometheus.yml
+    volumes:
+    - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
+    depends_on:
+    - cadvisor 
+  cadvisor:
+    image: gcr.io/cadvisor/cadvisor:latest
+    container_name: cadvisor
+    ports:
+    - 9999:8080
+    volumes:
+    - /:/rootfs:ro
+    - /var/run:/var/run:rw
+    - /sys:/sys:ro
+    - /var/lib/docker/:/var/lib/docker:ro
+  
+```
+
+Prometheus scrapping configuration :
+
+```yml
+scrape_configs:
+- job_name: cadvisor
+  scrape_interval: 5s
+  static_configs:
+  - targets:
+    - cadvisor:8080
+```
+
+## Dashboards
+
+In order to monitor the ressource consumption during our tests we need to create dashboard in Grafana. 
+
+We create 4 different queries using Prometheus as the data source. For each query we can use the `code` mode to create them from a PromQL query. 
+
+## OC stack consumption
+
+```
+sum(container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"})
+```
+
+## Monitord consumption
+
+```
+sum(container_memory_usage_bytes{image="oc-monitord"})
+```
+
+## Total RAM consumption
+
+```
+sum(
+  container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"}
+  or
+  container_memory_usage_bytes{image="oc-monitord"}
+)
+```
+
+## Number of monitord containers
+
+```
+count(container_memory_usage_bytes{image="oc-monitord"} > 0)
+```
--- a/docs/performance_test/insert_exec.sh
+++ b/docs/performance_test/insert_exec.sh
--- a/docs/performance_test/performance_report.md
+++ b/docs/performance_test/performance_report.md