# Goals This originated from a demand to know how much RAM is consummed by Open Cloud when running a large number of workflow at the same time on the same node. We differentiated between differents components : - The "oc-stack", which is the minimum set of services to be able to create and schedule a workflow execution : oc-auth, oc-datacenter, oc-scheduler, oc-front, oc-schedulerd, oc-workflow, oc-catalog, oc-peer, oc-workspace, loki, mongo, traefik and nats - oc-monitord, which is the daemon instanciated by the scheduling daemon (oc-schedulerd) that created the YAML for argo and creates the necessary kubernetes ressources. We monitor both parts to view how much RAM the oc-stack uses before / during / after the execution, the RAM consummed by the monitord containers and the total of the stack and monitors. # Setup In order to have optimal performance we used a Promox server with high ressources (>370 GiB RAM and 128 cores) to hosts two VMs composing our Kubernetes cluster, with one control plane node were the oc stack is running and a worker node with only k3s running. ## VMs We instantiated a 2 node kubernetes (with k3s) cluster on the superg PVE (https://superg-pve.irtse-pf.ext:8006/) ### VM Control This vm is running the oc stack and the monitord containers, it carries the biggest part of the load. It must have k3s and argo installed. We allocated **62 GiB of RAM** and **31 cores**. ### VM Worker This VM is holding the workload for all the pods created, acting as a worker node for the k3s cluster. We deploy k3s as a nodes as explained in the K3S quick start guide : `curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -` The value to use for K3S_TOKEN is stored at `/var/lib/rancher/k3s/server/node-token` on the server node. Verify that the server has been added as a node to the cluster on the control plane with `kubectl get nodes` and look for the hostname of the worker VM on the list of nodes. ### Delegate pods to the worker node In order for the pods to be executed on another node we need to modify how we construct he Argo YAML, to add a label in the metadata. We have added the needed attributes to the `Spec` struct in `oc-monitord` on the `test-ram` branch. ```go type Spec struct { ServiceAccountName string `yaml:"serviceAccountName"` Entrypoint string `yaml:"entrypoint"` Arguments []Parameter `yaml:"arguments,omitempty"` Volumes []VolumeClaimTemplate `yaml:"volumeClaimTemplates,omitempty"` Templates []Template `yaml:"templates"` Timeout int `yaml:"activeDeadlineSeconds,omitempty"` NodeSelector struct{ NodeRole string `yaml:"node-role"` } `yaml:"nodeSelector"` } ``` and added the tag in the `CreateDAG()` method : ```go b.Workflow.Spec.NodeSelector.NodeRole = "worker" ``` ## Container monitoring Docker compose to instantiate the monitoring stack : - Prometheus : storing data - Cadvisor : monitoring of the containers ```yml version: '3.2' services: prometheus: image: prom/prometheus:latest container_name: prometheus ports: - 9090:9090 command: - --config.file=/etc/prometheus/prometheus.yml volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro depends_on: - cadvisor cadvisor: image: gcr.io/cadvisor/cadvisor:latest container_name: cadvisor ports: - 9999:8080 volumes: - /:/rootfs:ro - /var/run:/var/run:rw - /sys:/sys:ro - /var/lib/docker/:/var/lib/docker:ro ``` Prometheus scrapping configuration : ```yml scrape_configs: - job_name: cadvisor scrape_interval: 5s static_configs: - targets: - cadvisor:8080 ``` ## Dashboards In order to monitor the ressource consumption during our tests we need to create dashboard in Grafana. We create 4 different queries using Prometheus as the data source. For each query we can use the `code` mode to create them from a PromQL query. ## OC stack consumption ``` sum(container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"}) ``` ## Monitord consumption ``` sum(container_memory_usage_bytes{image="oc-monitord"}) ``` ## Total RAM consumption ``` sum( container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"} or container_memory_usage_bytes{image="oc-monitord"} ) ``` ## Number of monitord containers ``` count(container_memory_usage_bytes{image="oc-monitord"} > 0) ```