Goals

This originated from a demand to know how much RAM is consummed by Open Cloud when running a large number of workflow at the same time on the same node.

We differentiated between differents components :

The "oc-stack", which is the minimum set of services to be able to create and schedule a workflow execution : oc-auth, oc-datacenter, oc-scheduler, oc-front, oc-schedulerd, oc-workflow, oc-catalog, oc-peer, oc-workspace, loki, mongo, traefik and nats
oc-monitord, which is the daemon instanciated by the scheduling daemon (oc-schedulerd) that created the YAML for argo and creates the necessary kubernetes ressources.

We monitor both parts to view how much RAM the oc-stack uses before / during / after the execution, the RAM consummed by the monitord containers and the total of the stack and monitors.

Setup

In order to have optimal performance we used a Promox server with high ressources (>370 GiB RAM and 128 cores) to hosts two VMs composing our Kubernetes cluster, with one control plane node were the oc stack is running and a worker node with only k3s running.

VMs

We instantiated a 2 node kubernetes (with k3s) cluster on the superg PVE (https://superg-pve.irtse-pf.ext:8006/)

VM Control

This vm is running the oc stack and the monitord containers, it carries the biggest part of the load. It must have k3s and argo installed. We allocated 62 GiB of RAM and 31 cores.

VM Worker

This VM is holding the workload for all the pods created, acting as a worker node for the k3s cluster. We deploy k3s as a nodes as explained in the K3S quick start guide :

curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -

The value to use for K3S_TOKEN is stored at /var/lib/rancher/k3s/server/node-token on the server node.

Verify that the server has been added as a node to the cluster on the control plane with kubectl get nodes and look for the hostname of the worker VM on the list of nodes.

### Delegate pods to the worker node

In order for the pods to be executed on another node we need to modify how we construct he Argo YAML, to add a label in the metadata. We have added the needed attributes to the Spec struct in oc-monitord on the test-ram branch.

type Spec struct {
	ServiceAccountName	string					`yaml:"serviceAccountName"`
	Entrypoint 			string                	`yaml:"entrypoint"`
	Arguments  			[]Parameter           	`yaml:"arguments,omitempty"`
	Volumes    			[]VolumeClaimTemplate 	`yaml:"volumeClaimTemplates,omitempty"`
	Templates  			[]Template            	`yaml:"templates"`
	Timeout    			int                   	`yaml:"activeDeadlineSeconds,omitempty"`
	NodeSelector		struct{
							NodeRole string `yaml:"node-role"`
						} `yaml:"nodeSelector"`
}

and added the tag in the CreateDAG() method :

b.Workflow.Spec.NodeSelector.NodeRole = "worker"

Container monitoring

Docker compose to instantiate the monitoring stack :

Prometheus : storing data
Cadvisor : monitoring of the containers

version: '3.2'
services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
    - 9090:9090
    command:
    - --config.file=/etc/prometheus/prometheus.yml
    volumes:
    - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    depends_on:
    - cadvisor 
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:latest
    container_name: cadvisor
    ports:
    - 9999:8080
    volumes:
    - /:/rootfs:ro
    - /var/run:/var/run:rw
    - /sys:/sys:ro
    - /var/lib/docker/:/var/lib/docker:ro

Prometheus scrapping configuration :

scrape_configs:
- job_name: cadvisor
  scrape_interval: 5s
  static_configs:
  - targets:
    - cadvisor:8080

Dashboards

In order to monitor the ressource consumption during our tests we need to create dashboard in Grafana.

We create 4 different queries using Prometheus as the data source. For each query we can use the code mode to create them from a PromQL query.

## OC stack consumption

sum(container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"})

Monitord consumption

sum(container_memory_usage_bytes{image="oc-monitord"})

Total RAM consumption

sum(
  container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"}
  or
  container_memory_usage_bytes{image="oc-monitord"}
)

## Number of monitord containers

count(container_memory_usage_bytes{image="oc-monitord"} > 0)