added documentation for the RAM consumption tests we jsut did
This commit is contained in:
parent
0cff31b32f
commit
a10021fb98
135
docs/performance_test/README.md
Normal file
135
docs/performance_test/README.md
Normal file
@ -0,0 +1,135 @@
|
||||
# Goals
|
||||
|
||||
This originated from a demand to know how much RAM is consummed by Open Cloud when running a large number of workflow at the same time on the same node.
|
||||
|
||||
We differentiated between differents components :
|
||||
|
||||
- The "oc-stack", which is the minimum set of services to be able to create and schedule a workflow execution : oc-auth, oc-datacenter, oc-scheduler, oc-front, oc-schedulerd, oc-workflow, oc-catalog, oc-peer, oc-workspace, loki, mongo, traefik and nats
|
||||
|
||||
- oc-monitord, which is the daemon instanciated by the scheduling daemon (oc-schedulerd) that created the YAML for argo and creates the necessary kubernetes ressources.
|
||||
|
||||
We monitor both parts to view how much RAM the oc-stack uses before / during / after the execution, the RAM consummed by the monitord containers and the total of the stack and monitors.
|
||||
|
||||
# Setup
|
||||
|
||||
In order to have optimal performance we used a Promox server with high ressources (>370 GiB RAM and 128 cores) to hosts two VMs composing our Kubernetes cluster, with one control plane node were the oc stack is running and a worker node with only k3s running.
|
||||
|
||||
## VMs
|
||||
|
||||
We instantiated a 2 node kubernetes (with k3s) cluster on the superg PVE (https://superg-pve.irtse-pf.ext:8006/)
|
||||
|
||||
### VM Control
|
||||
|
||||
This vm is running the oc stack and the monitord containers, it carries the biggest part of the load. It must have k3s and argo installed. We allocated **62 GiB of RAM** and **31 cores**.
|
||||
|
||||
### VM Worker
|
||||
|
||||
This VM is holding the workload for all the pods created, acting as a worker node for the k3s cluster. We deploy k3s as a nodes as explained in the K3S quick start guide :
|
||||
|
||||
`curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -`
|
||||
|
||||
The value to use for K3S_TOKEN is stored at `/var/lib/rancher/k3s/server/node-token` on the server node.
|
||||
|
||||
Verify that the server has been added as a node to the cluster on the control plane with `kubectl get nodes` and look for the hostname of the worker VM on the list of nodes.
|
||||
|
||||
### Delegate pods to the worker node
|
||||
|
||||
In order for the pods to be executed on another node we need to modify how we construct he Argo YAML, to add a label in the metadata. We have added the needed attributes to the `Spec` struct in `oc-monitord` on the `test-ram` branch.
|
||||
|
||||
```go
|
||||
type Spec struct {
|
||||
ServiceAccountName string `yaml:"serviceAccountName"`
|
||||
Entrypoint string `yaml:"entrypoint"`
|
||||
Arguments []Parameter `yaml:"arguments,omitempty"`
|
||||
Volumes []VolumeClaimTemplate `yaml:"volumeClaimTemplates,omitempty"`
|
||||
Templates []Template `yaml:"templates"`
|
||||
Timeout int `yaml:"activeDeadlineSeconds,omitempty"`
|
||||
NodeSelector struct{
|
||||
NodeRole string `yaml:"node-role"`
|
||||
} `yaml:"nodeSelector"`
|
||||
}
|
||||
```
|
||||
|
||||
and added the tag in the `CreateDAG()` method :
|
||||
|
||||
```go
|
||||
b.Workflow.Spec.NodeSelector.NodeRole = "worker"
|
||||
```
|
||||
|
||||
## Container monitoring
|
||||
|
||||
Docker compose to instantiate the monitoring stack :
|
||||
- Prometheus : storing data
|
||||
- Cadvisor : monitoring of the containers
|
||||
|
||||
```yml
|
||||
version: '3.2'
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
container_name: prometheus
|
||||
ports:
|
||||
- 9090:9090
|
||||
command:
|
||||
- --config.file=/etc/prometheus/prometheus.yml
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||
depends_on:
|
||||
- cadvisor
|
||||
cadvisor:
|
||||
image: gcr.io/cadvisor/cadvisor:latest
|
||||
container_name: cadvisor
|
||||
ports:
|
||||
- 9999:8080
|
||||
volumes:
|
||||
- /:/rootfs:ro
|
||||
- /var/run:/var/run:rw
|
||||
- /sys:/sys:ro
|
||||
- /var/lib/docker/:/var/lib/docker:ro
|
||||
|
||||
```
|
||||
|
||||
Prometheus scrapping configuration :
|
||||
|
||||
```yml
|
||||
scrape_configs:
|
||||
- job_name: cadvisor
|
||||
scrape_interval: 5s
|
||||
static_configs:
|
||||
- targets:
|
||||
- cadvisor:8080
|
||||
```
|
||||
|
||||
## Dashboards
|
||||
|
||||
In order to monitor the ressource consumption during our tests we need to create dashboard in Grafana.
|
||||
|
||||
We create 4 different queries using Prometheus as the data source. For each query we can use the `code` mode to create them from a PromQL query.
|
||||
|
||||
## OC stack consumption
|
||||
|
||||
```
|
||||
sum(container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"})
|
||||
```
|
||||
|
||||
## Monitord consumption
|
||||
|
||||
```
|
||||
sum(container_memory_usage_bytes{image="oc-monitord"})
|
||||
```
|
||||
|
||||
## Total RAM consumption
|
||||
|
||||
```
|
||||
sum(
|
||||
container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"}
|
||||
or
|
||||
container_memory_usage_bytes{image="oc-monitord"}
|
||||
)
|
||||
```
|
||||
|
||||
## Number of monitord containers
|
||||
|
||||
```
|
||||
count(container_memory_usage_bytes{image="oc-monitord"} > 0)
|
||||
```
|
71
docs/performance_test/insert_exec.sh
Executable file
71
docs/performance_test/insert_exec.sh
Executable file
File diff suppressed because one or more lines are too long
0
docs/performance_test/performance_report.md
Normal file
0
docs/performance_test/performance_report.md
Normal file
Loading…
Reference in New Issue
Block a user