Compare commits

...

7 Commits

19 changed files with 476 additions and 53 deletions

Binary file not shown.

After

Width:  |  Height:  |  Size: 3.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 2.5 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.9 MiB

BIN
docs/S3/img/workflow.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 124 KiB

View File

@@ -0,0 +1,44 @@
# Allowing reparted Pods to use S3 storage
As a first way to transfer data from one processing node to another we have implemented the mechanics that allow a pod to access a bucket on a S3 compatible server which is not on the same kubernetes cluster.
For this we will use an example Workflow run with Argo and Admiralty on the node *Control*, with the **curl** and **mosquitto** processing executing on the control node and the other processing on the *Target01* node.
To transfer data we will use the **S3** and **output/input** annotations handled by Argo, using two *Minio* servers on Control and Target01.
![](./img/workflow.png)
When the user launches a booking on the UI a request is sent to **oc-scheduler**, which :
- Check if another booking is scheduled at the time requested
- Creates the booking and workflow executions in the DB
- Creates the namespace, service accounts and rights for argo to execute
![](./img/ns-creation-after-booking.gif)
We added another action to the existing calls that were made to **oc-datacenter**.
**oc-scheduler** retrieves all the storage ressources in the workflow and for each, retrieves the *computing* ressources that host a processing ressource using the storage ressource. Here we have :
- Minio Control :
- Control (via the first cURL)
- Target01 (via imagemagic)
- Minio Target01 :
- Control (via alpine)
- Target01 (via cURL, openalpr and mosquitto)
If the computing and storage ressources are on the same node, **oc-scheduler** uses an empty POST request to the route and **oc-datacenter** create the credentials on the S3 server and store them in a kubernetes secret in the execution's namespace.
If the two ressources are in different nodes **oc-scheduler** uses a POST request which states it needs to retrieve the credentials, reads the response and call the appopriate **oc-datacenter** to create a kubernetes secret. This means if we add three nodes
- A from which the workflow is scheduled
- B where the storage is
- C where the computing is
A can contact B to retrieve the credentials, post them to C for storage and then run an Argo Workflow, from which a pod will be deported to C and will be able to access the S3 server on B.
![](./img/secrets-created-in-s3.gif)
# Final
We can see that the different processing are able to access the required data on different storage ressources, and that our ALPR analysis is sent to the mosquitto server and to the HTTP endpoint we set in the last cURL.
![](./img/argo-watch-executing.gif)

View File

@@ -1 +1 @@
This service offers realtime shared data synchronization between OpenCloud instances.
This service offers realtime shared data synchronization between OpenCloud instances.

View File

@@ -68,3 +68,4 @@ Several S3 compatible storage may be used in a workflow
A processing shall be connected to a computing link.
Argo volcano ?

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

View File

@@ -3,6 +3,85 @@
We have written two playbooks available on a private [GitHub repo](https://github.com/pi-B/ansible-oc/tree/384a5acc0713a0fa013a82f71fbe2338bf6c80c1/Admiralty)
- `deploy_admiralty.yml` installs Helm and necessary charts in order to run Admiralty on the cluster
- `setup_admiralty_target.yml` create the environment necessary to use a cluster as a target in an Admiralty federation running Argo Workflows. Create the necessary serviceAccount, target ressource and token to authentify the source
- `add_admiralty_target.yml` creates the environment to use a cluster as a source, providing the data necessary to use a given cluster as a target.
# Ansible playbook
ansible-playbook deploy_admiralty.yml -i <REMOTE_HOST_IP>, --extra-vars "user_prompt=<YOUR_USER>" --ask-pass
```yaml
- name: Install Helm
hosts: all:!localhost
user: "{{ user_prompt }}"
become: true
# become_method: su
vars:
arch_mapping: # Map ansible architecture {{ ansible_architecture }} names to Docker's architecture names
x86_64: amd64
aarch64: arm64
tasks:
- name: Check if Helm does exist
ansible.builtin.command:
cmd: which helm
register: result_which
failed_when: result_which.rc not in [ 0, 1 ]
- name: Install helm
when: result_which.rc == 1
block:
- name: download helm from source
ansible.builtin.get_url:
url: https://get.helm.sh/helm-v3.15.0-linux-amd64.tar.gz
dest: ./
- name: unpack helm
ansible.builtin.unarchive:
remote_src: true
src: helm-v3.15.0-linux-amd64.tar.gz
dest: ./
- name: copy helm to path
ansible.builtin.command:
cmd: mv linux-amd64/helm /usr/local/bin/helm
- name: Install admiralty
hosts: all:!localhost
user: "{{ user_prompt }}"
tasks:
- name: Install required python libraries
become: true
# become_method: su
package:
name:
- python3
- python3-yaml
state: present
- name: Add jetstack repo
ansible.builtin.shell:
cmd: |
helm repo add jetstack https://charts.jetstack.io && \
helm repo update
- name: Install cert-manager
kubernetes.core.helm:
chart_ref: jetstack/cert-manager
release_name: cert-manager
context: default
namespace: cert-manager
create_namespace: true
wait: true
set_values:
- value: installCRDs=true
- name: Install admiralty
kubernetes.core.helm:
name: admiralty
chart_ref: oci://public.ecr.aws/admiralty/admiralty
namespace: admiralty
create_namespace: true
chart_version: 0.16.0
wait: true
```

View File

@@ -1,90 +1,123 @@
# Introduction
OpenCloud is an open source distributed cloud solution.
It allows selectively sharing/selling/renting your infrastrucure resources (data, algorithm, compute, storage) with other OpenCloud peers.
It allows distributed workflow execution between partners.
Distributed execution in that peer to peer network can be organized depending on your own priorites :
* maximal sovereingty
* speed up calculation
* minimising production cost
* optimizing your infrasturcutre investments
OpenCloud provides an OpenId based distributed authentication system.
OpenCloud is a fully distributed solution, without any central organization or SPOF.
OpenCloud provides transaction tracking, for every partner to be aware of it's distributed resource consumption and ensure peer to peer billing.
OpenCloud is an open-source, distributed cloud solution that enables you to selectively share, sell, or rent your infrastructure resources—such as data, algorithms, compute power, and storage with other OpenCloud peers. It facilitates distributed workflow execution between partners, allowing seamless collaboration across decentralized networks.
Distributed execution within this peer-to-peer network can be optimized according to your own priorities:
* **Maximal sovereignty**
* **Accelerated computation**
* **Cost minimization**
* **Optimized infrastructure investments**
Each OpenCloud instance includes an OpenID-based distributed authentication system.
OpenCloud is entirely decentralized, with no central authority or single point of failure (SPOF). Additionally, OpenCloud provides transaction tracking, allowing all partners to be aware of their distributed resource consumption and ensuring transparent peer-to-peer billing.
---
## Features
Every OpenCloud instance runs a set of services which allow users to interact with their own deployment and with also with other OpenCloud participants.
Each OpenCloud instance runs a collection of services that allow users to interact with both their own deployment and other OpenCloud participants.
### Resource catalog
### Resource Catalog
The catalog service references all the resources provided by the current instance : Data, Algorithms, Compute units, Storages and pre-built processing workflows.
All ressources are described by metadata, template is defined in the catalog_metadata document.
Catalog resources might be public i.e. visible to all OpenCloud peers, or only to selected partners/groups(project/entity,...).
Access to a resource itself might be subject to credentials, payment, or other access agreeemnt.
The **Resource Catalog** service indexes all the resources provided by the current instance, including **Data**, **Algorithms**, **Compute Units**, **Storages**, and pre-built **Processing Workflows**.
All resources are described by metadata, as defined in the `catalog_metadata` document. Catalog resources can be either **public**, visible to all OpenCloud peers, or **private**, accessible only to selected partners or groups (e.g., projects, entities, etc.).
Access to specific resources may require credentials, payment, or other access agreements.
### Workspace management
---
Each OpenCloud user can use workspaces to select resources of interest.
Resources in a worlkspace might be used to later build a processing workflow or setup a new service.
Each user may define as many workspaces as required
### Workspace Management
### Workflow editor
Each OpenCloud user can create **workspaces** to organize resources of interest.
Resources within a workspace can later be used to build processing workflows or set up new services.
Users can define as many workspaces as needed to manage their projects efficiently.
Using selected element in workspace, a user may build a distributed processing workflow or permanent service.
A workflow might be built with the OpenCloud intragated workflow editor.
---
### Collaborative areas
### Workflow Editor
OpenCloud allows to share workspaces and workflows between selected partners.
A collaborative area may define multiple area management and operation rules that might be enforced by automatic processes or human verifiation. (ex: use only open source component, forbid personal data, restrict results visibility, define legal limitations...)
Using elements selected in a workspace, a user can build a **distributed processing workflow** or establish a **permanent service**.
Workflows are constructed with OpenCloud's integrated workflow editor, offering a user-friendly interface for defining distributed processes.
### Peer management
---
OpenCloud allows define relationship with other peers and thus create multiple private communities.
Access right related to the other peers on a global peer scope or for specific groups within the other perr might be defined.
### Collaborative Areas
OpenCloud enables the sharing of **workspaces** and **workflows** with selected partners, enhancing collaborative projects.
A **Collaborative Area** can include multiple management and operation rules that are enforced automatically or verified manually. Examples include:
* Enforcing the use of only open-source components
* Restricting the inclusion of personal data
* Defining result visibility constraints
* Imposing legal limitations
---
### Peer Management
OpenCloud allows you to define relationships with other peers, enabling the creation of private communities.
Access rights related to peers can be managed at a **global peer scope** or for **specific groups** within the peer community.
---
## Benefits
### Data location full control
### Complete Control Over Data Location
OpenCloud encourages users to host their own data.
When external storage is required, OpenCloud allows the user toi carefully select the perfect partner and location to ensure privacy or any other concern.
OpenCloud encourages users to host their own data.
When external storage is necessary, OpenCloud enables users to carefully select partners and locations to ensure privacy, compliance, and performance.
### Cooperation framework
---
Opencloud in cooperation with Ekitia, provides a framework for sharing data and managing common workspaces and usage regulations. It allows to set up both technical and legal aspects of common distributed projects.
### Cooperation Framework
OpenCloud provides a structured framework for sharing data, managing common workspaces, and defining usage regulations.
This framework covers both **technical** and **legal aspects** for distributed projects.
---
### Data Redundancy
As in a public cloud architecture, where data redundancy is provided natively, you can also use OpenCloud to provide data redundancy but with a more fine grained control.
Like traditional public cloud architectures, OpenCloud supports **data redundancy** but with finer-grained control.
You can distribute your data across multiple OpenCloud instances, ensuring availability and resilience.
### Not exclusive from a classic public cloud infrastructure
---
When your workload requires tremendous amount of storage or compute capability that even your OpenCloud partners can provide, you can still deploy an OpenCloud instance in any public cloud provider. This will allow you to scale easily if your workload is not sensitive to international competition.
### Compatibility with Public Cloud Infrastructure
### Fine grained access depending on partner
When your workloads require massive storage or computational capabilities beyond what your OpenCloud peers can provide, you can seamlessly deploy an OpenCloud instance on any public cloud provider.
This hybrid approach allows you to scale effortlessly for workloads that are not sensitive to international competition.
---
### Fine-Grained Access Control
### Lightweight for datacenter as well as edge deployment
OpenCloud provides **fine-grained access control**, enabling you to precisely define access policies for partners and communities.
The Opencloud stack is developped using Go language, generating native code and minimal scratch containers. All the selected COTS used by openCloud services are also selected as much as possible according to these guidelines.
---
The objective is to be able to run an OpenCloud instance on almost any platform from a datacenter that could accept huge processing workflows to an ARM single board computer that would accept simple concurrent payloads for different customers and applications such as sensor preprocessing (image recognition, filtering...)
### Lightweight for Datacenter and Edge Deployments
GUI are built in Flutter and generated as plain HTML/JS.
The OpenCloud stack is developed in **Go**, generating **native code** and minimal **scratch containers**. All selected COTS (Commercial Off-The-Shelf) components used by OpenCloud services are chosen with these design principles in mind.
The objective is to enable OpenCloud to run on almost any platform:
### Fully distributed
* In **datacenters**, supporting large-scale processing workflows
* On **ARM-based single-board computers**, handling concurrent payloads for diverse applications like **sensor preprocessing**, **image recognition**, or **data filtering**
OpenCloud is fully distributed, there is no single point of failure.
There is no central administrator, there is no required central registration to use it. OpenCloud is thus resilient, any partner may enter or exit the community without affecting the global OpenCloud community.
GUIs are built with **Flutter** and rendered as plain **HTML/JS** for lightweight deployment.
---
### Open Source
### Fully Distributed Architecture
In order to be trustable, OpenCloud is released in open source code.
Anyone can audit the existing code.
The code is published under AGPL V3 in order to prevent the emergence of obscure private forks exposed to the OpenCloud community.
OpenCloud is fully decentralized, eliminating any **single point of failure**.
There is no central administrator, and no central registration is required. This makes OpenCloud highly **resilient**, allowing partners to join or leave the network without impacting the broader OpenCloud community.
---
### Open Source and Transparent
To foster trust, OpenCloud is released as **open-source software**.
Its code is publicly available for audit. The project is licensed under **AGPL V3** to prevent the emergence of closed, private forks that could compromise the OpenCloud community's transparency and trust.

Binary file not shown.

After

Width:  |  Height:  |  Size: 34 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

View File

@@ -0,0 +1,151 @@
# Goals
This originated from a demand to know how much RAM is consummed by Open Cloud when running a large number of workflow at the same time on the same node.
We differentiated between differents components :
- The "oc-stack", which is the minimum set of services to be able to create and schedule a workflow execution : oc-auth, oc-datacenter, oc-scheduler, oc-front, oc-schedulerd, oc-workflow, oc-catalog, oc-peer, oc-workspace, loki, mongo, traefik and nats
- oc-monitord, which is the daemon instanciated by the scheduling daemon (oc-schedulerd) that created the YAML for argo and creates the necessary kubernetes ressources.
We monitor both parts to view how much RAM the oc-stack uses before / during / after the execution, the RAM consummed by the monitord containers and the total of the stack and monitors.
# Setup
In order to have optimal performance we used a Promox server with high ressources (>370 GiB RAM and 128 cores) to hosts two VMs composing our Kubernetes cluster, with one control plane node were the oc stack is running and a worker node with only k3s running.
## VMs
We instantiated a 2 node kubernetes (with k3s) cluster on the superg PVE (https://superg-pve.irtse-pf.ext:8006/)
### VM Control
This vm is running the oc stack and the monitord containers, it carries the biggest part of the load. It must have k3s and argo installed. We allocated **62 GiB of RAM** and **31 cores**.
### VM Worker
This VM is holding the workload for all the pods created, acting as a worker node for the k3s cluster. We deploy k3s as a nodes as explained in the K3S quick start guide :
`curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -`
The value to use for K3S_TOKEN is stored at `/var/lib/rancher/k3s/server/node-token` on the server node.
Verify that the server has been added as a node to the cluster on the control plane with `kubectl get nodes` and look for the hostname of the worker VM on the list of nodes.
### Delegate pods to the worker node
In order for the pods to be executed on another node we need to modify how we construct he Argo YAML, to add a label in the metadata. We have added the needed attributes to the `Spec` struct in `oc-monitord` on the `test-ram` branch.
```go
type Spec struct {
ServiceAccountName string `yaml:"serviceAccountName"`
Entrypoint string `yaml:"entrypoint"`
Arguments []Parameter `yaml:"arguments,omitempty"`
Volumes []VolumeClaimTemplate `yaml:"volumeClaimTemplates,omitempty"`
Templates []Template `yaml:"templates"`
Timeout int `yaml:"activeDeadlineSeconds,omitempty"`
NodeSelector struct{
NodeRole string `yaml:"node-role"`
} `yaml:"nodeSelector"`
}
```
and added the tag in the `CreateDAG()` method :
```go
b.Workflow.Spec.NodeSelector.NodeRole = "worker"
```
## Container monitoring
Docker compose to instantiate the monitoring stack :
- Prometheus : storing data
- Cadvisor : monitoring of the containers
```yml
version: '3.2'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- 9090:9090
command:
- --config.file=/etc/prometheus/prometheus.yml
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
depends_on:
- cadvisor
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
container_name: cadvisor
ports:
- 9999:8080
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
```
Prometheus scrapping configuration :
```yml
scrape_configs:
- job_name: cadvisor
scrape_interval: 5s
static_configs:
- targets:
- cadvisor:8080
```
## Dashboards
In order to monitor the ressource consumption during our tests we need to create dashboard in Grafana.
We create 4 different queries using Prometheus as the data source. For each query we can use the `code` mode to create them from a PromQL query.
### OC stack consumption
```
sum(container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"})
```
### Monitord consumption
```
sum(container_memory_usage_bytes{image="oc-monitord"})
```
### Total RAM consumption
```
sum(
container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"}
or
container_memory_usage_bytes{image="oc-monitord"}
)
```
### Number of monitord containers
```
count(container_memory_usage_bytes{image="oc-monitord"} > 0)
```
# Launch executions
We will use a script to insert in the DB the executions that will create the monitord containers.
We need to retrieve two informations to execute the scripted insertion :
- The **workflow id** for the workflow we want to instantiate, this is can be located in the DB
- A **token** to authentify against the API, connect to oc-front and retrieve the token in your browser network analyzer tool.
Add these to the `insert_exex.sh` script.
The script takes two arguments :
- **$1** : the number of executions, which are created by chunks of 10 using a CRON expression to create 10 execution**S** for each execution/namespace
- **$2** : the number of minutes between now and the execution time for the executions.

View File

@@ -0,0 +1,72 @@
#!/bin/bash
TOKEN=""
WORFLOW=""
NB_EXEC=$1
TIME=$2
if [ -z "$NB_EXEC" ]; then
NB_EXEC=1
fi
# if (( NB_EXEC % 10 != 0 )); then
# echo "Met un chiffre rond stp"
# exit 0
# fi
if [ -z "$TIME" ]; then
TIME=1
fi
EXECS=$(((NB_EXEC+9) / 10))
echo EXECS=$EXECS
DAY=$(date +%d -u)
MONTH=$(date +%m -u)
HOUR=$(date +%H -u)
MINUTE=$(date -d "$TIME min" +"%M" -u)
SECOND=$(date +%s -u)
start_loop=$(date +%s)
for ((i = 1; i <= $EXECS; i++)); do
(
start_req=$(date +%s)
echo "Exec $i"
CRON="0-10 $MINUTE $HOUR $DAY $MONTH *"
echo "$CRON"
START="2025-$MONTH-$DAY"T"$HOUR:$MINUTE:00.012Z"
END_MONTH=$(printf "%02d" $((MONTH + 1)))
END="2025-$END_MONTH-$DAY"T"$HOUR:$MINUTE:00.012Z"
# PAYLOAD=$(printf '{"id":null,"name":null,"cron":"","mode":1,"start":"%s","end":"%s"}' "$START" "$END")
PAYLOAD=$(printf '{"id":null,"name":null,"cron":"%s","mode":1,"start":"%s","end":"%s"}' "$CRON" "$START" "$END")
# echo $PAYLOAD
curl -X 'POST' "http://localhost:8000/scheduler/$WORKFLOW" \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d "$PAYLOAD" \
-H "Authorization: Bearer $TOKEN" -w '\n'
end=$(date +%s)
duration=$((end - start_req))
echo "Début $start_req"
echo "Fin $end"
echo "Durée d'exécution $i : $duration secondes"
)&
done
wait
end_loop=$(date +%s)
total_time=$((end_loop - start_loop))
echo "Durée d'exécution total : $total_time secondes"

View File

@@ -0,0 +1,43 @@
We used a very simple mono node workflow which execute a simple sleep command within an alpine container
![](wf_test_ram_1node.png)
# 10 monitors
![alt text](10_monitors.png)
# 100 monitors
![alt text](100_monitors.png)
# 150 monitors
![alt text](150_monitors.png)
# Observations
We see an increase in the memory usage by the OC stack which initially is around 600/700 MiB :
```
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
7ce889dd97cc oc-auth 0.00% 21.82MiB / 11.41GiB 0.19% 125MB / 61.9MB 23.3MB / 5.18MB 9
93be30148a12 oc-catalog 0.14% 17.52MiB / 11.41GiB 0.15% 300MB / 110MB 35.1MB / 242kB 9
611de96ee37e oc-datacenter 0.32% 21.85MiB / 11.41GiB 0.19% 38.7MB / 18.8MB 14.8MB / 0B 9
dafb3027cfc6 oc-front 0.00% 5.887MiB / 11.41GiB 0.05% 162kB / 3.48MB 1.65MB / 12.3kB 7
d7601fd64205 oc-peer 0.23% 16.46MiB / 11.41GiB 0.14% 201MB / 74.2MB 27.6MB / 606kB 9
a78eb053f0c8 oc-scheduler 0.00% 17.24MiB / 11.41GiB 0.15% 125MB / 61.1MB 17.3MB / 1.13MB 10
bfbc3c7c2c14 oc-schedulerd 0.07% 15.05MiB / 11.41GiB 0.13% 303MB / 293MB 7.58MB / 176kB 9
304bb6a65897 oc-workflow 0.44% 107.6MiB / 11.41GiB 0.92% 2.54GB / 2.65GB 50.9MB / 11.2MB 10
62e243c1c28f oc-workspace 0.13% 17.1MiB / 11.41GiB 0.15% 193MB / 95.6MB 34.4MB / 2.14MB 10
3c9311c8b963 loki 1.57% 147.4MiB / 11.41GiB 1.26% 37.4MB / 16.4MB 148MB / 459MB 13
01284abc3c8e mongo 1.48% 86.78MiB / 11.41GiB 0.74% 564MB / 1.48GB 35.6MB / 5.35GB 94
14fc9ac33688 traefik 2.61% 49.53MiB / 11.41GiB 0.42% 72.1MB / 72.1MB 127MB / 2.2MB 13
4f1b7890c622 nats 0.70% 78.14MiB / 11.41GiB 0.67% 2.64GB / 2.36GB 17.3MB / 2.2MB 14
Total 631.2 Mb
```
However over time with the repetition of a large number of scheduling that the stacks uses a larger amount of RAM.
Espacially it seems that **loki**, **nats**, **mongo**, **oc-datacenter** and **oc-workflow** grow overs 150 MiB. This can be explained by the cache growing in these containers, which seems to be reduced every time the containers are restarted.

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

BIN
performance_test Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB