Compare commits
16 Commits
fba603c9a6
...
master
| Author | SHA1 | Date | |
|---|---|---|---|
| 8e74e2b399 | |||
|
|
6722c365fd | ||
|
|
3da3ada710 | ||
|
|
a9b5f6dcad | ||
|
|
a10021fb98 | ||
|
|
572ab5d0c4 | ||
|
|
4dca4b3a51 | ||
| 0cff31b32f | |||
| 91c272d58f | |||
| 55794832ad | |||
| a0340e41b0 | |||
|
|
f93d5a662b | ||
|
|
faa21b5da9 | ||
| 6ae9655ca0 | |||
|
|
b31134c6cd | ||
|
|
22e22b98b4 |
BIN
docs/S3/img/argo-watch-executing.gif
Normal file
|
After Width: | Height: | Size: 3.5 MiB |
BIN
docs/S3/img/ns-creation-after-booking.gif
Normal file
|
After Width: | Height: | Size: 2.5 MiB |
BIN
docs/S3/img/secrets-created-in-s3.gif
Normal file
|
After Width: | Height: | Size: 1.9 MiB |
BIN
docs/S3/img/workflow.png
Normal file
|
After Width: | Height: | Size: 124 KiB |
44
docs/S3/reparted-S3-readme.md
Normal file
@@ -0,0 +1,44 @@
|
||||
# Allowing reparted Pods to use S3 storage
|
||||
|
||||
As a first way to transfer data from one processing node to another we have implemented the mechanics that allow a pod to access a bucket on a S3 compatible server which is not on the same kubernetes cluster.
|
||||
|
||||
For this we will use an example Workflow run with Argo and Admiralty on the node *Control*, with the **curl** and **mosquitto** processing executing on the control node and the other processing on the *Target01* node.
|
||||
To transfer data we will use the **S3** and **output/input** annotations handled by Argo, using two *Minio* servers on Control and Target01.
|
||||
|
||||

|
||||
|
||||
|
||||
When the user launches a booking on the UI a request is sent to **oc-scheduler**, which :
|
||||
- Check if another booking is scheduled at the time requested
|
||||
- Creates the booking and workflow executions in the DB
|
||||
- Creates the namespace, service accounts and rights for argo to execute
|
||||
|
||||

|
||||
|
||||
We added another action to the existing calls that were made to **oc-datacenter**.
|
||||
|
||||
**oc-scheduler** retrieves all the storage ressources in the workflow and for each, retrieves the *computing* ressources that host a processing ressource using the storage ressource. Here we have :
|
||||
- Minio Control :
|
||||
- Control (via the first cURL)
|
||||
- Target01 (via imagemagic)
|
||||
|
||||
- Minio Target01 :
|
||||
- Control (via alpine)
|
||||
- Target01 (via cURL, openalpr and mosquitto)
|
||||
|
||||
If the computing and storage ressources are on the same node, **oc-scheduler** uses an empty POST request to the route and **oc-datacenter** create the credentials on the S3 server and store them in a kubernetes secret in the execution's namespace.
|
||||
|
||||
If the two ressources are in different nodes **oc-scheduler** uses a POST request which states it needs to retrieve the credentials, reads the response and call the appopriate **oc-datacenter** to create a kubernetes secret. This means if we add three nodes
|
||||
- A from which the workflow is scheduled
|
||||
- B where the storage is
|
||||
- C where the computing is
|
||||
|
||||
A can contact B to retrieve the credentials, post them to C for storage and then run an Argo Workflow, from which a pod will be deported to C and will be able to access the S3 server on B.
|
||||
|
||||

|
||||
|
||||
# Final
|
||||
|
||||
We can see that the different processing are able to access the required data on different storage ressources, and that our ALPR analysis is sent to the mosquitto server and to the HTTP endpoint we set in the last cURL.
|
||||
|
||||

|
||||
33
docs/WP/authentication_access_control.md
Normal file
@@ -0,0 +1,33 @@
|
||||
## General architecture
|
||||
|
||||
Each OpenCloud instance will provide an OpenId interface. This interface may be connected to an existing LDAP Server or a dedicated one.
|
||||
The main advantage of this distributed solution is that each partner will manage it's own users and profiles. It simplifies access control management as each peer does not have to be aware of other peers users, but will only define access rules globally for the peer.
|
||||
|
||||
## Users / roles / groups
|
||||
Users in opencloud belong to a peer (company), they may be part of groups within the company (organisational unit, project, ...).
|
||||
Within those groups or globally for the peer, they may have different roles (project manager, workflow designer, accountant,...).
|
||||
Roles will define the list of permissions granted to that role.
|
||||
|
||||
## User permissions definition
|
||||
|
||||
Each OpenCloud instance will manage it's users and their permissions though the user/group/role scheme defined in the previous chapter.
|
||||
On a local instance basic permissions are :
|
||||
* a user has permission to start a distributed workflow using remote peers
|
||||
* a user has permissions to view financial information on the instance
|
||||
* a user has permissions to change the service exchange rates
|
||||
|
||||
On a remote instance basic permission are :
|
||||
* exceute workflow (quota + peers subset ?)
|
||||
* store data (quota + peers subset ?)
|
||||
|
||||
|
||||
## Authentication process
|
||||
|
||||
Each OpenCloud peer will accept a company/group as a whole.
|
||||
Upon user connection, it will receive user rights form the originating OpenId connect server and apply them. ex: specific pricing for a group (company agreement, project agreement, ...)
|
||||
A collaborative workspace
|
||||
|
||||
|
||||
## Resources don't have a static url
|
||||
They will map to an internal url of the service
|
||||
Once a workflow is initialized and ready for launch temporary urls proxying to the real service will be provided to the wokflow at booking time/
|
||||
0
docs/WP/oc-deploy.md
Normal file
14
docs/WP/oc-peer.md
Normal file
@@ -0,0 +1,14 @@
|
||||
# Description
|
||||
|
||||
This component holds a database of all known peers.
|
||||
It also performs the required operation when getting a new peer/group request :
|
||||
* Shows peer identity/certificates
|
||||
* Accept or reject a peer/group as partner
|
||||
* Define allowed service
|
||||
* Define visibility
|
||||
* Create a dedicated namespace if allowed to use our Compute and quotas
|
||||
* Define storage quotas
|
||||
* Generate access keys for the services
|
||||
* Returns the answer and interfacing data to the requester
|
||||
|
||||
|
||||
1
docs/WP/oc-sync.md
Normal file
@@ -0,0 +1 @@
|
||||
This service offers realtime shared data synchronization between OpenCloud instances.
|
||||
71
docs/WP/workflow_design.md
Normal file
@@ -0,0 +1,71 @@
|
||||
## Workflow design rules
|
||||
|
||||
1. A data resource may be directly linked to a processing
|
||||
1. A processing resource is linked always linked to the next processing resource.
|
||||
1. If two processing resources need to exchange file data, they need to be connected to the same file storage(s)
|
||||
1. A processing shall be linked to a computing resource
|
||||
|
||||
|
||||
### Data - Resource link
|
||||
|
||||
A data resource may be linked to a processing resource.
|
||||
The processing resource shall be compatible with the data resource format and API.
|
||||
|
||||
#### Example 1 : Trivial download
|
||||
For a simple example :
|
||||
* the data resource will provide an http url to download a file.
|
||||
* the processing resource is the simple curl command that will download the file in the current computing resource
|
||||
|
||||
|
||||
#### Example 2 : Advanced download processing resource
|
||||
|
||||
For a more specific example :
|
||||
* the data resource is a complex data archive
|
||||
* the processing resource is a complex download component that can be configured to retrieve specific data or datasets from the archive
|
||||
|
||||
|
||||
### Processing - Processing link
|
||||
|
||||
Dependant processings must be graphically linked, those links allow build the workflow's acyclic diagram.
|
||||
|
||||
|
||||
### Processing - Storage link
|
||||
|
||||
A processing may be linked to one or several storage resources.
|
||||
|
||||
#### Basic storage resource types
|
||||
|
||||
Storage resource types generally require a list source - destination information tha describe reaw/write operations.
|
||||
This information is associated to the link between the processing and the storage resources.
|
||||
|
||||
*In the case of a write to storage operation :*
|
||||
* the source information specifies the local path/filename in the container where the file is created by the processing.
|
||||
* the destination information contains the url/path/filename where the file shall be stored
|
||||
|
||||
*In the case of a read from storage operation :*
|
||||
* the source information specifies the url/path/filename where the file is stored
|
||||
* the destination information contains local path/filename in the container where the file shall be created for the processing to use it.
|
||||
|
||||
##### Local cluster storage
|
||||
|
||||
The generated Argo workflow defines a local storage available to all containers in the current cluster
|
||||
This storage is available from every container under the path defined in the $LOCALSTORAGE environment varlable.
|
||||
On this special storage, as it is mounted in all container, the source - destination information is implicit.
|
||||
Any data can be read or written directly.
|
||||
|
||||
##### S3 type storages
|
||||
|
||||
Several S3 compatible storage may be used in a workflow
|
||||
* OCS3 : The global MinIO deployed in the local OpenCloud instance
|
||||
* Generic S3 : Any external accessible S3 compatible service
|
||||
* WFS3 : An internal MinIO instance deployed for the workflow duration, that instance might be exposed outside the current cluster
|
||||
|
||||
##### Custom storage types
|
||||
|
||||
|
||||
|
||||
### Processing - Computing link
|
||||
|
||||
A processing shall be connected to a computing link.
|
||||
|
||||
Argo volcano ?
|
||||
BIN
docs/admiralty/Capture d’écran du 2025-05-20 16-03-39.png
Normal file
|
After Width: | Height: | Size: 31 KiB |
BIN
docs/admiralty/Capture d’écran du 2025-05-20 16-04-21.png
Normal file
|
After Width: | Height: | Size: 31 KiB |
BIN
docs/admiralty/auth_schema.jpg
Normal file
|
After Width: | Height: | Size: 91 KiB |
90
docs/admiralty/authentication.md
Normal file
@@ -0,0 +1,90 @@
|
||||
# Current authentication process
|
||||
|
||||
We are currently able to authentify against a remote `Admiralty Target` to execute pods from the `Source` cluster in a remote cluster, in the context of an `Argo Workflow`. The resulting artifacts or data can then be retrieved in the source cluster.
|
||||
|
||||
In this document we present the steps needed for this authentication process, its flaws and the improvments we could make.
|
||||
|
||||

|
||||
|
||||
## Requirements
|
||||
|
||||
### Namespace
|
||||
|
||||
In each cluster we need the same `namespace` to exist. Hence, both namespace need to have the same resources available, mmeaning here that Argo must be deployed in the same way.
|
||||
|
||||
> We haven't tested it yet, but maybe the `version` of the Argo Workflow shoud be the same in order to prevent mismatch between functionnalities.
|
||||
|
||||
### ServiceAccount
|
||||
|
||||
A `serviceAccount` with the same name must be created in each side of the cluster federation.
|
||||
|
||||
In the case of Argo Workflows it will be used to submit the workflow in the `Argo CLI` or should be specified in the `spec.serviceAccountName` field of the Workflow.
|
||||
|
||||
#### Roles
|
||||
|
||||
Given that the `serviceAccount` will be the same in both cluster, it must be binded with the appropriate `role` in order to execute both the Argo Workflow and Admiralty actions.
|
||||
|
||||
So far we only have seen the need to add the `patch` verb on `pods` for the `apiGroup` "" in `argo-role`.
|
||||
|
||||
Once the patch is done the role the `serviceAccount` that will be used must be added to the rolebinding `argo-binding`.
|
||||
|
||||
### Token
|
||||
|
||||
In order to authentify against the Kubernetes API we need to provide the Admiralty `Source` with a token stored in a secret. This token is created on the `Target` for the `serviceAccount` that we will use in the Admiralty communication. After copying it, we replace the IP in the `kubeconfig` with the IP that will be targeted by the source to reach the k8s API. The token generated for the serviceAccount is added in the "user" part of the kubeconfig.
|
||||
|
||||
This **edited kubeconfig** is then passed to the source cluster and converted into a secret, bound to the Admiralty `Target` resource. It is presented to the the k8s API on the target cluster, first as part of the TLS handshake and then to authenticate the serviceAccount that performs the pods delegation.
|
||||
|
||||
### Source/Target
|
||||
|
||||
Each cluster in the Admiralty Federation needs to declare **all of the other clusters** :
|
||||
|
||||
- That he will delegate pods to, with the `Target` resource
|
||||
|
||||
```yaml
|
||||
apiVersion: multicluster.admiralty.io/v1alpha1
|
||||
kind: Target
|
||||
metadata:
|
||||
name: some-name
|
||||
namespace: your-namespace
|
||||
spec:
|
||||
kubeconfigSecret:
|
||||
name: secret-holding-kubeconfig-info
|
||||
```
|
||||
|
||||
- That he will accept pods from, with the `Source` resource
|
||||
|
||||
```yaml
|
||||
apiVersion: multicluster.admiralty.io/v1alpha1
|
||||
kind: Source
|
||||
metadata:
|
||||
name: some-name
|
||||
namespace: your-namespace
|
||||
spec:
|
||||
serviceAccountName: service-account-used-by-source
|
||||
```
|
||||
|
||||
|
||||
## Caveats
|
||||
|
||||
### Token
|
||||
|
||||
By default, a token created by the kubernetes API is only valid for **1 hour**, which can pose problem for :
|
||||
|
||||
- Workflows taking more than 1 hour to execute, with pods requesting creation on a remote cluster when the token is expired
|
||||
|
||||
- Retransfering the modified `kubeconfig`, we need a way that allows a secure communication of the data between to clusters running Open Cloud.
|
||||
|
||||
It is possible to create token with **infinite duration** (in reality 10 years) but the Admiralty documentation **advices against** this for security issues.
|
||||
|
||||
### resources' name
|
||||
|
||||
When coupling Argo Workflows with a MinIO server to store the artifacts produced by a pod we need to access, for example but not only, a secret containing the authentication data. If we launch a workflow on cluster A and B, the secret resource containing the auth. data can't have the same thing in cluster A and B.
|
||||
|
||||
At the moment the only time we have faced this issue is with the MinIO s3 storage access. Since it is a service that we could deploy ourself we would have the possibility to attribute naming containing an UUID linked to the OC instance.
|
||||
|
||||
## Possible improvements
|
||||
|
||||
- Pods bound token, can they be issued to the remote cluster via an http API call ? [doc](https://kubernetes.io/docs/reference/kubernetes-api/authentication-resources/token-request-v1/)
|
||||
|
||||
- Using a service that contact its counterpart in the target cluster, to ask for a token with a validity set by the user in the workflow workspace. Communication over HTTPS, but how do we generate secure certificates on both ends ?
|
||||
|
||||
87
docs/admiralty/deployment.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Deploying Admiralty on a Open Cloud cluster
|
||||
|
||||
We have written two playbooks available on a private [GitHub repo](https://github.com/pi-B/ansible-oc/tree/384a5acc0713a0fa013a82f71fbe2338bf6c80c1/Admiralty)
|
||||
|
||||
- `deploy_admiralty.yml` installs Helm and necessary charts in order to run Admiralty on the cluster
|
||||
|
||||
# Ansible playbook
|
||||
|
||||
ansible-playbook deploy_admiralty.yml -i <REMOTE_HOST_IP>, --extra-vars "user_prompt=<YOUR_USER>" --ask-pass
|
||||
|
||||
```yaml
|
||||
- name: Install Helm
|
||||
hosts: all:!localhost
|
||||
user: "{{ user_prompt }}"
|
||||
become: true
|
||||
# become_method: su
|
||||
vars:
|
||||
arch_mapping: # Map ansible architecture {{ ansible_architecture }} names to Docker's architecture names
|
||||
x86_64: amd64
|
||||
aarch64: arm64
|
||||
|
||||
|
||||
tasks:
|
||||
- name: Check if Helm does exist
|
||||
ansible.builtin.command:
|
||||
cmd: which helm
|
||||
register: result_which
|
||||
failed_when: result_which.rc not in [ 0, 1 ]
|
||||
|
||||
- name: Install helm
|
||||
when: result_which.rc == 1
|
||||
block:
|
||||
- name: download helm from source
|
||||
ansible.builtin.get_url:
|
||||
url: https://get.helm.sh/helm-v3.15.0-linux-amd64.tar.gz
|
||||
dest: ./
|
||||
|
||||
- name: unpack helm
|
||||
ansible.builtin.unarchive:
|
||||
remote_src: true
|
||||
src: helm-v3.15.0-linux-amd64.tar.gz
|
||||
dest: ./
|
||||
|
||||
- name: copy helm to path
|
||||
ansible.builtin.command:
|
||||
cmd: mv linux-amd64/helm /usr/local/bin/helm
|
||||
|
||||
- name: Install admiralty
|
||||
hosts: all:!localhost
|
||||
user: "{{ user_prompt }}"
|
||||
|
||||
tasks:
|
||||
- name: Install required python libraries
|
||||
become: true
|
||||
# become_method: su
|
||||
package:
|
||||
name:
|
||||
- python3
|
||||
- python3-yaml
|
||||
state: present
|
||||
|
||||
- name: Add jetstack repo
|
||||
ansible.builtin.shell:
|
||||
cmd: |
|
||||
helm repo add jetstack https://charts.jetstack.io && \
|
||||
helm repo update
|
||||
|
||||
- name: Install cert-manager
|
||||
kubernetes.core.helm:
|
||||
chart_ref: jetstack/cert-manager
|
||||
release_name: cert-manager
|
||||
context: default
|
||||
namespace: cert-manager
|
||||
create_namespace: true
|
||||
wait: true
|
||||
set_values:
|
||||
- value: installCRDs=true
|
||||
|
||||
- name: Install admiralty
|
||||
kubernetes.core.helm:
|
||||
name: admiralty
|
||||
chart_ref: oci://public.ecr.aws/admiralty/admiralty
|
||||
namespace: admiralty
|
||||
create_namespace: true
|
||||
chart_version: 0.16.0
|
||||
wait: true
|
||||
```
|
||||
@@ -1,27 +0,0 @@
|
||||
# General architecture
|
||||
|
||||
Each OpenCloud instance will provide an OpenId interface. This interface may be connected to an existing LDAP Server or a dedicated one.
|
||||
The main advanytage of this distributed solution is that each partner will manage it's own iusers and profiles. It simplifies access control management as each peer does not have to be aware of other peers users, but will only define access rules globally for the peers.
|
||||
|
||||
# Users / roles / groups
|
||||
|
||||
|
||||
# User permissions definition
|
||||
|
||||
Each OpenCloud instance will manage it's users and their permissions :
|
||||
On a local instance :
|
||||
* a user has permission to start a distributed workflow in using remote peers
|
||||
* a user has administrative rights and may change the service exchenge rates
|
||||
* a user is limited to view financial information on the instance
|
||||
* a user belongs to a group (that may represent a project, a department,...)
|
||||
|
||||
# Authentication process
|
||||
|
||||
Each OpenCloud peer will accept a company as a whole.
|
||||
Upon user connection, it will receive user rights form the origninating OpenId connect server and apply them. ex: specific pricing for a group (company agreement, project agreement, ...)
|
||||
A collaborative workspace
|
||||
|
||||
|
||||
# Resources don't have an url
|
||||
They will map to an internal url of the service
|
||||
Once a workflow is initialized and ready for launch temporary urls proxying to the real service will be provided to the wokflow at booking time
|
||||
@@ -220,6 +220,7 @@ class Storage {
|
||||
"support" : "string"
|
||||
}
|
||||
|
||||
|
||||
Resource -- Owner
|
||||
|
||||
Resource <|- Data
|
||||
|
||||
@@ -1,28 +0,0 @@
|
||||
K8s Trex OK avec cloisonnement alpha
|
||||
Vm rebond pour attaquer le HPC
|
||||
|
||||
2 managers au lieu dans le cluster
|
||||
|
||||
Déploiement Helm
|
||||
|
||||
Pas de helm,
|
||||
1 seul namespace
|
||||
pas d'accès au control plane
|
||||
=> VCluster
|
||||
|
||||
Argo workflow
|
||||
|
||||
Contrat à rompre avec ATOS
|
||||
|
||||
|
||||
=> à terme cluster dédié OpenShift
|
||||
|
||||
====================
|
||||
|
||||
Accès au cluster et tester
|
||||
|
||||
|
||||
|
||||
========
|
||||
Compte opérateur de service qui possède les secrets dans le namespace
|
||||
|
||||
@@ -43,7 +43,7 @@ To force routing information update :
|
||||
|
||||
bee generate routers
|
||||
|
||||
## GUI compoenents
|
||||
## GUI components
|
||||
|
||||
The GUI are developped using Flutter framework
|
||||
|
||||
|
||||
37
docs/glossary.md
Normal file
@@ -0,0 +1,37 @@
|
||||
# Glossary
|
||||
|
||||
## Resource
|
||||
|
||||
An OpenCloud resource is an item that is shareable by any OpenCloud partner.
|
||||
it may be :
|
||||
* A data
|
||||
* An algorithm
|
||||
* A compute unit
|
||||
* A storage facility
|
||||
* A workflow refering to any of the previous items
|
||||
|
||||
## Catalog
|
||||
|
||||
The OpenCloud catalog contains a resource metadata list.
|
||||
|
||||
## Workspace
|
||||
|
||||
A workspace is a user selected set of resources.
|
||||
|
||||
## Workflow
|
||||
|
||||
A workflow is the processing of multiple resources.
|
||||
|
||||
## Service
|
||||
|
||||
A service is a deployment of permanent resources.
|
||||
|
||||
## Collaborative area
|
||||
|
||||
A collaborative area is an environment for shariung wokspaces / workflows / services between selected partners.
|
||||
|
||||
## Rule book
|
||||
|
||||
List of rules that a shareds workspace shall conform to
|
||||
|
||||
|
||||
69
docs/minio.md
Normal file
@@ -0,0 +1,69 @@
|
||||
# Setting up
|
||||
|
||||
Minio can be deployed using the argo workflow [documentation ](https://argo-workflows.readthedocs.io/en/latest/configure-artifact-repository/#configuring-minio) or use the ansible playbook written by `pierre.bayle[at]irt-saintexupery.com` available (here)[https://raw.githubusercontent.com/pi-B/ansible-oc/refs/heads/main/deploy_minio.yml?token=GHSAT0AAAAAAC5OBWUCGHWPA4OUAKHBKB4GZ4YTPGQ].
|
||||
|
||||
Launch the playbook with `ansible-playbook -i [your host ip or url], deploy_minio.yml --extra-vars "user_prompt=[your user]" [--ask-become-pass]`
|
||||
|
||||
- If your user doesn't have the `NOPASSWD` rights on the host use the `--ask-become-pass` to allow ansible to use `sudo`
|
||||
- Fill the value for `memory_req`, `storage_req` and `replicas` in the playbook's vars. The pods won't necessarily use it fully, but if the total memory or storage request of your pod's pool excede your host's capabilities the deployment might fail.
|
||||
|
||||
|
||||
## Flaws of the default install
|
||||
|
||||
- Requests 16Gi of memory / pods
|
||||
- Requests 500Gi of storage
|
||||
- Creates 16 replicas
|
||||
- Dosen't expose the MinIO GUI to the exterior of cluster
|
||||
|
||||
# Allow API access
|
||||
|
||||
Visit the MinIO GUI (on port 9001) and create the bucket(s) you will use (here `oc-bucket`) and access keys, encode them with base64 and create a secret in the argo namespace :
|
||||
|
||||
```
|
||||
kubectl create secret -n [name of your argo namespace] generic argo-artifact-secret \
|
||||
--from-literal=access-key=[your access key] \
|
||||
--from-literal=secret-key=[your secret key]
|
||||
```
|
||||
|
||||
- Create a ConfigMap, which will be used by argo to create the S3 artifact, the content can match the one from the previously created secret
|
||||
|
||||
```
|
||||
apiVersion: v1
|
||||
kind: ConfigMap
|
||||
metadata:
|
||||
# If you want to use this config map by default, name it "artifact-repositories". Otherwise, you can provide a reference to a
|
||||
# different config map in `artifactRepositoryRef.configMap`.
|
||||
name: artifact-repositories
|
||||
# annotations:
|
||||
# # v3.0 and after - if you want to use a specific key, put that key into this annotation.
|
||||
# workflows.argoproj.io/default-artifact-repository: oc-s3-artifact-repository
|
||||
data:
|
||||
oc-s3-artifact-repository: |
|
||||
s3:
|
||||
bucket: oc-bucket
|
||||
endpoint: [ retrieve cluster with kubectl get service argo-artifacts -o jsonpath="{.spec.clusterIP}" ]:9000
|
||||
insecure: true
|
||||
accessKeySecret:
|
||||
name: argo-artifact-secret
|
||||
key: access-key
|
||||
secretKeySecret:
|
||||
name: argo-artifact-secret
|
||||
key: secret-key
|
||||
|
||||
```
|
||||
|
||||
# Store Argo Workflow objects in MinIO S3 bucket
|
||||
|
||||
Here is an exemple of how to store some file/dir from an argo pod to an existing s3 bucket
|
||||
|
||||
```
|
||||
outputs:
|
||||
parameters:
|
||||
- name: outfile [or OUTDIR ]
|
||||
value: [NAME OF THE FILE OR DIR TO STORE]
|
||||
artifacts:
|
||||
- name: outputs
|
||||
path: [ PATH OF THE FILE OR DIR IN THE CONTAINER]
|
||||
s3:
|
||||
key: [PATH OF THE FILE IN THE BUCKET].tgz'
|
||||
```
|
||||
@@ -1,11 +1,123 @@
|
||||
OpenCloud is an open source distributed cloud solution.
|
||||
It allows selectively sharing/selling/renting your infrastrucure resources (data, algorithm, compute, storage) with other OpenCloud peers.
|
||||
It allows distributed workflow execution between partners.
|
||||
Distributed execution in that peer to peer network can be organized depending on your own priorites :
|
||||
- maximal sovereingty
|
||||
- speed up calculation
|
||||
- minimising production cost
|
||||
- optimizing your infrasturcutre investments
|
||||
OpenCloud provides an OpenId based distributed authentication system.
|
||||
OpenCloud is a fully distributed solution, without any central organization or SPOF.
|
||||
OpenCloud provides trnasaction tracking, for every partner to be aware of it's distributed resource consumption and ensure peer to peer billing.
|
||||
# Introduction
|
||||
|
||||
OpenCloud is an open-source, distributed cloud solution that enables you to selectively share, sell, or rent your infrastructure resources—such as data, algorithms, compute power, and storage with other OpenCloud peers. It facilitates distributed workflow execution between partners, allowing seamless collaboration across decentralized networks.
|
||||
|
||||
Distributed execution within this peer-to-peer network can be optimized according to your own priorities:
|
||||
|
||||
* **Maximal sovereignty**
|
||||
* **Accelerated computation**
|
||||
* **Cost minimization**
|
||||
* **Optimized infrastructure investments**
|
||||
|
||||
Each OpenCloud instance includes an OpenID-based distributed authentication system.
|
||||
OpenCloud is entirely decentralized, with no central authority or single point of failure (SPOF). Additionally, OpenCloud provides transaction tracking, allowing all partners to be aware of their distributed resource consumption and ensuring transparent peer-to-peer billing.
|
||||
|
||||
---
|
||||
|
||||
## Features
|
||||
|
||||
Each OpenCloud instance runs a collection of services that allow users to interact with both their own deployment and other OpenCloud participants.
|
||||
|
||||
### Resource Catalog
|
||||
|
||||
The **Resource Catalog** service indexes all the resources provided by the current instance, including **Data**, **Algorithms**, **Compute Units**, **Storages**, and pre-built **Processing Workflows**.
|
||||
All resources are described by metadata, as defined in the `catalog_metadata` document. Catalog resources can be either **public**, visible to all OpenCloud peers, or **private**, accessible only to selected partners or groups (e.g., projects, entities, etc.).
|
||||
Access to specific resources may require credentials, payment, or other access agreements.
|
||||
|
||||
---
|
||||
|
||||
### Workspace Management
|
||||
|
||||
Each OpenCloud user can create **workspaces** to organize resources of interest.
|
||||
Resources within a workspace can later be used to build processing workflows or set up new services.
|
||||
Users can define as many workspaces as needed to manage their projects efficiently.
|
||||
|
||||
---
|
||||
|
||||
### Workflow Editor
|
||||
|
||||
Using elements selected in a workspace, a user can build a **distributed processing workflow** or establish a **permanent service**.
|
||||
Workflows are constructed with OpenCloud's integrated workflow editor, offering a user-friendly interface for defining distributed processes.
|
||||
|
||||
---
|
||||
|
||||
### Collaborative Areas
|
||||
|
||||
OpenCloud enables the sharing of **workspaces** and **workflows** with selected partners, enhancing collaborative projects.
|
||||
A **Collaborative Area** can include multiple management and operation rules that are enforced automatically or verified manually. Examples include:
|
||||
|
||||
* Enforcing the use of only open-source components
|
||||
* Restricting the inclusion of personal data
|
||||
* Defining result visibility constraints
|
||||
* Imposing legal limitations
|
||||
|
||||
---
|
||||
|
||||
### Peer Management
|
||||
|
||||
OpenCloud allows you to define relationships with other peers, enabling the creation of private communities.
|
||||
Access rights related to peers can be managed at a **global peer scope** or for **specific groups** within the peer community.
|
||||
|
||||
---
|
||||
|
||||
## Benefits
|
||||
|
||||
### Complete Control Over Data Location
|
||||
|
||||
OpenCloud encourages users to host their own data.
|
||||
When external storage is necessary, OpenCloud enables users to carefully select partners and locations to ensure privacy, compliance, and performance.
|
||||
|
||||
---
|
||||
|
||||
### Cooperation Framework
|
||||
|
||||
OpenCloud provides a structured framework for sharing data, managing common workspaces, and defining usage regulations.
|
||||
This framework covers both **technical** and **legal aspects** for distributed projects.
|
||||
|
||||
---
|
||||
|
||||
### Data Redundancy
|
||||
|
||||
Like traditional public cloud architectures, OpenCloud supports **data redundancy** but with finer-grained control.
|
||||
You can distribute your data across multiple OpenCloud instances, ensuring availability and resilience.
|
||||
|
||||
---
|
||||
|
||||
### Compatibility with Public Cloud Infrastructure
|
||||
|
||||
When your workloads require massive storage or computational capabilities beyond what your OpenCloud peers can provide, you can seamlessly deploy an OpenCloud instance on any public cloud provider.
|
||||
This hybrid approach allows you to scale effortlessly for workloads that are not sensitive to international competition.
|
||||
|
||||
---
|
||||
|
||||
### Fine-Grained Access Control
|
||||
|
||||
OpenCloud provides **fine-grained access control**, enabling you to precisely define access policies for partners and communities.
|
||||
|
||||
---
|
||||
|
||||
### Lightweight for Datacenter and Edge Deployments
|
||||
|
||||
The OpenCloud stack is developed in **Go**, generating **native code** and minimal **scratch containers**. All selected COTS (Commercial Off-The-Shelf) components used by OpenCloud services are chosen with these design principles in mind.
|
||||
|
||||
The objective is to enable OpenCloud to run on almost any platform:
|
||||
|
||||
* In **datacenters**, supporting large-scale processing workflows
|
||||
* On **ARM-based single-board computers**, handling concurrent payloads for diverse applications like **sensor preprocessing**, **image recognition**, or **data filtering**
|
||||
|
||||
GUIs are built with **Flutter** and rendered as plain **HTML/JS** for lightweight deployment.
|
||||
|
||||
---
|
||||
|
||||
### Fully Distributed Architecture
|
||||
|
||||
OpenCloud is fully decentralized, eliminating any **single point of failure**.
|
||||
There is no central administrator, and no central registration is required. This makes OpenCloud highly **resilient**, allowing partners to join or leave the network without impacting the broader OpenCloud community.
|
||||
|
||||
---
|
||||
|
||||
### Open Source and Transparent
|
||||
|
||||
To foster trust, OpenCloud is released as **open-source software**.
|
||||
Its code is publicly available for audit. The project is licensed under **AGPL V3** to prevent the emergence of closed, private forks that could compromise the OpenCloud community's transparency and trust.
|
||||
|
||||
|
||||
BIN
docs/performance_test/100_monitors.png
Normal file
|
After Width: | Height: | Size: 34 KiB |
BIN
docs/performance_test/10_monitors.png
Normal file
|
After Width: | Height: | Size: 31 KiB |
BIN
docs/performance_test/150_monitors.png
Normal file
|
After Width: | Height: | Size: 30 KiB |
151
docs/performance_test/README.md
Normal file
@@ -0,0 +1,151 @@
|
||||
# Goals
|
||||
|
||||
This originated from a demand to know how much RAM is consummed by Open Cloud when running a large number of workflow at the same time on the same node.
|
||||
|
||||
We differentiated between differents components :
|
||||
|
||||
- The "oc-stack", which is the minimum set of services to be able to create and schedule a workflow execution : oc-auth, oc-datacenter, oc-scheduler, oc-front, oc-schedulerd, oc-workflow, oc-catalog, oc-peer, oc-workspace, loki, mongo, traefik and nats
|
||||
|
||||
- oc-monitord, which is the daemon instanciated by the scheduling daemon (oc-schedulerd) that created the YAML for argo and creates the necessary kubernetes ressources.
|
||||
|
||||
We monitor both parts to view how much RAM the oc-stack uses before / during / after the execution, the RAM consummed by the monitord containers and the total of the stack and monitors.
|
||||
|
||||
# Setup
|
||||
|
||||
In order to have optimal performance we used a Promox server with high ressources (>370 GiB RAM and 128 cores) to hosts two VMs composing our Kubernetes cluster, with one control plane node were the oc stack is running and a worker node with only k3s running.
|
||||
|
||||
## VMs
|
||||
|
||||
We instantiated a 2 node kubernetes (with k3s) cluster on the superg PVE (https://superg-pve.irtse-pf.ext:8006/)
|
||||
|
||||
### VM Control
|
||||
|
||||
This vm is running the oc stack and the monitord containers, it carries the biggest part of the load. It must have k3s and argo installed. We allocated **62 GiB of RAM** and **31 cores**.
|
||||
|
||||
### VM Worker
|
||||
|
||||
This VM is holding the workload for all the pods created, acting as a worker node for the k3s cluster. We deploy k3s as a nodes as explained in the K3S quick start guide :
|
||||
|
||||
`curl -sfL https://get.k3s.io | K3S_URL=https://myserver:6443 K3S_TOKEN=mynodetoken sh -`
|
||||
|
||||
The value to use for K3S_TOKEN is stored at `/var/lib/rancher/k3s/server/node-token` on the server node.
|
||||
|
||||
Verify that the server has been added as a node to the cluster on the control plane with `kubectl get nodes` and look for the hostname of the worker VM on the list of nodes.
|
||||
|
||||
### Delegate pods to the worker node
|
||||
|
||||
In order for the pods to be executed on another node we need to modify how we construct he Argo YAML, to add a label in the metadata. We have added the needed attributes to the `Spec` struct in `oc-monitord` on the `test-ram` branch.
|
||||
|
||||
```go
|
||||
type Spec struct {
|
||||
ServiceAccountName string `yaml:"serviceAccountName"`
|
||||
Entrypoint string `yaml:"entrypoint"`
|
||||
Arguments []Parameter `yaml:"arguments,omitempty"`
|
||||
Volumes []VolumeClaimTemplate `yaml:"volumeClaimTemplates,omitempty"`
|
||||
Templates []Template `yaml:"templates"`
|
||||
Timeout int `yaml:"activeDeadlineSeconds,omitempty"`
|
||||
NodeSelector struct{
|
||||
NodeRole string `yaml:"node-role"`
|
||||
} `yaml:"nodeSelector"`
|
||||
}
|
||||
```
|
||||
|
||||
and added the tag in the `CreateDAG()` method :
|
||||
|
||||
```go
|
||||
b.Workflow.Spec.NodeSelector.NodeRole = "worker"
|
||||
```
|
||||
|
||||
## Container monitoring
|
||||
|
||||
Docker compose to instantiate the monitoring stack :
|
||||
- Prometheus : storing data
|
||||
- Cadvisor : monitoring of the containers
|
||||
|
||||
```yml
|
||||
version: '3.2'
|
||||
services:
|
||||
prometheus:
|
||||
image: prom/prometheus:latest
|
||||
container_name: prometheus
|
||||
ports:
|
||||
- 9090:9090
|
||||
command:
|
||||
- --config.file=/etc/prometheus/prometheus.yml
|
||||
volumes:
|
||||
- ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||
depends_on:
|
||||
- cadvisor
|
||||
cadvisor:
|
||||
image: gcr.io/cadvisor/cadvisor:latest
|
||||
container_name: cadvisor
|
||||
ports:
|
||||
- 9999:8080
|
||||
volumes:
|
||||
- /:/rootfs:ro
|
||||
- /var/run:/var/run:rw
|
||||
- /sys:/sys:ro
|
||||
- /var/lib/docker/:/var/lib/docker:ro
|
||||
|
||||
```
|
||||
|
||||
Prometheus scrapping configuration :
|
||||
|
||||
```yml
|
||||
scrape_configs:
|
||||
- job_name: cadvisor
|
||||
scrape_interval: 5s
|
||||
static_configs:
|
||||
- targets:
|
||||
- cadvisor:8080
|
||||
```
|
||||
|
||||
## Dashboards
|
||||
|
||||
In order to monitor the ressource consumption during our tests we need to create dashboard in Grafana.
|
||||
|
||||
We create 4 different queries using Prometheus as the data source. For each query we can use the `code` mode to create them from a PromQL query.
|
||||
|
||||
### OC stack consumption
|
||||
|
||||
```
|
||||
sum(container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"})
|
||||
```
|
||||
|
||||
### Monitord consumption
|
||||
|
||||
```
|
||||
sum(container_memory_usage_bytes{image="oc-monitord"})
|
||||
```
|
||||
|
||||
### Total RAM consumption
|
||||
|
||||
```
|
||||
sum(
|
||||
container_memory_usage_bytes{name=~"oc-auth|oc-datacenter|oc-scheduler|oc-front|oc-schedulerd|oc-workflow|oc-catalog|oc-peer|oc-workspace|loki|mongo|traefik|nats"}
|
||||
or
|
||||
container_memory_usage_bytes{image="oc-monitord"}
|
||||
)
|
||||
```
|
||||
|
||||
### Number of monitord containers
|
||||
|
||||
```
|
||||
count(container_memory_usage_bytes{image="oc-monitord"} > 0)
|
||||
```
|
||||
|
||||
# Launch executions
|
||||
|
||||
We will use a script to insert in the DB the executions that will create the monitord containers.
|
||||
|
||||
We need to retrieve two informations to execute the scripted insertion :
|
||||
|
||||
- The **workflow id** for the workflow we want to instantiate, this is can be located in the DB
|
||||
- A **token** to authentify against the API, connect to oc-front and retrieve the token in your browser network analyzer tool.
|
||||
|
||||
Add these to the `insert_exex.sh` script.
|
||||
|
||||
The script takes two arguments :
|
||||
- **$1** : the number of executions, which are created by chunks of 10 using a CRON expression to create 10 execution**S** for each execution/namespace
|
||||
|
||||
- **$2** : the number of minutes between now and the execution time for the executions.
|
||||
72
docs/performance_test/insert_exec.sh
Executable file
@@ -0,0 +1,72 @@
|
||||
#!/bin/bash
|
||||
|
||||
TOKEN=""
|
||||
WORFLOW=""
|
||||
|
||||
NB_EXEC=$1
|
||||
TIME=$2
|
||||
|
||||
if [ -z "$NB_EXEC" ]; then
|
||||
NB_EXEC=1
|
||||
fi
|
||||
|
||||
# if (( NB_EXEC % 10 != 0 )); then
|
||||
# echo "Met un chiffre rond stp"
|
||||
# exit 0
|
||||
# fi
|
||||
|
||||
if [ -z "$TIME" ]; then
|
||||
TIME=1
|
||||
fi
|
||||
|
||||
|
||||
EXECS=$(((NB_EXEC+9) / 10))
|
||||
echo EXECS=$EXECS
|
||||
|
||||
DAY=$(date +%d -u)
|
||||
MONTH=$(date +%m -u)
|
||||
HOUR=$(date +%H -u)
|
||||
MINUTE=$(date -d "$TIME min" +"%M" -u)
|
||||
SECOND=$(date +%s -u)
|
||||
|
||||
start_loop=$(date +%s)
|
||||
|
||||
for ((i = 1; i <= $EXECS; i++)); do
|
||||
(
|
||||
start_req=$(date +%s)
|
||||
|
||||
echo "Exec $i"
|
||||
CRON="0-10 $MINUTE $HOUR $DAY $MONTH *"
|
||||
echo "$CRON"
|
||||
|
||||
START="2025-$MONTH-$DAY"T"$HOUR:$MINUTE:00.012Z"
|
||||
|
||||
END_MONTH=$(printf "%02d" $((MONTH + 1)))
|
||||
END="2025-$END_MONTH-$DAY"T"$HOUR:$MINUTE:00.012Z"
|
||||
|
||||
# PAYLOAD=$(printf '{"id":null,"name":null,"cron":"","mode":1,"start":"%s","end":"%s"}' "$START" "$END")
|
||||
PAYLOAD=$(printf '{"id":null,"name":null,"cron":"%s","mode":1,"start":"%s","end":"%s"}' "$CRON" "$START" "$END")
|
||||
|
||||
# echo $PAYLOAD
|
||||
|
||||
curl -X 'POST' "http://localhost:8000/scheduler/$WORKFLOW" \
|
||||
-H 'accept: application/json' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d "$PAYLOAD" \
|
||||
-H "Authorization: Bearer $TOKEN" -w '\n'
|
||||
|
||||
end=$(date +%s)
|
||||
duration=$((end - start_req))
|
||||
|
||||
echo "Début $start_req"
|
||||
echo "Fin $end"
|
||||
echo "Durée d'exécution $i : $duration secondes"
|
||||
)&
|
||||
|
||||
done
|
||||
|
||||
wait
|
||||
|
||||
end_loop=$(date +%s)
|
||||
total_time=$((end_loop - start_loop))
|
||||
echo "Durée d'exécution total : $total_time secondes"
|
||||
43
docs/performance_test/performance_report.md
Normal file
@@ -0,0 +1,43 @@
|
||||
We used a very simple mono node workflow which execute a simple sleep command within an alpine container
|
||||
|
||||

|
||||
|
||||
# 10 monitors
|
||||
|
||||

|
||||
|
||||
# 100 monitors
|
||||
|
||||

|
||||
|
||||
# 150 monitors
|
||||
|
||||

|
||||
|
||||
# Observations
|
||||
|
||||
We see an increase in the memory usage by the OC stack which initially is around 600/700 MiB :
|
||||
|
||||
```
|
||||
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
|
||||
7ce889dd97cc oc-auth 0.00% 21.82MiB / 11.41GiB 0.19% 125MB / 61.9MB 23.3MB / 5.18MB 9
|
||||
93be30148a12 oc-catalog 0.14% 17.52MiB / 11.41GiB 0.15% 300MB / 110MB 35.1MB / 242kB 9
|
||||
611de96ee37e oc-datacenter 0.32% 21.85MiB / 11.41GiB 0.19% 38.7MB / 18.8MB 14.8MB / 0B 9
|
||||
dafb3027cfc6 oc-front 0.00% 5.887MiB / 11.41GiB 0.05% 162kB / 3.48MB 1.65MB / 12.3kB 7
|
||||
d7601fd64205 oc-peer 0.23% 16.46MiB / 11.41GiB 0.14% 201MB / 74.2MB 27.6MB / 606kB 9
|
||||
a78eb053f0c8 oc-scheduler 0.00% 17.24MiB / 11.41GiB 0.15% 125MB / 61.1MB 17.3MB / 1.13MB 10
|
||||
bfbc3c7c2c14 oc-schedulerd 0.07% 15.05MiB / 11.41GiB 0.13% 303MB / 293MB 7.58MB / 176kB 9
|
||||
304bb6a65897 oc-workflow 0.44% 107.6MiB / 11.41GiB 0.92% 2.54GB / 2.65GB 50.9MB / 11.2MB 10
|
||||
62e243c1c28f oc-workspace 0.13% 17.1MiB / 11.41GiB 0.15% 193MB / 95.6MB 34.4MB / 2.14MB 10
|
||||
3c9311c8b963 loki 1.57% 147.4MiB / 11.41GiB 1.26% 37.4MB / 16.4MB 148MB / 459MB 13
|
||||
01284abc3c8e mongo 1.48% 86.78MiB / 11.41GiB 0.74% 564MB / 1.48GB 35.6MB / 5.35GB 94
|
||||
14fc9ac33688 traefik 2.61% 49.53MiB / 11.41GiB 0.42% 72.1MB / 72.1MB 127MB / 2.2MB 13
|
||||
4f1b7890c622 nats 0.70% 78.14MiB / 11.41GiB 0.67% 2.64GB / 2.36GB 17.3MB / 2.2MB 14
|
||||
|
||||
Total 631.2 Mb
|
||||
```
|
||||
|
||||
However over time with the repetition of a large number of scheduling that the stacks uses a larger amount of RAM.
|
||||
|
||||
Espacially it seems that **loki**, **nats**, **mongo**, **oc-datacenter** and **oc-workflow** grow overs 150 MiB. This can be explained by the cache growing in these containers, which seems to be reduced every time the containers are restarted.
|
||||
|
||||
BIN
docs/performance_test/wf_test_ram_1node.png
Normal file
|
After Width: | Height: | Size: 16 KiB |
BIN
performance_test
Normal file
|
After Width: | Height: | Size: 16 KiB |