Arik Hadas

Production observability of agentic AI with MLflow

2026-01-22T00:00:00+00:00

It’s been a while since I posted here, but as part of my work on OpenShift AI (that transition could be a topic for another post), I expect to have more interesting content to share. This post covers an exploration I participated in, focused on sending traces to MLflow for monitoring agentic workflows in production.

Background

MLflow is known for its capabilities in managing the machine learning lifecycle. It’s part of the Linux Foundation and as of version 3 of OpenShift AI, it has become a key component for managing experiment tracking and model registry within OpenShift AI.

Our goal was to explore the observability side of MLflow. Specifically, MLflow already covers the evaluation of production traces, and we wanted to investigate ways to integrate that with OpenShift AI and whether it makes sense to visualize more data on the MLflow console.

Custom or Standard Solution?

Full compatibility with OpenTelemetry was introduced in MLflow not long before our exploration (at the end of 2025). This left us with the following choices:

Favor MLflow SDK for the best experience.
Favor compatibility with OpenTelemetry for standardization.

Demonstrations we saw using the MLflow SDK were very impressive. While I have no experience working as a data scientist, the out-of-the-box capabilities of the MLflow SDK, such as automatic trace generation by simply adding an annotation, looked appealing.

However, considering the broader picture, we determined it was worth exploring how it could work with OpenTelemetry instead of the MLflow SDK, especially since there are ways to bridge the gaps (e.g., by complying with OpenInference).

Having a Central OTEL Collector

There are various ways to integrate MLflow with the typical observability stack in OpenShift. We chose to investigate using an OpenTelemetry collector that would receive traces from different components and export them to the OTEL endpoint of MLflow Tracking Server, as illustrated in the following diagram:

Simple Export Fails…

The very first attempt to export traces to the OTEL endpoint of MLflow Tracking Server failed with an error: “Workspace context is required for this request”. I was already familiar with this error from previous experiments with MLflow and knew it stemmed from an additional layer that was added to verify that a namespace/project is set for the request, which rejected the request.

I used Claude Code to identify the root cause and come up with a fix. A fix was posted and is expected to get included in a future version of OpenShift AI. With that fix, I was able to post traces from within the MLflow pod using:

curl -i -X POST \
      http://mlflow.opendatahub.svc:9443/v1/traces \
      -H "x-mlflow-experiment-id: 8" \
      -H "x-mlflow-workspace: opendatahub" \
      -H "Authorization: Bearer " \
      -H "Content-Type: application/x-protobuf" \
      -H "x-remote-user: kube:admin" \
      --data-binary @trace.bin

Fixing a Resolved Issue…

Next, I configured an OpenTelemetry collector that can receive traces over gRPC or HTTP and export them to the OTEL endpoint of MLflow Tracking Server:

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: collector
  namespace: openshift-monitoring
spec:
  mode: deployment
  config:
    receivers:
      otlp:
        protocols:
          grpc: {}
          http: {}

    processors:
      batch: {}
      resource:
        attributes:
          - key: service.name
            value: "otel-collector-forwarder"
            action: upsert

    exporters:
      otlphttp/mlflow:
        endpoint: "http://mlflow.opendatahub.svc:9443"
        encoding: proto
        compression: none
        headers:
          x-mlflow-experiment-id: "8"
          x-mlflow-workspace: "opendatahub"
          Authorization: "Bearer "
          x-remote-user: "kube:admin"
          Accept: "application/json"
        tls:
          insecure_skip_verify: true

    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [resource, batch]
          exporters: [otlphttp/mlflow]

With this configuration, a trace I sent to the collector appeared in the MLflow Tracking Server database. However, I noticed it was being sent multiple times until reaching a retry limit. I’ll spare you the technical details of the issue [1], as the interesting part is about the fix.

I thought this could be another issue I could contribute a fix for, so I asked Claude Code to produce one and was about to submit it to the main MLflow repository. However, when rebasing, I discovered the issue had already been fixed there! The lesson learned from this is twofold: (A) I’ll need to get used to working with midstream repositories, which didn’t exist in other projects I’ve worked on; and (B) Claude Code doesn’t appear to check parent repositories even when GitHub integration is enabled.

Summary

All in all, it was nice to see everything working smoothly eventually, as shown in this demonstration:

It turned out that the above-mentioned architecture considerations weren’t that relevant, as many would have seen the OpenTelemetry approach as the preferred option anyway. There were no surprising findings either. However, it was also nice to become more familiar with MLflow and with development processes in OpenShift AI.

[1] For those interested: the response sent by MLflow was encoded in JSON even though the request was encoded in Protobuf, which violated the OpenTelemetry specification and resulted in the otelhttp exporter failing to parse the response and attempting to re-export the data repeatedly.

Push Custom Images to Openshift Local

2022-12-02T00:00:00+00:00

This post describes the next step in our journey to deploy MTV (Migration Toolkit for Virtualization) on Openshift Local: deploying a custom image to Openshift Local without going through an external registry like quay.io

Prerequisits

Set up an Openshift Local cluster and deploy MTV on it as described here (note that you should also make the PVs accessible to the pods for being able to start VMs).

Clone the repository of Forklift and make sure you are able to build the image of forklift-controller with make build-controller.

Expose the image-registry

Do the following steps that are taken from the Openshift documentation:

$ HOST=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')

$ oc get secret -n openshift-ingress  router-certs-default -o go-template='{{index .data "tls.crt"}}' | base64 -d | sudo tee /etc/pki/ca-trust/source/anchors/${HOST}.crt  > /dev/null

$ sudo update-ca-trust enable

$ podman login -u kubeadmin -p $(oc whoami -t) $HOST

Note that the last step above is different than what is written in the above mentioned Openshift documentation, this command shouldn’t be executed with sudo.

Push an image to the exposed registry

Tag an image with $HOST as the registry and push it, for example:

$ podman tag  $HOST/openshift/forklift-controller:devel
$ podman push $HOST/openshift/forklift-controller:devel

Specifically, for forklift-controller this can be achieved with:

$ export IMG=$HOST/openshift/forklift-controller:devel
$ make push-controller

Redirect the pod to use the image from the cluster’s image registry

This part depends on the way the application is deployed, here I’ll describe a common practice for applications that are deployed using operators in which the image from the internal/cluster’s image registry is injected by the operator.

First, identify the ClusterServiceVersion instance in the relevant namespace (in my case it was called mtv-operator.v2.3.1 in the openshift-mtv namespace).

Then, edit it and specify the image that was pushed to the internal registry, in my case it was done by editing mtv-operator.v2.3.1:

$ oc edit csv -n openshift-mtv mtv-operator.v2.3.1

And setting the value of RELATED_IMAGE_CONTROLLER to: image-registry.openshift-image-registry.svc:5000/openshift/forklift-controller:devel

Finally, delete the pod, in my case forklift-controller, so another one would start with the new image from the internal registry.

Migrating from RHV to Openshift Local

2022-06-07T00:00:00+00:00

Recently I wanted to experience with Forklift, a project that enables to migrate virtual machines from Red Hat Virtualization (RHV) to Openshift Virtualization. Here, I’ll describe how I did this using Openshift Local which can be useful for development or experiments (Openshift Local is not meant for production use).

Prerequisits

We will need a bare-metal machine or a virtual machine with enough memory and CPU power. I used a virtual machine with 64G of RAM and 18 CPUs but I believe that even 32G of RAM and 10 CPUs would be enough. Additionally, as I won’t explain how to deploy RHV, I assume there is a RHV deployment that is accessible from within the machine that Openshift Local would run on.

Installing Openshift Local

Installing Red Hat Openshift Local is really easy. There is a guide for how to do that on console.redhat.com. Follow the instructions that appear there and make sure it ends successfully and you are able to login to the console with the credentials that appear at the end of the process.

Adjusting Openshift Local

As we’re going to deploy both Openshift Virtualization and the Migration Toolkit for Virtualization on this cluster, we need to override the default settings of the cluster. First, we’ll increase the memory to 64G:

$ crc config set memory 64000

Similarly, we’ll increase the number of CPUs to 16:

$ crc config set cpus 16

For the previous settings to be applied and since we are now going to extend the virtual disk that is used by the virtual machine that runs the cluster, we need to stop the cluster:

$ crc stop

Once it is stopped, we can extend the aforementioned virtual disk. I extended it by 40G:

$ qemu-img resize ~/.crc/machines/crc/crc.qcow2 +40g

Before starting the Openshift cluster again, you can inspect the updated settings with:

$ crc config view

If the settings look alright, start the Openshift cluster with:

$ crc start

Next. we will extend the filesystem to consume the additional space. In order to do that, we need to log in to the virtual machine using:

$ ssh -i ~/.crc/machines/crc/id_ecdsa -o StrictHostKeyChecking=no core@

You can find the IP address of the virtual machine with:

$ crc ip

Once you’re inside the virtual machine, execute the following command:

$ xfs_growfs /sysroot/

Congratulations, you now have an Openshift cluster with enough resources to run Openshift Virtualization and Migration Toolkit for Virtualization :)

Installing Openshift Virtualization

Go to the OperatorHub (Operators -> OperatorHub) and search for ‘cnv’. You’ll get the Openshift Virtualization operator. Install it and make sure to have both hostpath-provisioner and kubevirt-hyperconverged at the end of the process.

Installing Migration Toolkit for Virtualization

Similarly, search for ‘mtv’ in the OperatorHub and install the Migration Toolkit for Virtualization Operator.

Setting persistent volumes

Next, we will define a storage class that enables us to provision local persistent volumes (PVs) on the VM that runs the Openshift cluster. This is done by going to Storage->StorageClasses and create a new StorageClass. Give it a name and set the ‘Provisioner’ field to ‘kubevirt.io.hostpath-provisioner’.

Now we can change some of the built-in PVs to be consumable by this StorageClass. This is done by editing a PV, e.g., with ‘oc edit pv pv0009’ and set ‘accessModes’ to ‘ReadWriteOnce’ only and add ‘storageClassName: ' within the 'spec'.

You’re ready for your first migration!

Executing a migration plan

In order to initiate a migration you first need to log in to the UI of the Migration Toolkit for Virtualization (MTV). You can find the URL in Networking -> Routes and inspect the ‘Location’ of the ‘virt’ route within the openshift-mtv project (namespace). In my case it was ‘https://virt-openshift-mtv.apps-crc.testing’. Log in with the same credentials you used for the Openshift console.

Once you are logged in to the UI of MTV, add a RHV provider under ‘Providers’. You should find an Openshift Virtualization provider there as well.

Then, create mappings: go to ‘Mappings’ and define Network and Storage mapping from the source environment (RHV) to the target environment (Openshift Virtualization).

With that, you can now go to ‘Migration Plans’ and create a new migration plan. It is fairly simple to do by following the steps in that wizard. Assuming you chose a virtual machine(s) that is installed with a valid guest operating system, the execution of the migration plan would succeed and you’ll find the converted virtual machines within the ‘Virtualization -> VirtualMachines’ view in the Openshift console.

Multi-Cluster Configuration with Kubernetes

2020-07-04T00:00:00+00:00

Few months ago I’ve examined the ability to propagate configurations to Kubernetes clusters in a multi-cluster environment. Here I describe the process and tools in the hope that it will be useful for others that try to do the same thing.

Background

When we speak about configurations in the context of Kubernetes we typically speak about yaml files that describe how certain resources within the cluster should be defined. For instance, the configuration may include a URL that metrics should be sent to. As another example, the configuration may include all the properties of an application to be deployed to the cluster.

The GitOps paradigm promotes operating infrastructure using Git. A Git repository holds the state of the infrastructure and operators change the infrastructure using Git operations. For example, a property of one of the entities that comprise the infrastructure (e.g., a node) can be modified by modifying its corresponding resource within the Git repository (e.g., its labels). The GitOps paradigms becomes the common practice for managing the above mentioned configuration for Kubernetes.

Managing a single Kubernetes cluster using GitOps is relatively easy using tools like Argo CD. Argo CD, maybe the most popular tool for GitOps for Kubernetes nowadays, enables retrieving configuration from a bunch of sources (e.g., GitHub) and applying it to the cluster. The process is pretty straightforward: one needs to deploy Argo CD to the cluster and then define applications that retrieve configuration from a predefined Git repository and apply it to that cluster.

However, the process becomes more challenging in a multi-cluster environment.

Challenges in Multi-Cluster Environments

One of the hot topics in the Kubernetes world today is the management of multi-cluster environments. The Openshift/Hive project enables provisioning clusters as a service. Advanced Container Management (ACM), that was demonstrates in the last Red Hat summit, aims to facilitate various management operations (e.g., application delivery) in a multi-cluster environment.

When it comes to cluster configuration in a multi-cluster environment, several questions may arise:

How to propagate the configuration to the clusters? Do we want each cluster to pull its configuration from the Git repository or to propagate the configuration through a hub-cluster (or multiple hub-clusters)?
Should all configurations get to all clusters or do we want to limit part of the configuration only to certain clusters?
Do we need to alter the configuration with cluster-specific information (e.g., setting the name of the cluster on metrics that are reported from the cluster; or setting the URL of a hub-cluster that certain clusters should send their metrics to)?

These questions, among others, may suggest that the aforementioned solution for a single-cluster might not be sufficient for a multi-cluster environment.

Argo CD + Openshit/Hive

We have examined a multi-cluster solution that is based on Argo CD and Openshift/Hive. Conceptually, this solution is similar to the one that was presented by Worldpay to propagate configuration to Openshift clusters.

This solution assumes the system is composed of one or more hub-clusters and each hub-cluster manages one or more spoke/managed clusters. The spoke/managed clusters are provisioned by Openshift/Hive that runs on the hub-cluster. The configuration to propagate to the clusters resides in a remote Github repository. The next diagram depicts an example of such a system.

In our solution, Argo CD that is deployed to the hub-cluster is defined with an app(s) that retrieves the configuration from the remote Git repository. Then Argo CD pulls the configuration from the remote Git repository, transforms it into Openshift/Hive entities named SelectorSyncSets and finally, Openshift/Hive propagates the SelectorSyncSets to the spoke/managed clusters. The following diagram illustrates this process.

More specifically, we define several applications that are set to pull the configuration from a Git repository on GitHub. These applications are also configured with custom tools that transform the pulled configuration into SelectorSyncSets. During the transformation variables within the original configuration can be replaced with details that are specific to a hub or spoke/managed cluster. By using SelectorSyncSets (rather than ordinary SyncSets), the configuration propagates to spoke/managed clusters based on their labels. The label-matching mechanism is the one that is commonly used by Kubernetes. The transformation code is available here.

Using this mechanism we managed to deploy KubeVirt, Openshift-KNI/Performance-Addon-Operator, and other Openshift configuration that requires adding and patching Kubernetes entities. The configurations we used are available here. The beauty in this solution is that it does not rely on a central multi-cluster management application (but relies on provisioning the clusters using Openshift/Hive) and can work for a variety of configurations without having to deploy any additional component on the spoke/managed clusters. This mechanism is illustrated in more details in this recording.

Conclusion

The described solution for configuration management in a multi-cluster Kubernetes-based environment appeared to be useful for propagating a variety of configurations to spoke/managed clusters in our evaluation. For various reasons, the ones that continued that work have decided to embrace alternative approaches. Yet, I think it might be handy for ones that look for a simple and lightweight solution that doesn’t involve a centralized management application.

Nightly Builds on GitHub using Jenkins

2020-04-15T00:00:00+00:00

In this post I share a solution I have implemented for publishing nightly builds on GitHub.

Why Nightly Builds?

Building and testing your project automatically is a common practice nowadays. Many projects run unit tests before merging pull requests (PRs). Some projects, like KubeVirt, also run integration tests automatically before merging a PR, while others, like oVirt, run them on demand or after the fact. Various continuous integration tools such as Travis CI and Circle CI are available for this purpose.

However, many projects lack automatic releases. The concept of not only building the project but also deliverying unstable releases that are derived from the development branch periodically, possibly on a daily basis in which case they are generally referred to as nightly builds, is often missing despite its potential benefit. For users, it is a way to get exposed to new features and be able to provide early feedback. For developers, they may ease validating certain capabilities without the need to compile the code locally and then copy the artifacts elsewhere.

What’s the Challenge?

In some projects, frequent packaging and deliverying of unstable releases may not be worthy due to various reasons. For instance, distributed infrastructure management systems may require relatively high amount of resources and complex installation process, and as such their users may want to avoid deploying unstable releases. As another example, users of mission-critical systems may wish to minimize the amount of changes in their system.

But I believe the reason for the lack of nightly builds for the majority of projects out there would rather be the lack of a proper place to store them. Nightly builds are clearly useful for most projects, especially for software that is delivered as-a-service (SaaS) or standalone applications, however, it is not easy to find a free place to store them in a way that they could be easily consumed by users.

Let’s look at the muCommander project, for example. Nightly builds, that were produced by Jenkins, used to be stored on a local virtual machine. It was then easy to publish them on the project’s website. However, the project no longer possesses a local machine that can be available all the time. Today, both our source code and our website are hosted on GitHub (and GitHub Pages) and while GitHub provides a place to store releases, it lacks a mechanism for storing nightly builds.

Our Solution

The recently introduced solution for muCommander involves building the nightly builds using a local virtual machine (with Jenkins) that pushes the artifacts to GitHub. This way, the nightly builds are stored “in the cloud”, i.e., on a remote infrastructure that provides better availability than a local machine along with our stable releases that are also stored on GitHub. In addition, the local VM is a safe place to store a token for GitHub that is required for pushing the artifacts. While the local VM may not be running all the time, it can be easily recovered in case of a problem (in the worst case scenario, no new builds are produced but the latest would still be available).

I found no integration between Jenkins and GitHub that I could use for pushing the nightly builds to GitHub though. As I have previously mentioned, GitHub does not offer an out-of-the-box mechanism for third party tools like Jenkins or other CI/CD tools for nightly builds. This required me to write the following script that is based on the one I have found here:

# Publish on github
echo "Publishing on Github..."
token=""
 
# Setup variables
api_endpoint="https://api.github.com/repos/mucommander/mucommander"
uploads_endpoint="https://uploads.github.com/repos/mucommander/mucommander"
tag="nightly"
name="Nightly"
artifact=$(ls build/distributions)
md5=$(md5sum build/distributions/$artifact | awk '{print $1}')
description="MD5:\n$md5 $artifact"
description=$(echo "$description" | sed -z 's/\n/\\n/g') # Escape line breaks to prevent json parsing problems
 
# Query the existing release
release=$(curl -XGET $api_endpoint/releases/tags/$tag)
 
# Extract the id of the release from the response
id=$(echo "$release" | sed -n -e 's/"id":\ \([0-9]\+\),/\1/p' | head -n 1 | sed 's/[[:blank:]]//g')
 
if [ ! -z "$id" ]; then
    # Deleting the existing release
    curl -XDELETE -H "Authorization:token $token" $api_endpoint/releases/$id
fi
 
# Delete the existing tag
$(curl -XDELETE -H "Authorization:token $token" $api_endpoint/git/refs/tags/$tag) || true
 
# Create a release
release=$(curl -XPOST -H "Authorization:token $token" --data "{\"tag_name\": \"$tag\", \"target_commitish\": \"master\", \"name\": \"$name\", \"body\": \"$description\", \"draft\": false, \"prerelease\": true}" $api_endpoint/releases)
 
# Extract the id of the release from the creation response
id=$(echo "$release" | sed -n -e 's/"id":\ \([0-9]\+\),/\1/p' | head -n 1 | sed 's/[[:blank:]]//g')
 
# Upload the artifact
curl -XPOST -H "Authorization:token $token" -H "Content-Type:application/octet-stream" --data-binary @build/distributions/$artifact $uploads_endpoint/releases/$id/assets?name=$artifact

Let’s go over this script:
First, we set our token for GitHub as explained in the above-mentioned post.
Then we initialize some variables. The script makes use of the GitHub API and so the first two variables point to the endpoints of general API calls and upload calls for the github.com/mucommander/mucommander repository. The next two variables contain the name of the tag and the name of the release that the nightly build will be associated with. Next two variables contain the name of the artifact and its MD5 hash. Lastly, we set the description of the release to contain the MD5 hash and the name of the artifact.
Next, we query the existing release that is associated with the aforementioned tag and get its identifier. If the identifier is not empty, it means an existing release of a nightly build exists and it is therefore removed.
We then remove the existing tag, if it exists, so it will be recreated by the new release on top of the latest commit on the master branch.
Finally, we create a new release with the above-mentioned tag, name and description, and set it as non-draft and pre-release. We then extract the identifier of the created release and use it to upload the artifact that was already built by Jenkins to that release.

What’s Next?

With this mechanism, we can start publishing nightly builds in the hope that they will provide us with earlier feedback on the upcoming features in muCommander. The nightly builds would be found here.

Plain Kubernetes Cluster on Fedora 31

2020-01-05T00:00:00+00:00

In this post I describe how to run a plain (vanilla) Kubernetes cluster that spreads across physical machines installed with Fedora 31.

Various projects, like the Kubevirt project that I’ve been involved in recently, introduce an automated way for initiating a Kubernetes cluster. In Kubevirt, for instance, there is a framework named kubevirtci that enables one to quickly spin up and destroy Kubernetes clusters for testing. The kubevirtci framework is composed of two parts. The first part initiates a cluster. The second part deploys Kubevirt on top of that cluster. I can say kubevirtci served me well during the time I developed Kubevirt and the scripts and configurations used by kubevirtci may be used by others to easily run Kubernetes or Openshift clusters across multiple virtual machines within a single physical machine.

However, as part of a new project I started to work on I needed to run Kubernetes clusters across distributed virtual machines (that can be considered physical machines on the same network) for which kubevirtci does not fit, and so following are the steps I’ve made that I share in the hope others that attempt to achieve the same thing would find it useful.

First, we need to install docker as explained in this guide.

Then change the cgroup-driver of docker to be systemd by extending ExecStart in /etc/systemd/system/multi-user.target.wants/docker.service with:

--exec-opt native.cgroupdriver=systemd

The next step is following this guide, and specifically:

Add Kubernetes repository:

$ cat <<EOF > /etc/yum.repos.d/kubernetes.repo
[kubernetes]
name=Kubernetes
baseurl=https://packages.cloud.google.com/yum/repos/kubernetes-el7-x86_64
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://packages.cloud.google.com/yum/doc/yum-key.gpg https://packages.cloud.google.com/yum/doc/rpm-package-key.gpg
EOF

Disable SELinux:

$ sed -i 's/^SELINUX=enforcing$/SELINUX=permissive/' /etc/selinux/config

Disable swap in /etc/fstab.

Disable the firewall:

$ systemctl disable firewalld

Install Kubernetes packages:

$ yum install -y kubelet kubeadm kubectl --disableexcludes=kubernetes

Enable kubelet:

$ systemctl enable --now kubelet

Ensure net.bridge.bridge-nf-call-iptables is set to 1 in your sysctl config:

$ sysctl net.bridge.bridge-nf-call-iptables=1

Install kubernetes-cni:

$ dnf install kubernetes-cni

Reboot the machine:

$ shutdown -r now

On the master node, run:

$ kubeadm init

Then deploy weave-net:

$ kubectl apply -f "https://cloud.weave.works/k8s/net?k8s-version=$(kubectl version | base64 | tr -d '\n')&env.IPALLOC_RANGE=10.32.0.0/16"

On the worker nodes run kubeadm join with the token that was returned by kubeadm init on the master node. This, as well as kubeadm init on the master node, can be reverted back with kubeadm reset.

To check your cluster is up and running, inspect the nodes by running on the master node:

$ kubectl get nodes

And the pods:

$ kubectl get pods -n kube-system

To inspect the logs of kubelet, run:

$ journalctl -ur kubelet.service

muCommander

2018-08-09T00:00:00+00:00

Ten years ago (2008) I submitted my first contribution to an open source project named muCommander that I maintain to this day. This post provides a short description of the project, recent changes we have made, challenges we are facing, and some future plans.

What?

muCommander is a lightweight, cross-platform file manager with a dual-pane interface. It runs on any operating system with Java support (Mac OS X, Windows, Linux, *BSD, Solaris…).

In other words, muCommander is a long-standing (since 2002) open-source (GPLv3) file manager with a dual-pane interface (similar to that of Norton Commander) that can run on all the mainstream operating systems.

Why?

First and foremost, muCommander supports various file formats (e.g., ZIP) and protocols (e.g., SMB). It complements common built-in file managers with additional file formats, like 7z, and capabilities, like on-the-fly editing of ZIP files on Mac OS X. Moreover, it eliminates the need to set up protocol-specific clients, such as an FTP-client.

And not only that muCommander supports many file protocols but it also abstracts reads and writes with Java input/output streams. That way, reading from a remote file protocol and writing to another remote file protocol can be made very efficient. For instance, it can read files from an FTP server and write them to an SMB server without writing them to a temporary persistent location. That is done by reading a portion of the file from an input-stream connected to the FTP server and immediately write it to an output-stream connected to the SMB server.

These great capabilities are provided on all mainstream operating systems (OS). By having its core and most of its functionality written in Java, muCommander becomes cross-platform. Except for some OS-specific features that use native code (e.g., moving files to the trash), everything is implemented in the Java language. For developers it is significant as it conforms the principle of write once, run anywhere.

Another benefit of being implemented in Java is that muCommander can leverage the large variety of third-party client-side libraries for different file protocols.

Recent changes

Honestly, the project pulse has not been that great recently. I will touch that later on. Nevertheless, some important changes were done.

Technical

Converted the code repository to Git.
Moved the project to GitHub.
Replaced the build system with Gradle.
Enable compiling the code with Java 9 and Java 10.
Manage translations in the Zanata platform.
Fixed various bugs.

Communal

Updated the website.
Revived the twitter account.
Use Gitter.

Challenges

Next, I describe the four major challenges I see at the moment that hinder the progress of the project.

Scaling the development model

From time to time we get some really nice contributions as well as issues that users report. However, we currently have a fairly large codebase that becomes hard to maintain by a single maintainer. That causes PRs and issues to occasionally wait relatively long time for getting attention.

Communicating with the community

The contributions we get and issues that are being filed are a good sign as they show that both developers and end-users are interested in and using muCommander.

We used to communicate with the developer and user communities by Google group, Forum, and IRC channel. All these are practically abandoned. Seems like both the GitHub issues/PRs and Gitter provide good alternatives for communicating with developers. It may feel like the tools we currently use do not provide a good alternative to the Google group and the Forum for getting feedback and ideas from users though. But on the other hand, that may also be a consequence of the relatively low traffic in general in the project these days.

Competitive products and projects

There is a large variety of alternative file managers nowadays. Some of them target a specific operating system and thus are sometimes faster and better integrated (the most interesting are probably those that target Mac OS X, that is used by most of our user base). Some are backed up by commercial organizations. Those are typically proprietary products that provide base funtionality for free and other paid capabilities.

Another type of alternative products are those that have forked from muCommander in the past. Some forks that we have made contact with complained about the development pace of the project that they claim is too slow. That is a shame since migrating features from these forks to muCommander is not always trivial and it could have been much more productive to join forces in a single project.

Technical debts

Some features that were recently introduced exposed gaps in our application. Here, I describe two of them:

As mentioned before, muCommander was designed to be a lightweight file manager, something one can deploy on a minimal USB stick and execute on different machines. That is why we use proguard to shrink the jar that is being produced. Nontheless, the size of the produced jar increased due to features such as supporting vSphere VMs file system that brought with it a dependency of size 3.3M.
When considering something like supporting the qcow2 volumes format (and other virtual disk formats, such as vmdk) using libguestfs, we encounter two issues. First, libguestfs is not available on all operating systems. As mentioned before, muCommander already includes OS-specific things but in this case it would mean having unused dependencies on many operating systems. Second, it requires not only the java dependency (java bindings for libguestfs) but also library code to be installed on the OS (libguestfs). We currently have no way to specify such dependencies.

So what should we do?

Next, I will share some thoughts on how to address the aforementioned challenges.

Redefine the “mission statement”

The development of muCommander started more than 15 years ago. That is a pretty long time in software terms and so some of the assumptions we began with may not be relevant anymore. For example, the minimal USB stick today contains at least 1G and internet connection is much faster. So considering that muCommander is unlikely to run on devices with limited resources (such as mobile phones), it would probably be alright to produce a larger-size application. As another example, today much more data is stored on cloud services such as dropbox and google drive. Supporting such services may be more important for end-users than things like an advanced integrated text editor nowadays.

So I think this may be the right time to reconsider what is the goal of muCommander - what are its strengths, what should it provide, why should users continue using it and why should developers continue contributing to it.

Pluggable design

In my opinion, the most important technical change at the moment is introducing a pluggable framework. First, it can target only extensions for new file formats and file protocols. Later, it can be extended with other types of extensions.

Such a pluggable framework would enable us to:

Separate out heavy file format or protocol implementations from the codebase.
Install OS-specific extensions only when they are needed.
Define external dependencies per-extension.

The implementation of a pluggable machnism for muCommander was already discussed in the past. A natural infrastructure to use for this would be OSGI.

Promoting the project

With an up-to-date “mission statement” and a pluggable framework available, we should then promote the project. There are several ways to do this, like:

Presenting in a conference, such as FOSDEM or DevConf.cz.
Writing an article to a known website, such as opensource.com.
Defining a list of some desired features (such as PDF viewer, integrating dagger 2, etc) and submitting the project to GSoC (Google Summer of Code).

Reassess Gitter for user discussions

When things are back on track, we can reassess Gitter as a tool for communicating with end-users. It may be a good idea to schedule a bi-weekly (or a monthly) conference meeting to discuss topics related to the project.

Addressing Abstraction Hell in oVirt

2017-11-01T00:00:00+00:00

When I was a kid we used to play this game in which someone thought about a message and told it to his friend. The latter told the message he heard to another kid, and so on. Eventually, we were amused to see the difference between the original message and the one that the last kid has heard. This post describes our approach for addressing a similar problem that was caused by having too many abstraction layers in oVirt. Each layer converted its input in order to report it to the next layer “in its own words”, resulting in cumbersome and error-prone business flows in our platform.

Background

oVirt was originated from a management platform developed by a startup company that created the KVM hypervisor. Back at its early days, not only that the technology it was implemented in was different than the one we use today, but it was also designed differently.

Initially, the agent that resides on the distributed hosts, VDSM, was expected to interact directly with the hypervisor. Another protocol was defined for the communication between the central management unit (called ovirt-engine nowadays) and VDSM.

Consequently, flows that included interaction with the hypervisor required two data conversions. Figure 1 depicts such a flow. The management unit had its own representation of business entities that front-end clients used in order to communicate with the back-end. The management unit needed to convert those entities to the language VDSM speaks which is dictionary-based. VDSM, in turn, needed to convert the dictionary it received to the set of parameters that conform with the language of the hypervisor.

Figure 1: Architecture Before Using Libvirt

Later, Libvirt got into the picture. Libvirt provides an API for managing virtualization hosts. Its API is the de-facto standard in the industry, supporting a variety of hypervisors such as hyperv, xen, esx and qemu. Although focusing on qemu-kvm, it was a natural decision to leverage that simpler, more general and widely supported API in oVirt despite the downsides in having yet-another-abstraction-layer.

Problem

And so, an additional layer was added. Figure 2 depicts the previously mentioned flow with that new design. Now, VDSM converts its input into the language that Libvirt speaks (that is mostly XML-based) and Libvirt converts that to the language of the hypervisor.

Figure 2: Architecture with Libvirt

Since Libvirt is treated as a third-party tool in oVirt, we remained with two conversions in the scope of our platform. First, ovirt-engine converts its business entities into a dictionary. Second, VDSM converts the dictionary into Libvirt’s XMLs.

The need to convert data twice, in two different components, introduced several challenges. First, it required one to code in two different programming languages, as ovirt-engine is written in Java while VDSM is written in Python. In practice, many times several developers were involved in every feature, each was responsible for a part of the implementation within a certain component. These developers had to always be in-sync. Second, flows were more buggy and harder to debug as more code that is spread over different repositories and deployed in different places was required. Third, and maybe most importantly, the development process required reviews by different people that maintain the different components. It generally reduced the pace of the development process, mainly due to the review process in VDSM that has traditionally been slower largely due to its maintainership model.

Approach

We observed that many of our features required changes on the client side, UI or REST-API based clients, as well as on the back-end side but only minimal changes were needed on the agent side. The latter part mostly included conversion of data into Libvirt’s XMLs.

Thus, by letting ovirt-engine speak the language of Libvirt rather than that of VDSM, i.e., converting its business entities directly to Libvirt’s XMLs, not only that we use a more standard API in ovirt-engine but we often avoid any change in VDSM. Furthermore, this reduces the chances of people making hacks on the host side and increases the chances of the data representation in ovirt-engine being better aligned with the one in Libvirt.

Implementation

We modified ovirt-engine to both generate Libvirt’s Domain XML on run VM and to consume Libvirt’s Domain XML when monitoring the devices of the VM.

In run VM flow (figure 3), ovirt-engine now generates a full Libvirt’s Domain XML. Since the dictionary used to contain data that is required for VDSM but is not included in the specification of the Domain XML, the XML is extended with a metadata section that contains this kind of data. VDSM in turn inspects the metadata section to gather the information it needs for preparing the host for running the VM (e.g., creates payload devices, activates LVs) and then passes the XML to Libvirt.

Figure 3: Run VM Flow

In the monitoring process (figure 4), ovirt-engine now queries the Domain XML of the running VMs whose devices hash has changed. Then, ovirt-engine matches the reported devices with the ones in the database, the kind of matching that VDSM used to do, and passes the reported devices along with their correlation with the devices that exist in the database as a dictionary to the legacy devices monitoring code. This is done in a conversion layer (class) that was added to ovirt-engine.

Figure 4: VM Devices Monitoring Flow

Summary

In oVirt 4.2 we made a major change behind the scenes in the way ovirt-engine and VDSM interact. Now, ovirt-engine uses the commonly used API of Libvirt and VDSM mostly routes data from/to Libvirt in virt (i.e., VM-lifecycle) flows. This way, we eliminate the costly conversion of the data (in terms of development effort) by VDSM in these flows.

This change is supposed to be invisible for most of our users. However, while debugging issues that involve running a VM, one should be aware that the Domain XML is now generated by ovirt-engine and is printed to engine.log. In addition, while debugging issues with devices monitoring, one should now be aware that ovirt-engine matches the reported devices with the ones from the database rather than VDSM.

We have already been able to add new functionality to oVirt 4.2 more easily with this change and we expect it to also simplify future changes.

External DSL for API Specification in oVirt

2016-10-19T00:00:00+00:00

Few weeks ago I attended a session on the new API specification in oVirt. While the motivation was well explained and the overall design of the solution made a lot of sense, the presented language made me wonder whether it is the best language for the problem at hand. In this post I argue we can achieve a better language for the API specification in oVirt by using an external DSL rather than an internal DSL.

Background

Let’s start with a brief overview of what domain-specific languages are, the difference between internal and external domain-specific languages and describe tools that are available today for creating and using external domain-specific languages.

Domain Specific Language

A domain specific language (DSL) is a programming language that is tailored to a particular problem domain. Unlike general purpose languages (such as Java, C, C#), DSLs do not aim at being turing-complete. They generally provide syntax that is more declarative, concise and restrictive at the expense of reusability.

The concept of DSLs is not new. Most probably, if you are a developer you have already programmed with a DSL. Some notable examples of DSLs are: HTML for creating webpages, SQL for interaction with databases and MATLAB for matrix programming.

Internal vs External DSL

Martin Fowler distinguishes between two types of DSLs: internal and external. An internal DSL is a particular form of API in a general purpose host language (e.g., fluent API), while an external DSL is parsed independently and is not a part of a host language.

There is a clear trade-off between using internal and external DSLs. On the one hand, it is generally easier to create internal DSLs since one can leverage the parser and compilation tools of the host language. Moreover, one can leverage editing tools intended for the host language while programming with the internal DSL. On the other hand, internal DSLs are limited by the syntax and structure of the host language, which often results in more complicated languages to program with compared to external DSLs.

Language Workbench

Fowler noted language workbenches (LWs) as a possible killer-app for DSLs. These are tools that address the Achilles’ heel of external DSLs by facilitating their creation and use.

Today, LWs are typically based on mainstream IDEs and provide one with tool support for the grammar definition of the DSL using some grammar definition format (from which the parser and editing tools are generated) and tool support for the definition of the semantics of the DSL using some code transformation format (so DSL code could be transformed into code in a general purpose language in order to leverage the compilation tools of the latter).

Some notable production-ready language workbenches that are available today are: Xtext and Spoofax that are based on Eclipse, and MPS that is based on IntelliJ. I find Xtext to be the most practical LW nowadays among the ones mentioned above thanks to its ability to generate plugins for programming with the DSL in both Eclipse and IntelliJ, the integration one can achieve with Java and the fact that it does not make use of projectional editing.

Problem

A problem we were trying to solve in oVirt 4.0 was twofold:

Dependencies

oVirt provides several software development kits (SDKs) for different languages: Java, Python and recently also for Ruby. These SDKs interact with oVirt-engine, the central management unit, through REST-API interface.

Previously, the specification of the REST-API interface was integrated in the oVirt-engine project (figure 1). That lead to two issues. First, SDKs were depended on a fat artifact that contained more than just the specification. Second, we could publish this artifact only when a new version of oVirt-engine was released.

Figure 1: Architecture with oVirt API v3

Documentation

While it was possible to document the specification on top of the Java implementation of the REST-API interface using Javadoc, many parts were missing or not up-to-date.

Current Solution based on Internal DSL

The solution that was presented in version 4 of the API consisted of two parts. First, there was an architectural design change. The specification of the API was extracted into a separate project. The use of a separate project with its own source code repository allows other projects, like the SDKs, to depend only on the specification artifact (figure 2) and allows to publish new versions of the API specification independently. This solves the first part of the problem related to dependencies.

Figure 2: Architecture with oVirt API v4

Second, a new language was introduced to express the API specification in order to ease its documentation. This language is an internal DSL with Java as the host language.

In this language data types are represented by Java interfaces and documentation is provided in the form of Javadoc comment. For example, the Vm.java file contains the specification of the Vm entity, which looks like this:

/**
 * Represents a virtual machine.
 */
@Type
public interface Vm extends VmBase {
   /**
     * Contains the reason why this virtual machine was stopped. This reason is
     * provided by the user, via the GUI or via the API.
     */
    String stopReason();
    Date startTime();
    Date stopTime();
    ...
}

Services are represented in a similar way:

/**
 * This service manages a specific virtual machine.
 */
@Service
public interface VmService extends MeasurableService {

    /**
     * This operation will start the virtual machine managed by this
     * service, if it isn't already running.
     */
    interface Start {
        /**
         * Specifies if the virtual machine should be started in pause
         * mode. It is an optional parameter, if not given then the
         * virtual machine will be started normally.
         */
        @In Boolean pause();
        ...
    }
    ...
}

More about the current language can be found here.

Enhanced Solution with External DSL

The solution proposed in this post leaves the first part, the architectural design change, as is. That is, the API specification stays as a separate project. The difference is in the second part, namely the language introduced for the API specification, where an external DSL is used rather than an internal DSL.

An example for how to define the VM entity mentioned before with an external DSL:

'Represents a virtual machine.'
Type Vm : VmBase {
'Contains the reason why this virtual machine was stopped.
 This reason is provided by the user, via the GUI or via the API'
stopReason :: String;

TODO
startTime :: Date;

TODO
stopTime :: Date;
...
}

And VmService can be defined like this:

"This service manages a specific virtual machine."
Service VmService : MeasurableService {
"This operation will start the virtual machine managed by
 this service, if it isn't already running."
Start {
"Specifies if the virtual machine should be started in pause
 mode. It is an optional parameter, if not given then the
 virtual machine will be started normally."
In paused :: Boolean;
}
...
}

These code quotes could make you underestimate the effectiveness of programming with such a language due to the lack of support by Github’s markdown format for the presented language. IDE plugins, in contrast, provide the developer with the standard editing tools that are available today like auto-completion and text-highlighting. The next video demonstrates how a part of the VmService shown above was written in Eclipse:

The language definition can be found here. The language is defined in the grammar definition format provided by Xtext. For more details about this format and Xtext in general see the documentation on eclipse.org.

As a proof of concept, the Vm entity and VmService were transformed into their implementation in the internal DSL shown before. This transformation that was written in Xtend, a programming language provided by Xtext, can be found here. Note that in production it would be better to transform them directly into the target representation of the specification without the transformation into the internal DSL. The full definition of Vm and VmService can be found here and here (and the generated code here and here).

Why is External DSL Better

So what is the big deal between using the internal DSL vs. using the external DSL you may ask. Their syntax is quite similar and the latter does not provide one with the ability to express something one cannot express with the former. I will point out the benefits of using an external DSL by addressing things that came up in the session I mentioned at the beginning and from my own experience with working with both languages.

The argument for basing the presented language on Java (which makes it an internal DSL) was to make it possible to leverage Java tools. But as we have seen before, the same capabilities can also be achieved for an external DSL by using a language workbench.

In the mentioned session one asked whether the language itself or the tools developed for it (for its transformation I believe) can be reused by other projects. The answer was positive. Questions about reuse are often raised in order to find a way to reduce the amortized development cost. However, when using a proper language workbench, one uses third-party tools for the language definition and its transformation that significantly simplify the language development, making it (typically) cost-effective even for one-time use. By taking reusability out of the equation, the language can be kept minimal and optimal for the particular instance of the problem at hand.

The presenter showed an example for adding a color to the Vm entity. One guy asked if we can use the type byte for the RGB values of the color instead of an int. The answer was negative. This is an example of a downside of basing the language on Java - one may try to use any type provided by Java, even unexpected ones. In contrast, the external DSL provides only the supported types. Note: yet, one can easily define that a particular field can be of any Java type, if needed.

That example exposed another downside of the internal DSL. The presenter typed most of the definition of the color without specifying a comment for that field. Then he asked “is something missing?” and although that question seemed suspicious, most of the crowd answered that nothing is missing. Documentation in the form of Javadoc is optional and is easy to forget. The fact that documentation in the internal DSL is provided as Javadoc could make one reach the code review phase without the required documentation and the reviewer can easily miss it. In the external DSL, however, documentation is part of the language and thus lack of documentation will produce an error by the IDE (without any checkstyle or other plugin), making it impossible to forget to document.

Another question was: why should we use the @Type and @Service annotations while we can retrieve that information from the package the file is located in (either types or services)? I think this question can be generalized to: how can we reduce the boilerplate code? Looking at code written in the internal DSL, we see a redundant syntax that is repeated again and again: the visibility of the types and services, the interface keyword near every action in a service, empty parenthesis near every property of a data type and so on. In contrast, we see much less boilerplate code in one written in the external DSL.

Moreover, having the language less coupled with Java makes it easier to work with for non-programmers. Typically, ones that mainly work on documentation are not programmers. Simplifying the grammar and making it more declarative (specifically the documentation part as we will discuss next) makes it easier for them to contribute.

And lastly, besides being optional, the fact that Javadoc comments have no clear structure makes it difficult to understand the expected format of the documentation. For example, one is expected to write the date and status of a comment he adds/modifies. It is easy to forget to write the date and unless the reviewer catches it, documentation can be merged without specifying its date or when it is typed incorrectly (having @data instead of @date for instance). As for the status, not only that it is easy to forget to specify it, it is unclear what are its allowed values since it is a free text. Unlike the internal DSL, the external DSL provides a clear structure for documentation as part of the language, enforcing developers to provide all the required values and reduce the chance for issues that are caused by typos. An example for a structured documentation in the external DSL:

summary: 'This operation stops any migration of a virtual machine to another physical host.'
description: '
[source]
----
POST /ovirt-engine/api/vms/123/cancelmigration
----

The cancel migration action does not take any action specific parameters,
so the request body should contain an empty `action`:

[source,xml]
----

----
'
author: 'Arik Hadas '
date: '14 Sep 2016'
status: added
CancelMigration {
 'Indicates if the migration should cancelled asynchronously.'
 In async :: Boolean; 
}

One downside of using an external DSL is that it means to require contributors to use an IDE plugin for programming with the external DSL. On the other hand, language workbenches like Xtext are able to produce plugins for mainstream IDEs and mechanisms like Eclipse’s update-site can simplify the installation of such plugins.

Conclusion

The advanced language workbenches that are available today greatly increase the attractiveness of external DSLs. One is able to program with languages that typically better fit for the problems at hand compared to general purpose languages and internal DSLs without the high effort that was traditionally needed in order to create them and program with them due to lack of tools.

This post presents an external DSL for the API specification of oVirt that was partially implemented using the Xtext language workbench. Hopefully this post would interest people that work on oVirt’s API specification and push them toward giving a chance for such a language instead of the internal DSL that was introduced in oVirt 4.0.

Monitoring Improvements in oVirt

2016-07-24T00:00:00+00:00

Recently I’ve been working on improving the scalability of monitoring in oVirt. That is, how to make oVirt-engine, the central management unit in the oVirt management system, able to process and report changes in a growing number of virtual machines that are running in a data-center. In this post I elaborate on what we did and share some measurements.

Background

Monitoring in oVirt

In short, oVirt is an open-source management platform for virtual data-centers. It allows centralized management of a distributed system of virtual machines, compute, storage and networking resources.

In this post the term monitoring refers to the mechanism that oVirt-engine, the central component in the oVirt distributed system, collects runtime data from hosts that virtual machines are running on and reports it to clients. Some examples of such runtime data:

Statuses of hosts and VMs
Statistics such as memory and cpu consumption
Information about devices that are attached to VMs and hosts

Notable changes in the monitoring code in the past

@UnchangableByVdsm

Generally speaking, the monitoring gets runtime data reported from the hosts, compares it with the previously known data and process the changes.
In order to distinguish dynamic data that is reported by the hosts and dynamic data that is not reported by the hosts, we added in oVirt 3.5 an annotation called UnchangableByVdsm that should be put on every field in VmDynamic class that is not expected to be reported by the hosts. This was supposed to eliminate redundant saves of unchanged runtime data to the database.

Split hosts-monitoring and VMs-monitoring

Previously, before monitoring a host we locked the host and released it only after the host-related information and all information of VMs running on the host was processed. As a result, when doing an operation on a running VM, the host that the VM ran on was locked.
A major change in oVirt 3.5 was a refactoring the monitoring to use per-VM lock instead of the host lock while processing VM runtime data. That reduced the time that both monitoring and threads executing commands are locked.

Introduction of events based protocol with the hosts

A highly desirable enhancement came in oVirt 3.6 in which changes in VM runtime data are reported as events rather than by polling. In a typical data-center many of the VMs are ‘stable’, i.e. their status does not change much. In such environment, this change reduces the number of data sent on the wire and reduces the unnecessary processing in oVirt-engine.
Note that not all monitoring cycles were replaces with events: on every 15 seconds (by default), oVirt-engine still polls the statuses of all VMs, including their statistics. These cycles are called ‘statistics cycles’.

Scope of this work

An indication that monitoring is inefficient is when it works hard while the system is stable (I prefer the term ‘stable’ over ‘idle’ since the virtual machines could actually be in use). For example, when virtual machnines don’t change (no operation is done on these VMs and nothing in their environment is changed), one could expect the monitoring not to do anything except for processing and persisting statistics data (that is likely to change frequently).

Unfortunately it was not the case in oVirt. The next figure shows the ‘self time’ of hot-spots in the interaction with the database in oVirt 3.6 during the time an environment with one host and 6000 VMs was stable for one hour. I will elaborate on these number later on, but for now just note that the red color is the overall execution time of database queries/updates. The more red color we see, the more busy the monitoring is.

This work continues the effort to improve the monitoring in oVirt mentioned in the previous sub-section in order to address this particular problem. In the next section, I elaborate on the changes we did that lead to the reduced execution times shown in the next figure for the same enviroment and for the same time (look how much less red color!).

This work:

Takes for granted that the monitoring in oVirt hinders its scalability.
Does not change hosts monitoring.
Does not refer to other optimizations we did that do not improve monitoring of a stable system.

Changes

Not to process numa nodes data when not needed

We saw that a significant effort was put to process runtime data of numa nodes.
In terms of CPU time, 8.1% (which is 235 seconds) was wasted on getting all numa nodes of the host from the database and 5.9% (which is 170 seconds) was wasted on getting all numa nodes of VMs from the database. The overall CPU time spent on processing numa node data got up to 14.6%! This finding is similar to what we saw in profiling dump we got for other scaled environment.
In terms of database interaction, getting this information is relatively cheap (the following are average numbers in micro-seconds):

261 to get numa nodes by host
259 to get assigned numa nodes
255 to get numa node CPU by host
246 to get numa node CPU by VM
242 to get numa nodes by VM

But these queries are called many times so the overall portion of these calls is significant:

Getting numa nodes by host - 3% (48,546 msec)
Getting assigned numa nodes - 3% (48,201 msec)
Getting numa node CPU by host - 3% (47,569 msec)
Getting numa node CPU by VM - 2% (45,918 msec)
Getting numa nodes by VM - 2% (45,041 msec)

I used the term ‘wasted’ because my host did not report any VM related information about numa nodes! So in order to improve this we changed the analysis of VM’s data to skip processing (and fetching from the database) numa related data if no such data is reported for a VM.

Memoizing host’s numa nodes

But we cannot assume that hosts do not report numa nodes data for the VMs. So another improvement was to reduce the number of times it takes to query host’s level numa nodes data - by querying it on per-host basis instead of per-VM. That’s ok since this data does not change while we process data received from the host. We used the memoization technique to cache this information during host monitoring cycle.

Cache VM jobs

Another surprising finding was to see that we put a not negligible effort in processing and updating VM jobs (that is, jobs that represent live snapshot merges) without having a single job like that (the system is stable, remember?).
It gots up to 3.8% (111 sec) of the overall CPU time and 3% (47,140 msec) of the overall database interactions.

Therefore, another layer of in-memory management of VM jobs was added. Only when this layer detects that information should be retrieved from the database (and not all the data is cached) it access the database.

Reduce number of updates of dynamic VM data

Despite the use of @UnchangableByVdsm, I discovered that VM dynamic data (that includes for example, its status, ip of client console that is connected to it and so on) is updated. Again, no such update should occur in a stable system… The implications of this issue is significant because this is a per-VM operation so the time it takes is accumulated and in our environment got to 6% (101 sec) of the overall database interactions.

To solve this, VmDynamic was modified. Some of the fields that should not by compared against the reported data were marked with @UnchangableByVdsm and some fields that VmStatistics is a more appropriate place for them were moved.

Split VM devices monitoring from VMs monitoring

Hosts report the hash of the devices of each VM and the monitoring of the VMs used to compare this hash against the hash that was reported before, and triggers a poll request for full VM data, that contains information of the devices, only when the hash is changed. Not only that the code became more complicaed when it was tangled within other VM analysis computation, but a change in the hash triggered update of the whole VM dynamic data.

Therefore, we split the VM devices monitoring into a separate module that caches the device hashes and by that reduce even further the number of updates of VM dynamic data.

Lighter, dedicated monitoring views

Another observation from analyzing hot spots in the database interactions was that one of the queries we spend a lot of time on is the one for getting the network interface of the monitored VMs. This is a relatively cheap query, only 678 micro-sec on average, but it is called per-VM that therefore accumulated to 8% (126 sec) of the overall database interactions.

The way to improve it was by introducing another query that is based on a lighter view of network interfaces that contains only the information needed for the monitoring.

This technique was also used to improve the query for all VMs running on a given host. The following output depicts how much lighter is the new view (vms_monitoring_view) than the previous view the monitoring used (vms):

engine=> explain analyze select * from vms where run_on_vds='043f5638-f461-4d73-b62d-bc7ccc431429';
 Planning time: 2.947 ms  
 Execution time: 765.774 ms
engine=> explain analyze select * from vms_monitoring_view where run_on_vds='043f5638-f461-4d73-b62d-bc7ccc431429';
 Planning time: 0.387 ms
 Execution time: 275.600 ms

This new view is used by the monitoring in oVirt 4.0 but as we will see later on the monitoring in oVirt 4.1 won’t use it anymore. Still, this new view is used in several other places instead of the costly ‘vms’ view.

In-memory management of VM statistics

The main argument for persisting data into a database is its ability to store information that should be recoverable after restart of the application. However, in oVirt the database is many times used in order to share data between threads and processes. This badly affects performance.

VM statistics is a type of data that is not supposed to be recoverable after restart of the application. Thus, one could expect it not to be persisted in the database. But in order share the statistics with thread that queries VMs for clients and with DWH, it used to be persisted.

As part of this work, VM statistics is no longer persisted into the database. They are now managed in-memory. Threads that query VMs for clients retrieve it from the memory, and for DWH we can dump the statistics in longer intervals to wherever it takes the statistics from. By not persisting the statistics, the number of saves to the database it reduced. In our environment it got to 2% (38,669 msec) of the overall database interactions. It also reduces the time it takes to query VMs for clients.

Query only VM dynamic data for VM analysis

So ‘vms_monitoring_view’ turned out to be much more efficient than ‘vms’ view as it returned only statistics, dynamic and static information of the VM (without additional information that is stored in different tables).

But obviously querying only the dynamic data is much more efficient than using the vms_monitoring_view:

engine=> explain analyze select * from vms_monitoring_view where run_on_vds='043f5638-f461-4d73-b62d-bc7ccc431429';
 Planning time: 0.405 ms
 Execution time: 275.850 ms
engine=> explain analyze select * from vm_dynamic where run_on_vds='043f5638-f461-4d73-b62d-bc7ccc431429';
 Planning time: 0.109 ms
 Execution time: 2.703 ms

So as part of this work, not only that VM statistics are no longer queried from the database but also the static data of the VM is no longer queried from the database by the monitoring. Each update is done through VmManager that caches only the information needed by the monitoring and the monitoring uses this data instead of getting it from the database. That way, only the dynamic data is queried from the database.

Eliminate redundant queries by VM pools monitoring

Not directly related to VMs monitoring, VM pool monitoring that is responsible for running prestarted VMs also affects the amount of work done in a stable system. As part of this work, the amount of interactions with the database by VM pool monitoring in system that doesn’t contain prestarted VMs was reduced.

Results

CPU

CPU view on oVirt 3.6:

CPU view on oVirt master:

The total CPU time used in one hour for the monitoring reduced from 2297 sec to 1789 sec.
We spend significantly less time in the monitoring code - 814 sec instead of 1451 sec.
- The processing time reduced from 896 sec to 687 sec.
- The time it takes to persist changes to the database reduced from 546 sec to 114 sec.

Additional insight:

The time spent on host monitoring increased from 40,730 msec in oVirt 3.6 to 53,447 msec when using ‘vms_monitoring_view’. Thus, in 4.0 it is probably even higher due to additional operations that were added.

Database

Database hot spots on oVirt 3.6:

Database hot spots on oVirt master:

The time to query network interfaces of VM reduced from 678 micro-sec on average to 282 micro-sec, resulting in overall improvement from 126 sec to 108 sec (it is called much more, I believe it is because postgres caches this differently now).
The time it takes to query all the running VMs on the host reduced from 3,539 msec on average (!) to 909 msec, resulting in overall improvement from 113 sec to 59,130 msec thanks to querying only the dynamic data.
The time it took to save the dynamic data of the VMs was 101 sec (6%, 544 micro-sec on average). On master, the dynamic data was not saved at all.
All queries for numa nodes that were described before were not called on our environment.
Same for the query of VM jobs.
The update of VM statistics which took 261 micro-sec and 38,669 msec overall (2%) on oVirt 3.6, is not called anymore.
Queries related to guest agent data on network interfaces that we spend time on in oVirt 3.6 (insert: 319 micro-sec on average, 59,493 msec overall which is 3% and delete: 223 micro-sec on average, 41,605 msec overall which is 2%) were not called on oVirt master.

More insights:

Despite making the ‘regular’ VMs query lighter (since it does not include querying VM statistics from the database), it takes significatly more time on oVirt master: 996 msec on average while it used to be ~570 msec on average on oVirt 3.6.
Updates of the dynamic data of disks seems to be also inefficient. Although it is relatively cheap (143 micro-sec on average) operation, the fact that it is done per-VM makes the overall time relatively high on master (4%), especially considering that these VMs had no disks..
The overall time spent on querying VM network interfaces is still too much.
An insight that I find hard to explain is the following diagrams of the executed database statements that are probably a result of caching in postgres (that might explain the reduced memory consumption we will see later):

Executed statements in oVirt 3.6:

Executed statements in oVirt master:

Memory

Memory consumption on oVirt 3.6:

Memory consumption on oVirt master:

One can argue that in-memory management like the one introduced for VM statistics or in-memory management layers over the database like the one introduced for VM jobs leads to high memory consumption.

Surprisingly, the memory consumption on master is lower than the one seen on 3.6. While at peaks (right before the garbage collector cleans it) the memory on oVirt 3.6 get to ~1.45 GB, on oVirt master it gets to ~1.2 GB. That is probably thanks to other improvements or by reducing the amount of caching by postgres that compansate the higher memory consumption by the monitoring.

Possible future work

Although I refer to the code that includes the described changes as ‘master branch’, some of the changes are not yet merged so this work is not completed yet.
Need to investigate what makes VMs query to take much longer on the master branch.
Another improvement can be to replace the ‘statistics cycles’ polling with events. This could also prevent theoretical issues we currently have in the monitoring code.
In order to create the testing environment I played a bit with environment running 6000 VMs (using fake-VDSM). It is very inconvenient via the webadmin currently. Better UI support for batch operations is something to consider.
Also, we had an effort to introduce batch operations for operations on the hosts (like Run VM). We could consider batch scheduling that will allow us to resume that effort.
Introduce in-memory layers for network interface and dynamic disk data as well.
Split VM dynamic data to runtime data, that is reported by VDSM, and other kind of data to prevent redundant updates from happening again.
Cache VM dynamic data. We planned to do it for VM statuses, but we should consider doing that for other kind of dynamic VM data.