9. ANNEX 1: Troubleshooting
9.1. How to know the version of your current OSM installation
Run the following command to know the version of OSM client and OSM NBI:
osm version
In some circumstances, it could be useful to search the osm-devops
package installed in your system, since osm-devops
is the package used to drive installations:
dpkg -l osm-devops
||/ Name Version Architecture Description
+++-======================-=================-=====================-=====================================
ii osm-devops 8.0.0-1 all
To know the current verion of the OSM client, you can also search the python3-osmclient
package as a way to know your current version of OSM:
dpkg -l python3-osmclient
||/ Name Version Architecture Description
+++-======================-=================-=====================-=====================================
ii python3-osmclient 8.0.0-1 all
9.2. Troubleshooting installation
9.2.1. Recommended installation to facilitate troubleshooting
It is highly recommended saving a log of your installation:
./install_osm.sh 2>&1 | tee osm_install_log.txt
9.2.2. Recommended checks after installation
9.2.2.1. Checking whether all processes/services are running in K8s
kubectl -n osm get all
All the deployments and statefulsets should have 1 replica: 1/1
9.2.3. Issues on standard installation
9.2.3.1. Juju
9.2.3.1.1. Juju bootstrap hangs
If the Juju bootstrap takes a long time, stuck at this status…
Installing Juju agent on bootstrap instance
Fetching Juju GUI 2.14.0
Waiting for address
Attempting to connect to 10.71.22.78:22
Connected to 10.71.22.78
Running machine configuration script...
…it usually indicates that the LXD container with the Juju controller is having trouble connecting to the internet.
Get the name of the LXD container. It will begin with ‘juju-
’ and end with ‘-0
’.
lxc list
+-----------------+---------+---------------------+------+------------+-----------+
| NAME | STATE | IPV4 | IPV6 | TYPE | SNAPSHOTS |
+-----------------+---------+---------------------+------+------------+-----------+
| juju-0383f2-0 | RUNNING | 10.195.8.57 (eth0) | | PERSISTENT | |
+-----------------+---------+---------------------+------+------------+-----------+
Next, tail the output of cloud-init to see where the bootstrap is stuck.
lxc exec juju-0383f2-0 -- tail -f /var/log/cloud-init-output.log
9.2.3.1.2. Is Juju running?
If running, you should see something like this:
$ juju status
Model Controller Cloud/Region Version SLA
default osm localhost/localhost 2.3.7 unsupported
9.2.3.1.3. ERROR controller osm already exists
Did OSM installation fail during juju installation with an error like “ERROR controller osm already exists” ?
$ ./install_osm.sh
...
ERROR controller "osm" already exists
ERROR try was stopped
### Jum Agu 24 15:19:33 WIB 2018 install_juju: FATAL error: Juju installation failed
BACKTRACE:
### FATAL /usr/share/osm-devops/jenkins/common/logging 39
### install_juju /usr/share/osm-devops/installers/full_install_osm.sh 564
### install_lightweight /usr/share/osm-devops/installers/full_install_osm.sh 741
### main /usr/share/osm-devops/installers/full_install_osm.sh 1033
Try to destroy the Juju controller and run the installation again:
$ juju destroy-controller osm --destroy-all-models -y
$ ./install_osm.sh
If it does not work, you can destroy Juju container and run the installation again
#Destroy the Juju container
lxc stop juju-*
lxc delete juju-*
#Unregister the controller since we’ve manually freed the resources associated with it
juju unregister -y osm
#Verify that there are no controllers
juju list-controllers
#Run the installation again
./install_osm.sh
9.2.3.1.4. No controllers registered
The following error appears when the user used for installation does not belong to some groups:
Finished installation of juju Password: sg: failed to crypt password with previous salt: Invalid argument ERROR No controllers registered.
To fix it, just add the non-root user used for installation in sudo , lxd, docker groups
9.2.3.2. LXD
9.2.3.2.1. ERROR profile default: /etc/default/lxd-bridge
has IPv6 enabled
Make sure that you follow the instructions in the Quickstart.
When asked if you want to proceed with the installation and configuration of LXD, juju, docker CE and the initialization of a local docker swarm, as pre-requirements, Please answer “y”.
When dialog messages related to LXD configuration are shown, please answer in the following way:
Do you want to configure the LXD bridge? Yes
Do you want to setup an IPv4 subnet? Yes
<< Default values apply for next questions >>
Do you want to setup an IPv6 subnet? No
9.2.3.3. Docker Swarm
9.2.3.3.1. network netosm could not be found
The error is network "netosm" is declared as external, but could not be found. You need to create a swarm-scoped network before the stack is deployed
It usually happens when a docker system prune
is done with the stack stopped. The following script will create it:
#!/bin/bash
# Create OSM Docker Network ...
[ -z "$OSM_STACK_NAME" ] && OSM_STACK_NAME=osm
OSM_NETWORK_NAME=net${OSM_STACK_NAME}
echo Creating OSM Docker Network
DEFAULT_INTERFACE=$(route -n | awk '$1~/^0.0.0.0/ {print $8}')
DEFAULT_MTU=$(ip addr show $DEFAULT_INTERFACE | perl -ne 'if (/mtu\s(\d+)/) {print $1;}')
echo \# OSM_STACK_NAME = $OSM_STACK_NAME
echo \# OSM_NETWORK_NAME = $OSM_NETWORK_NAME
echo \# DEFAULT_INTERFACE = $DEFAULT_INTERFACE
echo \# DEFAULT_MTU = $DEFAULT_MTU
sg docker -c "docker network create --driver=overlay --attachable \
--opt com.docker.network.driver.mtu=${DEFAULT_MTU} \
${OSM_NETWORK_NAME}"
9.2.4. Issues on advanced installation (manual build of docker images)
9.2.4.1. Manual build of images. Were all docker images successfully built?
Although controlled by the installer, you can check that the following images exist:
$ docker image ls
REPOSITORY TAG IMAGE ID CREATED SIZE
osm/ng-ui latest 1988aa262a97 18 hours ago 710MB
osm/lcm latest c9ad59bf96aa 46 hours ago 667MB
osm/ro latest 812c987fcb16 46 hours ago 791MB
osm/nbi latest 584b4e0084a7 46 hours ago 497MB
osm/pm latest 1ad1e4099f52 46 hours ago 462MB
osm/mon latest b17efa3412e3 46 hours ago 725MB
wurstmeister/kafka latest 7cfc4e57966c 10 days ago 293MB
mysql 5 0d16d0a97dd1 2 weeks ago 372MB
mongo latest 14c497d5c758 3 weeks ago 366MB
wurstmeister/zookeeper latest 351aa00d2fe9 18 months ago 478MB
9.2.4.2. Docker image failed to build
9.2.4.2.1. Err:1 http://archive.ubuntu.com/ubuntu xenial InRelease
In some cases, DNS resolution works on the host but fails when building the Docker container. This is caused when Docker doesn’t automatically determine the DNS server to use.
Check if the following works:
docker run busybox nslookup archive.ubuntu.com
If it does not work, you have to configure Docker to use the available DNS.
# Get the IP address you’re using for DNS:
nmcli dev show | grep 'IP4.DNS'
# Create a new file, /etc/docker/daemon.json, that contains the following (but replace the DNS IP address with the output from the previous step:
{
"dns": ["192.168.24.10"]
}
# Restart docker
sudo service docker restart
# Re-run
docker run busybox nslookup archive.ubuntu.com
# Now you should be able to re-run the installer and move past the DNS issue.
9.2.4.2.2. TypeError: unsupported operand type(s) for -=: 'Retry' and 'int'
In some cases, a MTU mismatch between the host and docker interfaces will cause this error while running pip. You can check this by running ifconfig
and comparing the MTU of your host interface and the docker_gwbridge
interface.
# Create a new file, /etc/docker/daemon.json, that contains the following (but replace the MTU value with that of your host interface from the previous step:
{
"mtu": 1458
}
# Restart docker
sudo service docker restart
9.3. Common issues with VIMs
9.3.1. Is the VIM URL reachable and operational?
When there are problems to access the VIM URL, an error message similar to the following is shown after attempts to instantiate network services:
Error: "VIM Exception vimmconnConnectionException ConnectFailure: Unable to establish connection to <URL>"
In order to debug potential issues with the connection, in the case of an OpenStack VIM, you can install the OpenStack client in the OSM VM and run some basic tests. I.e.:
# Install the OpenStack client
sudo apt-get install python-openstackclient
# Load your OpenStack credentials. For instance, if your credentials are saved in a file named 'myVIM-openrc.sh', you can load them with:
source myVIM-openrc.sh
# Test if the VIM API is operational with a simple command. For instance:
openstack image list
If the openstack client works, then make sure that you can reach the VIM from the RO container:
# If running OSM on top of docker swarm, go to the container in docker swarm
docker exec -it osm_ro.1.xxxxx bash
# If running OSM on top of K8s, go to the RO deployment in kubernetes
kubectl -n osm exec -it deployment/ro bash
curl <URL_CONTROLLER>
In some cases, the errors come from the fact that the VIM was added to OSM using names in the URL that are not Fully Qualified Domain Names (FQDN).
When adding a VIM to OSM, you must use always FQDN or the IP addresses. It must be noted that “controller” or similar names are not proper FQDN (the suffix should be added). Non-FQDN names might be understood by docker’s dnsmasq as a docker container name to be resolved, which is not the case. In addition, all the VIM endpoints should also be FQDN or IP addresses, thus guaranteeing that all subsequent API calls can reach the appropriate endpoint.
Think of an NFV infrastructure with tens of VIMs, first you will have to use different names for each controller (controller1, controller2, etc.), then you will have to add to every machine trying to interact with the different VIMs, not only OSM, all those entries in the /etc/hosts file. This is bad practice.
However, it is useful to have a mean to work with lab environments using non-FQDN names. Three options here. Probably you are looking for the third one, but we recommend the first one:
Option 1. Change the admin URL and/or public URL of the endpoints to use an IP address or an FQDN. You might find this interesting if you want to bring your Openstack setup to production.
Option 2. Modify
/etc/hosts
in the docker RO container. This is not persistent after reboots or restarts.Option 3a (for docker swarm). Modify
/etc/osm/docker/docker-compose.yaml
in the host, adding extra_hosts in the ro section with the entries that you want to add to/etc/hosts
in the RO docker:Option 3b (for kubernetes). Modify
/etc/osm/docker/osm_pods/ro.yaml
in the host, adding extra_hosts in the ro section with the entries that you want to add to/etc/hosts
in the RO docker:
With docker swarm, the modification of /etc/osm/docker/docker-compose.yaml
would be:
ro:
extra_hosts:
controller: 1.2.3.4
Then:
docker stack rm osm
docker stack deploy -c /etc/osm/docker/docker-compose.yaml osm
With kubernetes, the procedure is very similar. The modification of /etc/osm/docker/osm_pods/ro.yaml
would be:
...
spec:
...
hostAliases:
- ip: "1.2.3.4"
hostnames:
- "controller"
...
Then:
kubectl -n osm apply -f /etc/osm/docker/osm_pods/ro.yaml
This is persistent after reboots and restarts.
9.3.2. VIM authentication
What should I check if the VIM authentication is failing?
Typically, you will get the following error message:
Error: "VIM Exception vimconnUnexpectedResponse Unauthorized: The request you have made requieres authentication. (HTTP 401)"
If your OpenStack URL is based on HTTPS, OSM will check by default the authenticity of your VIM using the appropriate public certificate. The recommended way to solve this is by modifying /etc/osm/docker/docker-compose.yaml
in the host, sharing the host file (e.g. /home/ubuntu/cafile.crt
) by adding a volume to the ro
section as follows:
ro:
...
volumes:
- /home/ubuntu/cafile.crt:/etc/osm/cafile.crt
Then, when creating the VIM, you should use the config option ca_cert
as follows:
$ # Create the VIM with all the usual options, and add the config option to specify the certificate
$ osm vim-create VIM-NAME ... --config '{ca_cert: /etc/osm/cafile.crt}'
For casual testing, when adding the VIM account to OSM, you can use 'insecure: True'
(without quotes) as part of the VIM config parameters:
$ osm vim-create VIM-NAME ... --config '{insecure: True}'
9.3.3. Issues when trying to access VM from OSM
Is the VIM management network reachable from OSM (e.g. via ssh, port 22)?
The simplest check would consist on deploying a VM attached to the management network and trying to access it via e.g. ssh from the OSM host.
For instance, in the case of an OpenStack VIM you could try something like this:
$ openstack server create --image ubuntu --flavor m1.small --nic mgmtnet test
If this does not work, typically it is due to one of these issues:
Security group policy in your VIM is blocking your traffic (contact your admin to fix it)
IP address space in the management network is not routable from outside (or in the reverse direction, for the ACKs).
9.4. Common issues with VCA/Juju
9.4.1. Juju status shows pending objects after deleting a NS
In extraordinary situations, the output of juju status
could show pending units that should have been removed when deleting a NS. In those situations, you can clean up VCA by following the procedure below:
juju status -m <NS_ID>
juju remove-application -m <NS_ID> <application>
juju resolved -m <NS_ID> <unit> --no-retry # You'll likely have to run it several times, as it will probably have an error in the next queued hook.Once the last hook is marked resolved, the charm will continue its removal
The following page also shows how to remove different Juju objects
9.4.2. Dump Juju Logs
To dump the Juju debug-logs, run this command:
juju debug-log --replay --no-tail > juju-debug.log
juju debug-log --replay --no-tail -m <NS_ID>
juju debug-log --replay --no-tail -m <NS_ID> --include <UNIT>
9.4.3. Manual recovery of Juju
If juju gets in a corrupt state and you cannot run juju status
or contact the juju controller, you might need to remove manually the controller and register again, making OSM aware of the new controller.
# Stop and delete all juju containers, then unregister the controller
lxc list
lxc stop juju-* #replace "*" by the right values
lxc delete juju-* #replace "*" by the right values
juju unregister -y osm
# Create the controller again
sg lxd -c "juju bootstrap --bootstrap-series=xenial localhost osm"
# Get controller IP and update it in relevant OSM env files
controller_ip=$(juju show-controller osm|grep api-endpoints|awk -F\' '{print $2}'|awk -F\: '{print $1}')
sudo sed -i 's/^OSMMON_VCA_HOST.*$/OSMMON_VCA_HOST='$controller_ip'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_HOST.*$/OSMLCM_VCA_HOST='$controller_ip'/' /etc/osm/docker/lcm.env
#Get juju password and feed it to OSM env files
function parse_juju_password {
password_file="${HOME}/.local/share/juju/accounts.yaml"
local controller_name=$1
local s='[[:space:]]*' w='[a-zA-Z0-9_-]*' fs=$(echo @|tr @ '\034')
sed -ne "s|^\($s\):|\1|" \
-e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
-e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $password_file |
awk -F$fs -v controller=$controller_name '{
indent = length($1)/2;
vname[indent] = $2;
for (i in vname) {if (i > indent) {delete vname[i]}}
if (length($3) > 0) {
vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
if (match(vn,controller) && match($2,"password")) {
printf("%s",$3);
}
}
}'
}
juju_password=$(parse_juju_password osm)
sudo sed -i 's/^OSMMON_VCA_SECRET.*$/OSMMON_VCA_SECRET='$juju_password'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_SECRET.*$/OSMLCM_VCA_SECRET='$juju_password'/' /etc/osm/docker/lcm.env
juju_pubkey=$(cat $HOME/.local/share/juju/ssh/juju_id_rsa.pub)
sudo sed -i 's/^OSMLCM_VCA_PUBKEY.*$/OSMLCM_VCA_PUBKEY='$juju_pubkey'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_PUBKEY.*$/OSMLCM_VCA_PUBKEY='$juju_pubkey'/' /etc/osm/docker/lcm.env
#Restart OSM stack
docker stack rm osm
docker stack deploy -c /etc/osm/docker/docker-compose.yaml osm
9.4.4. Slow deployment of charms
You can make deployment of charms quicker by:
Upgrading your LXD installation to use ZFS:LXD configuration for OSM Release FIVE
After LXD re-installation, you might need to reinstall the juju controller: Reinstall Juju controller
Preventing Juju from running
apt-get update && apt-get upgrade
when starting a machine: Disable OS upgrades in charmsBuilding periodically a custom image that will be used as base image for all the charms: Custom base image for charms
9.5. Common instantiation errors
9.5.1. File juju_id_rsa.pub not found
ERROR:
ERROR creating VCA model name 'xxxx': Traceback (most recent call last): File "/usr/lib/python3/dist-packages/osm_lcm/ns.py", line 822, in instantiate await ... [Errno 2] No such file or directory: '/root/.local/share/juju/ssh/juju_id_rsa.pub'
CAUSE: Normally a migration from release FIVE do not set properly the env for LCM
SOLUTION: Ensure variable OSMLCM_VCA_PUBKEY is properly set at file
/etc/osm/docker/lcm.env
. The value must match with the output of this commandcat $HOME/.local/share/juju/ssh/juju_id_rsa.pub
. If not, add or change it. Restart OSM, or just LCM service withdocker service update osm_lcm --force --env-add OSMLCM_VCA_PUBKEY=""
9.6. Common issues when interacting with NBI
9.6.1. SSL certificate problem
By default, OSM installer uses a self-signed certificate for HTTPS. That might lead to the error ‘SSL certificate problem: self signed certificate’ on the client side. For testing environments, you might want to ignore this error just by using the appropriate options to skip certificate validation (e.g. --insecure
for curl, --no-check-certificate
for wget, etc.). However, for more stable setups you might prefer to address this issue by installing the appropriate certificate in your client system.
These are the steps to install NBI certificate on the client side (tested for Ubuntu):
Get the certificate file
cert.pem
by any of these means:
From running docker container:
docker ps | grep nbi docker cp <docker-id>:/app/NBI/osm_nbi/http/cert.pem .
From source code: NBI-folder/osm_nbi/http/cert.pem
From ETSI’s git:
wget -O cert.pem "https://osm.etsi.org/gitweb/?p=osm/NBI.git;a=blob_plain;f=osm_nbi/http/cert.pem;hb=refs/heads/v8.0"
Then, you should install this certificate:
sudo cp cert.pem /usr/local/share/ca-certificates/osm_nbi_cert.pem.crt sudo update-ca-certificates # 1 added, 0 removed; done
Add to the list of
/etc/hosts
a host called “nbi” with the IP address where OSM is running.It can be
localhost
if client and server are the same machine.For localhost, you would need to add (or edit) these lines:
127.0.0.1 localhost nbi OSM-ip nbi
Finally, for the URL, use the
nbi
as host name (i.e. httts://nbi:9999/osm).Do not use neither
localhost
nor 127.0.0.1.You can run a quick test with
curl
by:curl https://nbi:9999/osm/version
9.6.2. Cannot login after migration to 6.0.2
ERROR: NBI always return “UNAUTHORIZED”. Cannot login neither with UI nor with CLI. CLI shows error “
can't find a default project for this user
” or “project admin not allowed for this user
”.CAUSE: Normally after a migration to release 6.0.2 There is a slight incompatibility with users created from older versions.
SOLUTION: Delete user admin and reboot NBI so that a new compatible user is created by running these commands:
curl --insecure https://localhost:9999/osm/test/db-clear/users
docker service update osm_nbi --force
9.7. Other operational issues
9.7.1. Running out of disk space
If you are upgrading frequently your OSM installation, you might face that your disk is running out of space. The reason is that the previous dockers and docker images might be consuming some disk space. Running the following two commands should be enough to clear your docker setup:
docker system prune
docker image prune
If you are still experiencing issues with disk space, logs in one of the dockers could be the cause of your issue. Check the containers that are consuming more space (typically kafka-exporter)
du -sk /var/lib/docker/containers/* |sort -n
docker ps |grep <CONTAINER_ID>
Then, remove the stack and redeploy it again after doing a prune:
docker stack rm osm_metrics
docker system prune
docker image prune
docker stack deploy -c /etc/osm/docker/osm_metrics/docker-compose.yml osm_metrics
9.8. Logs
9.8.1. Checking the logs of OSM in Kubernetes
You can check the logs of any container with the following commands:
kubectl -n osm logs deployment/mon --all-containers=true
kubectl -n osm logs deployment/pol --all-containers=true
kubectl -n osm logs deployment/lcm --all-containers=true
kubectl -n osm logs deployment/nbi --all-containers=true
kubectl -n osm logs deployment/ng-ui --all-containers=true
kubectl -n osm logs deployment/ro --all-containers=true
kubectl -n osm logs deployment/grafana --all-containers=true
kubectl -n osm logs deployment/keystone --all-containers=true
kubectl -n osm logs statefulset/mysql --all-containers=true
kubectl -n osm logs statefulset/mongo --all-containers=true
kubectl -n osm logs statefulset/kafka --all-containers=true
kubectl -n osm logs statefulset/zookeeper --all-containers=true
kubectl -n osm logs statefulset/prometheus --all-containers=true
For live debugging, the following commands can be useful to save the log output to a file and show it in the screen:
kubectl -n osm logs -f deployment/mon --all-containers=true 2>&1 | tee mon-log.txt
kubectl -n osm logs -f deployment/pol --all-containers=true 2>&1 | tee pol-log.txt
kubectl -n osm logs -f deployment/lcm --all-containers=true 2>&1 | tee lcm-log.txt
kubectl -n osm logs -f deployment/nbi --all-containers=true 2>&1 | tee nbi-log.txt
kubectl -n osm logs -f deployment/ng-ui --all-containers=true 2>&1 | tee ng-log.txt
kubectl -n osm logs -f deployment/ro --all-containers=true 2>&1 | tee ro-log.txt
kubectl -n osm logs -f deployment/grafana --all-containers=true 2>&1 | tee grafana-log.txt
kubectl -n osm logs -f deployment/keystone --all-containers=true 2>&1 | tee keystone-log.txt
kubectl -n osm logs -f statefulset/mysql --all-containers=true 2>&1 | tee mysql-log.txt
kubectl -n osm logs -f statefulset/mongo --all-containers=true 2>&1 | tee mongo-log.txt
kubectl -n osm logs -f statefulset/kafka --all-containers=true 2>&1 | tee kafka-log.txt
kubectl -n osm logs -f statefulset/zookeeper --all-containers=true 2>&1 | tee zookeeper-log.txt
kubectl -n osm logs -f statefulset/prometheus --all-containers=true 2>&1 | tee prometheus-log.txt
9.8.2. Changing the log level
You can change the log level of any container, by updating the container with the right LOG_LEVEL
env var.
Log levels are:
ERROR
WARNING
INFO
DEBUG
For instance, to set the log level to INFO for the MON in a deployment of OSM over K8s:
kubectl -n osm set env deployment mon OSMMON_GLOBAL_LOGLEVEL=INFO
For instance, to increase the log level to DEBUG for the NBI in a deployment of OSM over docker swarm:
docker service update --env-add OSMNBI_LOG_LEVEL=DEBUG osm_nbi
9.9. How to report an issue
If you have bugs or issues to be reported, please use Bugzilla
If you have questions or feedback, feel free to contact us through:
the mailing list OSM_TECH@list.etsi.org
the Slack work space
Please be patient. Answers may take a few days.
Please provide some context to your questions. As an example, find below some guidelines:
In case of an installation issue:
The full command used to run the installer and the full output of the installer (or at least enough context) might help on finding the solution.
It is highly recommended to run the installer command capturing standard output and standard error, so that you can send them for analysis if needed. E.g.:
./install_osm.sh 2>&1 | tee osm_install.log
In case of operational issues, the following information might help:
Version of OSM that you are using
Logs of the system. Check https://osm.etsi.org/wikipub/index.php/Common_issues_and_troubleshooting to know how to get them.
Details on the actions you made to get that error so that we could reproduce it.
IP network details in order to help troubleshooting potential network issues. For instance:
Client IP address (browser, command line client, etc.) from where you are trying to access OSM
IP address of the machine where OSM is running
IP addresses of the containers
NAT rules in the machine where OSM is running
Common sense applies here, so you don’t need to send everything, but just enough information to diagnose the issue and find a proper solution.
9.10. How to troubleshoot issues in the new Service Assurance architecture
Since OSM Release FOURTEEN, the Service Assurance architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. These alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats them to adapt to Airflow expected webhook endpoints. So the alert workflow is this: DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG
In case of any kind of error related to monitoring, the first thing to check should be the metrics stored in Prometheus. Its graphical interface can be visited at the URL http://$IP:9091/. Some useful metrics to review are the following:
ns_topology
: metric generated by a DAG with the current topology (VNFs and NSs) of instantiated VDUs in OSM.vm_status
: status (1: ok, 0: error) of the VMs in the VIMs registered in OSM.vm_status_extended
: metric enriched from the two previous ones, so it includes data about VNF and NS the VM belongs to as part of the metric labels.osm_*
: resource consumption metrics. Only intantiated VNFs that include monitoring parameters have these kind of metrics in Prometheus.
In case you need to debug closed-loop operations you will also need to check the Prometheus alerts here http://$IP:9091/alerts. On this page you can see the alerting rules and their status: inactive, pending or active. When a alert is fired (its status changes from pending to active) or is marked as resolved (from active to inactive), the appropriate DAG is run on Airflow. There are three types of alerting rules:
vdu_down
: this alert is fired when a VDU remains in a not OK state for several minutes and triggersalert_vdu
DAG. Its labels include information about NS, VNF, VIM, etc.scalein_*
: these rules manage scale-in operations based on the resource consumption metrics and the number of VDU instances. They triggerscalein_vdu
DAG.scaleout_*
: these rules manage scale-out operations based on the resource consumption metrics and the number of VDU instances. They triggerscaleout_vdu
DAG.
Finally, it is also interesting for debugging to be able to view the logs of the execution of the DAGs. To do this, you must visit the Airflow website, which will be accessible on the port pointed by the airflow-webserver
service in OSM’s cluster (not a fixed port):
kubectl -n osm get svc airflow-webserver
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
airflow-webserver NodePort 10.100.57.168 <none> 8080:19371/TCP 12d
When you open the URL http://$IP:port (19371
in the example above) in a browser, you will be prompted for the user and password (admin
/admin
by default). After that you will see the dashboard with the list of DAGs:
alert_vdu
: it is executed when a VDU down alarm is fired or resolved.scalein_vdu
,scaleout_vdu
: executed when auto-scaling conditions in a VNF are met.ns_topology
: this DAG is executed periodically for updating the topology metric in Prometheus of the instantiated NS.vim_status_*
: there is one such DAG for each VIM in OSM. It checks VIM’s reachability every few minutes.vm_status_vim_*
: these DAGs (one per VIM) get VM status from VIM and store them in Prometheus.vm_metrics_vim_*
: these DAGs (one per VIM) store in Prometheus resource consumption metrics from VIM.
The logs of the executions can be accessed by clicking on the corresponding DAG in dashboard and then selecting the required date and time in the grid. Each DAG has a set of tasks, and each task has its own logs.