10. ANNEX 1: Troubleshooting

10.1. How to know the version of your current OSM installation

Run the following command to know the version of OSM client and OSM NBI:

osm version
Server version: 17.0.0.post12+g194ced9 2020-04-17
Client version: 17.0.0+geffca72

In some circumstances, it could be useful to search the osm-devops package installed in your system, since osm-devops is the package used to drive installations:

dpkg -l osm-devops

||/ Name                     Version            Architecture          Description
+++-======================-=================-=====================-=====================================
ii  osm-devops             17.0.0-1          all

To know the current verion of the OSM client, you can also search the python3-osmclient package as a way to know your current version of OSM:

dpkg -l python3-osmclient
||/ Name                     Version            Architecture          Description
+++-======================-=================-=====================-=====================================
ii  python3-osmclient      17.0.0-1          all

10.2. Logs

10.2.1. Checking the logs of OSM in Kubernetes

You can check the logs of any container with the following commands:

kubectl -n osm logs deployment/nbi --all-containers=true
kubectl -n osm logs deployment/lcm --all-containers=true
kubectl -n osm logs deployment/ro --all-containers=true
kubectl -n osm logs deployment/ngui --all-containers=true
kubectl -n osm logs deployment/mon --all-containers=true
kubectl -n osm logs deployment/grafana --all-containers=true
kubectl -n osm logs statefulset/mongodb-k8s --all-containers=true
kubectl -n osm logs statefulset/kafka-controller --all-containers=true
kubectl -n osm logs statefulset/prometheus --all-containers=true

For live debugging, the following commands can be useful to save the log output to a file and show it in the screen:

kubectl -n osm logs -f deployment/nbi --all-containers=true 2>&1 | tee nbi-log.txt
kubectl -n osm logs -f deployment/lcm --all-containers=true 2>&1 | tee lcm-log.txt
kubectl -n osm logs -f deployment/ro --all-containers=true 2>&1 | tee ro-log.txt
kubectl -n osm logs -f deployment/ngui --all-containers=true 2>&1 | tee ngui-log.txt
kubectl -n osm logs -f deployment/mon --all-containers=true 2>&1 | tee mon-log.txt
kubectl -n osm logs -f deployment/grafana --all-containers=true 2>&1 | tee grafana-log.txt
kubectl -n osm logs -f statefulset/mongodb-k8s --all-containers=true 2>&1 | tee mongo-log.txt
kubectl -n osm logs -f statefulset/kafka-controller --all-containers=true 2>&1 | tee kafka-log.txt
kubectl -n osm logs -f statefulset/prometheus --all-containers=true 2>&1 | tee prometheus-log.txt

10.2.2. Changing the log level

You can change the log level of any container, by updating the container with the right LOG_LEVEL env var.

Log levels are:

ERROR
WARNING
INFO
DEBUG

For instance, to set the log level to INFO for the LCM in a deployment of OSM over K8s:

LOGLEVEL="INFO"
kubectl patch configmap osm-lcm-configmap -n osm --type='merge' -p '{"data":{"OSMLCM_GLOBAL_LOGLEVEL":"'${LOGLEVEL}'"}}'
kubectl get configmap osm-lcm-configmap -n osm -o yaml
kubectl -n osm rollout restart deployment lcm

10.2.3. Debugging Kafka

To connect to Kafka bus and print the received messages:

kubectl -n osm exec -it kafka-controller-0 -- kafka-console-consumer.sh --bootstrap-server localhost:9092 --whitelist '.*' --formatter kafka.tools.DefaultMessageFormatter --property print.timestamp=true --property print.key=true --property print.value=true

10.2.4. Debugging MongoDB

To connect to MongoDB and run commands:

kubectl -n osm exec -it pod/mongodb-k8s-0 -- mongosh

use osm;
db.getCollectionNames()
db.k8sclusters.find().pretty()
db.k8sclusters.deleteOne({"_id":"21323ef6-23ec-4f33-8171-dcc863aa9832"})
db.okas.find().pretty()
db.okas.find({}, { _id: 1, name: 1}).pretty()
db.okas.find({}, { _id: 1, name: 1, _admin: {usageState: 1}}).pretty()
db.okas.find({ "_admin.usageState": "IN_USE" }, { _id: 1, name: 1, "_admin.usageState": 1 }).pretty()
db.okas.updateOne(
  { name: "oka_name") }, // Filter: the document to update
  { $set: { field_to_update: "new_value" } } // Update: the field and new new value
)

10.3. Troubleshooting installation

10.3.1. Recommended installation to facilitate troubleshooting

It is highly recommended saving a log of your installation:

./install_osm.sh 2>&1 | tee osm_install_log.txt

10.3.2. Recommended checks after installation

10.3.2.1. Checking whether all processes/services are running in K8s

kubectl -n osm get all

All the deployments and statefulsets should have 1 replica: 1/1

10.4. How to troubleshoot issues in the new Service Assurance architecture

Since OSM Release FOURTEEN, the Service Assurance architecture is based on Apache Airflow and Prometheus. The Airflow DAGs, in addition to periodically collecting metrics from VIMs and storing them into Prometheus, implement auto-scaling and auto-healing closed-loop operations which are triggered by Prometheus alerts. These alerts are managed by AlertManager and forwarded to Webhook Translator, which re-formats them to adapt to Airflow expected webhook endpoints. So the alert workflow is this: DAGs collect metrics => Prometheus => AlertManager => Webhook Translator => Alarm driven DAG

In case of any kind of error related to monitoring, the first thing to check should be the metrics stored in Prometheus. Its graphical interface can be visited at the URL http://$IP:9091/. Some useful metrics to review are the following:

ns_topology: metric generated by a DAG with the current topology (VNFs and NSs) of instantiated VDUs in OSM.
vm_status: status (1: ok, 0: error) of the VMs in the VIMs registered in OSM.
vm_status_extended: metric enriched from the two previous ones, so it includes data about VNF and NS the VM belongs to as part of the metric labels.
osm_*: resource consumption metrics. Only intantiated VNFs that include monitoring parameters have these kind of metrics in Prometheus.

In case you need to debug closed-loop operations you will also need to check the Prometheus alerts here http://$IP:9091/alerts. On this page you can see the alerting rules and their status: inactive, pending or active. When a alert is fired (its status changes from pending to active) or is marked as resolved (from active to inactive), the appropriate DAG is run on Airflow. There are three types of alerting rules:

vdu_down: this alert is fired when a VDU remains in a not OK state for several minutes and triggers alert_vdu DAG. Its labels include information about NS, VNF, VIM, etc.
scalein_*: these rules manage scale-in operations based on the resource consumption metrics and the number of VDU instances. They trigger scalein_vdu DAG.
scaleout_*: these rules manage scale-out operations based on the resource consumption metrics and the number of VDU instances. They trigger scaleout_vdu DAG.

Finally, it is also interesting for debugging to be able to view the logs of the execution of the DAGs. To do this, you must visit the Airflow website, which will be accessible on the port pointed by the airflow-webserver service in OSM’s cluster (not a fixed port):

kubectl -n osm get svc airflow-webserver
NAME                TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
airflow-webserver   NodePort   10.100.57.168   <none>        8080:19371/TCP   12d

When you open the URL http://$IP:port (19371 in the example above) in a browser, you will be prompted for the user and password (admin/admin by default). After that you will see the dashboard with the list of DAGs:

alert_vdu: it is executed when a VDU down alarm is fired or resolved.
scalein_vdu, scaleout_vdu: executed when auto-scaling conditions in a VNF are met.
ns_topology: this DAG is executed periodically for updating the topology metric in Prometheus of the instantiated NS.
vim_status_*: there is one such DAG for each VIM in OSM. It checks VIM’s reachability every few minutes.
vm_status_vim_*: these DAGs (one per VIM) get VM status from VIM and store them in Prometheus.
vm_metrics_vim_*: these DAGs (one per VIM) store in Prometheus resource consumption metrics from VIM.

The logs of the executions can be accessed by clicking on the corresponding DAG in dashboard and then selecting the required date and time in the grid. Each DAG has a set of tasks, and each task has its own logs.

10.5. Checking workflows in new OSM declarative framework

Since Release SIXTEEN, operations involve launching an ArgoWorkflows workflow, which will end up with a commit being created in a Git repo.

Be aware that workflows are automatically cleaned up after some time, so the check of the workflows is recommended to be done while the operation is running or a few seconds later.

10.5.1. How to expose ArgoWorkflows UI

# Get the kubeconfig and copy to your local machine
# Then, from your local machine
export KUBECONFIG=~/kubeconfig-osm.yaml
kubectl -n argo port-forward deployment/argo-server 2746:2746

Access Argo UI from web browser: https://localhost:2746. Then click on the workflow, then on the step, then on “Logs”.

10.5.2. How to check a workflow with kubectl

export KUBECONFIG=~/kubeconfig-osm.yaml
kubectl -n osm-workflows get workflows
kubectl -n osm-workflows get workflows/${WORKFLOW_NAME}
kubectl -n osm-workflows get workflows/${WORKFLOW_NAME} -o json
kubectl -n osm-workflows get workflows/${WORKFLOW_NAME} -o jsonpath='{.status.conditions}' | jq -r '.[] | select(.type=="Completed").status'
watch kubectl -n osm-workflows get workflows

10.5.3. How to check a workflow with argo CLI

export KUBECONFIG=~/kubeconfig-osm.yaml
argo list -n osm-workflows
argo get -n osm-workflows @latest
argo watch -n osm-workflows @latest
argo logs -n osm-workflows @latest

10.6. Checking progress of operations in new OSM declarative framework

10.6.1. How to check progres of resources in Flux

export KUBECONFIG=~/kubeconfig-osm.yaml
watch 'echo; kubectl get managed; echo; kubectl get kustomizations -A; echo; kubectl get helmreleases -A'

10.7. Common issues with VIMs

10.7.1. Is the VIM URL reachable and operational?

When there are problems to access the VIM URL, an error message similar to the following is shown after attempts to instantiate network services:

Error: "VIM Exception vimmconnConnectionException ConnectFailure: Unable to establish connection to <URL>"

In order to debug potential issues with the connection, in the case of an OpenStack VIM, you can install the OpenStack client in the OSM VM and run some basic tests. I.e.:

# Install the OpenStack client
sudo apt-get install python-openstackclient
# Load your OpenStack credentials. For instance, if your credentials are saved in a file named 'myVIM-openrc.sh', you can load them with:
source myVIM-openrc.sh
# Test if the VIM API is operational with a simple command. For instance:
openstack image list

If the openstack client works, then make sure that you can reach the VIM from the RO container:

# If running OSM on top of docker swarm, go to the container in docker swarm
docker exec -it osm_ro.1.xxxxx bash
# If running OSM on top of K8s, go to the RO deployment in kubernetes
kubectl -n osm exec -it deployment/ro bash
curl <URL_CONTROLLER>

In some cases, the errors come from the fact that the VIM was added to OSM using names in the URL that are not Fully Qualified Domain Names (FQDN).

When adding a VIM to OSM, you must use always FQDN or the IP addresses. Non-FQDN names might be understood by Kubernetes as a container name to be resolved, which is not the case. In addition, all the VIM endpoints should also be FQDN or IP addresses, thus guaranteeing that all subsequent API calls can reach the appropriate endpoint.

10.7.2. Issues when trying to access VM from OSM

Is the VIM management network reachable from OSM (e.g. via ssh, port 22)?

The simplest check would consist on deploying a VM attached to the management network and trying to access it via e.g. ssh from the OSM host.

For instance, in the case of an OpenStack VIM you could try something like this:

$ openstack server create --image ubuntu --flavor m1.small --nic mgmtnet test

If this does not work, typically it is due to one of these issues:

Security group policy in your VIM is blocking your traffic (contact your admin to fix it)
IP address space in the management network is not routable from outside (or in the reverse direction, for the ACKs).

10.8. How to report an issue

If you have bugs or issues to be reported, please use Bugzilla

If you have questions or feedback, feel free to contact us through:

the mailing list OSM_TECH@list.etsi.org
the Slack work space

Please be patient. Answers may take a few days.

Please provide some context to your questions. As an example, find below some guidelines:

In case of an installation issue:
- The full command used to run the installer and the full output of the installer (or at least enough context) might help on finding the solution.
It is highly recommended to run the installer command capturing standard output and standard error, so that you can send them for analysis if needed. E.g.:

./install_osm.sh 2>&1 | tee osm_install.log

In case of operational issues, the following information might help:
- Version of OSM that you are using
Logs of the system. Check https://osm.etsi.org/wikipub/index.php/Common_issues_and_troubleshooting to know how to get them.
- Details on the actions you made to get that error so that we could reproduce it.
- IP network details in order to help troubleshooting potential network issues. For instance:
  - Client IP address (browser, command line client, etc.) from where you are trying to access OSM
  - IP address of the machine where OSM is running
  - IP addresses of the containers
  - NAT rules in the machine where OSM is running

Common sense applies here, so you don’t need to send everything, but just enough information to diagnose the issue and find a proper solution.

10.9. (OLD) Common issues with VCA/Juju

10.9.1. Juju status shows pending objects after deleting a NS

In extraordinary situations, the output of juju status could show pending units that should have been removed when deleting a NS. In those situations, you can clean up VCA by following the procedure below:

juju status -m <NS_ID>
juju remove-application -m <NS_ID> <application>
juju resolved -m <NS_ID> <unit> --no-retry        # You'll likely have to run it several times, as it will probably have an error in the next queued hook.Once the last hook is marked resolved, the charm will continue its removal

The following page also shows how to remove different Juju objects

10.9.2. Dump Juju Logs

To dump the Juju debug-logs, run this command:

juju debug-log --replay --no-tail > juju-debug.log
juju debug-log --replay --no-tail -m <NS_ID>
juju debug-log --replay --no-tail -m <NS_ID> --include <UNIT>

10.9.3. Manual recovery of Juju

If juju gets in a corrupt state and you cannot run juju status or contact the juju controller, you might need to remove manually the controller and register again, making OSM aware of the new controller.

# Stop and delete all juju containers, then unregister the controller
lxc list
lxc stop juju-*          #replace "*" by the right values
lxc delete juju-*        #replace "*" by the right values
juju unregister -y osm

# Create the controller again
sg lxd -c "juju bootstrap --bootstrap-series=xenial localhost osm"

# Get controller IP and update it in relevant OSM env files
controller_ip=$(juju show-controller osm|grep api-endpoints|awk -F\' '{print $2}'|awk -F\: '{print $1}')
sudo sed -i 's/^OSMMON_VCA_HOST.*$/OSMMON_VCA_HOST='$controller_ip'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_HOST.*$/OSMLCM_VCA_HOST='$controller_ip'/' /etc/osm/docker/lcm.env

#Get juju password and feed it to OSM env files
function parse_juju_password {
   password_file="${HOME}/.local/share/juju/accounts.yaml"
   local controller_name=$1
   local s='[[:space:]]*' w='[a-zA-Z0-9_-]*' fs=$(echo @|tr @ '\034')
   sed -ne "s|^\($s\):|\1|" \
        -e "s|^\($s\)\($w\)$s:$s[\"']\(.*\)[\"']$s\$|\1$fs\2$fs\3|p" \
        -e "s|^\($s\)\($w\)$s:$s\(.*\)$s\$|\1$fs\2$fs\3|p" $password_file |
   awk -F$fs -v controller=$controller_name '{
      indent = length($1)/2;
      vname[indent] = $2;
      for (i in vname) {if (i > indent) {delete vname[i]}}
      if (length($3) > 0) {
         vn=""; for (i=0; i<indent; i++) {vn=(vn)(vname[i])("_")}
         if (match(vn,controller) && match($2,"password")) {
             printf("%s",$3);
         }
      }
   }'
}
juju_password=$(parse_juju_password osm)
sudo sed -i 's/^OSMMON_VCA_SECRET.*$/OSMMON_VCA_SECRET='$juju_password'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_SECRET.*$/OSMLCM_VCA_SECRET='$juju_password'/' /etc/osm/docker/lcm.env

juju_pubkey=$(cat $HOME/.local/share/juju/ssh/juju_id_rsa.pub)
sudo sed -i 's/^OSMLCM_VCA_PUBKEY.*$/OSMLCM_VCA_PUBKEY='$juju_pubkey'/' /etc/osm/docker/mon.env
sudo sed -i 's/^OSMLCM_VCA_PUBKEY.*$/OSMLCM_VCA_PUBKEY='$juju_pubkey'/' /etc/osm/docker/lcm.env

#Restart OSM stack
docker stack rm osm
docker stack deploy -c /etc/osm/docker/docker-compose.yaml osm

10.9.4. Slow deployment of charms

You can make deployment of charms quicker by:

Upgrading your LXD installation to use ZFS:LXD configuration for OSM Release FIVE
- After LXD re-installation, you might need to reinstall the juju controller: Reinstall Juju controller
Preventing Juju from running apt-get update && apt-get upgrade when starting a machine: Disable OS upgrades in charms
Building periodically a custom image that will be used as base image for all the charms: Custom base image for charms

10.10. Other operational issues

10.10.1. Running out of disk space

If you are upgrading frequently your OSM installation, you might face that your disk is running out of space. The reason is that the previous dockers and docker images might be consuming some disk space. Running the following two commands should be enough to clear your docker setup:

docker system prune
docker image prune