Skip to content
Snippets Groups Projects
Commit db894ea8 authored by Zygmunt Krynicki's avatar Zygmunt Krynicki
Browse files

Add OTA documentation


Add small notes on the OTA stack.

Signed-off-by: default avatarZygmunt Krynicki <zygmunt.krynicki@huawei.com>
parent 0ca76a42
No related branches found
No related tags found
1 merge request!26Add OTA documentation
Pipeline #10835 passed
...@@ -73,6 +73,7 @@ daily life. ...@@ -73,6 +73,7 @@ daily life.
oniro/supported-technologies/openthread oniro/supported-technologies/openthread
oniro/supported-technologies/containers oniro/supported-technologies/containers
oniro/supported-technologies/modbus oniro/supported-technologies/modbus
ota
.. toctree:: .. toctree::
:caption: Supported Toolchains :caption: Supported Toolchains
......
ota.rst 0 → 100644
.. SPDX-FileCopyrightText: Huawei Inc.
.. SPDX-License-Identifier: CC-BY-4.0
.. include:: definitions.rst
Over The Air (OTA) Updates
==========================
|main_project_name| provides support for updating Linux devices in the field.
With certain preparations, derivative projects can prepare and distribute
periodic updates to bring in up-to-date security patches, new features and
capabilities.
This document is meant to be read top-to-bottom, with increasing level of
detailed presented at each stage. It starts with an overview of two supported
distribution flows, centrally managed with HawkBit and anonymously updated with
NetOTA and their suggested deployment scenarios, then describes the architecture
of the OTA stack on Linux devices and how mutable persistent data is arranged,
lastly the detailed architecture of on-device stack is described. That last
chapter meant to assist developers porting the stack to new boards or debugging
unexpected issues.
Important considerations
------------------------
This chapter contains specific advice to the implementer of the update system.
|main_project_name| provides some good defaults and a starting point, but any
complete product must tune and adjust a number of elements.
Failure to understand and correctly implement the following advice can cause
catastrophic failure in the field. When in doubt, re-test and re-check.
Partitions
..........
|main_project_name| devices are using an A/B model with two immutable system
partitions, separate boot partition, separate application data partition and
separate system data partition and separate immutable device data partition.
The roles for those partitions were determined at the design stage and should
be used in according with the intent.
OS, not apps
............
The update stack is designed to update the operating system, not applications.
Applications _may_ be embedded into the operating system image but _should_ be
delivered as separate entities, for example, as system containers, because that
de-couples their life-cycle and upgrade frequency from that of the base system.
The sizes of the A/B partitions are fixed during the lifetime of the device.
Care should be taken to plan ahead, so that their sizes are not a constraining
factor during the evolution of the system software. This is also related to any
applications that may be bundled in the system image.
Each update involves system re-boot. In case of failure (total or partial),
another reboot is performed for the rollback operation. In contrast some
application update stacks may be able to achieve zero-downtime updates.
Plan your updates such, so that least downtime and interruption occurs for the
users of your product.
Signature Validity
..................
The update payload, also known the *update bundle*, is verified against a known
public key or keyring contained inside the system image. The validity of the
keyring should be such, that a device that was taken from long-term storage,
without receiving any intermediate updates, may successfully validate the
signature.
A conservative number of **ten years** is recommended. After the baked-in public
key expires, the device needs to be re-programmed externally, possibly involving
product recall.
Space Requirements
..................
Update involves downloading the complete copy of the system partition. The
device must either use the data partition (which should have enough storage for
typical use-cases) or must combine having enough memory in RAM-based file system
_and_ use small enough images to ensure that the copy may be fully downloaded.
The choice of temporary storage for the download image depends on the update
model and will be discussed below.
Each update involves writing the system image to one of the available A/B slots.
Care should be taken to design the system with enough write endurance to support
updates during the entire lifetime of the product.
Time Requirements
.................
Update frequency incurs proportional load on the update server. A large enough
fleet of devices merely _checking_ for an update can take down any single
server. To alleviate this product design should balance update frequency (in
some cases it can be controlled remotely post-deployment) and to spread the load
over time. It is strongly advisable to evenly distribute update checks with a
random element. If any potential updates must occur at a specific local time
(e.g. between three and four AM), then the system must be correctly configured
to observe the correct time zone. The update server can be scaled horizontally,
to an extent. At least for NetOTA care was taken to allow efficiency at scale,
with stateless operation and no need for a traditional database. Any number of
geographically distributed replicas, behind load balancers and geo-routing can
take arbitrarily large load. The update server (both HawkBit and NetOTA) uses
separates meta-data from file storage, allowing to offload network traffic to
optimized CDN solutions.
Partitions and data
-------------------
The system image, as build by the Yocto recipe `oniro-image-base` (or
derivative) has a corresponding update bundle, defined by the Yocto recipe
`oniro-bundle-base`.
Full disk image is meant to be programmed once, typically during manufacturing
or during initial one-time setup. The update bundle image is meant to be
uploaded to the upgrade server and downloaded to individual devices in the
field.
The disk is partitioned into the following partitions:
- boot (FAT)
- system-a (squashfs)
- system-b (squashfs or empty)
- device-data (ext4, ro)
- system-data (ext4)
- app-data (ext4)
The update stack interacts with the boot partition, the system a and b
partitions and the system data partition. Remaining partitions may be used by
other parts of the system, but are not directly affected by anything that
happens during the update process.
Boot and update process
-----------------------
Platform specific boot loader chooses one of the system partitions, either A or
B, and boots into it. On EFI systems the kernel is loaded from the system
partition. Other boot loaders may need to load the kernel from the boot
partition. Appropriate redundancy scheme is used, to allow more than one kernel
to co-exist.
During early initialization of userspace, the immutable system partition mounted
at `/` is augmented with bind mounts to other partitions. In general application
data (e.g. containers and other large data sets) is meant to live on the
application data partition, which does not use the A/B update model.
Certain small amount of data, such as system configuration and identity,
including the state of the update stack, is kept in the system-data partition.
Applications that are compiled into the image need overrides for their Yocto to
allow them to persist state. This is handled with the ``WRITABLES`` system which
is not documented here.
When an update is initiated, a complete image is downloaded to temporary
storage. The image is cryptographically verified against the RAUC signing key or
key-chain. Compatibility is checked against the RAUC ``COMPATIBLE`` string.
Valid and compatible images are mounted in a private mount namespace to reveal
the update image contained inside. That namespace is observed by the ``rauc``
process and the ``sysota-rauc-install-handler`` process. On |main_project_name|
systems, the update payload is a single squashfs image called ``system``. The
system image is then copied to the inactive slot, for example, when *slot A* is
active, then the image is copied to *slot B*. Platform-specific logic is then
used to configure the boot system to boot into the newly written slot **once**.
This acts as a safety mechanism, ensuring that power loss anywhere during the
update process has the effect of reverting back to the known-good image. After
the image is written, platform specific post-install schedules the device to
reboot, perhaps in a special way to ensure the boot-once constraint.
During boot-up, platform firmware or GRUB EFI application detects the boot-once
mode and uses the inactive slot for the remainder of the boot process. This
tests, in one go, the new kernel, kernel modules and any userspace
applications. On successful boot, late userspace takes the decision to commit
the update transaction. A committed transaction atomically swaps the
active-inactive role of the two system partitions.
If failure, for example power loss or unexpected software error, prevents
reaching the commit stage, then update commit will not happen. Depending on the
nature of the failure the device may restart automatically, or may need to
restarted externally. It is recommended to equip and configure a hardware
watchdog to avoid the need of manual recovery during this critical step.
Once restarted the good slot is booted into automatically and the upgrade is
aborted. Temporary data saved during the update process is removed, so that it
does not accumulate in the boot partition.
Supported update servers
------------------------
|main_project_name| supports two update servers: **HawkBit** and **NetOTA**.
HawkBit is a mature solution and recommended for scenarios where devices are
managed centrally by a single authority. The device manufacturer may sell
white-label boxes, deferring all management to the integrator or reseller. The
integrator must deploy operate and maintain a HawkBit installation for the
lifetime of the product. All devices deployed in the field must be explicitly
provisioned with location and credentials before updates can be distributed.
NetOTA is not as mature but is recommended for scenarios where no central
authority manages devices, but the device manufacturer or vendor still maintains
the software over time, releasing updates that devices may install at any time.
The manufacturer may pre-provision all devices with the location of the update
server, name of the image and default update channel, for example latest, stable
release. Depending on product user interface and other integration requirements,
end users may trigger the update process manually or the device may
automatically attempt to update from time to time.
HawkBit update server
---------------------
Eclipse HawkBit can be used to manage any number of devices of diverse types.
Devices periodically contact HawkBit over HTTPS to check if an update is
available. Whoever operates the HawkBit server, has total control of the
software deployed onto the devices.
HawkBit is most suited to environments, where a single authority operates a
number of devices and wants to exert precise control over the update process.
Each device is separately addressable, although mass-update (roll-outs) are also
supported.
This mode requires explicit provisioning. A configuration file for
``rauc-hawkbit-updater`` needs to be prepared for each device and installed
either during manufacturing or on-premises. |main_project_name| does not offer
any support for provisioning, this part is left for the integrator.
HawkBit supports several types of authentication between itself and devices in
the field. Both per-device authentication token and shared gateway tokens are
supported. Control on polling frequency is also available. HawkBit offers
advanced features for tracking and reporting devices, although not all of them
are supported by the ``rauc-hawkbit-updater`` client.
HawkBit is a complex piece of software with vast documentation. Refer to
https://www.eclipse.org/hawkbit/ for details. Small deployments, especially
useful for evaluation, can use the ``hawkbit`` snap package for quick local
deployments. The snap package is not optimized for high number of users or
high-availability, so larger deployments are encouraged to learn about HawkBit
architecture and deploy a scalable installation across multiple machines.
Deploying HawkBit
.................
To deploy HawkBit for a evaluation it is best to use the ``hawkbit`` snap
package. The package offers several stability levels expressed as distinct snap
tracks. Installation instructions can be found on the `hawkbit snap information
page <https://snapcraft.io/hawkbit>`_.
The _stable_ track offers HawkBit 0.2.5 and is not recommended for deployment
due to old age, number of open bugs and missing features. The _beta_ track
offers HawkBit 0.3.0M7 and is recommend for evaluation. The _edge_ track offers
a periodic build of the latest upstream HawkBit. This version is annotated with
the git commit hash and a sequential number counted since the most recent tag.
**Warning**: HawkBit 0.2.5 does not offer updates to 0.3.0. This is an upstream
issue caused by faulty database migration.
Once ``hawkbit`` snap is installed, consult the ``snap info hawkbit`` command and
read the description explaining available configuration options. Those are
managed through the snap configuration system. The name of the administrative
account can be set with ``snap set hawkbit username=user``. The password of the
administrative user can be similarly set with ``snap set hawkbit
password=secret``. By default HawkBit listens on ``localhost``, port ``8080`` and is
meant to be exposed by a reverse http proxy. Evaluation installations can use
the insecure http protocol directly and skip setting up the proxy. To use
HawkBit for evaluation, set the listen address to `0.0.0.0` or `::`, so that the
service is reachable from all the network interfaces. This can be done with
``snap set hawkbit address=0.0.0.0``.
Once HawkBit is installed, either using the snap or in any other way, it should
be configured in one of several ways. The primary deciding factor is how devices
authenticate to HawkBit. The full documentation is beyond the scope of this
document, but for simple deployments we recommend either using *per-device
authentication token*, in which HawkBit has to be told about the presence of
every distinct device, or using the *gateway authentication token*, in which
there is a shared secret among all the devices and they all authenticate to the
gateway this way. This configuration is exposed under **System Config** menu,
available from the sidebar on the left.
In either mode any number of devices can be created under the **Deployment**
menu. In HawkBit nomenclature, a device is called a _target_. Targets may be
clustered into target types, which aid in maintaining a heterogeneous fleet more
easily. Each target has a *controller ID*, which is an unique string identifying
the device in the system. In some authentication modes, devices need to be
provisioned with not only the URL of the HawkBit server, but also with their
*controller ID* and *security token*. Mass deployments can be performed using
bulk upload or using the management API.
The |main_project_name| project created a command line tool for working with
portions of the HawkBit management APIs. This tool is called ``hawkbitctl`` and
is similarly available as a snap package or as a container on DockerHub
(``zyga/hawkbitctl``). To install ``hawkbitctl`` as a snap, see `hawkbitctl
snap information page <https://snapcraft.io/hawkbitctl>`_. Refer to the
documentation of ``hawkbitctl`` to see how to use it to create devices with
given controller ID and security tokens.
Provisioning Devices for HawkBit
................................
SysOTA does not contain a native HawkBit client yet, so it leverages the
``rauc-hawkbit-updater`` program for this role. Said program reads a
configuration file ``/etc/rauc-hawkbit-updater/config.conf``, which must be
owned by the ``rauc-hawkbit`` user, connects to a given HawkBit server
authenticates using either device or gateway token and then listens for events.
|main_project_name| images contain a sample configuration file in
``/usr/share/rauc-hawkbit-updater/example.conf`` which can be used as a quick
reference.
At minimum, the following settings must be configured:
- The ``target_name`` field must be set to the *controller ID* of the target
created in HawkBit. The values may be generated separately, for example the
manufacturing process may generate a batch of identifiers and save them in a CSV file
to be imported into HawkBit later.
- The ``auth_token`` field must be set to the per-device authentication token.
If gateway authentication is used then ``gateway_token`` must be used instead.
Similarly the tokens may be generated in batches during manufacturing and
stored along with controller IDs in a CSV file.
- The ``hawkbit_server`` field must be set to the domain name or IP of your
HawkBit server. Domain names are recommended but toy deployments may use local
IP addresses as well.
Once the file is created and has the right ownership you can start the
``rauc-hawkbit-updater.service`` systemd unit, to ensure the client can connect
and authenticate correctly. When the file is present, the service will start
automatically during system startup.
Working with HawkBit
....................
HawkBit has both the web dashboard and a complex set of REST APIs covering all
aspects of the management story. During exploration and evaluation it is
recommended to use the graphical user interface. As the workflow solidifies it
is encouraged to switch the REST APIs and automation.
The general data model related to updates is as follows:
- the *update bundle* is expressed as an *artifact* on a *software module*
- the *software module* is added as an element of a *distribution set*
- the *distribution set* is assigned to a *target* for deployment
The |main_project_name| project has created the ``hawkbitctl`` utility, which
easily create the required scaffolding and to upload the bundle to the server.
While useful, the tool does not cover the entire API surface yet and you may
find that specific functionality is missing. In cases like that custom
solutions, for example scripts using ``curl`` may be used as a stop-gap
measure.
HawkBit has one more essential complexity, the type system, where *targets*
(devices), *software modules* and *distribution sets* have corresponding type
entities, *target types*, *software module types* and *distribution set types*.
The type system allows to constrain correct combinations and prevent mistakes.
Devices of the same type should refer to a *target type*, which further refers
to compatible *distribution set type*, which finally refers to a compatible
*software module type*. This allows an actual update bundle to be placed in a
new software module of the right _type_, which in the end allows HawkBit to
prevent assigning or rolling out incorrect software to a given specific device.
When using the graphical user interface you should be aware that some of the
operations are only expressed as a drag and drop interaction. This specifically
applies to the act of binding a *software module* to a *distribution set* and
the act of assigning a *distribution set* to a *target*.
Operators working with HawkBit are strongly encouraged to read the extensive
upstream documentation to understand the finer details of the data model,
specifically around cardinality of relations.
Updating with HawkBit
.....................
The basic checklist of updating with HawkBit, assuming the update server
is deployed and devices are provisioned, is as follows:
- Build the bundle recipe, for example ``oniro-bundle-base``. Products should
maintain a pair of recipes, one for the bundle and one for the whole image.
All the examples that refer to the base recipes here should be understood as
references to the actual recipe names used by the product.
- Collect the ``*.raucb`` (RAUC bundle) file from the Yocto deploy directory.
- Perform any QA process deemed necessary. This should at least involve copying
the bundle a real device and updating manually with ``rauc install``. It is
recommended to test the software for a few days at least, to attempt to detect
problems related to memory leaks that would not crash outright but may crash
and cause issues after the update transaction is committed.
- Create a new *software module* with an unique combination of both name and
version, and a reference to an appropriate *software module type* created
out-of-band, which describes RAUC update bundles for specific class of
devices.
- Upload the bundle as an artifact to the software module created earlier.
- Create a new *distribution set* with an unique combination of both name and
version, and a reference to an appropriate *distribution set type* created
out-of-band, which describes a distribution that contains the software module
type with the RAUC update bundle.
- Bind the *software module* to the *distribution set* (by drag-and-drop).
At this stage, the update is uploaded and can be rolled out or assigned to
individual devices. Once a device is asked to update it will download and
install the bundle. Basic information about the process is relayed from the
device to HawkBit and can be seen in per-device action history.
When testing updates with a small number of devices, the distribution set may be
dragged and dropped onto the device to commence the update for that specific
device.
NetOTA update server
--------------------
The NetOTA project can be used to distribute software to diverse devices from
one or more servers. In this mode the device periodically contacts NetOTA over
HTTPS to check if an update is available. In this mode whoever operates the
NetOTA server chooses the composition and number of available system images and
devices can be configured to follow a specific image name and stability level.
Unlike in the HawkBit model, the central server has no control over the devices.
Instead anyone controlling individual devices chooses the server, the image name
and the stability level and then follows along at the pace determined by the
device.
This mode requires minimal provisioning by either installing a configuration
file or by using the ``sysotactl`` utility to set the name of the package, the
stability level and the URL of the update server. In addition a systemd timer or
equivalent userspace agent, must periodically call the ``sysotactl update``
command or the corresponding D-Bus API.
NetOTA is a beta quality software. It can be used and has documentation
sufficient for deployment, but was not tested daily, during the development of
the |main_project_name| release. This mode is documented for completeness,
since it complements the centrally managed HawkBit mode.
For more information about deploying NetOTA, creating an update repository and
uploading software to said repository, please refer to the `upstream
documentation <https://gitlab.com/zygoon/netota>`_.
Deploying NetOTA
................
To deploy NetOTA for a evaluation it is best to use the ``netota`` snap
package. The package offers several stability levels expressed as distinct snap
tracks. Installation instructions can be found on the `netota snap information
page <https://snapcraft.io/netota>`_.
The _stable_ track offers NetOTA 0.3.2 and is recommended for deployment. The
_edge_ track offers automatic builds from the continuous integration system.
This version is annotated with the git commit hash and a sequential number
counted since the most recent tag.
Once ``netota`` snap is installed, consult the ``snap info netota`` command and
read the description explaining available configuration options. Those are
managed through the snap configuration system. By default NetOTA listens on
``localhost``, port ``8000`` and is meant to be exposed by a reverse http
proxy. Evaluation installations can use the insecure http protocol directly and
skip setting up the proxy. To use HawkBit for evaluation, set the listen
address to `0.0.0.0:8000` or `[::]:8000`, so that the service is reachable from
all the network interfaces. This can be done with ``snap set netota
address=0.0.0.0:8000``.
NetOTA does not offer any graphical dashboards and is configured by placing
files in the file system. The snap package uses the directory
``/var/snap/netota/common/repository`` as the root of the data set. Upon
installation an ``example`` package is copied there. It can be used to
understand the data structure used by NetOTA. Evaluation deployments can edit
the data in place with a text editor. Production deployments are advised to use
a git repository to track deployment operations. Updates to the repository do
not need to be atomic. The systemd service ``snap.netota.netotad.service`` can
be restarted to re-scan the file system structure and present updated
information over the REST APIs used by devices in the field. Alternatively the
``SIGHUP`` signal may be sent to the ``netotad`` process for the same effect,
without having any observable downtime.
Provisioning Devices for NetOTA
...............................
SysOTA contains a native NetOTA client and maintains all associated state and
configuration. The configuration is exposed as a D-Bus API and is meant to be
consumed by custom device agents developed for a particular solution. The D-Bus
API has the ability to control the URL of the NetOTA server, the package name
and stream name to follow, as well as to perform an update and monitor the
progress.
For convenience, the same APIs are exposed as command line tool ``sysotactl``.
The tool has built-in help. By default a status of the current configuration
and state is displayed. Use the command ``sysotactl set-server URL`` to set the
URL of the NetOTA deployment. Use the command ``sysotactl set-package`` to set
the name of the package containing system image for your product. Use the
command ``sysotactl set-stream`` to set the name of the stream of the package
to subscribe to.
Using the command ``sysotactl streams`` you can discover the set of streams
available for your package. Using streams allows a fleet of devices to follow
different versions of the same package. It can be useful for canary-testing,
major version upgrades or hot-fixing a particular device experiencing an issue,
without having to upgrade all devices at the same time.
Using the command ``sysotactl update`` you can trigger an update. Updated
software is downloaded and installed automatically. D-Bus signals are sent
throughout the process, allowing any user interface present on the device to
display appropriate information.
The same configuration can be provided by editing SysOTA configuration
file ``/etc/sysota/sysotad.conf``. See ``sysotad.conf`` manual page for details.
Updating with NetOTA
....................
The basic checklist of updating with NetOTA, assuming the update server
is deployed and devices are provisioned, is as follows:
- Build the bundle recipe, for example ``oniro-bundle-base``. Products should
maintain a pair of recipes, one for the bundle and one for the whole image.
All the examples that refer to the base recipes here should be understood as
references to the actual recipe names used by the product.
- Collect the ``*.raucb`` (RAUC bundle) file from the Yocto deploy directory.
- Perform any QA process deemed necessary. This should at least involve copying
the bundle a real device and updating manually with ``rauc install``. It is
recommended to test the software for a few days at least, to attempt to detect
problems related to memory leaks that would not crash outright but may crash
and cause issues after the update transaction is committed.
- Choose which stream to publish the bundle to. You can create additional streams
at will, by touching a ``foo.stream`` file. Make sure to create the
corresponding ``foo.stream.d`` directory as well. This will create the stream
``foo``. If you choose an existing stream remember that all the *archives*
present in that stream must have the exact same version. This means you may
need to perform additional builds, if the package is built for more than one
architecture or ``MACHINE`` value.
- Create a new file with the extension ``.archive`` that describes the newly
build bundle. This process is somewhat involved, as several pieces of
information need to be provided. The archive file should be placed in the
``.stream.d`` directory of the stream you've selected earlier. The archive
must contain at least one ``[Download]`` section with the ``URL=`` entry
pointing to a http server that hosts the file. For local deployments you can
use any web server you have available. In larger deployments you may choose
to use a content delivery network provider, to offer high-availability
services for your fleet.
- If you are doing this for the first time, make sure to read the upstream
documentation of the NetOTA project and consult the sample repository created
by the ``netota`` snap package on first install. Ideally keep the changes
you've made in a git repository, so that you can both track any changes or
revert back to previous state.
- Restart the NetOTA service or sent ``SIGHUP`` to the ``netotad`` process.
Note that if the new repository is not consistent in any, an error message
will be logged and the service will refuse to start up (if you had chosen to
restart the service) or will keep serving the old content (if you had chosen
to send the signal).
At this stage the server will offer updates to devices if they choose to ask.
You can perform the update manually with ``sysotactl update`` or if you have a
custom device agent, you may instruct it to perform the corresponding D-Bus
call.
Limitations
-----------
|main_project_name| update stack is by no means perfect. Knowing current
weaknesses can help plan ahead. We tried to balance the design, so that no
weakness is _fatal_, and so that remaining gaps can be updated in the field.
Firmware updates
................
Firmware is not updated by SysOTA. Product designers should consider
implementing ``fwupd`` and obtain firmware from LVFS. The safety of updating
firmware in the field is difficult to measure. For EFI-capable systems being
able to, at least, update the EFI firmware is strongly recommended.
CPU microcode
.............
CPU microcode may be updated by the EFI firmware and by the early boot process.
At present, CPU microcode is not updated by the early boot process. This is
tracked as https://gitlab.eclipse.org/eclipse/oniro-core/oniro/-/issues/508
GRUB application update
.......................
Updating the OS does not currently update the EFI application containing GRUB.
This is tracked as https://gitlab.eclipse.org/eclipse/oniro-core/sysota/-/issues/8
GRUB script update
..................
Updating the OS does not currently update the GRUB script. This is tracked as
https://gitlab.eclipse.org/eclipse/oniro-core/oniro/-/issues/523
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment