What is an SOE?
Modern IT departments face huge changes in the way they deploy and maintain servers. When I first entered the industry, UNIX, and hence Linux servers were in the parlance of cloud workloads, ‘pets’: few in number, lovingly tended machines that were individually configured and maintained by hand. They often ran many workloads at the same time (mail server, file server, database, shell accounts) on expensive hardware. A typical ratio of system administrators to servers might be 1:10 yet hardware costs, rather than labour, accounted for the greatest share of the IT department’s budget.
A typical current day bank may have 10,000 Linux servers deployed, with Sysadmin:server ratios as high as 1:1000. Modern deployments are largely treated as ‘cattle’: numerous, single workload devices that should be deployed quickly, replaced quickly and require no manual configuration. Commodity servers and virtualisation have completely changed the economics of IT departments from being major capital cost centres to majority operating expenditure cost centres: labour costs, software licences, maintenance contracts now form the bulk of the expenditure of most IT organisations.
And yet the majority of IT departments are still stuck in the mindset of the 1990-2000s where servers are deployed manually or semi-manually from a “gold image”, are configured more or less manually, and are patched manually (if at all!) on an ad hoc basis.
A Standard Operating Environment aims to permit any IT organisation to adopt automation for the deployment and maintenance of their servers, to effect rapid patching and configuration change, to manage configuration drift, and to be able to respond rapidly to demands from their user base. The SOE consists of the following components:
- A set of Concepts to define and describe the artefacts associated with the deployment and maintenance of operating infrastructure.
- A set of Workflows to release and maintain standard builds, to maintain deployed services, and to automatically test software and configuration changes.
- A set of Tools to maintain artefacts such as standard builds and configurations, to deploy these to servers, to track versions and history, and to perform automated testing.
SOE provides consistent definitions of various terms that can be used to describe infrastructure deployment and maintenance.
A Build is a set of installable software and configurations. A server is built using a specific build. The build may change over time as it is maintained, similarly servers built with a specific build may change in time as they are maintained or their configuration is altered.
In SOE, we use a single build to deploy different kinds of servers. Thus a webserver and a database server will be deployed using exactly the same build, only configuration will change. A builds consist of:
- A base operating system e.g. Red Hat Enterprise Linux 7.2
- Additional software such as application software, internally developed software, agents etc.
- A set of configurations required to deploy servers into different Roles
Within Red Hat Satellite, builds are modelled using Content Views.
Builds are versioned and any change to a build – for example the addition of a patch, results in a new version of the build being released. Every deployed server is registered to a specific build version, permitting drift analysis and management. To apply patches to a server, it is exposed to a later build version which contain the required patches.
Within Red Hat Satellite build versions are modelled using Content View Versions.
A Role is a specific software configuration that is directed at fulfilling a business role. An example role is “EMEA HR Production SAP Server”. This would be a server used to provide SAP server to the European HR division of an organisation. A server has only a single role. Within SOE, role is considered immutable: if a deployed server is required in a different role it is re-deployed from scratch. However, the definition of the role can change over time, and servers deployed into that role will change with the chaning definition.
A server role is defined by 3 broad areas:
- The server is deployed with initial characteristics suitable for the role. These include physical characteristics such as RAM and number of CPUs, and operating system characteristics such as partition tables, specific OS versions, networking configuration etc.
- The server is installed with all the software required to perform its role.
- The server operating system and application software is correctly configured to perform its role.
Within Red Hat Satellite, roles are implemented using Hostgroups.
Roles tend to have a great deal of commonality – for example, server hardening, registration to monitoring system, and even the software to be installed will likely be common across roles. In order to encourage re-use, we introduce the concept of Profiles. (NB This concept of roles and profiles was proposed by Craig Dunn in this influential conference presentation.) A role contains one or more profiles. While a server can only have one role, that role will include one or more profiles. The following is a list of typical profiles:
- Base server
- Websphere server
- SAP Cluster server
- DMZ hardening
- Netbackup client
- Splunk client
Our fictional role “EMEA HR Production SAP Server” might include the following profiles: Base server, SAP Cluster server, Netbackup client, Splunk client.
In a Puppet environment, each profile would be implemented as a module. In an Ansible environment, each profile would be implemented, confusingly as a role. For the purposes of clarity, I will use the term Ansible-Role to distinguish the Ansible concept of a role from the SOE concept described above.
A build goes through the following lifecycle stages:
Inception -> Development -> Release -> Maintenance -> Retirement
Organisations usually have multiple builds, each at different lifecycle stages at any one time. The stages are described in further detail below.
The inception stage is the point at which a new build is initiated. There are several possible triggers for inception of a new build, however the majority of organisations use one of the two following schemes:
- In a time-based scheme a new build is created at regular time intervals. For example an organisation may choose to release a new build every 6 months. NB Do not confuse this with patch frequency which will be described in the Maintenance Stage section.
- In a Vendor release-based scheme the inception of a new build is triggered by an upstream vendor releasing a new version of software. For most Red Hat customers, this is simply when Red Hat releases a new major or minor version of Red Hat Enterprise Linux.
The name of the build should be settled on at the point of inception. The name of the build can be anything, however most organisations will want to settle on a specific naming convention. For vendor release-based build, the naming convention may be as simple as ACME-RHEL<maj>,<min>, e.g. ACME-RHEL7.2. This indicates that this is the ACME company’s build of RHEL7.2.
For a time-based scheme, a common convention is ACME-RHEL<maj>.<min>-<date>, e.g. ACME-RHEL7.2-20161027. This indicates that this is the ACME company’s build of RHEL7.2 incepted on the 27th October 2016. The datestamp is required if the build release period is shorter than the vendor’s software release period. I tend to use the datestamp of the inception stage, rather than the release date, as I usually do not know the release date in advance, and it may of course slip.
The development stage is the period during which a new build is designed, developed and tested. This consists of the following steps:
- New Requirements are gathered. These will typically be design and functional requirements such as the need for the operating system build to support a specific end-user application, or function with specific hardware. The new requirement may simply be that the build be based on a newly released version of the base OS.
- Requirements are condensed into a Task Backlog describing the work that needs to be done to create the new Build.
- New base Build is created. Usually this will involve the cloning and updating of the existing build, however a build may be created from scratch. Within Red Hat Satellite, this would involve the creation of a new Content View, possible a copy of an existing content view. Kickstarts, configuration scripts, Ansible playbooks and tests will be held in a revision control system, usually Git. The branching mechanism of the revision control system is used to create a development stage branch.
- The new build is Tested. Automated testing is key to rapid development of an SOE. A typical testing configuration is to use a group of virtual machines, one deployed into each of the Roles that the SOE supports. On every change to the SOE (for example a change in the Content Views, or configuration repositories), the test VMs are redeployed and acceptance and unit tests run against them. Re-deployment of test machine and test execution is handled using a Continuous Integration engine, usually Jenkins or Bamboo.
- Development takes place iteratively until the Task Backlog is run down and all Tests are passing. Many organisation will use an Agile approach such as Scrumban to manage this phase.
- The new Build is released.
During the maintainance stage, the new build is continuously update. Updates come in a variety of flavours:
- Vendor Patches are triaged and added to the build as required
- Defects arising from end user issues may result in configuration and software changes
- Enhancements resulting from end user requests may result in configuration and software changes to a current build rather than being deferred to the next build
Maintenance workflows need to answer several questions and are highly dependent on the organisation’s attitude towards risk and stability. These workflows will be expanded upon in detail in the next article in this series, but some of the questions that will be answered are listed below:
- What changes need to be applied to the current build and what can be deferred to the next build?
- How do I assess vendor patches for inclusion in the current build?
- What is my schedule for releasing a new version of the current build?
- How do I treat emergency patches that might need to be released outside of a formal schedule? For example a patch that is required to fix a critical security flaw, or a bug that is affecting a revenue-generating line of business?
- After including a patch into the current build and incrementing the build version, what is my policy on actually applying the patch to deployed servers?
Once the build is retired, it will no longer be updated. No new servers should be built using a retired build and servers currently on the retired build should be migrated to currently maintained build, or redeployed.
Managing Multiple Releases
One of the challenges of complex environments is that multiple builds need to be managed simultaneously. For example, in a large organisation following the vendor-released base build stragegy, the build list might look like:
The workflows and processes required to manage complex environments such as this will be described in much greater detail in the next article in the series.