# Design Details for pe-admin

([original design doc (Google doc)](https://docs.google.com/document/d/16D3IEgEy6oFz9yF_BahmWeZDZTEHGtevvwVtrgkbhQU))

## Problems and Limitations of the Old Tools

The principal limitation of the current BASH installer script, it's helper
scriptlets, and the puppet-infrastructure configure action it calls is that
they can only operate on the PE node that they are invoked on. In some cases
they attempt to make use of supporting tools, such as the psql client utility,
to query information from other nodes, but they only change the local node.
They have no concept of or capability for orchestrating the install/upgrade of
PE as a whole.

This orchestration limitation crops up during extra-large installs, extra-large
upgrades, ha upgrades, and anytime we need the installer to make upgrade
judgements based on overall PE status. So, issues like mco deprecation, legacy
split deprecation, puppet-agent platform changes, compiler to pdb-compiler
changes, postgresql migrations, correctness of PE Infrastructure pxp-agent
configuration, etc.

To work around this, we've resorted to increasing the complexity of both the
puppet-enterprise-installer script and the `pe_manager` utilities it and `puppet
infra configure` use when they try to assess the current system state. In the
case of HA upgrades, a bolt plan is used to shutdown replication before the
upgrade of the master. But resyncing HA does not occur until the administrator
runs a separate command to upgrade the replica, assuming the core upgrade
succeeded.

All of this is complicated by the fact that the current upgrade requires a
bootstrap trio of packages, pe-installer, pe-modules and puppet-agent, to be
installed early by the puppet-enterprise-installer script so that it can make
these various tests and have the correct version of puppet-infra configure to
invoke. But by upgrading the packages, we've already altered the state of the
system. This creates rollback problems when validations fail. It also
introduces a sequence problem in that some utilities we rely on, like psql,
must be used before puppet-agent is upgraded, since they rely on the agent's
openssl implementation, which may change too drastically across versions (LTS
-> LTS upgrades, for example) for an old version of psql to function with a
newer agent.

In addition, the same tools are invoked (puppet-enterprise-installer) for both
install and upgrade workflows. The tools use heuristics (presence of current
install, version files, etc) to test whether they are being invoked for a fresh
install, an upgrade, or are being reinvoked for a partly failed install or
upgrade, or being called separately to repair an existing installation that was
broken outside of the upgrade cycle. These tests are error prone and further
degrade the idempotency of the installation process.

Finally, there is an isolation problem. The current tools and their libraries
have dependencies on the local puppet, and get tangled up in the initialization
and bootstrap of the local puppet installation, settings and facts when they
require puppet.

Because of these limitations, and the inherent limitations of BASH, the tech
debt of the existing tools has grown unmanageable as we work to handle larger
and more complex installations of PE. This increases upgrade failures and
support tickets and contributes to customer frustration with larger upgrades. 

 * puppet-enterprise-installer is a large, awkward BASH script
   * BASH is inherently error prone to write and maintain
   * Difficult to test (the support for automated testing of BASH is limited,
     pushing failure cases out to expensive integration tests that use full
     vms)
   * Attempts to move functionality (such as validation) into Ruby run up
     against package order issues (shim has to install some packages, such that
     we have the code we need to validate with, but installation of some packages
     (such as puppet-agent) may break both our ability to test and PE's integrity)
   * Error handling is poor
 * `puppet infrastructure configure` is a Ruby Puppet Face
   * It was originally written to do one thing: run a puppet apply on the local
     node from a constrained modulepath
   * It is now burdened with:
     * logic to scrape current configuration from console/hiera/puppetdb for upgrades
     * logic related to determining install versus upgrade versus repair state,
       and also postgres migration state
     * logic for half of HA upgrade orchestration, the other half being handled
       by the `puppet infra upgrade replica` command
   * Fundamentally, it's isolation is incomplete and there can still be bleed
     over of configuration from the installed agent into configure's apply

## Goals

This is a listing of the main project goals.

 * Make the pe-installer package the only requirement for installing PE
   (aside from a PE tarball)
 * Be idempotent
 * Separate install from upgrade
 * Use Bolt Inventory to manage classification of infrastructure for
   the plans
 * Allow for installation and upgrade of all the infrastructure,
   including compilers and replica
 * Allow manual per node install/upgrade where ssh is restricted or as
   a fallback when pcp is interrupted
 * Allow the Bolt plans for installing/upgrading PE to be used outside
   of pe-installer for custom workflows
 * Allow running with sudo
 * Eliminate four years of tech debt in the current installer tooling
   by handling all management actions as either Puppet code (plans and
   manifests) or Ruby tasks
 * Prefer orchestrated install/upgrade of PE from one controlling node
 * Prefer remote install/upgrade of PE from a bastion host or
   administrative workstation
 * Allow for XL-HA installations, either via peadm, or
   `enterprise_tasks` or whatever module ends up housing the PE
   install/upgrade plans
 * Allow for managed `ha_proxy` load balancers
 * Allow use of pcp transport for upgrades where ssh is a problem

## Name

I chose pe-admin because I needed to call it something and this name defines
it as a pe tool and provides a concise description of its purpose.

It is close to the name of the peadm module. Which could be a good thing or bad thing.

Other name suggestions are welcome.

## Structure

The pe-admin tool is built as an extension of Bolt, providing UX
support for the specific tasks of installing, upgrading, repairing
and validating one or more PE installations. Each of these operations,
in the idealized remote case where Bolt can connect freely to all of
the required nodes, will be orchestrated by a single plan. One of the
goals of the project is that with sufficient Bolt expertise and familiarty
with the plans and their parameters, an end user can use the plans
directly in their own automation workflows.

Another goal is to allow manual install and upgrade where Bolt cannot
orchestrate (due to SSH/other transport restrictions). This is why the
top-level orchestration plans ultimately end up calling `pe-admin
local-install` on the core master and database nodes, for example, to
perform the actual Puppet applies that modify the node. This replaces use of
the overburdened and buggy `puppet-infrastructure configure` to bring a core
infrastructure master or database node into compliance.

The pe-admin tool is pure Ruby. The pe-installer package provides an isolated
Puppet runtime: Ruby and openssl, and additional gems such as Puppet, Facter
and Bolt. The cli is defined as a Thor class. Configuration is handled as a
Bolt project, which can be maintained at ~/.puppetlabs/pe-admin. (This likely
needs some updating with recent solidification of Bolt projects in 2.23?+)
Since pe-admin is essentially a wrapper around the Bolt library, specific,
enduring configuration can be manged for the pe-admin tool under the mentioned
configuration directory, or absorbed generally from any top level /etc Bolt
configuration on the host.

The intention is for pe-admin to be used on workstation nodes, bastion hosts
that a user can access with keys which allow them to reach the infrastructure
nodes (or test nodes) that they are going to manage.

The tool is equally useful for spinning up test infrastructure internally on
vmpooler, or whatever else you can reach via ssh.

Internally, pe-admin interacts with the Bolt API directly through a facade
class that hides most Bolt internals. It does not shell out, so it is operating
as a single process with tighter ability to regulate Bolt's output by supplying
it's own Reporter class for example.

### Inventory

One of the key challenges to working with PE is concretely identifying which
hosts are in what architectural PE Infrastructure roles. Bolt's concept of
inventory fits this naturally. For fresh installation cases, we seed the
inventory directly from command line or pe.conf defaults. For upgrade/repair
cases with existing infrastructure, Bolt inventory can interact directly with
puppetdb to supply those groupings.

For cases when puppetdb is down, we could use services.conf, or rely on
user input, or potentially on stored inventory files. This could be used long
enough to repair the core PE node, re-enabling puppetdb for more thorough
validation and repair of the whole installation, for example.

### Validation

By ensuring that we validate the existing system up front, we can fail early
and provide better feedback about problems. Some work was begun in the
puppet-enterprise-installer script along these lines, but the validation
attempts there are inherently fragile because they are tangled up in the state
of the node they are testing and it's puppet-agent and pe-modules packages.

By instead focusing work on validation and feedback using pe-admin for
connectivity, existing service status, pxp-agent configuration, replica status,
puppet agent status, system memory, file system size, or whatever else we
check, we can write robust, tested routines in Ruby/Bolt that check what we
need up front, ideally before we've changed anything about PE itself.

### Output and Logging

The old tooling's output is fairly exhaustive but difficult to follow and does
not distinguish between console and log file. The pe-admin tool uses the
Logging gem, just like bolt, and allows us much more flexibility to concisely
explain what the tool is doing on the console while retaining details in the
log files.

### Changes compared to the old tools

#### puppet-enterprise-installer

This BASH script used to be the starting point for installation and upgrade,
but is not used by pe-admin.

We will likely hollow it out and have it perform a bootstrap-installer and then
a `pe-admin install` for the simple monolothic case as a stepping stone to
the new workflow. It will need deprecation warnings and documentation pointers.

#### puppet infrastructure configure

The configure action of the infrastructure face has been saddled with three
roles from the beginning: 'install', 'upgrade', and 'repair'. Distinguishing
between them is problematic. Its basic job is to assemble an initial catalog of
classes to be applied from an isolated environmentpath. Its isolation is
incomplete, and it is encumbered with a lot of secondary validation and upgrade
code that was wedged in to deal with upgrade cases.

Bolt natively does a better job of isolating Puppet for an apply. Bolt running
local transport does not have any transport problems. So once pe-installer is
installed on a node, `pe-admin local-install` can directly apply all the
necessary classes to configure the node in a safe manner.

Most likely, local-upgrade and local-repair commands will follow the same
pattern so that we're explicit about what mode we're operating in.

## Codebase

Breakdown of the lib structure:

 * pe\_installer.rb - establishes the top level namespace, tool name and base requires.
 * pe\_installer
   * architecture.rb - provides abstractions for the type of PE installation
     (as documented by [PE Architecture Docs](https://puppet.com/docs/pe/latest/choosing_an_architecture.html).
   * bolt\_interface.rb - the facade for initializing and interacting with the
     Bolt API. As best as practical, this class should be the cut line for
     exposing Bolt internals.
   * cli.rb - provides the Thor command line structure for the tool.
   * cli\_ext
     * setup.rb - common setup code used by the cli. Having this extracted
       separately allows the Cli class to just document the command and
       argument structure.
   * config.rb - manages configuration for pe-admin, whether supplied on the
     command line or from configuration files.
   * error.rb - root error class. More specific errors are subclassed from it
     to ensure that errors thrown by pe-admin are catchable concisely.
   * interviewer.rb - (TBD) placeholder class for commandline interaction.
   * inventory.rb - abstraction around a Bolt::Inventory that allows us to look
     up nodes for the PE infrastructure roles (masters, compilers, replica,
     databases, infrastructure) as Bolt::Targets for use in plans.
   * inventory
     * node_roles.rb - an immutable structure of PE node roles (master,
       database, etc) as provided from commandline args and/or pe.conf. Used to
       seed an Inventory for initial install, for example.
   * logger.rb - pe-admin uses the [logging](https://github.com/TwP/logging)
     gem. This module is a mixin of logger helper methods.
   * manager.rb - engine that handles execution of a command with the given
     config inputs using the BoltInterface to run a particular plan.
   * reporter.rb - nearly stub class that will ultimately handle command output.
   * util.rb - module of utility class methods that have external dependencies
     which were easier to isolate this way for testing, or which didn't
     otherwise fit in another class.
   * version.rb - the tool version.

## Future

Currently only an install command is prototyped.

### fetch

A fetch command would allow you to pull down requested tarballs by version and platform from pm.puppetlabs.com. That should allow to get release or dev tarballs on the internal network, and release tarballs on an external network.

### upgrade

The upgrade command will put more effort into validating the existing system up front, and if found to be a good state, then run the upgrade plan. It will lookup the current infrastructure hosts via puppetdb queries in inventory ([example inventory file](./puppetdb_inventory.yaml)

Validation needs to be informative but bypassable, since if an upgrade fails, it will usually leave you in a state where a validation would no longer succeed.

Some validation checks are upgrade specific:

* tarball platform versus install platform
* deprecated services
* deprecated configuration
* deprecated architectures

### validate

Checking the current state of the PE installation should be runnable by itself.

Things that could be tested:

* connectivity
* service status
* installation architecture (standard/large/xl is correctly structured)
* PE node group structure
* pxp agent configuration
* puppet runs come back in a steady state

### repair

Currently you can run `puppet-infra configure` on a given PE node to correct
local configuration and ensure services are reinstalled and running. This
ultimately is a local apply of the same pe_install classes used during install
and upgrade.

A repair is required on the master or database whenever the infrastructure is
broken, for whatever reason, to the point where Puppet cannot get a catalog
from the master to correct itself.

It is also part of workflows to regenerate certificates on core infrastructure
nodes since invalidating the certificate again means that services cannot
connect properly until the new certificate is generated and configured.

Just as with the `pe-admin local-install` case, a plan run via pe-admin's Bolt
interface over localhost will apply pe_install classes to repair.

The `pe-admin repair` command will handle recovering configuration on the core
nodes, running the local repair command on each node in the proper sequence,
and validating the services are back.

### support

Potentially a command could be used to run the support script on some/all of the
PE infrastructure, possibly collecting and collating the tarballs, and/or
uploading if on a Bastion.

### peadm

As mentioned above, we need to bridge the gap between the installation and
upgrade plans in enterprise_tasks and those in puppetlabs-peadm so that we can
support extra-large-ha installations. Whether pe-admin ends up calling plans
from enterprise_tasks or peadm doesn't matter much so long as the plans cover
the necessary use cases.

In addition, there are currently two flavors of extra-large in production use.

One is the documented PE extra-large configuration, which has a master node and
a separate pe-postgresql node with all the databases. This configuration was
the original alternative to the legacy 3-way master/console/puppetdb split that
was deprecated in 2018.1 and removed in 2019.0. In 2019.0 it's documented as
monolithic with compilers and standalone postgresql node. In 2019.2 it's simply
documented as extra-large. This configuration does not support HA.

The second is the flavor installed by peadm, which has running postgresql
services on both the master and the database node. The database node only has
the puppetdb database, while the master node provides the remaining console and
orchestrator databases. This variation can support HA using the peadm module. A
conversion step is required to get from the standard XL to the peadm XL; I
believe peadm does provide a convert plan for this, but there is work needed to
reconcile extra-large installations so that PE has only one flavor.

### replicated

Using Replicated to manage PE installations will (whenever that comes to pass)
fork PE management, in that I expect we will have a subset of PE installations
to manage with Replicated and a subset of installations to manage without it.

Replicated management of PE, to my current, and limited understanding, is
essentially a containerized installation of PE, and will not be using any of
the PE Puppet code or tooling currently used in PE management.

The pe-admin tool doesn't change any of that. It's not an answer to Replicated,
it's an answer to better tooling for the subset that we manage outside of
Replicated, for however long that happens to be.
