PurpleTeam TLS Tester Implementation

image

Kim Carter

Tuesday, September 7, 2021

The PurpleTeam TLS Tester is now implemented. I’ve written this post to highlight the learnings, and to talk about the various significant changes that were made as part of the release. All core components were released as version 1.0.0-alpha.3.


The details of the above video can be found here.

Contents

All of the release notes can be accessed from the Github issue.

Documentation

Work items created

As a result of the Tls Tester Implementation

Synchronisation

There ended up being quite a bit of work done around synchronisation of the components, and there is still work to be done. There were architectural decisions made several years ago that needed some modification, and as you can see from the Work items created there is ongoing work that needs to be done.

For example I discovered near the end of the implementation another edge-case around state of a given Tester being incorrect if a different Tester is in a Tester failure: state. You can read about the issue here. We will be addressing this one soon.

Then there is this lack of retry issue in the orchestrator Tester models which was found near the end of the TLS implementation work also, which probably won’t occur very often at all (we have never witnessed it), but it still needs to be fixed.

Before we get started discussing the synchronisation of components, you will need some understanding of the various relevant time-outs in the code base.

Time-outs

Many of the time-out issues with AWS just don’t exist when running locally. AWS Api Gateway does not support streaming, so we need to use long polling (lp) between the CLI and the orchestrator in the cloud environment.

CLI

For the test command

The initial request to the orchestrator for the test command has a set of timeouts, but it must stop trying before the back-end fails due to:

  • Stage Two containers not being up and responsive within the currently 120000 (s2containers.serviceDiscoveryServiceInstances.timeoutToBeAvailable) + 30000 (s2containers.responsive.timeout) duration
  • The Stage Two container service discovery services not being up and responsive within the same duration as above

If the CLI continues to retry after a back-end timeout, then it may continue to do so indefinitely if unsupervised, as is likely if being used in noUi mode.

The time-out series for the test command currently looks like the following for the cloud environment. The CLI doesn’t timeout at all for local:

Tries:

  1. 23000,
  2. 15000,
  3. 15000,
  4. 10010,
  5. 10010,
  6. 10010,
  7. 10010,
  8. 10010,
  9. 10010,
  10. 10010,
  11. 10010,
  12. 10010,
  13. 0 // Cancel

This adds up to 143090 + some request and response latency, a little short of 150000 + some comms latency in the AWS machine.

For tester[ Progress | PctComplete | BugCount ] updates

Five long-poll request attempts with no data returned from the orchestrator and the CLI gives up.

// ...,
testerFeedbackComms: {
  longPoll: {
    nullProgressMaxRetries: {
      doc: 'The number of times (sequentially receiving an event with a data object containing a property with a null value) to poll the backend when the orchestrator is not receiving feedback from the testers.',
      format: 'int',
      default: 5
    }
  }
},
// ...

Orchestrator

The following is used in the testerWatcher and needs to be well under the AWS API Gateway timeout which is 30 seconds:

// ...,
testerFeedbackComms: {
  // ...,
  longPoll: {
    timeout: {
      doc: 'A double that expresses seconds to wait for blocking Redis commands. We need to timeout well before the AWS Api Gateway timeout.',
      format: Number,
      default: 20.0
    }
  }
}

App Tester

// ...,
s2Containers: {
  serviceDiscoveryServiceInstances: {
    timeoutToBeAvailable: {
      doc: 'The duration in milliseconds before giving up on waiting for the s2 Service Discovery Service Instances to be available.',
      format: 'duration',
      default: 120000
    },
    retryIntervalToBeAvailable: {
      doc: 'The retry interval in milliseconds for the s2 Service Discovery Service Instances to be available.',
      format: 'duration',
      default: 5000
    }
  },
  responsive: {
    timeout: {
      doc: 'The duration in milliseconds before giving up on waiting for the s2 containers to be responsive.',
      format: 'duration',
      default: 30000
    },
    retryInterval: {
      doc: 'The retry interval in milliseconds for the s2 containers to be responsive.',
      format: 'duration',
      default: 2000
    }
  }
},
// ...

The emissary.apiFeedbackSpeed is used to send the CLI the following message types: testerProgress, testerPctComplete and testerBugCount, thus keeping the lp alive. This duration needs to be less than the orchestrator’s 20 second testerFeedbackComms.longPoll.timeout.

emissary: {
  // ...,
  apiFeedbackSpeed: {
    doc: 'The speed to poll the Zap API for feedback of test progress',
    format: 'duration',
    default: 5000
  },
  // ...

TLS Tester

If we don’t receive any update from the TLS Emissary within this duration (messageChannelHeartBeatInterval) then the TLS Tester sends the CLI a testerProgress message with the textData: Tester is awaiting Emissary feedback.... This duration needs to be less than the orchestrator’s 20 second testerFeedbackComms.longPoll.timeout to make sure the CLI continues to poll the orchestrator for tester[Progress|PctComplete|BugCount] updates.

// ...,
messageChannelHeartBeatInterval: {
  doc: 'This is used to send heart beat messages every n milliseconds. Primarily to keep the orchestrator\'s testerWatcher longPoll timeout from being reached.',
  format: 'duration',
  default: 15000
},
// ...

Message flows

There are two flow types in play between the orchestrator and the CLI, namely Server Sent Events (sse) and Long Polling (lp).

Before reading this section dive over to the orchestrator README for a quick run-down on how PurpleTeam is using sse and lp

Before The TLS implementation, the testerFeedbackComms.medium was defined in the configuration for both the orchestrator and the CLI. Both configurations had to match. If they didn’t the orchestrator would respond with an error message. Now this is defined in the orchestrator only and the orchestrator tells the CLI which medium it should use before stating either sse or lp.


When the CLI runs the test command, there are three significant sequential events, I’ll brush over or omit less significant events to make explaining the flow easier to understand. If you’d rather just read the code it’s here:

  1. CLI makes a POST request to the orchestrator’s /test route with the Job, and continues to do so according to it’s retry schedule.
    The orchestrator’s testTeamAttack routine is where a lot of the decision making occurs
    • If a Test Run is already in progress (initTesterResponsesForCli is defined) and the orchestrator already has the responses from the requests to the Testers /init-tester route (initTesterResponsesForCli has a length), whether the Testers were successfully initialised or not, then the Tester responses along with whether to use sse or lp to subscribe to Tester feedback are returned to the CLI
    • If a Test Run is already in progress (initTesterResponsesForCli is defined), the orchestrator causes a client-side time-out because a response from the request to the Testers /init-tester route has not yet been received, and the orchestrator wants the CLI to try again once it times out
    • If execution gets past the above then a Test Run is not currently in progress, so the orchestrator:
      1. Sets a in-progress flag
      2. Asks it’s Tester models to initialise their Testers and wait for the responses
      3. Once all of the responses are received, the orchestrator populates a failedTesterInitialisations array with any Tester failure:… messages
      4. The orchestrator creates a startTesters boolean and assigns it true if every active Tester has it’s state set to Tester initialised.… (not Awaiting Job., Initialising Tester., or [App|Tls] tests are running.), otherwise false is assigned
      5. If there were any failedTesterInitialisations or startTesters is false:
        1. initTesterResponsesForCli is populated with the responses from trying to initialise the Testers (both successful and/or unsuccessful)
        2. A response is returned to the CLI with initTesterResponsesForCli and whether the orchestrator expects the CLI to use sse or lp
      6. Otherwise:
        1. The orchestrator invokes each Testers /start-tester route
        2. If we are running in cloud the orchestrator warms up the Test Session message (Redis) channels and lists, this waits for all Testers of the represented Test Sessions to provide their first message set. These message sets are assigned to an array called warmUpTestSessionMessageSets which looks like the following before being populated with messages:
          [
            {
               channelName: 'app-lowPrivUser',
               testerMessageSet: []
            }, {
               channelName: 'app-adminUser',
               testerMessageSet: []
            }, {
               channelName: 'tls-NA',
               testerMessageSet: []
            }
          ]
          

          If Testers are started and the orchestrator did not subscribe to the Test Session message channels, it would never know when the Test Sessions are finished in order to clean-up, so this subscription must occur

        3. initTesterResponsesForCli is populated with the responses from trying to initialise the Testers (only successful)
        4. A response is returned to the CLI with initTesterResponsesForCli and whether the orchestrator expects the CLI to use sse or lp
  2. CLI makes a GET request to either of the following (currently this happens whether all Testers were initialised successfully or not, there is no point in this happening if there were any Tester failure: messages returned from any Testers, we will change this soon):
    • If using sse?   /tester-feedback/{testerName}/{sessionId}:
      In this case messages from the Test Sessions continue to flow through the Redis channels and the orchestrator continues to push them to the CLI
    • If using lp?   /poll-tester-feedback/{testerName}/{sessionId}:
      In this case the CLI starts the long-poll process, the orchestrator checks to see if warmUpTestSessionMessageSets contains an element for the given channel name (BTW: channel names are constructed like: ${testerName}-${sessionId) (this will only happen in the cloud environment), if so it is spliced out and returned, if not the pollTesterMessages of the testerWatcher is invoked. pollTesterMessages is responsible for providing a callback to each Redis channel which when invoked takes the given message from a Testers Test Session and pushes it on to the tail of a Redis list with the same name as the Redis channel that the message was received from. Each time the CLI requests a message set for a given Test Session, if no messages are yet available it waits (on Redis blpop (blocking head pop)), if messages are available, they are popped (Redis lpop (non blocking head pop)) from the head of the Redis list
  3. CLI makes a GET request to the /outcomes route
    • This happens once the CLI receives a message starting with All Test Sessions of all Testers are finished. By the time this has happens, the orchestrator has already cleaned up the Testers and created the Outcomes archive based on the results and reports generated by the Testers

TLS Tester Implementation

Unlike the App Tester (app-scanner) which supervises an external Emissary (Zaproxy), the TLS Tester (tls-scanner) supervises an embedded Emissary (testssl.sh). This means that the TLS Emissary runs within the same container as the TLS Tester.

The Job file which the Build User provides to the CLI contains everything required to get the TLS Emissary running and targeting your website or web API.

The implementation of the TLS Tester was actually the easy part of this release. An additional stage one container image was required for local and also in the Terraform configuration for cloud in the form of AWS ECS Task Definition modification. The AWS ECR deployment script needed adding to.

The new TLS Tester isn’t that different from the App Tester other than it is a lot simpler because we don’t have to bring up stage two containers, and all the potential synchronisation issues around external Emissaries.

The execution flow goes from the /init-tester and /start-tester routes to the model.

/init-tester basically sets the Tester up with the Build User supplied Job and sets the status.

/start-tester starts (spawns) the Cucumber CLI, which initialises the Cucumber world which is where most of the domain specific parts are glued together, and the actual Cucumber Steps (tests) are run.

The following are added to the Cucumber world:

  • The messagePublisher (pushes messages onto Redis ${testerName}-${sessionId channels)
  • sut (System Under Test) domain object
  • testssl domain object

The testssl.sh process is spawned.

When ever the TLS Emissary writes to stdout the Tester deals with it here.