Skip to end of metadata
Go to start of metadata

Introduction

During the past 2 releases of OPNFV, we always executed all the scenarios on bare metal using all the test cases functest and yardstick declared without verifying the basics to see if a scenario worked fine or not.
This resulted in

  • waste of resources since we didn't check if a certain scenario is worth carrying it to bare metal.
  • reduced ability to know which tests worked fine since we run all the tests all the time.
  • more importantly, increased feedback time due to running everything on limited no of bare metal resources we have and naturally increase of queued builds on Jenkins thus ending up having long feedback times.

There were several reasons for this. For example our CI coverage was not enough causing faulty commit to slip to the master branch and so on.

It is clear that we need to improve our CI since we know a bit more about what type of environment we are working in and we must be more clever in what we are doing in order to increase the quality of the stuff we are providing.

The proposal you will read on this page aims to solve some of the issues we experienced by introducing couple of things.
Some of them are well known (commit gating and so on) and others might look new due to naming or they are new for real (promotion, confidence levels) so please be patient and continue reading.

What is proposed here impacts everything; our infra, our way of working and how we see and do things so it is important for you to read what is documented here, come up with questions and comments to see if anything here makes sense or not.

Another important thing is that this proposal tries to bring alignment across the board: both technical alignment and the terminology. Technical alignment is important since if we do things differently, there will be people who will be spending (=wasting) their time to understand what is happening and where. Alignment in terminology is important so we can understand what each other is talking about.

Terminology

Confidence Levels

Confidence levels are basic stamps on certain things to state the quality of that thing. Generally artifacts gain confidence levels but in OPNFV, we generally talk about scenarios so we can start applying confidence levels to scenarios.

It is proposed to introduce below confidence levels per scenario.

  • latest: scenarios that pass commit gate get this confidence level
  • daily: scenarios that pass daily loops get this confidence level
  • weekly: scenarios that pass weekly loops get this confidence level

In technical terms this would work like this

  • latest: metadata for the scenario that passes commit gate is stored on artifact repo under a directory latest
  • daily: metadata for the scenario that passes daily loop is stored on artifact repo under a directory daily
  • weekly: metadata for the scenario that passes weekly loop is stored on artifact repo under a directory weekly

Promotion in CI

Promotion is the activity of stamping things with a certain and higher confidence level - from latest to daily and from daily to weekly.

In technical terms this would work like this

  • latest: metadata for the scenario that passes commit gate is stored on artifact repo under a directory latest
  • daily: daily jobs consume latest scenarios and if any scenario passes this loop, the metadata for the scenario gets removed from latest and moved into daily directory.
  • weekly: weekly jobs consume daily scenarios and if any scenario passes this loop, the metadata for the scenario gets removed from daily and moved into weekly directory.

If a new change comes in, impacting an already promoted scenario, this should be reflected as well. In this case, if the scenario fails in earlier loops then the scenario needs to be demoted to the last loops it passed with the given artifact.

How the Things Fit Together

 

Here is how the things would work

  • A developer sends a new patch and includes what scenario is impacted by that patch in the commit message
  • Verify job on Jenkins gets triggered by the Gerrit patchset-created event and the change gets verified by Jenkins.
    • Success: Patch gets Verified+1
    • Failure: Patch gets Verified-1
  • Change gets submitted/merged to master if and when it gets Review+2 from project committers.
  • Merge job on Jenkins gets triggered by the Gerrit change-merged event and the resulting artifact (if valid for given installer) and the scenario that's verified are stored on artifact repository in latest folder.
  • {installer}-daily-{branch} job polls latest folder periodically to find out if there is any new scenario to test.
    • If it finds, it triggers the daily job for the found scenario and exits.
    • If it doesn't find, it just exits.
  • {installer}-{scenario}-baremetal-daily-{branch} job
    • Removes the scenario from latest folder
    • Runs deployment, functest and yardstick daily suites.
    • If the run succeeds, scenario gets promoted into daily folder
    • If the run fails, scenario is put back into latest folder if and only if a new file for the same scenario is not there already. If the scenario is put back then the file gets updated with the failed run information so we don't rerun that scenario on baremetal. (needs furthter thinking.)
  • {installer}-weekly-{branch} job polls daily and weekly folder periodically (starts Friday 00:00UTC, ends Monday 00:00UTC) to find out if there is any new scenario to test.
    • If it finds, it triggers the weekly job for the found scenario and exits.
    • If it doesn't find, it just exits.
  • {installer}-{scenario}-baremetal-weekly-{branch} job
    • Removes the scenario from daily folder if the scenario comes from daily.
    • Runs deployment, functest and yardstick weekly suites.
    • If the run succeeds, scenario gets promoted into weekly folder.
    • If the run fails, scenario is put back into daily folder if and only if a new file for the same scenario is not there already.

One difference between daily loop and weekly loop is that we do not run daily loop if a scenario already passed the daily loop since it gets moved into daily folder.
But we need to rerun weekly loops on scenarios that already gained weekly confidence level in order to make sure the scenario still works even though there has been no change impacting that specific scenario to ensure the scenario is not impacted by other changes.

Jenkins Jobs in Detail

Commit Gating

The names in parentheses are the names of the corresponding/similar activities done on OpenStack CI using Zuul.

Patch Verification (Check)

There is a consensus to run basic check, builds, virtual deployments and functest smoke tests for patches coming to Gerrit  for given installer. Based on this, below job structure is proposed.

Possible values for the keys are

  • installer: apex, compass, fuel, joid
  • branch: master, colorado

Each of these jobs block the next ones following themselves. If any of the jobs fail, the rest of the chain will not be run and the patch will get Verified-1 from Jenkins on Gerrit.

Here is what each job is expected to verify

  • {installer}-verify-basic-{branch}: This job verifies the most basic things about the patch such as the commit message to see if the scenario is specified in it. Apart from this, if an installer has unit tests, they can be run here as well.
  • {installer}-verify-build-{branch}: This job verifies the build if installer is an artifact based installer. The resulting artifact should be stored either locally or uploaded to Artifact Repo with gerrit change number so the following deploy job can fetch it.
  • {installer}-verify-deploy-{branch}: This job verifies the deployability of given scenario virtually by fetching the artifact that's built by the previous job.
  • {installer}-verify-smoketest-{branch}: This job runs functest smoke test suite in order to verify the basic OpenStack functionality to determine if the scenario is good enough to be carried on baremetal.

Pre Merge (Gate)

This is still TBD due to not having right way of running pre-merge verification jobs (no Zuul-like of setup).

Post Merge (Post)

Post merge jobs will be used for building and promoting scenarios. There are still some things to clarify here (such as whether the artifact built from the merged change still works for given scenario or not) but the basic idea with these jobs is to build and upload the artifacts and promote the scenarios for daily loops. This job makes daily-build or other artifact build jobs obsolete since we will always build a new artifact whenever a change gets merged.

Daily Loops

Daily Loops consume artifacts for the candidate scenarios

It is proposed to take build jobs out of main daily deploy-yardstick-functest loops in order to make it possible the creation of build artifacts post-merge.
This will allow any daily loop to start without needing to wait for completion of new artifacts.

Jobs for daily loops are structured as below.

 

This structure makes it possible to

  • increase the visibility of the scenario status by using 1 parent job per scenario, which links to downstream jobs making it possible to track single scenario deployment and testing
  • have common deploy, functest, and yardstick jobs per installer/branch

 

Possible values for the keys are

  • installer: apex, compass, fuel, joid
  • scenario: list of scenarios supported by a given installer
  • branch: master, colorado

Weekly Loops

Job structure for the weekly loops is same as the daily loops except the test suites running against the deployed system and the artifacts used for this.
Weekly loops should use the latest and greatest artifact for the scenario that is going to be tested so we will not rebuild an artifact here for weekly loops but instead (re)use an artifact that's proven to be working for the given scenario.

Possible values for the keys are

  • installer: apex, compass, fuel, joid
  • scenario: list of scenarios supported by a given installer
  • branch: master, colorado

Way Forward

In order not to break everything altogether, it is proposed to follow incremental approach, split in to several phases, to introduce the change gradually.

Timeline

Phases 1 and 2 will be delivered as part of Danube Release.

Phases

Phase 1 - Align Daily Job Structures

This phase deals with aligning daily build and baremetal job structure between installers and do not introduce or use promotion related stuff. The diagram below shows the job structure.

Steps

  1. Create/configure {installer}-build-daily-{branch} jobs independently from the rest in order to run them independently, continuously producing artifacts and uploading them to Artifact Repository.
  2. Create/configure {installer}-{scenario}-baremetal-daily-{branch} jobs separately to increase the visibility and ease the troubleshooting.
  3. Create/configure common {installer}-deploy-baremetal-daily-{branch}, yardstick-{installer}-baremetal-daily-{branch}, and functest-{installer}-baremetal-daily-{branch} jobs separately to increase the visibility and ease the troubleshooting.
  4. Make sure {installer}-deploy-baremetal-daily-{branch} jobs consume artifacts from Artifact Repository or lab cache in order to enable the use of pooled baremetal resources and prevent tying jobs to certain PODs.

Phase 2 - Align Verify Job Structures and Enable Commit Gating

This phase deals with aligning verify job structures between installers and enabling commit gating on virtual environment.

Steps

  1. Create/configure {installer}-verify-{branch} jobs using Multijob so aborting and blocking builds work fine when a new patch for existing change or a new change ends up on Gerrit.
  2. Create/configure {installer}-verify-basic-{branch}, {installer}-verify-build-{branch}, {installer}-verify-deploy-virtual-{branch}, and {installer}-verify-smoke-test-{branch} jobs separately to increase the visibility and ease the troubleshooting.

Phase 3 - Enable Opt-in Scenario Promotion and Commit Gating

This phase deals with opt-in scenario promotion from post-merge jobs as shown in below diagram.

Steps

  1. Agree on commit message format/structure to make stating scenarios possible/aligned
  2. Based on agreed commit message format, make it possible to opt-in additional verification activities that are normally not done by installer in question.
  3. Create/configure {installer}-merge-{branch} jobs.
  4. Create/configure {installer}-merge-promote-{branch} jobs to promote scenario.

Phase 4 - Enable Commit Gating and Use of Promotion & Confidence Levels in Full

Once the earlier phases are deemed to be successful for all the installers and we have the alignment, we can start introducing the use of promoted scenarios and applying proper confidence levels.

Steps

  1. Enable commit gating fully.
  2. Create/configure {installer}-merge-build-{branch} jobs to upload artifacts.
  3. Disable/remove daily build jobs and start consuming artifacts produced by merge jobs.
  4. Create {installer}-daily-{branch} jobs, activate polling of scenarios and disable {installer}-{scenario}-baremetal-daily-{branch} triggers so only the promoted scenarios end up on baremetal.
  5. Create {installer}-weekly-{branch} jobs, activate polling of scenarios and disable {installer}-{scenario}-baremetal-weekly-{branch} triggers so only the promoted scenarios end up on baremetal.

Status and Progress

 

InstallerPhase 1
Daily Job Alignment
Phase 2
Verify Job Alignment and Limited Commit Gating

Phase 3
Opt-in Promotion and Commit Gating

Phase 4
Promotion, CL, and Commit Gating
ApexIn progressNot started - Job structure alignment needed, smoke tests are already activeNot startedNot started
CompassDoneIn progress - Job structure alignment needed, smoke tests are already activeNot startedNot started
FuelDoneDone - Job structures aligned, Installer needs to enable virtual deploy and smoke testNot startedNot started
JoidDoneDone - Job structures aligned, Installer needs to enable virtual deploy and smoke testNot startedNot started
DaisyNot startedIn progress - Job structures aligned, Installer needs to enable virtual deploy and smoke testNot startedNot started

Prerequisites

  • Common for all phases
    • contributions from everyone involed/impacted are must or this will never happen
  • Phase 1:
    • separation of builds from the rest of the daily/weekly chains - prevent builds blocking from the rest of the chain. Make a recent version of artifact always available to be consumed.
    • right use of resources - prevent whole verify/daily/weekly chain depending on the machine where the builds are done.
  • Phase 2:
    • check the applicability of smoke test for all installers
    • identify the pattern to specify scenarios in commit messages
    • check if promoting from verify jobs results in what we expect - should promotion be done as part of pre-merge? If so, what should be verified for patches?
  • Phase 3:
    • identify the pattern to make opt-in promotion possible
    • prototype the promotion
  • No labels

12 Comments

  1. Thank you, so clear implementation steps

    I agree with your proposal, 

    so there will be only one metadata of each scenario in daily and weekly directory?

    if no metadata of the scenario in the daily directory, so the weekly job won't run, yes?

     

    I have another ideas, maybe we should store testresult or the link of log with de artifacts together? it will prove that why we set the confidence lever of this senario

    and a question, we need extra pods to run weekly jobs? our pods are fully scheduled.

  2. mei mei

    Yes, if a scenario hasn't achieved a lower confidence level, the next loop in line, giving higher confidence level will not be run for that scenario since it failed on earlier loops. So if a scenario doesn't have daily confidence level, no weekly loop will run for it.

    About storing the test results and so on. That's the scenario metadata which we do similar thing for our artifacts. If you look at artifacts.opnfv.org/compass4nfv/latest.properties, you will see the BUILD_URL there together with other stuff. Similar information needs to be stored for scenarios as well.

    Here is very basic structure of things on artifact repo - this is just a quick example.

    artifacts.opnfv.org
    ??? scenarios
        ??? apex
        ?   ??? daily
        ?   ??? latest
        ?   ??? weekly
        ??? joid
        ?   ??? daily
        ?   ??? latest
        ?   ??? weekly
        ??? fuel
        ?   ??? daily
        ?   ??? latest
        ?   ??? weekly
        ??? compass4nfv
            ??? latest
            ?   ??? os-ocl-nofeature-ha
            ??? daily
            ?   ??? os-odl_l3-nofeature-ha
            ?   ??? os-odl_l2-nofeature-ha
            ??? weekly
                ??? os-nosdn-nofeature-ha

    And each of these files should contain something similar to below - again, this is just an example.

    OPNFV_ARTIFACT_VERSION=2016-07-05_01-34-03
    OPNFV_GIT_URL=https://gerrit.opnfv.org/gerrit/compass4nfv
    OPNFV_GIT_SHA1=780256c594ade62a006b5bf740f868714a60ac8e
    OPNFV_ARTIFACT_URL=artifacts.opnfv.org/compass4nfv/opnfv-2016-07-05_01-34-03.iso
    OPNFV_ARTIFACT_MD5SUM=82629c6e6741e4d7b34fff5c38026fb6
    OPNFV_BUILD_URL=https://build.opnfv.org/ci/job/compass-merge-master/230/
    OPNFV_DAILY_URL=https://build.opnfv.org/ci/job/compass-os-onos-nofeature-ha-baremetal-daily-master/42/
    OPNFV_WEEKLY_URL=https://build.opnfv.org/ci/job/compass-os-onos-nofeature-ha-baremetal-weekly-master/9/

    About the resource situation, that's something we need to think about. We might try running weekly loops on existing PODs but then we need to stop daily jobs during that time.

    Long story short, all these things require us to evaluate continuously and move on incrementally. That's why the use of promoted scenario is left to the second phase since I think establishing all the basics will take some time and we might hit to Colorado branch off date which I'm personally not keen on doing big changes once we branch off. So utilizing promotion for real (phase2) might be available for D release.

  3. Thanks Fatih Degirmenci for pushing forward the discussion in the Summit.

    I think the target of CI Evolution is to save resources, including hardware(Pharos lab), and people resources.

     

    some comments:

    1. A developer is hard to know which scenario(s) will be affected, then it will cause them to use as many scenarios and possible in commit message, which then will cause many more resources consumed than now;
    2. Verification job will be meaningful when it is actually can be finished in several minutes not hours; Maybe it can be similar with what Openstack gate verification does.
    3. This job makes daily-build or other artifact build jobs obsolete since we will always build a new artifact whenever a change gets merged. Is it necessary to produce artifact from Merge job? It will produce tens of times a day and only the newest will consumed by Daily job. It will take 15-20 minutes in each building for Fuel installer. It will save building time and costs if we still use daily build job.
    4. Is there a separated built ISO for each scenario?
    5. It is important to introduce verification job. Each project in OPNFV will support UT as more as possible. My concern is currently tox and pep8 is not popular yet.

     

     

  4. You touch on important points Julien and I agree with most of them but we have reality as well. Please see what you read on this page as steps towards more proper CI.

    And here are some additional notes for further clarification.

    Saving resources is secondary. The real purpose with this proposal is to reduce the feedback time. What we experienced during the past 2 releases is that we run everything on bare metal always. This increased the number of builds waiting in the queue, essentially contributing to long feedback times. Another important purpose is to increase the quality of master branches by running things close to reality as much as possible. Improvement in how we use our resources is a side benefit. But things could become worse if we don't have enough resources to run commit gating so we will need to evaluate what we are doing and fine tune things over time.

    And other points.

    1. Which scenarios are affected: You are right, it is not always possible to know this so the bigger responsibility here falls on project committers. Apart from this, changes might have side effects, impacting more scenarios than what we can run. We will start with running verification for one scenario and again fine tune while we move on.
    2. The time it takes to verify patches: I can't agree more. To me the feedback should be provided by CI in minutes, hours, and days. But we have the reality as I mentioned. For example Apex and Compass already run virtual deployments, Fuel runs builds and Joid runs nothing. We need to make sure we do similar things for our projects and based on what majority of installers do today, the others need to follow. I am not sure how much you follow Gerrit but what I noticed is that it normally takes hours (if not days) for code reviews to be done and submitted so we hopefully are not making things terribly worse. Another thing is that we have other ideas like using snapshots and so on but we basically do not have enough people to work on these right now so we choose the easier path by doing deployments from scratch. Over time I expect to improve the situation to reduce the feedback time further.
    3. Building artifacts from Merge Jobs: Today we either build stuff on timer couple of times a day or once a day. This is not really so beneficial since you either have to start a build manually or wait for next day to get new version of the artifact if you want your patch to be included in artifact and deployed. Again this increases the feedback time.
      Daily runs will consume promoted versions of scenarios together with the latest ISO - as long as the same scenario is not impacted by a later change that is under verification. Apart from this, having multiple builds a day will reduce the delta between different versions of the artifacts, reducing the delta between the ISOs that get deployed, resulting in idenfitying the faulty commits easier. (5 commits vs 20 commits for example.) Apart from this, post merge jobs will not add extra time since they will be triggered after merge.
    4. ISO: No separate ISO per scenario - latest ISO will be used if and only if the scenario is not impacted by a later patch that is under verification. If it is impacted, then I expect we use the ISO that gets uploaded to artifact repo when the scenario becomes candidate for next loop. There are things we will identify here while we move on and I have questions here as well.
    5. UT: I agree to have UT for all the projects we have but this is not always possible due to different reasons - lack of people, lack of understanding and so on. And the virtual deployments is the lowest common denominator across different installers - they all have virtual deployments.
    1. Agree with you. Fatih Degirmenci

      As you listed in the response, we can add the targets here:

      1. reduce the feedback time.
      2. increase the quality of master branches
      3. save resources involved
      4. ...

      It will be shared to the community besides INFRA WG.

      About the Building artifacts from Merge Jobs, we can get more opinions from WG team.

  5. I want to explore the Unit Test methods, and for there to be a team working to promote these practices and document them on the wiki. I'm willing to help.

    I assume there would be ability to mark some commits as "experimental" or "do not promote" i.e. do not push this commit to the next stage even if it passes the tests at the current stage, and then the ability to amend it to enable promotion later. Does that make sense (bit of a newb in this area)?

  6. Bryan Sullivan

    About unit tests; some projects do this right at the very beginning of their project (yardstick, moon), some others recently started employing (apex) and some do not have anything.
    So it would be very helpful if you can help out with this. I'm sure there are others who want to be involved such as Julien.

    About promotion stuff; There are many details that will surface when we start trying promotion but for Colorado, promotion will be experimental - meaning that we will just promote scenarios but will continue running things for all the available scenarios for the given installer instead of testing promoted scenarios only. About what you ask; promotion will be applied once a change gets merged to master (post-merge) so it is not possible to amend a merged change or block the promotion. But what we can do is when the patch ends up on gerrit and before merge, you can specify this in a certain way (with a keyword in commit message for example) so the promotion that takes place during post-merge can act accordingly and do not take this type of commits into account for promotion purposes.

    1. We are using tox in OPNFV projects for UT, such as Parser, Qtip etc.

  7. Fatih Degirmenci, about the commit message in the Colorado release,  what will be supported or introduced?

    In Openstack, there are some action keywords in the commit-message:

    Closes-Bug: #######
    Partial-Bug: #######
    Related-Bug: #######

    In OPNFV now, we support JIRA: #### in commit-message to link the JIRA tasks.

    In the last discussion, some keywords will be used for triggering scenario CI test. We'd define all the keywords we will used. Several the tools in the infrastructure will be affected., such as Gerrit server(and its relevant its-jira plugin), Zuul or Jenkins, and maybe post-script of commit-msg.

  8. Fatih Degirmenci this page seems does not mention how to move artifact from build phase to deployment phase when doing verification if build and deployment are not the same machine. Can we use artifact repo as well?

  9. One additional thing to consider as part of the promotion/demotion is the definition of "success". 
    Right now, success of a scenario validation isn't a digital decision and as we all know, there is no one size fits all test for scenarios, but the tests are scenario dependent. 
    IMHO the overall new approach would need to be complemented with a way to allow the scenario owner to define "success", i.e. define the test cases that must pass and those that are only optional/nice to have. For that, we'd need the testing to be fully configurable. 

  10. Zhijiang Hu The artifacts will be built on build servers and uploaded to OPNFV Artifact Repository. Once a deployment starts on another machine, it will download the artifact from OPNFV Artifact Repository and continue with the rest. The process is same no matter what the purpose is; it could be patchset verification, daily, or weekly runs.

    Frank Brockners You summarized it well. My expectation is that the scenario owners together with the Test WG will define the success criteria and CI will use it to come up with a verdict. As you said, the criteria should state which testcases must pass, what thresholds are there and so on. Apart from having scenario based list of test cases/thresholds that must pass, we need phase based test cases/thresholds defined as well. (such as verify doesn't do benchmarking and only smoke test cases must pass. But for weekly, the success criteria is much tougher, taking more things into account.) The promotion part will not be implemented by Danube so there is still time for scenario owners and Test WG to come to a conclusion.