galogo30.gif (1484 bytes)

gardcolorbar.gif (85 bytes)

Home

BUILDING SIMULATION SOFTWARE TESTING

Overview

There are a wide variety of types of software tests that may be used for building energy simulation software, each with a different objective or scope. Tests may be performed on the entire program or on individual subroutines or algorithms.  The goal of software testing is to cost effectively identify and communicate as many potential problems with the software as possible and iterate with the development team until the identified bugs are eliminated. This goal is consistent with the goal of the development team to provide high quality software that is free of errors. Please note that creating bug free software is not an obtainable goal, since there are too many possible inputs and too many possible paths through the program. From the development team’s perspective a successful test is one that reveals a problem, all other tests are unnecessary. Unfortunately, determining beforehand which tests reveal problems is impossible, and that is why tests are so exhaustive. From the users’ perspective, a successful test is one that shows that the software results match some type of standard with an adequate level of accuracy.

Typical Test Types

For building energy simulation software, the types of tests available include:

  • Analytical Tests - Analytical tests compare results to mathematical solutions for simple cases.

  • Comparative Tests - Comparative tests compares a program to itself or to other simulation programs. 

  • Sensitivity Tests - Sensitivity tests compare results to a baseline case and exhaustively test the functioning of every modeling input, including weather data for a full range of climate zones. 

  • Full Code Tests - Full code tests are designed to exercise all lines of code by exercising combinations of inputs and tracking which lines of code have been executed. 

  • Range Tests - Range tests check the operation of the code over the complete range of valid inputs. The tests will also go beyond all valid ranges to ensure that adequate error messages are generated. 

  • Empirical Tests - Empirical tests compare results to experimental data. In many respects, these have proven to be the most difficult type of tests to do. It is important that high-quality data be used as the basis for comparison along with complete and accurate information for developing a simulation model that represents the test building or module as closely as possible.

General Concepts

A good rule of thumb is that testing should take approximately half the effort that was expended in developing and implementing the algorithms, including testing performed by the software development team. Common estimates of the cost of finding and fixing errors in programs range from 40% to 80% of the total development costs. In addition, it has estimated that programs released to testing have one to three bugs per every 100 lines of code, and this is after the programmers have fixed 99% of their mistakes.

Many likely and less-than likely input variations should be tried during testing with the emphasis on tests that are likely to discover bugs. It is not cost effective, and often not possible, to test software so completely that it is bug-free. Instead, the strategy is to find the bugs that are the most likely to cause problems and those that cause the largest problems.

Generally, a bug which is the most important to catch will be one that delivers some or many incorrect numbers but does not cause the program to crash or issue warning messages. This includes output values reported in the wrong format, in an incorrect location, or outputs which are just plain wrong. This is important to fix first since the user has a reasonable expectation that any number delivered is “correct.” The next most important type bug to catch is one that causes the program to crash, or not deliver the expected reports, or to issue false error messages. In this case, the program at least indicates to the user that there is a problem, and they will not depend on the results.

The effort of testing should not be limited by a specific plan. In many cases the majority of the testing that is performed and found useful (discovered bugs) is testing that is done on an ad hoc basis following the flow of the program and the intuition of the tester. Adequate time for testing should be provided since rushing this step serves only to increase the number of likely bugs at the final release.

Glass Box Testing

Testing performed by a programmer with access to and understanding of the source code is called glass box testing or structural testing. Glass box testing is usually performed during source code development and prior to black box testing which is also known as functional testing. The original programmers typically perform some glass box testing. Unlike glass box testing, black box testing is testing of the program from the user’s perspective with no access to source code. Glass box testing may include the following: 

  • Code review – A careful line by line review of the source code by another programmer familiar with the program’s overall objectives. In the development teams, this is best performed by exchanging source code with another programmer. 

  • Complete coverage – A coverage monitor ensures that all lines of the source code are executed. Different types of coverage monitoring programs exist which look at executing all lines, all branches, or all logic conditions for each branch. 

  • Drivers– Adding code that calls the module being developed with artificial data and getting a known response. This also allows for incremental or evolutionary programming, which allows the entire program to be run even when intermediate modules have not yet been developed. 

  • Module deletion – To narrow down how crashes occur, modules are removed and stubs (non-performing modules with the same name) are substituted. 

  • Assertion checks – Conditional statements of a premise at a certain point in the program. The result of an assertion that is incorrect should be a special error condition. Assertion checks may be conditionally compiled so that they do not exist in the final program.

In addition, the programmer should identify the following during glass box testing if not already defined in the specification: 

  • Input ranges – The maximum and minimum allowable values for every data entry field. 

  • Internal boundaries – Threshold values of inputs that cause different program segments to be used.

  • Error handler triggers – Values of input which should cause error handling routines to be executed.

Black Box Testing

Black box testing that is also known as functional or behavioral testing. This is testing a pre- release or release version of the program and trying various inputs looking for incorrect outputs or program crashes. It is one of the best ways to diagnose bugs before users discover them. Given the complexity of many software packages, much thought must be devoted to optimizing the testing process. Ideally only one test will be performed for each possible set of software conditions. Of all possible sets of software conditions there are groupings of conditions that if all tested would be testing the same code and reveal the same bugs (or hopefully the lack of bugs). These groups of conditions are considered an equivalent class or each is called an equivalent class partition.

Software Testing Standards and Guiding Documents

The Institute of Electrical and Electronics Engineers has a number of publications on quality assurance procedures and standards for developing software. The most applicable standard, “Standard 829-1998 IEEE Standard for Software Test Documentation,” should be reviewed when developing a testing plan.

Acceptance Testing

A range of simple buildings should be simulated with the building simulation program undergoing testing automatically prior to all other tests. These should be considered an acceptance test intended to weed out unstable versions of the software that would not be fruitful to use for further debugging. Acceptance tests are often automated and may be provided to the programmers as a way to reduce the number of versions submitted for testing.

Regression Testing

Tests are performed multiple times, including once after each major source code change. The changes are usually due to inconsistencies found during the testing, and to implementing new features and are likely to contain problems. The first tests to be run with each new version are considered regression tests and compare the results before and after the series of tests. The results of the regression tests are all concatenated into a text file and are compared to a text file prepared using the identical method on the previous version. The comparison will be performed using a standard text file comparison utility and will report any new differences. The inconsistencies that have been fixed by the development team should be easy to identify and confirm that the fixes have been made correctly. The series of regression tests should consist of automated quick-to-perform tests and all tests that have previously found errors. The likelihood of an old error creeping back into the code when new ones are fixed is high enough to justify keeping tests in the regression suite that have been long since fixed.

Release Tests

Special consideration must be made just prior to a public release of the program to ensure that all bugs that were intended to be fixed were actually fixed. A release test is the most comprehensive automated test that, in large part, consists of a mixture of previously failed-and-fixed tests and tests that have always passed. It is critical that prior to any form of public release that known problems are specifically identified. The public release should include a “readme” file that describes all known problems at the time of release, and it is usually the tester that is responsible for compiling this list of problems and any work-arounds that exist. Release tests should include virus checking of the final installation package. Too many cases of distribution of viruses have been reported to not take this additional precaution. One type of release test that also needs to be performed prior to a final public release is a comparison of all features actually working reliably with prepared literature. It is crucial the literature reflect all design decision made during development and testing.

Beta Tests

Other than the developers and testers, most software developed is not used by anyone else until beta tests commence. A common practice is to recruit a group of beta testers who are knowledgeable users of similar products. The beta test group should be sent the program in executable form for their target environment and should include documentation. They should be warned extensively that the product still has bugs and should not be used for production purposes. It is unrealistic to expect the beta tester to contribute more than half a day of testing per week. They should be informed of a clear path to make reports on bugs, general problems, and possible enhancements. This should include both e-mail and telephone. Often the lead tester is responsible for managing the feedback from the beta testers. The lead tester works with the beta testers to verify any reported bugs by trying to reproduce them. It is not uncommon for the beta test support to require full time effort, especially if the beta tester list is above 20 people. Beta testers should get new versions of the program not more often than once every other week otherwise the effort in installing and uninstalling the program is most of the time spent. At times, a critical bug is found and beta testing needs to be halted, so all beta testers should be available by e-mail. Beta testers have a few different motivations for volunteering to be a beta tester and each needs to be catered to:

  • Desire to become an expert quickly,

  • Anticipation of a free copy of the final program,

  • Professional curiosity,

  • Complementary product, or

  • Competitive product.

At times, it even makes sense to pay the beta tester to provide expert advice, if there are a limited number of experts in a field. Getting a final version of the program is very common in beta testing and if it is not part of the plan, the beta testers should be explicitly told.

Test Suspension and Resumption Criteria

Some bugs are so fundamental that the credibility of the results of other tests is affected, and tests need to be suspended. These include tests that appear to reveal general bugs in the input and output processing, tests that show non-repeatable results using identical inputs, tests that produce order of magnitude errors in fundamental algorithms, and tests that produce order of magnitude errors in basic elements. These tests should result in an urgent bug report and a follow up call to the responsible developer and the overall project manager. Both should be informed that no further testing could be fruitful until that bug is fixed. The overall project manager is informed to let them know of the critical path that may be getting delayed due to the bug so that additional programmers may be utilized if necessary. Upon receipt of an updated version, the tester should perform the normal regression testing that is performed on every new version, and repeat the tests that originally revealed the critical bug. Upon passing these tests, both the programmer and the overall project manager should again be informed that tests have resumed.

Full Code Tests

Full code tests are designed to exercise all lines of code by exercising combinations of inputs and tracking which lines of code have been executed. This is a glass box testing technique and must be performed by the programmer using software designed to aid in the testing process. Many of the range tests may also be appropriate for performing these tests. Full code tests are no panacea since they cannot be as comprehensive as full logic flow tests. Full logic flow tests test every possible set of preceding conditions that have been executed prior to a particular portion of the code and, by definition, require an almost infinite amount of effort. 

Documentation Tests

Comparing how the program operates and the documentation describing the program is often left to the tester. This is a crucial step. Even though many people don’t read the documentation, one can expect that any inconsistencies will be costly and embarrassing to fix.

Comparative Tests

Comparative tests compare a program to itself or to other simulation programs. This type of testing accomplishes results on two different levels, both validation and debugging. From a validation perspective, comparative tests will show if the software is computing solutions that are reasonable compared to similar programs. This is a very powerful method of assessment, but it is no substitute for determining if the program is absolutely correct since it may be just as equally incorrect as the benchmark program or programs. The biggest strength of comparative testing is the ability to compare any cases that two or more programs can both model. This is much more flexible than analytical tests when only specific solutions exist for simple models, and much more flexible than empirical tests when only specific data sets have been collected for usually a very narrow band of operation. Comparative testing is also useful for field-by-field input debugging. Complex programs have so many inputs and outputs that the results are often difficult to interpret. To ascertain if a given test passes or fails, engineering judgment or hand calculations are often needed. Field by field comparative testing eliminates any calculational requirements for the subset of fields that are equivalent in two or more simulation programs. The equivalent fields are exercised using equivalent inputs and relevant outputs are directly compared.

The most common comparative tests for building energy simulation programs are BESTEST and ASHRAE's Standard 140.

Analytical Tests

Analytical tests compare results to mathematical solutions for simple cases.

Empirical Tests

Empirical tests compare results to experimental data. In many respects, these have proven to be the most difficult type of tests to do. It is important that high-quality data be used as the basis for comparison along with complete and accurate information for developing a simulation model that represents the test building or module as closely as possible.

Range Tests

Range tests check the operation of the code over the complete range of valid inputs. The tests will also go beyond all valid ranges to ensure that adequate error messages are generated.

Sensitivity Tests

Sensitivity tests compare results to a baseline case and exhaustively test the functioning of every modeling input, including weather data for a full range of climate zones.

Executable Tests

Executable Tests interrupt and restart the program, including tests that remove selective binary or input files, looking for graceful program stops with appropriate error messages. Executable Tests also include: 

  • Load tests – testing the program’s ability to handle large tasks. 

  • Error recovery – attempting to make the program generate as many error messages as possible. 

  • Compatibility – checking the functioning of the program simultaneously with other applications such as word processors, spreadsheets, day planners, etc.

  • Installation – the installation program must be tested to see if it properly installs in a variety of environments.

Testing Deliverables

The ultimate deliverable from any testing effort is software that has fewer bugs than before the testing. In order to understand the value of specific testing activities for use in planning necessary revisions to the software, additional details should be provided in the form of:

  • Test Logs

  • Incident Reports

  • Summary Report

According to IEEE-829, the summary report should include:

  • Summary

  • Variances

  • Comprehensiveness assessment

  • Summary of results

  • Evaluation

  • Summary of activities

The summary report should include what software was tested including version number and a description of the software and hardware environment. It should reference the test plan, test logs and test incident reports. It should describe in the variance section how certain tests described in the test plan were not followed exactly and what additional tests were performed beyond the test plan. The comprehensiveness assessment should include an evaluation of what features or feature combinations of the software were not sufficiently tested and are expected to contain a subjective assessment. The summary of results and evaluation should include a description of all incidents and whether they were resolved, and provide an overall evaluation of the testing and an estimate of the general reliability of the software. The summary of activities describes the effort level and calendar time needed for performing different aspects of the testing.

Results of GARD Analytics testing of the EnergyPlus program can be found here.

 

Search GARD Analytics

Contact infogard.com or webmastergard.com
Copyright © 1996-2007 GARD Analytics, Inc.