| Item | Definition |
---|---|---|
Level 1 | Re-test reliability | The consistency of performers(s) results over repeated rounds of testing conducted over a period of typically days or weeks. This represents the change in a participant’s results between repeated tests due to both systematic and random error, rather than true changes in performance [27, 36, 46] |
 | Intra-rater | The agreement (consistency) among two or more trials administered or scored by the same rater [4, 47] |
 | Inter-rater | The level of agreement (consistency) between assessments of the same performance when undertaken by two or more raters [4, 46, 47] |
 | Content validity | How well a specific test measures that which it intends to measure [4, 27] |
 | Discriminant validity | The extent to which results from a test relate to results on another test which measures a different construct (i.e., the ability to discriminate between dissimilar constructs) [42, 48, 49] |
 | Responsiveness/sensitivity to change | The ability of a test to detect worthwhile and ‘real’ improvements over time (e.g., between an initial bout of testing and subsequent rounds) [42, 50–54] |
 | MID/SWC | The smallest change or difference in a test result that is considered practically meaningful or important [55–58] |
 | Interpretability | The degree to which practical meaning can be assigned to a test result or change in result [25, 28] |
 | Familiarity required | The need to undertake a test familiarisation session with all participants prior to main testing in order to reduce or eliminate learning or reactivity effects [4] |
 | Duration | Expected and/or actual duration of the testing protocol [59, 60] |
Level 2 | Stability | The consistency of performer(s) results over repeated rounds of testing conducted over a period of months or years [40, 42, 61, 62] |
 | Internal consistency | The degree of inter-relatedness among test components that intend to measure the same construct/characteristic [28] |
 | Convergent validity | The extent to which results from tests that theoretically should be related to each other are, in fact, related to each other [42, 49] |
 | Concurrent validity | The extent to which the test relates to an alternate, previously validated measure of the same construct administered at the same time [42, 63] |
 | Predictive validity | The extent to which the test relates to a previously validated measure of a theoretically similar construct, administered at a future point in time [42, 63] |
 | Floor and ceiling effects | The ability of a test to distinguish between individuals at the lower and upper extremities of performance (i.e., ability to distinguish between high results (ceiling effect) and low results (floor effect)) [28, 64] |
 | Scoring complexity | The ease with which a test can be conducted and scored in a practical setting by the test administrator [65, 66] |
 | Completion complexity | The ease with which a test can be completed by a participant [65–67] |
 | Cost | The total amount of resources required for test administration including equipment, time, and administrator expertise/experience [25] |