Normative Values for Self-Reported Benchmark Workout Scores in CrossFit® Practitioners

Background CrossFit® practitioners commonly track progress by monitoring their ability to complete a variety of standardized benchmark workouts within a typical class setting. However, objective assessment of progress is challenging because normative data does not currently exist for any of these benchmark workouts. Therefore, the purpose of this study was to develop normative values for five common benchmark workouts (i.e., Fran, Grace, Helen, Filthy-50 [F50], and Fight-Gone-Bad [FGB]). Methods Performance data from 133,857 male (M) and female (F) profiles located on a publicly available website were collected and sorted by sex (i.e., male [M] and female [F]) and competitive age classification (i.e., teen [T], individual [I], or masters [M]) and screened for errors. Subsequently, 10,000 valid profiles were randomly selected for analysis. Results Means and standard deviations were calculated for each category for Fran (IM 250 ± 106 s; IF 331 ± 181 s; MM 311 ± 138 s; MF 368 ± 138 s; TM 316 ± 136 s; and TF 334 ± 120 s), Grace (IM 180 ± 90 s; IF 213 ± 96 s; MM 213 ± 93 s; MF 238 ± 100 s; TM 228 ± 63 s; and TF 223 ± 69 s), Helen (IM 9.5 ± 1.9 min; IF 11.1 ± 2.4 min; MM 10.2 ± 2.0 min; MF 11.5 ± 2.3 min; TM 9.4 ± 1.6 min; and TF 12.7 ± 1.9 min), F50 (IM 24.4 ± 5.9 min; IF 27.3 ± 6.9 min; MM 26.7 ± 6.1 min; MF 28.2 ± 6.0 min; TM 25.9 ± 7.9 min; and TF 28.3 ± 8.1 min), and FGB (IM 335 ± 65 repetitions; IF 292 ± 62 repetitions; MM 311 ± 59 repetitions; MF 280 ± 54 repetitions; TM 279 ± 44 repetitions; and TF 238 ± 35 repetitions). These values were then used to calculate normative percentile (in deciles) values for each category within each workout. Separate, one-way analyses of variance revealed significant (p < 0.05) differences between categories for each workout. Conclusions These normative values can be used to assess proficiency and sport-specific progress, establish realistic training goals, and for standard inclusion/exclusion criteria for future research in CrossFit® practitioners.

Normative scores for five common benchmark workouts (i.e., Fran, Grace, Helen, Filthy-50, and Fight-Gone-Bad) were created for male and female competitors in the teen, individual, and masters' competitive age divisions for CrossFit®. On average, males in the individual and masters' age categories scored better than their female counterparts in each workout despite workouts being scaled for sex. The normative scores reported here may be used for standardized comparison between athletes, to track individual progress, and as an inclusionary/exclusionary criteria tool for future investigations on CrossFit®.

Background
CrossFit® training combines weightlifting, gymnastics, and traditional cardiovascular exercise modalities (e.g., running, rowing, and cycling) into a single workout that is performed at high intensity. Daily workouts are generally unique and vary in the number and type of exercises included, the prescribed intensity and volume loads, and whether rest intervals are enforced (e.g., Fight-Gone-Bad [FGB] requires a 1-min rest break between rounds) [1]. Performance during such workouts may be quantified through a variety of strategies. Trainees may be instructed to complete all exercises and/or rounds as quickly as possible, they may be asked to complete "as many repetitions as possible" (AMRAP) within a certain time frame, or they may be asked to maintain a specific workout pace (e.g., complete a specified number of repetitions every minute) for a set time frame. Regardless of formatting, workouts will typically challenge some combination of strength, power, endurance, and/or sport-specific skill. While monitoring progress in attributes such as strength, power, and endurance may be accomplished via traditional laboratory and field assessments, monitoring progress in sport-specific skill is not as simple. Assessments of individual skills (e.g., rope jumping or climbing, bar and ring muscle-ups, burpees, and box jumps) may provide some insight, but this practice lacks context. To this end, common benchmark workouts (i.e., FGB, Fran, Grace, Helen, and Filthy-50 [F50]) may be used to assist practitioners in gauging their ability to perform various movements within the context of a workout. Currently, normative values exist for several traditional physiological measures (e.g., maximal strength, aerobic capacity) [2], but not for these common benchmark workouts.
The CrossFit® website allows users to create a profile where they can upload their best scores for traditional measures of strength (i.e., squat, deadlift), power (i.e., clean and jerk, snatch), anaerobic performance (i.e., 400-m sprint), aerobic performance (i.e., 5000-m run), and common benchmark workouts. Previously, proficiency in some of these benchmark workouts (i.e., Fran and Grace) have been related to anaerobic performance and strength [3], while self-reported performances may distinguish competitive level within this sport [4]. For instance, Serafini and colleagues (2017) noted that performances in common benchmark workouts were greater in higher-ranking male and female competitors who placed within the top 1500 during the 2016 CrossFit® Open (CFO). However, considering that over 320,000 individuals participated in the 2016 CFO [5], this information is limited to a relatively small sample of CrossFit® practitioners, and only to those associated with the most competitive division (i.e., individual). Therefore, the purpose of this investigation was to create normative values for the five common benchmark workouts across the three primary competitive age divisions (i.e., individual, masters, and teens) in CrossFit® practitioners.

Study Design
Five-hundred thousand uniform resource locators (URL) were scraped (May 25-August 14, 2017) from a publicly available online database [6] and yielded 133,857 user profiles that contained self-reported anthropometric and performance data. Profiles were sorted by sex and competitive age classification (i.e., individual, masters, or teens) and then screened for errors. Profiles were eliminated from the analysis if they (a) contained data points that exceeded four standard deviations (i.e., < 0.001% of all cases) from their respective mean [7] or (b) did not contain more than one completed benchmark workout (i.e., Fran, Grace, Helen, Filthy-50, and Fight-Gone-Bad). Of the remaining cases (n = 39,884), exactly 10,000 profiles were randomly selected for analysis.

Participants
Male ( M ) and female ( F ) participants, who were assigned to the individual (I; 18-34 years), masters (M; ≥ 35 years), or teens (T; < 18 years) age-classifications during the 2017 CFO, were selected for this study. All participants possessed, of their own volition and initiative, a profile on the CrossFit Games™ website [6] where their self-reported performance data was located. Profiles were selected by the numerical order of their URL. All data was downloaded from The CrossFit Games™ website and decoded so that no identifiable information (i.e., name) was available from any of the participants. Random sampling of all valid cases elicited 4397 profiles in I M (30.0 ± 4.2 years; 178.8 ± 7.2 cm; 86.3 ± 10.6 kg), 1628 profiles in I F (29.9 ± 4.0 years; 164.5 ± 6.7 cm; 65.2 ± 8.5 kg), 2955 profiles in M M (42.0 ± 5.9 years; 178.9 ± 7.1 cm; 87.3 ± 11.2 kg), 918 profiles in M F (41.7 ± 5.9 years; 164.7 ± 6.7 cm; 64.7 ± 9.0 kg), 69 profiles in T M (17.5 ± 2.7 years; 175.3 ± 6.5 cm; 74.5 ± 10.3 kg), and 33 profiles in T F (17.0 ± 0.8 years; 163.4 ± 6.5 cm; 61.4 ± 8.8 kg). Since these data were pre-existing and publicly available, the University's Institutional Review Board classified this study as exempt (Study# 16-215).

Performance Measures
Participants have the option on their profile to record their best performances for select benchmark workouts. These include Fight-Gone-Bad (FGB), Fran, Grace, Helen, and the Filthy 50 (F50). The details of each workout's design, repetition scheme, exercise list, standardized load or difficulty, and scoring method are described in Table 1. Briefly, four of the recorded events (i.e., Fran, Grace, Helen, and F50) were scored by time-to-completion (TTC), while FGB was scored as the total number of repetitions completed within the set time frame.

Statistical Analyses
Statistical software (SPSS, v.24.0, SPSS Inc., Chicago, IL) was used for random sampling, as well as to calculate means, standard deviations, and percentiles (in deciles) for each competitive group. Additionally, a one-way analysis of variance was used to examine differences between I M , I F , M M , M F , T M , and T F . Subsequent Tukey's post hoc tests were used to determine pairwise differences when significant F ratios were obtained. For all statistical tests, a probability level of p ≤ 0.05 was established to denote statistical significance.

Fight Gone Bad
No differences were observed between teen competitors (T M = 316 ± 136 s; T F = 334 ± 120 s) and any other classification.

Helen
For Helen (Fig. 1d)  faster completion times than M F (p ≤ 0.001). No other differences were observed between T M (9.4 ± 1.6 min) and any other classification.

Discussion
CrossFit® training constantly varies daily workouts to promote general physical preparedness [1]. While this   [8,9], gauging sport-specific progress and proficiency is difficult. Traditional field and laboratory measures (e.g., aerobic capacity, anaerobic threshold, peak power) are commonly accepted tools for monitoring athletic progress [2], and a few have been related to CrossFit® performance [3,10]. However, in most instances, their precision is dependent on the availability of expensive equipment, and it may not be logistically feasible to assess several individuals from a single location or across locations, without sacrificing their validity and/or reliability. It is also difficult to simulate actual workouts or competitive environments with traditional assessment tools (e.g., metabolic cart, cycle ergometers, force plates) because of the likelihood that they would impair natural movement. Thus, Cross-Fit® practitioners commonly use standardized workouts to monitor sport-specific adaptations. These common benchmark workouts are identifiable by name (e.g., Fran, Grace), and their requirements are standardized across affiliates. Though commonly practiced, there is little information available to allow practitioners to determine the quality of their performance in such workouts. Here, we provide normative values for self-reported performance scores in five, common benchmark workouts for male and female practitioners across the three, primary age-classifications (i.e., teens, individuals, or master's) of the CrossFit® Open. Practitioners can use these data to project their status among their peers, as well as to monitor their individual progress and set realistic goals for training. In terms of absolute intensity, CrossFit® workouts prescribed for I M are the most challenging. For instance, in the workouts examined in the present study, men were typically required to lift more weight, jump onto a higher box, or throw a heavier medicine ball to a higher target than women. Workout prescription may be further scaled to accommodate less experienced and/or older individuals, but this does not occur in the common benchmark workouts (i.e., only one workout design exists for each sex, regardless of age). Accordingly, we observed that I M and I F performed better than their master's counterparts in all workouts aside from F50 (i.e., no differences were found between I F and M F ). This is not surprising because younger practitioners would be expected to perform better when given the same task [11,12]. However, within the individual and master's age classifications, men reported better scores than women for each workout. This is interesting because appropriate scaling should equate workout difficulty and result in similar scores between men and women. Typically, clear differences exist between men and women when comparisons are made with absolute values for traditional measures of strength and endurance, but not when using relative figures (e.g., percentage of one-repetition maximum, per kilogram of body mass) [13][14][15]. Though comparisons between sexes are not common in Cross-Fit®, it may be possible if relative standards are used when prescribing intensity. Another possible explanation may be related to the fact that more men (n = 7352) than women (n = 2546), in the individual and master's age classifications, possessed a profile account and reported their performance scores. Likewise, only 102 teenage practitioners possessed an account in the present sample. Individuals who participate in CrossFit® and similar exercise forms are not required to create a profile on the CrossFit® website and have alternative platforms for tracking progress (e.g., Wodify, Zen Planner, beyond the whiteboard). Consequently, our findings may be limited to CrossFit® athletes who also possess an account on the CrossFit® website. Further, because the athletes report these data as their personal best performance in each workout, our findings may be most representative of peak fitness within each individual workout and not necessarily of ability across all workouts simultaneously.
These data may also be useful for developing more accurate inclusion/exclusion criteria in research. Currently, physiological research on CrossFit® is limited, and most studies have used training experience (i.e., the number of years of participation) as the primary indicator for training status. Though years of experience would likely indicate a degree of familiarity with the nuances of this training strategy, its use as an indicator of proficiency is complicated by individual variability in training frequency, regularity in utilizing prescribed (versus scaled) workouts, athletic talent, and previous experiences in other sports. Put simply, unless potential participants are recruited from a pool of individuals who have been previously ranked in international competitions (e.g., the Reebok CrossFit Games™), it is difficult to accurately identify their proficiency in the sport from experience alone. For instance, male and female participants have been previously recruited based on their experience (number of years was not reported) with CrossFit® to determine their physiological responses to two common benchmark workouts (including "Fran") [16]. However, it may not be correct to extrapolate their findings to all CrossFit® practitioners. Based on our findings, the "Fran" scores for male (331 ± 82.4 s) and female (331 ± 92.1 s) participants in that study would have placed them within the 20th and 50th percentiles, respectively. It may have been more appropriate to describe those individuals as beginner or intermediate CrossFit® practitioners, rather than simply stating they had experience. Likewise, Butcher and colleagues (2015) recruited participants who had previously progressed to the regional round of the Reebok CrossFit Games™ or at least participated in the CrossFit® Open, and who possessed at least 1 year of experience (~3.7-4.3 years). However, by examining their measured performances in Fran (203 ± 48 s; range = 130-289 s) and Grace (136 ± 32 s; range = 93-194 s), and depending on sex category (not specified), they could have ranked above the 70th percentile for "Fran" or as low as the 20th percentile for "Grace". Comparatively, less variability in reported performance scores can be observed in the study conducted by Serafini and colleagues (2017). In that study, the authors utilized final rankings in the 2016 CrossFit® Open to examine differences in benchmark workout scores reported by the top 1500 male and 1500 female athletes (i.e., the top~1%). Although the reported scores would still vary by specific workout and sex, male and female participants typically ranked above the 80th and 70th percentiles, respectively. As more research is conducted on Cross-Fit®, it will become increasingly necessary to utilize more specific methods for participant recruitment to make accurate inferences across studies.

Conclusions
In practice, the five benchmark workouts described here are typically made part of regular training but are not commonly completed under the scrutiny of a judge. Although it is possible that the self-reported data used in this study included invalid performance scores (i.e., the athlete did not meet all workout requirements), this method of reporting is consistent with how these workouts are commonly scored at a local affiliate. That is, coaches rely on trainees to follow the described standards for each workout and to accurately report their scores. Nevertheless, additional steps were taken in to minimize the number of unrealistic performances (i.e., removing scores that were greater than four standard deviations from the mean). Though potentially limited to users of the Cross-Fit® website, the normative values we have presented appear to adequately describe sport-specific ability for five common benchmark workouts. Practitioners and coaches may use these values to assess individual progress, make comparisons between individuals, and establish realistic training goals. Further, as more research is conducted on this training strategy, these values may be used as inclusion/exclusion criteria to assist researchers when assessing the suitability of potential participants for a study's specific aims. Nevertheless, it may be worthwhile to verify these normative values, obtained from self-reported performance scores, with those obtained from observed performances. Additionally, the five workouts examined here represent a small sample of potential benchmark tools that could be used to assess sport-specific ability in Cross-Fit® participants. Future endeavors should seek to identify normative values for additional benchmark workouts (e.g., "Cindy", "Jackie", "Diane"), as well as for "Hero" workouts (e.g., "Jerry", "Murph", "Randy").