Free
Original Contribution  |   October 2010
Maintenance and Improvement of Interobserver Reliability of Osteopathic Palpatory Tests Over a 4-Month Period
Author Notes
  • From the A.T. Still Research Institute at A.T. Still University in Kirksville, Missouri (Dr Degenhardt and Ms Johnson) and the Department of Osteopathic Manipulative Medicine at the Kirksville College of Osteopathic Medicine-A.T. Still University in Missouri (Drs Snider and Snider). 
  • Address correspondence to Brian F. Degenhardt, DO, A.T. Still Research Institute, A.T. Still University, 800 W Jefferson St, Kirksville, MO 63501-1443.E-mail: bdegenhardt@atsu.edu 
Article Information
Neuromusculoskeletal Disorders / Osteopathic Manipulative Treatment / Pain Management/Palliative Care / Low Back Pain / OMT in the Laboratory
Original Contribution   |   October 2010
Maintenance and Improvement of Interobserver Reliability of Osteopathic Palpatory Tests Over a 4-Month Period
The Journal of the American Osteopathic Association, October 2010, Vol. 110, 579-586. doi:10.7556/jaoa.2010.110.10.579
The Journal of the American Osteopathic Association, October 2010, Vol. 110, 579-586. doi:10.7556/jaoa.2010.110.10.579
Abstract

Context: Few studies have shown that diagnostic palpation is reliable. No studies have shown that the reliability of diagnostic palpatory skills can be maintained and improved over time.

Objective: To investigate whether the reliability of selected palpatory tests used to identify lumbar somatic dysfunction was maintained during a 4-month period as part of a clinical observational study.

Methods: Participants with low back pain and participants without low back pain, recruited from a rural Midwestern community, were examined during 6 separate sessions over a 4-month period. During each data collection session, two blinded examiners, who had previously completed comprehensive consensus training, evaluated the lumbar region with four tests: static segmental positional asymmetry of the transverse processes in the horizontal plane, tissue texture abnormalities, resistance to anterior springing on the spinous processes, and tenderness induced by pressure on the spinous processes. Detailed protocols for each test were defined during a previous comprehensive consensus training period and were not revised during the current study. To verify that established interobserver reliability was maintained throughout the clinical study, quality control sampling was performed on all data. When findings were inconsistent between the two examiners, focused consensus training was performed as a means of recalibration to understand why assessments were inconsistent. Interobserver reliability for determining the presence or absence of somatic dysfunction was assessed using kappa coefficients.

Results: The study enrolled 64 participants, and 14 to 33 participants were examined per session. All four tests had acceptable interobserver reliability by the final data collection session. The test for static segmental positional asymmetry of the transverse processes in the horizontal plane had moderate to substantial reliability in all 6 sessions. The test for tissue texture abnormalities had moderate reliability in 5 of the 6 sessions. The test for resistance to anterior springing on the spinous processes had moderate reliability for 3 of the 6 sessions. The test for tenderness had substantial to almost perfect reliability for all 6 sessions. In general, interobserver reliability improved over time.

Conclusions: Examiners were able to maintain and improve interobserver reliability of four lumbar diagnostic palpatory tests over a 4-month period.

Practitioners from multiple manual-therapy professions, such as chiropractic, massage therapy, osteopathic medicine, and physical therapy, use palpation to diagnose and treat patients with disorders of the spine.1,2 Despite the widespread use of palpation, scientific assessment of the value of diagnostic palpation and of the effectiveness of manual treatment has been limited, and there is a lack of fundamental evidence demonstrating the reliability and validity of the numerous tests and techniques used within each profession. Although variation exists in the techniques used by individual professions, tests evaluating the spine can be organized into categories that assess the following four characteristics: symmetry of spinal landmarks, soft tissue abnormalities, segmental and regional motion, and pain.3,4 
For more than 30 years, studies38 have investigated the reliability of spinal diagnostic palpatory tests, but systematic reviewers have criticized most of these studies for having poor methodologic designs. Despite variations in the inclusion criteria used in recent systematic reviews of research on diagnostic palpatory tests, the conclusions made by those reviewers have been similar: pain provocation tests are reliable both between and within observers; motion and landmark location tests are more reliable within the same observer (ie, intraobserver reliability) than between observers (ie, interobserver reliability); and, overall, tests other than pain provocation have poor interobserver reliability.3,4 
A previous study by the present authors9 showed that improvement in reliability of several commonly used palpatory tests for the lumbar spine could be achieved after a period of consensus training. These tests, performed at the L1 through L4 vertebral levels, evaluated static segmental positional asymmetry of the transverse processes in the horizontal plane, tissue texture abnormalities, resistance to induced motion on the transverse processes, resistance to anterior springing on the spinous processes, and tenderness induced by pressure on the spinous processes. In that study,9 three phases were part of the training process: pretraining interobserver reliability assessment, consensus training, and posttraining interobserver reliability assessment. 
The acceptable reliability of tests in the previous study9 was limited to the examiners in that study at that time. For the current study, we hypothesized that the reliability of these tests could be maintained and improved upon over an extended period. A review of the literature identified only one report10 on the evaluation of interobserver agreement over time. In that study,10 thoracic vertebrae were evaluated by two examiners in two sessions within a 10-day period. Interobserver reliability for the presence vs absence of motion restriction was fair in both sessions (Cohen's kappa coefficient [κ] = 0.22 and 0.24). 
In the current study, two blinded examiners evaluated spinal dysfunctions from L1 through L4 using four common palpatory tests over a 4-month period within a clinical observational study. Participants with low back pain and with no low back pain were included in the study. The purpose of the current study was to verify that established interobserver reliability was maintained throughout the clinical study. Quality control sampling was performed on all data, and focused consensus training was used as an efficient recalibration procedure when disagreements in findings were identified. 
Methods
The current study sought to determine whether established interobserver reliability of four palpatory diagnostic tests for evaluating lumbar somatic dysfunction could be maintained during a clinical observational study. There were 6 data collection sessions over the 4-month period of the study, quality control sampling was performed on all data, and occasional recalibration procedures were performed. Study participants were recruited from the faculty, staff, and students of Kirksville College of Osteopathic Medicine-A.T. Still University (KCOM) in Missouri, as well as from the surrounding rural community. 
Individuals of both sexes and those with and without low back pain were included in the study. Individuals were excluded if they had known congenital vertebral anomalies or other conditions (eg, fractures) that could potentially alter lumbar bony anatomic features or if they had received spinal manipulation within the 8 weeks preceding the study. The KCOM Institutional Review Board reviewed and approved the study design. All participants signed institutional review board–approved informed consent forms prior to participation in the study. 
Two examiners who are American Osteopathic Association board certified in neuromusculoskeletal medicine—with 10 years (B.F.D.) and 3 years (K.T.S.) of clinical experience—performed the tests in the study. Pretraining levels of interobserver reliability were determined for the following four segmental vertebral palpatory diagnostic tests: positional asymmetry of the transverse processes in the horizontal plane, tissue texture abnormalities, segmental rotational motion asymmetry, and pain provocation to assess tenderness.9 
As described in the previous study,9 consensus training was performed following pretraining interobserver reliability assessment. After consensus training was completed, the interobserver reliability was again tested, revealing improved reliability for positional asymmetry, tissue texture abnormalities, and tenderness. However, because of insignificant improvement and overall poor reliability of the method for measuring rotational motion asymmetry during consensus training,9 that test was changed for the current study to resistance to anterior springing on the spinous processes. Results for pretraining and posttraining periods for the two examiners in the current study were obtained from our previous study.9 
After assessment of posttraining interobserver reliability, a 2-month interval occurred before the maintenance of interobserver reliability was studied for another 4-month period. During these 4 months, the protocol remained consistent with the method developed during the consensus training study.9 The time between the 6 data collection sessions of the current study ranged from 1 to 8 weeks, depending on the availability of examiners and participants. 
Participants were examined in the prone position. The spinous processes of L1 through L4 were identified by palpation. L5 was not included because of its high frequency of anatomic variability, which complicates palpation. A mark was placed on the skin overlying and bisecting the vertical mid portion of the spinous processes. Both examining osteopathic physicians agreed on the placement of these marks, which were used to ensure consistent localization of the vertebrae being tested. 
Throughout the study, the examiners alternated in regard to which one evaluated the participants first. In each of the 6 data collection sessions, both examiners performed the tests in the same sequence (Figure), and each test (other than pain provocation to assess tenderness) was performed 2 or 3 times to determine a consistent outcome (ie, at least 2 of 3 tests in agreement). Pain provocation was performed 1 or 2 times to avoid causing hypersensitivity. 
The presence or absence of physical findings was recorded in a blinded manner after each test was completed. In addition, the examiners were blinded regarding participants' histories of low back pain. After both examiners completed their testing and recording of findings, they compared results. If lack of agreement for a specific test was found, that test was repeated by both examiners in an unblinded manner. The repeated testing allowed each examiner to observe the other's techniques (eg, location of digits, amount and direction of forces), leading to an understanding of why the findings were different and to a consensus on findings. This process functioned as an efficient and meaningful method of recalibration. Only data collected in a blinded manner were used for the assessment of interobserver reliability. 
Figure.
Description of osteopathic diagnostic palpatory tests used to measure maintenance of interobserver reliability over a 4-month period. The tests are listed in the sequence they were performed in each of 6 sessions during the study period.
Figure.
Description of osteopathic diagnostic palpatory tests used to measure maintenance of interobserver reliability over a 4-month period. The tests are listed in the sequence they were performed in each of 6 sessions during the study period.
Statistical Analysis
Data used in the statistical analysis were collected in a blinded manner at pretraining and posttraining and at the 6 data collection sessions of the clinical observational study. The data were analyzed for interobserver reliability by using Cohen's κ coefficient, 95% confidence intervals for κ, and percent agreement. The κ values were interpreted using the scale established by Landis and Koch,11 as follows: 0.81-1.00 indicates almost perfect reliability; 0.61-0.80, substantial reliability; 0.41-0.60, moderate reliability; 0.21-0.40, fair reliability; 0-0.20, slight reliability; and <0, poor reliability. Acceptable reliability was defined as κ≥0.40.12 Analyses were conducting using SAS 9.2 software (SAS Institute Inc, Cary, North Carolina). 
Prevalence of specific somatic dysfunctions was calculated as the percent of examinations in which the physical finding was present. To test for changes in probability of agreement between the examiners from the pretraining period to the data collection sessions, P values were calculated using logistic regression analysis. For resistance to anterior springing on the spinous processes, change in probability of agreement was analyzed in comparison to the posttraining reliability session because of the change in this palpatory diagnostic test between the pretraining and the posttraining periods. 
Logistic regression analysis was also used to test for the effect of participant demographic characteristics (ie, age, body mass index [BMI], sex, and presence of low back pain) on the probability of agreement between the two examiners. 
Results
Sixty-four participants were included in the current study. Thirty-three participants were included in session 1, 31 in session 2, 14 in session 4, and 17 each in sessions 3, 5, and 6. Participants were aged between 20 and 40 years, with a mean (SD) age of 30 (6) years. Of the 64 participants, 16 (25%) had low back pain and 48 (75%) did not. Forty-nine (77%) of the study participants were women. 
Over the 6 data collection sessions during the 4-month study period, improvement in interobserver reliability was obtained for all tests. All four palpatory tests demonstrated acceptable interobserver reliability, as defined12 by κ≥0.40. 
For static segmental positional asymmetry of the transverse processes in the horizontal plane, agreement between examiners significantly improved from the pretraining session to the 6 data collection sessions (Table 1). The κ values in the 6 sessions ranged from 0.56 to 0.72, with moderate to substantial reliability achieved in all sessions. 
Table 1
Interobserver Reliability and Maintenance of Reliability for Static Segmental Positional Asymmetry of the Transverse Processes in the Horizontal Plane (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168400.18 (0.08-0.27)72
Posttraining (30) -20 120 84* 0.29 (0.02-0.56) 93
Session 1 (33)013277*0.56 (0.42-0.71)88
Session 2 (31) 8 124 74* 0.44 (0.27-0.61) 88
Session 3 (17)96888*0.60 (0.31-0.88)91
Session 4 (14) 10 56 89* 0.72 (0.52-0.91) 95
Session 5 (17)136890*0.66 (0.43-0.89)96
Session 6 (17) 17 68 91* 0.59 (0.26-0.92) 91
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
Table 1
Interobserver Reliability and Maintenance of Reliability for Static Segmental Positional Asymmetry of the Transverse Processes in the Horizontal Plane (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168400.18 (0.08-0.27)72
Posttraining (30) -20 120 84* 0.29 (0.02-0.56) 93
Session 1 (33)013277*0.56 (0.42-0.71)88
Session 2 (31) 8 124 74* 0.44 (0.27-0.61) 88
Session 3 (17)96888*0.60 (0.31-0.88)91
Session 4 (14) 10 56 89* 0.72 (0.52-0.91) 95
Session 5 (17)136890*0.66 (0.43-0.89)96
Session 6 (17) 17 68 91* 0.59 (0.26-0.92) 91
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
×
For tissue texture abnormalities, agreement between examiners also significantly improved from the pretraining session to the 6 data collection sessions (Table 2). The κ values ranged from 0.23 to 0.55, with moderate reliability achieved in 5 of the 6 sessions. 
Table 2
Interobserver Reliability and Maintenance of Reliability for Tissue Texture Abnormalities (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-3833658-0.01 (-0.11-0.10)29
Posttraining (30) -20 240 69* 0.37 (0.26-0.49) 57
Session 1 (33)026476*0.41 (0.29-0.53)72
Session 2 (31) 8 248 72* 0.43 (0.32-0.55) 57
Session 3 (17)913674*0.45 (0.30-0.60)60
Session 4 (14) 10 112 76* 0.23 (0.01-0.44) 81
Session 5 (17)1313682*0.55 (0.40-0.71)73
Session 6 (17) 17 136 78* 0.45 (0.29-0.62) 72
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
Table 2
Interobserver Reliability and Maintenance of Reliability for Tissue Texture Abnormalities (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-3833658-0.01 (-0.11-0.10)29
Posttraining (30) -20 240 69* 0.37 (0.26-0.49) 57
Session 1 (33)026476*0.41 (0.29-0.53)72
Session 2 (31) 8 248 72* 0.43 (0.32-0.55) 57
Session 3 (17)913674*0.45 (0.30-0.60)60
Session 4 (14) 10 112 76* 0.23 (0.01-0.44) 81
Session 5 (17)1313682*0.55 (0.40-0.71)73
Session 6 (17) 17 136 78* 0.45 (0.29-0.62) 72
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
×
For resistance to anterior springing on the spinous processes, no pretraining data were available because, as previously mentioned, the test used for this category of palpatory diagnosis was changed between the pretraining and posttraining sessions. Between the post consensus training session and the 6 data collection sessions, improvement in agreement between examiners was not significant (Table 3). The κ values ranged from 0.30 to 0.50, with moderate reliability achieved in 3 of the 6 sessions. 
Table 3
Interobserver Reliability and Maintenance of Reliability for Resistance to Anterior Springing on the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Posttraining (14)-860700.29 (0.03-0.55)73
Session 1 (33) 0 132 70 0.30 (0.12-0.47) 72
Session 2 (31)8124730.46 (0.30-0.62)55
Session 3 (17) 9 68 76 0.50 (0.29-0.71) 50
Session 4 (14)1056750.31 (0.04-0.58)52
Session 5 (17) 13 68 71 0.37 (0.15-0.60) 57
Session 6 (17)1768760.44 (0.21-0.67)61
 *Because of a change in testing procedures between the pretraining and posttraining sessions, comparisons for resistance to anterior springing are made to the posttraining session.Abbreviations: CI, confidence interval; κ, kappa coefficient.
Table 3
Interobserver Reliability and Maintenance of Reliability for Resistance to Anterior Springing on the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Posttraining (14)-860700.29 (0.03-0.55)73
Session 1 (33) 0 132 70 0.30 (0.12-0.47) 72
Session 2 (31)8124730.46 (0.30-0.62)55
Session 3 (17) 9 68 76 0.50 (0.29-0.71) 50
Session 4 (14)1056750.31 (0.04-0.58)52
Session 5 (17) 13 68 71 0.37 (0.15-0.60) 57
Session 6 (17)1768760.44 (0.21-0.67)61
 *Because of a change in testing procedures between the pretraining and posttraining sessions, comparisons for resistance to anterior springing are made to the posttraining session.Abbreviations: CI, confidence interval; κ, kappa coefficient.
×
For pain provocation to access tenderness, agreement between examiners significantly improved from the pretraining session to 5 of the 6 data collection sessions (Table 4). The κ values in the 6 sessions ranged from 0.61 to 0.88, with substantial to almost perfect reliability achieved in all sessions. 
Table 4
Interobserver Reliability and Maintenance of Reliability for Pain Provocation to Assess Tenderness Over the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168750.32 (0.16-0.49)24
Posttraining (3) -26 12 83 0.56 (0.01-1.00) 25
Session 1 (33)013291*0.70 (0.54-0.86)18
Session 2 (31) 8 124 91* 0.68 (0.50-0.86) 17
Session 3 (17)96897*0.88 (0.72-1.00)15
Session 4 (14) 10 56 82 0.61 (0.40-0.83) 36
Session 5 (17)136894*0.81 (0.63-0.99)19
Session 6 (17) 17 68 87* 0.65 (0.45-0.85) 24
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
Table 4
Interobserver Reliability and Maintenance of Reliability for Pain Provocation to Assess Tenderness Over the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168750.32 (0.16-0.49)24
Posttraining (3) -26 12 83 0.56 (0.01-1.00) 25
Session 1 (33)013291*0.70 (0.54-0.86)18
Session 2 (31) 8 124 91* 0.68 (0.50-0.86) 17
Session 3 (17)96897*0.88 (0.72-1.00)15
Session 4 (14) 10 56 82 0.61 (0.40-0.83) 36
Session 5 (17)136894*0.81 (0.63-0.99)19
Session 6 (17) 17 68 87* 0.65 (0.45-0.85) 24
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
×
Demographic characteristics of study participants (Table 5) had little impact on the reliability of the tests during the current study. The BMI of participants was not significantly related to interobserver reliability for static segmental positional asymmetry of the transverse processes in the horizontal plane (P=.17), tissue texture abnormalities (P=.75), resistance to anterior springing on the spinous processes (P>.99), or tenderness (P=.15). Interobserver reliability was significantly greater when examining women than men for static segmental positional asymmetry of the transverse processes in the horizontal plane (P=.04), but not for tissue texture abnormalities (P=.36), resistance to anterior springing on the spinous processes (P=.11), or tenderness (P=.22). 
Table 5
Demographic Characteristics of Participants in Study of Interobserver Reliability of Osteopathic Palpatory Tests (N=64)



Sex

Presence of Low Back Pain, No. (%)

Age, mean (SD)

Body Mass Index, mean (SD)
Testing Session (n)
Study Week
Men, No. (%)
Women, No. (%)
Pretraining (42)-3832 (76)10 (24)...26 (4)...
Posttraining* (3) -26 2 (67) 1 (33) ... 27 (1) ...
Posttraining* (30)-2014 (47)16 (53)...29 (6)...
Posttraining* (14) -8 7 (50) 7 (50) ... 36 (13) ...
Session 1 (33)08 (24)25 (76)2 (6)30 (6)26.3 (5.7)
Session 2 (31) 8 8 (26) 23 (74) 2 (6) 31 (6) 26.5 (5.7)
Session 3 (17)92 (12)15 (88)027 (7)25.9 (4.2)
Session 4 (14) 10 5 (36) 9 (64) 14 (100) 29 (6) 25.8 (4.1)
Session 5 (17)132 (12)15 (88)027 (7)25.9 (4.2)
Session 6 (17) 17 2 (12) 15 (88) 0 27 (7) 25.9 (4.2)
 ... indicates data not collected.
 *Posttraining for the four palpatory tests occurred during different testing sessions.
Table 5
Demographic Characteristics of Participants in Study of Interobserver Reliability of Osteopathic Palpatory Tests (N=64)



Sex

Presence of Low Back Pain, No. (%)

Age, mean (SD)

Body Mass Index, mean (SD)
Testing Session (n)
Study Week
Men, No. (%)
Women, No. (%)
Pretraining (42)-3832 (76)10 (24)...26 (4)...
Posttraining* (3) -26 2 (67) 1 (33) ... 27 (1) ...
Posttraining* (30)-2014 (47)16 (53)...29 (6)...
Posttraining* (14) -8 7 (50) 7 (50) ... 36 (13) ...
Session 1 (33)08 (24)25 (76)2 (6)30 (6)26.3 (5.7)
Session 2 (31) 8 8 (26) 23 (74) 2 (6) 31 (6) 26.5 (5.7)
Session 3 (17)92 (12)15 (88)027 (7)25.9 (4.2)
Session 4 (14) 10 5 (36) 9 (64) 14 (100) 29 (6) 25.8 (4.1)
Session 5 (17)132 (12)15 (88)027 (7)25.9 (4.2)
Session 6 (17) 17 2 (12) 15 (88) 0 27 (7) 25.9 (4.2)
 ... indicates data not collected.
 *Posttraining for the four palpatory tests occurred during different testing sessions.
×
Participant age was significantly related to interobserver reliability for resistance to anterior springing on the spinous processes (P=.01), with reliability being poorer in older participants. Participant age was not significantly related to interobserver reliability for static segmental positional asymmetry of the transverse processes in the horizontal plane (P=.82), tissue texture abnormalities (P=.82), or tenderness (P=.54). 
Comment
Results of the current study indicate that acceptable interobserver reliability can be maintained and improved over 4 months with occasional recalibration consisting of focused consensus training. In a comparison of individual palpatory diagnostic tests, pain provocation to assess tenderness had the highest κ values for almost all assessments. This result is consistent with previously reported findings.3,4 Static segmental positional asymmetry of the transverse processes in the horizontal plane had similar κ values to pain provocation to assess tenderness, showing a level of reliability in the moderate to substantial range. Such a high level of reliability for static segmental positional asymmetry of the transverse processes in the horizontal plane, as far as we are aware, has not previously been reported. 
Only one session for tissue texture abnormalities (session 4) resulted in a κ value lower than the posttraining level. Overall, the test for tissue texture abnormalities showed moderate reliability, and the test for resistance to anterior springing on the spinous processes showed fair to moderate reliability. 
A comparison of the session 4 cohort vs the cohorts of the other sessions may explain the reduction in the κ value for tissue texture abnormalities for that session. Session 4 had the largest cohort of participants with low back pain. The diagnostic procedures may have irritated the back, causing tissues to react to the palpatory pressures and resulting in changes in baseline pain characteristics. In future studies, participants could complete preexamination and postexamination questionnaires to assess their perceptions of any symptom change as a result of the examination. An objective measure, such as algometry, could also be used, though use of such a measure would defeat the purpose of evaluating the usefulness of diagnostic palpation. Further refinement of the descriptors used for evaluating nuances of tissue texture abnormalities may be helpful during consensus training, especially for symptomatic patients. 
In contrast, static segmental positional asymmetry of the transverse processes in the horizontal plane had the highest level of reliability for session 4 compared to the other sessions. Perhaps positional asymmetry is related to the cause of low back pain and is more prominent in individuals with low back pain as a result of tissue hypertonicity, a condition commonly found in individuals with low back pain. 
When assessing static segmental positional asymmetry of the transverse processes in the horizontal plane, interobserver reliability was significantly higher for women than for men. This result may be caused by commonly found hypertrophy and hypertonicity of the erector spinae musculature in men, making localization of the transverse process more difficult. Further study is required to determine if the higher interobserver reliability of positional asymmetry in women is a consistent finding and, if so, why it occurs. 
Although the assumption that heavier individuals are more difficult to palpate than lighter individuals may seem to be valid, data from the current study's predominantly overweight population did not substantiate this assumption. The mean (SD) BMI for the 6 sessions ranged from 25.8 (4.1) to 26.5 (5.7). No statistically significant effect of BMI on the reliability of any one type of test was observed in this study. Because of the narrow range of BMIs of study participants, additional studies on individuals with a wider range of BMIs are needed to determine whether the thickness of soft tissue impedes the reliability of these tests. 
Similar to previous findings for motion testing,3,4,9 resistance to anterior springing on the spinous processes in the current study had the poorest level of interobserver reliability. Despite recalibration, this test also showed the least improvement over time. Assessment of reliability assumes, by definition, that the characteristics being observed are stable. During recalibration, motion characteristics observed by the initial examiner often changed, indicating that the finding for resistance to anterior springing was not stable with repetitive testing. Although this change was not quantified in the current study, the fair to moderate reliability of this test reinforces findings from previous studies3,4,9 and calls into question the reproducibility of vertebral motion testing on a segmental basis. 
Alternatively, we may not know the best way to interpret findings of segmental motion tests. Statistical analysis suggested that reliability of the resistance to anterior springing motion test was significantly related to participant age. However, closer analysis of the data indicated that this difference, though statistically significant, may not be clinically significant. This difference in reliability was observed between participants aged 21 to 25 years and those older than 25 years. Differences in the prevalence of findings could account for the statistically significant relationship of age to reliability. 
In other studies,13,14 investigators have preconditioned vertebral segments by repetitively inducing motion before the actual testing to improve the likelihood of stable findings. Because preconditioning is not commonly performed in practice or taught in osteopathic medical training programs, it was not used in the current study. 
Although consensus training has been shown to be useful in establishing interobserver reliability, such findings are meaningless unless adequate interobserver reliability can be demonstrated throughout clinical research studies that use palpation as a diagnostic tool or as an outcome measure. Consequently, development of quality control and recalibration methods that can be effectively performed in clinical research is crucial. 
In our previous study on the effect of consensus training on interobserver reliability,9 comprehensive consensus training involved three phases. The first phase was pretraining interobserver reliability assessment and involved defining and standardizing testing procedures and interpretation. The second phase was consensus training, which involved simultaneous evaluation of numerous individuals by all study examiners to clarify the testing procedures and to refine examiner skills to the consensus standard. The third phase was posttraining interobserver reliability assessment and involved reassessment of examiner skills. This training established calibration parameters for specific palpatory tests for that group of examiners. The interobserver reliability assessment in posttraining demonstrated the effectiveness of the other two phases of comprehensive consensus training.9 In the current study, quality control procedures with recalibration, consisting of focused consensus training, were shown to be successful in maintaining and improving the interobserver reliability of the two examiners. 
The recalibration procedures of the current study are distinct from the comprehensive consensus training procedures previously published.9 Recalibration in the current study does not involve the simultaneous evaluation of every participant by every examiner, as occurs in comprehensive consensus training, in which the definition, performance, and interpretation of tests are refined. During recalibration, no changes or refinements are made in the definition, intended performance, or interpretation of the techniques. Furthermore, recalibration is performed only when a disagreement exists in findings. As a result, feedback from recalibration is efficient and can be performed without derailing clinical research studies that use diagnostic palpation. 
In the current study, percent agreement ranged from 71% to 97% for each data collection session. Thus, for 80% of the examinations, no interaction occurred between examiners. If recalibration with focused consensus training is required for more than 25% of the tests performed, we propose that comprehensive consensus training needs to be repeated. 
Acceptable κ values were achieved and reliability was improved over time for several palpatory tests in the lumbar region in the current study. However, the study's confidence intervals were large, with widths ranging from 0.23 to 0.66. The lower bound of these confidence intervals periodically extended below the range of acceptable reliability (κ<0.40).12 Therefore, widths of the confidence intervals should be considered when interpreting results in this type of study. The accuracy of estimated κ values, as measured by the margin of error (ie, half the width of the confidence interval), is affected by the prevalence rates of test findings, the sample size, and the true (ie, actual) level of reliability. For example, a κ value estimated to be 0.44—classified as acceptable within the current standard—may actually be as low as 0.21, which is only fair reliability. 
To help assess the feasibility of future reliability studies, the margins of error for 95% confidence intervals with various prevalence values and true κ values were determined for different sample sizes (Table 6). As shown in Table 6, a palpatory test with a true κ of 0.60 and a prevalence rate of the test findings of 80% would require 120 observations to establish 95% confidence that the palpatory test has a κ of at least 0.40. However, if the test had a true κ of 0.70 and a prevalence rate of 70%, the same level of confidence could be achieved using only 30 observations. If estimating reliability with sufficient accuracy (ie, narrow confidence intervals) requires 120 observations, repetitively reassessing reliability would be unrealistic, especially when using palpatory tests during a clinical study. Thus, focused recalibration methods, as proposed in the current study, are practical and have sufficient scientific rigor for use in clinical research. 
Table 6
Margins of Error for 95% Confidence Intervals for Varying Prevalence Values (%) and True Kappa Coefficients (κ) for Three Sample Sizes



Prevalence of Findings
True κ
n
50%
60%
70%
80%
90%
0.40300.330.330.360.400.53
60 0.23 0.24 0.25 0.29 0.38
1200.160.170.180.200.27
0.50 30 0.31 0.32 0.34 0.39 0.52
600.220.220.240.270.37
120 0.16 0.16 0.17 0.19 0.26
0.60300.290.290.310.360.49
60 0.20 0.21 0.22 0.26 0.34
1200.140.150.160.180.24
0.70 30 0.26 0.26 0.28 0.32 0.44
600.180.190.200.230.31
120 0.13 0.13 0.14 0.16 0.22
0.80300.220.220.240.270.37
60 0.15 0.16 0.17 0.19 0.26
1200.110.110.120.140.18
0.90 30 0.16 0.16 0.17 0.20 0.26
600.110.110.120.140.19
120 0.08 0.08 0.09 0.10 0.13
0.99300.050.050.060.060.08
60 0.04 0.04 0.04 0.05 0.06

120
0.03
0.03
0.03
0.03
0.04
Table 6
Margins of Error for 95% Confidence Intervals for Varying Prevalence Values (%) and True Kappa Coefficients (κ) for Three Sample Sizes



Prevalence of Findings
True κ
n
50%
60%
70%
80%
90%
0.40300.330.330.360.400.53
60 0.23 0.24 0.25 0.29 0.38
1200.160.170.180.200.27
0.50 30 0.31 0.32 0.34 0.39 0.52
600.220.220.240.270.37
120 0.16 0.16 0.17 0.19 0.26
0.60300.290.290.310.360.49
60 0.20 0.21 0.22 0.26 0.34
1200.140.150.160.180.24
0.70 30 0.26 0.26 0.28 0.32 0.44
600.180.190.200.230.31
120 0.13 0.13 0.14 0.16 0.22
0.80300.220.220.240.270.37
60 0.15 0.16 0.17 0.19 0.26
1200.110.110.120.140.18
0.90 30 0.16 0.16 0.17 0.20 0.26
600.110.110.120.140.19
120 0.08 0.08 0.09 0.10 0.13
0.99300.050.050.060.060.08
60 0.04 0.04 0.04 0.05 0.06

120
0.03
0.03
0.03
0.03
0.04
×
We propose that establishing and maintaining interobserver reliability during clinical studies that involve diagnostic palpation in either the evaluation of study participants or as an outcome measure can be accomplished with a two-step process: (1) establishing adequate interobserver reliability (κ≥0.40) using comprehensive consensus training and (2) maintaining interobserver reliability, initially using quality control sampling of all data and, when examiners do not agree on palpatory findings, using focused consensus training. After reliability has been sustained at an appropriate level over several data collection sessions, issues related to determining the frequency of quality control sampling (eg, daily, weekly, or monthly) should be considered in light of various factors, such as the number of landmarks being assessed per study participant, the prevalence of positive findings, the number of examiners, and the number of data collection sites. 
Further study is needed to determine the optimal frequency for quality control assessments and the conditions necessary to enable a reduction in the quality control sampling rate. It is likely that such standards may vary depending on the examiners performing the research. 
The current study has several strengths and limitations. The study was conducted in a busy patient care setting, demonstrating the feasibility of practicing physicians performing this type of research. The difference in clinical experience between the two examiners supports generalizability of the findings, because maintenance and improvement of interobserver reliability was established for those with levels of clinical experience ranging from 3 years to 10 years. The participant demographics of the current study are a limitation as a result of the narrow age and BMI ranges, the predominance of women, and high prevalence rates of certain somatic dysfunctions. 
Conclusions
The current study indicates that quality control sampling on all data, with occasional recalibration consisting of focused consensus training, can maintain and improve the interobserver reliability of commonly used palpatory diagnostic tests of the lumbar spine in an efficient and meaningful manner. The study design may be particularly useful in manual therapy research and educational programs. Future studies should consider the prevalence rates of test findings when determining sample size and study feasibility, as well as the confidence intervals when interpreting the reliability of test findings. 
 Financial Disclosures: The authors have no conflicts of interest to declare. This study was supported by grants from the National Institutes of Health—National Center for Complementary and Alternative Medicine, grant no. 1R01AT00305, and the American Osteopathic Association, grant no. 00-04-505.
 
The authors thank Patty Lyons for her technical support and Deborah Goggin, MA, for her editorial support. This study was supported by grants from the National Institutes of Health—National Center for Complementary and Alternative Medicine, grant no. 1R01AT00305, and the American Osteopathic Association, grant no. 00-04-505. 
Najm WI, Seffinger MA, Mishra SI, et al. Content validity of manual spinal palpatory exams - a systematic review [published online ahead of print May 7, 2003]. BMC Complement Altern Med. 2003;3(1):1. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC156889. Accessed September 3, 2010.
Potter L, McCarthy C, Oldham J. Intraexaminer reliability of identifying a dysfunctional segment in the thoracic and lumbar spine. J Manipulative Physiol Ther. 2006;29 (3 ):203-207.
Seffinger MA, Najm WI, Mishra SI, et al. Reliability of spinal palpation for diagnosis of back and neck pain: a systematic review of the literature. Spine.. (2004). ;29 (19 ):E413-E425.
Stochkendahl MJ, Christensen HW, Hartvigsen J, et al. Manual examination of the spine: a systematic critical literature review of reproducibility. J Manipulative Physiol Ther. 2006;29 (6 ):475-485.
Haas M, Panzer D. Palpatory diagnosis of subluxation. In: Gatterman MI, ed. Foundations of Chiropractic: Subluxation. 2nd ed. St Louis, MO: Elsevier Mosby; 2005:104-114.
Hestbaek L, Leboeuf-Yde C. Are chiropractic tests for the lumbo-pelvic spine reliable and valid? A systematic critical literature review. J Manipulative Physiol Ther.. (2000). ;23 (4 ):258-275.
Huijbregts P. Spinal motion palpation: a review of reliability studies. J Man Manip Ther. 2002;10 (1 ):24-39.
van der Wurff P, Hagmeijer RH, Meyne W. Clinical tests of the sacroiliac joint: a systemic methodological review. Part 1: reliability [review]. Man Ther.. (2000). ;5 (1 ):30-36.
Degenhardt BF, Snider KT, Snider EJ, Johnson JC. Interobserver reliability of osteopathic palpatory diagnostic tests of the lumbar spine: improvements from consensus training. J Am Osteopath Assoc. 2005;105(10):465-473. http://www.jaoa.org/cgi/reprint/105/10/465. Accessed September 3, 2010.
Christensen HW, Vach W, Vach K, et al. Palpation of the upper thoracic spine: an observer reliability study. J Manipulative Physiol Ther. 2002;25 (5 ):285-292.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33 (1 ):159-174.
Fjellner A, Bexander C, Faleij R, Strender LE. Interexaminer reliability in physical examination of the cervical spine. J Manipulative Physiol Ther. 1999;22 (8 ):511-516.
Rothstein JM, Echternach JL. Primer on Measurement: An Introductory Guide to Measurement Issues. Alexandria, VA: American Physical Therapy Association; 1993.
van Trijffel E, Anderegg Q, Bossuyt PM, Lucas C. Inter-examiner reliability of passive assessment of intervertebral motion in the cervical and lumbar spine: a systematic review [published online ahead of print July 1, 2005]. Man Ther. 2005;10 (4 ):256-269.
Figure.
Description of osteopathic diagnostic palpatory tests used to measure maintenance of interobserver reliability over a 4-month period. The tests are listed in the sequence they were performed in each of 6 sessions during the study period.
Figure.
Description of osteopathic diagnostic palpatory tests used to measure maintenance of interobserver reliability over a 4-month period. The tests are listed in the sequence they were performed in each of 6 sessions during the study period.
Table 1
Interobserver Reliability and Maintenance of Reliability for Static Segmental Positional Asymmetry of the Transverse Processes in the Horizontal Plane (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168400.18 (0.08-0.27)72
Posttraining (30) -20 120 84* 0.29 (0.02-0.56) 93
Session 1 (33)013277*0.56 (0.42-0.71)88
Session 2 (31) 8 124 74* 0.44 (0.27-0.61) 88
Session 3 (17)96888*0.60 (0.31-0.88)91
Session 4 (14) 10 56 89* 0.72 (0.52-0.91) 95
Session 5 (17)136890*0.66 (0.43-0.89)96
Session 6 (17) 17 68 91* 0.59 (0.26-0.92) 91
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
Table 1
Interobserver Reliability and Maintenance of Reliability for Static Segmental Positional Asymmetry of the Transverse Processes in the Horizontal Plane (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168400.18 (0.08-0.27)72
Posttraining (30) -20 120 84* 0.29 (0.02-0.56) 93
Session 1 (33)013277*0.56 (0.42-0.71)88
Session 2 (31) 8 124 74* 0.44 (0.27-0.61) 88
Session 3 (17)96888*0.60 (0.31-0.88)91
Session 4 (14) 10 56 89* 0.72 (0.52-0.91) 95
Session 5 (17)136890*0.66 (0.43-0.89)96
Session 6 (17) 17 68 91* 0.59 (0.26-0.92) 91
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
×
Table 2
Interobserver Reliability and Maintenance of Reliability for Tissue Texture Abnormalities (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-3833658-0.01 (-0.11-0.10)29
Posttraining (30) -20 240 69* 0.37 (0.26-0.49) 57
Session 1 (33)026476*0.41 (0.29-0.53)72
Session 2 (31) 8 248 72* 0.43 (0.32-0.55) 57
Session 3 (17)913674*0.45 (0.30-0.60)60
Session 4 (14) 10 112 76* 0.23 (0.01-0.44) 81
Session 5 (17)1313682*0.55 (0.40-0.71)73
Session 6 (17) 17 136 78* 0.45 (0.29-0.62) 72
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
Table 2
Interobserver Reliability and Maintenance of Reliability for Tissue Texture Abnormalities (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-3833658-0.01 (-0.11-0.10)29
Posttraining (30) -20 240 69* 0.37 (0.26-0.49) 57
Session 1 (33)026476*0.41 (0.29-0.53)72
Session 2 (31) 8 248 72* 0.43 (0.32-0.55) 57
Session 3 (17)913674*0.45 (0.30-0.60)60
Session 4 (14) 10 112 76* 0.23 (0.01-0.44) 81
Session 5 (17)1313682*0.55 (0.40-0.71)73
Session 6 (17) 17 136 78* 0.45 (0.29-0.62) 72
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
×
Table 3
Interobserver Reliability and Maintenance of Reliability for Resistance to Anterior Springing on the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Posttraining (14)-860700.29 (0.03-0.55)73
Session 1 (33) 0 132 70 0.30 (0.12-0.47) 72
Session 2 (31)8124730.46 (0.30-0.62)55
Session 3 (17) 9 68 76 0.50 (0.29-0.71) 50
Session 4 (14)1056750.31 (0.04-0.58)52
Session 5 (17) 13 68 71 0.37 (0.15-0.60) 57
Session 6 (17)1768760.44 (0.21-0.67)61
 *Because of a change in testing procedures between the pretraining and posttraining sessions, comparisons for resistance to anterior springing are made to the posttraining session.Abbreviations: CI, confidence interval; κ, kappa coefficient.
Table 3
Interobserver Reliability and Maintenance of Reliability for Resistance to Anterior Springing on the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Posttraining (14)-860700.29 (0.03-0.55)73
Session 1 (33) 0 132 70 0.30 (0.12-0.47) 72
Session 2 (31)8124730.46 (0.30-0.62)55
Session 3 (17) 9 68 76 0.50 (0.29-0.71) 50
Session 4 (14)1056750.31 (0.04-0.58)52
Session 5 (17) 13 68 71 0.37 (0.15-0.60) 57
Session 6 (17)1768760.44 (0.21-0.67)61
 *Because of a change in testing procedures between the pretraining and posttraining sessions, comparisons for resistance to anterior springing are made to the posttraining session.Abbreviations: CI, confidence interval; κ, kappa coefficient.
×
Table 4
Interobserver Reliability and Maintenance of Reliability for Pain Provocation to Assess Tenderness Over the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168750.32 (0.16-0.49)24
Posttraining (3) -26 12 83 0.56 (0.01-1.00) 25
Session 1 (33)013291*0.70 (0.54-0.86)18
Session 2 (31) 8 124 91* 0.68 (0.50-0.86) 17
Session 3 (17)96897*0.88 (0.72-1.00)15
Session 4 (14) 10 56 82 0.61 (0.40-0.83) 36
Session 5 (17)136894*0.81 (0.63-0.99)19
Session 6 (17) 17 68 87* 0.65 (0.45-0.85) 24
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
Table 4
Interobserver Reliability and Maintenance of Reliability for Pain Provocation to Assess Tenderness Over the Spinous Processes (N=64)

Testing Session (n)

Study Week

Total Vertebrae Examined, No.

Agreement, %

κ (95% CI)

Prevalence, %
Pretraining (42)-38168750.32 (0.16-0.49)24
Posttraining (3) -26 12 83 0.56 (0.01-1.00) 25
Session 1 (33)013291*0.70 (0.54-0.86)18
Session 2 (31) 8 124 91* 0.68 (0.50-0.86) 17
Session 3 (17)96897*0.88 (0.72-1.00)15
Session 4 (14) 10 56 82 0.61 (0.40-0.83) 36
Session 5 (17)136894*0.81 (0.63-0.99)19
Session 6 (17) 17 68 87* 0.65 (0.45-0.85) 24
 Abbreviations: CI, confidence interval; κ, kappa coefficient.
 *Significant difference in examiner agreement compared to pretraining session (P<.001).
×
Table 5
Demographic Characteristics of Participants in Study of Interobserver Reliability of Osteopathic Palpatory Tests (N=64)



Sex

Presence of Low Back Pain, No. (%)

Age, mean (SD)

Body Mass Index, mean (SD)
Testing Session (n)
Study Week
Men, No. (%)
Women, No. (%)
Pretraining (42)-3832 (76)10 (24)...26 (4)...
Posttraining* (3) -26 2 (67) 1 (33) ... 27 (1) ...
Posttraining* (30)-2014 (47)16 (53)...29 (6)...
Posttraining* (14) -8 7 (50) 7 (50) ... 36 (13) ...
Session 1 (33)08 (24)25 (76)2 (6)30 (6)26.3 (5.7)
Session 2 (31) 8 8 (26) 23 (74) 2 (6) 31 (6) 26.5 (5.7)
Session 3 (17)92 (12)15 (88)027 (7)25.9 (4.2)
Session 4 (14) 10 5 (36) 9 (64) 14 (100) 29 (6) 25.8 (4.1)
Session 5 (17)132 (12)15 (88)027 (7)25.9 (4.2)
Session 6 (17) 17 2 (12) 15 (88) 0 27 (7) 25.9 (4.2)
 ... indicates data not collected.
 *Posttraining for the four palpatory tests occurred during different testing sessions.
Table 5
Demographic Characteristics of Participants in Study of Interobserver Reliability of Osteopathic Palpatory Tests (N=64)



Sex

Presence of Low Back Pain, No. (%)

Age, mean (SD)

Body Mass Index, mean (SD)
Testing Session (n)
Study Week
Men, No. (%)
Women, No. (%)
Pretraining (42)-3832 (76)10 (24)...26 (4)...
Posttraining* (3) -26 2 (67) 1 (33) ... 27 (1) ...
Posttraining* (30)-2014 (47)16 (53)...29 (6)...
Posttraining* (14) -8 7 (50) 7 (50) ... 36 (13) ...
Session 1 (33)08 (24)25 (76)2 (6)30 (6)26.3 (5.7)
Session 2 (31) 8 8 (26) 23 (74) 2 (6) 31 (6) 26.5 (5.7)
Session 3 (17)92 (12)15 (88)027 (7)25.9 (4.2)
Session 4 (14) 10 5 (36) 9 (64) 14 (100) 29 (6) 25.8 (4.1)
Session 5 (17)132 (12)15 (88)027 (7)25.9 (4.2)
Session 6 (17) 17 2 (12) 15 (88) 0 27 (7) 25.9 (4.2)
 ... indicates data not collected.
 *Posttraining for the four palpatory tests occurred during different testing sessions.
×
Table 6
Margins of Error for 95% Confidence Intervals for Varying Prevalence Values (%) and True Kappa Coefficients (κ) for Three Sample Sizes



Prevalence of Findings
True κ
n
50%
60%
70%
80%
90%
0.40300.330.330.360.400.53
60 0.23 0.24 0.25 0.29 0.38
1200.160.170.180.200.27
0.50 30 0.31 0.32 0.34 0.39 0.52
600.220.220.240.270.37
120 0.16 0.16 0.17 0.19 0.26
0.60300.290.290.310.360.49
60 0.20 0.21 0.22 0.26 0.34
1200.140.150.160.180.24
0.70 30 0.26 0.26 0.28 0.32 0.44
600.180.190.200.230.31
120 0.13 0.13 0.14 0.16 0.22
0.80300.220.220.240.270.37
60 0.15 0.16 0.17 0.19 0.26
1200.110.110.120.140.18
0.90 30 0.16 0.16 0.17 0.20 0.26
600.110.110.120.140.19
120 0.08 0.08 0.09 0.10 0.13
0.99300.050.050.060.060.08
60 0.04 0.04 0.04 0.05 0.06

120
0.03
0.03
0.03
0.03
0.04
Table 6
Margins of Error for 95% Confidence Intervals for Varying Prevalence Values (%) and True Kappa Coefficients (κ) for Three Sample Sizes



Prevalence of Findings
True κ
n
50%
60%
70%
80%
90%
0.40300.330.330.360.400.53
60 0.23 0.24 0.25 0.29 0.38
1200.160.170.180.200.27
0.50 30 0.31 0.32 0.34 0.39 0.52
600.220.220.240.270.37
120 0.16 0.16 0.17 0.19 0.26
0.60300.290.290.310.360.49
60 0.20 0.21 0.22 0.26 0.34
1200.140.150.160.180.24
0.70 30 0.26 0.26 0.28 0.32 0.44
600.180.190.200.230.31
120 0.13 0.13 0.14 0.16 0.22
0.80300.220.220.240.270.37
60 0.15 0.16 0.17 0.19 0.26
1200.110.110.120.140.18
0.90 30 0.16 0.16 0.17 0.20 0.26
600.110.110.120.140.19
120 0.08 0.08 0.09 0.10 0.13
0.99300.050.050.060.060.08
60 0.04 0.04 0.04 0.05 0.06

120
0.03
0.03
0.03
0.03
0.04
×