What If Your Organization Gets Sued for Employment Testing? A Case Study of an EEO Employment Test Validation Suit: Smith vs. City of Boston - Part 2

In Part 1 we examined the court’s ruling on whether or not disparate impact had occurred. We will now examine the court’s ruling on the validity of the test.

Court Ruling on Test Validity: Is It Job-Related and Consistent with Business Necessity?

Since disparate impact was demonstrated, the second ruling of the case was whether the 2008 test was both job-related for the position of Boston Police Department (BPD) lieutenant and consistent with business necessity as required in the Uniform Guidelines. Job-related and “consistent with business necessity” means that the test was related to the job and was necessary to help the business function effectively.

Three Types of Test Validation Methods Described in the Uniform Guidelines

Three types of test validation methods are described in the Uniform Guidelines for use in determining whether practices, procedures, or tests (PPT) are job-related and consistent with business necessity: criterion validation, content validation, and construct validation. The Uniform Guidelines provides a set of minimally accepted requirements to follow when conducting validation studies. It does not, however, include an overly specific methodology that needs to be followed for a validation study to be considered potentially legally defensible. The first validation method, criterion validity, provides statistical evidence that those who perform better on the PPT are more likely to be successful on the job, showing that the PPT is job-related. The second validation method, content validity, provides inferential evidence that a PPT is job-related. This is accomplished through an in-depth study of the job (a job analysis) and a series of subject matter expert (SME) opinion surveys. The third validation method that evaluates whether the PPT is job-related is construct validity. Construct validity is demonstrated through identifying the relationships among three things: a specific job-related characteristic, a PPT measuring that characteristic, and measures of job performance. Due to the difficulty and complexity of demonstrating construct validity, this method is rarely used in PPT evaluations.

Both criterion and content validation studies typically begin with a review of documents containing previously-developed job analyses, job descriptions, and other information the employer may have developed. These documents typically describe the important duties performed by those in the position and the knowledge, skills, abilities, and personal characteristics (KSAPCs) needed to perform those duties. Knowledge, skills, abilities, and personal characteristics are attributes that underlie the successful performance of job duties. This information is used as the foundation for the required job analysis, in the case of a content validation study, or in the case of a criterion validation study, an analysis of the job. A job analysis is an in-depth analysis of the job a PPT is being created for, and includes documenting the important duties performed by those that hold the job and the KSAPCs needed to perform those duties. It also involves collecting survey data from SMEs on those KSAPCs in several areas, such as the level of importance and frequency for KSAPCs and job duties. An analysis of the job is much less rigorous than a job analysis. An analysis of the job involves the review of job information to determine measures of work behaviors or performance that are relevant to the job. In the BPD examination case, a content validation approach was taken and therefore a job analysis was conducted.

Content Validity Documentation

A variety of methodologies may be used when conducting a job analysis. However, the more closely the methodology follows the Uniform Guidelines section 14C for content validity, the more legally defensible the job analysis will be in court. The Uniform Guidelines also outlines how job analysis, test development, and test validation must be documented differently for content, criterion, and construct validity studies. This is important for employers to keep in mind because the more closely the test developer can provide documentation around the specific steps they followed for the job analysis, test development, and test validation, the more legally defensible the selection procedure is likely to be. Section 15C of the Uniform Guidelines outlines 19 different essential elements and eight different elements that should be included for content validity. Certain elements should be included if applicable, though not listed as essential, because certain circumstances would make it difficult to include these elements, or sometimes alternate approaches can be taken.

Required areas include the following:

dates and locations of the job analysis
the circumstances on which the study was conducted
elements of the job analysis
elements of the selection procedure and its content
the relationship between the selection procedure and the job
alternative selection procedures investigated
uses and application of the selection procedure
contact person
accuracy and completeness

A review of the documentation and how the City addressed or didn’t address these elements will be discussed in the following sections.

Job Analysis of the Boston Police Department Lieutenant Examined

Next, the job analysis of the BPD lieutenant was examined to identify if it met the requirements of content validation in the Uniform Guidelines. In the case of the BPD examination, the court felt that the City addressed the job analysis requirements of the Uniform Guidelines sufficiently.

The Role of a Boston Police Department Lieutenant

Before explaining the review of the job analysis documentation, it is important to first gain a high-level understanding of the role of a BPD lieutenant. The study of the documents for the job analysis to determine whether the test was job-related and of business necessity discovered that lieutenants at the Boston Police Department serve as second-line supervisors who supervise sergeants, and sergeants supervise police officers. Lieutenants are also in charge of station houses, are responsible for arresting suspects, and are responsible for the safety of prisoners. There is also a significant amount of desk work required in the station.

Lieutenants are required to work outside of the station, including talking with citizens at community meetings and taking control of scenes of major incidents. Supervisory skills required by the lieutenants include the ability to motivate employees and to communicate information among ranks. The official department job description for lieutenant has not changed since 1979 and current Boston Police Department Commissioner William Evans testified that it was still accurate. The Uniform Guidelines emphasizes updating job analyses as the job changes over time and give a good rule of thumb of examining job analyses every five years.

Job Analyses Used as the Basis for Creating the BPD Exam

The following is a more detailed discussion of the job analyses used as evidence in the current case. Three different job analyses were used as a foundation to develop the 2008 exam used in the BPD case: a job analysis conducted in 1991, another in 2000 that incorporated some elements from the 1991 job analysis, and an abbreviated job analysis in 2008. The 2008 abbreviated job analysis was a slight update to the job analysis from 2000. The 2008 abbreviated job analysis was ultimately used in the creation of the 2008 exam. Only the 2000 and 2008 job analyses are explained here because they were the most foundational for development of the 2008 exam.

2000 Job Analysis

For the 2000 job analysis, the City contracted with an outside consulting firm. The firm first created a list of 302 possibly relevant tasks Boston police lieutenants perform, as well as knowledge, skills and abilities (KSAs) necessary to perform those tasks. Twelve SMEs, consisting of Department employees holding the rank of lieutenant or higher, rated the tasks for frequency, importance, necessity of performing the task upon starting the job, and how related successful performance of the task was to successful job performance. If 10 of the SMEs rated a task as “very important” or “important,” upon entry to the job, and agreed that performance of that task clearly separated the best workers or better workers from inferior workers, then it satisfied the City’s criteria to be included in the final job analysis. Of the initial 302 tasks, 281 fulfilled the criteria.

The SMEs were then asked to determine which of the following dimensions were required for each task: oral communication, interpersonal skills, problem identification and analysis, judgment, and planning and organizing. Then a list of 149 KSAs potentially necessary to perform the 281 tasks was composed. Next, the SMEs were asked whether the KSAs related to the job of police lieutenant, when the KSA was learned (before or after assignment to the job), how long it took to learn the KSA, how the KSA differentiated performance, and whether the KSA was required to perform the job effectively.

For a KSA to be important enough to be tested, nine of the 12 SMEs must have rated the KSA as:

related to the job
learned before assignment to the job
requiring more training than a brief orientation period
having the capability of distinguishing performance to a high or moderate degree
required or desirable to perform the job effectively

Of the 149 KSAs rated by the SMEs, 145 met the criteria.

2008 Job Analysis

For the 2008 job analysis, SMEs were asked to re-rate each of the 149 KSAs used in the job analysis conducted in 2000. SMEs rated a sufficient number of the 149 KSAs in 2008 as meeting the criteria previously outlined as important enough to be tested.

Court Ruling on the Job Analysis

The court felt that the City addressed the job analysis requirements of the Uniform Guidelines sufficiently.

Test Development and Validation

In the next phase, the court examined the extent to which the content of the exam was related to the job. The court looked at the test’s development and its validation. The court also examined how the department used the exam to make promotional decisions. The 2008 exam consisted of two elements: a written, closed-book exam consisting of 100 multiple-choice questions, and an Education and Experience (E&E) rating. The next section will examine the method used to develop the exams, the extent to which the exam was a representative sample of the job, and how the exam was used to select lieutenants.

Often, the job analysis is finalized before the multiple-choice exam is created. The test developer then frequently converts the job analysis into a test plan document that outlines which KSAs will be assessed by the exam. This is done to ensure the exam is a representative sample of the job. In this case, a test outline was created and 100 test items were created to measure certain KSAs. The SMEs then reviewed the test questions, identified which KSAs matched the questions, and evaluated the questions for difficulty, readability, and recommendation for use. SME opinion is critical at this stage in the test development process because it provides validation that the test items are related to the job. The court felt that the City did an adequate job of addressing the Uniform Guidelines for this portion of the process.

The first portion of the exam, the education and experience, was examined for its compliance with the Uniform Guidelines by the court. The E&E score was a measure of prior education and experience. Out of 100 possible points on the written examination, the City required candidates to score at least 70 points to pass. The E&E score was then calculated only for candidates who passed the written exam. The written portion accounted for 80% of the final score; the E&E component for 20%. Every candidate was automatically awarded 14 out of the 20 total points for the E&E. The court decided to ultimately exclude the entire E&E portion of the exam from the analysis because it contributed very little to the rank ordering of candidates on the eligible list compared to the written exam. In fact, the correlation between candidates’ scores on the written exam and their final exam score was .95, an almost perfect positive correlation. An eligible list is a list of candidates that are eligible for hire. There was also no evidence provided by the City linking the E&E with tasks or KSAs from the job analysis.

The court next examined the evidence regarding the extent to which the exam evaluated a representative sample of the job skills. This was done because the Uniform Guidelines states in 14(C)(1) that “A selection procedure can be supported by a content validity strategy to the extent that it is a representative sample of the content of the job.” The 2000 job analysis indicated that there were 145 KSAs that are critical to performing the job. While there were 13 knowledge categories that were assessed on the written exam, they were very broadly worded and it was estimated that about 80% of the knowledge could fall under these categories. However, ultimately, only two of the critical ability areas were assessed. Thus, the court concluded that the 2008 exam did not sufficiently test for a representative sample of the critical KSAs because the exam did not reflect many of the skills and abilities necessary to perform the job of lieutenant. In the overall assessment of content validity, this was one of the primary reasons the exam was ultimately found to not meet the standards of the Uniform Guidelines. In prior years, the City had used an assessment center that was designed to test skills and abilities such as oral communication skills, interpersonal skills, the ability to quickly identify a problem and analyze it, the ability to make sound decisions promptly, and the ability to break work down into subtasks and prioritize them. These were assessed through a variety of exercises including in-basket exercise (a simulated written exercise) and a situational exercise. In the situational exercise, candidates were videotaped offering verbal responses to hypothetical scenarios that a lieutenant may encounter. The City decided not to use an assessment center for the 2008 exam process, but had they chosen to, it is much more likely that the court would have found the exam to be a representative sample of the job. This is because it could have measured more skills and abilities such as communication ability, interpersonal skills, and situational judgement.

The court then evaluated evidence regarding the reliability of the test. Section 14 (C)5 of the Uniform Guidelines states: “whenever feasible, appropriate statistical estimates should be made of the reliability of the selection procedure.” Reliability in this situation would have likely measured the extent to which the items on the exam were measuring the same area, such as job knowledge. The City did not present evidence of conducting any type of reliability analyses and the court faulted them for this.

Court Ruling

Not enough KSAs were tested.
The reliability of the test wasn’t demonstrated.

Evaluating How the Exam Was Used to Make Selection Decisions

Another important consideration in terms of evaluating the validity of a PPT is how it is used to make a selection decision. There are three primary ways a test can be used to make selection decisions. If determining how to separate qualified from unqualified applicants is the goal, that test should be used on a pass/fail basis with a minimum passing score set. If the goal is to make distinctions between candidates that are equally qualified, but may have slightly different raw scores on a PPT, then banding is the approach that should be used. Banding is a statistical procedure that puts similarly scoring applicants into groups, and each group can be considered as having the same score. Ranking should be used if the goal is to make decisions based on test score on an applicant-by-applicant basis. In other words, the test is used in a rank order fashion to either hire or move applicants forward in the selection process starting from the top of the list moving down. If looking to make decisions on applicants based on several selection procedures across many KSAPCs that differ in levels of importance, then a weighted or combined selection process can be used. The level of validity and reliability the courts require increase going from pass/fail to banding to ranking (Biddle, 2011¹). Since how the test is used is so important in determining the overall validity of the test, courts examine how the test is used with a high level of scrutiny.

The City chose to use a minimum passing score on the exam. Section 5(H) of The Uniform Guidelines states “where cutoff scores [minimum passing scores] are used, they should normally be set so as to be reasonable and consistent with normal expectations of acceptable proficiency within the work force.” Suppose a multiple-choice test was created for an entry-level police officer where all the items were completely job-relevant, but an arbitrary 90% minimum passing score was set. What evidence is there that 90% is the correct cutoff score to accurately identify minimally qualified candidates? Without job expert input regarding what a minimally qualified candidate would score on the test, a 90% cutoff can’t be justified. The City chose a 70% cutoff for the 2008 exam, but did not provide any rationale for doing so. They decided to weight the written portion of the exam 80% and the E&E 20%. Referencing past exams, the City said they thought that SMEs would probably have chosen the 80%/20% weighting formula. However, the SMEs were never actually surveyed to determine this. There was no indication that the City conducted any analyses to support the cutoff score and weighting.

For the candidates who passed the written exam with 70% correct, the E&E score was then applied to their overall score and the candidates were chosen for promotion in a rank-order fashion. Section 5(G) of the Uniform Guidelines states: “Evidence which may be sufficient to support the use of a selection procedure on a pass/fail (screening) basis may be insufficient to support the use of the same procedure on a ranking basis.” Since the reliability and validity standards are the highest for rank ordering, the court took a particularly close look at this aspect of the selection process.

Court Ruling

The weighting scheme of the written exam and E&E wasn’t justified.
An arbitrary 70% cutoff was used and couldn’t be justified.

Overall Court Rulings

In the case of the BPD exam administration, the court first ruled that disparate impact had occurred. The next portion of the legal proceedings examined if the exam met the standards of content validity outlined in the Uniform Guidelines. While the court determined that the job analysis portion met the standards of content validity, the court ruled that the exam itself did not meet the standards for content validity for the following reasons:

Not enough KSAs were tested.
The reliability of the test wasn’t demonstrated.
The weighting scheme of the written exam and E&E wasn’t justified.
An arbitrary 70% cutoff was used and couldn’t be justified.

Discussion

The potential cost of litigation is high and there is a lot of value in having a valid selection process that identifies the best candidates for the job. Understanding the process of a Title VII disparate impact case can help your agency in making informed decisions about your testing process. This case has highlighted some of the many aspects that are important to consider in evaluating disparate impact and the role of job analysis, test development, and test validation in evaluating a Title VII disparate impact challenge. Gaining a thorough understanding of the Uniform Guidelines is crucial in understanding how courts evaluate disparate impact test validation lawsuits. While this case follows a typical process, every case is slightly different. For instance, had a criterion or construct validation approach been used, the test would have been evaluated for compliance with the Uniform Guidelines section 15(B) for criterion-related validity studies, or section 15(D) for construct validity studies. In the current case, a measure of education and experience and a multiple choice written test were evaluated. However, there are a variety of other PPTs that are litigated including interviews, work-sample tests, personality tests, and physical ability tests, just to name a few.

1. Biddle, D. A. (2011). Adverse Impact and Test Validation: A Practitioner’s Handbook (3rd ed.). Scottsdale, AZ: Infinity Publishing.↵