Featured Speaker Sessions

Changing the Design of Examinee Score Reports

Featured Speakers: Rebecca Lipner, American Board of Internal Medicine; and Bradley Brossman, American Board of Internal Medicine

In the spirit of information transparency and quality improvement, a redesign of the traditional score report was undertaken to deliver more meaningful and detailed feedback to physicians taking high-stakes medical certification examinations. A new score report was produced following many months of measurement research and input from the physician community through focus groups, “think-aloud” usability interviews, and surveys with randomly-selected examinees.

Based on the initial focus groups and usability studies, the redesign tackled several issues including a simple design, a graphic display of information, meaningful content subscores, and detailed information that would help examinees better understand their performance gaps. The report follows an inverted-pyramid style, presenting broadest information first followed by more detail in each subsequent section (i.e., pass-fail decision first, followed by exam score, subscores, and descriptions of questions missed). The simple graphical displays made it easier to understand where the examinee stood compared with the passing score and with other physicians. The measurement research led to an improved method of reporting subscores that corrects exaggerated estimates of ability that can sometimes occur when content areas contain only a small number of questions. A listing of blueprint descriptors for each question missed, along with the medical task of that question was used as a way to provide more detailed information without sacrificing test security.

This session will describe the changes made and the process used to make them. Examples include graphical displays that made it easier to understand where the examinee stood compared with the passing score and with other physicians, and measurement research leading to an improved method of reporting subscores that corrects the exaggerated estimates of ability that can sometimes occur when content areas contain only a small number of questions. A listing of blueprint descriptors for each question missed—along with the medical task of that question—was also used as a way to provide more detailed information without sacrificing test security.

Disruption of Traditional Assessment Systems: Are We the Walking Dead?

Featured Speaker: Dave Winsborough, Hogan Assessment Systems

For decades, the bulk of traditional commercial assessment has involved participants responding to carefully researched items that are aggregated into scales. Arguments about test development, item response formats, statistical properties of items, modes of administration, and other questions may be useful for the academic community, but have changed the applied assessment industry very little over the last 40 years. Criterion measurement has fared about the same, and has often delivered even less in terms of advancing the industry. In many ways, moving test administration from test booklets and scantron sheets to online and adaptive testing simply shifted forms onto screens and sped up the calculation of scale scores.

On the other hand, digitization has created a fundamentally different testing landscape by radically converting manual, offline processes into online, networked, computer supported (and often dependent) processes. Entire organizations are undertaking digitization initiatives, and increasingly our individual lives are taking place in digitized environments as well. These significant shifts bear important implications for the testing industry.

Specifically, these changes have enabled four significant forces that disrupt traditional assessment. First, traditional assessment items may become less and less relevant over time—or perhaps eventually disappear—as useful behavioral signals are increasingly sampled from digitized human behavior. Voice recognition software, video-based interviewing, and other sources of digital summary data (e.g., geolocation, browser use, online response latencies, and email content analysis) are just a few examples. Second, some data scientists are putting aside theory development in favor of simply jumping into the data, searching vast pools of digital content in search of relationships between variables. Third, these disruptions have also transformed our notions of outcome variables from traditional measures to aggregated digital data sources such as financial transactions and physiological information. Lastly, and most crucially, testing is either disappearing or being transformed in the form of gamification and other forms of entertainment-based testing and customers prefer it.

In short, disruption is already occurring and testing is being commodified. Given the choice between being disruptors or being disrupted, this session seeks to discuss which kinds of response should be taken.

Implementing Automated Item Generation in a Large-scale Medical Licensing Examination Program: Lessons Learned

Featured Speakers: Mark Gierl, University of Alberta; and Andre De Champlain, Medical Council of Canada

On-demand testing is commonplace with most large-scale testing programs because it affords greater flexibility in session scheduling as well as selection of a testing location for the candidate. This does impose challenges to programs, however, including overexposure of items due to the high frequency at which exams are administered. Robust item banks—usually predicated on an increase in committee-based item writing efforts—are needed to support routine retirement and replenishment of items.

The Medical Council of Canada (MCC) has been exploring an item development process that might streamline costly traditional approaches while yielding a number of items necessary to support more frequent and flexible assessment. Specifically, the use of automated item generation (AIG)—which uses computer technology to generate test items from cognitive models—has been studied for over five years.

Cognitive models are representations of the knowledge and skills that are required to solve any given problem, and while developing a cognitive model for a medical scenario, for example, content experts would be asked to deconstruct the (clinical) reasoning process involved via clearly stated variables and related elements. Those would then be entered into a computer program that uses algorithms to generate multiple-choice questions, or MCQs (Gierl & Lai 2013).

The MCC has been piloting AIG items for over five years with a number of its examinations, including the MCC Qualifying Examination Part I (MCCQE I)—one of the requirements for medical licensure in Canada. The aim of this session will be to provide an overview of the lessons learned in the use and operational rollout of AIG with the MCCQE I.

AIG has proved beneficial from a number of perspectives in that it has: (1) offered a highly efficient process through which hundreds of MCQs can be generated from cognitive maps, (2) yielded items of a quality level that is at least equal—and in many instances superior—to that of traditionally written MCQs, based on difficulty and discrimination inclusion criteria, (3) provided a framework for the systematic creation of plausible distractors, adding value from the perspective of tailoring diagnostic feedback for remedial purposes, and (4) contributed to an enhancement of test development process.

This session’s presenters are hopeful that sharing their experiences might not only help other testing organizations interested in adopting AIG, but also foster discussion that will benefit all attendees.

Measuring Performance Throughout the Year - An Alternative to End of Year Summative Testing

Featured Speaker: Catherine Taylor, Measured Progress

ESSA provides states with an option to develop alternatives to summative testing for accountability purposes. Specifically, these alternatives may "involve multiple up-to-date measures of student academic achievement, including measures that assess higher-order thinking skills and understanding, which may include measures of student academic growth and may be partially delivered in the form of portfolios, projects, or extended performance tasks."

Assessment specialists are actively considering ways to achieve this ESSA provision. States are also considering how to incorporate classroom-based performance assessments into their accountability programs. Past efforts to collect work over time for summative assessment programs have been criticized for their unwieldiness, their lack of evidence for reliability and validity, lack of comparability across different sources, and the difficulty in applying common rubrics to collections of student work from different schools and districts.

This session will present a model for developing portfolios or collections of student evidence that addresses past criticisms. The model includes: (1) developing common scoring rubrics that can be applied across collections of work, (2) developing task shells for performance tasks that can be used to generate multiple comparable tasks that are anchored in classroom contexts, and (3) setting criteria for acceptable numbers and types of evidence.

This session will demonstrate: (1) how this model was applied in a state for a high school graduation portfolio, (2) how the collections were scored, and (3) a standard-setting method that was used to set performance standards comparable to those of the summative assessment.

Assessment providers can support states in implementing similar models by: (1) developing model performance tasks or task shells, (2) designing other generalizable classroom-based tools such as test maps for end of unit tests, and (3) using electronic portfolio methods to collect and score students' collections.

Lastly, this session will also present a summary of the evidence for validity and reliability obtained from this alternate program.

Will “wearables” Provide the Next Big Innovation in the Measurement and Assessment of People?

Featured Speakers: Robert McHenry, Independent; and Naeiry Vartevan, Cambridge Cognition

The ready availability of trackers, smart watches, fitness bands, body worn devices, and clothing with sensors woven into it is providing access to biometric and neurophysiological data about individuals on a continuous and longitudinal basis. Some of this data provides direct information about people’s habits and lifestyles. Other data may be interpreted indirectly to measure personality. Some devices such as the smart watch can even be used in conjunction with the smart phone to assess intelligence or to predict the early onset of Alzheimer’s disease. In recent developments, some employers are asking employees to share data from these devices 24/7 and are creating programs for monitoring employees in and out of the workplace in order to assess employees’ psychological state, current productivity, and to predict their behavior at work.

This session will demonstrate the range of consumer and professional wearable devices currently available for purchase by consumers and professionals. It will examine the current outputs from these products (EEG, heart rate, skin conductivity, skin temperature, respiration, skin glucose, muscle mass, etc.) and consider their relevance not only to work behaviors but also well being and safety. Using a selection of case studies, this session will demonstrate how wearables can be used in test development and how data from wearables could—even more than at present—benefit both the wearer and the professional who is monitoring the output. This session will also argue that output from wearables could be used in place of questionnaires for the assessment in the clinical, educational, and occupational fields.

Learning and Performance Assessments in Computerized Environments

Featured Speakers: Alina Von Davier, ACT, Inc.; Kristen DiCerbo, Pearson; and Greg Chung, CRESST

Nowadays society is interested in developing Learning and Assessment Systems (LAS) and not merely improving the systems we have. Educators request assessments that reflect the way people actually teach, learn and work authentically, and that are merged with the learning experience. There is a renewed interest in performance assessments that are individualized and adaptive and efforts are being made to develop these complex assessments in virtual settings. The desire to create better LAS, which can provide actionable evidence based on “big data” to improve students’ and adults’ skills and shape educational policies and methodologies, combined with recent advances in technology, have led to the proliferation of virtual systems. First, the fishbowl panelists will briefly discuss the use of games, simulations, and intelligent tutoring systems as LASs and use actual case studies to discuss best practices and innovative approaches to developing and evaluating new LASs for facilitating learning. Then the participants will contribute to the discussion.

The New EU-U.S. Privacy Shield: What Testing Companies Need to Know

Featured Speakers: Emily Fedeles, Baker Hosteler; and Melinda McLellan, Baker Hosteler Law Firm

When European Court of Justice invalidated the "Safe Harbor" framework in October 2015, many multinational companies were left scrambling to find alternative legal mechanisms to transfer personal data from Europe to the U.S. Over 4,000 organizations that had been using the Safe Harbor to transfer data for more than 15 years lost that option overnight. After several months of legal uncertainty, in July 2016, the European Commission adopted the EU-U.S. Privacy Shield to replace the Safe Harbor. Companies were able to join the new framework as of August 1, 2016 by self-certifying on the Department of Commerce's website.

In this presentation, BakerHostetler Partner Melinda McLellan and Associate Emily Fedeles will provide information and insights for organizations considering joining the Privacy Shield, including:

A nuts and bolts explanation of the Privacy Shield
Background on the development of the Privacy Shield framework
An overview of the practical steps in the self-certification process
Strategies for assessing whether the Privacy Shield program is right for your business
Key features of the Privacy Shield that differ from its predecessor, the Safe Harbor framework
Discussion of the future stability and longevity of the Privacy Shield

Assessing Intelligence and Executive Functions

Featured Speaker: George McCloskey, Philadelphia College of Osteopathic Medicine

This presentation will compare and contrast the psychological constructs of intelligence and executive functions and explore how schools of thought about these two constructs have evolved over a period of nearly one hundred years.

View the Full Program

March 5-8, 2017 | Westin Kierland Resort & Spa | Scottsdale, AZ

March 5-8, 2017
Westin Kierland Resort & Spa
Scottsdale, AZ