Submitted (and accepted) symposia

Symposium 1 / Tuesday, 3rd July / 11.00-12.30 / Room: Leeszaal

Linguistic diversity and testing

Chair

Paula Elosua (University of Basque Country, Spain)

 

Symposium abstract

In linguistically diverse contexts where more than one language is used student assessment has to deal with the problem related with the language of the tests. From a psychometric point of view the language of a test has to maximize the validity of the scores and has to avoid bias, but how to choose between home language and instruction language? How can we measure the impact of the language on performance and validity? Addressing these questions makes it necessary to differentiate between two different linguistic contexts: linguistic minorities and bilingual environments. This panel shows different approaches to study the problem of the validity of scores and to evaluate the influence of the language of testing on student's performance. The first paper by Solano-Flores discusses a theoretical framework to address linguistic diversity in testing. The second, by Urbieta-Muñuzurri and colleagues describes the impact of the language of the test on bilingual student's performance. Ercikan and Oliveri investigate the effect on DIF of large degrees of heterogeneity among linguistic groups participating in PISA. The last one by Martiniello and Elosua presents a model for predicting DIF for English Language Learners in mathematics assessments.

 

Paper 1

A conceptual framework on the testing of linguistically diverse populations

Guillermo Solano-Flores (University of Colorado, USA)

 

This paper addresses the need for rigorous methodologies in the testing of linguistically diverse populations in a global economy. It submits the notion that, in order to properly address language as a source of construct-irrelevant variance, different types of linguistic diversity are to be recognized. Current testing practices mainly focus on score differences between groups (e.g., students tested in different language versions of a test). The paper proposes a conceptual framework that identifies two factors as critical to properly addressing linguistic diversity in testing: the language(s) in which tests are administered - in one language (OL) or in multiple languages (ML) - and the linguistic group(s) tested - one group (OG) or multiple groups (MG). Four testing models result from the combination of these factors: OL-OG (e.g., NAEP or many national or state testing programs), OL-MG (e.g., LLECE), ML-OG (e.g., testing programs in which linguistic minorities are tested in one of two languages), and ML-MG (e.g., TIMMS or PISA). The conceptual framework allows identification of methodological issues relevant to each testing model, such as the pertinence of within- or cross-subject designs, the treatment of language as a population factor or as a source of measurement error, and the proper sampling of linguistic populations

 

Paper 2

PISAL. Influence of the language of testing on the achievement of bilinguals

Edurado Ubieta-Muñuzuri (Basque Institute of Educational Evaluation and Research, Spain)

Araceli Angulo-Vargas (Basque Institute of Educational Evaluation and Research, Spain)

Amaia Arregi-Martínez (Basque Institute of Educational Evaluation and Research, Spain)

 

The educational system of the Basque Autonomous Community (Spain) is a bilingual system. In this system different options can be chosen for the distribution of the amount of time to the official languages (Basque and Spanish) that are used in teaching. Model D develops educational contents in Basque which is the L2 for the majority of the students. The students enrolled in school have essentially two home tongues: either Basque or Spanish (Basque around 13%). The Basque Country has participated in the Program for International Student Assessment (PISA) with its own sample since 2003. The application of PISA in the linguistic model D has been carried out in the students' mother tongue. In the 2009 edition of PISA, the Basque Autonomous Community and Luxembourg, decided to make the so called research PISA-L (PISA LANGUAGE). Therefore, two weeks after the first application was set up, a reduced version of PISA test was administered by the ISEI-IVEI (the above mentioned Basque Institute of educational evaluation and research) to a sample of students; the first half did the test in their mother tongue whereas the other half did it in their learning language. In this paper the results of this research are presented.

 

Paper 3

Heterogeneity of linguistic minority students in differential item functioning analyses

Kadriye Ercikan (University of British Columbia, Canada)

Maria E. Oliveri (Univeristy of British Columbia, Canada)

 

Test developers investigate bias for linguistic minority students and develop accommodations to minimize disadvantage against these students. Previous research has demonstrated great degrees of heterogeneity among linguistic groups such as the English language learners (ELL) in the Unites States. For example, ELLs are from different cultural backgrounds, have different levels of English language proficiency, and have variations in their developments of this proficiency. This presentation will focus on the effect of large degrees of heterogeneity among linguistic groups on bias investigations. In particular, when we investigate bias using differential item functioning methodology, how accurate is the DIF identification in the presence of population heterogeneity? This research investigates population heterogeneity using a latent class modeling approach and examines differential functioning of items for latent classes within linguistic groups. We examine population heterogeneity of students from three countries: United States, Canada and Spain using data from the most recent administration of PISA2009. The focus is on heterogeneity within a subgroup of students whose test and home language are the same and a second subgroup of students whose test and home language are different. The results have implications on methodologies used for investigating bias and validity issues for examinees from diverse linguistic backgrounds.

 

Paper 4

Sources of DIF for language minority students in mathematics

Maria Martiniello (Educational Testing Service, USA)

Paula Elosua (University of Basque Country, Spain)

 

Employing a hypothesis-driven approach, this study examines item characteristics that are systematically associated with DIF in high school mathematics assessments administered in English for language minority students classified as English language learners (ELLs) in US schools. PLS path modeling was applied to predict DIF indices for ELLs and non-ELLs as a function of construct-relevant latent factors derived from the process standards for high quality mathematics instruction proposed by the U.S. National Council of Teachers of Mathematics. These processes are Problem Solving, Connections, Representations, and Reasoning and Proof.Two factors were hypothesized. The first involves conceptual understanding, complex problem-solving and connection to real-world settings. The second involves logical and strategic Thinking. The first factor predicted half of the DIF variation while the second predicted none. Items with high scores in the first factor tended to be more difficult for ELLs than for non-ELLs conditional on mathematics scores. In contrast, items with low scores (targeting Basic Skills, Algorithm Execution, and Procedural Fluency) tended to be easier for ELLs. Since items with real-world scenarios tend to have greater language load (a potential source of construct-irrelevant variance), we examined the relationship between non-mathematical linguistic complexity in items and DIF for ELLs

 

 

Discussant

Kurt Geisinger (Buros Center for Testing, University of Nebraska-Lincoln, USA)

 

 

 

Symposium 2 / Tuesday, 3rd July / 11.00-12.30 / Room: Bestuurskamer

Psychometric modelling and the management of response biases in questionnaire-based assessment

Chair

Rob Meijer (University of Groningen, The Netherlands)

 

Symposium abstract

The use of computer-based testing is increasing in many fields of applied psychology. As with many new developments this also raises new and interesting challenges. In this symposium the main aim is to present a number of applied studies that use factor analytical and item response theory techniques to deal with the psychometric quality of computer-based questionnaires. In particular we would like to show how these techniques can help to determine the item and test score quality in general (presentation 1), across different groups (presentation 2), and across different test administration modes (presentations 3 and 4).

 

Paper 1

Using bi-factor modeling to determine the dimensionality of the BDI-II

Danny Brouwer (University of Twente, The Netherlands)

 

In computerized testing often total scores and subtest scores are reported. Questionnaires contain items that represent a broad range of the trait being measured. Consequently, in many cases when measuring a unidimensional construct, some multidimensionality exists. Given this complex structure of psychological data we asked ourselves to what extent can practitioners report one total score and subscale scores? In the literature with respect to clinical questionnaires there are often debates about which factor model demonstrates the best fit. For example, in recent studies about the Beck Depression Inventory-II, which is intended to measure severity of depression, there has been a debate about whether there is one general underlying the structure of the BDI or multiple correlated first order factors. In the present research, we applied a bi-factor model to evaluate the extent to which scores reflect either a single variable or multiple variables, in a large sample of 1530 clinical outpatients. The results showed that subscales scores do not add much information to the total score. Consequences for computer-based clinical assessment are discussed.

 

Paper 2

The use of effect size indices for differential item and test functioning in computer-based testing

Iris Egberink (University of Groningen, The Netherlands)

 

The aim of this study was to gain experience with effect size indices for differential item and test functioning recently discussed in Meade (2010). Measurement equivalence was investigated for personality questionnaires administered computer-based to applicants and incumbents, and to different ethnic groups in a personnel selection setting. Data were collected in cooperation with a Dutch human resources assessment firm; 4050 applicants and 4217 incumbents filled out a personality test. Scaling results were compared for applicants and incumbents, before differential item and test functioning was investigated using a likelihood ratio approach and different effect size measures. Results showed that the scalability was lower for the applicants than for the incumbents. Some items functioned differently for both groups, but differential test functioning was of no practical importance. Similar results were found for the comparison of different ethnic groups in a selection context. Implications for computer-based testing are discussed.

 

Paper 3

Experiences in applicant settings in combining UIT and verification CATs

Annette Maij-de Meij (Picompany, The Netherlands)

Lolle Schakel (Picompany, The Netherlands)

 

Unproctored internet testing (UIT) has some major competitive advantages over proctored testing in, for example, selection procedures. As it is demanded by organizations, the question is how to deal with UIT adequately. It is recommended to administer a verification test in a proctored setting after UIT (International Testing Commission, 2006), to avoid and detect cheating or the inappropriate use of resources. Computerized adaptive tests (CATs) can measure accurately with relatively short tests and may aid in a solution for short verification tests. However, there are limits as short tests run the risk of low ability estimates when the estimate initially drops, and in this way detects applicants unjustly as possible cheaters. Experiences with two real CATs, a UIT and a verification test will be discussed, supported by empirical data including different CAT conditions of the verification test.

 

Paper 4

Using cumulative sum statistics to detect inconsistencies in unproctored Internet testing

Jorge Tendeiro (University of Groningen, The Netherlands)

 

Unproctored Internet Testing (UIT) is becoming more popular in personnel recruitment and selection. A drawback of UIT is that cheating is easy and, therefore, a proctored test is often administered after an UIT procedure. Methods to detect inconsistency between the unproctored and proctored tests that are available in the literature are often based on total scores. In this talk we propose a new approach based on cumulative sum procedures (CUSUMs). In the current test-retest context, CUSUMs are used to check the compatibility between trait estimates from the unproctored test and item scores from the proctored test. CUSUMs allow accumulating evidence of inconsistency between unproctored and proctored tests as the test progresses. This procedure is applied to empirical data from an adaptive computer-based test in a real personnel selection context. The usefulness of the CUSUM is illustrated and the unique contribution of the CUSUM to existing procedures is discussed. Special emphasis is given to CUSUM charts as valuable visual representations of the final results.

 

Discussant

Wilco Emons (University of Tilburg, the Netherlands)

 

 

Symposium 3 / Tuesday, 3rd July / 11.00-12.30 / Room: Raadszaal

Examinee motivation and secondary data analysis of large-scale assessment

Chair

Christina van Barneveld (Lakehead University, Canada)

 

Symposium abstract

The purpose of this symposium is to highlight issues and practices related to the problem of examinee motivation and the validity of interpretations that are based on a large-scale, standardized assessment of Grade 9 mathematics in Canada. The papers in this symposium address both the research problem and the methodological decision-making that was necessary to complete the research. The paper by Sarwar et al. on data merging strategies details the problems, benefits, outcomes and implications of data-merging decisions for educational research. The paper by van Barneveld et al. examine teacher and student data on the use of large-scale assessment in students' class marks and dsicuss how the results may inform educational assessment policy. The paper by Zerpa and van Barneveld quantify examinee motivation and its impact on item parameter and ability estimates using a modified item response model.

 

Paper 1

Data merging strategies and implications for educational research

Gul Shahzad Sarwar (University of Ottawa, Canada)

Carlos Zerpa (Lakehead University, Canada)

Christina van Barneveld (Lakehead University, Canada)

Marielle Simon (University of Ottawa, Ottawa, Ontario, Canada)

Karieann Brinson (Lakehead University, Thunder Bay, Ontario, Canada)

 

Data merging is a procedure for integrating different sets of data from multiple files into one. Via the merge process, an educational researcher may gain meaningful information and facilitate the secondary analysis of data to answer research questions. The two main advantages of merging data file sources are (a) an increase in the number of variables which leads to a gain of related information and (b) the possibility of obtaining new results, which were not initially planned prior to data collection. This paper highlights the decision-making process when merging data files and compares approaches to merging data files using Structured Query Language and SPSS. The problems, benefits, outcome and implications of data-merging decision for educational research are discussed.

 

Paper 2

Using large-scale assessment in students' class marks: Teacher and student perspectives

Christina van Barneveld (Lakehead University, Canada)

Carlos Zerpa (Lakehead University, Canada)

Gul Shahzad Sarwar (University of Ottawa, Canada)

 

 

The purpose of this paper was to describe teacher practices and student and teacher perspectives on using some or all parts of a large-scale assessment (LSA) in calculating students' class marks. Grade 9 mathematics teachers (n=4459) and students (n=131,487) responded to questions on a self-report questionnaire as part of a LSA of Grade 9 mathematics. Results suggested that almost all teachers (95%-98%) reported using some or all of the LSA for class marks, most frequently counting for 6-10% of their students' class grades. The percentage of students who indicated that their teacher will count some or all parts of the LSA as part of their class mark ranged from 37-59%, notably lower percentages than reported by teachers. These results and other practices and perspectives are discussed in terms of their contribution to an educational assessment policy on the use of some or all parts on a LSA for class marks.

 

Paper 3

The problem of examinee motivation: Using an item response model which includes examinee motivation

Carlos Zerpa (Lakehead University, Canada)

Chrisitna van Barneveld (Lakehead University, Canada)

 

Every year, about 140,000 grade nine students in the province of Ontario, Canada, are given a large-scale assessment in mathematics to monitor student academic achievement and performance. Current item-response models (IRM) do not account for the effect of student motivation. The purpose of this study was to develop an IRM which includes motivation in the model and to evaluate its usefulness in improving estimates of item parameters and student abilities when low motivation is present. Student motivation was identified from self-report student data using a principal component analysis. Two components scores, task-value and effort, were computed for each examinee and merged with student item responses to create two groups, high-motivated and low-motivated examinees. These groups were used to examine the effect of low motivation on the estimates of test item parameters and student abilities between a 3-parameter logistic (3PL) and modified 3PL IRM. The results suggest that some item parameters are overestimated and some examinee abilities are underestimated when the model does not account for a motivation component. The results of this study are discussed in terms of bias and root mean square errors of estimates. Implications for testing organizations and future research will be discussed.

 

Paper 4

A critical appraisal of recent research literature on examinee motivation and secondary data analysis of large-scale assessments

Christina van Barneveld (Lakehead University, Canada)

Carlos Zerpa (Lakehead University, Canada)


The purpose of this paper is to critically appraise recent research literature on examinee motivation and secondary data analysis of large-scale assessments. In this paper we identify key elements of the assessment context, assessment format, and student characteristics that are related to a student's motivation to engage in large-scale educational assessment. Literature on theories of motivation, measures of motivation, the proportion of students who are not motivated, and the potential impact of low motivation on large-scale assessment results are discussed. Future directions for research and technology are outlined.

 

Discussant

Don Klinger (Queen's University, Canada)

 

 

Symposium 4 / Tuesday, 3rd July / 13.45-15.15 / Room: Leeszaal

Assessment of linguistic minority students in Canada and the USA

 

Chair

Debra Sandilands (The University of British Columbia, Canada)

 

Symposium abstract

This symposium brings together researchers from Canada and the US to address critical issues in assessing linguistic minority students (LMs) in two very different contexts. In Canada, Francophone students living in English-speaking environments outside of Quebec are educated and tested in French. In the US, English Language Learners (ELLs) whose first language is not English are educated and typically tested in English. Despite different contexts, learning outcomes as measured by large-scale assessments are much lower for these LMs compared to their counterparts. What factors may explain lower performance levels for LMs? Do the tests provide accurate and unbiased measurement of these students' competencies? What study designs are recommended for evaluating testing accommodations for ELLs? Do developments in technology provide opportunities for ensuring greater access and validity for ELLs in the next generation of assessment systems? These questions are at the core of the presentations in this symposium.

 

Paper 1

Factors associated with science achievement for Francophone students in Canada

Debra Sandilands (The University of British Columbia, Canada)

Juliette Lyons-Thomas (The University of British Columbia, Canada)

Kadriye Ercikan (The University of British Columbia, Canada)

Stephanie Barclay McKeown (The University of British Columbia, Canada)

 

Although majority Francophone Students (FS) in the French-Canadian province of Quebec tend to perform higher on large-scale assessments of achievement than Anglophone students in Quebec, the performance of minority FS in other Canadian provinces tends to be lower than their Anglophone counterparts. Using data from the PISA 2006 assessment, this study investigates whether and to what extent school contextual factors (such as: availability of human, technology-related and other material resources; activities that promote engagement with science; time spent in science classes; and teacher's instructional practices) may be similar or different for minority and majority FS, and how these school factors correlate with science achievement for the two groups. It will present descriptive statistics for science achievement and school factors for each group and examine the statistical significance of group differences. It will also use multiple correlation analyses to investigate how much of the variance in students' science achievement can be explained by the selected school factors for each of the two groups. Preliminary analyses reveal large and statistically significant differences in science achievement and in mean values of student and school administrator reported factors between minority and majority FS.

 

Paper 2

Assessment of linguistic minority students in Canada

Kadriye Ercikan (The University of British Columbia, Canada)

Wolff-Michael Roth Wolff-Michael Roth: Griffith University, Australia

Marielle Simon (University of Ottawa, Canada)

Debra Sandilands (The University of British Columbia, Canada)

Juliette Lyons-Thomas (The University of British Columbia, Canada)

 

In Canada, there is an interesting achievement gap between the two key linguistic groups of students. While Francophone students in the province of Quebec surpass their Anglophone counterparts on achievement tests, Francophone students in the other provinces where they are a linguistic minority consistently perform more poorly compared to their Anglophone counterparts. Although minority Francophone students receive instruction in French, most of them live in an English-speaking environment and are more likely to speak a language at home different than the test language (approximately 40% in Ontario) compared to students attending English-language schools (approximately 9% in Ontario). This linguistic context raises questions about the extent to which test scores reflect minority Francophone students' achievement accurately and the extent to which they reflect French language competency. In this presentation, we plan to present findings regarding measurement comparability of French versions of tests for the two groups of Francophone students in Canada: those who live in linguistic minority settings and those who live in majority settings. The measurement comparability is investigated using differential item functioning analyses, linguistic reviews of items for language load, and think aloud protocols.

 

Paper 3

Improved approaches for evaluating testing accommodations for English language learners

Guillermo Solano-Flores (University of Colorado at Boulder, USA)

 

In this paper the need for robust methodologies for evaluating testing accommodations (TAs) for English language learners (ELLs) - students who are still developing English as a second language - is addressed. TAs for ELLs are modifications on the ways in which tests are administered with the intent to minimize limited proficiency in the language of testing as a source of test invalidity. While large-scale testing programs in the U.S. use many forms of TAs (e.g., reading the items aloud to the students, allowing extra time to complete the tests), only few are supported by empirical evidence of their effectiveness. Research on vignette illustrations - illustrations added to non-illustrated items with the intent to provide visual support for ELLs to understand their content - shows that the effectiveness of this TA is shaped by the characteristics of each student and the characteristic of each item. Drawing from this experience, I challenge current TA practices - which focus on score differences between ELLs and non-ELLs tested with and without TAs - and submit that the effectiveness of TAs should be evaluated using within-subject designs that examine the amount of score variance due to student (the object of measurement) and its interaction with two facets - item and (the presence/absence of) TA.

 

Paper 4

Challenges for new U.S. assessment systems in assessing English learners

John W. Young, Ph.D. (Educational Testing Service, USA)

 

In the United States, two multi-state consortia are funded to develop the next generation of assessment systems for testing students in grades K - 12 beginning in 2014-15. For students who do not have English as their first language, known as English language learners (ELLs), there are serious technical challenges in ensuring that the content assessments they are administered are valid and fair for them. Because of the complexity of the language used in current assessments, the validity of scores as indicators of ELLs' comprehension and knowledge is questionable. The new systems, which will employ computer delivery, technology-enhanced items, and constructed response items, have the potential to produce scores of greater validity for ELLs. Some of the challenges include: • Can items be created that measure higher order skills while minimizing the degree of linguistic complexity? • Can scoring rubrics be created that lead to valid scores while accounting for the language skills of ELLs? • Can innovative testing accommodations, such as ones based on translanguaging, be developed that will increase the validity of scores for ELLs? This presentation will discuss the challenges and possible solutions for ensuring greater access and validity for ELLs in the next generation of assessment systems.

 

Discussant

Kurt Geisinger (Buros Center for Testing, University of Nebraska - Lincoln, USA)

 

 

Symposium 5 / Tuesday, 3rd July / 13.45-15.15 / Room: Bestuurskamer

Testing resources help promote test development and use in emerging countries

Chair

Thomas Oakland (University of Florida, USA)

 

Symposium abstract

Please note that the following is a resubmission of an earlier proposal. The current and following proposal reflects a recent addition. The development of psychological tests constitutes psychology's most important technical contribution to society. Tests are readily available in some countries and rarely available in others, especially in countries in which the discipline of psychology and its clinical practices are emerging. The purpose of this symposium is to discuss models used by test companies that impact their availability and use internationally.

 

Paper 1

Using technology to increase international access to psychological assessments

Hazel Wheldon (MHS Inc., Canada)

 

Test publishers face multiple challenges finding ways to make assessments available around the world, particularly in developing countries where demand is high but affordability is low. This demand has to be balanced with the interests of the publishers, not least of which are the protection of intellectual property, the assurance that local translations and adaptations are culturally appropriate and the need to ensure that the assessments are available for as wide a group of qualified users as possible. Solutions to these challenges must also take into consideration the commitments that the publishers have regarding revenue to sustain their business and royalties to authors. Creative models are needed to ensure that all these needs are met. Fortunately the increasing availability and flexibility of technology has the capability to address many of the above issues in a cost effective and accessible way. In this presentation, we will discuss three distinct models for working with developing countries to ensure cost effective access to valuable assessment tools. While each of the models is promising in their own right, they also contain varying amounts of shortcomings that will also be discussed.

 

Paper 2

Test adaptation in emerging countries: Standardizing Wechsler scales in China and India

Paul McKeown (Pearson Clinical and Talent Assessment, UK)

 

Test publishers face multiple challenges finding ways to make assessments available around the world, particularly in developing countries where demand is high but affordability is low. This demand has to be balanced with the interests of the publishers, not least of which are the protection of intellectual property, the assurance that local translations and adaptations are culturally appropriate and the need to ensure that the assessments are available for as wide a group of qualified users as possible. Solutions to these challenges must also take into consideration the commitments that the publishers have regarding revenue to sustain their business and royalties to authors. Creative models are needed to ensure that all these needs are met. Fortunately the increasing availability and flexibility of technology has the capability to address many of the above issues in a cost effective and accessible way. In this presentation, we will discuss three distinct models for working with developing countries to ensure cost effective access to valuable assessment tools. While each of the models is promising in their own right, they also contain varying amounts of shortcomings that will also be discussed.

 

Paper 3

Drivers of demand in test usage in emerging countries

Dragos Iliescu (Testcentral Bucharest, Romania)

 

Test publishers and authors tend to focus in their important work mainly on the technical aspects that impact test development and use. The psychometrician's point of view is of course important and has taken us to the current level of sophistication in tests, which ensures a higher quality of products and services as ever before in the history of psychology. In solid and established professional environments, such as the US and most of Western Europe, attitudes of test users and other stakeholders towards tests and testing have grown together with the professional market, in which test developers and publishers are a part. As such, there were rarely significantly divergent attitudes, as both the user ("market") and the usage ("product") have grown and expanded together. However, emerging markets differ. The "adapt and test" model of test exporting, which is used by most, if not all, test publishers, when the objective is to spread usage of a test to an international audience has to be rethought in terms of local habits and attitudes, in order to make the test usable in the new market. The presentation will focus on drivers of demand regarding tests and testing in Romania, focusing on such issues as ethical test usage, test user qualifications, administration and scoring procedures, local preferences for special types of tests or measured variables etc. The impact of these attitudes on test usage in Romania will be discussed.

 

Paper 4

The challenge of employment testing in emerging markets

Ilke Inceoglu (SHL Group Ltd., UK)

Dave Bartram (SHL Group Ltd., UK)

 

Promoting occupational test use in emerging countries opens up opportunities for organizations to standardize assessment practices across a wider range of diverse countries. Benefits include more efficiency and fairness in selection through standardized processes. There are, however, challenges in implementing testing in emerging countries. These are largely determined by the maturity of testing markets and cultural factors. The challenges fall into three main categories: the availability of resources, expertise and infrastructure relating to (1) test design and development, (2) test distribution, and (3) test use. For example, at stage (1) there may be a lack of sufficient local expertise (i.e. psychometric training) and guidance on test development criteria may be lacking. Regarding (2), the required infrastructure (e.g. sufficient internet bandwidth for online testing) to conduct testing and acceptance of psychometric tests may be lacking. Related to (3), candidates may not be familiar with testing and testing formats that can affect test results. Also standards for test use may be lacking and training for test administration and giving feedback might not be available. We will discuss these challenges by providing examples of distributing tests in a number of diverse emerging markets such as Eastern Europe, Africa, China and the Middle East.

 

Discussant

Thomas Oakland (University of Florida, USA)

 

 

Symposium 6 / Tuesday, 3rd July / 13.45-15.15 / Room: Raadzaal

Psychometric properties of the Multiple Mini-Interview: Efforts to assess applicants' non-cognitive qualities

Chair

Don A. Klinger (Queen's University, Canada)

 

Symposium abstract

The Multiple Mini-Interview (MMI) has become an increasingly popular method to select prospective candidates for admissions into professional programs (e.g, medical school). The MMI uses a series of labour intensive, short simulation stations to more effectively assess applicants' non-cognitive qualities, including empathy, critical thinking, ethics and communication. This symposium introduces the MMI and presents findings from three separate analyses of its psychometric properties, highlighting its potential and the challenges to be addressed. MMI data from 455 potential students were used to: 1) use Generalizability theory to estimate of generalizability of the MMI and the sources of error; 2) use a many-facet Rasch model to identify misfitting examinees, items and raters; and 3) conduct factor analyses to determine the MMI's factor structure. Consistent with previous research, our results support the consistency of MMI; however, it is unclear the extent to which the MMI measures the intended non-cognitive qualities across stations.

 

Paper 1

Introduction to the Multiple Mini-Interview testing procedure

Don A. Klinger (Queen's University, Canada)

 

The Multiple Mini-Interview (MMI) has become a powerful assessment tool to select candidates for admissions into highly competitive professional degree programs (e.g., medical school). This presentation introduces the MMI, its current uses, and its potential as an assessment of non-cognitive qualities (empathy, communication, ethics). Applicants complete a series of stations in which they interact with a current student under the premise of a simulated scenario. These short simulation stations are intended to more effectively assess applicants' non-cognitive qualities, including empathy, critical thinking, ethics and communication (Queen's University, 2011). For each station, the student and an observing examiner score the applicant on between 2 and 5 non-cognitive qualities, each on a 9-point scale. Previous research has demonstrated high reliability and the ability to differentiate desired qualities amongst potential candidates (Dodson et al., 2009), suggesting the MMI provides a superior method to select candidates for admissions into competitive professional programs. Although the MMI is labour intensive, requiring the participation of several students and examiners, research suggests the MMI has higher predictive validity in comparison to traditional interviews formats (Eva et al., 2004), making it worth the time and expense.

 

Paper 2

Generalizability and sources of error on the Multiple Mini-Interview

Kin Luu (Queen's University, Canada)

Stefanie Sebok (Queen's University, Canada)

 

Generalizability theory (G-theory) provides an effective method to not only determine the generalizability (reliability) of an assessment process, but also the sources of error that exist across facets. In the case of the MMI, two important facets are stations and raters. An Applicant by Rater by Item nested in Station (A X R X I:S) G-theory analysis was conducted. Based on this analysis, the G-coefficient was found to be relatively high (>0.80). Estimates of the variance components for raters indicated little variation due to rater differences, and the rater-by-item interaction indicated that the raters used the scales consistently. For the stations, there was a large proportion of the variation attributable to the interaction between applicants and stations. Further analyses nesting raters within stations provided similar findings. These results suggest that the applicants were obtaining different scores across each of the stations, and the stations differentially rate applicants. Research is currently being completed using D-studies to determine the number of stations, items and raters required to obtain high levels of generalizability while minimizing the number of required raters.

 

Paper 3

Identifying misfitting applicants, raters and stations on the Multiple Mini-Interview

Stefanie Sebok (Queen's University, Canada)

King Luu (Queen's University, Canada)

 

The many-facet Rasch model can be used to identify the extent to which the unit of measure and facets are consistently measured. The purpose of this study was used to investigate the fit statistics for applicants, raters and stations on the MMI using the many-facet Rasch model (Linacre, 1990). Individual rater, item, and applicant characteristics were examined based on the MMI data from 455 applicants. Through applicant reports we were able to identify individual fit characteristics for each applicant. With few exceptions, we did not identify significant misfitting scores for the sample of applicants. The results from the item analysis indicated that every item in the model (e.g., communication, critical thinking) fit within a unidimensional construct. Lastly, the rater measurement reports suggest that both student raters and faculty examiners were fairly homogeneous in rating potential applicants. These findings provide evidence with respect to the consistency of the MMI process to select applicants for acceptance into professional programs. There is also evidence that the student raters were equally adept at using the rating scales as the faculty examiners.

 

Paper 4

Factor structure of the Multiple Mini-Interview

Don A. Klinger (Queen's University, Canada)

 

Two critical aspects of the MMI are the rater consistencies amongst the student raters and faculty examiners, and the extent to which the different stations measure the intended non-cognitive qualities of the applicants. This presentation provides the results from two sets of analyses. First, we compared the rater consistency across stations and between students and faculty raters. The student raters and examiners tended to give similar scores across stations, although the students' scores were slightly higher across stations. Further, the variability in scores was similar across stations. Second, we completed a series of factor analyses (Principal Axes with: 1) varimax and 2) oblimin) of the students' and examiners' scores of the 455 applicants. As a result of the MMI, each applicant received 88 different scores (7 stations with 2 to 5 categories per station with two raters per station). The factor analyses of these data clearly highlight the existence of distinct factors. These factors load together based on station rather than on the different non-cognitive constructs intended to be measured across stations (e.g., communication, critical thinking, maturity). Hence it appears that raters did not differentiate amongst non-cognitive constructs, suggesting that a more holistic scoring system may be appropriate.

 

Discussant

Christina van Barneveld (Lakehead University, Canada)

 

 

Symposium 7 / Tuesday, 3rd July / 15.45-17.15 / Room: Leeszaal

Reducing bias and the achievement gap of minorities in selection procedures in the Low Countries

Chair

Johnny Fontaine (Ghent University, Belgium)

Eva Derous (Ghent University, Belgium)

 

Symposium abstract

The present symposium presents four studies from the Low Countries that investigate cultural bias in selection assessment. The first contribution investigates the impact of cultural differences in impression management tactics on interview outcomes. They have a substantial biasing effect in interaction with the personality characteristics of the recruiter. The second contribution investigates achievement differences on a new Constructed Response Multimedia Test. This test reduces the minority-majority achievement gap compared to the more traditional selection procedures. The third contribution deals with speeded cognitive assessment procedures which focus on working memory. Taking the speed-accuracy trade-off into account surprisingly little bias and little achievement differences were observed. In the final contribution a traditional numeric reasoning test is compared to an in basket exercise. While the former test showed substantial ethnic differences, they were negligible with the latter testing approach. The four presentations demonstrate that the bias and the achievement gap can be reduced.

 

Paper 1

Effects of culture-specific impression management and recruiter characteristics on biased interview Ratings

Veronique Verhees (Ghent University, Belgium)

Liesbeth De Beyter (Ghent University, Belgium)

Eva Derous (Ghent University, Belgium)

 

Ethnically diverse applicant pools demand more research on the way immigrant applicants present themselves in selection procedures as well as how recruiters perceive immigrant applicants. This field experimental study is one of the first studies examining the effect of culture-specific impression management (IM) tactics and recruiters' characteristics on interview ratings of equally-qualified Moroccan applicants. Participants (N = 165, all native Belgians) were subjected to one of three culture-specific IM-conditions (Belgian vs. Moroccan vs. control). Recruiters' characteristics (Social Dominance Orientation (SDO), Ethnic identification, and Perceived similarity), were expected to moderate the effect of IM-tactics on applicants' interview ratings. ANCOVAs revealed no main effect of culture-specific IM-tactics. However, raters high in SDO and ethnic in-group identification rated applicants who used Moroccan culture-specific IM-tactics, significantly lower than those using Belgian culture-specific IM-tactics. Perceived similarity positively affected interview ratings of applicants using Moroccan IM-tactics (effect sizes were all moderate). Study results show how both recruiter characteristics (like SDO) and ethnic minority applicants' interview IM-tactics can differentially affect interview ratings of equally-qualified applicants and, hence, bias interview decisions. Practical implications for both the way Moroccan minority applicants can present themselves to native-Belgian recruiters and how majority interviewers might avert biased interview ratings, will be discussed.

 

Paper 2

Alternative predictors in personnel selection: Constructed response multimedia tests versus other instruments

Britt De Soete (Ghent University, Belgium)

Filip Lievens (Ghent University, Belgium)

Janneke Oostrom (Erasmus University Rotterdam, The Netherlands)

Lena Westerveld (Police Academy, Apeldoorn, The Netherlands)

 

As employee diversity has become a key challenge for the domain of I&O Psychology, subgroup differences have received increasing attention as a yardstick to evaluate selection instruments. The present study compared subgroup differences on an innovative constructed response multimedia test to other commonly used selection instruments. Constructed response multimedia tests present applicants with multimedia fragments depicting key situations that respondents are likely to encounter during job performance. At a critical point, the scene freezes and applicants are asked to answer orally by acting out their response to each fragment as if they actually take part in the presented situation. Apart from this constructed response multimedia test, two hundred forty-five applicants (27% ethnic minorities) for entry-level police jobs at the Dutch police academy completed both cognitive (cognitive ability test, language proficiency test) and non-cognitive (personality measure, interview, role-play) selection instruments. Results demonstrated minor subgroup differences on the constructed response multimedia test (d = 0.14) as compared to other selection instruments, suggesting the constructed response multimedia test to be a valuable alternative predictor to ensure a diverse applicant inflow. Subgroup differences were also examined on the dimensional level, with cognitively-loaded dimensions displaying larger subgroup differences.

 

Paper 3

Language and cultural effects in the speeded cognitive assessment of Belgian army recruits

Symen Brouwers (Ghent University, Belgium)

Johnny Fontaine (Ghent University, Belgium)

Jacques Mylle (Royal Military Academy, Belgium)

 

In the present study we examine whether test bias extends towards the speed-accuracy trade-off in speeded cognitive tests. The GCTB battery of Irvine and Kyllonen was administered to 7,891 recruits of the Royal Belgium Armed Forces: 3,303 French speaking mainstream, 679 French speaking ethnic, 3,619 Dutch speaking mainstream, 290 Dutch speaking ethnic. Per subtest (Alphabet, Word Rules, Orientation, Number Fluency, Odds and Evens, and Reasoning) eleven items were administered. Scored were item correctness and lag time. Logistic regression revealed that average lag time across correct items of a subtest was an important predictor of item correctness; linear regression revealed that the number of correct items was an important predictor of item lag time. For several items language region was a significant predictor whereas ethnicity was rarely a significant predictor. An ANOVA revealed that there were no substantial group differences except for Word Rules, which showed a lower number of correct answers and longer lag times for French speaking recruits and recruits with an ethnic background. We conclude that when interpreting assessments with speeded cognitive tests, correctness also needs to be considered. Bias is mostly located in the inequivalence of word characteristics across cultures, such as word length and associative power.

 

Paper 4

Classical versus in basket approaches to cognitive ability assessment in selections of the Belgian government

Johnny Fontaine (Ghent University, Belgium)

Eva Derous (Ghent University, Belgium)

Cédric Danloy (SELOR, Belgium)

Vincent Van Malderen (SELOR, Belgium)

 

In the present study impact, bias, and adverse impact have been investigated for two very different selection instruments used by the Belgian government. The first was a classical numerical ability test taken by 1,558 applicants of which 226 belonged to an ethnic minority. The second was a new in basket exercise which was taken by 625 applicants of which 159 belonged to an ethnic minority. On the classical ability test, there was a medium sized effect of ethnical background with the minority group scoring lower. On the basis of logistic regressions item bias was identified for nine of the 33 items. The size of the item bias parameters showed a consistent curvilinear relationship with the item order, with especially items in the middle favouring the majority group. This pointed to very different test taking styles between two groups. On the in basket exercise only a small effect of ethnic background could be identified. Moreover, there was no evidence for bias in the assessment of the individual competencies. Adverse impact ratio's confirmed the same pattern across the two tests. The study thus found that a test which resembles more daily cognitive activities showed less impact, less bias, and less adverse impact.

 

Discussant

Fons van de Vijver (Tilburg University, The Netherlands)

 

 

Symposium 8 / Tuesday, 3rd July / 15.45-17.15 / Room: Bestuurskamer

Challenges of test adaptations in special contexts: The role of the ITC Guidelines

Chair

Jose-Luis Padilla (University of Granada, Spain)

Stephen G. Sireci (University of Massachusetts, USA)

 

Symposium abstract

Practitioners and cross-cultural researchers can face specific challenges when adapting or validating adapted versions of test and questionnaires. To adapt a questionnaire to a target language in a bilingual community, get validity evidence of adapted version in clinical contexts, or research on so elusive problem as DIF causes, are significant examples of such challenges addressed by the papers of the symposium. There is no discussion on the role of the ITC Guidelines should play. However, its real impact on practice is unknown. Proposals to follow the ITC Guidelines in applied and research contexts will be discussed.

 

Paper 1

Guidelines versus practices in cross-lingual assessment: A disconcerting disconnect

Joseph A. Rios (University of Massachusetts, USA)

Stephen G. Sireci (University of Massachusetts, USA)

 

The Guidelines for Translating and Adapting Tests (ITC, 2010) provide important guidance to test developers, cross-cultural researchers, and test users on important validity issues in developing and evaluating tests for use across languages. These guidelines are widely applauded, but the degree to which they are followed in practice is unknown. In this study, we performed a comprehensive search of studies published in peer-reviewed journals. Approximately 300 relevant articles were found and 55 were selected for analysis. We were interested in whether practices improved after publication of the first ITC Guidelines in 1994. Therefore, the selected studies spanned 1956 to 2009. Very few studies provided adequate evidence regarding the quality of the translation or conducted a pilot study. With respect to statistical evaluations of invariance/comparability, reliability evidence was provided in less than half of the studies, about 1/3 evaluated construct equivalence (primarily using exploratory procedures), about 1/3 evaluated method bias, and only one study evaluated item bias. Very few studies provided more than one type of evidence. Our analysis of cross-lingual assessment practices suggests very few cross-cultural practitioners are following the ITC Guidelines, which threatens the validity of many of these instruments. None of the 22 articles published since 1996 cited the Guidelines, which suggests better dissemination strategies are needed.

 

Paper 2

Structural equivalence of adaptations in bilingual communities: An illustration

Itziar Alonso-Arbiol (University of the Basque Country, Spain)

Miriam Gallarin (Technische Universität Berlin, Germany)

Nekane Balluerka (University of the Basque Country, Spain)

Arantxa Gorostiaga (University of the Basque Country, Spain)

Mikel Haranburu (University of the Basque Country, Spain)

Aitor Aritzeta (University of the Basque Country, Spain)

 

The associations between attachment security and outcomes of positive development and/or absence of psychopathological indexes illustrate the importance of the research on attachment in adolescence. The Inventory of Parent and Peer Attachment (IPPA, Armsden & Greenberg, 1987), is one the most widely used assessment tools. In the Basque region of Spain, two official languages - Basque and Spanish - are used for instruction in secondary schools, where IPPA is most often administered to samples of adolescents. Stemming independently from the standard original English measure, currently there are Spanish and Basque versions. Those adaptations produced versions with slightly different items, which do not allow for comparison between the versions. In this study we carried out an adaptation of the IPPA measure that contained the same items for the two versions that are used interchangeably in some Basque schools. Samples are made of 477 (Spanish version) and 1,037 adolescents (Basque version). We used an exploratory factor analytic approach to test structural equivalence. Later, items were selected for their use in both Basque and Spanish versions. We discuss the need of using strategies for the adaptation of measures that are simultaneously applied in bilingual communities, and the implications for ITC guidelines.

 

Paper 3

Validation of the LittlEARS questionnaire in Polish and Spanish cochlear implanted children

Anita Obrycka (Institute of Physiology and Pathology of Hearing, Poland)

Jose-Luis Padilla (University of Granada, Spain)

Artur Lorens (Institute of Physiology and Pathology of Hearing, Warsaw/Kajetany, Poland)

Anna Piotrowska (Institute of Physiology and Pathology of Hearing, Warsaw/Kajetany, Poland)

Alba-Saida García (University San Cecilio Hospital, Granada, Spain)

Henryk Skarzynski (Institute of Physiology and Pathology of Hearing, Warsaw/Kajetany, Poland)

 

Introduction of pediatric cochlear implantation generated an interest in new assessment measures of auditory-verbal and language abilities of young implanted children. The LittlEARS Auditory Questionnaire (LEAQ) is intended to assess auditory behavior of infants up to two years of age. Parental questionnaires can be useful instruments for complementing professional assessments in clinic contexts. The aim of this paper was to obtain cross-cultural validity evidence of the LEAQ measures in Polish and Spanish cochlear implanted (CI) children. 114 Polish and 65 Spanish CI children implanted before age of two were tested with the Polish and Spanish versions of the LEAQ at first fitting of the speech processor, and at four subsequent follow-up visits. Psychometric properties of LEAQ measures were evaluated. Measures of both CI groups were compared with the country and international normative curves of non-hearing problem children. Different cubic and logarithmic functions with "total" scores as dependent variable and "age" as independent variable were applied for each country database. Differences found between both countries in psychometrics, influence of age at the first fitting of the device and the degree of hearing loss, can be related to cultural and linguistic factors. Lastly, the impact of the ITC Guidelines when performing the cross-cultural validation of the LEAQ will be discussed.

 

Paper 4

Searching for DIF sources by mixed-method research designs in cross-lingual assessment

Isabel Benítez (University of Granada, Spain)

Jose-Luis Padilla (University of Granada, Spain)

Stephen G. Sireci (University of Massachusetts, USA)

María-Dolores Hidalgo (University of Murcia, Spain)

Juana Gómez-Benito (University of Barcelona, Spain)

 

Research on Differential Item Functioning (DIF) causes can contribute to improve methods to test structural and construct equivalence in cross-cultural and lingual assessment. However, scants general findings have been found in explaining DIF results. A "mixed research" can be a promising approach to move forward our knowledge of DIF sources. Qualitative evidence from cognitive interviewing may be helpful in explaining DIF results. Starting from DIF results obtained by analysing US and Spanish versions of the Student Questionnaire of the Program for International Student Assessment (PISA, OECD, 2006), 20 cognitive interviews in the US and 24 in Spain were conducted in a "mixed method" research design. Interview protocols were developed taking expert appraisal evidence into account. Interviewees respond to general and follow-up probes after answering each Student Questionnaire scale. The analyses of the US and Spanish participant's narrative show different interpretation patterns for most of the items flagged with large DIF. The differences are related to different schooling experiences, educational contexts, and, in some cases, terms and expressions with different meaning in both languages. Lastly, potential benefits of resorting to a "mixed research" paradigm to develop the ITC guidelines will be discussed.

 

Discussant

Kurt Geisinger (Buros Center for Testing, University of Nebraska-Lincoln, USA)

 

 

Symposium 9 / Tuesday, 3rd July / 15.45-17.15 / Room: Raadszaal

Assessment of children's temperament: A summary of international work

Chair

Thomas Oakland (USA)

 

Symposium abstract

The development of measures of adult personality has received considerable cross-national attention. The development of measures of personality and temperament for children has received more limited attention. The purpose of this symposium is to discuss efforts in three countries - Brazil, Poland, and Romania - to develop a measure of children's temperament. Data on children's temperament from 21 countries also will be summarized.

 

Paper 1

The development of a measure of children's temperament in Brazil

Ricardo Primi (University of São Francisco)

Tatiana de Cassia Nakano (Pontifical Catholic University of Campinas)

 

This study investigates the psychometric properties of the Brazilian version of the Student Styles Questionnaire (SSQ) with the use of exploratory full information factor analysis and Rasch item maps in the development of normative reference interpretation based on a more construct centered approach in producing norms for psychological tests. Data from 1,267 children were analyzed and the proportion of preferred styles are compared with those obtained in other countries.

 

Paper 2

The development of a measure of children's temperament in Poland

Jan Cieciuch (University of Finance and Management, Poland)

Tomasz Rowinski (Cardinal Stefan Wyszynski University, Warsaw, Poland)

 

The Polish adaptation of the Student Styles Questionnaire (SSQ, Oakland, Glutting, & Horton, 1996) was done in 2011. According to psychometric requirements confirmatory factor analysis was run as a test of factor validity. Additionally tests of measurement invariance were performed to make groups comparisons meaningful. The analysis were performed on the group of 1,376 Polish pupils aged 8-18. In applying CFA we followed the procedure of parceling, used by Benson, Oakland and Shermis (2009). Measurement invariance across males and females as well age groups was performed in framework of Multigroup Confirmatory Factor Analysis (Vandenberg & Lance 2000). Results of three levels of measurement invariance (configural, metric and scalar) will be presented and discussed.

 

Paper 3

The development of a measure of children's temperament in Romania

Dragos Iliescu (SNSPA University, Romania)

 

Preferences for four bipolar temperament styles (i.e., extroversion-introversion, practical-imaginative, thinking-feeling, and organized-flexible) as measured by the Learning Styles Inventory (Oakland, Glutting, & Horton, 1996), are discussed for a sample of 2,400 Romanian children. Romanian children generally prefer extroversion, practical, thinking, and organized styles. Among Romanian children, more males than females prefer thinking and flexible styles and more females than males prefer feeling and organized styles. Gender differences are not apparent on extroversion-introversion and practical-imaginative styles. Age differences are most apparent on organized-flexible styles. Validity evidence in terms of preferred activities and grades for specific courses are also provided.

 

Paper 4

Children's temperament in 21 countries

Thomas Oakland (USA)

 

Data from 22,676 children from 21 countries who completed a scale of temperament that assesses four bipolar temperament traits (i.e., extroversion-introversion, practical-imaginative, thinking-feeling, and organized-flexible styles) are summarized. Children generally prefer organized to flexible styles in 20 countries, extroverted to introverted styles in 19 counties, practical to imaginative styles in 16 countries. Preferences for thinking-feeling styles display gender differences in 14 countries, with females generally preferring feeling and males generally preferring thinking styles. Age differences are most apparent on extroversion-introversion and organized-flexible styles.

 

Discussant

Solange Wechsler (Pontifica Catholic University, Campinas, Brazil)

 

 

Symposium 10 / Wednesday, 4th July / 9.00-10.30 / Room: Leeszaal

Developments in the Netherlands Part 1 (COTAN)

Chair and discussant

JanHenk Kamphuis (University of Amsterdam, The Netherlands)

 

Symposium Abstract

The Dutch Committee on Tests and Testing (COTAN) was founded by the Dutch Association of Psychologists (NIP) in 1959 with the mission "to promote the better use of better tests". Representatives of the departments of psychology of all ten Dutch universities and representatives of psychologists practicing in various areas of psychology are selected as members. The most important criterion to become elected for a COTAN membership is expertise in the field of test theory and test construction. The two symposia organized by COTAN show some of this expertise by COTAN members. The first symposium deals with three recent test development projects: The development of a dynamic test with tangible electronics for children, the utility of the Dutch adaptation of the WPPSI-III, and research on distress-induced eating with the Dutch questionnaire for eating disorders. The second symposium addresses more general issues concerning test construction: The influence of motivation on the scores of ability and performance tests, the pro's and con's of using short tests, score differences on intelligence tests for various ethnic groups, and the relevance of (group) reliability coefficients for individual measurement.

 

Paper 1

Optimal performance and typical performance: "What's in it for me?"

Bas Hemker (Cito, Arnhem, The Netherlands)

Marie-Anne Mittelhaëuser (Tilburg University, The Netherlands; Cito, Arnhem, The Netherlands)

 

Many tests try to measure ability, or they try to measure and evaluate performance. But what type of performance do we actually measure? Is it a person's typical performance - which is the performance a person shows if nothing really depends on the outcome? Or is it a person's optimal performance - which is the performance if every party involved is highly motivated to get good results? And do these types of performance really differ? Motivation of the student is often driven by self-interest. Results suggests that even at a young age, such as 12-year olds, students ask themselves: "What's in it for me?"

The effects of the difference between typical and optimal performance play an important role in large scale studies, such as national assessments or studies for international comparison of ability. Can we use low-stakes tests, or do we need high-stakes tests? How to report on the differences between these types of performance or between these test taking conditions in comparing standards and references levels?

The difference between optimal and typical performance is also found during pre-tests, norm studies, equating procedures and in the construction of item banks. This difference in performance is, for example, an issue because for many tests that measure optimal performance, due to practical restrictions the pre-tests are performed in the typical performance test taking condition.

Data were collected of performances in high stakes and low stakes settings of the same set of items. Based on the results of the analyses, in this presentation we discuss the of challenges and propose possible solutions for the differences in performance.

 

Paper 2

Using short tests: Should we?

Wilco H. M. Emons (Tilburg University, The Netherlands)

 

Short tests containing at most 15 items are increasingly popular in clinical and health psychology, for individual diagnosis and change assessment. Although short tests alleviate the burden of testing, they are also more vulnerable to measurement error than longer tests. Therefore, it can be questioned whether short tests can provide enough accuracy, even if items are well chosen given the application envisaged. In this presentation, we argue that even for satisfactory total-score reliability, person measurement using short scales may be overly imprecise and, as a result, the certainty of making a correct decision may be low for many individuals. However, the relationship between test length/reliability and individual decision-making and change assessment is complex. We discuss this relationship and show results of recent psychometric studies on minimal test-length requirements. These studies aimed at producing hands-on rules for researchers to act upon. Both in research and individual diagnosis, we generally recommend the use of highly reliable scales so as to reduce the chance of faulty decisions. We also notice that this topic requires large-scale research before definitive advice can be given.

 

Paper 3

Ethnic score differences on intelligence tests in The Netherlands: Is the gap closing?

Remko van den Berg (NOA, Amsterdam, The Netherlands)

 

Research in intelligence test score differences between ethnic majority and minority groups started in The Netherlands almost twenty years ago. Most studies showed large score differences between ethnic majorities and first generation ethnic minorities, and smaller but still substantial differences between ethnic majorities and second generation ethnic minorities.

In the present study, data collected among adults with the Multicultural Capacity Test (MCT-M) in the period 2007-2010 are compared with data collected in the period 1995-1997. Test scores of first and second generation non-Western ethnic minorities with a Turkish, Moroccan, Surinamese, or Antillean background (Total N = 8,192) are compared with those of the Dutch majority group (N = 18,365).

The hypothesis is that intelligence test score differences between ethnic majorities and minorities diminish over time, with generation, length of stay in the Netherlands, and integration into Dutch society.

 

Paper 4

The end of reliability, and the beginning of individual measurement

Klaas Sijtsma (Tilburg University, The Netherlands)

 

Several authors have noticed that classical reliability is a group characteristic, which expresses the degree to which the variance in the group with respect to the test score can be attributed to variance with respect to the true score. A reliability equal to 0.8 on an intelligence test in a group of elementary-school students means that 80 percent of the test-score variance in the group is due to differences in true scores. The other 20 percent is due to random measurement error. If one wishes to say something about the precision by which an individual test score measures a particular intelligence level, knowing that 80 percent of the group differences are due to true-score differences is of little help and one needs to resort to other statistics that allow making statements about the individual. In psychometrics, there are two possibilities: The standard error of measurement of the true score (classical test theory) and the standard error of the latent person variable (item response theory) can both be used to estimate confidence intervals for individual measurement and test hypotheses. Both approaches are discussed and merits and drawback are illuminated.

 

 

Symposium 11 / Wednesday, 4th July / 09.00-10.30 / Room: Bestuurskamer

Cooperating to enhance test security in international testing programs

Chair

John Fremer (Caveon Test Security, USA)

 

Symposium abstract

The test program attribute that is most closely associated with high levels of cheating and theft of intellectual property is the extent to which a program is international in scope. Increasingly, programs developed in one country are delivered through a network based elsewhere and possibly with the cooperation of agencies in several countries who have contributed to test and program design, setting of standards, piloting of test items, and/or scoring and reporting. This symposium looks at the key steps of test design, development, and administration as well as the need to plan and implement high quality procedures for investigating testing irregularities.

 

Paper 1

Assisting in the development of test for international use

Aranka Krechtling (CITO, The Netherlands)

 

When it is known from the planning stages that tests will be used internationally, the items should be designed, as much as possible, for security purposes. The design should discourage memorization and sharing and make common methods of cheating less effective. They should limit item exposure, thereby prolonging the usefulness of items and test results. It will be important to be realistic about the time period within which items can be used with confidence that they have not been exposed to some candidates. Learning about the particular regions and countries where testing will take place is essential. There are some locations that are best viewed as the "last stop" for items, tests, and pools. Once administered in these places, no further use of intact tests or pools can be counted on. One can build into the testing process, though, the "rest" and restoration of such items so that many can be reused in future tests or pools. What must happen is that you design a realistic testing strategy, informed by those from other testing programs who have learned how to test in many different settings. You need to guard against thinking that your program must do exactly the same thing in all contexts. If you adhere to that rule, there will be a substantial number of countries within which no "standard" testing model will be feasible.

 

Paper 2

Building and maintaining the security of international tests

Cor Sluijter (CITO, The Netherlands)

 

When the intended domain of use for a test or collection of tests is international in scope, it is important that during the development of items and tests that the content is protected, through the use of confidentiality and nondisclosure agreements as well as through sound security procedures. The core group that has overall responsibility for test development and deployment often has a strong and shared commitment to protecting the examination questions and the fairness of the testing process. There are sometimes security breaches that can be traced to the developers of tests but when this happens it is more likely to be due to inadequate security training and practices rather than mischievous or malicious intent. Once you move beyond the core group of developers and expose examination questions to reviewers, problems can increase significantly. Developing careful "chain of custody" rules, clearly assigning responsibility, and auditing the process thoroughly and systematically is essential. When planning the administration of international tests, determining the level of security required is a critical step. Does the test primarily influence decisions regarding training or do successful test takers move directly to practice? What are the consequences of a person being designated as "competent" or certified? In some instances lives could well be placed at risk

 

Paper 3

Addressing global test administration security challenges

David Foster (Kryterion, USA)

 

More and more programs are expanding the reach of their testing programs, across borders into other countries and continents. With the Internet and other communication technologies in play today, it shouldn't matter whether a person takes a test in Brussels, Buenos Aires, or Bangkok. Following proper test administration standards will make sure that the test is completed fairly and that the results serve the purposes intended. However, testing across such distances introduces additional threats to the security of the exams that must be anticipated and dealt with. This session will cover proper security-related test administration standards, recent ITC security guidelines, list and describe relevant security threats, and provide a framework for dealing with those threats.

 

Paper 4

Conducting security investigations

Conducting security investigations

John Fremer (Caveon Test Security, USA)

 

One of the tasks that test program managers for international testing programs must carry out effectively is that of designing and implementing solid procedures for investigating testing irregularities that may be indicative of cheating or theft of intellectual property. Towards that end each organization should have in place a set of procedures for:

  • / Determining whether a security investigation is warranted,
  • / Evaluating the extent of fraudulent activities and the damage from them, and
  • / Guiding the execution of such investigations.

Other critical tasks include:

  • / Responsibility for managing the security investigation process has been clearly defined.
  • / Investigation procedures have been piloted before they are used operationally.
  • / Investigation procedures have received reviews from the perspective of program managers, as well as measurement, communications, and legal staff.
  • / Sufficient funds have been allocated to permit thorough investigations.
  • / Training materials have been developed and can be accessed by those who will be conducting investigations
  • / Investigation procedures are applied consistently across the organization or program

 

Discussant

Eugene Burke (SHL Group, UK)

 

 

Symposium 12 / Wednesday, 4th July / 08.45-10.30 / Room: Raadszaal

International perspectives on test reviewing

Chair

Kurt F. Geisinger (Buros Center for Testing, University of Nebraska-Lincoln, USA)

 

Symposium abstract

This symposium is based upon a special issue of the International Journal of Testing. The first paper, by Pat Lindley and David Bartram, discusses the development of the EFPA model of test reviewing. The second, (Janet Carlson and Kurt Geisinger), describes the Buros Center for Testing approach. The third and fourth papers, by Jose Muniz and Carmen Hagemeister and their colleagues, describes test reviewing in Spain and Germany, respectively. Paula Elosua and Dragos Iliescu discuss the frequent use of adapted tests in Europe and the need for reviewing of such measures. Dave Bartram and Kurt Geisinger serve as discussants. Considering international approaches to test reviewing may provide benefits at the least by establishing agreement on both dimensions and criteria for assessing the quality of tests and could also provide guidance on best practices in test reviewing. Taking this idea further, one might facilitate review procedures by developing an international pool of competent reviewers. However, as a last resort, there will remain a need for local evaluation of the quality of test adaptations.

 

Paper 1

Issues raised by use of the EFPA test review model relating to the internationalization of test standards

Pat Lindley (British Psychological Society, UK)

 

In this paper we present the background to the development of test reviewing by the British Psychological Society (BPS) in the UK. We also described the role played by the BPS in the development of the EFPA test review model and its adaptation for use in test reviewing in the UK. We conclude with a discussion of lessons learned from this experience for the internationalization of test reviews. Internationalization can refer to one or all of the following: the criteria used for reviewing tests; the procedures used to review tests or the actual reviews themselves. While we see value in internationalizing the first two of these, the third is problematic as there will remain a need for local reviews of local test adaptations.

 

Paper 2

Test reviewing at the Buros Center for Testing

Janet F. Carlson (Buros Center for Testing, University of Nebraska-Lincoln, USA)

Kurt F. Geisinger (Buros Center for Testing, University of Nebraska-Lincoln, USA)

 

The test review process used by the Buros Center for Testing is described as a series of 11 steps: (1) identifying tests to be reviewed, (2) obtaining tests and preparing test descriptions, (3) determining whether tests meet review criteria, (4) identifying appropriate reviewers, (5) selecting reviewers, (6) sending materials to reviewers, (7) checking reviews for factual accuracy, (8) editing content of reviews, (9) copyediting and updating reviews, (10) obtaining reviewer approval, and (11) seeking comments from publishers. Special considerations associated with each step are discussed. Preliminary efforts to extend the international relevance of Buros are reflected by a new product currently in development, Pruebas Publicadas en Español ("Tests in Print in Spanish").

 

Paper 3

Test reviewing in Spain

José Muñiz (Universidad de Oviedo, Spain)

José R. Fernández-Hermida (University of Oviedo, Spain)

Eduardo Fonseca-Pedrero (University of La Rioja, Spain)

Ángela Campillo-Álvarez (University of Oviedo, Spain)

Elsa Peña-Suárez (University of Oviedo, Spain)

 

The proper use of psychological tests requires that the measurement instruments have adequate psychometric properties, such as reliability and validity, and that the professionals who use the instruments have the necessary expertise. In this paper we present the first review of tests published in Spain, carried out with an Assessment Model developed by the European Test Commission, and adapted to the Spanish context. The model permits both qualitative and quantitative assessment of the test. Ten tests were reviewed, selected from among those most widely used by Spanish professionals. Each test was sent to two peer reviewers for its assessment, and based on this assessment a final report was drawn up. In general, it can be said that the quality of the ten measurement instruments is good, the reports highlighting their strong and weak points. In light of the reviews some improvements are suggested for future editions of the tests, emphasizing the need to include in the Manuals as much evidence as possible on the validity of the tests. Finally, we discuss the details of the review process and analyze possible future directions for test assessment in Spain.

 

Paper 4

Test reviewing in Germany

Carmen Hagemeister (Technische Universität Dresden, Germany)

Martin Kersting (Justus-Liebig-Universität, Gießen, Germany)

Gerhard Stemmler (Universität Marburg, Germany)

The German test review system is based on the German standard DIN 33430 "Requirements for proficiency assessment procedures and their implementation". This standard makes demands on the documentation of instruments (test manuals), demands that can be applied to any field of assessment. The German test review system and the process of test reviewing are described. Contrary to the EFPA and the Dutch test review systems, the German test review system does not assign ratings to fixed sizes of reliability or validity coefficients. The reasons for this difference are explained. Practical matters are explicated, and what would be necessary in order to establish a new test reviewing culture with the present system.

 

Paper 5

Tests in Europe. Where we are and where we should go

Paula Elosua (Universidad del Pais Vasco/University of the Basque Country, Spain)
Dragos Iliescu (National School of Political and Administrative Studies, Bucharest, Romania)


Psychometric practice does not always converge with the advances of
psychometric theory. In order to investigate this gap, the authors focus on the 10 most used psychological tests in Europe, identified by recent surveys. The paper analyzes test manuals published in 6 different European countries for these 10 most used tests. A total of 32 test manuals (11 cognitive ability tests, 7 personality measures and 14 clinical measures) are analyzed in terms of their congruence with the latest precepts of the Joint Standards for Educational and Psychological Testing and the Guidelines of the International Test Commission. These two documents are seen as reflecting the latest accomplishments in psychometric theory and as sound recommendations of best psychometric practices. Issues related to reliability, measurement error, consistency as related to a measurement model, validity, validation procedure, scales, norms and score comparison, and test adaptation are analyzed for each test manual. The data show a gap between psychometric practice and psychometric theory. The authors try to explain the reasons for this gap and suggest ways of closing this gap in the future.

 

Discussant

Dave Bartram (SHL Group, UK)

 

 

Symposium 13 / Wednesday, 4th July / 11.00-12.15 / Room: Leeszaal

Developments in the Netherlands Part 2 (COTAN)

Chair and discussant

Arne Evers (University of Amsterdam, The Netherlands)

 

Symposium Abstract

The Dutch Committee on Tests and Testing (COTAN) was founded by the Dutch Association of Psychologists (NIP) in 1959 with the mission "to promote the better use of better tests". Representatives of the departments of psychology of all ten Dutch universities and representatives of psychologists practicing in various areas of psychology are selected as members. The most important criterion to become elected for a COTAN membership is expertise in the field of test theory and test construction. The two symposia organized by COTAN show some of this expertise by COTAN members. The first symposium deals with three recent test development projects: the development of a dynamic test with tangible electronics for children, the utility of the Dutch adaptation of the WPPSI-III, and research on distress-induced eating with the Dutch questionnaire for eating disorders. The second symposium addresses more general issues concerning test construction: The influence of motivation on the scores of ability and performance tests, the pro's and con's of using short tests, score differences on intelligence tests for various ethnic groups, and the relevance of (group) reliability coefficients for individual measurement.

 

Paper 1

Dynamic testing with tangible electronics: Measuring strategy change in solving series completion tasks

Wilma C. M. Resing (Leiden University, Developmental and Educational Psychology, The Netherlands)

 

In the recent past various dynamic testing procedures have been developed from the perspective that cognitive/educational testing should not be exclusively focused on the end result of previous learning, but mostly on the ability to learn or learning as it occurs. Electronic tools, utilizing testing designs underpinned by graduated prompting, are assumed to offer opportunities to get insight in how learning processes occur and vary within and between individuals. Interfaces using concrete materials, such as cars, playmobile puppets, or blocks, combined with light- and speech-technology and based on fine-grained cognitive task analyses, have much potential. These can both be used to provide adaptive prompting during learning and to measure details relating to children's problem-solving. The objective of our current research programme was to explore whether dynamic testing, incorporating a series of structured hints/prompts can provide insights into both the learning processes and the potential of children. A total of 72 second grade children were given series completion tasks. The study employed a pretest-post-test control group randomized block design, with two training sessions between pre-and post-test. Half of the children were allocated to the experimental, the other half to the control condition. Experimental-group children were involved in the study on four occasions; those in the control group were seen twice (they did not receive the two training sessions). Special attention will be paid at the dynamic testing procedure with electronic tangibles. Comparisons will be made between findings from dynamic and static testing in children. In more detail, changes in individual solving- and learning strategies will be presented.

 

Paper 2

The utility of Wechsler's test of intelligence in preschool children

Petra P.M. Hurks (Maastricht University, The Netherlands)

 

Over the past decades, research on the importance of early experiences for later development has led to a more intense focus on early (cognitive) childhood development and early intervention (Garred & Gilmore, 2009). In this context, Wechsler's intelligence test for preschoolers has a long history, starting in 1967. In this presentation, its most recent version (i.e., the Dutch WPPSI-III) will be evaluated. A general description of this individually administered test of cognitive ability is provided, as well as brief information on its historical, conceptual, and theoretical background, its technical qualities, and the meaning of its results. Also, the WPPSI-III is compared to other tests of general cognitive ability, suitable for the preschool child and widely known in the fields of (school) psychology and neuropsychology. Finally, the value of testing intelligence of preschool school children in general and in particular by use of the Dutch WPPSI-III will be discussed.

 

Paper 3

Moderation of distress-induced eating by emotional eating scores

Tatjana van Strien (Behavioural Science Instituteand Institute for Gender Studies, Radboud University Nijmegen, The Netherlands)

 

Earlier studies assessing the possible moderator effect of self-reported emotional eating on the relation between stress and actual food intake have obtained mixed results. The null findings in some of these studies might be attributed to misclassification of participants due to the use of the median splits and/or insufficient participants with extreme scores. The objective of the two presented studies was to test whether it is possible to predict distress-induced eating with a self-report emotional eating scale by using extreme scorers. In study 1 (n = 45) we used a between-subjects design and emotional eating was assessed after food intake during a negative or a neutral mood (induced by a movie). In study 2 (n = 47) we used a within-subjects design and emotional eating was assessed well before food intake, which occurred after a control or stress task (Trier Social Stress Task; Kirschbaum, Pirke & Hellhammer, 1993). The main outcome measure was actual food intake. In both studies self-reported emotional eating significantly moderated the relation between distress and food intake. As predicted, low emotional eaters ate less during the sad movie or after stress than during the neutral movie or after the control task, whereas high emotional eaters ate more. No such moderator effect was found for emotional eating in the entire sample (n = 124) of study 1 using the median-split procedure or the full range of emotional eating scores. We conclude that it is possible to predict distress-induced food intake using self-reports of emotional eating provided that the participants have sufficiently extreme emotional eating scores.

 

 

Symposium 14 / Wednesday, 4th July / 11.00-12.30 / Room: Bestuurskamer

Cross-cultural assessment of psychopathology

Chair

Frederick Leong (Michigan State University, USA)

Zornitsa Kalibatseva (Michigan State University, USA)

 

Symposium abstract

This symposium focuses on the cross-cultural assessment of a wide range of psychopathology both within countries and across countries. The first paper examines the assessment of depression and the endorsement of specific depressive symptoms among Asian Americans and European Americans in the USA. The second study investigates cultural differences in anxiety measurement among three different cultural groups in Romania. The third paper analyzes the rates of schizotypal experiences and the psychometric characteristics of an assessment instrument (ESQUIZO-Q) of schizotypal experiences among nonclinical adolescents in Spain. Lastly, the fourth presentation employs a structural equation modelling (SEM) approach within the framework of a higher-order confirmatory factor analytic (CFA) model to test for equivalence the Beck Depression Inventory across Canadian, Swedish, and Bulgarian nonclinical adolescents.

 

Paper 1

Assessment of depression among Asian Americans and European Americans

Zornitsa Kalibatseva (Michigan State University, USA)

Frederick Leong (Michigan State University, USA)

Neal Schmitt (Michigan State University, USA)

 

The impact of culture, race, and ethnicity on the expression and experience of mental disorders is a field that has received increasing attention in the last decade. Previous research has suggested that the nature of depression among Asian Americans may differ from that among European Americans. This study used data from the National Latino and Asian American Study (NLAAS) to assess depressive symptomatology among Asian Americans diagnosed with Major Depressive Disorder (DSM-IV-TR). The frequency of depressive symptoms among Asian Americans (N = 189) was explored and compared to that of Europeans (N = 1202) from the National Comorbidity Survey-Replication (NCS-R). In addition, the phenomenology of depression among Asian Americans was examined in relation to gender, acculturative stress, and language proficiency. The comparison of depressive symptom profiles between Asian Americans and European Americans revealed various similarities and differences between the two groups. IRT analyses revealed differential item functioning (DIF) for various symptoms. In particular, European Americans may endorse more easily than Asian Americans a variety of affective symptoms, while Asian Americans may endorse more easily some of the somatic symptoms. Implications for the assessment, diagnosis, and treatment of depression among Asian Americans are discussed.

 

Paper 2

Cultural differences in anxiety measurement - the Romanian multicultural environment

Andrei Ion (SNSPA University, Romania)

Dragos Iliescu (Testcentral, Romania)

 

The cross-cultural differences in anxiety have been investigated in several studies (Boehnke, Frindte, Reddy, & Singhal, 1993; Ginter, Glanser, & Richmond, 1994; Klonoff & Landrine, 1994; Magansson & Stattin, 1978; Mumford, 1993). The current paper aims to explore the cultural differences in anxiety measurement across the three different cultural groups from Romania (Gipsy, Hungarians, and Romanians). Across the three different cultural groups, anxiety was measured with different psychometric instruments: STAI - State-Trait Anxiety Inventory (Spielberger, 1983), MCMI III- Millon Clinical Multiaxial Inventory - Anxiety (Millon, Millon, Davis, & Grossman, 1996), NEO PI-R Neuroticism Extraversion Openness Personality Inventory Revised - Anxiety Facet (Costa, & McCrae, 1992). The data analysis focused both on exploring the within and between group differences, as well as on establishing measurement invariance across the three samples.

 

Paper 3

New developments in the assessment of schizotypal experiences

Eduardo Fonseca-Pedrero (University of La Rioja, Spain; Spanish Center for Biomedical Research in Mental Health - CIBERSAM, Spain)

José Muñiz (University of Oviedo, Spain)

Mercedes Paino (Department of Psychology, University of Oviedo, Spain; Spanish Center for Biomedical Research in Mental Health - CIBERSAM, Spain)

Serafín Lemos-Giráldez (Department of Psychology, University of Oviedo, Spain; Spanish Center for Biomedical Research in Mental Health - CIBERSAM, Spain)

 

Schizotypal experiences represent the behavioural expression of liability for psychotic disorders in general population. There are different measurement instruments for the assessment of schizotypal experiences in both adults and adolescents. The main goal of this study was to analyze the rates of schizotypal experiences, the internal structure and reliability of the Oviedo Questionnaire for Schizotypy Assessment (ESQUIZO-Q) in nonclinical adolescents. The final sample consisted of 3,056 participants, 1,469 males, with a mean age of 15.9 years (SD = 1.2). The results indicated that schizotypal experiences are very common in this age group. The analysis of the underlying internal structure of the ESQUIZO-Q subscales revealed a three-factor solution specified in the following components: Reality Distortion, Anhedonia and Interpersonal Disorganization. The levels of internal consistency for the subscales of the ESQUIZO-Q were acceptable. The ESQUIZO-Q is a brief and easily administered self-report with adequate psychometric characteristics for the assessment of schizotypal experiences in nonclinical adolescents. Future studies should explore in more depth the psychometric properties of the ESQUIZO-Q (e.g., predictive validity) as well as the development of computerized-adaptive versions.

 

Paper 4

Exemplified measurement and structural nonequivalence: The BDI in cross-cultural perspective

Barbara M. Byrne (University of Ottawa, Canada)

 

Critical to multiple-group comparisons in general and cross-cultural comparisons in particular, is knowledge that the instrument of measurement is operating equivalently across groups. That is, perception of the item content, as well as meaningfulness and dimensional structure of the underlying construct are group-invariant. Based on a structural equation modeling (SEM) approach within the framework of a higher-order confirmatory factor analytic (CFA) model, and addressing both the non-normality and ordinality of the data, results from tests for equivalence of the Beck Depression Inventory (BDI; Beck, Ward, Mendelson, & Erbaugh, 1961) across Canadian (N = 658), Swedish (N = 661) and Bulgarian (N = 691) nonclinical adolescents exemplify not only how, but also the extent to which a measuring instrument can vary across groups. Results are presented from both the traditionally statistical and the more recent empirically practical perspectives. Discussion focuses on possible theoretical, methodological, and statistical reasons for these reported measurement and structural discrepancies.

 

Discussant

Thomas Oakland (University of Florida)

 

 

Symposium 15 / Wednesday, 4th July / 11.00-12.30 / Room: Raadszaal

Current issues in gifted children identification and assessment

Chair

Jacques Grégoire (Catholic University of Louvain, Belgium)

 

Symposium abstract

The understanding of giftedness has advanced significantly in the last twenty years. It is currently seen as a complex human characteristic, involving much more than an exceptional global intelligence. New models of giftedness were developed, which have several consequences for the identification and the assessment of gifted children. In this symposium, we will present and discuss four issues related to giftedness and their implications for assessment: creativity, emotional competences, motivation and cognitive heterogeneity.

 

Paper 1

Creative giftedness: Its nature and measure

Todd Lubart (Paris Descartes University, France)

Maud Besançon (Université Paris Descartes, LATI)
Baptiste Barbot (Yale University, Child Study Center)

 

Creative giftedness is defined as the ability to produce original, contextually appropriate new ideas. These creative productions may be artistic, literacy, scientific, musical, social problem solutions, among others. Creative giftedness refers to the potential for creative work rather than creative accomplishments or achievement itself. This creative potential is distinct from classic intellectual ability and « academic » giftedness. Based on more than a century of research, a two process model of creative potential, involving divergent-exploratory and convergent-integrative thinking, is proposed. This approach integrates cognitive, conative, and emotional factors as the basis of each processing mode. A new tool, the Evaluation of Potential for Creativity (EPoC) is presented. It allows the divergent-exploratory and convergent-integrative processes to be measured in the domain of creative activity. Thus a profile of creative potential can be established for children and adolescents, allowing identification of creative giftedness as well as the development of creative talent. Developed initially in France, and now subject of international adaptations, psychometric results concerning EPoC will be discussed.

 

Paper 2

Evaluation of gifted students emotional competences: How it can help to understand academic performance and social well-being

Sophie Brasseur (University of Namur, Belgium)

Jacques Grégoire (Catholic University of Louvain, Belgium)

 

In the literature, gifted people are often described as hypersensitive. Some authors even proposed to include this trait as an identifying characteristic of giftedness. If it's true, this hypersensitivity should be associated with a specific emotional profile of gifted children. Many clinicians reported that gifted people show difficulties in emotional regulation. However, only few researches have actually examined the functioning of gifted people in terms of emotional competences. Results of our research on this issue will be presented, illustrating the multiplicity of existing profiles in this population. In the field of education, several studies discussed how emotional competences (EC) could help students to improve their academic performances. For instance, good EC could help them to be more confident and to handle efficiently stressful situations (e.g. exam). In this presentation, we will show that gifted students with high EC are more efficient than gifted students with low EC. We will also discuss how and which kinds of EC are involved in this case. Indeed, for gifted adolescents, dimensions of EC, which help them to succeed, seem to be specific. In this presentation, the contribution of the evaluation of emotional competences in the identification of giftedness will also be discussed.

 

Paper 3

Evaluation of gifted students motivation: More than to be or not to be motivated

Catherine Cuche (Catholic University of Louvain, Belgium)

Jacques Grégoire (Catholic University of Louvain, Belgium)

 

In the scientific literature on giftedness, two opposite categories of population are often investigated: successful students and underachievers. As we know, motivation is central in the learning process and school achievement. Therefore, motivation is also helpful to explain the differences between successful gifted students and underachiever gifted students. (Baker, Bridger, & Evans, 1998; McCoach & Siegle, 2003). The purpose of this presentation is to identify specificities of gifted students' motivation, and stimulate researchers and educators to pay attention to this variable as a multi-faceted factor, and not as a monolithic one. The population we studied consisted of gifted young adults who completed secondary school with or without failures. A thematic analysis of their narrative about their motivation was conducted. Results showed that the subjective value they associated with the tasks (Wigfield and Eccles (1983, 2002), and the way they constructed and protected their self-efficacy beliefs (Bandura (1989, 2003)) determined several motivational profiles. Moreover, a high or a low motivational level may sometimes hide important traits, like unhealthy perfectionism. For giftedness identification and counselling, these findings suggest that it is important to measure more than the intellectual level, and that the motivational profiles should also be taken into account within a global approach of the individual.

 

Paper 4

A developmental model for gifted children identification

Jacques Grégoire (Catholic University of Louvain, Belgium)

 

Current intelligence tests based on hierarchical factorial models, as the Cattell-Horn-Cattell model (CHC), allow the measurement of a wide span of broad abilities, in addition to the classical IQ. WISC-IV and KABC-II are well-known examples of recent tests based on the CHC model of intelligence. These tests showed that a large number of children, previously identified as gifted only using their IQ, have heterogeneous ability profiles. In the same time, several models of giftedness emphasized the importance of non-intellectual characteristics, as motivation (e.g. Renzulli) or creativity (e.g. Sternberg) in its definition. This widening of the psychological picture of gifted children challenges the use of the classical criteria to identify gifted children, i.e. IQ 130. Therefore, psychologists who have to identify gifted children are puzzled. What are the appropriate indices of giftedness? We answer this question referring to an integrated developmental model of giftedness where the concept of "potential", "competency" and "performance" are clearly specified. Based on this model, we emphasize the more appropriate criteria to correctly identify gifted children.

 

Discussant

Pending

 

 

Symposium 16 / Wednesday, 4th July / 13.45-15.15 / Room: Leeszaal

Test progress in Iberian Latin American countries: Trends and opportunities for international collaboration

Chair

Solange Wechsler (Pontifical Catholic University of Campinas, Brazil)

 

Symposium abstract

Test development and use in Iberian Latin American countries (e.g, Brazil, Portugal, and Spain) have progressed considerably during the last ten years. Initially, psychological tests were imported and translated. However, new efforts have focused on test construction and validation in light of each country's needs and cultural conditions. This work has resulted in increasing both the number and quality of psychological tests available in each country. Cross national and international opportunities exist to collaborate and work, thus combining experiences between historic centres of test leadership and those that are emerging in Iberian Latin American countries, including the development and use of tests for those fluent in Portuguese and Spanish. The purposes of this symposium are to describe the status of test development and use in Brazil, Portugal, and Spain and suggest ways historic and emerging centres for test development and use may work together.

 

Paper 1

The impact of Brazilian test movement: Possibilities for international partnerships

Solange Wechsler (Pontifical Catholic University of Campinas, Brazil)

 

Test movement in Brazil can be regarded as being through 3 big waves. In the first wave, tests were highly valued and imported from other nations, mainly from the U.S. In the second wave, tests were criticized as not being relevant to the country's characteristics. Finally, in the third wave, laboratories on tests construction were organized at universities, a national Institute on Psychological Assessment was found by researchers, and a regulation passed by the Federal Council of Psychologists required scientific qualities for all tests to be used in the country. As a consequence, a large number of psychological, educational and neurological tests have been developed in the last decade. This recent trend indicates various possibilities for international collaborations, through test companies, book publications, cross national research and other related organizations for assessment.

 

Paper 2

Test development in Portugal

Leandro Almeida (University of Minho, Portugal)

Amanda Franco (University of Minho, Portugal)

 

Ethnic and cultural diversity in Portugal has been long-standing and is increasing. The discipline and practices of psychology within Portugal have transitioned from an emphasis on psychoanalysis to more modern models that consider social and other contextualized influences on behavior. This new emphasis has engendered the need for new tests and other assessment methods and recognizes the more specialized practices of psychologists. Psychologists are requesting revised, new and culturally relevant standardized instruments Collaboration among those engaged in test development in the Ibero and Latin-American countries can be extremely important. The purpose of this presentation is to discuss ways cooperation, leading to synergy, may occur.

 

Paper 3

Past, present and future uses of psychological tests in Spain

Maria Péres Solis (Complutense University, Madrid, Spain)

 

Psychologists in Spain have worked to establish close relationships between psychological evaluations, appropriate use of assessment instruments, and proposals for effective intervention tailored to the needs of people and contexts. Test development has increased considerably in Spain, paralleling efforts in other Latin American counties. Much of this development can be traced to the transformation of the Spanish Society of Psychological Evaluation's involvement in the European Association of Psychological Assessment. These issues will be discussed in greater detail.

 

Paper 4

Test progress and needs for Portuguese gifted children

Margarida Pocinho (University of Madeira- Portugal)

 

In Portugal, the concept of Giftedness is not static, it is in constant evolution, thus being characterized among diverse variables beyond cognitive and intelligence capacities. According to the World Council for Gifted and Talented children, giftedness can be conceptualized as high performance or potential, in any of the following isolated or combined elements: general intellectual capacity, specific academic skill, creative or productive thinking, special talent for visual, dramatic and musical arts, motor skills and leadership. Thus, the variety of concepts demonstrates a criteria multiplicity that should be considered to diagnosis giftedness. As a result, a broad range of psychological and educational assessment instruments have been developed in Portugal in the last 15 years. However, it is necessary to join forces and establish common guidelines at national and international level, which include different agents, procedures and tests, to assess giftedness under multidisciplinary approaches.

 

Discussant

Thomas Oakland (University of Florida, USA)

 

 

Symposium 17 / Wednesday, 4th July / 13.45-15.15 / Room: Bestuurskamer

Cross-cultural comparability of noncognitive assessments - construct and method issues

 

Chair

Jonas P. Bertling (Educational Testing Service, USA)

Patrick C. Kyllonen (Educational Testing Service, USA)

 

Symposium abstract

For noncognitive assessments administered internationally, such as personality tests for multinational employment selection or background questionnaires for international comparative studies (e.g., PISA), we assume that assessments are comparable in meaning cross-culturally (van de Vijver & Leung, 2000). But culture interacts with respondents' interpretations and response styles, diminishing our understanding of the relationships between noncognitive factors and educational or workforce outcomes (Hui & Triandis, 1985; van de Vijver et al., 2008). This is especially important in cases where Likert scales are used, which is very common for noncognitive assessments, both in education (Buckley, 2009) and the workforce (Ziegler, MacCann, & Roberts, 2011). The purpose of this symposium is to present findings from a variety of international educational and workforce studies that have used innovative new methods designed to detect and correct for cultural differences in response style, and thereby reduce that threat to the validity of international noncognitive assessments.

 

Paper 1

The overclaiming technique as a new method for international educational assessments

Patrick C. Kyllonen (Educational Testing Service, USA)

Richard D. Roberts (Educational Testing Service, USA)

Jonas P. Bertling (Educational Testing Service, USA)

 

The overclaiming technique (OCT; Paulhus, Harms, Bruce & Lysy, 2003; see also Zimmerman, Broder, Shaughnessy, & Underwood, 1977) is a method that can be used to estimate both respondents' concept familiarity and their tendency to overstate what they know. It does this by collecting recognition judgments for intermixed concepts and foils. We created an OCT test to estimate secondary student respondents' familiarity with mathematics concepts (e.g., polynomial function) on an international large-scale mathematics assessment by intermixing such concepts with foils (e.g., proper number). For each concept and foil, presented in a list in a survey booklet, participants indicated familiarity on a 5-point scale (from "never saw it" to "very familiar with it"). Not surprisingly, correct recognition was found to be highly correlated with mathematics achievement scores at both the individual and country levels. However, susceptibility to overclaiming (the tendency to claim familiarity with a foil concept) was found to be highly negatively correlated with achievement at a country level-high achieving countries were found to be less susceptible to overclaiming on average. We review these findings, discuss possible implications, including for conditioning responses to other self-report questions, and present the results for such conditioning.

 

Paper 2

A multicountry comparison of social desirability with a shortened and simplified Marlowe-Crowne Social Desirability Scale

Jia He (Tilburg University, The Netherlands)

Byron Adams (Tilburg University, The Netherlands)

Fons van de Vijver (Tilburg University, The Netherlands)

 

Social desirability has long been considered as a validity threat in surveys. The meaning of social desirability is unclear, notably in cross-cultural context. It is conceptualized as multidimensional (e.g., Paulhus, 1991), but in most applications treated as unidimensional. In this paper we addressed the factor structure of the Marlowe-Crowne Social Desirability Scale, its cross-cultural equivalence, and its patterning of country means. We administered a 15-item simplified Marlowe-Crowne Social Desirability Scale to highly educated young people in 7 countries (Bulgaria, China, Indonesia, Kenya, Mexico, the Netherlands, and South Africa). Exploratory factor analysis suggested a two- or three-factor model; both solutions showed satisfactory structural equivalence in most countries. In the two-factor model, we identified a positive and a negative impression management dimension, whereas in the three-factor model, the positive impression management split into a social (e.g., I help others in trouble) and a personal dimension (e.g., I am careful about my way of dressing). There were significant differences in every dimension across countries, yet they did not correlate with GDP per capita or the Human Development Index. We conclude that social desirability is multidimensional and that, contrary to previous studies, cross-cultural differences in social desirability were unrelated to country-level affluence.

 

Paper 3

Similarities and differences between normative and ipsative scale personality data across countries

Dave Bartram (SHL Group Ltd, UK)

 

Bartram (in press) has shown that much of the variance in country-level scale means and SDs for the Big Five can be accounted for by country-level metrics. These metrics include Hofstede's cultural dimensions and 'performance' metrics, such as GDP, life expectancy, quality of educational provision and global competitiveness. He compared the results obtained from Big Five measures based on the OPQ32i, which used forced choice self-report measures and produced ipsative primary scale scores, with Big Five data from other researchers who have used Likert-rating format items with self- and with peer-report to produce normative scale scores. Differences in the results of these large scale studies are difficult to attribute to data collection method or instrument format effects as many other variables are also involved. In the present paper, comparisons will be drawn between data that has been both ipsatively scaled and normatively scaled, with both sets of scale scores being drawn from each of two related forced-choice item format instruments: OPQ32i and OPQ32r. This provides us with the ability to compare data from two time periods using two similar forced-choice item formats scored, in each case, both normatively (using a multi-dimensional IRT scoring model) and ipsatively.

 

Paper 4

Can anchoring vignettes enhance the cross-cultural comparability of student background questionnaires?

 

Jonas P. Bertling (Educational Testing Service, USA)

Patrick C. Kyllonen (Educational Testing Service, USA)

Richard D. Roberts (Educational Testing Service, USA)

 

In PISA and other large-scale assessments, a quite robust phenomenon is a reliable difference between individual and national level relationships for certain questionnaire scales (e.g., Loveless, 2006). A question is whether this reflects true differences between countries in attitudinal factors (high achieving countries have worse attitudes towards mathematics) or merely a method artifact. If the issue of cross-cultural differences in survey response styles is not considered, and existing response styles are not corrected for, secondary analysts who use attitudinal data are at risk of reaching erroneous conclusions (Bartram, 2009). There can be, for instance, considerable differences in how students from different countries interpret the response scale. Anchoring Vignettes (e.g., Wand & King, 2007) have been successfully used in various fields of survey research, but so far not in educational large-scale assessments. In this paper, we compare different anchoring methods (e.g., nonparametric vs. parametric) based on data from the PISA 2012 Field Trial. Results indicate that anchoring vignette type items could considerably improve the cross-cultural comparability of student background questionnaire scales. Substantial gains in measurement precision and validity could be achieved. Correlations with proficiency strongly aligned on the individual and the country-level for anchored, but not for unanchored Likert-scale responses.

 

Discussant

Johnny Fontaine (Ghent University, Belgium)

 

 

Symposium 18 / Wednesday, 4th July / 13.45-15.15 / Room: Raadszaal

So Short! That Valid?

Chair

Rab MacIver (Brunel University, UK)

 

Symposium abstract

The seminal articles of Burisch (1984, 1997) clearly outlined measurement scale construction principles and urged test authors to aim for short and sharp assessment scales. The symposium offers an overview of development principles and their applications to the Behaviour, Ability and Global (B-A-G) components of the Saville Consulting Wave® Performance Culture Framework. With view to Behaviour prediction MacIver et al show that halving the length of a set of 36 six item scales markedly reduces internal consistency while maintaining high point-to-point validities. On the prediction of Abilities Kurz demonstrates how measures of specific ver-bal, numerical and diagrammatic reasoning aptitude domains can be summarised through ‘Swift’ assessment of the higher-order ‘Analysis Aptitude’ construct while retaining most of the validity of longer measures. Comparing various criterion measures Hopton finds that a succinct 3 item ‘Global’ measure attracts the best prediction through a Professional Styles ‘Great 8’ Total pre-dictor composite. The discussant Helen Baron will offer views based on more than 25 years of experience in the development of psychometric assessment tools.

 

Paper 1

Why short scales can be more valid than long ones

Matthias Burisch (University of Hamburg, Germany)

 

In most life situations, more tends to be more: Think health, wealth, or wisdom. Making test scales longer typically enhances their reliability. Not so with validity. Somewhat counter-intuitively, adding items beyond some optimal number can be detrimental. This presentation will briefly review the evidence in favour of short scales and explain why the general rule does not always hold here. The hallmarks of inductive, deductive and criterion-centric development approaches will be reviewed, and the issues illustrated through a study where a validity optimisation algorithm is used to construct short scales out of a lengthy personality measure that following cross-validation on average outperform the original scales.

 

Paper 2

Lowering internal consistency and maintaining criterion-related validity

Rab MacIver (Brunel University, UK)

Neil Anderson (Brunel University, UK)

Arne Evers (University of Amsterdam)

Ana Cristina Costa (Brunel University, UK)

Jake Smith (Saville Consulting, UK)

 

Can behavioural questionnaire scales be shortened, reducing the internal consistency, while still maintaining the criterion-related validity? This would allow shorter scales to be developed, which have lower internal consistency than conventional scales, but that still possess criterion-related validity. Part of Wave Professional Styles standardisation included two Work Strengths assessments composed of self-efficacy/talent items forming 36 scales(n = 1,153). The two parallel forms are designed to be used separately. The two questionnaires were co-standardised and a concurrent validity study was conducted on the same sample, with a proportion of the respondents being independently rated on 36 criteria (n = 500-632). It was possible to combine the talent items in both questionnaires to create 36 scales each with six items (three items from each form). The average internal consistency of these six item scales was 0.779 and the average convergent criterion related validity of the 36 talent scales with 36 matched behavioural criteria was 0.204. The two forms were then considered as two separate questionnaires with three items per scale. The average internal consistency for the three item scales dropped to .551, yet the average criterion related validity was relatively unchanged at 0.193. The results are in line with the hypothesis that the criterion related validity stays relatively constant when content coverage does not change. The results are discussed in relation to the use of shorter scales and how to maximise the validity a test user receives. Scale length and the over reliance of internal consistency as a key marker of scale adequacy are also discussed.

 

 

Paper 3

Swift yet valid aptitude assessment

Rainer Kurz (Kingston University, UK)

 

Kurz (2000) observed a positive manifold across Verbal, Numerical, Diagrammatic, Checking, Spatial and Mechanical tests leading to the formulation of the Differential Reasoning Model (DREAM). Building on this model Kurz et al. (2006) developed and co-standardised Work Aptitudes (WORK) Verbal, Numerical and Diagrammatic Analysis tests with 28 items and 20 minutes time limit each, and devised from a designated item sub-set a short form Swift Analysis Aptitude (SAA) with 24 items and a 18 minute time limit (i.e. 8 items in 6 minutes per sub-test). Validities against educational exams (GCSE English, Math & Science points composite; N = 273) reached .62 (uncorrected) for the WORK total and .53 for the SAA item sub-set. Against self-ratings (N = 226) of matched cognitive competencies the values were .40 and .34, and against Solving Problems effectiveness .28 and .20 respectively. The results suggest that SAA achieves about 80% of the WORK validities with just 30% of the items, and that conversely WORK single tests would achieve 25% higher validity than SAA. Similar ratios were found for each sub-test. Where measurement accuracy for a specific aptitude area is of paramount importance and/or generous amounts of assessment time available, the use of specific WORK tests would be advisable with the additional benefit of item type sub-scores. Where a GMA composite is required, candidate time is precious, a multi-stage approach is used (e.g. 'Screen Out' online unsupervised; 'Select In' supervised) and/or other relevant information sources are available (e.g. in Assessment Centres) use of the short SAA measure seems advisable.

 

Paper 4

Criterion measures length & validation

Tom Hopton (Saville Consulting, UK)

 

One of the major difficulties in comparing performance across different jobs is having a coherent, consistent and valid framework of criteria against which to assess performance. This paper raises the question of why do many people not dedicate as much attention to the criterion space as they do to the measurement of predictive dispositional constructs. Although longer, more behaviourally-specific criterion measures may seem more useful than shorter, general markers of overall performance, this paper presents evidence showing that a 3-item criterion measure of overall workplace performance can outperform a number of longer criterion measures. 308 individuals completed Wave Professional Styles and engaged one independent rater of their performance to complete a research version of the Saville Consulting Performance 360 assessment. The 360 assessment contained single items representing the Great Eight (Bartram, 2005) constructs, as well as the hierarchical Wave Performance Culture Framework of performance including behavioural, ability and global competencies of workplace performance. In this study, the strongest correlation between the predictor measure (Wave Professional Styles) and any independent criterion measure was for the 3-item criterion measure, at .32** (uncorrected). This paper concludes with a discussion of why this criterion measure may have outperformed the longer scales, focusing in particular on its improved breadth of measurement and its relevance to modern workplace performance. It concludes by presenting some possible implications for future research and practice.

 

Discussant

Helen Baron (BPS, UK)

 

 

Symposium 19 / Wednesday, 4th July / 15.45-17.15 / Room: Leeszaal

Short scales for psychological research - applicability, benefits, and potential limitations

Chairs

Christoph J. Kemper (GESIS - Leibniz Institute for the social sciences, Mannheim, Germany)

Beatrice Rammstedt (GESIS - Leibniz Institute for the social sciences, Mannheim, Germany)

 

Symposium abstract

In recent years, the demand of short measures for psychological constructs has been growing, e.g. in studies conducted online or large-scale surveys. If online questionnaires are too lengthy, break-off and item nonresponse may increase substantially leading to a lower quality of research data. Due to severe monetary and time constraints, measures used in large-scale surveys have to be short but still reliable and valid indicators of a psychological construct. In this symposium, the construction and validation of three short scales measuring diverse personality and ability constructs will be presented. The fourth presentation will focus on the delicate task of short scale construction. It reviews different approaches and proposes guidelines for the construction process in order to ensure that short scales are not only economic but also sufficiently reliable and valid. Different aspects of psychometric short scales will be discussed, e.g. their applicability in different contexts, their benefits, and potential limitations.

 

Paper 1

Are you optimistic or pessimistic, or both? - The LifeOrientationTest-Revised

Markus Zenger (University of Leipzig, Germany)

Andreas Hinz (University of Leipzig, Germany)

Yve Stöbel-Richter (University of Leipzig, Germany)

Elmar Brähler (University of Leipzig, Germany)

 

The Life Orientation Test-Revised (LOT-R) was developed to measure the construct of optimism, defined as generalized outcome expectancy. Originally designed as a uni-dimensional scale with the two antipoles optimism and pessimism, several studies based on specific samples could show that optimism and pessimism can be seen as distinct dimensions. With data derived from a representative sample of the German general population (N = 2,372, age 18-93y.), the factorial structure of the LOT-R was tested using confirmatory factor analysis. Several model fit indices indicate that the assumption of a bi-dimensional structure of the LOT-R fits the data much better than the uni-dimensional structure. Besides, the LOT-R can be applied in clinical settings to predict psychosocial outcomes. In a study with 387 cancer patients, LOT-R scores significantly predicted anxiety, depression and health related quality of life three months later. The predictive power of optimism and pessimism differed between males and females. Furthermore, the brevity of the questionnaire (three items for optimism and pessimism, respectively, and four filler items) allows an economical assessment of optimism and pessimism for individual screening purposes as well as for epidemiological research.

 

Paper 2

Brief knowledge scales for the measurement of crystallized intelligence

Stefan Schipolowski (Humboldt-Universität zu Berlin, Germany)

Oliver Wilhelm (Ulm University, Germany)

Ulrich Schroeders (Humboldt-Universität zu Berlin, Germany)

Anastassiya Kovaleva (GESIS - Leibniz Institute for the social sciences, Mannheim, Germany)

Christoph J. Kemper (GESIS - Leibniz Institute for the social sciences, Mannheim, Germany)

 

Crystallized intelligence (gc) is a well-established cognitive ability factor that has been conceptualized as reflecting influences of learning, education, and acculturation. Although gc is best measured by capturing knowledge from different domains, in practice gc assessments are often limited to verbal skills (e.g., vocabulary). Based on a large item pool covering declarative knowledge from 16 domains in natural sciences, the humanities, and civics we compiled a 32 item gc assessment that can be completed in approximately 10 minutes and includes 2 items from each knowledge domain. Based on a sample of 1,100 German adults covering a broad age range and various educational backgrounds we derive an ultra-short version for 5 minutes testing time. First, we investigate mean, floor and ceiling effects as a function of context variables. Second, we compare competing measurement models in terms of model fit and reliability of the latent factors. Third, locally weighted measurement models will be used to evaluate the relationship of gc with age and educational background. These analyses are extended with an examination of mean trajectories and differentiation-dedifferentiation of factor space. Potential and problems of contemporary ability measurement concepts for survey research will be discussed.

 

Paper 3

The Vocabulary and Overclaiming Test (VOC-T)

Matthias Ziegler (Humboldt-Universität zu Berlin, Germany)

Christoph J. Kemper (GESIS - Leibniz Institute for the social sciences, Mannheim, Germany)

Beatrice Rammstedt (GESIS - Leibniz Institute for the social sciences, Mannheim, Germany)

 

The present research aimed at constructing a questionnaire measuring overclaiming tendencies (VOC-T-bias) as an indicator of self-enhancement. An approach was used which also allows to estimate a score for vocabulary knowledge, the accuracy index (VOC-T-accuracy) using signal detection theory. For construction purposes, an online study was conducted with N = 1,176 participants. The resulting questionnaire, named Vocabulary and Overclaiming - Test (VOC-T) was investigated with regard to its psychometric properties in two further studies. Study 2 used data from a population representative sample (N = 527) and Study 3 was another online survey (N = 933). Results show that reliability estimates were satisfactory for the VOC-T-bias index but less so for the VOC-T-accuracy index. Overclaiming did not correlate with knowledge but was sensitive to self-enhancement supporting the test-score's construct validity. The VOC-T-accuracy index in turn covaried with general knowledge and more so with verbal knowledge also supporting construct validity. Moreover, the VOC-T-accuracy index also had a meaningful correlation with age in both validation studies. All in all, the psychometric properties can be regarded as sufficient to recommend the VOC-T for research purposes.

 

Paper 4

Multi-item scales: Quality criteria and reduction of length

Hilde Tobi (Wageningen University, The Netherlands)

Ynte van Dam (Wageningen University, The Netherlands)

 

 

There are several reasons to reduce scale length and there are at least as many approaches to scale length reduction being used. The literature, however, gives little guidance as to how to reach the best possible short version. The aim of this paper is to come up with some guidance based on the quality criteria: validity, reliability and discriminative power. An inventory of methods is presented and measurements to asses a particular quality criterion. Each of these methods and measurements seems to favor or discard particular item characteristics. How this results in a particular short version is illustrated with the Connected to Nature Scale (Mayer & Frantz, 2004) and the NEP-scale which aims at the construct "beliefs" within the Value-Belief-Norm theory (Dunlap, van Liere, Mertig, & Jones, 2000).

 

Discussant

Johannes Lutz (University of Potsdam, Germany)

 

 

Symposium 20 / Wednesday, 4th July / 15.45-17.15 / Room: Bestuurskamer

New developments of observed-score equating

 

Chair

Marie Wiberg (Department of Statistics, Umeå University, Sweden)

Alina von Davier (Educational Testing Service, USA)

 

Symposium abstract

The aim of this symposium is to highlight recent developments within observed-score equating. Equating is influenced by various factors and several of these factors are addressed in this symposium. These include how the data is generated, the equating design, the equating procedure and how the result is reported. The first and second papers both consider a non-equivalent groups with anchor test design. Each paper proposes new observed-score methods within the kernel equating framework in order to improve the equating procedure. The third paper focuses on how data should be generated if we want to evaluate an item response theory (IRT) observed-score equating. If the generating model is the same as the equating model it may influence the results. Finally, the last paper's focus is on what should be reported from a test and discusses the possibility of equating subscores and report these equated subscores as opposed to reporting the total score alone.

 

Paper 1

An observed-score local equating method within the Kernel equating framework

Marie Wiberg Department of Statistics, Umeå University, Sweden

 

When we have non-equivalent groups with anchor test (NEAT) design we can use the information obtained from the observed anchor scores in many ways. A newly proposed method is to use a local non-equivalent anchor test, NEAT, equating method (van der Linden & Wiberg, 2010) in order to obtain an equating method which fulfills all equating criteria. If this method is integrated with the kernel method of test equating (von Davier, 2011; von Davier, Holland & Thayer, 2004) we obtain a local kernel observed-score NEAT method with clear connection to the classical test theory. The most important connection is the view of the observed-scores as composed of a true score and a random error. This new method also fulfills all equating criteria, especially equity and population invariance which traditional methods have problems with. This presentation discusses the usability of observed-scores in equating in general and the advantages of a local equating method in the kernel equating framework. The new method is examined with an empirical study and compared with traditional equating methods.

 

Paper 2

Non-linear Levine observed-score equating method in the (Gaussian) Kernel equating framework

Alina von Davier (Educational Testing Service, USA)

Henry Chen (Educational Testing Service, USA)

 

In a non-equivalent groups with an anchor (NEAT) design, there are several ways to use the information provided by the anchor in the equating process. One of the equating methods available in the NEAT design is the linear observed-score Levine method, which is rooted in the classical test theory and uses estimated relationships between true scores on the test forms to be equated and on the anchor. The Levine method has received some attention in the recent years and further developments have been considered, such as a curvilinear analogue of the method in the kernel equating framework (KE; von Davier, 2011; Chen, Livingston, & Holland, 2011; von Davier, Holland, & Thayer, 2004). The Levine observed-score equating method is often computed in practical applications for comparison purposes, because under some circumstances it is more accurate than other linear equating methods (see Petersen, et al., 1982). This presentation discusses the usefulness of observed-score and classical test theory based methodologies in equating in general, and in particular, it discusses the advantages of a (non)-linear Levine observed-score equating method in the KE framework. The new equating function is investigated on real data and the results are compared with those from the traditional equating functions.

 

Paper 3

Empirical evaluation of IRT observed score equating

Anton Beguin (Cito Institute for Educational Measurement, Netherlands)

Maarten Marsman (Cito Institute for Educational Measurement, Netherlands)

 

Some of the aspects influencing equating are the design, equating procedure and particularities of the data. In the evaluation of the IRT equating procedures often simulation studies are carried out in which the data are sampled based on an IRT model. In some studies deliberate misfit is added to the model but often the simulated data are sampled based on the same model that is also used in the equating procedure. In the current simulation study we construct data based on a model free resampling procedure using existing large data sets. We divide the data in such a way that an equating design is simulated with some parts of the data as unique and other parts as common items. The size of the dataset allows that multiple equating samples are drawn. Standard errors are estimated based on the variance over the samples. In the evaluation of the robustness of the procedure conditions are constructed based on number of respondents and number of common items. Furthermore we take into account the multilevel structure that is present in the data to be able to evaluate the effect of school based sampling of students on the effectiveness of IRT observed score equating.

 

Paper 4

The value of reporting and equating observed admissions test subscores

Per-Erik Lyrén (Department of Applied Educational Sciences, Umeå University, Sweden)

 

It is becoming more common to report, use and equate observed subscores on large-scale assessments, such as admissions tests. However, before doing so the added value of these subscores should be examined empirically, and there are several ways to do so. One of the most promising methods is the CTT-derived observed-score method proposed by Haberman (2008). The method was used in this study to examine the value of reported observed subscores on the new Swedish Scholastic Assessment Test (SweSAT), which was first administered in the fall 2011. The SweSAT is an admissions test used for selection to higher education in Sweden. Analysis on pretest data show that there is added value in only two out of eight subtest scores; one in the Verbal part and one in the Quantitative part. This is different from the old SweSAT, which had value in four out of a total of five observed subscores. The presentation will also include analyzes on regular observed-score data from the first two administrations of the test. The advantages of Haberman's method and the consequences of equating the observed subscores in the SweSAT context will be discussed.

 

Discussant

Kurt Geisinger (Buros Center for Testing, University of Nebraska-Lincoln, USA)

 

 

Symposium 21 / Wednesday, 4th July / 15.45-17.15 / Room: Raadszaal

The hierarchical structure of personality and performance

Chair

Rainer Kurz (Saville Consulting, UK)

 

Symposium abstract

The Big 5 personality factors have provided focus and direction to personality questionnaire de-velopment and validation research. Digman (1997) proposed higher-order factors labelled Alpha (Agreeableness, Emotional Stability & Conscientiousness) and Beta (Extraversion & Openness) that according to Musek (2007) form a ‘Big One’ factor of personality. Van der Linden, Nijen-huis & Bakker (2010) demonstrated the existence of a General Factor of Personality (GFP) across numerous Big 5 measures and samples, and showed moderate validity for the construct. The Great 8 competencies (Kurz & Bartram, 2002) expanded the Big 5 by adding Need for Achievement & Power as well as Reasoning constructs. Kurz, MacIver & Saville (2009) pro-posed a Three Effectiveness Factor model building on this model. The symposium outlines the various models and alternative approaches to stimulate discussion and outline how they can be reconciled through hierarchical models e.g. as demonstrated by Woods (2009) through factor analysing 5 personality questionnaires.

 

Paper 1

A general factor of effectiveness, or three effectiveness factors?

Rainer Kurz (Saville Consulting, UK)

 

This paper reviews research on General Mental Ability (GMA) and introduces the Three Effectiveness Factor (3EF) model (Kurz, MacIver & Saville, 2009) that features broad Working Together, Promoting Change and Demonstrating Capability factors. The first two factors broadly correspond to Alpha vs. Beta (Digman, 1997), Getting Along & Getting Ahead (Hogan & Holland, 2003) and Stability vs. Plasticity (DeYoung et al, 2002) constructs. The third factor covers information processing themes related to reasoning and dependability. In a PCA Wave Performance 360 Data on N = 13,017 ratings all 45 items showed positive loadings (> = .25) on the first unrotated component. The first Varimax rotated component Promoting Change (20%) covered Openness, Need for Achievement, Need for Power and Extraversion themes, and global ratings of Accomplishing Objectives and Demonstrating Potential. Working Together (14%) covered Agreeableness and Emotional Stability themes. Demonstrating Capability (13%) covered Dependability aspects of Conscientiousness, Working with Information, and global ratings on Applying Specialist Expertise. The results confirm the structure of the Three Effectiveness Factor Model (3EF) with a General factor at the apex. The hierarchical conceptualisation of Effectiveness at work in terms of three broad factors synergises ability, personality and competency assessment, and enables better alignment of predictor and criterion variables at the level of specificity desired.

 

Paper 2

The general factor of personality

Dimitri van der Linden (Erasmus University Rotterdam, The Netherlands)

 

Recently it has been proposed that a General Factor of Personality (GFP) exists that reflects a continuum of socially desirable behaviour. The GFP is assumed to occupy the apex of the hierarchical structure of personality and emerges from the inter-correlations among lower-order personality factors such as the Big Five. Since its emphasis in the literature in 2007, the GFP has stirred a strong scientific debate. Some researchers have suggested that the GFP may be not much more than a measurement artefact, for example due to common method bias (e.g., faking on tests). Others however, have suggested that the GFP reflects a substantive personality factor that plays a role in interpersonal behaviour. The presentation will briefly describe the background of the GFP and present studies (e.g. Meta-analysis with N = 144,114) showing that the GFP i) is consistent over different personality measures, and in self-ratings and other ratings, ii) is related to a range of real-life outcomes such as social status, supervisor-rated performance, and objective job performance. This evidence indicates that the GFP is a factor with practical and theoretical implications. Having knowledge about the GFP may therefore also be relevant for people working in the area of personnel selection and test development.

 

Paper 3

Do I want candidates with more personality, or more suitable personalities?

Paula Cruise (OPP Ltd, UK)

Rob Bailey (OPP Ltd, UK)

 

This presentation explores to what extent the Big One, aka General Factor of Personality (Musek 2007, van der Linden 2011), can be found in the 16PF questionnaire. It reports the lack of a clear one-factor solution from Principle Components Factor Analysis and describes the two factor model found instead (as well as the conventional 16PF structure of 5 Global Factors and 16 detailed, Primary Factors). This same pattern was found in recent data from the UK and Ireland (N = 1,202) and US data (N = 30,567).This presentation also explores the utility of the GFP concept by contrasting the predictive validity of a two factor solution, a five factor solution and detailed regression weights arising from the detailed 16 factors. This will build on Herrmann and Bailey’s 2009 work to contrast the utility of 5 Factors vs. 16, where it was found that regression equations based on 16 Factors yielded higher validity than the 5 Global Factors. The criteria in Herrmann and Bailey’s were 360 ratings on a variety of competencies. The paper concludes that specific, granular personality characteristics give more accurate prediction for a variety of discrete areas of work performance.

 

Paper 4

Co-validation of ‘Great 8 Competencies’ totals across seven personality questionnaires

Rab MacIver (Brunel University, UK)

 

This presentation explores the hierarchical structure of personality variables based on a co-validation (N = 308) of 16PF, NEO, HPI, OPQ32i and 3 versions of Saville Consulting Wave. Scale mappings to the Great 8 Competencies were constructed ‘a priori’ based on OPQ32 equations (Bartram, 2005). When entering the 56 Great 8 scores from the different instruments into a PCA the first component extracted revealed high loadings for motivational constructs while those underpinned by Agreeableness and Conscientiousness showed negative loadings. A rotated two factor solution covered Alpha & Beta while the three factor solution resembled Working Together, Promoting Change and Demonstrating Capability constructs of Kurz, MacIver & Saville (2009). Big 5 and Great 8 solutions were also viable. Great 8 Total unit weight composite scores correlated on average .58 with a minimum convergence of .42 between scores from different instrument. The average raw validity against an external rating of Global performance was .21 (.36 adjusted for criterion unreliability) with values ranging from .18 to .32 (.31 to .57). The results show that Totals across the Great 8 Competencies are convergent across measures and show sizeable criterion-related validity against external ratings of overall performance. Unit weight aggregation of criterion centric Great 8 forecast scales into a ‘General’ construct seems viable and a promising alternative to extraction of the first PCA component.

 

Discussant

Jörg Prieler (Investigation, Research & Consulting Centre, Austria)

 

 

Symposium 22 / Thursday, 5th July / 08.30-10.00 / Room: Leeszaal

Personality assessment across and within cultures: Addressing some methodological issues

Chair

Ilke Inceoglu (SHL Group, UK)

 

Symposium abstract

With globalisation and increased physical mobility of both employers and employees, international organisations are faced with the question of how to use assessment tools for selection, development and performance management across different countries. As country populations are becoming more diverse, employers also have to consider whether assessments used within one country work equally well for applicants from diverse cultural backgrounds. This symposium addresses methodological and practical questions in personality assessment across and within cultures. Different methods used for assessing construct equivalence are compared, and data collected across a range of diverse countries and ethnic groups within a country are presented, examining construct equivalence, individual and country level effects on scale scores using multi-level analysis, and the influence of in-group and out-group comparisons on personality test scores.

 

Paper 1

Replicating personality structure with confirmatory factor analysis using the German 16PF Questionnaire

Anne Herrmann (Leuphana University L?neburg, Germany)

 

The underlying structure of personality has been debated extensively in personality psychology and psychometrics. So far, the internal structure of personality inventories has been primarily examined using exploratory factor analysis (EFA). Increasingly, confirmatory factor analysis (CFA) is considered the method of choice, because it offers several advantages when studying internal construct validity. This study aims to replicate the EFA-based second-order structure of the German 16 Personality Factors (16PF) questionnaire using CFA (N = 786). Two models were compared: Model 1 was specified based on EFA results obtained from the German version of the 16PF during its development. Model 2 was specified based on EFA results of the US-English 16PF. This model also reflects how the 16 primary factors are assigned to compute the five global factors for the US-English version, and many other language versions of the questionnaire. Better fit was obtained for the latter model when applied to the German data. The findings are related to other research applying CFA to personality instruments, and implications are discussed. In addition, the differences between results from EFA and CFA are described, as well as the challenges faced in the replication of EFA results when using the more restrictive CFA.

 

Paper 2

Equivalence of personality constructs across 29 countries

Mathijs Affourtit (SHL Group, UK)

Ilke Inceoglu (SHL Group, UK)

 

With growing globalisation and pan-geographical operations by many companies, assessment has become increasingly international. For comparison of candidates across countries, constructs have to be transferable. This paper will address the construct equivalence of the SHL Occupational Personality Questionnaire (OPQ32r) across diverse countries and language versions, including for example, Europe, the US, Latin American countries, China, South Africa and India. Results of analyses of data from over 90,000 administrations of the OPQ32r across 29 countries will be reported. The OPQ32 was first developed in UK English after thorough review of the constructs and items by an international group of experts. To determine whether the other language versions of the OPQ32r measure the same constructs as the UK English version, the pattern of scale inter-correlations was examined using structural equation modelling (SEM). The model tested showed very good fit across all 29 countries. The results strongly confirm the construct (i.e. configural) equivalence of the other OPQ32r language versions with the UK English version. In addition to assessing construct equivalence through the use of SEM to look at the invariance of between-scale correlation patterns, we also investigate variations in scale means as a function of language and other demographics (age and gender).

 

Paper 3

Multi-level analysis of a large multicultural personality data set

Dave Bartram (SHL Group, UK)

 

Multi-level analysis provides a means of looking at both individual and country level effects on scale scores simultaneously. It provides a means of partitioning variance in scales scores that is associated with other individual or country level variables. In an earlier paper, Bartram (in press) has shown that much of the variance in country-level scale means and SDs for the Big Five can be accounted for by country-level metrics. These metrics include Hofstede's cultural dimensions and 'performance' metrics, such as GDP, life expectancy, quality of educational provision and global competitiveness. However, the analyses he reported were based on country-level data and did not look at individual level effects at the same time. The present paper presents a re-analysis of these data using multi-level analysis and shows where this approach both substantiates the findings from the earlier work and where it helps to explain the differences between variance attributable to individual variations within- country and differences between countries.

 

Paper 4

The reference-group effect: Personality data from a Turkish-Dutch minority group

Marise Ph. Born (VU University of Amsterdam & Erasmus University Rotterdam, Netherlands)

Anita de Vries (NOA, Amsterdam & VU University of Amsterdam, The Netherlands)

Reinout De Vries (VU University of Amsterdam, The Netherlands)

 

Small or absent ethnicity effects on personality scales may indicate that ethnic groups do not differ on personality traits but also may be caused by the so-called reference-group effect (RGE). The RGE occurs when responses are based not on respondents' absolute level of a construct but rather on their level relative to a comparison group. This study examines to what extent Turkish-Dutch minorities are influenced by perceptions of comparison others when filling out a personality test. The results show that when the Turkish-Dutch compared themselves with people from their own Turkish-Dutch minority group (in-group comparison), there were no score differences between the Dutch majority and the Turkish-Dutch minority. Yet, when Turkish-Dutch minorities thought about how they behave in comparison to the Dutch majority group (out-group comparison), they saw themselves as less honest and humble. Furthermore, when the Turkish-Dutch used an out-group comparison other, they saw themselves as more emotional as well as less agreeable and less open for new experiences than when they used an in-group comparison other. The findings do suggest that Turkish-Dutch members are influenced by perceptions of comparison others when filling out a personality test.

 

Discussant

Fons van de Vijver (Tilburg University, Netherlands)

 

 

Symposium 23 / Thursday, 5th July / 08.15-10.00 / Room: Bestuurskamer

 

Monitoring the quality of performance assessment raters

 

Chairs

Avi Allalouf (National Institute for Testing and Evaluation, Israel)

Alvaro Arce-Ferrer (Pearson, USA)

 

Symposium abstract

In assessment programs around the world, achievement (or academic performance such as mathematics ability or reading proficiency) is measured by means of multiple-choice and constructed-response items. Monitoring the quality of scores on constructed-response items is a complex and intricate process that has much to gain from quality control procedures (ITC Guidelines for Quality Control, 2011). In assessment programs that involve constructed-response items many problems arise. Where reliable and valid, constructed-response items can provide unique information about students' performance; if they are not reliable or valid, the scores will reflect rater idiosyncrasy in addition to student performance. This symposium brings together measurement practitioners and scholars from several countries with a view to sharing insights and recommendations concerning the monitoring of rating quality. Collectively the five papers review the common challenges encountered in constructed-response scoring; summarize findings from empirical research that is geared to addressing the challenges; and offer recommendations with regard to: (1) rater selection, (2) rater training, (3) identification of rater effects, (4) defining rating quality, and - based on this - (5) monitoring the quality of work of performance assessment raters. There is a need for more research in this area, such that will combine psychometric indices of rating quality with investigations of the rating processes, including a close study of the interrelation between rater, student and performance characteristics.

 

Paper 1

Assessing incidence and consequence of rater effects on open-ended scoring

Alvaro J. Arce (Pearson, USA)

Rense Lange (Illinois State Board of Education. USA)

 

The large-scale scoring of open-ended (OE) items involves an intricate net of operations and activities, and no single measure exists to adequately capture and monitor all facets that determine the final quality of scoring results. In practice, the different facets involved are monitored using specialized quality control guidelines to insure achieving acceptable overall performance, and boundaries are set on a one-on-one basis for individual facets.

We propose that performance scoring process is well served by a conceptual framework that can model the (main and interaction) effects of the various facets involved. Our approach relies on Linacre's many facets Rasch model (MFRM) to provide a common frame of reference to judge various quality aspects involved large-scale scoring efforts. Of particular interest is the study of rater effects and tracking raters over the duration of the scoring process - as scoring window can be three to four weeks, with varying numbers of raters scoring papers on multiple occasions. The paper analyzes performance of three approaches to pinpoint rater effects and their consequential impact on students' accountability scores. The study is framed within the context of OE human mediated scoring for a large scale testing program for reading and mathematics grades 3 to 8.

 

Paper 2

The ideal rater: Monitoring the efficiency of performance assessment raters

Avi Allalouf (National Institute for Testing and Evaluation, Jerusalem, Israel)

Galit Klapfer (National Institute for Testing and Evaluation, Jerusalem, Israel)

Marina Fronton (National Institute for Testing and Evaluation, Jerusalem, Israel)

 

There is a subjective component inherent in the rating of performance assessments, because it is conducted by people and affected by their input. A variety of means can be used to limit this subjectivity and to monitor the efficiency of the rating. Efficiency can be defined as the quality (reliability, validity) of the rating together with the speed with which the rating was conducted. This study is divided into two parts: 1) an analysis of rating efficiency and stability over a number of years, by rater, and by demographic KSAO variables (Knowledge, Skills, Ability and Other characteristics) and 2) an examination of the relation between rating speed and quality, by rater and in general. The study is based on data from raters who have been employed in the rating of two questionnaires over eight years, 2004-2011. The questionnaires - a standardized Biographical Questionnaire, and a Judgment & Decision Questionnaire - are both part of an assessment center for measuring non-cognitive attributes in medical school candidates. Determining efficiency is essential to identifying the potential ideal rater, i.e. one who will be quick and reliable, and produce valid results.

 

Paper 3

Monitoring raters: The effects of quality assurance system

Jo-Anne Baird (Oxford University Centre for Educational Assessment, UK)

Michelle Meadows (Assessment and Qualifications Alliance, UK)

George Leckie (Centre for Multilevel Modelling, University Of Bristol, UK)

 

Research on marking behaviour is reported, using operational data from samples of A-level marking which were re-marked by senior examiners in quality checks. For each question paper, these senior examiners had been trained in the application of the marking standard by a single principal examiner. Senior examiners then trained teams of junior examiners. A multilevel analysis of 5,500 re-marks from 110 senior examiners, 567 junior examiners and 22 question papers was carried out. Two systems of setting 'true scores' were compared. Findings showed that some teams were more accurate than others; a phenomenon that was created by the senior examiners' training. Senior examiners were not, then, systematically applying the Principal Examiners' marking standards in a cascaded, hierarchical manner, as intended. Further, senior examiners took some junior examiners' professional views into account more than others when conducting the second marking. This was a bias and senior examiners perceived there to be more variability in the accuracy of junior examiners' marking than would have been the case if the senior examiners were not involved in the quality assurance process.

 

Paper 4

Stability of rating characteristics across time in high-stakes examinations

Iasonas Lamprianou (Department of Social and Political Sciences, University of Cyprus, Cyprus)

 

This study investigates the stability of the rating characteristics of a large group of raters across time, in the context of English-as-a-Foreign-Language high-stakes examinations. The study uses two different measures of rater severity, two measures of rater consistency and one measure of 'restriction of range'. Eighteen datasets from eight consecutive years are used. The datasets come from two different high stakes examinations. A range of statistical models are used in order to investigate three groups of research questions. The study found that some raters exhibit very different rating characteristics than others. It was also found that there is a strong overall experience effect regarding the severity of the raters. There was also a strong individual-rater experience effect which affected the stability of the rater characteristics, sometimes negatively and sometimes positively.

 

Paper 5

Evaluating rater-mediated assessments

George Engelhard, Jr. (Emory University, Atlanta, USA)

Stefanie A. Wind (Emory University, Atlanta, USA)

 

The purpose of this study is to provide an introduction to the concept of invariant measurement for rater-mediated assessments. Rasch models provide an approach for creating item-invariant person measurement and person-invariant item calibration. This study extends these ideas to measurement situations that require raters to make judgments regarding student performances, such as the quality of written essays, in order to create rater-invariant person measurement and task calibration. The Many-Facet Rasch (MFR) Model is used for evaluating the psychometric quality of rater-mediated assessments. Several indices related to rater agreement, rater errors and systematic biases, and rater accuracy based on Rasch Measurement Theory are described and compared in this study. Assessments designed to measure communicative competence within the context of large-scale writing assessments are used to illustrate various indices of rating quality for evaluating rater-mediated assessments, as well as their consistency in providing information about rating quality. The implications of this approach for examining rating quality for research, theory, policy, and practice in large-scale writing assessment are examined.

Findings hold implications for future research on the quality of rater-mediated performance assessments. Identified differences among indices of rating quality will inform future studies that use them.

 

Discussant

George Engelhard, Jr. (Emory University, Atlanta, USA)

 

 

Symposium 24 / Thursday, 5th July / 08.15-10.00 / Room: Raadszaal

Advances in testing and measurement: Job performance, change and innovation

Chair

Neil Anderson (Brunel University, Business School, UK)

Kristina Potocnik (Brunel University, Business School, UK)

 

Symposium abstract

This symposium focuses on developments and advances in the measurement of job performance, change, and innovation in the workplace. It comprises four papers from different European countries - Spain, Greece, the U.K., and The Netherlands. In the first paper, Salagado et al. revisit relationships among emotional stability, conscientiousness, their component facets, and job performance. In the second, Athanasios et al. present a Situational Judgment Test to assess change agent's behavior predictive of effective change management. In the third paper, Potocnik and Anderson explore the validity of innovative performance and propose how to measure this construct reliably and fairly. Finally, Roe raises important issues concerning time in validation research designs and proposes several directions for future research and practice in this regard. All papers have important theoretical and practical implications for psychometric measurement, as will be highlighted in the concluding discussant commentary.

 

Paper 1

Job performance, personality factors and their facets: Not much more than conscientiousness

Jesús F. Salgado (University of Santiago de Compostela, Spain)

Silvia Moscoso

Alfredo Berges

 

This study compared the predictions of the three basic perspectives that exist on the bandwidth debate about the relationship between personality and job performance (Ones & Viswesvaran, 1996; Schneider et al., 1996; Hogan & Brett, 1996; Paunonen, 1999; Ashton, 1998, Paunonen & Nice, 2001, Tett et al., 2003). A sample of 226 police officers responded a personality inventory based on the Five-Factor Model, and the relationship among emotional stability, conscientiousness, their facets, and job performance were analyzed. The immediate supervisor rated each individual on ten behavioral-anchored rating scales which served as criteria. Three criteria of different broadness were used with each personality dimension. Unrotated principal components served as estimates of conscientiousness, emotional stability and their facets. The results showed that conscientiousness and emotional stability predicted the three criteria, and that the facets, when the common factor variance was eliminated, showed a smaller validity than the factor. The results also showed that the facets showed no incremental validity over conscientiousness and a small amount of incremental validity over emotional stability. Finally, the results showed that emotional stability and the facets did not showed incremental validity over conscientiousness. The implications for the theory and the practice of personnel selection are discussed.

 

Paper 2

Selecting for change: Developing a SJT to measure change agent's behaviors

Gouras Athanasios (Athens University of Economics and Business, Department of Management Science and Technology, Greece)

Ioannis Nikolaou

Maria Vakola

 

Change has been an important issue in contemporary research and practice. In the past, many change scholars elaborated on the macro level change but recently there is a shift of change management literature on micro level issues, such as resistance to change, readiness to change, managerial coping with change and dispositional resistance (Judge et al., 1999; Oreg, 2003; Oreg et al., 2011). Even though there is an important progress in this field, issues like individual characteristics and behaviors of change agents towards change success have attracted relatively limited attention. The main scope of our research is to identify change agent's behaviors that are predicting change performance and enrich literature and practice following a rigorous methodology for selecting effective change agents. Using a sample of 160 white collar employees from various organizations in Greece, we developed a Situational Judgment Test (Chan & Schmitt, 2002; McDaniel et al., 2001; McDaniel & Nguyen, 2001) assessing change agent's behaviors. Following typical psychometric theory guidelines and the previous literature on SJTs, the questionnaire credited with sufficient validity and reliability, leading us to our next step of establishing its nomological network. Implications both for research and practice, potential methodological caveats and next steps will be discussed in detail.

 

Paper 3

Assessment of Innovative Performance

Kristina Potocnik (Brunel University, Business School, UK)

Neil Anderson (Brunel University, Business School, UK)

 

Past research has employed a wide variety of measures to assess employee innovative performance. Most commonly, these have consisted of a survey-based questionnaire measures to which employees have responded both for predictor variables and outcome measures of innovation. While the use of self-reports is a convenient way to collect data as they are quick and easy to administer and facilitate the data collection on large samples of respondents, this measurement method is subjected to different biases which might potentially distort the true score or feelings of the respondents about the phenomena under consideration. Hence, the exclusive use of self-reports to assess innovative performance might be questionable and other more appropriate ways of assessing this type of performance should be explored. One hypothesis could be that independent observers provide more valid and reliable ratings of innovative performance. We examined this assumption in our research using different raters. Generally, we found that indeed independent observers provided a more reliable and valid ratings compared to self-assessed innovative performance. Hence, we would like to encourage future studies in innovation to avoid self-reports and rely more on independent ratings.

 

Paper 4

Test validity from a temporal perspective: Towards a new validation paradigm

Robert A. Roe (Maastricht University, The Netherlands)

 

As psychological theories and methodologies have become increasingly permeated with time over the past decades, the need for reviewing and updating test validation practices has become apparent. After a brief discussion of differential (between-subjects) and temporal (within-subjects) research, this paper examines the role that time has played in traditional validation research, and identifies a number of conceptual and methodological inadequacies. Starting from the assumption that human traits and behaviors unfold over time, within the boundaries set by birth and death, it argues that time should be an explicit part of any measurement, either test or criterion, and that validation should include an evaluation of dynamic aspects of the test-criterion relationship. It furthermore argues that whenever stability is assumed, it needs to be demonstrated in terms of measurement equivalence over time (rather than retest-reliability). The paper continues to presents a general framework based on a variables x subjects x time data model, and defines four validation designs, i.e.: (a) stable predictor, stable criterion; (b) stable predictor, dynamic criterion; (c) dynamic predictor, stable criterion; (d) dynamic predictor, dynamic criterion. It gives recommendations for data gathering and methods of analysis, and concludes with implications for test research and practical applications.

 

Discussant

Neil Anderson (Brunel University, Business School, UK)

 

 

Symposium 25 / Thursday, 5th July / 10.30-12.00 / Room: Leeszaal

Is the end in sight for traditional psychometric testing?

Chair

Janwillem Bast (Hogrefe Uitgevers, The Netherlands)

Nadine Schuchart (Hogrefe Verlag, Germany)

 

Symposium abstract

Psychometric testing has a long and successful past, but does it also have a future? Traditional psychometric testing is about constructing the best possible psychological instrument to measure the psychological construct we are interested in. In other words, it is about the specification of a measurement model to get the most reliable and valid measurement of a latent variable. When the most optimal set of items is found the instrument is administered in populations with known characteristics in order to give the measurement a scale and norm. After this the instrument can be used in the diagnostic process. In this procedure several assumptions are made about the nature of the latent variable, the underlying factor structure, the measurement model, and the way norms are calculated. In this symposium questions are asked about these assumptions. Is the end in sight for traditional psychometric testing?

 

Paper 1

Psychological constructs: latent variables or networks?

Angélique O. J. Cramer (Psychological Methods, University of Amsterdam, The Netherlands)

 

Many psychological constructs are conceptualized as latent variables: unobserved entities (e.g., intelligence, major depression) for which we need measurement instruments (e.g., intelligence tests) to detect them. This conceptualization has largely gone unchallenged for quite some decades. However, in more recent years, efforts to identify the nature of such latent variables in psychology increasingly turn out to be rather unsuccessful: e.g., major depression cannot be attributed to pathology in one specific part of the brain's chemistry. Such findings beg the question whether psychological constructs are 1) unobserved but detectable if only we devise better measurement instruments; or 2) unobserved because they do not exist, at least not as latent variables. In recent work (Borsboom, 2008; Cramer, Waldorp, van der Maas & Borsboom, 2010), we argue for a different conceptualization of psychological constructs, namely as networks. For example, major depression is conceptualized as a network of its observable symptoms: e.g., insomnia ? fatigue ? concentration problems? As such, there is no latent variable and the observable symptoms, which used to figure as measurements of the underlying major depression construct, are now causally autonomous agents. What does this novel conceptualization mean for psychological testing?

 

Paper 2

Modeling differentiation in the linear factor model

Dylan Molenaar (Department of Psychology, University of Amsterdam, The Netherlands)

 

The linear factor model is a popular and well-known psychometric measurement model to link observed test data to an underlying latent variable. An important assumption within the linear factor model is that the correlations between the observed variables are constant across the range of the latent variable. This assumption, together with the assumption of a normal distribution for the latent variable, implies that the data are normally distributed. In this talk it is discussed how a phenomenon called ability differentiation (Spearman, 1924) results in specific violations of these assumptions. Extensions of the traditional linear factor model are proposed to account for differentiation. Both first- and second order models are considered. In addition, the new models are applied to real datasets on intelligence.

 

Paper 3

Continuous norming models: an empirical evaluation

Paul Oosterveld (TNS-NIPO, Amsterdam, The Netherlands)

 

Continuous norming refers to techniques to prevent discontinuities in norms developed for different levels of a candidate characteristic by applying a statistical model to the data. An example of such a discontinuity is a drop of several IQ points on an intelligence test when a child moves to the next age norm group. Several models for developing continuous norms have been proposed. These will be presented and applied to the actual norm data of the d2 test of attention. The criterion for judging model quality will be the extent that the resulting norms follow the intended distribution, for example, normally distributed with mean 50 and standard deviation 10.

 

Paper 4

Searching for a theory of intelligence

H.Steven Scholte (Cognitive Neuroscience Group Department of Psychology University of Amsterdam, The Netherlands)

 

Psychometrics deal with measuring psychological constructs. It is often not clear what these constructs are. This is usually solved by performing a factor- or latent measures analysis and referring to the result as this analysis as a theory of the construct. A typical example of this approach is the notion if g. It is well established that different measurements of cognitive abilities are positively correlated. In this situation a factor analysis will generate one factor (or g) that correlates positively with all these tests. G is of course not a theory of intelligence but is often used as a starting point for the question what is intelligence. This usually biases the outcome of such research to finding correlations of g. The alternative is that there are many underlying sources of intelligence that are, for reasons of development, correlated with each other through connectivity. Here we will present the first results from an extensive study (1000 subjects) in which we relate the subscales of the IST intelligence test, which has a 3x3 facet design, with the functional connectivity of the brain. This will be contrasted with an analysis in which g is used directly.

 

Discussant

Mark Schittekatte (Testpracticum Department of Psychology Ghent University, Belgium)

 

 

Symposium 26 / Thursday, 5th July / 10.30-12.00 / Room: Bestuurskamer

Paper-pencil and computerized testing: Practical matters

Chair

Paula Elosua (University of Basque Country, Spain)

 

Symposium abstract

The last two decades Computer Based Tests (CBT) are increasing in popularity over traditional Paper Pencil-Tests (PPT). There are some potential advantages in CBT, such as; test construction, administration, scoring, reporting or the opportunity to work with new item formats. Several comparability studies have been carried out to examine attitude toward CBT and PPT, and to assess the effects of CBT on performance or score validation. This symposium shows some of those works. Arribas-Aguila studies the prevalence of CBT and PPT as well as observed score equivalence between both test formats. Iliescu and col.'s work focus on the comparison of results obtained in Romania on equivalent samples of online and PPT data for 3 tests. Oostrom and col. compare applicant perceptions regarding video applications and paper resumes, and finally Hagge an col., investigate whether item parameter drift impacts candidate ability estimates to the same extent on both computerized adaptive tests and fixed- length paper and pencil forms.

 

Paper 1

Two-speed psychologists: prevalence of use and equivalence data for computer based and paper-and-pencil tests in Spanish speaking population

David Arribas-Aguila (TEA Ediciones, Spain)

 

The aim of this paper was to study the prevalence of use of new technologies testing vs. traditional testing by Spanish speaking applied psychologists, as well as the equivalence between both approaches. Equivalence and prevalence of use of CTT paper-and-pencil and CTT computer based forms of six well-known personality tests (16PF-5, NEO PI-R, BFQ, BIP, compeTEA and TPT) were analyzed. Data were collected from all the administrations of these tests (N>300,000) made in Spanish speaking countries during 2011. Results showed that paper-and-pencil tests have much higher prevalence of use than their equivalents computer based versions. Regarding the analysis of equivalence, statistical differences between raw scores of both formats were significant, but small effect sizes were observed. Taking into account these results, it seems that there are two-speed applied psychologists in Spanish speaking countries: most psychologists are still using paper-and-pencil questionnaires, and a few take advantages of new technologies. Results about equivalence do not seem to be one of the reasons that explain the two-speed phenomenon.

 

Paper 2

Equivalence of online and paper-and-pencil forms for 3 tests in Romania

Dragos Iliescu (National School of Political and Administrative Studies, Romania)

Andrei Ion (National School of Political and Administrative Studies, Romania)

Dan Ispas (Illinois State University, U.S.A.)

Alexandra Ilie (University of South Florida, U.S.A.)

 

Online testing is rapidly replacing paper-and-pencil testing in some domains of psychological testing. In Romania, this is less visible for educational and clinical tests, and more visible for the I/O domain. Assessment providers and publishers of tests not only offer their clients the alternative between online and paper-and-pencil testing, but sometimes even do not offer anything else than online testing. Many of the tests used today primarily in online deployment have not been developed for this procedure and have never been validated for this usage. Their equivalence with the original paper-and-pencil form is assumed, but rarely proven. This paper will focus on the comparison of results obtained in Romania on equivalent samples of online and paper-and-pencil data, for 3 tests from 3 different domains: a cognitive ability test (Intelligenz-Struktur-Test-2000R, Liepmann, Beauducel, Brocke & Amthauer, 2007), a vocational interests test (Self-Directed-Search, Holland, 1994) and a personality measure (Big Five Questionnaire-2, Caprara, Barbaranelli & Borgogni, 2005).

 

Paper 3

Video and paper resume applications: Predicting applicants' preference based on personality and cognitive ability

Janneke K. Oostrom (Erasmus University Rotterdam, The Netherlands)

Annemarie M. F. Hiemstra (Erasmus University Rotterdam, and GITP, The Netherlands)

Esther Smeitz-Cohen (Academy for Legislation, The Netherlands)

Eva Derous (Ghent University, Belgium)

Marise Ph. Born (Erasmus University Rotterdam, The Netherlands)

 

This study compared applicant perceptions regarding video applications and paper resumes. Furthermore, it was examined whether personality and cognitive ability explained applicants' preference for either type of screening tool. Data were collected among 104 applicants for a legislative lawyer traineeship, who applied with a paper resume and a videotaped message. Additionally, invited applicants filled out a personality questionnaire and a cognitive ability test. Contrary to our hypotheses, applicants strongly preferred paper resumes over video applications in terms of fairness, face validity, predictive validity, and opportunity to perform (1.59 < d < 2.18). Extraversion, emotional stability, and cognitive ability were significant moderators. Applicants that scored high on cognitive ability preferred paper over video applications with regard to all four perceptions, when compared to low scoring applicants. Applicants that scored high on emotional stability preferred paper over video applications in terms of fairness, face validity, and opportunity to perform, when compared to low scoring applicants. In contrast, applicants that scored high on extraversion showed a less strong preference for paper over video applications in terms of face validity than low scoring applicants. Although video applications are increasingly being used, this study showed that not all applicants consider this as a positive trend.

 

Paper 4

Comparing the impact of item drift for paper and pencil tests versus computerized adaptive tests

Sarah L. Hagge (National Council of State Boards of Nursing, USA)

Ada Woo (National Council of State Boards of Nursing, USA)

Philip Dickison (National Council of State Boards of Nursing, USA)

Jerry Gorham (Pearson VUE)

 

Item security is of primary importance for licensure and certification examinations. If items are compromised, candidates may receive inflated estimates of ability. When examinations are created using items from previously calibrated item pools, large fluctuations in item parameter estimates may indicate compromised items. Rationales for robust item security exist for items administered through both computerized adaptive tests (CAT) and paper and pencil tests. The purpose of the current study is to investigate whether item parameter drift impacts candidate ability estimates to the same extent on both CAT and fixed- length paper and pencil forms. Data for this study come from two high-stakes examinations: a variable-length CAT licensure examination and a fixed-length paper and pencil certification examination. Both examinations utilize an item bank calibrated based on the Rasch model. Candidate ability from the operational administrations of the examinations will be considered baseline. Item parameter drift will be introduced into item pools based on magnitude of drift and percentage of items exhibiting drift. Candidate ability will be re-estimated using the drifted item parameters. Results will be evaluated on consistency of pass rates, differences between candidate ability estimates, and comparisons across the two examinations. Results may have implications for decisions regarding examination administration mode.

 

Discussant

David Foster (Kryterion, USA)

 

 

Symposium 27 / Thursday, 5th July / 10.30-12.00 / Room: Raadszaal

The assessment and teaching of 21st century skills

Chair

Patrick Griffin (University of Melbourne, Australia)

 

Symposium abstract

The symposium will explore the challenges inherent in developing multinational, human to human collaborative problem solving assessment tasks. It will examine the process of development, the issues of implementation in two countries and the solutions to what were thought to be intractable problems. Implementations in two European countries exposed the challenges for non-English speaking applications and the translation requirements. Development identified a range of strategies such as the definition of the construct, mapping performance onto the developmental continuum. the project was sponsored by Cisco, Intel and Microsoft and hosted in six countries under the supervision of ministerial level country representatives.

 

Paper 1

Challenges of internet based collaborative problem solving assessment

Patrick Griffin (University of Melbourne, Australia)

Esther Care (University of Melbourne, Australia)

 

This presentation will examine the pedagogical and technical challenges experienced in the development of human to human collaborative problem-solving assessment tasks. The challenges included the definition, development of tasks, the delivery, student and school access and scoring, calibration dealing with dependencies built into the process, interpretation and reporting to students and teachers. There were also issues associated with developing teacher skills in using the data to make instructional decisions about improvement of student performance in these areas. An additional area of challenge was the idea of bringing this to scale, developing and negotiating policy implications and working with education systems and jurisdictions to influence curriculum. Among the strategies adopted to meet these challenges was the adoption of cloud computing, assessment within a developmental framework, commissioning a policy paper by interviewing government and education jurisdiction officials, working with international testing agencies, developing an online delivery for teacher development and providing feedback procedures for students, schools and teachers. In addition a planned series of publications outlining the conceptualisation of the issues, methodological approaches, and research underpinning the study are planned as a means of addressing the issues and challenges.

 

Paper 2

Introducing assessment tools for 21st Century skills in Finland

Arto Ahonen (University of Jyväskylä, Finland)

Marja Kankaanranta (University of Jyväskylä, Finland)

 

As a partner in the international project Assessment and Teaching of 21st Century Skills (ATC21S), the Finnish government is looking forward to the future needs for teaching and learning. In today's digital world, students must adapt to emerging technologies and new social environments that change the way we communicate and work. Learning to collaborate effectively and connect digitally on local and global scales is essential for everyone in a knowledge-based economy. Although "reading, writing and arithmetic" are essential, today's curricula should include the skills necessary to prepare students for the work force - critical thinking and problem-solving, communication, collaboration, creativity and innovation. Today's international and national standards primarily measure core subject performance - mathematics, science and reading. ATC21S is designing new assessment prototypes to help education systems include the 21st-century skills that are essential to functioning in the future. This presentation will discuss the experiences of introducing this totally new era of assessment types into the Finnish comprehensive schools in the trial phase of the study. The data comprises teacher and student feedback and researchers' observations.

 

Paper 3

ATC21S: Collaborative problem solving trials in the Netherlands

Diederik Schonau (Cito, Arnhem, The Netherlands)

 

In the Netherlands twelve primary and secondary schools volunteered to take part in the ATC21S trials for collaborative problem solving. The assessment tasks were taken initially by 144 students in primary education (age 11 years), and about 450 students in secondary education (age 13 and 15 years). The trials generated great interest in those schools, due in part to their interests in introducing new ways of learning (and assessment) in their schools. Implementing the trials met with specific technological challenges presented by the particular characteristics of the delivery platform and the dynamic dyad-based characteristics of the tasks. Notwithstanding state of the art IT facilities in schools and good internet access, different types of firewalls and port issues highlighted the complexity faced in delivering these ambitious tasks in schools. Notwithstanding, and most importantly, students were very motivated to work on the tasks and were enthusiastic about the problem solving challenges they faced. Processes followed to diagnose technological difficulties, as well as strategies proposed by researchers and students to improve delivery and management are discussed.

 

Paper 4

Delivery of tasks online - creating tasks to indicate the constructs

Esther Care (University of Melbourne, Australia)

Patrick Griffin (University of Melbourne, Australia)

 

The ATC21S goal of implementing online assessments of 21st century skills in order to drive teaching and learning was approached through a detailed research and development plan. Early work by expert panels in defining and characterising the skills of interest was subjected to a process in which broad developmental learning progressions were drafted. These progressions were mapped onto matrices of the subskills within each skillset, and rubrics were written to describe activities presumed to demonstrate different levels of skills along these progressions. The definitions, learning progressions, and rubrics were then provided to task developers for them to draft task concepts followed by working models. These models were then subjected to intense scrutiny by back-referencing the individual actions that students would need to take to complete the tasks online against the subskills matrices. The steps in this process are presented with detailed examples for collaborative problem solving. The challenges inherent in presenting online tasks for dyad completion which automatically nullify the possibility of symmetric stimulus presentation and response, are outlined, with their solutions.

 

Discussant

Martina Roth (Intel Corporation, USA)

 

 

 

Important Dates and Deadlines

Conference Dates:

July 3-5, 2012

July 2, 2012 (Pre-Conference Workshops)

 

Deadlines:

Submissions are now closed since 20 January 2012

Early bird registration has been closed on 15 April 2012

 

Second announcement of conference:

Download 2nd Announcement 8th Conference of the ITC

 

_____________________________________

 

DIAMOND SPONSORS:

 

GMAC

 

NIP

 

SHL

 

_____________________________________

 

PLATINUM SPONSORS:

 

BPS

 

BUROS

 

Thomas