Measuring design innovation for project-based design assessment: considerations of robustness and efficiency

Instructional approaches that support the acquisition of innovation design process skills for engineering and other design students are critical to developing design competencies. Resources that enable efficient and valid evaluation of design outcomes are needed; however current evaluation methods do not apply well to the heterogeneous projects found in authentic project-based design classes such as capstone design. We develop a robust and efficient measure of design outcome innovation and validate our measure with a large and diverse set of design outcomes from a project-based design class. The measure is based on expert judgments of design concepts’ value and functionality derived from a set of importance-weighted design requirements. In the context of an engineering design class, the innovation score was a statistically significant predictor of success in terms of implementation status by the client. Thus, the measure’s validity was supported by its ability to predict design concepts’ implementation by clients in the context of a product realization class. New design outcome assessment measures in the context of authentic project-based design environments, such as the one developed in the present study, should interface with process-based metrics to create higher-quality input into the overall assessment of design team performance.

Instructional approaches that support the acquisition of innovation design process skills for engineering and other design students are critical to developing design competencies. Resources that enable efficient and valid evaluation of design outcomes are needed; however current evaluation methods do not apply well to the heterogeneous projects found in authentic project-based design classes such as capstone design. We develop a robust and efficient measure of design outcome innovation and validate our measure with a large and diverse set of design outcomes from a project-based design class. The measure is based on expert judgments of design concepts' value and functionality derived from a set of importance-weighted design requirements. In the context of an engineering design class, the innovation score was a statistically significant predictor of success in terms of implementation status by the client. Thus, the measure's validity was supported by its ability to predict design concepts' implementation by clients in the context of a product realization class. New design outcome assessment measures in the context of authentic project-based design environments, such as the one developed in the present study, should interface with process-based metrics to create higher-quality input into the overall assessment of design team performance.

Introduction
A key concern of design education research is to discover ways to support the training of future innovators in design (i.e., engineering or industrial designers) who have the tools and skills to produce novel artifacts that add significant value to stakeholders (Bransford, 2007;Dym, et al., 2005). Researchers and educators increasingly recognize the importance of "authentic" learning contexts (e.g., learning-by-doing, problem/project-based learning) for design students to develop the skills needed for innovation (Crawley, et al., 2007;Dym, et al., 2005;Litzinger, et al., 2011;Mills and Treagust, 2003). Authentic learning contexts often take the form of a project-based class, such as capstone design. In such a class, students (individually or in teams) tackle a design problem, moving through major early phases of professional design, from problem formulation and understanding, to requirements definition, to concept generation and selection, to prototyping (stopping short of fabrication).
Assessing student innovation in project-based courses presents unique challenges to design educators. While traditional methods (e.g., exams, written reports, etc.) can be effectively leveraged to assess students' understanding of the design process and skill in executing key elements of the process (Atman, et al., 2014), assessment of design outcomes (e.g., the creativity or quality of the team's final product) remains a difficult challenge. Authentic design problems are ill-structured and admit multiple solutions. These properties invalidate key assumptions of traditional methods of assessment (e.g., existence of one "gold standard" answer) and consequently render them unusable. Little guidance exists in the literature as to how to design and implement design outcome measures that are objective, reliable, and robust across multiple contexts. We argue that this lack of guidance is a major reason that educators tend to either neglect assessment of design outcomes, or implement them in an ad-hoc manner (McKenzie, et al., 2004;Pembridge and Paretti, 2010;Sobek and Jain, 2004).
The lack of robust design outcome assessment practices in project courses is a major barrier to effective instruction on innovation. While process adeptness and conceptual and technical knowledge are important components of innovative skill, how will we know if students are able to innovate if we have no robust quantitative way of measuring the innovativeness of their designs? Presumably good design processes lead to good design outcomes, but the correlation is imperfect. Design outcome measures provide an important complement to process measures for calibrating understanding of how students are developing the ability to innovate, providing educators and students more precise information with which to focus learning efforts. In this paper, we contribute to addressing this gap by presenting a new objective measure of design Measuring design innovation for project-based design assessment: considerations of robustness and efficiency innovation for use in project courses. We adapt a well-known measure of engineering design innovation from Shah and colleagues (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003), addressing key issues in adapting it to the project course context, including specifying a robust and principled process for devising rating scales across multiple problems with few available solutions for comparison.
1.1 Existing approaches for assessing capstone design and other project-based design courses In evaluating capstone design and student performance, faculty who teach design use a wide range of assessments that typically include written reports, presentations, as well as the technical quality of the design. Pembridge and Paretti (2010) report that several stakeholders contribute to the assessment of a project, including the course instructor(s), project advisors, as well as other students, although less than half of the faculty surveyed in their study reported using the involvement of industry partners in assessment.
Most importantly, consistent assessment measures across evaluators and projects are critical, yet difficult to apply in a design project course with multiple design contexts (one for each student or student team), each design context involving different stakeholders. Rubrics that focus on important characteristics within each activity that can be assessed, e.g. design logbooks, final presentations, are used in order to communicate expectations to students and apply consistent evaluation (Nassersharif and Rousseau, 2010). McKenzie, et al. (2004) found that faculty predominately use the overall combination of written reports, final product technical quality, and design deliverables to determine individual student performance. However, faculty indicated that they lacked the information or knowledge on how to develop rubrics that work for all users in the capstone design setting. Further, the reported success of a project is mainly composed of delineated assessments, that may or may not have had client or industry input on the assessment of the final design (i.e., little emphasis on the overall success of the design outcome-the value provided to the stakeholders).
Few principled approaches to assessing design innovation in project courses exist. A notable exception is Sobek and Jain's (2004) Design Quality Rubric (DQR) and Client Satisfaction Questionnaire (CSQ). These assessments were developed evaluate projects based on the design outcome per se rather than the process used. The DQR focused on key dimensions of design innovation synthesized from various engineering curricula and design competitions, including meeting technical requirements, and feasibility, novelty, and simplicity of the design. The CSQ focused on a number of different dimensions, including technical quality of the final design, benefits of the design for the company, level of involvement with the design team, complexity of the design task, and quality of final deliverables (e.g., report, presentation, engineering drawings, prototypes).
This work provides a good starting point, but leaves opportunities for improvement. We argue that holistic Likert-like ratings of technical quality do not adequately capture the inherent complexity of design, where design problems address multiple (and sometimes competing) design requirements of varying importance. Holistic rating risks collapsing performance into just one or two requirements, which can mask worthwhile advances for other aspects of the problem the students might have produced.

Existing quantitative measures of design innovation outcomes
To address the need for function-focused assessment of design outcomes, we look to the literature on design innovation research for measures that might apply in an assessment context. Amabile's (1982;1983; Consensual Assessment Technique (CAT), in which groups of domain experts judge the creativity of a set of creative products on an unanchored, global creativity scale (typically on a Likert-type 1-6 scale). This approach is used often in studies of creativity in various domains other than design. Its face validity is high, since its foundational assumption that domain expert consensus judgments on a product's creativity are a valid and reliable measure matches that of real-world judgments of creative achievement (e.g., Academy Awards, Nobel Prizes, etc.). Reliability of the average judgments across the group of experts is often high, with typical Cronbach alphas ranging from 0.7 to 0.9. However, this validity and reliability is critically contingent on obtaining both the right kind of experts-validity and reliability are compromised when raters are not experts in the relevant domain (Kaufman, et al., 2008;Kaufman, Baer and Cole, 2009) -and number of experts-typically seven or more experts to obtain acceptable measure reliability (Amabile, 1982). In an engineering education context, it should be relatively easy to obtain the right kind of experts, but may be prohibitively challenging to obtain the right number of experts.

Creative Product Semantic Scale (CPSS)
Another approach is the Creative Product Semantic Scale (CPSS) (Besemer 1998;Besemer & O'Quin, 1999;O'Quin and Besemer, 1989). This method consists of providing 1-7 Likert ratings for 55 items, with each item anchored at the top and bottom end of the scale by bipolar adjectives. The adjectives are clustered according to three critical dimensions of creative products, with each dimension composed of sub-dimensions: novelty (composed of originality and surprise), resolution (composed of logical, useful, valuable, and understandable), and elaboration and synthesis (also called style; composed of organic, well-crafted, and elegant). These dimensions are based on Besemer and Treffinger's (1981) Creative Product Analysis Model. Validity has been established for the novelty dimension of the scale, which has been shown to be capable of satisfactorily discriminating between more and less novel known products; validity for the resolution and elaboration and synthesis sub-dimensions remains to be established convincingly. Reliability has also been shown to be generally good. However, similar to the CAT, a potential barrier to adoption in any area of education is its cost. On average, 10 mi-nutes are required for the rating of a single product, per rater, and Besemer and colleagues (The CPSSAcademic, n.d.) recommend that at least 50-60 knowledgeable raters provide ratings for each product in order to achieve good reliability and validity. In addition, the current version of the scale is proprietary and is pay-peruse. Based on the current fee structure, the cost of obtaining the recommended minimum 50 ratings for a given product would be US$450 per product (The CPSSAcademic, n.d.). Few university departments have the budgetary resources for this approach.
A further issue shared by both CAT and CPSS is the lack of insight into the domain-specific (function-focused) dimensions of design innovation. If utilized for assessment, students will gain a global/holistic sense of their innovation performance (similar to Sobek and Jain's DQR) but it may be difficult to use that feedback to diagnose and fix deficiencies in knowledge or skill required to innovate.

Shah and colleagues' system of ideation metrics
A final approach to consider is the system of design innovation metrics proposed by Shah and colleagues (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003).
Their system includes detailed versions of four traditional metrics of creative processes and products: quantity, variety, novelty, and quality. Because the focus of this paper is on approaches for measuring innovation output rather than process characteristics, we focus on Shah and colleagues' novelty and quality metrics.
Similar to the CPSS, the Shah and colleagues approach was developed specifically within the context of engineering design research (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003). As such, it has a heavy focus on functions and requirements, which are important in many (but not all) areas of design. Measurement begins with a functional decomposition of the overall product. Any whole product (a system) can be divided into functional subsystems. For instance, a car can be divided into a propulsion subsystem, a steering subsystem, a load-carrying subsystem, and a safety subsystem. Design outputs are then evaluated separately on each of the functions (e.g., its propulsion, its steering). Variations in designs that do not impact these main functions are considered irrelevant. Moreover, the functions may not be equal in overall importance. Thus, the overall novelty or quality score is a weighted-by-function-importance average across sub-scores for each function.
Within the functional decomposition, novelty is the novelty of particular function feature(s). The exact novelty calculation is a variation of an approach used by Torrance (1962), and can be determined using empirically derived or a priori estimates of the novelty of particular features. In the a priori method, the experimenter determines (before conducting the experiment) how common different choices for each function have been in the past-this method has questionable reliability and validity for complex real design applications because it requires very deep knowledge of every design task by the instructor or researcher. More importantly, in many authentic design-problem contexts, prior solutions for a given design problem may not even exist. In the empirical method, essentially focusing on ease-of-generation rather than novelty per se, data is derived from a large set of participant responses to a fixed design task: ideas are novel to the extent that few participants generated them. This method is very convenient for fixed design tasks given to many participants (e.g., in an experiment or at a design competition), but is not useful for evaluation of novelty when each designer/team tackles a different design task, such as in capstone design classes and other authentic project-based design classes.
The approach used by Shah and colleagues (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003) to measure idea quality borrows heavily from common engineering evaluation metric approaches such as Quality Function Deployment (Huang, 1996) and Decision Tables (Pahl and Beitz, 1996). In these approaches, design concepts are evaluated on the degree to which they meet the main functional criteria of the overall product (e.g., strength, cost, manufacturability, ease-ofuse). An overall score is determined by a weighted sum of each functional criterion, with the weights reflecting the importance of each functional criterion (e.g., perhaps ease-of-use is more important in one design context, but cost is more important in another context). In contrast to simple holistic quality ratings, this method for evaluating quality is likely to be reliable because it is more objective, and it is likely to be valid because it is directly linked to design functionality.
The Shah and colleagues (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003) quality and novelty metrics have generally been found to have good validity and reliability: inter-rater agreement for coding of ideas for functional elements (for novelty calculations) is often high (Cohen' kappa of 0.8 or higher) (Chan, et al., 2011;Tseng, et al., 2008), and inter-rater agreement is similarly high for quality ratings (Pearson's correlation of 0.7 or higher) (Chan, et al., 2011;Linsey, et al., 2010). The quality metric has the additional advantages of high face and construct validity due to its adaption from industry-wide methods of concept screening (e.g., Pugh, 1991). Cost barriers are also relatively low, as the method does not require a large number of experts to achieve satisfactory reliability, and it is not proprietary pay-for-use. Perhaps for this reason, these metrics have been widely used to good effect in engineering design research (e.g., Chan, et al., 2011;Kurtoglu, Campbell and Linsey, 2009;Linsey, et al., 2010;Tseng et al., 2008;Wilson, et al., 2010).
However, some important details are left unspecified by Shah and colleagues (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003), which have potential impacts on its translation into the capstone design assessment context; these issues are summarized in Table 1. First, regarding overall functional decomposition, what guidelines should be followed in decomposing a product into functional sub-systems? Ideally, we would like this design outcome measure to be comparable across projects within a class, across semesters of a class, and across instructors to support various scales of formative evaluation and empirical research. Secondly, when gauging the Measuring design innovation for project-based design assessment: considerations of robustness and efficiency novelty of a final product, or set of final products, it is not clear how to establish the universe of ideas within which to estimate a given product's rarity. In capstone design courses, observed idea spaces are typically small because teams often solve different problems, and if they do solve the same problem, the number of teams is typically small. In these situations, estimates of the baseline frequencies of various concept types may either be circular in the identity case or not be stable enough to generate valid and reliable estimates of a final product's relative novelty. Further, there are often few outside benchmarks from which to draw estimates of the novelty of concepts. This is especially the case if the teams are addressing authentic design problems, which are likely to be unsolved, and therefore by definition do not have an established universe of possible solutions to compare against. Finally, there are questions surrounding the use of sub-scales for the quality metric. From where should these sub-scales come, and how should the sub-scales be weighted? Who should generate these sub-scales and assign weights, and by what criteria? The rating scale is also vaguely specified. How should instructors determine the size of the rating scale, and appropriate anchors for each portion of the rating scale? Without answers to these questions, applying this assessment approach in a capstone context may lead to invalid or unreliable assessments, potentially obstructing rather than facilitating accurate assessment of students' skill development.

Aspect Question
Functional decomposition How to decompose?
Novelty What universe of ideas serves as baseline?
Quality sub-scales Who/how to define sub-scales?
Quality rating scale Size and anchors for scale? In order to evaluate the success of a design outcome in the context of capstone design course environment -where a variety of projects typically have one design outcome for each project-we developed a metric to address key issues in evaluating design outcomes for a course with a diverse set of projects. Our working definition of innovative design is the production of artifacts that add significant value over existing/prior artifacts and address a given need or want. A product that adds value to existing artifacts can be created through realizing new functions and properties (Pahl, et al., 2007) or meeting requirements in novel ways (Cropley and Cropley, 2005). Focusing on the performance of a product's functional components provides an evaluation measure that we can extend to a diverse set of design outcomes, include dimensions that are relatively qualitative.
Following Shah and colleagues' quality metric (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003), our innovation measure codes designs for innovation based on a set of project-specific subscales that relate to key functional requirements of the design, with each subscale weighted by importance to the overall design. Appropriate judges are selected to rate designs on their respective subscales, using a well-defined rating scale. The final innovation measure for a given design is a weighted combination of its subscale ratings.
Departing from Shah and colleagues (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003), we collapse novelty into this single outcome measure, leaving novelty implicit in the measure. Our motivation and justification for this is as follows. First, our theoretical assumption is that innovative design is the production of artifacts that add significant value over existing/prior artifacts that address a given need or want. This assumption privileges functionality and value-added over novelty per se (i.e., without functionality/value added), in a similar fashion to other theoretical conceptualizations of innovation (Chandy and Tellis, 1998;Garcia and Calantone, 2003). Novelty per se is not always beneficial, as the potential value of distinctiveness from the competition may be offset by additional costs required to bring that design to market, from manufacturing, supply chain, and user-support perspectives. Conversely, relatively small changes in functionality (i.e., low novelty) can sometimes lead to "game-changing" degrees of value-added. Second, in authentic design contexts, a minimal level of novelty is an implicit requirement of design briefs. Clients do not seek designed artifacts that are identical to existing competition; rather, they seek new designs that are different in some substantial way from the competition (usually in terms of value added). Finally, as discussed earlier, a frequency or a priori-based approach to estimating novelty of designs is potentially problematic in most design process-outcome research contexts. We believe that leaving novelty implicit in the measure, unless it is an explicit and separately specified client requirement, yields a clean and usable measure of design innovation, particularly in an engineering context, as well as other design contexts where functionality and value added are paramount considerations.

The client-defined design problem case
For capstone projects in which the design problem is determined by an external agent (e.g., entrepreneur, an end-user, or company), requirements are derived directly from the design process. The process we specify assumes that the design projects to which this measure will be applied will begin with some sort of design brief, and go through an initial requirements clarification phase, where customer needs are translated into specific design requirements. Requirements continue to be iteratively refined, dropped, or added, via continued conversation and feedback loops with stakeholders as the design process progresses. We further assume that requirements will at least implicitly be ordered in an importance hierarchy, where certain requirements may be core/ critical, others less so, and still others optional. This iterative generating and refining of importance-weighted requirements is prescribed in many prominent engineering design texts (Otto and Wood, 2000;Ullman, 2002;Ulrich and Eppinger, 2008), and practiced in many (but not all!) capstone and experiential design courses.
Our proposed method extracts the final set of requirements for a given project and uses that set as the sub-scales for the innovation measure. Importance weights are extracted from project documents if they are explicitly specified. If design teams do not specify importance ratings for the set of design requirements, instructors using this method should query the team for explicit weightings. To ensure that requirements and importance weights are properly specified, we recommend an additional step prior to rating where a knowledgeable expert checks the requirements. With poorer-functioning/performing teams, one runs the risk with this method of obtaining importance-weighted requirements/specifications with serious flaws, such as incompleteness, poorly assigned importance weightings, and others. Even in higher-performing teams, however, there could be differences in how well requirements are captured and specified, and these differences could be confounded with other predictor variables of interest (e.g., conflict handling, which can influence accuracy of requirements, in addition to quality of solutions). Adding the extra step of screening the final set of requirements helps to ensure that the final measure of innovation validly measures the extent to which the design adds value over existing/prior designs.

The instructor-defined design problem case
For capstone projects where the instructor forms the design problem, requirements are generated a priori. The instructor can generate importance weightings a priori as well, since the instructor is, in effect, the client.

The rating scales
We define two separate scales for rating the degree to which requirements are met, even with relatively qualitative requirements. In the general case, the scale consists of 4-points as follows: 0 -Did not come close to meeting requirement; 1 -Fell just short of meeting requirement; 2 -Met requirement, but did not exceed significantly; 3 -Significantly exceeded the requirement.
In design, requirements can sometimes include a specification of both minimal and ideal values, where minimal values describe an outcome case with which stakeholders would be satisfied, and ideal values typically describe extremely high-quality thresholds that are often not possible without significant compromises on other sub-systems or significant breakthrough innovation. For example, a market analysis may find that a sufficient number of users would pay $20 for a product (minimal value), but a much larger number of users would pay $10 for the product (ideal value). In this more specialized case, a 5-point scale is employed, as follows: 0 -Did not come close to minimal; 1 -Fell just short of minimal; 2 -Met minimal but did not meet ideal; 3 -Met ideal; 4 -Significantly exceeded ideal value. This 5-point scale allows for a measure of design success that goes above and beyond normal standards of excellence. Each point on this scale below 4 corresponds to its matching point on the general 4-point scale. Why use these categories rather than simply directly using the metrics upon which the categories are based? First, the categories normalize the measure across many very diverse metrics of performance that are on completely different dimensions (e.g., cost, usability, strength). That is, one cannot meaningfully directly average dollars, usability ratings, and tensile strength measurements. Second, the approach allows for the inclusion of more qualitative dimensions that do not have an underlying quantitative scale (e.g., inclusion of a certain design esthetic). Third, the categories take into account the satisficing nature of design, in which the quality of a design changes categorically in the minds of users when thresholds are met (Bansal, et al., 2009). For example, a pen that is 120% of what the intended user is willing to pay is not so different from a pen that is 150% of what the intended user is willing to pay (i.e., neither pen is purchased); similarly, 20% and 50% of target costs are close to equally good. Fourth, the categories likely reflect the realistic precision on many of the metrics-while some dimensions like strength can be measured precisely to many decimals, dimensions like usability, attractiveness, and manufacturing costs can only be roughly estimated during the design process, so more precision on the innovation scale is not warranted.

The client-defined design problem case
For each client-defined design problem case, an appropriate domain expert or stakeholder can serve as the judge.

The instructor-defined design problem case
In the instructor-defined design problem case, a faculty or staff member, e.g. graduate student, with relevant content expertise can act as an appropriate judge.

Research context and participants
In this section, we describe an example implementation of our metric to investigate: • Can the proposed metric be used as an effective assessment in the context of an educational setting?
• Does the proposed metric provide a valid and reliable measure of success for a capstone design course with diverse set of design outcomes?
We apply our metric to a sample of engineering projects in the context of multidisciplinary engineering student design teams working enrolled in a product realization course at a large research university in the North-Eastern United States. Courses employing a product-based learning pedagogical approach and advised or sponsored by an outside client are common in US schools of engineering (Hotaling, et al., 2012). The course also has many features in common with capstone design courses required of all engineering undergraduates in the US, except perhaps being more multi-disciplinary and employing a more structured design process.
Our sample consists of 57 teams across 7 semester-long implementations of the course (from Spring 2007 to Summer 2009). In each implementation, multidisciplinary teams of 3 to 5 students took products from concept to functional prototype. Up to US$2,500 in funding was made available for student teams to make conceptual prototypes as part of their products. An industry mentor was made available for each project to assist the team in making design decisions.
Each team worked on a different project (typically 6 different teams/projects per semester), and each semester there were rarely repeated projects from prior semesters. Application domains were diverse, ranging from running shoe cushion monitoring devices to computerized pill minders for dispensing medication to Radio Frequency Identification (RFID) personnel badge systems. Examples of the final designs in Figure 1 illustrate the diversity in project outcomes and applications. Approximately 20% of design teams in these courses produced products that are later patented, and teams appeared to vary greatly in terms of overall innovation. This sample of teams was collected as part of a larger research project examining process antecedents of engineering team innovation, which motivated the need for a measure of innovation that is reliable and valid across this heterogeneous set of projects.

Deriving the sub-scales
As part of the guided design process used in the course, teams developed requirements by specifying the needed functions, behaviors, and attributes of their proposed design solution. Teams also rated each requirement on an importance scale of 1 (weakly important to the development of the design) to 5 (absolutely important to the development of the design), and provided minimal and ideal values for requirements, where applicable. We extracted the specified requirements, levels, and importance ratings from their design artifacts, such as design notebooks, project write-ups, and intermediate project presentations.
These team-provided requirements and importance ratings were treated as a first draft of the requirements officially used to evaluate innovation. Prior to assessing team performance for each subscale on our innovation metric, the course instructor assigned an importance rating to each requirement, evaluated the requirement set as a whole to identify if requirements necessary to the success of the design solution were omitted during the requirement identification stage, and identified unnecessary or extraneous requirements. Instructors identified requirements that should have been included and these "omitted requirements" were added (along with their respective weights) to the evaluation set. Unnecessary or extraneous requirements were also removed, based on the instructor's evaluation.

Selecting appropriate experts
In this implementation, the course instructor for the semester served as the expert to judge each team's innovation measure. In semesters where there were multiple instructors, supervision of teams tended to be divided among the instructors based on their relative expertise within each team's project domain. Thus, one expert rated each team's requirements, i.e., either the overall course instructor (for semesters where there was only one instructor) or the instructor who was most familiar with and had ex-pertise in their project (for semesters where there were multiple instructors). As is often the case with instructors of such courses in schools of engineering, the course instructors had extensive experience in product realization, including numerous patents, startup company experience, and industry consulting, and also had extensive content knowledge about material sciences, mechanical engineering, and electrical engineering, the key elements of projects selected for analysis.

Artículos 63
Measuring design innovation for project-based design assessment: considerations of robustness and efficiency implementation status had significantly higher innovation scores (M = 73.3, SD = 13.2) than teams with "no" implementation status (M = 49.2, SD = 23.8), Cohen's d = 1.36, p < .001. Teams with "maybe" implementation status also had significantly higher innovation scores (M = 67.8, SD = 12.3) than teams with "no" implementation status, Cohen's d = 1.07, p < 0.05. "Yes" teams also had higher scores than "maybe" teams, d = 0.52, but the difference was not statistically significant, p = 0.49. While one might expect "yes" and "maybe" teams to be significantly different, we believe the small difference observed here reflects the complexity of sponsors' decision processes for implementing teams' projects. It is likely the case that many project outcomes were good enough to implement, but various internal/contingent factors (e.g., budget constraints, intellectual property considerations) might have prevented immediate implementation.
4. Assessing the validity of the measure

Methods
To examine the validity of the measure, we examined the relationship between teams' final innovation scores and whether their sponsor implemented their resulting design to some degree. To obtain this measure of implementation status, we queried sponsors at the end of the semester whether they planned to implement or were implementing the teams' final design (or at least some aspect of it) at the company. On the basis of the responses from each team's sponsor's responses, we constructed a three-level (yes, maybe, no) implementation status measure, where "yes" indexed responses that indicated at least some aspect of the team's design was currently being implemented, "maybe" indexed response that indicated at least some aspect of the design was being considered for future implementation by the sponsor in some fashion, and "no" indexed unambiguous responses that indicated no part of the team's design was currently being implemented or considered for future implementation by the sponsor. Our sample for this analysis consisted of 47 teams across 7 semesters of the course; 10 of the 57 teams in the full dataset were excluded because we were either unable to obtain the implementation information from the sponsor, or the sponsor had not yet decided on implementation during the duration of our data collection period.

Results
Obtained innovation scores in our sample ranged from a minimum of 0 (only one team obtained this score) to a maximum of 100 (only one team obtained this score), with a mean of 63.9 and a standard deviation of 18.9. That is, the proposed measure has significant variation across teams of this sort (i.e., the full range of the measure is obtained, and the mean performance is somewhere near the middle of the scale). There were 18 teams with a "yes" response for implementation, 16 teams with "maybe", and 13 teams with "no". Figure 2 shows the relationship between teams' innovation score and implementation status. To statistically explore the association between the measures, we conducted a one-way analysis of variance (ANOVA) with implementation status as the betweensubjects factor. The ANOVA revealed a significantmain effect of implementation status on innovation score, F (2, 44) = 8.22, p < 0.01. Post-hoc pairwise Tukey tests revealed that teams with "yes"

Conclusions
In this paper we have motivated the need for a robust and efficient measure of design innovation, suitable for design outcome assessment in authentic project/problem-based design courses.
We have presented an advance on Shah and colleagues' ideation metrics system (Shah, Kulkarni and Vargas-Hernandez, 2000;Shah, Vargas-Hernandez and Smith, 2003), adapting their quality metric into an improved innovation measure that meets this need, and addressing key issues in adapting this measure to the design education context (see Table 5 for a summary of how our measure addresses each key issue). Finally, we have demonstrated evidence of this measure's convergent validity in terms of a significant association with the probability of a final design concept being implemented by its client/sponsor, in the context of a product realization course. Rating scale is standardized across sub-scales using qualitative anchors Table 5. Summary of our answers to key methodological ambiguities in Shah and colleagues' ideatio n metrics system From these two productive uses of the measure, we conclude that our measure has satisfactory reliability -at least for detecting effects of equivalent or greater size compared with the effects presented in this paper and in (Jang and Schunn, 2012;Paletz, Chan, and Schunn, 2017)-and is robust to the potential statistical noise from using one expert.
Third, we have provided an approach to measuring outcomes, but it does not provide insights into underlying educational problems. Measurement is helpful to formative assessment, but quantitative measures alone are not enough to guide student learning. It is particularly likely that feedback about design process quality will be needed to complement the feedback obtained on design outcomes from our measures.
Finally, some researchers may worry that our measure fails to capture the distinction between incremental and radical innovation, at least in part because novelty is only implicit in the measure. However, as we have argued, the theoretical basis of our measure of innovation construes value-added as the primary component of the construct of innovation. Nevertheless, we freely admit that a single ideal measure of innovation for all purposes is neither feasible nor desirable. Design instructors and/or researchers may find it useful to view our measure as being part of a suite of possible innovation measures to illuminate the various aspects or nuances of design innovation skill. We recommend future research investigating the capacity of a more absolute scale that incorporates weightings based on instructor or client perceptions of difficulty.

Suggestions for design educators
Capstone design instructors are faced with assessing the success of the course projects, which is particularly difficult when only one design outcome exists for a particular design prompt. The culmination of a capstone design project exhibits the final design outcomes as well as various design documents that support the process used to arrive at the final design solutions-including the translation of requirements into design solutions. While a successful design outcome is beneficial to the client as well as the design team, what we can derive from this finding as design educators is the impact of developing students' design process skills. In this study, requirement definition and its link to

Limitations
While the measure described and evaluated in this paper demonstrated a robust metric based on the functionality and requirement performance, we also identify potential limitations with this approach. First, a prerequisite of our innovation measure is that the designer or design team being measured goes through a process of explicit and iterative refinement of design requirements. While developing our measure based on this prerequisite makes our measure well suited for many realistic design contexts, it does mean that the measure may not be valid in settings where this assumption does not hold, for instance, in design courses where student teams do not follow a requirements-focused design process (e.g., in more esthetic-oriented design or when exploring new technologies more broadly without consideration of particular user groups' needs). Thus, it is possible that our measure cannot capture the full bottom range of design innovation performance. However, as we have shown in our example implementation, and in our validity study, our measure does have a useful range of variation, and is able to capture the performance differences of design teams whose final design concepts are of poor enough quality to preclude consideration for implementation.
Second, some evaluators might wonder about the reliability of our measure, given that we have not reported reliability analysis. The reason for the lack of reliability data was that, in our implementation of the measure described above, we were not able to get more than one qualified expert for most of the teams. In most semesters, there was only one instructor, and in semesters where there were multiple instructors, they focused separately on teams whose projects they were most familiar with; in that case, obtaining ratings from multiple instructors for all teams was inappropriate, as their level of expertise and familiarity with some teams was not sufficient. We note, however, that the convergent validity of the measure can be treated as indirect evidence for its satisfactory reliability. If the measure were generally unreliable, this statistical noise would obscure meaningful relationships, and the relationship between the innovation score and implementation status presented above would be unlikely to be detected. In our other work, the innovation measure proposed here has also proven useful for discriminating between better and worse design process characteristics, such as the amount and timing of appropriate design tool use (Jang and Schunn, 2012;Paletz, Chan, and Schunn, 2017).
Measuring design innovation for project-based design assessment: considerations of robustness and efficiency successful design outcomes demonstrates the importance of instructing and facilitating students in defining and writing requirements that address client needs/ wants. Based on the results of our measure's implementation, we recommend that students iteratively work with their client(s) and possibly the instructor to define the applicable requirements as well as a method for meeting the requirements to the appropriate degree.
Typically, one instructor is responsible for assessing multiple projects as part of the capstone design course, which makes it difficult to employ many types or different assessments for the elements of the design project. The satisfactory convergent validity (and hence implicitly the reliability) of our measure is encouraging for instructors who wish to employ our measure, but are also constrained by resource limitations, e.g., only 1 qualified expert per team (a situation that, we suspect, is not uncommon in capstone design courses). Nevertheless, we encourage instructors who wish to employ our measure to use at least two experts, if possible, and recommend explicit reliability analysis to future development work on the measure.
In the context of design education research and development, a results-focused approach to measuring student outcomes in project-based learning cannot and should not be viewed as a comprehensive measure of student progress and performance in developing innovation competencies. Important complements include knowledge and process-based assessments targeted specifically for creativity and innovation (Daly, Mosyjowski and Seifert, 2014). These types of assessments can also interface with more holistic assessment approaches of design outcome success, to create higher-quality input to the assessments of the design teams' performances.