How Was Your Day? Evaluating a Conversational Companion

The "How Was Your Day" (HWYD) companion is an embodied conversational agent that can discuss work-related issues, entering free-form dialogues while discussing issues surrounding a typical work day. The open-ended nature of these interactions requires new models of evaluation. Here, we describe a paradigm and methodology for evaluating the main aspects of such functionality in conjunction with overall system behavior, with respect to three parameters: functional ability (i.e., does it do the "rightâ thing conversationally), content (i.e., does it respond appropriately to the semantic context), and emotional behavior (i.e., given the emotional input from the user, does it respond in an emotionally appropriate way). We demonstrate the functionality of our evaluation paradigm as a method for both grading current system performance, and targeting areas for particular performance review. We show correlation between, for example, automatic speech recognition performance and overall system performance (as is expected in systems of this type), but beyond this, we show where individual utterances or responses, indicated as positive or negative, characterize system performance, and demonstrate how our combination evaluation approach highlights issues (both positive and negative) in the companion system's interaction behavior.

The paper discusses the use of objective measures, subjective measures and appropriateness annotation for evaluating Companions, and general requirements and features of the approach.We evaluate such a system, the "How Was Your Day" (HWYD) Companion [2], [3], an embodied conversational agent that can discuss work-related issues.In addition to looking at traditional measures such as length of the interaction, we evaluate the HWYD Companion's emotional capabilities, and investigate the use of appropriateness as a measure of conversation quality, the hypothesis being that good Companions need to be good conversational partners.This introduction describes the HWYD Companion system and discusses some previous efforts to evaluate spoken dialogue systems.Section 2 introduces the proposed evaluation paradigm for Companions with its subjective and objective measures.Section 3 discusses the evaluation methodology and how user studies were set up and performed.The scenarios adopted for those studies play a vital role in the evaluations and are described in detail in Section 4. Results of experimental user studies carried out along these lines are presented and analysed in Section 5. Section 6 finally discusses the experiences from the experimental evaluations.

The "How Was Your Day" Companion
The user interface (UI) of the HWYD system [4] is illustrated in Figure 1.On the left we see an avatar exhibiting facial expressions and gestures.The system is rendered on a HD screen with a roughly life-size ECA.The HWYD Companion can engage in long, free-form conversations about events that have taken place during the user's working day.The system both confidence score is below a pre-set threshold value (depending on the competing valence categories).
In the 'long' loop, the rule-based Dialogue Manager (DM) takes the affect-annotated semantic output of the NLU and determines the next system turn, which is generated by the plan-based Affective Strategy Module (ASM) and handed to Natural Language Generator (NLG).The NLG output is passed both to speech synthesis (an extension of the Loquendo TM TTS system including paralinguistic elements such as exclamations and laughter, and emotional prosody generation for negative and positive utterances), and to the module guiding the movements of the avatar, producing gestures and facial expressions conveying the Companion's emotional state.
Two more modules are shown in Figure 1: the Knowledge Base (KB) acts as the central repository of data in the system and is available to all other modules, while the Interruption Manager (IM) [7] handles the system's responses to user barge-ins.When a genuine user interruption (rather than just a backchannel) is detected, the IM instructs the Companion to stop speaking (at next natural stopping point) and the user's interruption utterance is processed by the long loop.

Evaluating Companions
Companions are targeted as persistent, collaborative, conversational partners, where the user may have a wide degree of initiative in the resulting interaction.Rather than singular, focused tasks, as seen in the majority of deployed dialogue systems, fully developed Companions can have a range of tasks and be expected to switch task on demand.Some tasks are not defined in such a way that an automatic system can know a priori when they are complete.It may be that the task itself is defined as maintaining a relationship, not something that can be measured using traditional metrics such as task completion.When devising an evaluation paradigm for such systems, we need to balance the completion of any tasks with some measure of "conversational performance".The assumption in traditional dialogue evaluation is that the quality of the conversation correlates with user satisfaction.That is, if the resulting dialogue is annoying or repetitive, we expect a corresponding drop in user satisfaction.However, user satisfaction is in some sense a composite score, covering the entire interaction.Thus can, for example, poor text-to-speech performance have a disproportional effect on user satisfaction.
A significant amount of effort has been spent on evaluating spoken language dialogue systems, mostly relying on a combination of observable metrics and user feedback (cf.[8], [9], [10]).Efficiency and effectiveness metrics often include the number of user turns, system turns, and total elapsed time.For the "quality of interaction", it is usual to record speech recognition rejections, time out prompts, help requests, bargeins, mean recognition score (concept accuracy), and cancellation requests.Note that these are somewhat functional descriptors of quality of interaction.
The DARPA Communicator Program made extensive use of the PARADISE metric [15].PARADISE (PARAdigm for DIaLogue System Evaluation) was developed to evaluate the performance of spoken dialogue systems, in a way de-coupled from the task the system was attempting.'Performance' of a dialogue system is affected both by what the user and the dialogue agent working together accomplish, and how it gets accomplished, in terms of the quality measures indicated above.PARADISE aims to maximise task completion, whilst simultaneously minimising dialogue costs, measured as both objective efficiency of the dialogue (length, measured in total turns for example) and some qualitative measure.A consequence of this model is that often the dialogue quality parameters are tuned to overcome the deficiencies highlighted by the observable metrics, such as discussed by Hajdinjak and Mihelič [16].For example, using explicit confirmation increases the likelihood of task completion, and so is often chosen, despite being regarded as somewhat unnatural in comparative human-human speech data.
The lack of a community-wide method for evaluating conversational performance of spoken language dialogue systems acts as a barrier to the wholesale development of usable, practical systems beyond simple, task-oriented interaction.We want to develop a method of scoring conversational performance directly; measuring the system's capability to maintain a conversation based on the progression of the dialogue.We believe that conversational performance can be measured in terms of appropriateness, and indeed several researchers previously looked at using a mechanism of appropriateness of dialogue as a measure of effective communication strategies (cf.[11], [12], [13], [14]).

EVALUATION PARADIGM
In order to evaluate a Companion, some overall system properties need to be charted: functional ability (does it do the 'right' thing?), content (does it respond appropriately to the semantic context?), and emotional behaviour (given the emotional input from the user, does it respond in an emotionally appropriate way?).
To this end, we have developed an evaluation process that considers, and correlates, three types of features: 1. Metric-centric: The use of quantitative methods to determine values for dialogue metric data including word error rate of speech recognition and concept error rate of natural language understanding, in conjunction with readily computable scores such as dialogue duration; number of turns; words per turn, etc.    ).Each of the resulting annotations over the transcript is then scored.First, filled pauses are graded as generally human-like, and good for virtual agents to perform, but do not add a lot (score 0).Appropriate responses and questions are very good (AP/AQ: +2), and extended contributions are good (COM: +0.5), but even better are new initiatives and responses pushing the interaction back on track (INI: +3).Repairs and clarifications are bad as such (RR: -0.5), but their use can still gain points by allowing subsequent appropriate response.For example, if it takes two dialogue moves to complete a repair (with a combined score of -1), that then leads to an appropriate response (score +2); thus we still reward this sub-part of the interaction with an overall score of +1.Finally, inappropriate responses of all kinds (emotion, content or other) are bad (score -1), but no response is worse (NRN: -2).

User-centric:
Note that these values are set by hand.When working with such a reward-oriented approach to dialogue modelling in a Companion scenario the measures may be weighted in alternate ways, requiring benchmarking.However, this evaluation methodology can be used to grade complete and part dialogues: the total score (or indeed individual annotation scores) is not necessarily the most useful in all stages of development of a dialogue system.Instead, comparative scores and tag distributions across dialogues can be better measures, as will be examined further below.

EVALUATION METHODOLOGY
Using the paradigm outlined in Section 2, the "How Was Your Day" Companion was exposed to a number of participants, to test functionality aspects of the complete system.In all, twelve users had a total of 84 separate, fully logged and recorded formal interactions with the Companion in the Interactive Collaborative Environment at Edinburgh Napier University.Participants sat at a desk and faced a 42" LCD screen displaying the prototype interface.Audio-visual recordings were made of each session and affective data in the form of galvanic skin response was recorded.Figure 2 gives a graphical overview of the evaluation layout.

Participants and Data
The participants were recruited from staff and students at Edinburgh Napier University.Four had some prior familiarity with the Companions project; the remaining eight were completely new to it, although some had prior experience with affective or interactive computer systems.Three of the participants were female and nine male; their ages ranged from 22 to 54 with an average of 33.All were native speakers of British English.Users were rewarded for their participation.After the  For each session, the following data was collected:

Participant Session Protocol 13
The following is a description of the session protocol     1.An Affectiva Q Sensor TM from MIT Media Lab measured skin conductance, a form of GSR that grows higher during excitement, attention or anxiety, and lower during boredom or relaxation.
The participant then undertook a training session consisting of reading aloud 42 statements for each emotional condition (as detailed in Section 3.3).

ASR Training
Next the participant went through a Dragon Naturally Speaking new user training session, the results of which provided the ASR model for the prototype.

Prototype Session
Once completed the participant was reminded of the scenarios they would be undertaking with the prototype, and to emote as best they could when speaking with the Companion, using the emotional condition as indicated in the scripts for each session.The participants where then asked whether they had any questions, after which the session commenced.All recording equipment was activated and the prototype was loaded.Between each of the scenarios the output logs were copied to an external server and the prototype rebooted.

Post Session Questionnaire and Interview
After all scenarios were completed, the participant filled out a Likert Scale online questionnaire, and then interviewed for 5-10 minutes on their likes and dislikes of the prototype, the concept, and anything else that came to their mind regarding their experience.Participants were then given a reward voucher and thanked.All data was copied to an external drive and collated into a redundant storage array.

EmoVoice Sessions
As was shown in Figure 1, two different modules in the HWYD Companion aim to elicit the emotional content of user utterances: The EmoVoice module [5] analyses the speech input to determine if it is a positive or negative sentiment and an active or passive form, information which the Sentiment Analysis module [6] in parallel tries to elicit from the linguistic data.This information is fused together by Emotion Modelling to a representation of the user's current emotional state in the form of one of five possible values (Negative Active or Passive, Neutral, Positive Active or Passive).
During the evaluation period each participant undertook independent EmoVoice training and testing session in order to examine the accuracy of emotional condition allocation of the EmoVoice system for the users of the prototype system.The participants were given an introduction to the functionality and an overview of how the session would be undertaken.
During each EmoVoice session the participant was asked to read aloud a series of 42 emotionally appropriate statements in each of the five emotional conditions: • Negative Active: "I really hate how he treats me", • Negative Passive: "It's got to the stage where I don't care any more", • Neutral: "Angela Merkel is German Chancellor", • Positive Active: "I just love to sing and dance", • Positive Passive: "Today has been a good day".
The 210 statements were provided by the EmoVoice developers and are the standard stimulus for EmoVoice training.The participants were asked to "act out" each statement as best they could in the appropriate emotional way, that is, to sound angry if appropriate to the statement; or sad, joyful, neutral, and so on.They were shown a video example of a user undertaking a session to illustrate this.The participants undertook the session in a different room to the Companion evaluation in order to give them some privacy when reading aloud so as best to enable the optimum conditions from emotional allocation by EmoVoice.

SCENARIO DESIGN AND SCRIPTS
Each participant evaluation session consisted of a set of user scenarios.based around templates provided by the system developers, outlining the areas in which the Companion was capable of discussing.We designed a set of scenarios to best evaluate the performance of the prototype under certain experimental conditions.

Pilot Study
We conducted an initial pilot phase, where members of the evaluation team exclusively interacted with the A subsequent round of pilot tests of the scenarios led to further refinements, including a series of notes that needed to be considered before using the scenarios: • A user should add information to answer the ECA's questions more appropriately, such as: a project name, a project leader, and people you are working with.
• If and when the ECA takes over the conversation, there is a need to let it lead it.
• Longer user utterances seem more successful.
• Negative events give the ECA more leverage for tirades, whereas overly positive user dialogues offer the Companion little to converse about.

Scenarios
With these considerations in mind, six complete scenarios were extracted and the evaluation team refined the scripts to be used for user testing.The scripts were designed to guide the domain of conversation whilst incorporating enough flexibility for the user to apply their own language choice and to ensure the  were provided in each script to ensure the participants 55 were clear on the prescribed emotional state that was 56 intended to guide their language choices and how 57 they would emote, although the choice of, for example, 58 lexical items was left to the user.

59
In addition to the six scenarios using the prototype 60 user interface as provided, it was agreed that an 61 additional interaction session would be undertaken 62 with each participant, only showing the avatar and 63 excluding any other UI elements such as the dialogues 64 in text form.Each scenario contains the following: 65 1) A set of features:

69
• emotional state (constant -variety) 70 2) Rationale for using the features (for evaluators).71 3) A script guiding the user during the conversation.72 In most of the scenarios, we were explicit about 73 events, their polarity (how the user should talk 74 about them, in terms of emotional content), and 75 duration (that is, scenarios -and by extension 76 the interaction -was considered complete once 77 the script ends).There are two scenarios which 78 are more open-ended, and do not have this 79 duration constraint.

80
A summary of the scenarios in terms of the feature 81 sets can be seen in Table 3.(In Scenario 5, all the 82 feature settings were allowed to be user defined.)The 83 rest of this section gives a full breakdown of each of 84 the seven scenarios in turn.

85
Scenario 1a, Negative events: This is the baseline 86 condition for the HWYD Companion.We found that 87 the system performed best when presented with 'neg-88 ative' events (events of a negative nature as they effect 89 the user).We chose to present only a few events, and 90 to make the overall utterances shorter (in this context, 91 shorter means only one or two events presented to 92 the system at a time).We kept the emotional state of 93 the user constant over the interaction.This structure 94 of scenario consistently gave the best performance in 95 pilot studies.The following script was used:    was developed with the specific intention of exploring how the system copes with switched emotional state during a conversation, that is, the display empathy.Negative to positive gave better performance during pilot sessions than positive to negative, so this was the condition we chose to use in this scenario.This condition is a test of the performance and integration of the EmoVoice component, in conjunction with the overall dialogue strategy.To produce the clearest results (indicated from pilot studies), this scenario reverted to using short utterances from the user.Scenario 4, Free-form conversation: Scenarios 1a-3 are extremely controlled.The next two release those controls as an investigation of user behaviour when presented with the system.Of course, neither of these scenarios is representative of completely freeform behaviour, as each participant will have executed the previous scenarios prior to these, so is intended to have some primed behaviour with respect to the Companion.In Scenario 4, we explicitly prime the Companion with some information, using a correlate of Scenario 1a, before encouraging the user to engage it in free-form conversation for as long as they wished.

NEG Boss i s very unhappy with my performance BEGIN FREEFORM on any t o p i c t h e user d e s i r e s
Scenario 5, User-defined: In order to determine how the system copes with entirely user-defined discussion, we allowed users to talk about 'their' day in so much as possible, and set no end point in the interaction.Again, as with Scenario 4 we understand the nature of implicit priming, and prior user interactions with the system act as a mechanism for users to understand, at least in part, system functionality.Scenario 6, Avatar only: As seen in Figure 1, the HWYD system displays a wealth of information, including the avatar, visual feedback of what the speech recogniser had output, and textual output about to be rendered by the TTS.During pilot sessions there were mixed feelings about this interface, specifically     Objective dialogue metrics form an important part of 15 any speech system evaluation, and are standardized to 16 some point.We collected a set of metrics (as in Table 1) covering the extent of the scenario dialogues captured 18 during each user session:

19
• number of turns (user and system),

24
Table 4 shows average dialogue metrics scores for all 25 participant sessions and each scenario's average.As expected, the shortest interactions are in Scenario 1a using short utterances.Scenario 1b is a very close correlate, and similar in character.Short interactions are also seen in Scenario 2, where longer utterances are used (so taking less interactions to complete the scenario in total), consequently giving less overall utterance count, despite containing more events.Scenario 3 contains mixed emotional content, and prompted longer overall interactions, in part due to the length of the scenario.Scenario 4 is similar initially to Scenario 1a, then allows for a portion of free user input, so is marginally longer than 1a; hence the number of utterances is above average.Interestingly, when users are allowed complete freedom in interaction, as in Scenario 5, the total number of utterances drop below average.Finally, Scenario 6 is a replica of Scenario 1a, but with reduced visual feedback to the user.

Error Rates
As shown in Figure 5, the word error rate was 37% on average and concept error rate 33%.These represent very poor scores for speech recognition, and hence present a hard task for any interaction voice system.It is difficult to hypothesise why the ASR scores are so low.The recogniser used was a trainable system, tuned to each participant.However, the speech characteristics had the lowest CER at 26%, whilst being the freeform scenario in which the participant was free to discuss any topic they liked, which in our estimation demonstrates a level of robustness of the system when dealing with concepts outside its core topics.

Response Time
In order to establish the average time it took for the system to respond to a user utterance, the audio waveform from each session was analysed and the time from the end of user utterance to commencement of the audio output from the system was measured.
Typically the user interface would output the text response before the audio output began (to the order of 0.3-1.0seconds).However, for the purpose of this analysis, response time reflects the audio input-output of the system.The average time from end of user utterance to response was 4.18 s (Figure 6).During the annotation of the waveforms, the evaluators noted whether the audio output came from the short loop or the long loop.When the short loop was activated, the response was at times as low as 1.20 s, with an average of 2.28 s.With long loop responses and more complicated tirades (ignoring short loop responses), the average time for response was 6.47 s.

Emotional Response Analysis
EmoVoice automatically segmented each statement and the next statement was automatically presented to the user.EmoVoice then allocated one of the five emotional conditions to each audio segment.The session would take approximately 45 minutes to complete.After each session the evaluators copied the resulting output from EmoVoice into a spreadsheet allowing the assessment of percentage of correct identification in each emotional condition, the breakdown of emotion allocation in each condition, and a total correct identification average.The scores for eleven participants can be seen in 47 Table 5 (one participant's data was corrupted and lost).48 As indicated by the last number of the table and the 49 'Total Average' bar in Figure 7, EmoVoice on average 50 correctly classified 47.67% of the statements.It was sig-51 nificantly more successful when identifying Negative 52 Active (58.92%) and Neutral (54.98%) statements than 53 Negative Passive (45.45%),Positive Active (42.64%) or 54 Positive Passive (36.36%).One possible user influence 55 in this result is that participants typically reported 56 finding it easier to "act" angry or neutral than the 57 other emotional conditions, the passive variants being 58 the hardest.This indicates why we found it expedient 59 to skew evaluation scenarios towards negative events.60 Figure 8 illustrates the emotional condition alloca-61 tion across all statements by all users.The EmoVoice 62 results for the participants had a small skew towards 63 Negative Active, with 23.8% of all statements allocated 64 as Negative Active versus the actual 20%, and a skew 65 away from Negative Passive (15.4% versus 20%).

66
In order to identify where EmoVoice is allocating 67 incorrect emotional assessments, a similar analysis can 68 be undertaken within a specific emotional condition, 69 as in Figure 9, rather than across all statements.For the 70 Negative Active, Negative Passive and Positive Active 71  persons in a picture, it seems inappropriate to ask when they were born).Third, there may be other factors to consider, such as the appropriate use of politeness, humour or error correction strategies that are outside of the present evaluation.
To conduct the evaluation, annotators scored the level of appropriateness for every utterance, given the level of information it contained, and the progression of the dialogue so far.We want to reward appropriate behaviour (answering questions, using new knowledge correctly) and penalize mechanisms seen as inappropriate between humans: incorrect use of knowledge; asking unrelated or off-topic questions; over-verification; strong, one-sided initiative; and limited choices.
When working with the output of an automatic speech recognizer (ASR), it is necessary to account for that there often is a large discrepancy between what a user actually says and what the system recognizes.The annotations are based on what is recognized only -so that if there were recognition errors, the hope would be that either the user spots them in subsequent conversation and can work with the system to correct this, or that the errors are minor in relation to the dialogue flow and hence essentially can be ignored.The system can only function with the content that has been recognised, rather than working on the assumption of completely correct and error-free ASR.
Annotators use a system that splits the system and user utterances and codes each with one of several annotations, as described in Section 2.3.Three annotators worked on the output of the evaluation sessions.10% of the dialogues were annotated by all three annotators; pair-wise comparison between annotators on these dialogues shows agreement rates in excess of 90%.
To start the analysis, Figure 10 presents an overview of the distribution of labels across the entire evaluation.A quick breakdown shows that the majority of utterances in the evaluation sessions (almost 30% overall) are responses by the user to system utterances (RTS).
Unsurprisingly, the second largest category is appropri-1 ate questions asked by the system (AQ).If we look at  Figures 11c through 11i, give the distribution of responses (NAPE), the number in Scenario 1b is above average: the system struggled significantly more to recognize positive emotional events (represented in this scenario) than negative events (Scenario 1a).
The Scenario 2 (Figure 11e) label distribution differs significantly from the previous two.The number of responses to system (RTS) is way below the average, as participants use longer utterances.As a consequence of receiving more information in the utterances, the system ask fewer questions (AQ is below average) and the user gives longer, more involved responses to single questions (RES is high).A trade-off is that emotional response is harder, resulting in a greater than average number of inappropriate emotional responses: perhaps it is harder to detect the overall emotional value than in shorter, clearer utterances.
Figure 11f shows the label distribution for Scenario 3, which involved mixed emotional content.Interestingly, it shows average scores across the scenario for label distribution, where we might have expected a greater number of inappropriate emotional outputs.Given the overall lack of accuracy of the EmoVoice component across our evaluation, we feel that any potential error revealed by this scenario is concealed beneath the general errors of the emotion classification system.Scenario 4 represents the first scenario where freeform user input is permissible, following a short script similar to Scenario 1a.Thus Figure 11g displays a similar distribution to that in Figure 11c: the system continues to ask some appropriate questions and the user responds.A slight increase in inappropriate content (NAPC, not recognizing the information exchanged from user to system) is also observed.Scenario 5, where users have complete free access to the system, although guided by prior interactions, gave a change in the relational distribution of three labels.Encouragingly, there is no significant increase in inappropriate responses.However, as Figure 11h shows, there is an increase in utterances from the user that appear to warrant some response from the system, yet return nothing (NRN, where the system is silent in response to some question or emotional comment from the user).We also see a corresponding drop in appropriate responses, and fewer appropriate questions, all of which cause a drop in overall score.As the users deviate from the scripts (and the underlying template structure of the domain) the system has less to discuss that is within the topic of the conversation.Consequently, it appears the system chooses to stay silent.Using the simple conversational mechanisms found in chat-bots may help to address these issues.
Finally, Scenario 6 with an avatar-only user interface (Figure 11i), shows little deviation from Scenario 1a with avatar plus visual feedback (Figure 11c).This scenario was designed to test the user interface, and shows that the users and system performed more or less equally, if the user had access to visual feedback from the system or not.In conjunction with the user  does it respond in an emotionally appropriate way).

13
We have shown how overall system performance, 14 graded on these parameters, is a composite of the 15 lower level system functionality.Equally importantly,

16
we demonstrate the functionality of our evaluation 17 paradigm as a method for both grading current system 18 performance and for targeting areas for particular 19 performance review.We show correlation between, e.g., ASR performance and overall system performance (as is expected in systems of this type) but also where individual utterances or responses, indicated as positive or negative, show an immediate response from the user, and demonstrate how our combination evaluation approach highlights issues (positive and negative) in the HWYD Companion.The evaluation shows that the system performs well, and has an interesting profile when comparing the distribution of appropriateness labels.It is also clear that this represents just a first step towards Companionable dialogue systems.However, the paradigm as deployed gives clear indicators of areas to improve upon.
We did not seek to perform a component analysis, although some components require particular attention.In particular, the overall high ASR Word Error Rate hampers many efforts to create Companionable dialogue.Given this, the system performed reasonably well, although it has no particular strategies for managing speech error.Incorporation of these would improve overall scores and feedback.The EmoVoice component may have an effect here.By training for this component, user are effectively shifted from talking in a natural fashion, which directly (and negatively) impacts speech recognition performance.In any case, EmoVoice performance is not ideal, so it is surprising that the system does not output a higher number of inappropriate emotional statements on the basis of this module, possibly since it works in conjunction with a text-based sentiment analysis module, which perhaps mitigates the errors.However, the performance of EmoVoice and the low inappropriate emotion scores correlate with circumstance of WER and CER, that is, one has impact, although not linear, on the other.
An interesting point to note is that in the participant interviews after all sessions, length of delay in response was considered far less an issue than the timing of the response.Participants wanted feedback regarding the state of the Companion during the response delay, specifically if the Companion was indeed going to deliver a response or not (there are several utterances per dialogue that receive no reply).They reported that the length of the delay was less impactful than not knowing if and when a response was coming, and the largest frustration was when they started talking again but the Companion then proceeded to talk over them.
The scenarios were chosen to test specific conditions of the HWYD Companion and were able to show some performance issues.For example, there was an implicit belief that the system would perform better with long user utterances, but this was shown not to be the case.As with most spoken language systems, shorter (although significantly longer than most task-based systems) focused utterances proved most successful.
The appropriateness annotation provides several interesting features when analyzing dialogues.First, specific annotation gives developers key insights into areas of system performance that can be addressed at both micro and macro levels.At micro level, a list of utterances can be output from the system (and surrounding context) and be judged to be inappropriate on some level (providing direction for system improvements).At macro level, the graphs of distribution of labels indicate conversation trajectories that can be useful characterizations of both scenarios and systems.
For example, if we want the users to talk more, we need data corresponding to Figure 11e (Scenario 2), where users emit longer utterances.Conversely, if our profile looks more like Figure 11c, we have a more traditional short utterance, interactive dialogue system.Different dialogue strategies may be planned around different dialogue trajectories as indicated by these graphs.Used at the data collection stage, such graphs might present interesting ways to determine optimal system performance, based on user expectation.
If we take the goals of the evaluation paradigm, to develop metrics that can score conversational dia-logue systems, the HWYD Companion is successful 59 at achieving some of these 'goals': 60 Natural Dialogue: the user interacts with the 61 artificial agent in a natural way.That is, there are 62 no significant delays in the interaction, the agent uses 63 knowledge in an appropriate way, asks appropriate 64 questions, does not rely on overly strong confirma-65 tion strategies, etc.The interactions with the HWYD 66 Companion within domain are mostly appropriate.67 Out of domain presents a more significant problem, 68 as for most dialogue systems.There are no significant 69 interaction delays, although users indicate that delays 70 are not as important as clarity of signaling turn taking, 71 and the paradigm may be modified on this basis.

72
Initiative: there is a balance between the initiative 73 of the system and the initiative of the user.Either 74 can ask questions, change the topic of conversation, 75 hold the floor if required.Further analysis indicates 76 that the use of appropriateness labels can shed more 77 light on initiative, e.g., at which points in the dialogue 78 is initiative largely given to the user?By plotting 79 initiative over time, an even exchange of initiative 80 as the dialogue progresses should be seen.Again, this 81 may lead to refinements of the evaluation paradigm.

82
Confusion: that the system runs dialogues in a 83 way that does not increase th user's cognitive load.84 This is the hardest to measure in systems with limited 85 error correction routines incorporated into the dialogue 86 scenario: simple measures of requests for repair can 87 not be used to give some indication of cognitive load.88 Stickiness: the Companion is desirable to talk 89 to, both within an individual interaction and over a 90 significant period of time (weeks or months).It would 91 be very interesting to evaluate user interaction with 92 the HWYD Companion over a longer period of time.93 User Satisfaction: the measure of how happy a 94 user is with the interaction, both in the immediacy (at 95 the time of an interaction) and in the long term.The 96 user satisfaction survey results are mixed, and clearly 97 there are component level issues (e.g., speech recogni-98 tion) which are significant contributors to performance, 99 but it is clear that the sheer novelty of the scenario 100 has a significant impact on user evaluation; users are 101 not yet prepared to hold conversations with computer 102 systems in this way, although it would be interesting 103 to see how users adapt to this scenario over time.

3 .
Qualitative methods used to acquire subjective impressions and opinions from the users of the Companions prototypes, including Likertbased surveys, focus groups and interviews.Measure of Appropriateness: An annotation of the data resulting from the metric-centric evaluation.

Fig. 2 :
Fig. 2: Overview of the data collection and participant location during each evaluation session

3 •
HD video of each participant (front and side on)

4 • 5 •screen capture 6 • 7 • 8 • 9 •
Video of post session participant interview Prototype Audio of prototype system Q TM file for Galvanic skin response (GSR) output1   XML log file detailing all module outputs Questionnaire response for each participant 10 All generated evaluation data (audio, video, affective) 11 is available for online access for interested researchers.

14 used
with each participant of the Companions proto-15 type when executing the HWYD dialogue session.Each 16 session took approximately 2.5-3 hours to complete.

1 .
Introduction The participant was greeted by 18 an evaluator and asked to watch a short video intro-19 ducing the research, the prototype, the data collection 20 equipment and the scenario they were to undertake 21 including EmoVoice and ASR training.After the 22 introduction, the participant was asked to sign a video 23 waiver and experiment participant agreement (in line 24 with IRB/ethical treatment of human subjects).

25 2 .
EmoVoice Session The participant read a short 26 overview of EmoVoice's functionality and was shown 27 a video of someone training on the system to illustrate 28 that the more emotive the user was, the more accurate 29 the emotional condition allocation of EmoVoice was. 30 Companion, assessing what appeared to be anecdotal strengths and weaknesses.During this initial phase, the evaluation team developed a total of around twenty scenario combinations that best represented the breadth of interaction experience offered by the HWYD scenario.It was decided that this represented too large a set for comprehensive testing, and so these were then scaled down to a design of ten basic scenarios (14 with Positive/Negative variations).Each scenario session involved a variety of conditions.

96 NEG Greet Companion 97 NEG Had a bad day 98 NEG 1 NEG 2 NEG 3 NEG 4 NEG 8 U
My promotion was r e j e c t e d Gave a bad p r e s e n t a t i o n Missed an important dead line Meeting with Nigel & Paul was a d i s a s t e r Boss i s very unhappy with my performance 5 An example dialogue between the user (U; here named 6 David) and the Companion system (S; here called 7 Matilda) generated from this scenario could be: : Morning Matilda.

9 S: 10 U:
Good morning David, how was your day?Pretty awful Matilda, I've had a terrible day.

11 S: Please tell me 12 U
: Well.My promotion was rejected today.

13 U:Scenario 2 ,NEG Had a bad day 45 NEGScenario 3 ,
It all happened after I gave a terrible presentation 14 first thing this morning . . .15 Scenario 1b, Positive events: In pilot studies, we 16 found that overall negative events gave the Companion 17 greater leverage.However, we wanted a direct contrast.18 To that end, we created a minor variant of Scenario 1a, 19 where all the events were positive.This is the only 20 change from the previous scenario, so would present 21 us with a clear and direct comparison.Script: 22 POS Greet Companion 23 POS You ' ve had a good day 24 POS You ' ve been o f f e r e d a promotion 25 POS Gave a good p r e s e n t a t i o n 26 POS Made an important d e l i v e r a b l e d e a d l i n e 27 POS Had a g r e a t meeting with Nigel & Paul 28 POS Boss i s happy with your work 29 Long utterances: This scenario was 30 designed to explore if the system performance changes 31 with long utterances, and whether it is more or less 32 natural to use long or short utterances.It was also 33 intended to see the impact on the dialogue of two or 34 three events per utterance versus a single event.In 35 this scenario, the significant change from Scenario 1a 36 is that users are encouraged to offer more information 37 (more concepts) to the system in a single user turn.As a 38 consequence, we had to increase the overall number of 39 events.We expected the outcome from this condition to 40 be overall longer dialogues, but an interesting contrast 41 in how the system understands the user (through a 42 potential concept error rate increase, for example).43NEG Greet Companion 44 The t r a f f i c was r e a l l y bad t h i s morning 46 NEG My computer c r a s h e d a s I was p r e p a r i n g 47 the p r e s e n t a t i o n today 48 NEG Missed an important dead line 49 NEG Gave a bad p r e s e n t a t i o n 50 NEG Meeting with Nigel & Paul was a d i s a s t e r 51 NEG Boss i s very unhappy with my performance 52 NEG and so my promotion was r e j e c t e d 53 NEG I lost my special parking space 54 NEG I will miss out on my Christmas holidays 55 NEG Jane i s always h a r a s s i n g me 56 Mixed emotional states: To this point, 57 the scenarios used fixed emotional states.Scenario 3 58

NEG
Greet Companion NEG Had a bad day NEG The t r a f f i c was r e a l l y bad t h i s morning NEG My computer c r a s h e d a s I was p r e p a r i n g the p r e s e n t a t i o n today NEG Gave a bad p r e s e n t a t i o n NEG Missed an important deadl ine NEG I must work over the Christmas holidays POS Meeting with Nigel & Paul went very well POS My promotion was a c c e p t e d POS Boss i s very happy with my performance POS I will have extra holidays this year POS Jane always says how good my work i s POS I was given a special parking space Had a bad day NEG My promotion was r e j e c t e d NEG Gave a bad p r e s e n t a t i o n NEG Missed an important deadl ine NEG Meeting with Nigel & Paul was a d i s a s t e r

Fig. 3 :
Fig. 3: Average utterance count per scenario (blue line = combined average across all scenarios)

6 5 RESULTS AND ANALYSIS 7 2 8 10 speech data from EmoVoice, and appropriateness 11 measurements. These data sets are described in turn 12 below
Twelve participants followed the Protocol in Section 3.and the set-up of Section 3.1 was used to collect three 9 types of data: objective dialogue metrics, emotional , and the results of the data collection analysed.

20
is directly relative to length of dialogue; Figure11a21 shows that average score per scenario is also related to 22 dialogue length.The chosen benchmark, Scenario 1a 23 scores exactly on the overall system average.Most 24 scenarios are at or above the average.Scenario 3 is 25 significantly higher (but has significantly higher total 26 utterances) and Scenario 2 is significantly lower (for the 27 inverse reason).What is interesting are the particularly 28 low scores in Scenario 5, the free-form scenario.29 Normalising the appropriateness scores for length 30 of dialogue and showing scores per utterance across 31 scenarios, gives the results of Figure 11b.Here the 32 baseline condition, 1a outperforms the average, being 33 a very clean and concise interaction.Scenario 1b, by 34 comparison, underperforms the average, despite the 35 only difference being the polarity of events.Most 36 noticeably, scenarios involving any deviation from the 37 script (Scenario 4 with slight deviation, and Scenario 5 38 with no script) score lower than average.39 It is most useful to examine these scenarios in terms 40 of annotation label distributions, and compare them 41 to the average scores across the entire evaluation.

3 6 DISCUSSION AND CONCLUSIONS 4 The development of Companion technologies requires 5 new
models of evaluation.In this paper, we have 6 concentrated on assessing the HWYD Companion's 7 functionality and overall system behaviour, with re-8 spect to three parameters: functional ability (does 9 it do the 'right' thing), content (does it respond 10 appropriately to the semantic context), and emotional 11 behaviour (given the emotional input from the user, 12 partially carried out within the EC/FP6 106 integrated project COMPANIONS (IST-34434), and 107 while Dr. Webb was at State University of New York; 108 Albany, New York, USA.109 Thanks to the developers of the HWYD Companion 110 and the developers of EmoVoice, as well as to Jay 111 Bradley and the participants in the user studies at 112 Napier University, Edinburgh.

TABLE 1 :
Objective MetricsHuman labelers assign categories to both system 58 and user utterances, with particular focus on system 59 behaviour.Labels capture the appropriateness of an 60 utterance in the context of the on-going dialogue.For 61 example, if the system asks a particular question, it 62 may be judged to be appropriate, but if the system 63 subsequently repeats the same question, when the user 64 has provided a valid answer, the same utterance could 65 be judged to be inappropriate in that context.77 namic programming string alignment is used to calcu-78 late the errors.Concept Error Rate (CER) is calculated 79 by ignoring the order of recognised concepts, 662.1 Objective Speech and Dialogue Metrics 67The 16 objective metrics are outlined in Table1.68 Standard timing information needs to be collected 69 from each interaction.Delay times between utterances, 70 both system and user, should be captured, as well 71 as overall dialogue length, in time and in number 72 of utterances.Vocabulary sizes and utterance lengths 73 (in words) are expected to be available both based 74 on ASR results and on transcriptions.Word error 75 rate (WER) is calculated using the standard formula 76 (WER = deletions+insertions+substitutions number of words uttered by user ).Regular dy-

TABLE 3 :
Overview of the scenario features dialogues were varied.Explicit emotional indicators 54

TABLE 4 :
Dialogue metrics averages over all scenarios

TABLE 5 :
Results from EmoVoice session

Table 2 )
. Again as noted, average total score