An Investigation of Environmental Influence on the Benefits of Adaptation Mechanisms in Evolutionary Swarm Robotics

A robotic swarm that is required to operate for long periods in a potentially unknown environment can use both evolution and individual learning methods in order to adapt. However, the role played by the environment in influencing the effectiveness of each type of learning is not well understood. In this paper, we address this question by analysing the performance of a swarm in a range of simulated, dynamic environments where a distributed evolutionary algorithm for evolving a controller is augmented with a number of different individual learning mechanisms. The learning mechanisms themselves are defined by parameters which can be either fixed or inherited. We conduct experiments in a range of dynamic environments whose characteristics are varied so as to present different opportunities for learning. Results enable us to map environmental characteristics to the most effective learning algorithm.


INTRODUCTION
Recent advances in technology are driving novel research in swarm robotics, envisioning future applications in which swarms might be sent to remote or hazardous environments and in which they will need to survive over long periods of time. As these environments will be unknown to the designer a priori and are potentially dynamic, the swarm must be able to continuously adapt its behaviour to ensure it both maintains su cient energy to survive, and to successfully perform tasks. e importance of being able to adapt over time has been a subject of research within Evolutionary Robotics for some time [20]. GECCO  Adaptation o en takes one or all of three forms: evolutionary, individual and social learning. In evolutionary adaptation, information encoded on the genome adapts through selection and reproductive operators over many generations. In individual learning, a robot can adapt its own behaviour during the course of its lifetime, for example, updating weight values in a neural network controller. Finally in social learning, robots can exchange information during a lifetime. e relative bene ts of mixing the di erent types of adaptation have been studied both in simulation [3,6] and hardware [4,[10][11][12]. Typically, experiments are conducted in single environment related to a speci c task, therefore the role of the environment in in uencing the result is not made explicit. An exception is recent work from Haasdjik [5] who explicitly studied the e ect of combining con icting environmental and task requirements in a simulated system. is showed that high selective pressure exerted by a task can outweigh any selective pressure from the environment. However, an arbitrary environment was de ned to conduct experiments in, leaving open the question of whether the same e ects would be observed in a di erent environment. e goal of this paper is to investigate the interplay between evolution, individual learning and environment characteristics. We consider a swarm which undergoes distributed evolution of a neuralnetwork based controller and is augmented with an individual learning mechanism: this modi es the information gleaned from the environment and fed to the controller over the lifetime of a robot. Speci cally, we consider a swarm operating in an environment which is unknown a priori and which robots must learn relative values of positive and negative energy tokens. Each environment contains n positive and n negative energy tokens. Positive tokens increase the robot's energy by units of energy, while negative ones reduce it by a xed amount. As n, vary, each environment presents di erent opportunities for learning in that there are a small number of high value tokens, or a large number of low value tokens. In addition, tokens change their nature across 'seasons', i.e. tokens of a speci c colour switch value from negative to positive on a cyclical basis. is forces the swarm to have to re-learn the e ect of any given colour of token every season. Various se ings for individual learning are investigated in which the learning mechanism is either xed or has components that can be simultaneously evolved. e following questions are investigated: • How do the parameters of the environment (token count, token value) in uence the e ectiveness of di erent individual learning se ings? • How does the rate of change of a given environment in uence the e ectiveness of individual learning mechanisms? arXiv:1804.07663v1 [cs.NE] 20 Apr 2018 • How does the nature of the individual learning mechanism in uence performance in di erent environments?
We augment a distributed evolutionary algorithm previously described in [9] with mechanisms for individual learning in order to conduct experiments. Note that the goal is not to propose a novel method of either individual learning or evolutionary adaptation but to explore the relationship between the environment and value of di erent types of adaptation.

RELATED WORK
A reasonable body of research exists in relation to combining learning and evolution, and factors that in uence this relationship [7,13,14]. e relationship of the two methods in a swarm environment in which it is necessary to simultaneously learn behaviours which enable reproduction in addition to task performance is less well studied however. Haasdjik et al propose a framework for evolution, individual and social learning in collective systems, and consider the interaction of evolution and individual learning in which the la er is achieved by reinforcement learning [19]. eir experiments show that in a collective system, it is possible for learning to counteract evolution. A hiding-e ect can occur in which individual learning acts to mask the ill-adapted nature of non-optimal agents and is therefore counter-productive. Although a number of environments were investigated which essentially modi ed the reward system, all environments were static, and the relationship of the learning framework to speci c parameterisations of the environmental features was not examined.
A dynamically changing reward system was investigated in [1] who proposed mEDEA, a completely distributed evolutionary algorithm for open-ended evolution. Here, e cient adaptation in a changing environment was demonstrated using a set up that switched phases: in the free-ride phase, there is no cost to movement therefore a robot only needs to meet a single other robot to pass on its genome, while in the alternating phase the robot is required to harvest energy in order to move and therefore creating opportunties for passing on its genome. Haasdijk et al [8] extended mEDEA to add explicit task-selection in the MONEE framework [15]. In [5] they examine in more detail the relative selection pressures induced by task performance and survival in di erent environments, nding that task performance is optimised even if it reduces the lifetime of robots (and therefore their ability to reproduce). Heinermann et al investigate the relationship between evolution, individual and social learning in real swarm [10][11][12]. Here, the evolutionary part focuses on evolving a suitable sensory layout, while the individual learning runs an evolution strategy to learn the network weights during the robot lifetime. Learnt weight vectors are broadcast to other robots during the social learning phase. e main focus of this work was to investigate the impact of social learning. Individual learning is required to learn a controller and hence cannot be omi ed.
In contrast to the above, we consider scenarios in which individual learning has the potential to improve evolved behaviours, but is not essential. We investigate the relative bene ts of evolution and individual learning using a variety of learning mechanisms and in a range of environments with di erent features. e goal is to speci cally relate the roles of evolution and individual learning performance to features of the environment.

OVERVIEW
A swarm operates in an open environment in which there are two types of coloured tokens: driving over one colour increases the robots energy while the other decreases it. Robots should learn to avoid the negative token. However, a "seasonal" change is imposed where the value of the token is reversed, i.e. red becomes positive and blue negative or vice versa. A robot must thus adapt any previously evolved behaviour. All robots in the swarm evolve a neural network that controls their behaviour through a distributed evolutionary algorithm [9] In addition, they can exploit an individual learning mechanism which can potentially learn the current value of a given colour of token. is information modi es an input to the evolved neural network. We investigate a number of types of individual learning in which some components of the learning mechanism can be either heritable, xed or absent.
Experiments are conducted using the Roborobo simulator [2]. e robots have 8 ray-sensors distributed around the body and detect proximity to the nearest object and its type. Each robot is controlled by an evolved Elman recurrent neural network (RNN). e network has 16 sensory inputs and 2 motor outputs (translational and rotational speeds). e 16 inputs comprise of two information of each of the 8 ray-sensors, proximity and whether or not this object is an energy token. Although the colour/type of the object is also detected by the robot, it is not fed into the RNN as an input, but only used in the adaptation mechanism 1 .

mEDEA
Using the inputs and outputs just described, an RNN with 1 hidden layer containing 16 nodes is evolved by a distributed evolutionary algorithm [9]. is algorithm is an extension of mEDEA [1], and incorporates a selection mechanism based on relative tness. In brief, for a xed period, robots move according to their control algorithm, broadcasting their genome that is received and stored by any robot within range. At the end of this period, a robot uses tness-proportionate selection to choose a genome from its list of collected genomes according to a relative tness value, and applies a variation operator.
is takes the form of a Gaussian random mutation operator, inspired from Evolution Strategies. Pseudo-code is given in Algorithm 1.
Each robot estimates its tness in terms of its ability to survive based on the balance between energy lost and energy gained, denoted (δ E ): this term is initialised to 0 at t = 0 (when the current genome was activated) and is decreased according an energy-model described below that accounts for both movement and the cost of communicating for evolution, and increased by E token if it crosses an energy token. Given δ E , a robot calculates a tness value which is relative to those robots in the neighbourhood of range r . according to equation 1, where f i is the relative tness of robot i at time t, mean sub i is the mean δ E of the robots within the subpopulation de ned by all robots in range r of robot i, and sd sub i is the standard deviation of the δ E of the subpopulation.
enomeList.addIfUnique(recei edGenomes); if genomeList.size() > 0 then enome = select r oul et t e−wheel ( enomeList); load(currentGenome = applyVariation( enome)); enomeList.empty(); li f etime = 0; end end Algorithm 1: Pseudo code of the adapted version of the mEDEA algorithm with relative tness mEDEA r f as introduced in [9] used with roule e-wheel as explicit selection mechanism ere is a xed cost to living of 0.5 units per timestep, regardless of whether the robot moves or not. A robot moving consumes an amount of energy that is related to its rotational speed rot , translational speed trans , and their respective maximum values rot-max and trans-max e amount of energy spent on communication E com is calculated using equation 3, where i and j are the number of genomes received and transmi ed respectively. e values a rx = 0.0305, a tx = 0.01379 and a tx-amp = 0.000614 were determined based on the method described by [18]; the reader is referred to this publication for a description of their approach.
Equation 4 shows the change in energy at each simulation step, where n is the number of tokens that have been collected in that step.

Environment
In Evolutionary Robotics, it is o en unclear exactly how parameterisation of a given environment might in uence the emergence of particular behaviours. O en, the focus of reported studies is on algorithm performance, without serious consideration of how the choice of environment may in uence results. is is particularly important for an open-ended distributed algorithm such as mEDEA in which survival of robots is crucial for evolution to occur. To counter this, Steyven et al [17] recently proposed a technique by which preliminary experimentation could be used to generate a surface-plot, highlighting regions of the parameter space in which the environment provides the right balance between facilitating survival and exerting su cient pressure for new behaviours to emerge. is enables a researcher to select appropriate se ings for experimentation. For example, for a given task, on the one hand, there will be regions in which the characteristics of the environment are such that robots nd survival to be trivial (e.g. food supplies are unlimited and easy to nd), and hence there is li le pressure to evolve specialised behaviours. On the other hand, environmental characteristics which are harsh enough to cause individual robots to die prematurely and therefore prevent any e ective evolution are also identi ed.
Using the algorithm described above, we conducted experiments in an environment parameterised by two variables: the number of energy tokens available, and the value of the energy token. In each environment tested, there are n positive tokens with value , and n negative tokens with value -400. e delta-energy δ E , i.e. di erence between start and end energy is recorded for multiple points in the parameter space, resulting in the plot shown in gure 1. From this plot, we identify three points to conduct experiments along the energy neutral line, i.e the region in which the robot expends as much energy as it acquires. is represents a region in which selection-pressure from the environment to survive is neither too small or too large to mask the behaviours we are interested in investigating. e points identi ed are speci ed in table 1.

INDIVIDUAL LEARNING
e neural network described above has a set of binary inputs (one for each sensor ray of the robot) that denote the presence (1) or absence (0) of a token (independent of its type). erefore, in an environment in which there are multiple types of tokens, the only * way for an individual to distinguish between them is to pick up the token and observe the change in energy. If the environment in which the robot operates is known a priori, then clearly, the neural network could be designed in order to include relevant information about each token type. However, if the environment is unknown, then the robot must learn to adapt to the di erent types and values of tokens it may encounter. We use an adaptation mechanism which enables a robot to modify the value input to the RNN corresponding to a token sensor: instead of simply having a binary input, the robot uses a learned/evolved multiplier to adapt the token input to a continuous value between −1 and 1.
Each time a previously unseen type of token is encountered (detected by a sensor ray or through consumption 2 ), a new multiplier is added to the multiplier set. As tokens are usually detected before they are consumed, no information regarding a new token's value is known: the robot therefore randomly initialises a value to associate with the type (x) of detected token. Following consumption, the resulting change of energy is detected by the robot and its learning mechanism can modify the corresponding multiplier value (m x ).
All multiplier values are adjusted every time a token is consumed according to equation 5: where m x is the current value for the multiplier for type x; C x is the number of tokens of type x collected; C total is the total number of all tokens collected; V x is the value of the token that has just been consumed and is therefore now known to the robot (being equivalent to the change in energy); V max , V min de ne the minimum and maximum values of all tokens encountered so far. LR is a learning rate that controls the magnitude of the change, and LS is either −1 or +1 and simply inverts the direction of change; this is required to adjust the learning mechanism to the internal value notation of the neural network and can be adapted via evolution. e learning mechanism is shown in Algorithm 2. ree factors in uence the learning mechanism: the initial value assigned to a token V x , the learning rate LR and the associated sign LS. ese factors can be randomly assigned, xed to some speci c value, or can themselves be subject to evolution. Allowing the learning sign to co-evolve enables the learning mechanism to self-adapt to the internal value convention of the neural network. Finally, enabling the robot to evolve an appropriate starting value for each type of token based on its experience may speed up learning 2 e sensor rays of the robot are not evenly distributed around the robot body. is can lead to the situation in which a robot drives over the token before any of the sensor rays detected it.
if token x is unknown then multipliers.add(token x ); end if token x is consumed then tokenCounter x .update(token x ); totalTokenCount.update(); tokenV alue x .update(δ E (t) − δ E (t − 1)); totalV alueRan e.update(); for m x in multipliers do m x .update(); // eq. 5 end end Algorithm 2: Pseudo code of the steps carried out to update all multipliers every time a token is encountered. in some circumstances. Even though token values change over seasons, inheriting a good starting value may be bene cial, likely dependent on the rate of change of the environment. Table 2 de nes four variants of the learning algorithm that we investigate in conjunction with the three environments described in section 3.2. Note that in no case is any Lamarkian evolution used, i.e. although the multiplier starting values are adapted over the course of a lifetime, they are never wri en back to the genome and are therefore not inherited.

Experiments
An experiment is de ned by a tuple <environment, seasonal change rate, algorithm>. ree environments (see section 3.2) and three di erent rates of seasonal change are investigated: 0 (no change, i.e. static environment), every 5000 iterations, and every 15000 iterations. Note that the maximum lifetime of a genome before it is replaced is 2500 iterations, so every robot should go through at least one evolutionary generation during the shorter (5000 iterations) season and at least 5 times in the 15000 season. In practice, as robots tend to die before their maximum lifetime, more evolutionary cycles are likely to occur.
Four algorithms are investigated as detailed in table 2. Note that in the baseline experiments, all tokens have a xed multiplier of 1 and therefore the robots cannot distinguish between tokens of different types. us, in total 36 (=3x3x4) experiments are conducted. In each experiment, we record the totalTokenRatio at the end of the season. is value is the ratio of the number of collected token with positive value divided by the sum of all collected token within that season. A ratio of 0.5 shows that an equal amount of positive and  negative token was collected, below 0.5 more negative and above more positive token, respectively. Experimental and simulation parameters are given in table 3. Parameters associated with the learning mechanism are given in table 4. e values for LR initial and LR max where selected following limited empirical exploration. e positive value of an energy token is determined by the environment. In seasons when a token is negative, the value is xed -400 which is 80% of a robot's initial energy.
Following 30 runs of each experiment, statistical analysis was conducted based on the method in [16] using a signi cance level of 5%. e distributions of two results were checked using a Shapiro-Wilk test. If both followed a Gaussian distribution then Levene's test for homogeneity of variances was perfomed. For equal variances the p-value was determined using an ANOVA test, otherwise using a Welch test. A Kruskal-Wallis rank sum test was perfomed to determine the p-value if one of the results followed a non-Gaussian distribution.

RESULTS
is section provides summarised results: detailed experimental data is available as supplementary material. Table 5 shows the median totalTokenRatio for each of three individual learning mechanisms (EVO, EVO+IL, IL) in each of the 3 environments and for each value of seasonal change. e values are compared to the result from the baseline experiment each case, and statistical signi cance is indicated in the table. e EVO method (which evolves multiplier values but has no adaption during a lifetime) outperforms the baseline method in all three static environments (season change = 0). Here, evolution is able to determine appropriate values for each multiplier type. However, in the dynamic environment, evolving the multiplier value is detrimental. In the rst season, evolution can nd appropriate multiplier values (particularly in a long season). However, as soon as the season changes, these become irrelevant; if these values have spread su ciently through the population it may take considerable time for evolution to reverse this change, while in the meantime, the robot will continue to collect negative tokens. e IL method ( xed learning rate and random initialisation of values) never outperforms the baseline method in the static environment, and is worse than the baseline in the dynamic environments. e magnitude of the e ect is highest in the seasonal change=5000 environment for a balanced environment. It appears that the learning rate is not su cient to adapt a randomly initialised multiplier to a suitable value while the randomness can actually bias the robot towards collecting a particular type. On average, this is worse than the baseline case in which the robot has equal preference for both types.
In contrast, with the exception of the two dynamic and scarce environments, the EVO+IL method which evolves the LR, LS and the multiplier values and also adjusts the la er during lifetime, a signi cant improvement is observed with respect to the baseline method. In the scarce environments, the robots have li le information available to them to inform learning, as there are few tokens. When the environment is changing rapidly this is particularly detrimental. In the other environments, there are more tokens to learn from. When this is coupled with the ability to both evolve useful multiplier values and adapt them at a appropriate rate, the swarm learns to adapt to the changing environments and improves its behaviour in the static environment.

In uence of environmental parameters
Next, we examine the rst question posed in section 1 in more depth: under what environmental conditions is augmenting evolution with an individual learning mechanism bene cial? Table 6 provides a pairwise comparison of environments for totalTokenRatio obtained at the end of each experiment. In this table and subsequent ones, the symbols =, <,> indicate whether the median values for totalTokenRatio are not signi cantly di erent, signi cantly smaller or larger respectively. p-values below the signi cance level of 0.05 are wri en in bold. Table 6 clearly indicates that for the methods that include an evolutionary component with the learning algorithm, then in the static environment, abundant > balanced > scarce. In contrast, when only a xed individual learning mechanism is used with no adaptation of learning rate, then the reverse appears true; the token ratio is higher in the balanced and scare environments is higher than in the abundant environment, with no signi cant di erence between balanced and abundant.
In the slow changing environment (15k), the general trend is that abundant > balanced > scarce for all three mechanisms. In the rapidly changing environment, a mixed picture emerges. For the EVO+IL mechanism, it is clear that abundant > balanced >  Table 7 illustrates how the rate of change of a given environment in uences the interaction between environmental parameters and learning mechanisms. In 21/27 pairwise comparisons, statistically signi cant results are observed.

In uence of Environmental Change
In the scarce environments, there is a general pa ern that in terms of rate of change, static > 5k > 15k for all mechanisms. In the balanced environments, the same general pa ern is observed, with the exception that for the IL and EVO+IL mechanisms, no statistical di erences are noted between the 5k and 15k environments. In the abundant environments, we also note the same general pa ern as above, except that for IL, the only signi cant result shows that 5k>15k signi cant, while in contrast, for EVO, 5k<15k. For the scarce environment, general pa ern that EVO+IL outperforms the other two methods in 4/6 cases, with no statistical di erence in the other two cases. In the balanced environment, EVO+IL also clearly dominates both EVO and IL. EVO dominates IL in the static and 5k experiments. Finally, in the abundant environment, we again observe the supremacy of EVO+IL, while IL dominates EVO in both of the dynamic environments.

Analysis
e previous section showed that the EVO+IL clearly outperforms IL and EVO in all parameterisations of the environment and for all rates of change. We examine its behaviour more closely by plo ing the normalised di erence between the number of positive tokens (p) and the number of negative tokens (n) collected per season over time (i.e. p-n). is is shown in gure 2 for the (scarce, balanced, abundant) environments for the two cases in which the values of the tokens change dynamically with seasons. e solid lines on the graph represent this value combined over both seasons, while the dashed and do ed lines represent the value in season 0 and season 1 respectively. All   in token value (i.e. an upward trend). e abundant environment proves most straightforward to learn in: having a large quantity of information of low-value outweighs the situation in which a small quantity of high-value information is available. In contrast, in the baseline experiment in which no information is available as to token value, the (p-n) metric continuously cycles. In this case, the best that evolution can do is learn a token-avoidance behaviour, as there is no means of distinguishing between tokens.

CONCLUSION
We have investigated the performance of a number of adaptation mechanisms that augment evolution of a neural network controller. Adaptation mechanisms that included heritable and xed components were analysed in three di erent environments in which both the number of learning opportunities and the impact of the learning opportunity varied. We show that an adaptation mechanism in which all components evolve and are heritable (EVO+IL) copes well in static and dynamic environments, and is able to learn to distinguish between tokens of di erent value. In dynamic environments, the greatest e ect is observed when the environment contains a large number of small learning opportunities. e fewer the learning opportunities, the less e ective the mechanism becomes, despite the fact that the   opportunities provide more energy and therefore more information to the learning mechanism. In contrast, the EVO and IL mechanisms both prove to be detrimental in a changing environment when compared to the baseline scenario. No clear pa ern emerges however in terms of the magnitude of the e ect with respect to the number of learning opportunities present. e IL method never outperforms the baseline experiments, whereas EVO is bene cial only in a static environment. In the la er case, performance is greatest in the environment with most tokens, and decreases as the number of tokens decreases. e results clearly demonstrate the interaction between the learning mechanism and environmental parameters. is is of particular relevance for distributed algorithms such as mEDEA in which environmental pressure in uences reproductive abilities. e huge variety of behaviour that were displayed in di erent environments highlight how fundamental it is to not just select parameters at random, but to perform a more thorough analysis. e emerging behaviour using a single set of algorithmic parameter varied from giving a massice advantage, to showing no di erence, to even being counter productive. Future work will extend the analysis to other mechanisms for adding individual learning and/or adaptation, as well as considering social learning, recently demonstrated by [11,12] to be e ective in some scenarios.