A Hyper-Heuristic Ensemble Method for Static Job-Shop Scheduling

We describe a new hyper-heuristic method NELLI-GP for solving job-shop scheduling problems (JSSP) that evolves an ensemble of heuristics. The ensemble adopts a divide-and-conquer approach in which each heuristic solves a unique subset of the instance set considered. NELLI-GP extends an existing ensemble method called NELLI by introducing a novel heuristic generator that evolves heuristics composed of linear sequences of dispatching rules: each rule is represented using a tree structure and is itself evolved. Following a training period, the ensemble is shown to outperform both existing dispatching rules and a standard genetic programming algorithm on a large set of new test instances. In addition, it obtains superior results on a set of 210 benchmark problems from the literature when compared to two state-of-the-art hyper-heuristic approaches. Further analysis of the relationship between heuristics in the evolved ensemble and the instances each solves provides new insights into features that might describe similar instances.


Introduction
The Job Shop Scheduling Problem (JSSP) is one of the most researched combinatorial problems studied by practitioners over recent decades, due to its relevance in many industrial applications.The simplest form is known as the static JSSP: a number of operations that need to be scheduled across multiple machines, with all operations available at the start.Dispatching rules that derive a priority index for each operation to be scheduled provide a quick and simple method for creating a schedule (Baker, 1984), and hence for practical reasons are commonly used in real-world applications.Many dispatching rules are described in the literature, designed by hand by human experts, although according to Geiger et al. (2006), often through considerable effort.
In an effort to automate this, hyper-heuristic methods have been used to create novel, reusable dispatching rules.A comprehensive recent survey is provided by Branke et al. (2015) who summarize the state of the art, noting that the most common approach has been to use variable-length grammar-based methods (e.g., genetic programming) to generate new dispatching rules.While this has achieved much success, they highlight a number of open questions and challenges, the first of which is the need to deal with complex scenarios by evolving ensembles of heuristics, rather than a single rule.
Currently, a few examples of small ensembles that evolve pairs of heuristics exist (e.g., Miyashita, 2000;Nguyen et al., 2014a;Park et al., 2013).Most recently, Park et al. (2015) evolve a larger ensemble of rules in which each heuristic in the ensemble votes to determine the priority of an operation at each iteration, using a similar methodology to ensemble approaches to classification typically seen in machine learning (Valentini and Masulli, 2002).From an optimisation perspective, Smith-Miles et al. (2014) point out that any given algorithm will have individual weaknesses on a given set of instances.Ensemble methods in which a mixture of algorithms is used to solve a set of instances would therefore appear a fruitful line of research.We propose an ensemble approach in which a set of heuristics is evolved using GP, each of which generates extra as a complete solution to a subset of instances in a distinct region of the instance space (i.e., the set of problem instances of interest).This can be be loosely described as a "mixtureof-experts" model, to borrow terminology from the machine learning literature.It is also similar to the algorithm portfolio methods that are common in Operations Research, particularly for solving SAT problems (e.g., Leyton-Brown et al., 2003).However, in contrast to many portfolio methods that focus mainly on selection from a portfolio, our approach addresses portfolio composition, ensuring that the portfolio contains a behaviourally diverse range of algorithms.
Our approach extends a previously described ensemble method called NELLI, that has been extensively tested in the bin-packing domain (Sim et al., 2015) and on a small set of 62 JSSP instances (Sim and Hart, 2014).NELLI-GP extends NELLI by evolving a pool of novel dispatching rules, each represented as an expression tree.The novel rules are combined into variable length sequences called heuristics, which are themselves subject to evolution.Using a large set of static JSSP instances, we address two main questions using NELLI-GP: • Do evolved heuristics, composed of variable length sequences of evolved dispatching rules, outperform the individual evolved rules?
• Do ensembles of evolved heuristics outperform individual heuristics?
Our approach extends existing work on tree-based rule evolution of single dispatching-rules by considering an extended node set and multiple methods for selection of eligible jobs.It extends current approaches to ensemble methods in that the role of each member of the ensemble is to solve a subset of instances, rather than voting to prioritise individual operations.Evaluation on over 700 problem instances shows significant improvement with respect to existing dispatching rules, and to state-of-the-art approaches (Park et al., 2015;Nguyen et al., 2013b), improving on published results on benchmark sets by up to 16%.Finally, analysis of the ensemble in terms of mapping heuristics to instances provides new insights into the relationship between simple instance parameters and algorithmic performance.
The article proceeds with a brief introduction to JSSP and to previous approaches for automating the production of dispatching rules.In Section 4, we provide both a conceptual and algorithmic description of the approach, clarifying the modifications that were made to NELLI and describing the new problem generator.After providing detailed experimental results, the article concludes with an analysis section that provides some insight into the function of the individual elements of the ensembles, the diversity of heuristics, and the difficulty of the problems in the instance space.

Background
Using the α|β|γ notation (Graham et al., 1979) the two JSSP problems investigated are denoted as J m||C max and J m|| ω i T i where the term J m translates as a Job-Shop environment and the last term corresponds to the objective (Makespan (C max ) and Summed Weighted Tardiness (SWT), respectively).An n × m JSSP problem has a set of n jobs, {J i } 1≤j ≤n and m machines, {M k } 1≤k≤m .Each job J i contains m operations, each of which has processing time p jk and weight given by w i .Each job is processed on every machine with no recursion allowed.Each job has a release time r i , which imposes a hard constraint on the earliest start time of a job and a due time of d i which imposes a soft constraint on the time that the job should be completed.Machines process a single operation at a time and all machines are available for the duration of the schedule.

Dispatching Rules
Many simple dispatching rules have been developed to quickly prioritise operations in order to select an appropriate operation for scheduling in an iterative process (Panwalkar and Iskander, 1977;Haupt, 1989;Blackstone et al., 1982).Typically, rules are hand-designed by experts in a trial-and-error process (Geiger et al., 2006), and each one optimises one or a limited number of scheduling objectives based on some particular conditions.Composite DRs combine single dispatching rules into a new rule and have been shown to outperform single DRs (Nguyen et al., 2013a).Both types of DR typically fall into one of two categories: DRs that can be evaluated before a schedule is commenced (e.g., Job Arrival Date and Operation Processing Time) or DRs that return different priorities for operations as the schedule is built (e.g., Number of Remaining Operations).
Many simple dispatching rules have been known for decades-a comprehensive survey in Panwalkar and Iskander (1977) describes over one hundred.Some of the earliest hyper-heuristic approaches to scheduling (though the term was not in use at the time) recognised improved performance could be achieved by combining existing dispatching rules into sequences that could be applied iteratively to build a solution.For instance, Ulrich and Erwin (1995) and Hart and Ross (1998) both use evolutionary algorithms to search for promising sequences of dispatching rules, showing promising results compared to the single rules.
In addition to specifying priority, a dispatching rule needs to also specify the eligible job set, that is, the subset of jobs that should be prioritised by the rule.Non-delay selection considers only jobs that are currently waiting to be scheduled, and hence minimises idle time.In the majority of hyper-heuristics that generate dispatching rules, eligibility is limited to waiting jobs, that is, follows the non-delay method.Non-delay schedules are not guaranteed to be optimal; the Giffler-Thompson algorithm (Giffler and Thompson, 1960) includes in the list of eligible jobs those arriving in the near future, before the shortest operation of waiting jobs can be completed.Several hyper-heuristic methods adopt this approach or variants of it (Geiger et al., 2006;Miyashita, 2000;Pickardt et al., 2013).Rather than predefining eligibility, the hyper-heuristic itself can optimise the method by which eligible jobs are selected.Hart and Ross (1998) evolved an extra parameter for each rule that defined the method that should be used to generate the conflict set of operations; Nguyen et al. (2013a) propose a representation for evolving scheduling rules that incorporates a hybrid scheduling strategy between non-delay and active scheduling.In Nguyen et al. (2013b), a separate function for the non-delay factor is evolved which can then adapt to changing shop conditions.

Evolving Novel Dispatching Rules
Despite the large number of existing dispatching rules and proposed methods for combining them, interactions between scheduling rules can be complex; in order to address this, more recent research has attempted to automate the process of designing rules, through the use of Genetic Programming (GP) to evolve arithmetic expressions of rules.This is a hyper-heuristic approach in that the space of heuristics (dispatching rules) is searched, rather than the space of potential solutions (schedules).A very comprehensive survey of hyper-heuristic approaches to automated design of dispatching rules is provided in Branke et al. (2015).
Most hyper-heuristic methods evolve a single dispatching rule.For example, Tay and Ho (2008) successfully evolved rules that determined the priority of operations to be scheduled for static problems.This work was later shown to be less useful in dynamic cases.Hildebrandt et al. (2010) advanced this work by using four varied training scenarios and found effective but complex rules.Note that most rule-based approaches lack a global perspective in that they consider only the current state of a machine and its queue.Nguyen et al. (2013b) evolve dispatching rules that iteratively improve the schedules by utilising the information from completed schedules; in Nguyen et al. (2013a), they extend the terminal set available to GP to include both recorded information regarding a previous schedule, "look-ahead" information and composite DRs, finding good results on static JSSP problems.Hunt et al. (2014) extend this approach of using "less-myopic" information, evolving new rules with an extended terminal set that were tested on a set of randomly created test scenarios.Their results show evolution of better DRs in terms of total weighted tardiness across the test simulation scenarios with high utilisation.In relation to dynamic JSSP problems, Nguyen et al. (2014b) compare surrogate-assisted selection methods in conjunction with GP, and automatic programming via iterated local search (Nguyen et al., 2015).Note that in addition to trees, other representations of dispatching rules are possible, for example, using neural networks to represent a function or simple linear combinations of attributes.However, a recent paper by Branke et al. (2014) evaluated potential representations and concluded that in terms of solution quality, the expression tree representation was preferable.
The GP systems described here evolve a single rule that should generalise across a range of benchmarks.Another strand of research utilises GP to evolve multiple rules that are used collaboratively to solve problems.Miyashita (2000) developed a system called GP-3 that evolves two dispatching rules and a choice function that selects between the two rules depending on whether a machine creates a bottleneck; this shows improvement over the use of a single evolved rule.Nguyen et al. (2014a) consider two approaches to evolving rules for dynamic JSSPs.In the first approach, a single individual is evolved that contains two separate subtrees that together specify a scheduling policy.In their second approach, a cooperative co-evolution method inspired by Potter and De Jong (2000) is used to evolve trees in two separate subpopulations: trees from each subpopulation are combined to specify a complete policy, and three objectives are used to select individuals based on dominance criteria.

Collectives of Rules
GP has previously been used to evolve ensembles of rules in the classification domain (e.g., Downey et al., 2012;Bhowan et al., 2013Bhowan et al., , 2014)).The evolved ensembles promote behavioural diversity amongst the evolved classifiers, and have proved promising when applied to classifying unbalanced data sets.Each evolved classifier votes on the class of a given data record, with the majority vote recorded.In the hyper-heuristic JSSP domain, at the time of writing we are aware of only one ensemble approach.Park et al. (2015) extend the cooperative co-evolution method first proposed by Nguyen et al. (2014a) to evolve s subpopulations of trees; a set consisting of one tree from each subpopulation is formed and each member of the subset votes as to which operation should be selected for scheduling.The operation with the most votes is then scheduled.The approach shows promising results but was limited to testing on 80 problems.
In contrast to the ensemble method in which members of the ensemble each votes as to which operation should be scheduled, we have previously described a system called NELLI that evolves an ensemble in which each member of the ensemble "wins" a unique subset of the instances under consideration (Sim et al., 2015(Sim et al., , 2013;;Sim and Hart, 2014).NELLI uses a method inspired by Artificial Immune Systems (AIS) to evolve a set of heuristics, which are behaviourally diverse in the sense that each solves different subsets of a large instance set.This approach bears considerable resemblance to bagging (bootstrap aggregating) approaches from the machine learning literature (Breiman, 1996) which have been shown to improve the stability and accuracy of machine learning algorithms, by reducing variance and avoiding overfitting.The approach can also be considered as mixture of experts methods (Valentini and Masulli, 2002): in these methods a supervising learner divides the input space and assigns an appropriate element of the ensemble to each division.While NELLI-GP retains the core AIS element of NELLI which ensures diverse heuristics are retained, it replaces the component that evolves the heuristics, using GP to evolve new dispatching rules which are sequenced into new heuristics.

Method
In NELLI-GP, an ensemble E is defined by a tuple H, R, D .H specifies the maximum number of heuristics in the ensemble, each of which is composed of a sequence of rules.The maximum length of the rule sequence is denoted by R. Each rule is defined by a tree that can have maximum depth D. Thus, an ensemble signified by the expression H 5 − R3 − D10, for example, describes an ensemble composed of a maximum of 5 heuristics, each of which is composed of a sequence of rules of maximum length 3, and where each rule has maximum depth 10.The created trees, rules, and heuristics that form a key contribution of this article are described in turn.

Tree-Based Dispatching Rules
In contrast to previous work in which NELLI used dispatching rules taken directly from the literature (Sim and Hart, 2014), in NELLI-GP we formulate novel dispatching rules as trees of depth d, where 0 ≤ d ≤ d max .The tree returns a real value for each operation that can be considered for scheduling-these values are subsequently used to prioritise each operation.Trees are formulated using a set of terminal and function nodes described next.

Terminals and Function Nodes
The terminal node set (see Table 1) contains 28 dispatching rules, compared to 14 in Hunt et al. (2014) and 7 in Tay and Ho (2008).The majority of these have been obtained from the literature and include examples of both simple and composite dispatching rules and examples from each of the three classes defined in Section 2.1-static, dynamic, and less myopic.Additional terminals have been added to increase diversity.We use a large set of function nodes as defined in Table 2.
Note that many of the acronyms used here differ from those commonly used in the literature due to the fact that the set of priorities returned by a rule can be ordered either ascending or descending.For example, the Shortest Processing Time rule (SPT) named OpPT here can sort operations from smallest to largest or from largest to smallest, Figure 1: Example rules comprising a conflict set and a tree dispatching rule of depth 0 (left) and 2 (right).Terminal nodes prefixed withreturn a negated value.depending on whether the terminal is set as positive or negative.Terminals prefixed with Op, J, or M denote that they operate on the operation, its associated job, or its allocated machine.For example, for each operation to be prioritised, MTPT returns the total processing time of the operations to be scheduled on the associated machine and JTPT returns the total processing time of the job.
The list includes simple DRs such as Weighted Shortest Processing Time (WSPT), Earliest Due Date (EDD), and Minimum Slack (MS) (named JWPT, JDD, and JS, respectively).These DRs have been shown to provide optimal solutions for the 1|| ω j T j single machine problem when release dates and due dates are zero (WSPT), or when they are sufficiently spread out (EDD or MS).The terminal set also includes composite DRs from the literature such as Apparent Tardiness Cost (JATC) which encapsulates features of both WSPT and MS.JATC has been shown to outperform its component parts in Vepsalainen and Morton (1987) for the SWT objective.Some less myopic rules such as Next Operation Processing Time (OpNPT) are also implemented.
When a terminal node is added to a tree, it is randomly assigned as a positive or negative node; that is, a rule can order its list of operations from largest to smallest or vice versa.Thus, in effect, the size of the terminal set is doubled.

Rule Wrappers: Eligible Operations
Each newly generated tree is encapsulated in a wrapper, named a rule (see Figure 1).A rule is randomly assigned a label A, B, or C specifying the algorithm that should be used to generate the set S * of operations that are eligible for scheduling, thus expanding the search-space of possible rules.Algorithm A is a modified version of the Giffler and Thompson algorithm (Giffler and Thompson, 1960) described in Algorithm 1; Algorithm B is the Non-delay algorithm (Baker, 1984) described in Algorithm 2; Algorithm C is the set of all available waiting operations.
The first two algorithms are known to generate schedules from a reduced solution space that includes the optimal solution.The Giffler and Thompson algorithm is modified as shown in Algorithm 1 such that if more than one machine shares the same minimum completion time, then all machines with that time are included in S * .The third method is included given that hyper-heuristics are often used in situations in which optimality is not required, merely an acceptable solution found quickly-note however, that the evolutionary process might subsequently discard this rule.An operation is chosen for scheduling from S * according to the value returned by the priority rule.
Note that there are 28 * 2 * 3 = 168 possible dispatching rules described by trees of depth = 0 only: a rule can contain one of 28 terminals, each of which can be set as positive or negative and use one of the 3 possible conflict sets.The number of potential complex rules composed of trees of depth > 0 is obviously considerably larger.

Heuristics
A heuristic, h i , is defined as a variable length sequence of rules, of minimum length 1 and maximum length R. To produce a solution for a problem instance, each rule in the heuristic is applied in turn and schedules a single operation.After the last rule in the sequence is used, the procedure repeats starting from the first rule until all operations have been scheduled.When a rule is selected, the subset S * of operations to be considered for scheduling is generated according to the scheduling algorithm A, B, or C defined by the rule.The tree defined by the rule is then applied to sort the available operations.The first operation from the sorted list is then selected.If more than one operation may be placed at the head of the sorted list (i.e., set of equal values is returned), then one is selected randomly.The selected operation is scheduled at the earliest available time on its predetermined machine as determined by the maximum of the completion time of the last operation to be processed on the machine, the completion time of the last operation from the selected operations associated job, and the job's arrival time.

Evolving Ensembles of Heuristics
Evolution of the ensemble is managed using the NELLI method previously mentioned in Section 2, modified in order to cope with the use of tree-based rules.In brief, the algorithm provides a cooperative method of evolving a set of heuristics that interact to cover a problem space, favouring heuristics that are able to find a niche in solving at least one problem better than any other heuristic in the system.Heuristics are sustained in the system by "winning" instances, for example, achieving a result on an instance that is better than anything else in the ensemble.
The heuristics in the ensemble continuously adapt due to evolutionary processes that continually trial new heuristics, and the size of the ensemble is an emergent property of the system.On the one hand, this enables the network to both continually improve its response and adapt to changing problem instances.On the other hand, it provides an immediate solution to new problem instances that appear as each heuristics generalises over some part of the problem space.A full description of the algorithm can be found in Sim et al. (2013Sim et al. ( , 2015)), Sim and Hart (2014), and Hart and Sim (2014), which describes applications in which heuristics were formed from predefined rules only.
Figure 2: NELLI-GP comprises three components: a heuristic generator to supply a continual stream of novel heuristics, a set of problem instances, and a network inspired by the idiotypic network theory of the immune system.The heuristic generator is expanded (the shaded area) here by the addition of a Rule Generator.

Implementation
A conceptual view of the system is given in Figure 2 and is described by Algorithm 3. The system contains three key components: a continual stream of problem instances (the problem-space), a continual stream of novel heuristics, and a network of heuristics inspired by the idiotypic model of the immune system.The dynamics of the network determine the final constitution of the system in retaining only problems that are representative of characteristic regions of the problem space, and heuristics that solve problems that cannot be solved by other heuristics.The system was originally proposed as a lifelong learning approach that could autonomously adapt to changing problem characteristics (Sim et al., 2015)-here we use a modified form to evolve a fixed ensemble of heuristics that collectively map the problem space based on a training set of problems.As in any machine learning method, the quality of the system relies on the training set being representative of the problem space, hence a large, diverse set of problem instances is desirable.
The key improvements to previous work using NELLI (Sim and Hart, 2014) are provided by the new heuristic generator.Heuristics are now represented as sequences of novel tree-based dispatching rules.This also differs from previous work in Sim et al. (2015) that uses Single-Node genetic programming to represent bin-packing heuristics.
where f (h p ) is the result (MS or SWT) achieved by heuristic h on problem p and min (f (H p )) is the best result obtained on problem p by the rest of the ensemble H = H − h.
The new approach provides increased diversity and results in a trained ensemble that better generalises across the problem space.A brief description of the core network is provided before describing the novel modifications in detail.

Immune Network
The immune network lying at the heart of the system is unchanged from the system used previously (e.g., Hart and Sim, 2014), and is covered here briefly for clarity.The immune network models heuristics as antibodies and problems as pathogens.The objective is to find a minimal repertoire of heuristics that effectively solve the problems being supplied to the system.A heuristic "wins" a problem instance if it obtains a solution to the instance that is better than any other heuristic in the system.The immune network promotes behavioural diversity within the ensemble in that for a heuristic to persist it must win at least one problem.Similarly, a problem survives in the system only if it is won by only one heuristic; that is, it is representative of a hard part of the instance space.Both heuristics and problems have their behaviours governed by two variables-concentration and stimulation.The key steps of the immune network simulation are described from line 10 onwards in Algorithm 3.With each iteration, after new problems and heuristics have been injected into the system, the stimulation levels of both heuristics and problems are calculated using Equations 1 and 2. Based on the stimulation received, concentration levels are adjusted up or down by a fixed amount ( c ) depending on whether the stimulation value of a cell lies within upper and lower thresholds.Heuristics and problems are injected into the system with an initial level of concentration (c init ) that allows them to persist in the network for a number of iterations, even without receiving any stimulation.If any concentration level falls to zero or below, it is removed from the system.Heuristics have an implicit effect on each other's behaviour leading to complex network behaviour; for example, a newly introduced heuristic that outperforms an existing heuristic may limit the stimulation received by the existing heuristic to such an extent that its concentration is reduced and it is eventually removed from the network.
Note that there is no global fitness function: a new heuristic is included in the ensemble only if it wins at least one instance from another heuristic; hence every addition of a new heuristic necessarily improves global fitness.The stimulation level of a heuristic provides a measure of its individual fitness: this drives selection for mutation, providing evolutionary pressure for individuals to improve.Specialisation of heuristics to specific niches with the instance space occurs due to the explicit need to win instances to survive.

Heuristic Generator
As previously described, a heuristic is composed of a sequence of rules.Each rule comprises two parts: a method of generating a conflict set of waiting operations and a tree dispatching rule that is used to prioritise the operations from the conflict set.New heuristics are created through cloning, initialisation, and mutation processes, the latter two of which are described by Algorithms 4 and 5 and remain identical to our previous work described in Sim and Hart (2015a).
Both the heuristic mutation and heuristic initialisation procedures may result in the requirement for new rules to be generated.In previous work (Sim and Hart, 2015a) when a rule was required, it was selected randomly from a set of 13 rules taken directly from the scheduling literature.In the work presented here, a rule generator is implemented that creates novel rules formulated as tree structures as described in Section 3.1.The process described in Algorithm 6 outlines the method for generating a new rule.New rules can be formed either through initialisation, using the ramped half and half method of Koza (1992), as described in Algorithm 7, or by mutating an existing rule using subtree swap mutation.
The current set of heuristics sustained by the network in NELLI-GP represents a population of available rules that are used as the starting point to generate new rules.This is motivated by the fact that any heuristic maintained in the system has already proved it provides a contribution to the ensemble, and thus can be considered as "high fitness."Unlike conventional GP, which uses crossover to converge to an optimal solution, only cloning and mutation procedures are used here to generate new heuristics and rules.The rationale for this implementation decision is that the purpose of the ensemble is to find diverse heuristics; therefore, utilising crossover was deemed unhelpful as crossover inevitably leads to convergence.Initial empirical investigation including crossover validated this assertion, leading to convergence of heuristics and no improvement in the collective performance of the ensemble: this is directly antagonistic to the goal of NELLI-GP to sustain a diverse population of heuristics.

Problem Generator
In addition to a stream of heuristics, NELLI-GP requires a continuous stream of problem instances that ideally are widely distributed over the potential problem space.Many JSSP datasets are available in the literature (Applegate and Cook, 1991;Lawrence, 1984;Taillard, 1993).Often, these problems have known lower bounds which provide useful benchmarks, but tend to consist of small sets of problems generated from a specified parameter set.A common approach from researchers studying JSSP has been to generate new problems using a specified problem generator (e.g., Hunt et al., 2014).This method has the advantage of being able to create very large numbers of problems; furthermore, by varying the parameters of the generator, a wide area of the potential problem space can be covered.For both of these reasons, we use the latter approach.The generator is made available on-line at Sim and Hart (2015b).
We consider problems specified by n jobs and m machines, where n ∈ [10, 25, 50, 100] and m ∈ [5,10,15,20,25].Instances are generated from each possible combination (n, m).The processing time for an operation is selected randomly from a uniform distribution following p ir = U [2, . . ., 10] where p ik refers to the processing time of the operation of job j on machine k.Release dates are drawn randomly from one of two distributions as in Tay and Ho (2008) using Equation 3depending on the number of jobs in the problem instance.
Due dates are defined as in Baker (1984) using Equation 4. The term c is fixed at 1.3, the average of the values used in Singer and Pinedo (1998) which investigated the difficulty of relatively "tight" problems generated with c = 1.4 and c = 1.2. (4) Job weights (w i . . .w n ) are selected using the 4:2:1 rule taken from Zhou et al. (2009) which is informed by research suggesting that 20% of a company's customers are the most important, 60% are of average importance, and 20% are less important.Consequently the first 20% of jobs in an instance are assigned a weight of 4, the next 60% receive a weight of 2, and the remaining 20% of jobs in an instance are given a weight of 1.The order in which a job's operations are assigned to machines may differ for each job and is allocated randomly during the problem generation process.All jobs visit each machine exactly once.In total, 20 different instance classes are defined by the combinations of (n, m, p, r, d).Within a class, each instance is unique due to the stochastic nature of p (processing time) and r (release date).

Training and Test Sets
Thirty-five instances are generated from each of the 20 classes.Within each class, 25 instances are selected at random and added to a training set, with the remaining 10 allocated to a test set.Thus, 700 problem instances are generated in total with 500 instances in the training set, and 200 in the test set.
Each of the data sets is considered using two different objective fitness functions.The makespan is a measure of the time to completion of the last job and is given by Evolutionary Computation Volume 24, Number 4 C max = max{C i } : 1 ≤ i ≤ n, where C i is the completion time of job i.The Summed Weighted Tardiness relates to the lateness of a job and is given by n i=1 ω i T i , where the tardiness of a job is denoted by T i and T i = max{C i − d i , 0}.In the following section, we describe a number of experiments that are conducted separately using each of these objectives.All problem instances are available at Sim and Hart (2015b).

Experimental Method
The experiments described here have the following objectives: • to investigate whether there is a performance benefit from evolving tree-based rules in comparison to a large set of predefined scheduling rules • to investigate if a performance benefit is obtained by combining rules into heuristics composed of linear sequences of rules, compared to using single rules • to investigate the benefit of using an ensemble of heuristics compared to single heuristics • to compare the results obtained to the state of the art from the literature To address the first three objectives, ensembles described by H x , D y , R z are evolved using the training and test sets described in Section 4.4.Note that experiments in which x > 1 describe ensembles consisting of more than one heuristic.Experiments in which y = 0 have rules of depth 0, that is, use only the scheduling rules defined in the literature.Experiments in which z = 1 use a single scheduling rule (either evolved or from the literature), rather than heuristic composed of a sequence of rules.
Parameters used in all experiments are given in Table 3.These are unchanged from those used in previous works with additional parameters added relating to the initial and maximum depth for trees which are both set to either 0 or to 6 (d init ) and 17 (d max ), respectively.The probability p ais and p m are set to high values to encourage exploitation, that is, mutation of existing rules and/or heuristics.This follows work reported in Sim and Hart (2014) that noted that the best results were obtained when new heuristics were formulated predominantly by mutating existing heuristics.Only two values of R max and D max were investigated: R max was either set to 1 or U = unlimited, that is, the length of the sequence of rules defining a heuristic was either exactly one or set to unlimited, so that the sequences could grow to any length depending on the evolutionary operators applied.The maximum depth of the trees was either set to 0 (use only a single terminal node) or 17 as in Koza (1992).The size of the ensemble was varied such that H ∈ 1, 2,4,8,16,32,64,128,256, U where U indicates unlimited, that is, no enforced restriction on the size of the ensemble.
As the rules are all stochastic (more than one operation may receive the highest priority score which results in a random selection from the set of highest priority operations) all experiments were conducted 30 times each for 10,000 iterations.This results in 30 sets of results for each training scenario: the heuristic(s) that gave the best results for each scenario during training were evaluated 30 times on a set of 200 test problems using the corresponding objective function.

Results
This section gives results from experiments in which ensembles were evolved using the training set of 500 instances and test set of 200 instances described in Section 4.4.Figure 3 shows box plots that directly contrast performance when switching between trees, heuristics, and ensembles on the test sets for both objectives.In each case, the four lefthand results (H = 1) indicate "one-size-fits-all" experiments in which a single heuristic is evolved.The right-hand four results represent ensembles, in which the components can consist of single rules (R = 1) or sequences of rules (R = U).Configurations with D max = 0 represent rules composed only of terminal nodes.Those with D max = 17 contain treebased rules.For the C max objective, H 1 D0 R1 corresponds exactly to the rule −J RP T (annotated with conflict set B) shown in Table 1 which is the best performing single terminal node and to −J ACT (again with conflict set B) for the SWT objective.The box plots present the summed fitness over the 200 test instances obtained from each configuration over 30 runs.The following general observations are made: • Using an evolved rule (D max = 17) is always preferable to creating heuristics from the terminal nodes only, regardless of whether ensembles or rule sequences are used.
• For the non-ensemble methods (H = 1), then a heuristic that contains a single rule rather than a sequence of rules is preferable: this is not observed in the ensemble methods in which the sequences (R = U) provide better results.This may result from overfitting on the training set when a single sequence is evolved, particularly for D = 0.
• Ensembles provide better results than a single "one-size-fits-all" approach (H = 1), regardless of R and D.
• The ensemble method used in conjunction with DU−RU is the clear winner, showing the benefit of an ensemble, of evolving trees, and sequencing them into heuristics.
Two-tailed t-tests between all pairs of results show that all observations are statistically significant at the 5% confidence level, with the exception of the comparison H 1D0R1 to H 1D17R1 using the SWT objective, where we were unable to find a statistically better single rule that generalised better than the JATC rule, known for its performance on the SWT objective.However, our results using ensembles (H > 1) improved on the single best rule on all occasions.Tables 4 and 5 provide results from each configuration from the single best run obtained for both objectives.As the summed fitness metric used in Figure 3 can be distorted by the larger instances, and masks performance on individual instances, we also provide a comparison to the theoretical optimal.makespan this can be easily calculated (Taillard, 1993); the table gives the median value for the ratio actual/optimal makespan.For the TWT objective, we estimate the theoretical optimal as the sum of the weighted due-dates, and calculate the ratio of the summed weighted actual arrival date to the summed weighted due-dates.For both objectives, we also give the actual number of optimal solutions found.Additionally, we rank each of the 8 configurations on each instance (where rank 1 is best) and calculate the average rank over all instances, and the number of instances assigned rank 1 for each configuration.
In terms of the summed fitness, the ensemble composed of heuristics containing rule-sequences (HUD17RU) provides the best result for both objectives.Comparing average ranks and instances of rank 1, the same ensemble proves best for the TWT objectives.It is marginally beaten by HUD17R1 for makespan in terms of average rank, but has more instances assigned rank 1.In terms of number of optimal solutions and median ratio, although the ensemble methods are clearly preferable, there is little difference between the ensemble of rule sequences (HUD17RU) and the ensemble of single rules (HUD17R1).The RU configuration appears to faciliate the emergence of heuristics that improve fitness (reducing the sum) on some of the larger instances, while having little impact on the overall ratio to optimal, most likely due to the increased size of the search space.

Comparison to a Disposable Hyper-Heuristic Approach
Hyper-heuristic methods aim to find quick and acceptable solutions to problems, often trading some loss in quality against speed of producing a solution.Given a new problem instance p and a trained ensemble containing h heuristics, then exactly h heuristics need to be executed once to find the best solution to the instance.In contrast, a typical  meta-heuristic approach applied to each instance to obtain a heuristic would run for i iterations.Although not entirely fair, it is instructive to compare the quality of solutions obtained by greedy selection from an evolved ensemble to the quality of the solutions obtained by directly evolving a new heuristic to solve each individual instance.A standard GP algorithm with parameters set as in Koza (1992) (population size = 500, initialised using ramped half and half up to a depth of 6 using crossover 90% and cloning 10% and maximum tree depth of 17) was executed for 20 generations (10,000 evaluations as with NELLI-GP).After initialisation, each tree is randomly assigned one of three possible rules to determine operation eligibility defined in Section 3.1.2.This rule does not undergo mutation.The following experiments are performed 30 times: • GP (1P) : GP is run on each of the 200 instances in the test set in isolation; that is, a single rule is evolved for each test instance.
• GP (200P) : GP is run on the full set of the 200 test instances; that is, 1 rule is evolved for the complete test set.
Table 6 shows the best, mean, and standard deviation obtained over 30 runs.t-test results in each column show a comparison with the previous columns results.As well as results from the two experiments described, 3 sets of results from the experiments conducted in the previous section are included for comparison.GP (1P) and H U D17 R1 both evolve exactly one rule per problem, and hence can be directly compared.Similarly H 1 D17 R1 and GP (200P) both evolve a single rule that is evaluated on all 200 problem instances and hence are directly comparable.The results of applying the best ensemble obtained during training (H 1 D17 RU ) to the test set are also included.
As would be expected GP(1P), which evolves a different rule for each problem instance, obtains better results than both GP(200P) and H 1 D17 R1 which evolve only a single generalist rule.Interestingly, when the evolved H 1 D17 R1 rule is applied to the test set, it outperforms the single rule that was evolved by GP directly on those instances, demonstrating good generalisation from the training set.All of these results are eclipsed by the results obtained by applying the reusable ensembles of heuristics generated by NELLI-GP.Furthermore, the best trained ensemble (H U D17 RU ), comprising of 264 heuristics, requires less than 3% of the evaluations required for either of the GP experiments.

Comparison to Existing Approaches
We compare our method to two recent approaches from the literature.Nguyen et al. (2013b) describe a new GP-based approach to learning new iterative dispatching rules, using four well-known benchmarks datatsets (LA, ORB, TA, DMU).A training set is formed consisting of all odd numbered instances from each of the four sets, with the remainder making up a test set.Each set has 105 instances.Park et al. (2015) use GP to evolve an ensemble of rules that vote to determine which operation is selected for scheduling at each iteration.Their approach is tested on the 80 instance Taillard (TA) set (which is included in Nguyen et al., 2013b), using three different training sets each containing 5 instances, with the remaining 65 instances making up the test set in each case.
A direct comparison of ensemble methods is given by comparing NELLI-GP to the ensemble method of Park et al. (2015), that is, the "mixture-of-experts" ensembles of the former and the majority voting approach taken by the latter.We use the same training and test sets described in Park et al. (2015) and evolve an ensemble of heuristics using NELLI-GP in the form H 4 D17 RU in order to allow a fair comparison to the 4 islands used by Park et al. (2015).In order to provide an exact comparison, NELLI-GP is first limited to exactly the same set of function and terminal nodes used by Park et al. (2015).The experiment is then repeated using the full set of nodes given in Table 7.Both NELLI-GP ensembles outperform the results from Park et al. (2015).t-tests at the 5% significance level confirm this result.t-tests additionally show no significant difference between the NELLI-GP results with different terminal sets.To compare to Nguyen et al. (2013b), we evolve an ensemble defined by HUD17RU using exactly the same training and test sets as their approach.Results averaged over 30 runs are given in the final line of Table 8.NELLI-GP significantly outperforms the published approach on both training and test sets.
The ensembles evolved by NELLI-GP should be reusable.We take ensembles evolved on the new 500 training instances (as described in Section 4.4) and reuse on the test sets used by Nguyen et al. (2013b) and Park et al. (2015), comparing to their published results.Results are given in Table 8 for three ensembles (H 4, H 64, H U). All three significantly outperform EGP-JSS from Park et al. (2015) on the 65 Taillard instances.H 8 and H U significantly outperform the results of the new GP method proposed by Nguyen et al. (2013b) and the results from their simple GP algorithm .Clearly the NELLI-GP ensembles are reusable-ensembles evolved on one data set can be directly applied to a different data set, and demonstrate improved performance over methods that were tailored to the tested data set.In addition, the ensemble provides a robust optimisation method.The instances in the data sets given in Table 8 have similar parameters in terms of (jobs, machines) but contain operations whose processing times are drawn from different distributions to those used by the instances used to evolve the ensembles.Hence, they can be considered perturbed versions of the original instance set.

Analysis
This section provides an analysis of the ensemble in terms of the role and structure of the constituent heuristics, and relationship between heuristic performance and problem structure.

Effect of Ensemble Size
The experiments in Section 5 allow ensembles of unlimited size (H = U ) to evolve, although in practice evolution limits the actual size that emerges.In the following experiments, we examine the effect of restricting the size of the ensemble to H ∈ (1,2,4,8,16,32,64,128,256) rather than allowing evolution to proceed unrestricted.Figure 4 shows results on the test sets for both objective functions as H is varied.
A clear correlation between the size of the ensemble and the collective performance of the ensemble is apparent in all cases.However, the benefit clearly tails off as the size of the ensemble increases beyond a saturation level.An ensemble size of |H | ≈ |P |/10 appears appropriate as a rule of thumb given the investigations conducted.However, this is clearly problem dependent; the larger and more diverse the training set the better the evolved heuristics.

Structure of Evolved Heuristics
The heuristics contained in evolved ensembles of size 1 − 264 are analysed in terms of the number of rules contained in each heuristic, and the number of nodes in each rule, with results given in Table 9.As previously noted, the number of heuristics in the ensemble, the number of rules in a heuristic, and the tree depth of the rules is an emergent property of NELLI-GP.Some general trends are observed.Heuristics tend to increase in complexity in terms of the number of rules per heuristic as the ensemble size increases, peaking at |H | = 64 when the trend reverses.This suggests that large ensembles favour less complex rules, as each heuristic needs to operate in a smaller region of the instance space.In terms of nodes per rule, the pattern is less obvious; the number of nodes tends to increase as the size of the ensemble increases, but does not grow linearly.The average depth per rule is given in the final column-note that in all Figure 5: An average rule from 1 of the 64 heuristics evolved in experiment H 64 R17 DU .The 64 heuristics contained on average 8 rules with rules having an average depth of 6.
In general, the depth of rules that emerged was much less than the maximum value of 17.
cases, this is significantly less that the maximum allowed depth of 17 and often produces small, easy to analyse heuristics with depth around 5. In contrast, in typical GP, bloat is common.We suggest that in NELLI-GP, bloat is largely suppressed by the elitist requirement that a heuristic must solve at least one instance better than the existing heuristics to replace another in the ensemble.In standard GP with a large population, weak trees can be sustained due to the replace-worst step, encouraging bloat.A typical rule taken from 1 of the heuristics from H 64 R17 DU is shown in Figure 5.

The Role of the Ensemble
To understand how each heuristic within the ensemble contributes to the collective performance we analyse an ensemble of h heuristics applied to the set of i = 200 test instances.For each instance we calculate f * i , that is, the best result achieved on the instance by the ensemble, and calculate c i as the number of heuristics that achieve f * i for a given instance i.For instances with c i > 1, multiple heuristics achieve the same result f * .From an algorithmic perspective, we suggest that these instances do not represent "interesting" regions in the instance space, given many heuristics perform equally well.In contrast, if c i = 1, then the instance has a unique winner, and these instances represent regions in which algorithm selection becomes a key issue.
Let w(h j ) be the number of instances won by a heuristic, h j .A specialist heuristic, one with small w(h j ), operates in a niche region of the instance space.A generalist heuristic on the other hand, with large w(h j ), operates across large regions of the space.The balance between the specialist/generalist nature of the heuristics evolved is an emergent property of NELLI-GP that is closely linked to the structure of the instances in the dataset.We consider the 20 classes of makespan instances described in Section 4.4 and the heuristics generated by the experiment H 8D17RU and H 64D17RU .Figure 6 shows the number of instances w(h i ) won by each of the heuristics for both experiments.For an ensemble of 8 heuristics, we observe that for 32% of the 200 instances, there is no uniquely best heuristic.For the remaining 134 instances "interesting" instances, no particular heuristic is dominant; the most general heuristic (H3) wins 26 instances while the most specialist (H2) wins 11.In the ensemble of 64 heuristics, only 10% of instances do not have a unique winner.The 64 heuristics tend to specialise equally: the number of instances won ranges from 2-7, with no heuristic generalising across many instances.

Relationship Between Instance Class and Heuristic Performance
NELLI-GP promotes the emergence of heuristics which are behaviourally diverse in that each heuristic has to uniquely win at least one instance in order to survive.Assuming that the instances within the subset won by each heuristic share similar features, we investigate the intra-class and inter-class membership of each subset, given the 20 classes defined in Section 4.4.Figure 7 uses a network representation to capture the relationships between heuristics and classes for an 8 heuristic ensemble.An edge exists between a heuristic and a class if the heuristic wins at least one instance in that class.The weight of the edge reflects how many instances within the class were won, and the size of a heuristic denotes how many different classes it wins instances in.The degree of a heuristic indicates the number of different classes the heuristic wins instances from.The degree of a class indicates how many different heuristics win instances from that class, thus is representative of intra-class niches.Heuristics have a median degree of 8.5, indicating there are clearly many inter-class relationships between instances.Classes Figure 7: A network capturing the relationship between a heuristic and the instance classes in which it wins instances.Edges are weighted according to the number of instances in the class won.have a median degree of 4, that is, a typical class of 10 instances generated from the same parameter set has at least 4 intra-class clusters.H3 is strongly associated with class 18 (defined by 25 jobs and 25 machines) winning 7/10 instances.Only 3 of the 20 classes are uniquely associated with a heuristic, implying uniformity across instances within these classes.Note that two of these classes (1,2) contain the easiest (in terms of finding optimal solutions) instances, due to the small number of jobs and machines.
Clusters of similar instances-those won by the same heuristic-naturally emerge from NELLI-GP.Each cluster can be examined with respect to a feature set to potentially identify correlations across instances.Ingimundardottir and Runarsson (2012) consider a set of 16 features and attempt to relate them to instance difficulty.Thirteen of these features are recalculated on a step-by-step basis after every operation has been scheduled (e.g., current makespan); the remaining 3 are static and apply to the instance as whole (e.g., total job processing time).Smith-Miles et al. (2009) cluster instances (using a tardiness objective) according to 6 features, then examine the distribution of fitness values to infer knowledge about the relationships between instance structure and heuristic performance.Taking the H 8 ensemble as an example, for each heuristic in turn we examine whether the features of the instances within the cluster defined by the instances it uniquely won differ significantly from the remainder of the instances.As NELLI-GP assigns a single heuristic to an instance, only static features are relevant.We consider four features: mean processing time, processing time range, and total processing time are used in both (Ingimundardottir and Runarsson, 2012;Smith-Miles et al., 2009); we add ratio machines:jobs as this is known to influence problem hardness (Streeter and Smith, 2006).
If I t is the 200 instances in the test set, then let I H j be the subset of instances uniquely won by heuristic H j ; I r the subset of all remaining instances from I t ; I u the instances in I t uniquely won1 by the remaining heuristics.For each of the four features, we apply a Wilcoxon rank sum test to compare the feature values in the instances in I H i to those in subset I r , and then separately to those in I u .Results are shown in Tables 10 and 11 which differentiates between results significant at the 5% level (+) and at the 1% level (++).For three heuristics (H2, H5, H6) none of the features are discriminatory.However, for the remaining heuristics, at least one feature discriminates the instances won by the heuristic from the remainder of all instances or the remainder of heuristics uniquely won.All features are discriminatory in at least two tests.The results generally concur with the findings of Ingimundardottir and Runarsson (2012) who attempt to correlate features with hardness, noting that "the features distinguishing hard problems were scarce.Most likely due to their more complex data structure their key features are of a more composite nature."NELLI-GP removes the need to define features in order to discriminate between instances; the clustering of instances emerges as a result of applying the algorithm.The emergent clusters capture structure within the problems that cannot be easily defined by humanly-intuitive features or by the features used to generate instances.

Conclusion
The automated design of reusable production scheduling heuristics is currently of great interest to the optimisation community as evidenced in detail by Branke et al. (2015).The need to move towards ensemble methods has been highlighted as an open challenge, given the requirement to deal with complex decisions and varied problem instances.We have described a novel ensemble method for evolving JSSP heuristics called NELLI-GP in which each heuristic in the ensemble generates solutions to problems in a niche region of the instance space.
Experiments on a large set of newly generated problem instances as well as existing benchmarks have shown that novel rules can be evolved that outperform existing scheduling rules and recent hyper-heuristic methods.The power of the method is attributed to (a) evolving new dispatching rules that both define job eligibility and use a large set of terminals, (b) combining rules into variable length sequences to form new heuristics, and (c) evolving ensembles of heuristics, in which each operates in a distinct part of the instance space.The reusability and robustness of the ensemble is demonstrated by applying an ensemble evolved on one set of instances to new instance sets, outperforming existing benchmarks.Analysing the evolved ensemble in terms of performance on different classes has enabled new insights to be gained relating to instance-structure.In agreement with Ingimundardottir and Runarsson (2012), we find that intuitive problem features are insufficient to characterise instances.Using the instance clusters that emerge from running NELLI-GP however, future work will focus on attempting to characterise the clusters in terms of combinations of known features or deriving new complex features.In addition, we intend to apply NELLI-GP to dynamic scheduling problems in which it is possible to derive more fine-grained features.

Figure 3 :
Figure 3: Results on the test problems for Makespan (left) and SWT (right).

Figure 6 :
Figure 6: Number of instances uniquely won by each heuristic.(a) Ensemble of 8 heuristics.(b) Ensemble of 64 heuristics (sorted by wins).

Table 4 :
C max .For each configuration <H*D*R*>, the table shows the median ratio to theoretical optimal and summed fitness over all instances, and average rank compared to the other configurations.Results are from the single best run in all cases.

Table 5 :
T W T .For each configuration <H*D*R*>, the table shows the median ratio to theoretical optimal and summed fitness over all instances, and average rank compared to the other configurations.Results are taken from the single best run in all cases.

Table 6 :
Comparison of ensembles (C max ) to disposable hyper-heuristic approaches.

Table 8 :
Reusability and robustness of ensembles: NELLI-GP T refers to ensembles trained on the 500 new instances and applied directly to new test sets.Results from EGP-SS and , taken directly from Park et al. (2015) and Nguyen et al. (2013b), respectively.For comparison, NELLI-GP E gives the result from evolving an ensemble on the same training set given in the relevant publications.

Table 9 :
Analysis of the structure of the evolved heuristics.

Table 10 :
Examining the distribution of features in set I H i compared to I r , that is, all remaining instances.The table shows p-values obtained from a Wilcoxon rank sum test for each feature.

Table 11 :
Examining the distribution of features in set I H i compared to I u , that is, the subset of remaining instances that are uniquely solved.The table shows p-values obtained from a Wilcoxon rank sum test for each feature.