Near-Data Prediction Based Speculative Optimization in a Distribution Environment

. Apache Hadoop is an open source software framework that supports data-intensive distributed applications and is distributed under the Apache 2.0 licensing agreement, where consumers will no longer deal with complex config-uration of software and hardware but only pay for cloud services on demand. So how to make the performance of the cloud platform become more important in a consumer-centric environment. There exists imbalance between in some distribution of slow tasks, which results in straggling tasks will have a great influence on the Hadoop framework. By monitoring those tasks in real-time progress and copying the potential Stragglers to a different node, the speculative execution (SE) realizes to improve the probability of finishing those backup tasks before the original ones. The Speculative execution (SE) applies this principle and thus proposed a solution to handle the Straggling tasks. At present, the performance of the Hadoop system is unsatisfying because of the erroneous judgement and inappropriate selection for the backup nodes in the current SE policy. This paper proposes an SE optimized strategy which can be used in prediction of near data. In this strategy, the first step is gathering the real-time task execution information and the remaining runtime required for the task is predicted by a local prediction method. Then it chooses a proper backup node according to the near data and actual demand in the second step. On the other side, this model also includes a cost-effective model in order to make the performance of SE to the peak. The results show that using this strategy in Hadoop effectively improves the accuracy of alternative tasks and effects better in heterogeneous


Introduction
As the internet has successfully occupied many aspects of people's lives, the amount of data stored in the consumer's private cloud will grow exponentially in the next few years, many consumers need to pay for the cloud service on demand [1]. Whilst cloud computing platforms have evolved such as Apache Storm [2], Spark [3], and Hadoop [4].
Hadoop is widely used in distributed data storage, computing and search functions areas because of the Apache top project and the prevalent cloud computing frameworks [5]. Many strategies have been designed to improve the effectiveness and efficiency of Hadoop clusters and facilitate big data storage and analytics [6], but the inefficient resource allocation in Hadoop job scheduling still bring many difficulties.
Allocation and Coordination of tasks among TaskTrackers has therefore become critical and challenging in a JobTracker due to lack of runtime information of TaskTrackers and difficulty in predicting the completion duration of each tasks [7]. The most effective mechanism to improve Hadoop's fault tolerance is Speculative Execution (SE), which identifies and corrects the inefficient allocation of JobTracker. [8]. Previous research efforts have been conducted to optimize the SE strategy, Although the purpose of these strategies is to identify the remaining time of the task through slow tasks, such self-estimation is often inappropriate due to inaccuracy. [9].
In this paper, we pay attention to real-time task execution and collect the relevant information during a task's run time. A local weighted prediction method called LWR-SE is employed to estimate the running time required for the task. In parallel, the max cost-consumption model and the more appropriate selection strategy of back up task execution nodes are combined. In this way, both cloud platform providers and consumers can take advantage of it. And section 2 lists current user-centric cloud environment research and Hadoop-based fault-tolerant optimization strategies. Section 3 presents the "LWR-SE" we designed, and the reliability of the method was verified by experimental methods in Section 4. In Section 5, we summarize this article's work and list some key work to be done in the future.

2
Related Work

Service in User-centric cloud
The combination of the consumer electronics industry and cloud computing has led to a growing number of researchers focusing on user-centric cloud services. A new architecture called IDM based on privacy and reputation extensions was put forward to enhance the security of consumers' identity [10]. A new architecture "SuSSo" is designed to deal with the limitation of service continuity when across different consumer electronic devices combined with the cloud computing [11]. Abolfazli et al gave an overall analysis and compared the different solutions on the mobile cloud computing in the fields of consumer electronics [12]. Fu et al. proposed a new useful multi-keword ranked search strategy towards the encrypted cloud data, which supports synonym queries at the same time [13]. Grzonkowski et al. raised a more secure authentication method for home networks in user-centric cloud environment [14]. Due to the complexity and difference of big data, they propose a cloud computation offloading method, named COM, dynamic schedules of data/control-constrained computing tasks are confirmed [15]. A new and systematic smart home management system, which was deployed in the cloud and acted as the community broker, is presented to provide more electronic information service [16].

Fault Tolerance in Hadoop
On the other hand, the temporal fault-tolerance aims to automatically detect and restore fault run-time tasks so that it can shorten the execution time, and improve the computing performance and reliability of a cloud system, which involves strategies on MapReduce job and task scheduling, enhancement of speculation execution (SE) strategies, etc [17]. The original speculative execution was implemented as Hadoop-Naï ve in Hadoop [18]. Its primary idea was to recognize a task as a "Straggler" if its progress is below the average level, which can cause misjudged tasks and wasted cluster resources. It goes even worse in a heterogeneous environment. With regard to the average rate used in the LATE to calculate the remaining time of running tasks, which may lead to inaccurate or even incorrect prediction. In 2015, Wu's team improved the accuracy of the prediction by calculating the remaining time of system load situation calculation task. [19] MCP proposal can maximize the startup backup task, which solves the problem that the previous SE strategy is not old, by dividing Map tasks into map and combiner stages, and Reduce tasks into copy, sort and reduce phases [20]. In 2014, an SE optimization algorithm called Ex-MCP was proposed to compare node values with MCP. [21]. On top of that, there are some optimization methods put forward. A execution was proposed based on sort nodes out according to the hardware performance of the nodes [22]. Wang et al. proposed a PSE optimization strategy that can ignore the differences between different processors to improve the efficiency of speculative execution. [23]. In [24] An effective speculative execution strategy (SECDT) is proposed. The completion time required for the task is calculated by decision tree. [25]. Besides, an ATAS strategy can improve the Hadoop's expansive ability by increasing the estimate accuracy on the execution time of backing-up tasks [26]. Adaptive allocation scheduling can also be used for NILM algorithms based on power allocation. [27]. Due to the imbalance of cloud platform performance, Edge Computing Nodes (ECNS) has been proposed as an alternative solution for cloud computing in recent years. The team of Xu uses non-dominated sorting genetic algorithm II (NSGA-II) to achieve multiple Target optimization, shortening the unloading time of computing tasks and reducing the energy consumption. [28].
In general, the current SE strategy still has great difficulties in quickly backing up and accurately identifying1 potential Straggler tasks in appropriate nodes, and how to balance the overall benefits while maintaining the processing of local universities is also very large challenge.

Model and Algorithm
In this section, we introduce a speculative execution method named "LWR-SE". The flow chart of the method is indicated in Fig. 1, with more details discussed in the rest parts of this section.

The Recognition of Straggler Candidates
Data Collection of Running Tasks. First, confirm Stragglers by collecting detailed information such as the progress and execution time of real-time tasks. To collect features, the raw data is collected from the HDFS (progress, Timestamp) to facilitate prediction. Then convert the progress pair to (progress, execution time) in order to simplify the algorithm. The algorithm for data collection is shown as follows.
The number of training set samples is set to n, representing the progress of different tasks. p represents the progress which is an input n+1 dimension. t indicates the execution time of the task. θ is the regression parameter and it should satisfy that the square error Minimize between predicted and true values, which is proposed in Equations (2) and (3).
Where E represents the error, (pi,ti) is the ith training samples, ωi is the weight in the ith local forecasting area, which depends on the local prediction point. To simplify the description, we can transform it into a matrix representation as shown in Equation (4).
Where X is a matrix, with m rows training dataset p0, p1…, pm and n being set to 2. W is a matrix as the Equation (5) shows.
In addition, θ is also ensure that LWR has the minimum loss function at the predicted q, and the loss function of the LWR algorithm is shown as follows: Then the regression parameter θ can be calculated using the least square method with a prediction point corresponding to a parameter θ. The final calculated θ is substituted into the Equation (1) and then the execution time of the corresponding progress is predicted, as shown in Equation (7) and (8).
The target of LWR is to find θ that minimizes for present prediction, during which the most important process is to compute the weight function, which can be obtained in two steps: Step 1: Distance Calculation. The local region is firstly determined using Euler distance when predicting the value of the local point, as described in Equation (9). Step 2: Weight Calculation. The calculation of the weight function depends on the distance d. The greater the distance from the predicted point, the smaller the weight will be assigned. Use the Gaussian kernel function in (10), γ can controls the rate at which the weight decreases with distance. Set to 0.08 in this paper.  (10) According to the consumption and benefits of launching or non-launching a backup task in the cluster, we can compute the profits of launching or no-launching the backup task to the cluster, the profits of launching speculative execution or not can be obtained as the following Equations.
α and β represent the weight of benefit and the cluster cost. When satisfying the following formula, the identified Straggler backup task will be launched so that it can reach the maximum efficiency.
Here we let  replace  /, then the above Equation can be simplified as follows.
Where is the running time of a backup task,  is the load factor of the cluster, which is the ratio of the number of pending tasks to the number of the free containers in the cluster.

Results and Evaluation
In this section, we first test the performance of our model based on linear predictions and actual values. After that, the LWR-SE strategy is evaluated compared to Hadoop-None, LATE, and MCP in a heterogeneous cloud environment in three different scenarios.

Experimental Environment Preparation
We use 64-bit Ubuntu Server to be our operating system and our experimental platform is Hadoop-2.6.0. There are eight virtual nodes in the Hadoop cluster and each server is consist of four Intel® Xeon® CPU, 288GB memory in total and up to 10 TB hard drive.
In Table 3, it shows some detail information about each node. In the framework of Hadoop, it is common to use the datasets such as Wordcount and Sort as the experimental workloads. They are available on the Purdue MapReduce Benchmarks Suite.  Fig. 3 and Fig. 4 respectively, where the red line represents prediction error rates. It can be depicted that the predictive accuracy of the LWR model much outperforms the linear regression, especially while the progress reaches 80% and over. RMSE is used to evaluate the accuracy of the prediction, and the calculation Equation is as follows.  Where p is actual value and pi represents the prediction value. Table 3 and Table 4 show the RMSE results of fifteen datasets/tasks, which are randomly selected from the Wordcount and Sort tasks. The average prediction RMSE of Wordcount and Sort are 1.56 and 1.75. This happens due to some unusual large values, which are mainly caused by resource contention and the non-data locality in the copy phrase during the Reduce process. If the outliers are ignored, we can find that average prediction RMSE of Wordcount and Sort drops to 0.91 and 0.86.

Evaluate the performance of the LWR-SE strategy in heterogeneous situations
Three different kinds of cluster workload scenarios are configured to evaluate the performance of the LWR-SE, they are Normal Load Scenario, Busy Load Scenario, and Busy Load with Data Skew Scenario. In addition, the final results are shown as the best, worse and average outcomes of each strategy. Performance of the LWR-SE Strategy in a Busy Load Scenario. A busy load scenario provides the cluster with limited resources to supply additional replication. It therefore is more necessary to ensure the accuracy of speculative execution, since low accuracy can cause the cluster resources to be irrationally occupied and consequently slow down the performance of the whole cluster. The busy load scenario was configured by running other computing-intensive and/or IO-intensive tasks simultaneously. Wordcount and Sort jobs were set up to submit every 150 seconds. As can be seen in Fig. 6, the LWR-SE also fits well when running Sort jobs in the busy load scenario. in terms of the JET, on average cases, LWR-SE completed 9.7% earlier than MCP, 24.9% earlier than LATE and 30.6% earlier than Hadoop-None. When considering CT, the cluster throughput of LWR-SE increased by 9.3% compared with MCP and 36.1% over LATE.

Conclusion
In this paper, we propose a strategy named LWR-SE based on the relationship between tasks and job execution schedule, which can obtain higher local prediction accuracy and can guarantee the cloud system. Maximize benefits. The experimental results show that it is superior to MCP, LATE and Hadoop-None.