INTRODUCTION
High availability and reliability are crucial for different software systems. In Cloud Computing, for example, software reliability and availability stay as concerns for customers (CISCO, 2012; CDW, 2015; Kim, 2009). Previous works introduce software aging as a pertinent issue in this area (Araujo et al., 2011a; Araujo et al., 2011b; Matos et al., 2012; Torquato et al., 2015; Umesh and Srinivasan, 2017).
The software aging phenomenon consists in a gradual increase in software failure rate or performance degradation during its execution (Parnas, 1994). These effects usually happen because of errors accumulation in software state. This accumulation can lead software to hangs and total failures (Grottke et al., 2008; Huang et al., 1995). Systems with long-time of execution may suffer from software aging effects (Huang et al., 1995). On Cloud Computing systems, the VMM (Virtual Machine Monitor) is liable to suffer software aging, as presented in (Matos et al., 2012; Torquato et al., 2015).
Software rejuvenation is the countermeasure to software aging. Software rejuvenation consists of a proactive technique to clean software aging effects by rolling it back to a stable status. Software rejuvenation usually lies on an application restart or a system reboot (Cotroneo et al., 2014). Previous works propose a schedule to submit software rejuvenation actions (Melo et al., 2013a; Melo et al., 2013b) to minimize system downtime caused by these operations. Other techniques use genetic algorithms and time series forecasting to predict software aging behavior and support software rejuvenation actions (Umesh et al., 2017a; Umesh et al., 2017b). More details of software aging and rejuvenation are in Software Aging and Rejuvenation section.
Due to characteristics of software aging, it is hard to determine its roots. Errors, memory leaks and other aging-related symptoms are non-expected events (Torquato et al., 2015). A proper approach to deal with software aging and rejuvenation issues is to investigate and analyze them in an isolated environment. By running experiments and tests to understand software aging symptoms and rejuvenation effectiveness.
To support this type of research, this paper presents a methodology to support software aging and rejuvenation experiments. The approach is named SWARE (Stress-WAit-REjuvenation). The SWARE approach has three phases. (A) Stress Phase, which aims to observe workload exposure impacts in the internal software state. (B) Wait Phase which observes software behavior after the stress workload submission. This phase seeks to highlight software aging effects in the system. As software aging is a cumulative process, its effects may remain even after workload exposure. If the software returns to a stable state without software rejuvenation action, there is no evidence of software aging symptoms. (C) Rejuvenation Phase, with the goal to perceive consequences of software rejuvenation action (Melo, 2014; Torquato et al., 2015). This phase highlights the effectiveness of software rejuvenation action to mitigate software aging effects observed in the Wait Phase.
The phases adjustment depends on selected software for aging testing. The phases are sequential. The end of a phase triggers the start of next. The proposed approach does not comprise the detection of Time to Aging Related Failure (TTARF). SWARE Methodology section contains details about proposed approach.
Case Study section has an experimental setup which uses the proposed approach. This experiment consists of an investigation of software aging and rejuvenation on OpenNebula/KVM 12 Private Cloud (Torquato et al., 2015). In this case study, we investigate software aging symptoms in KVM 1.0 component using an accelerated workload. We also checked rejuvenation effectiveness through a VM Live Migration process. The obtained results are in the Results and Analysis section. From results, it is possible to understand system behavior during the three phases of the experiment. Rejuvenation Phase results show that VM Live Migration enables software rejuvenation of KVM software. We also present a numerical analysis of the results obtained from the Wait and Rejuvenation phases. These results show that the system may require more than 12 days to recover without software rejuvenation.
Related Works section has the related works comparison, and Conclusions and Future Work section presents our conclusions and future works. This paper is an improvement and extension of our previous articles (Torquato et al., 2017) and (De Melo et al., 2017).
SOFTWARE AGING AND REJUVENATION
Software aging is the accumulation of aging-related bugs effects. Aging-related bugs often appear when the system reaches conditions (e.g., lack of computational resources) which are difficult to reproduce (Vaidyanathan and Trivedi, 2001). The consequences of bugs activation lead to software performance degradation (or its failure rate increases). Software aging can influence the system from hangs to total failures (Huang et al., 1995).
A feasible way to determine software aging existence is to observe system monitoring reports to find anomalous behavior (Valentim et al., 2016; Cotroneo et al., 2014). The paper (Garg et al., 1998) presents a methodology based on SNMP distributed tool for monitoring OS resources on a LAN of UNIX machines to observe software aging existence.
The paper (Huang et al., 1995) presents first definitions of software rejuvenation. We can define software aging as a proactive technique to avoid aging effects to reach critical levels. Software rejuvenation is also considered a cost-effective because it does not require knowledge about roots of aging effects (Cotroneo et al., 2014). The rejuvenation actions rely on restart an application to conduct it to a clean state, without aging effects accumulated. Some papers propose scheduling of software rejuvenation actions (Melo et al., 2013a; Melo et al., 2013b) to determine when to perform rejuvenation actions to maximize overall system availability.
Experiments to measure and observe software aging symptoms may have a long duration. Some papers (Matos et al., 2012; Araujo et al., 2011a) proposes an investigation with accelerated experiments. Thus, it is possible to observe software internal state alterations in a shorter time.
SWARE METHODOLOGY
The experiment methodology is obtained from the papers (Torquato et al., 2017) and (De Melo et al., 2017). This methodology has two preliminary steps. First, the selection of software component to test. Second, the choice of a specific workload to stress this component. For example, a specific workload for Web Server is a workload of connections high rate of requests. The workload exposure aims to induce the system to operate in different levels of usage, seeking to trigger aging-related bugs. It is essential to ensure that workload does not cause premature failures. And, the monitoring activity should not cause high system intrusion.
After these preliminary steps, we can apply the proposed approach. As aforementioned, the SWARE methodology has three phases (Stress, Wait and Rejuvenation). The details of each are in next sections.
Stress Phase
This phase aims to stress the system with the selected workload. The stress workload leads to system internal state degradation. Monitoring reports of this phase should present an increase in software failure rate or a decrease in software performance. Figure 1 depicts the expected behavior of internal system state.
The stress phase duration varies according to workload submitted to the system. Workload submission stops when resources usage or performance degradation reach a critical level. At this point, probably aging bugs already be activated as the system passes through different usage states.
Wait Phase
The primary goal of Wait phase is to observe software aging symptoms existence. Software aging effects remain in the system even without incoming workload. After Stress Phase, there are two possibilities: (i) system recovers from workload overhead and returns to a stable state; (ii) or system persists degraded. If the software returns to a stable state without rejuvenation action, there is no evidence of software aging existence. In that conditions, the workload submitted to the system only causes overhead in resources usage.
Figure 2 presents a possible behavior of system state which highlights software aging evidence. Figure 3 shows a likely behavior of the system state which does not highlight evidence of software aging.
The duration of Wait phase should be long enough to ensure that system persists in a degraded state. Usually, to achieve this goal, Wait Phase should during approximately the same time of Stress phase.
Rejuvenation Phase
A requisite for Rejuvenation Phase start is the software rejuvenation action selection. This selection depends on the component stressed in the previous phase. Software rejuvenation usually relies on restart application or system reboot, but in some situations, other types of rejuvenation may also be valid.
Rejuvenation phase starts with the software rejuvenation action submission. The primary goal of this phase is to observe impacts of rejuvenation action on internal system state. Figures 4 and 5 present variations in software state when software rejuvenation is useful or not.
Figure 6 presents a flowchart with a summary of proposed approach.
CASE STUDY
System Architecture
The considered Cloud Computing testbed uses OpenNebula VIM 3.6 and VMM KVM 1.0. The environment has four Physical Machines (PMs) and one Virtual Machine (VM). The VM runs an Apache Web Server with a simple HTML page on Ubuntu Server 12.04 operating system. FrontEnd Machine is responsible for managing Cloud Environment. Main Node is the PM which executes VM. Standby Node is a spare host to receive VM Live Migration purposes. Stresser is an external machine which is responsible for sending workload and monitor Cloud system. All components are in a Private Network. The Table 1 has the Physical Machine configurations. Figure 7 depicts the testbed architecture.
Table 1. Physical Machine Configurations
Name |
Description |
Processor |
RAM |
Software |
FrontEnd |
Cloud Manager |
Intel Core i3 – 3.10 GHz |
4GB |
Ubuntu Server 12.04 (kernel 3.2.0-23), OpenNebula 3.6 |
Main Node, Standby Node |
Execution Nodes |
Intel Core i3 – 3.10 GHz |
4GB |
Ubuntu Server 12.04 (kernel 3.2.0-23), KVM 1.0 |
Stresser |
Cloud Monitor and Stresser |
Intel Core 2 Quad – 2.66 GHz |
4GB |
Ubuntu Desktop 12.10 (kernel 3.5.0-36), Autobench and monitoring tools |
|
Workload Selection
The selected workload to stress KVM is a sequential operation of mounting and unmounting 15 Virtual Disks (of 1GB) on VM (Matos et al., 2012). The pseudo-algorithm of this workload is in Algorithm 1.
Algorithm 1 – Software aging workload |
loop
while AttachedDisks < 15 do
Mount(Disk1GB);
Wait(15 seconds);
end while
while AttachedDisks >= 1 do
Unmount(Disk1GB);
Wait(15 seconds);
end while
end loop
|
Besides the mount and unmount workload, we decide to add a workload to Apache Web Server. We want to observe Web Server performance impacts during the experiment. To select the workload, we conducted a capacity test considering the same testbed used in software aging experiment. The capacity test uses the httperf tool with Autobench (Mosberger and Jin, 1998). This benchmark tool sends requests to the Web Server and collects results as response time (in milliseconds) and the number of errors observed. In httperf, the response time is the time between sending the first byte of a request and receiving the first byte of reply. And, the amount of errors takes account of errors such as connection timeout, socket timeout and connection refused. The rate of requests is
Observing these results, we selected the workload of 2000 requests per second. The results show that the server can handle this rate of requests with low response time and errors. Figure 10 shows a summary of the selected workload to software aging and rejuvenation experiments.
Software Rejuvenation Strategy
OpenNebula/KVM Clouds allow system managers to perform VM Live Migration. VM Live Migration consists in remapping a VM from a PM to another with reduced operation downtime (Clark et al., 2005). Previous studies (Machida et al., 2013; Melo et al., 2013a) shows VM Live Migration as a support mechanism to VMM software rejuvenation. Figure 11 presents software rejuvenation strategy used in the experiments.
On Stress Phase, the system is receiving workload to stress VMM software. In this early stage, the system does not present software aging effects yet. The Standby Node VMM is active but not receiving any system requests. Wait Phase starts when system status suffers from software aging. As Main Node VMM manages VM, software aging effects will affect VM performance too. VM Live Migration triggers the Rejuvenation Phase.
When VM arrives in the Standby Node, it can leverage a fresh state VMM. Thus, previous VMM software aging effects will not affect VM performance or failure rate. Finally, the Main Node restart removes software aging status.
System Monitoring reports support phases duration decision. Based on principles presented in SWARE Methodology section, Table 2 shows each phase period for our experiment. The entire process during 13 consecutive days.
Table 2. Experiment Phases Duration
Phase |
Duration |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12 |
13 |
Stress Phase |
6d |
|
|
|
|
|
|
|
|
|
|
|
|
|
Wait Phase |
5d |
|
|
|
|
|
|
|
|
|
|
|
|
|
Rej. Phase |
2d |
|
|
|
|
|
|
|
|
|
|
|
|
|
Disks Mount and Unmount |
6d |
|
|
|
|
|
|
|
|
|
|
|
|
|
Web Requests |
13d |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
RESULTS AND ANALYSIS
CPU and RAM Monitoring
The Figures 10 and 11 present results of system resources monitoring. To improve results visualization, the graphics present monitoring data from both PMs: Main Node and Standby Node. Stress Phase and Wait Phase in these plots are the results from Main Node monitoring and the Rejuvenation Phase is the result from Standby Node monitoring (which receives VM Live Migration). All plots contain limits of Stress Phase, Wait Phase and Rejuvenation Phase. The monitoring intervals are 30 seconds.
The CPU utilization results contain four metrics: USER - CPU utilization percentage for User-level processes; SYS - CPU usage for Kernel-level processes; IO - Waiting for In/Out operations; IDLE - CPU idle percentage, excluding time waiting for In/Out.
Figure 12 presents results of CPU utilization. In the Stress Phase is possible to observe significant IO requests rate for CPU. This behavior occurs because PM has to communicate with VM during mount disk workload and also has to redirect income network traffic to VM. In the Wait Phase, CPU utilization tends to return to normal levels. But, SYS requests rate remains at higher levels than average (comparing to Rejuvenation Phase). As KVM resides on Linux Kernel (which is responsible for SYS requests), CPU SYS requests rate may return to normal state after a cleanup action. Rejuvenation Phase presents SYS requests rate returning to normal levels.
Figure 13 depicts results of RAM monitoring. Results for Wait Phase reveals that RAM consumption persists at high levels. In this phase, PM continues to receive web requests from the network. But, Rejuvenation Phase results presents a decreasing usage of RAM resources.
Extra consumption of resource is necessary when a PM receives a VM migration. Thus, there is a peak of RAM usage at the beginning of Rejuvenation Phase. After this, RAM usage results show a decreasing behavior.
Web Server Metrics Results
VM depends on VMM to interact with PM resources. Then, VMM state affects the applications and services which run in VM. Thus, applications and services monitoring may present possible effects of software aging. The results in this section show monitoring data collected from benchmark tool of Web Server.
Figures 14 and 15 present results of Web Server monitoring. As expected, Web Server suffers effects of high workload exposure in Stress Phase. Response time and Amount of Errors were increasing during this phase. Wait Phase results show Errors and Response Time remaining at high levels. Rejuvenation phase brings the system to a stable state.
Numerical Analysis
In this section, we show the results of a brief numerical analysis of obtained results. We selected the Response Time results of Web Server (Figure 14) for this analysis. First, we compute the details of the response time of the Web Server with software aging effects (Wait Phase results) and without software aging (Rejuvenation Phase). The summary of these results is in Table 3. This analysis aims to analyze the effects of software aging and rejuvenation in the response time of the Web Server. Therefore, we ignored the results of Stress Phase. It is important to highlight that the Stress Phase goal is only to accelerate software aging bugs activation. The obtained results reveal that the response time in a Web Server with software aging effects surpass the response time in a Web Server without software aging effects in more than 860%.
Table 3. Web Server Response Time Results (ms) Summary
Phase |
Min. |
1st Qu. |
Median |
Mean |
3rd Qu. |
Max. |
Wait Phase |
0.0 |
848.5 |
934.9 |
917.9 |
1004.8 |
2888.6 |
Rejuvenation Phase |
0.0 |
0.5 |
0.5 |
1.063 |
0.600 |
2982.9 |
|
We also perform a trend analysis in the results of Wait Phase. The goal of the trend analysis is to verify the existence of a trend in the Web Server response time. We used the Mann-Kendall Test. The Mann-Kendall test is usually applied in software aging tests (Grottke et al., 2006; Garg et al., 1998; Machida et al., 2013). The Mann-Kendall analysis (Mann, 1945) checks the null hypothesis, H0, which shows that there is no trend in the data during the time, against the alternative hypothesis, H1, which indicates an upward or a monotonic downward trend in the data. As software aging is a cumulative process, the Mann-Kendall test can be used to reveal patterns of software internal state degradation. Among Mann-Kendall tests results, we have the Z-value which is used to accept or reject the null hypothesis. Z-value close to zero suggests no trend in the data; a high absolute value indicates the existence of a trend. Figure 16 shows the results and trend of Web Server response time during the Wait Phase.
Table 4 presents the results of statistical analysis made in the data. The results show Mann-Kendall Z-value is lower than zero. Therefore, it is possible to reject the null hypothesis (no trend in the data). The negative value indicates a monotonic downward trend in the data. To calculate the slope of the monotonic trend, we used the Sen method (Sen, 1968). Table 4 also presents the estimated slope and the 95% confidence interval for this slope.
Table 4. Web Server Response Time Trend Analysis
Parameter |
Value |
Mann-Kendall Z-Value |
-2.3 |
Estimated slope |
-0.0009 ms/s |
95% confidence interval for the slope |
(-0.0001 ms/s, -0.0016 ms/s) |
|
These results show that the downward trend obeys the following function: y = -0.0009x + 939.6568. We compute the time to system returns to normal operation. To conduct this calculation, we consider the Web Server Response Time in the Rejuvenation Phase, which is of 1.063 milliseconds. This calculation shows that the time to Web Server return to normal operation (with Response Time equals to 1.063 ms) is about 12 days. The Figure 17 below shows the results comparing the Wait Phase and Rejuvenation Phase.
General Discussion
Observing CPU results of Stress Phase in Figure 10, it is possible to notice a similar behavior as presented in (Matos et al., 2012). Still, in Stress Phase, RAM results (Figure 11) and Web Server response time results (Figure 12) show similar results as on (Grottke et al., 2006).
As software aging causes internal software state degradation, its consequences may persist on software state until software rejuvenation (or failure) occurs (as explained in SWARE Methodology section). Results of Wait Phase shows a degraded software state. The numerical analysis reveals that, without rejuvenation actions, the system only achieves a normal operation state (with Response Time in 1.063 ms) after 12 days. It is important to highlight that this estimative considers that the workload is constant and of 2000 requests per second.
Results of Rejuvenation Phase emphasize software aging evidence on KVM software. After VM migration, VM arrives on a fresh state KVM software. As VM migration suffices to software internal state degradation removal, it is possible to corroborate that VM migration is a useful technique to KVM software rejuvenation.
RELATED WORK
The authors of (Matias et al., 2010) present an approach to apply accelerated degradation tests on software aging experiments. The paper uses a particular aging factor to control the aging effects on the experimental setup. This aging factor is obtained by sensitivity analyses based on a statistical design of experiment. We also applied accelerated tests on the Stress Phase of our experiment. Different from the authors of the mentioned paper, the SWARE approach also comprises software rejuvenation.
The paper (Li et al., 2002) presents a methodology to estimate software aging effects by using time-series analyses. The authors analyzed the behavior of a Web Server under a varying workload. The primary goal is to detect and estimate resource exhaustion time due to software aging effects. This paper helps us to define technologies used in our case study. As presented in the mentioned paper, we also used a Web Server and the httperf tool in our experimental setup. The paper also presents an extensive statistical analysis used in the resources estimation. Different from this paper we used a generic workload to induce software aging bugs activation. The SWARE methodology is simpler to apply as it does not require statistical expertise.
The paper (Grottke et al., 2006) provide valuable inputs on how to perform software aging experiments on a Web Server. The main goal of the authors is to conduct a software aging analysis of a Web Server. Different from the results presented in the mentioned paper, we also include the software rejuvenation effectiveness test on SWARE approach.
The paper (Matias et al., 2006) offers a comprehensive approach to software aging and rejuvenation experiments on a Web Server. Besides the techniques to observe software aging problems, the authors implemented a rejuvenation agent to mitigate aging effects in the Web Server. The presented results show substantial reduction of software aging when the rejuvenation agent is integrated into the environment. The proposed rejuvenation agent uses a predefined interval to submit software rejuvenation in the environment. Different from this paper, the Wait Phase of SWARE approach allows the system manager to decide when to perform rejuvenation action. Therefore, it is possible to reduce overhead caused by recurrent software rejuvenation actions. Nevertheless, the mentioned paper provides helpful insights into the SWARE methodology.
Cotroneo et al. (Cotroneo et al., 2010) provide a broad investigation of software aging causes in Linux Operating Systems. The adopted approach aims to trace kernel activities to observe possible software aging effects. The results of the paper show that the filesystem operation causes significant contribution in software aging indicators. This result may explain the behavior of Stress Phase of our experiment when the VM filesystem is dealing with software aging workload. Matias et al. (Matias et al., 2010) also present a methodology to measure software aging effects through OS kernel observation and instrumentation. Instead of showing specific causes of software aging, the SWARE approach aims to provide an overview of system status during and after software aging effects.
The paper (Silva et al., 2006) shows a study of software aging and rejuvenation in a SOAP-based Servers. The authors ran a variety of scenarios with different configurations. The adopted approach is focused on studies of software aging and rejuvenation in SOAP-based servers. The SWARE approach aims to be more generic and flexible to other types of software.
The paper (Melo et al., 2017) presents an investigation of software aging on OpenStack Cloud Computing Platform. The authors used a testbed which runs OpenStack, Apache and MySQL database. The presented results show that the MySQL processes present software aging issues. The authors also used a particular workload to stress the system and present trend analysis for resources consumption. The considered workload consists of sequential operations of start and terminates VM instances. We also adopted a sequential workload in our case study. However, the SWARE approach also comprises the observation of software aging problems and software rejuvenation effectiveness.
The paper (Meng et al., 2016) presents a comprehensive approach to investigating software aging and rejuvenation in a J2EE Application Server. The authors used a hierarchical approach to submit software rejuvenation in the system. The adopted methodology has two main steps: (i) software aging tests and (ii) application of the hierarchical software rejuvenation mechanism. Our experiments also comprise software aging and rejuvenation phases. But, we also include the Wait Phase to highlight effects caused by software aging bugs activation.
CONCLUSIONS AND FUTURE WORK
This paper presented the SWARE methodology. The SWARE is a comprehensive approach to investigating aging symptoms and rejuvenation effectiveness on software systems. The SWARE approach has three phases aiming to highlight software aging effects symptoms and rejuvenation effectiveness. (I) Stress Phase, when software aging workload reaches system to stress investigated component. (II) Wait Phase, to perceive indicators of software aging. (III) Rejuvenation Phase, which aims to detect rejuvenation action effectiveness on the environment.
Case Study section presents a case study to validate SWARE approach. This case study aims to investigate KVM software aging and also to study VM Live Migration effectiveness as a software rejuvenation action. The investigation shows results of software aging symptoms on KVM. Wait Phase highlights software aging effects on resources consumption and quality of service of a Web Server. Finally, after VM Live Migration it is possible to notice that degradation effects clean-up. The numerical analysis shows the relevance of the VM Live Migration as a rejuvenation mechanism. Using mathematical approximations, we conclude that the system may take more than 12 days to recover from aging effects. The VM Live Migration usually takes less than 10 seconds. Therefore, the system can improve service quality by applying this technique.
The major contribution of this paper is a generic methodology to conduct software aging and rejuvenation effectiveness investigation. The case study proposed consists of a Cloud Computing environment. But, the methodology phases guidelines can be reproduced in other types of software systems. This paper also presents (as a case study) a software aging and rejuvenation study on VM Live Migration as software rejuvenation for KVM hypervisor. The results of proposed case study show practical results of VM live migration effectiveness as rejuvenation for KVM.
The approach does not quantify software aging effects. With the lack of statistical techniques on data, it is hard to define software aging failure time and a proper rejuvenation schedule. The approach may require a substantial time to produce expected results. The Stress phase may not produce expected results on software aging bugs activation.
Future research directions aim to investigate further SWARE approach application on different scenarios as Software Defined Networking, Network Function Virtualization, Virtualized Containers and Fog Computing. Other research lines seek to improve approach adding multiple component software aging investigations.
ACKNOWLEDGEMENTS
This paper is an extended and improved version of our previous works (Torquato et al., 2017) and (De Melo et al., 2017).