Energy-Aware Scheduling of Streaming Applications on Edge-Devices in IoT-Based Healthcare

The reliance on Network-on-Chip (NoC)-based Multiprocessor Systems-on-Chips (MPSoCs) is proliferating in modern embedded systems to satisfy the higher performance requirement of multimedia streaming applications. Task level coarse grained software pipeling also called re-timing when combined with Dynamic Voltage and Frequency Scaling (DVFS) has shown to be an effective approach in significantly reducing energy consumption of the multiprocessor systems at the expense of additional delay. In this article we develop a novel energy-aware scheduler considering tasks with conditional constraints on Voltage Frequency Island (VFI)-based heterogeneous NoC-MPSoCs deploying re-timing integrated with DVFS for real-time streaming applications. We propose a novel task level re-timing approach called R-CTG and integrate it with non linear programming-based scheduling and voltage scaling approach referred to as ALI-EBAD. The R-CTG approach aims to minimize the latency caused by re-timing without compromising on energy-efficiency. Compared to R-DAG, the state-of-the-art approach designed for traditional Directed Acyclic Graph (DAG)-based task graphs, R-CTG significantly reduces the re-timing latency because it only re-times tasks that free up the wasted slack. To validate our claims we performed experiments on using 12 real benchmarks, the results demonstrate that ALI-EBAD out performs CA-TMES-Search and CA-TMES-Quick task schedulers in terms of energy-efficiency.


I. INTRODUCTION
H EALTHCARE is one of the fastest growing industries with an enormous potential for enhancement from the employment of technologies such as the Internet-of-Things (IoT), cloud computing, and mobile devices. An IoT-based ubiquitous healthcare system may provide patients the opportunity to lead a more independent life without the constant need for qualified medical staff to monitor their health. The advanced healthcare system using the IoT not only provides an accurate medical data but also integrates an alarm system for emergency situations. The IoT is making healthcare more accessible and responsive to the needs of most users anywhere and at anytime. The IoT consists of a combination of sensors and actuators including, analogue devices, and cameras. Utilizing features such as video-streams transmitted via the Internet, these systems are able to monitor the healthcare needs for anyone. This is considered most beneficial to people with increases support needs such as elderly, inform or those with alternative abilities. The video and data streams are stored securely in the cloud but available for access by the right people when needed. An abstract IoT-based healthcare system architecture is demonstrated in Fig. 1 [1], [2].
Thanks to the advancements in technology, multimedia datastreaming and live video-streaming have provided significant positive impacts on healthcare by enabling professional, realtime virtual healthcare assistance in places which was not possible before the era of the Internet. Multimedia applications are providing a new baseline for advanced healthcare. Literature predicted there to be approximately 50 billion interconnected IoT digital devices by 2020 [3]. Multimedia content will be approximately 80% of the total Internet data traffic by 2021 and healthcare will be a significant proportion of this data. The multimedia data content is streamed over the network in an encoded form with the video displayed to the end user and/or professional either in a recorded or prerecorded manner. Real-time video-streaming applications in IoT are periodic in nature as they tend to be executed repeatedly. In IoT, video streams are usually compressed to reduce the video size and achieve better network load balancing. The MPEG-encoder is executed numerous times for the whole video stream. The multimedia streaming data can be represented by Conditional Task Graph (CTG), the tasks within the CTG are dependent on each other [4]. Prime examples of real-time IoT-based multimedia streaming applications in healthcare include human gait analysis, telemonitoring and fall detection in people with infirmities [5], [6].
In real-time streaming applications, tasks are dependent on each other thus, the slack within the processors of the MPSoC architecture is not efficiently utilized. Re-timing is a powerful technique applied at the application level to transform the intra-period dependencies between tasks by regrouping the tasks from different periods. Concisely, re-timing transforms a dependent task model into an independent task model to efficiently utilize the resources [7]. However, the process of re-timing adds an unwanted delay called prologue during video-streaming. Ideally there is a need of an efficient video-streaming where the video starts playing with a reduced prologue.
Creation and manipulation of multimedia data is computationally expensive due to intensive processes such as video encoding, compression, and Fourier Transform (FT). Consequently, Multiprocessor System-on-Chips (MPSoCs) have become an essential element of the modern embedded systems for real-time multimedia data processing due to their higher performance, reliability and exceptional Qualityof-Service (QoS) [8], [9]. Xilinx Zynq UltraScale+ MPSoCs and Tilera TILE-Gx72 are few of the well-known high performance computing architectures used in digital systems for healthcare Examples of the medical application of multiprocessor systems include a real-time video-streaming system [10] developed to remotely monitor the human ultrasound examinations. The ultrasound streams are transmitted wirelessly to a remote location where the information is accessed and analyzed by a medical specialist. Similar works in [11], [12] used heterogeneous MPSoCs and enhanced the image quality for the ultrasound imaging. These developments improved the overall performance and reduced latency. In a further example, a human fall detection mechanism is presented [13] using a ZYNQ MPSoC platform to perform various operations such as segmentation, feature extraction, filtering and recognition for differentiating different types of fall.
Data extensive real-time applications are increasing in IoT, increasing numbers of processors elements are therefore desirable in MPSoCs design to meet the performance needs [9], [14], [15]. According to the International Technology Roadmap for Semiconductors (ITRS), MPSoCs will consist of hundreds of processors in the near future. In this case, traditional bus-based MPSoCs would become a computational bottleneck due to their higher inter-element communication requirement leading to higher congestion and poor scalability. Alternatively, Network-on-Chip (NoC)-based communication available in some MPSoCs can offer improved scalability with higher flexibility [16], [17]. Recently Voltage Frequency Island (VFI), Globally Asynchronous Locally Synchronous (GALS) introduced to NoC interconnect, where the tiles/processors are partitioned into islands. Each island in an MPSoC operates at its own frequency and supply voltage to minimize the total energy consumption [18]. These attributes lead to higher throughput and lower hardware complexity and make VFIbased heterogeneous NoC-MPSoC (VFI-NoC-HMPSoC) the most suitable choice of computing platform for data extensive applications [19].
Energy consumption reduction in digital systems using MPSoCs for the IoT-based healthcare is an important research aspect because higher energy consumption produces an increased carbon footprint [9], [18]. Proper task scheduling approaches can reduce energy consumption and increase the performance and reliability of an embedded system [20]. Task scheduling is NP-hard problem therefore, different heuristics have been developed to achieve energy-efficient solutions [21]. Real-time multimedia streaming applications, as used in healthcare, are typical illustrations of static task scheduling. Dynamic Voltage and Frequency Scaling (DVFS) is a conventional approach integrated with scheduling to minimize the energy consumption of MPSoC computing architectures [22]. DVFS efficiently utilizes the available slack within the processors by dynamically reducing the supplied voltage/clock frequency without violating the tasks deadline constraint subsequently, it minimizes the overall power consumption [7].
In this article, we investigate an energy-efficient static scheduling deploying VFI-NoC-HMPSoC for a set of periodic tasks with conditional precedence constraints representing a real-time periodic streaming application. Our contributions and innovations include as follows.
1) We develop a novel energy-aware static scheduler considering tasks with conditional constraints using VFI-NoC-HMPSoC computing architecture deploying a re-timing technique integrated with DVFS for the IoT-based real-time streaming applications in healthcare. 2) We present a non linear programming (NLP)-based scheduling and voltage scaling approach which we refer to as ALI-EBAD. This performs task scheduling and voltage scaling in an integrated manner to steer the task scheduling towards a more energy-efficient solution. 3) We propose a novel task level coarse-grained software pipelining approach called Re-timed CTG (R-CTG). This significantly reduces the re-timing latency compared to R-DAG [23], without an increase in energy consumption. 4) We show by experiments that ALI-EBAD deploying VFI-NoC-HMPSoC achieves an average energyefficiency of ∼20% over CA-TMES-Search [24] and ∼25% over CA-TMES-Quick [24]. The energy-efficiency increases significantly when R-CTG is integrated with ALI-EBAD and attains an average energy saving of ∼40% and ∼45% respectively. The R-CTG efficiently reduces the prologue/latency by 50% when compared with state-of-the-art re-timing technique R-DAG [23]. The remainder of this article is organized as follows: Section II discusses the related work performed so far on task scheduling using multiprocessors. Section III presents application, system, and energy models used in the simulations. Section IV explains our novel offline pipelined scheduling. Section V presents experimental results followed by the conclusion of this article in Section VI.

II. LITERATURE REVIEW
Multiprocessor systems are becoming de-facto computing platforms due to their excellent high-performance and exceptional QoS. For this reason, different research studies have deployed multiprocessor computing architectures for energy-aware task scheduling.
Olafsson introduced one of the first dynamical models for the scheduling of tasks on heterogeneous multiprocessor systems [25]. Aydin et al. applied DVFS to determine the optimal voltage levels for the tasks and developed an algorithm called Earliest Deadline First (EDF) to a generate feasible tasks schedule [26]. Tosun [27] used Integer Linear Programming (ILP) to decrease the computational energy consumption of heterogeneous MPSoC architectures by assigning accurate voltage levels to the tasks. The authors also developed an energy-aware heuristic integrated with EDF technique for efficient task ordering. Kumar and Vidyarthi [28] combined task mapping and discrete voltage levels assignment within the single optimization loop of Genetic Algorithm (GA) and compared the results in terms of energy-savings with Genetic Algorithm-Struggle (GA-ST). Recently Dziurzanski and Singh designed a feedback control-based task scheduling called Admission Control Algorithm (ACA) and performed schedulability analysis to determine the tasks which are expected to violate deadline constraints [29]. Though the aforementioned task scheduling heuristics presented in [26], [27], [28], [29] efficiently reduced energy consumption of the multiprocessor computing systems, these investigations only considered independent task graphs, i.e., tasks without precedence constraints.
Other researchers investigated scheduling problems integrated with DVFS for tasks with precedence constraints to reduce the power overhead. For instance, Wang et al. formulated the scheduling problem as ILP and reduced the computational and inter-processor communication overheads of heterogeneous MPSoC for streaming applications. The authors obtained a solution with a possible minimum schedule length using ILP-based algorithm and minimized the wasted slack within the schedule deploying DVFS [30]. Chen et al. [31] applied Mixed Integer Linear Programming (MILP) on NoC-MPSoC and developed a heuristic for generating a non-preemptive task schedule while applying discrete voltage to each task. Ali et al. developed a meta-heuristic called Contention-aware Integrated Task Mapping and Voltage Assignment (CITM-VA) for static task mapping and performed task ordering using Earliest Latest Finish Time First (ELFTF) [9]. While the investigations performed for task scheduling problems on MPSoC in [9], [30], [31] only focus on dependent tasks represented by Directed Acyclic Graph (DAG).
The research studies in [16], [32], [33], [34], [35] explored energy-efficient scheduling for CTGs. For example, Shin and Kim [32] developed a Non Linear Programming (NLP)-based heuristic for assigning optimal discrete voltage levels to each task in order to reduce the computational energy consumption. Wu et al. [33] presented an algorithm which deploys the schedule table generated by Eles et al. [34] for calculating the available slack in the processors and assigns voltage levels to each task using a voltage scaling algorithm. Tariq and Wu [16] scheduled the tasks represented by CTGs on homogeneous MPSoC and formulated the scheduling problem as NLP. An algorithm called Iterative Offline Energyaware Task and Communication Scheduling (IOETCS) is used to perform scheduling and voltage scaling in an integrated manner. IOETCS used the Earliest Successor-Tree-Consistent Deadline First (ESTCDF) algorithm for generating an initial task schedule and then applies voltage scaling using ILP [35]. Each of these research papers only investigated energy-aware conditional task scheduling on single processor per VFI.
Recently, task scheduling deploying VFI-based MPSoCs have been explored in other studies that use bus as a communication interconnect. For example, Pagani et al. [36] presented a Single Frequency Approximation (SFA) algorithm for optimal voltage assignment to the processor islands in MPSoC architecture. The SFA is integrated with Dynamic Programming Mapping Algorithm (DPMA) to increase the energy-efficiency and to minimize the running time. Liu and Guo [37] developed an algorithm called Voltage Island Largest Capacity First (VILCF) for task scheduling. The VILCF reduces the energy consumption by fully utilizing an island that is already active before activating other islands. Han et al. [24] mapped tasks on the processors of the islands and communications on the NoC to reduce the overall makespan and inter-VFI communication. The authors developed two contention and energy-aware task mapping and edge scheduling heuristics called CA-TMES-Quick and CA-TMES-Search for assigning tasks to processors and edges on NoC. Tariq et al. [18] developed a metaheuristic for energy-efficient and contention-aware dependent tasks with precedence constraints on VFI-NoC-HMPSoC. Gammoudi et al. [38] scheduled periodic tasks on homogeneous NoC-VFI-MPSoC architectures deploying well known EDF task ordering policy. Though these investigations reduced the energy consumption by utilizing appropriate task mapping and scheduling, they did not consider re-timing at the task level to further minimize the total energy consumption.
Researchers in other investigations have developed algorithms to transform the intra-period dependencies in DAG tasks into inter-period dependencies using re-timing integrated with DVFS to achieve higher energy-efficiency. Wang et al. [23] transformed the dependent tasks into independent task models deploying an algorithm called R-DAG, and used a heuristic termed as GeneS for voltage assignment plus task mapping. Wang et al. [39] reduced the interprocessor communication overhead using re-timing by deploying an algorithm called Joint Computation and Communication Task Scheduling (JCCTS). The JCCTS combined with DVFS reduces the computational energy consumption for real-time tasks. In another research study, Wang et al. [40] reduced both memory and communication overhead using algorithms called Memory-Aware Optimal Task Scheduling (MAOTS) and Heuristic Memory-Aware Task Scheduling (HMATS). These scheduling approaches fail to implement re-timing for tasks represented by CTGs on VFI-NoC-MPSoC architectures.
Concisely, to the best of our knowledge, no prior work has been done that focuses on energy-aware scheduling of tasks with conditional constraints represented by CTGs on VFI-NoC-HMPSoC deploying re-timing combined with DVFS technique.

III. MODELS AND DEFINITIONS
In this section, CTG is explained followed by the discussion of our computing architecture deployed for energy-aware task scheduling, and finally the energy model is presented that is used to carry out the simulations.

A. Application Model
The application in our model is represented by a CTG. A CTG is a weighted DAG, G(V, E, A, W, X) [16]. V = {v 1 , v 2 , . . . , v n } shows a set of tasks while each task has a certain execution time represented by the number of clock cycles NC i,k on a processor pe k , a common period T and an individual soft deadline d i ≤ T . E ⊆ V × V is a set of directed edges each denoting the dependency between the two tasks.
A is a set of triplets (e i , c i , p(c i )), where e i ∈ E , and c i and p(c i ) represent the condition associated with e i and its probability [41], respectively. X is a set of edge weights. An edge weight χ s ∈ X of an edge e s = (v i , v j ) denoted the communication volume in bits from task v i to task v j .
A scenario of a CTG G is a sub-graph of G formed by all the tasks in a complete execution trace of the task set. Fig. 2(b) shows a scenario of the CTG G in Fig. 2(a) with a = true. Given a CTG G i , an activation space AS is a set of all the possible conditions each of which corresponds to a unique scenario. For example, the activation space of the CTG G shown in Fig. 2 where c is a condition that belongs to the scenario s, and p(c) is the probability when c is true. Associated with each task is its activation probability. The activation probability of a task is the probability with which the task can be executed. Let S j be the set of scenarios to which a task v j belongs to. The activation probability of v j is calculated as follows: (1)

B. System Model
We consider a NoC-based VFI-MPSoC with M processors Fig. 3. In a  single tile there is a processor, local memory and network interface. In each tile, the processor executes tasks, memory holds the scheduled tasks and the network interface connects the processor with the router (R) of the mesh network. The processors of the target computing architecture are grouped into a set C = {c 1 , c 2 , c 3 , . . . c m } of m heterogeneous VFIs. Heterogeneous VFIs have processors with different energy performance profiles. That is one VFI contains lower performance but higher energy-efficient processors while other VFI consists of higher performance but lower energy-efficient processors. Each VFI, c i ∈ C of the computing system contains k number of homogeneous processors (processors with same energy performance profile). A single VFI can operate independently at n discrete voltage and frequency lev-  in Fig. 4. Each tile of the NoC-based VFI-MPSoC is associated with a router. The NoC mesh contains N R rows and N C columns. Thus, the total number of processors in the NoC-based VFI-MPSoC is equal to N R × N C . Each router possesses five ports, out of which four ports are deployed to communicate with the neighbor routers while one port is used for communicating with the processor. A link connects two routers and/or a router with a processor. All links are identical, full duplex and have the same band width, b w .
2) Switching Technique: We assume Virtual cut-through (VCT) switching. VCT is one of the most popular packet switching techniques for NoC communications. In VCT routing the buffer size is large and the entire packet is sent to the next node thus, VCT has lower latency and higher link utilization and lesser packet blocking probability 3) Routing Technique: We adopt XY routing which is the most popular deterministic routing technique on NoC. Routing in a network decides the path of a packet from source to the destination router/node. The XY routing specifically targets 2D-mesh topology and it is the most suitable option for mesh topology networks. Moreover, XY routing is a simple routing but an effective approach additionally it is not prone to deadlock occurrence. In XY routing the packets at the routers are first routed in x-direction and then in the y-direction.

C. Offline Schedule
We consider periodic applications. Periodic applications have a property that they execute repeatedly. Hence, the offline schedule of a CTG is a repeated pattern for the execution of one period of the corresponding periodic conditional dependent tasks. In this work offline schedule consists of both task to processor allocation step and control step assignment. In task to processor allocation step we decide which processor should execute the task and in the control step assignment we decide when to start the task/communication. The key notations we have used in this article are listed in Table I.
Suppose T is the period of the application then, T indicates the deadline of the schedule, and the schedule must complete within T. We use ρ(v i ) and Δt(v i ) to represent respectively the start and the execution time of a task node v i . Similarly ρ(v j , L k ) and Δt(v i , L k ) represent respectively the start and transmission time of a communication node v j on link L k . ζ(v i ) represents the finish time of a node v i .
In Section IV-A we discuss in detail our offline scheduling and voltage scaling approach. The energy model that is used in our simulations is also discussed in Section IV-A.

IV. SCHEDULE-AWARE PIPELINING
In this section we discuss our coarse-grained task level pipelining (re-timing) approach. Before we start explaining our proposed re-timing approach we briefly explain re-timing.
The notion of re-timing was originally introduced by [42] to reduce the synchronous circuit cycle period. Recently, [23], [39], [40] extended re-timing to schedule applications represented by a classical DAG (Directed Acyclic Graph) task model on MPSoCs and is defined as follows.
Definition 1: Given a CTG G, re-timing is a function RT : V → Z, that maps each node v i ∈ G to an integer RT (v i ), where RT (v i ) is the number of periods of v i reschedule in the prologue. Re-timing v i once if it is legal, reschedules one period of v i into the prologue.
From the point of view of the program, re-timing regroups loop body such that some or all dependencies within a period are transformed into inter-period dependencies. The re-timing function is valid if no reference is made to the data from the future period. The definition of a valid re-timing function is: , the re-timing function is illegal because this condition implies a reference to unavailable data from the future. Fig. 5(a) shows the schedule of the first three periods of the CTG shown in Fig. 6(a). Table II shows the execution time and energy consumption of the tasks of the CTG in Fig. 6(a). The application is scheduled on MPSoC that consists of two processors pe 1 and pe 2 . Table II shows the energy consumption and execution time of each task on the two processors. Compared to pe 1 , pe 2 is more energy efficient. Fig. 5(a) shows a schedule generated by CA-TMES-Search. Notice that CA-TMES-Search fails to efficiently utilize the more energy-efficient processor pe 2 because it favours the processor on which the task can start the earliest. pe 2 remains idle until v 1 completes execution on pe 1 . This is because of the intra-period dependency between v 1 and v 2 . CA-TMES-Search cannot utilize this slack. Fig. 5(c) shows the schedule generated by our approach. Notice that if intra-period dependencies can be transformed into inter-period dependencies then the wasted slack can be utilized. This can be obtained by regrouping tasks from different periods with computation and communication node rescheduling. As each task is periodic, in Fig. 5(c), we reschedule periodic task v 1 and execute it one period before v 2 and v 3 . The newly added period is called the prologue. In this way the data required by v 2 and v 3 is available at the start of each   FIG. 6 period, therefore v 2 can start early on pe 2 . Consequently, task v 4 can be scheduled on pe 2 . Since our approach can utilize the available resources more efficiently hence, it is able to generate more energy-efficient schedule. The energy consumption of the schedule in Fig. 5(a) is 7.5 nJ where as the schedule in Fig. 5(c) consumes 7 nJ. Our approach can further reduce the energy to 6 nJ if the MPSoC has another energy-efficient processor like pe 2 .
Although re-timing is effective in reducing energy consumption, there is a cost associated with it. One drawback of re-timing is that it adds prologue. The prologue latency is the number of periods in prologue times the period, T. The number of periods in the prologue is equal to the maximum re-timing value RT max of the nodes in G, RT max = max{RT (v i ) : ∀v i ∈ G}. Thus the prologue latency is: Besides energy reduction, we also want to minimize the prologLatency. Fig. 5(b) shows the re-timed schedule generated by R-DAG. Compared to this the prologue latency of the re-timed schedule generated by our approach is half. We are able to reduce the prologue latency because we take a different approach compared to R-DAG. We first transform the CTG into an independent task set by relaxing the precedence constraints between the nodes and then we schedule the independent task model onto the MPSoC. Hence, the MPSoC resources are maximally utilized and do not remain idle due to precedence constraints between nodes. Finally we calculate the re-timing values of the nodes and generate the re-timed schedule.
Algorithm 1 describes our schedule-aware software pipelining scheduling approach. Our approach has three main steps. 1) Step 1 (Line 1): Use Algorithm 2 to generate the relaxed schedule π and task to processor mapping map.

2)
Step 2 (Line 2): Given the CTG G(V,E,A,W,X) and task to processor mapping map transform G into an extended graph G e by adding additional nodes for every directed edge in G whose tail and head nodes are mapped on a Algorithm 1: R-CTG input: A CTG G, tasks Deadlines, an MPSoC output: Re-timed schedule 1 Use Algorithm 2 to generate the relaxed schedule π and task to processor mapping map; 2 Given the CTG G and task to processor mapping map transform G into an extended graph Ge ; 3 Set the re-timing values of all leaf nodes of Ge to zero; 4 for each node v i in the reverse topological order of Ge do 10 Given the relaxed schedule π and retiming values, generate the re-timed schedule. different processor. An extended graph G e is a directed acyclic graph G(V + V * , E ). V is the set of original nodes that are kept unchanged and are called task nodes.

3) Step 3 (Lines 4-9):
Calculate the re-timing values of the nodes. Given a node v i and its child node v j , our re-timing function is defined as follows: (3)

A. ALI-EBAD, A Relaxed Offline Scheduling and Voltage Scaling Algorithm
In this section, we describe our relaxed offline scheduling and voltage scaling approach. A relaxed schedule is generated by assuming that the precedence constraints between the nodes of a CTG do not exist, that is the nodes are assumed to be independent. We propose an offline scheduler Algorithm 2, ALI-EBAD to generate the relax schedule.
Given a periodic CTG that models a streaming application, our main objective is to execute the application on heterogeneous VFI-NoC-MPSoC such that the total expected energy consumption is minimized. Unlike other state-of-the-art algorithms for task scheduling, we consider energy performance profiles, scheduling and voltage scaling in a integrated manner in the design of ALI-EBAD.
ALI-EBAD maintains a ready list R that contains all the ready nodes and source nodes. A task is ready if all its parents have been scheduled. All the nodes of G are in R because we relax the precedence constraints between the task (Line 2). Next ALI-EBAD repeats executing the following five steps until all the tasks in R have been scheduled. 1) Selects one by one each ready task v i ∈ R, tentatively map v i on each processor pe k ∈ P . For each task and processor pair (v l , pe k ), repeat the following.
Algorithm 2: ALI-EBAD input: CTG G, matrix NC , set X , and NoC-based MPSoC output: Schedule π best and an array Map reflecting task mapping 1 Compute the successor-tree-consistent deadline of all the nodes in G; 2 Create a list R and insert in it all the nodes of G, R ← V ; 3 Create and array Map of size |V |; 4 Create two empty sets V s and V * s ; 5 repeat 6 Set E best exp to ∞; 7 for each v l ∈ R do 8 for each pe k ∈ P do 9 Tentatively map v l to pe k ;

10
Insert v l in V s ;  Set π best to π; 20 Set j to k ; Delete v i from R; 30 until R is not empty; a) Insert task v l in set V s (Line 10). The set V s contains all the tasks that have been scheduled. For each parent and child node of v l mapped on a different processor insert a communication node in V * s (Line [11][12][13][14]. The communication node is required because the precedence constraints have only been relaxed not removed. The data has to be transmitted over the NoC from the processor, where the parent node is mapped to the processor where child node is mapped. b) Solve the NLP described in the next section to generate the schedule π and calculate the expected energy consumption of G s (Line 15). The schedule π specifies the unique start time, finish time for each node in G s and a voltage setting for each island. c) Round the voltage of each island that has been assigned an invalid voltage level by NLP to the nearest highest valid voltage level (Line 16). Note that the schedule needs to be updated in this case. This involves re-calculating the start and finish times of task and communication nodes under new voltages settings such that the relative order of the task and communication nodes remain the same. d) Delete v l from V s (Line 22) and the corresponding communication nodes from V * s . v l is deleted from V s because its current mapping is tentative. 2) Find the task processor pair (v i , pe k ), such that mapping v i to pe k results in a minimum increase in energy consumption amongst all the pairs.

1) NLP-Based DVFS Approach:
We propose an NLPbased offline scheduler that is inspired by the scheduler proposed in [16], [43]. Before we describe our NLP-based offline scheduler, we discuss the priority scheme used by the scheduler. Our approach uses the priority scheme earliest successor-tree-consistent deadline first [35] because it allows the DVFS scheme to efficiently utilize the available slack and significantly reduce energy consumption. The successor-treeconsistent deadline is defined as the upper bound on the latest finish time of the nodes in CTG. Compared to edge-consistent deadline it is a tighter bound on latest finish time of the nodes, because it takes into account the resource constraints of the MPSoC while calculating the latest finish time. Our NLP-based offline approach, schedules tasks and communication nodes in the earliest successor-tree-consistent deadline manner, which means that nodes with shorter successor-treeconsistent deadline are scheduled earlier than nodes with longer successor-tree-consistent deadline.
Next, we describe our NLP-based offline scheduler: Operating Frequency Constraints: The operating frequency f j of each island c j ∈ C is determined by the following constraints [44]: where K 1 , K 2 , K 6 and V th 1 are circuit dependent constants, L d is the logic depth, and α (1.4 ≤ α ≤ 2) is velocity saturation imposed by the technology used.

Execution and Transmission Time Constraints:
The execution time of each task node v i ∈ V is given by the following equation: where v i is mapped on processor pe k that belongs to island c j whose frequency is f j . Consider a communication node v j whose parent task node v p is mapped on pe src and child task node v c is mapped on pe dest , the routing algorithm used by the network generates the route R j from pe src to pe dst . The route R j =< L 1 , L 2 , . . . , L l > is an ordered list of links, where L 1 is the first link and L l is the last link on the route.
For communication nodes we only consider the link transmission time and ignore the overheads such as inter router delay, data copy between buffers etc. The transmission frequency of a link L γ ∈ R j is the minimum of the sender and receiver routers frequencies. Hence, the transmission time of a communication node v j on link L γ ∈ R j is given by the following constraint: λ is given by the following constraint: where f u and f v are the frequencies of sender and receiver routers respectively. Link Causality Constraints: In communication scheduling, network resources such as links are treated as processors in a way that each communication can only use one resource at a time. Hence, communication nodes are scheduled on the links for the time they occupy them.
Note that the route depends only on the source and destination of the communication because in our network model we assume deterministic (XY) routing. Furthermore, the entire communication must be transmitted on the established route because in the network model we suppose circuit switching. A communication node utilizing this route must be scheduled on all the links (of this route). The data traverses these links in the order they appear in the route vector.
The schedule of each communication node v j ∈ V * on the links of the route R j =< L 1 , L 2 , . . . , L l > (that v j traverses) must obey the link causality constraints according to cut-through switching [18], [45]. The link causality constraints are defined as follows: Resource Exclusiveness Constraints: In a feasible schedule nodes mapped on the same resource must not overlap. However, a schedule is deemed feasible even though mutually exclusive nodes may schedule in the same time interval. This is because only one among the mutually exclusive nodes execute at run-time thus, resource exclusiveness constraints are not violated.
We define resource exclusiveness constraints to order concurrent nodes mapped on the same resource in an exclusive manner. For each task node v i ∈ V the resource exclusiveness constraints are defined as follows: where Π(v i , pe k ) is a set of task nodes concurrent to v i , have shorter or equal successor-tree-consistent deadline than v i and are mapped on pe k . Nodes v i and v j are concurrent if they are not reachable from each other in the CTG and are not mutually exclusive.
Similarly for each communication node v i ∈ V * where the route that v j takes is R j =< L 1 , L 2 , . . . , L l >, the resource constraints are defined as follows: where Π(v i , L γ ) is a set of communication nodes concurrent to v i , have shorter or equal successor-tree-consistent deadline than v i and use the same L γ . Deadline Constraints: We define the deadline constraints so that tasks complete execution before their deadlines as follows: Supply Voltage and Start Time Bounds: Given the minimum supply voltage V min dd and the maximum supply voltage V max dd the following constraints define the upper and lower bounds on the supply voltage assigned to each island: The following constraint define the lower bound on the start time of each task node v i ∈ V : Objective Function: The objective of our NLP formulation is to minimize the total expected energy consumption minimize E where the total expected energy is given as follows: E i is the energy consumed in execution of a task v i mapped on processor pe k that belongs to VFI c j : where C eff k is the effective switched capacitance of pe k , L g denotes the number of logic gates, {K 3 , K 4 , K 5 } represents technology specific parameters, v bs and I jn represent bodybias voltage and leakage current respectively. E u is the energy consumed in the execution of communication node v u that traverses the route R u =< L 1 , L 2 , . . . , L l >. The parent  and child task nodes of v u are v p and v c respectively. E u is calculated as follows: where E bit (src, dest) is the energy consumed to transmit one bit from src tile to dest. E bit (src, dest) is discussed in detail in [46].

V. PERFORMANCE EVALUATION
In this section we explain our experimental setup used for producing various results on different real benchmarks to compare the performance of our scheduler.

A. Experimental Setup
We use Samsung Exynos 5422 chip energy model adapted from [47], in simulations we deploy two types of processors. Type 1 is a high-performance and high-energy consuming Cortex A15 (big). Type 2 is a low-power, low-power consuming Cortex A7 (little). The Cortex A15 consumes ∼ 6 − 12 times higher power compared to Cortex A7 [48]. The operating frequencies and relative power consumption of both types are listed in Table III. We adopted 70 manometers (nm) processor technology parameters from Ali et al. [9] given in Table IV. We built the simulation environment in MATLAB version R2016a. We use MATLAB's fmincon function to solve the NLP problem.Moreover, we conducted the experiments using the hardware platform of Intel Xeon, i5-3570 CPU with the clock frequency of 3.50 GHz and 16.00 GB memory, 10 MB cache.
We perform experiments on 12 real benchmarks listed in Table V and Table VI. In Table V, the robot benchmark contains tasks for automation and control. ATR is a real-time streaming application used for pattern recognition. Consumer-1 and consumer-2 consist of tasks to perform RGB to CMYK conversion and JPEG compression/decompression. The mp3-decoder benchmark performs Huffman Decoding (HD) and Inverse Discrete Transform (IDCT). Office benchmark contains tasks for text processing, image rotation, and gray-scale to binary conversion. In Table VI, E shows the number of edges and O represents the number of OR-Fork nodes while C denotes the number of conditions for each benchmark. Cruise-control (ctg-1) and mjpeg-decoder (ctg-2) benchmarks represent the vehicle cruise controller system application and Motion-JPEG decoder respectively. The last four benchmarks symbolize synthetic benchmarks with conditional precedence constraints.
We use contention and energy-aware task mapping and edge scheduling (CA-TMES) approach developed by Han et al. [24] as a baseline for energy-efficiency performance comparison of our energy management technique, ALI-EBAD. The authors presented CA-TMES-Search and CA-TMES-Quick two different scheduling techniques for processor selection on which a task can start earliest among all other processors. CA-TMES-Search approximates the start time for each task while considering the communication contention. CA-TMES-Quick first maps the tasks and then determines the routes for communications. CA-TMES-Search relatively saves higher energy than CA-TMES-Quick because of coordinating the task mapping in an exhaustive way and subsequently, reducing the overall makespan significantly. Similarly, we compare the effectiveness of our re-timing technique R-CTG with a state-of-the-art approach called R-DAG developed by Wang et al. [23]. R-DAG transforms a set of periodic dependent tasks into periodic independent tasks. R-DAG technique assigns a re-timing value to each node in the DAG starting from reverse topological order. It assigns a value 0 to the sink node while adds 1 with the re-timing value if it is a parent node.

B. Results and Discussion
In this section, we perform experiments on a set of benchmarks for two different scenarios (1) without and (2) with conditional precedence constraints. We consider that the number of VFI (NVFI) = 4 and number of processors per VFI (NPI) = 4. We use two terms in the results, i.e., heterogeneous VFI-based NoC-MPSoC (VFI-NoC-HMPSoC) and homogeneous VFI-based NoC-MPSoC (VFI-NoC-MPSoC).
1) Without Re-Timing: Fig. 7 shows the energy consumption comparison of our task scheduling, ALI-EBAD with stateof-the-art energy management schemes CA-TMES-Search and CA-TMES-Quick. The horizontal axis represents real benchmarks while the vertical axis denotes the energy consumption in milli joules (mJ). ALI-EBAD outperforms CA-TMES-Search and CA-TMES-Quick scheduling techniques in terms of energy-efficiency. It achieves an average energy savings of ∼15%, and ∼20% over CA-TMES-Search and CA-TMES-Quick respectively when homogeneous processors only of type 1 are used to form VFI-NoC-MPSoC architecture. Unlike, CA-TMES-Search and CA-TMES-Quick, ALI-EBAD performs task mapping, ordering and voltage scaling in an integrated manner. Moreover, it schedules dependent tasks closer to each other to avoid energy dissipation due to the utilization of links, buffers, and switches for communications. In case of task scheduling using both type 1 and type 2 processors to form VFI-NoC-HMPSoC. We randomly select the type of processor for each VFI in VFI-NoC-HMPSoC architecture to generate a heterogeneous computing platform while ensuring unbiased experimentation. The energy-efficiency further increases to ∼20% and ∼25% compared to CA-TMES-Search and CA-TMES-Quick when VFI-NoC-HMPSoC computing architecture is considered during task scheduling. This further reduction in energy consumption occurs because ALI-EBAD maps higher energy consuming tasks on high energy-efficient and low-performance processor. In other words it considers the energy performance profiles of the processors during task scheduling.
2) With Re-Timing: Fig. 8 demonstrates the energy consumption of real benchmarks without conditional precedence when re-timing is deployed. We combine ALI-EBAD static task scheduler with our developed re-timing technique R-CTG while we integrate CA-TMES-Search and CA-TMES-Quick with R-DAG. R-CTG coarse-grained software pipelining transforms the intra-period dependencies into inter-period dependencies to utilize the wasted slack and efficiently utilize   the DVFS for achieving higher energy-efficiency. The energy consumption significantly reduces when re-timing technique is used. Compared to CA-TMES-Search in Fig. 7 without re-timing the energy-efficiency increases to an average ∼40% and ∼45% for ALI-EBAD@VFI-NoC-MPSoC and ALI-EBAD@VFI-NoC-HMPSoC. Similarly Fig. 9 illustrates the energy consumption of benchmarks with conditional precedence constraints. It is noticeable that both R-DAG and R-CTG perform similarly in terms of energy-efficiency when combined with ALI-EBAD heuristic for both homogeneous and heterogeneous VFI-NoC-MPSoC platforms. Though, there is no significant energy performance improvement of R-CTG over R-DAG however, it reduces the maximum re-timing (RT max ) values significantly as shown in Fig. 10. R-CTG reduces the RT max by 50% when compared to R-DAG. We compare the prologue latency in terms of maximum re-timing RT max . The smaller the value of RT max the shorter the prologue latency. Unlike R-DAG our novel re-timing, R-CTG achieves lower prologue because it only re-times tasks that free up the wasted-slack.
Concisely, ALI-EBAD static task scheduler using VFI-NoC-HMPSoC outperforms CA-TMES-Search and CA-TMES-Quick. It achieves an average energy-efficiency of ∼20% over CA-TMES-Search and ∼25% over CA-TMES-Quick. This energy saving increases to ∼40% and ∼45% when re-timing is deployed. R-CTG and R-DAG achieve similar energy-efficiency when integrated with ALI-EBAD but R-CTG produces a lower prologue of 50% compared to R-DAG.

VI. CONCLUSION
The computational complexity of real-time multimedia applications is rapidly proliferating, Voltage Frequency Islands (VFI)-based Multiprocessor System-on-Chip (MPSoC) architectures are adopted for higher performance and effective energy management. In this article we investigated complex scheduling problem for tasks, both with and without conditional precedence constraints by deploying a VFI-NoC-MPSoC computing platform. We proposed a novel re-timing technique, R-CTG and integrated it with a non linear programming-based scheduling and voltage scaling approach referred to as ALI-EBAD. The R-CTG minimizes the latency caused by re-timing without compromising on the energy-efficiency. It significantly reduces the re-timing latency because it only re-times tasks that free up the wasted slack. We conducted an experiment on 12 benchmarks the results of which demonstrate that ALI-EBAD deploying VFI-NoC-HMPSoC outperforms CA-TMES-Search and CATMES-Quick while achieving an average energy-efficiency improvement of ∼20% and ∼25% respectively. The energy saving significantly increases to ∼40% and ∼45% when R-CTG is used. Compared to a previous state-of-the-art re-timing technique, R-DAG, our coarse-grained software pipelining, R-CTG achieves similar energy-efficiency when integrated with ALI-EBAD however it improves computational efficiency by reducing prologue by ∼50%. In the future we can also consider Quality-of-Experience (QoE), which is an interesting parameter from a user's perspective.
Umair Ullah Tariq received the master's degree from the University of Engineering and Technology, Taxila, Pakistan, and the Ph.D. degree in computer science and engineering from the University of New South Wales, Sydney, NSW, Australia. He is currently working on energy-aware task scheduling on multiprocessor systems and intrusion detection system for IoT. He has published several research papers in prominent conferences and journals. His research interests include energy-aware task scheduling, digital image processing, computer vision, and IoT security.
Haider Ali received the master's degree in electronic systems design engineering from Manchester Metropolitan University, U.K., and the Ph.D. degree from the University of Derby (UoD), U.K. He is currently a Lecturer with the Department of Electronics, Computing and Mathematics, UoD. He served with COMSATS University Islamabad, Abbottabad Campus, Pakistan, as a Lecturer from 2011 to 2016. He is currently working on energyefficient algorithms for task mappings on modern embedded systems for real-time applications. His research area of interest is electronic and biomedical systems design, Internetof-Things, algorithms design, and embedded systems. He has received two Best Paper Awards from international conferences. He has served as a member of the technical program committee for different workshops and serves many reputable journals as a reviewer.
Lu Liu (Member, IEEE) received the Ph.D. degree from the Surrey Space Center, University of Surrey. He is the Head of School of Informatics, University of Leicester. Prior to this, he was the Head of School of Electronics, Computing and Mathematics and a Professor of Distributed Computing with the University of Derby. He had worked as a Research Fellow with the WRG e-Science Center, University of Leeds. He has secured and participated in many research projects which are supported by research councils, BIS, Innovate U.K., British Council and leading industries. He has over 200 scientific publications in reputable journals, academic books and international conferences. His research interests are in the areas of data analytics, AI, cloud computing, service computing, and Internet of Things. He received the Vice-Chancellor's Awards for Excellence in Doctoral Supervision in 2018, the BCL Faculty Research Award in 2012, and was recognized as a Promising Researcher by the University of Derby in 2011. He has been a recipient of seven Best Paper Awards from international conferences and was invited to deliver seven keynote speeches at international conferences/workshops. He is a Fellow of British Computer Society and serves as an Editorial Board Member of six international journals and the Guest Editor for 19 journal special issues. He has chaired over 30 international conference and workshops, and presently or formerly serves as the program committee member for over 60 international conferences and workshops.