A method and system for scheduling workflows to reduce failures in a grid computing environment is disclosed. The method and system disclosed herein schedules the workflows based on failure predictions of nodes in the grid computing environment.
Method and System for Scheduling of Workflows to Reduce Failures in Grid Computing
Disclosed is a method and system for incorporating failure probabilities and characteristics of nodes in scheduling of workflows to reduce failures in a grid computing environment. In order to schedule the workflows, the method and system disclosed herein involves collecting information about performance of nodes of the grid computing environment. The information includes machine characteristics, network characteristics, hardware upgrade history, and job execution and failure statistics for each node of the grid computing environment. This information is collected for a sufficiently large time period in order to get more accurate information. Based on the collected information, the method and system disclosed herein further includes computing failure probabilities of each node. Thereafter, the workflows are scheduled based on the failure probabilities. For example, the system may schedule critical workflows to more robust nodes in the grid computing environment than to those nodes that are prone to failures. The scheduling may be performed based on predefined tolerance thresholds of the failure probabilities. Additionally, the schedule may be generated based on priorities of data transfers and jobs in the workflows.
The figure below illustrates an exemplary architecture of the system for scheduling workflows in the grid computing environment. The system includes a failure modeler and a workflow scheduler. Th...