Skip to main content

SmartScaleSystems (S3): AI-Driven Resource Management for Efficient and Sustainable Large-Scale Distributed Systems

Primary supervisor

Mohammad Goudarzi

In SmartScaleSystems (S3), we aim to design and build resource management solutions to learn from usage patterns, predict future needs, and allocate resources to minimize latency, energy consumption, and costs of running diverse applications in large-scale distributed systems. This project offers researchers and students a chance to explore cutting-edge concepts in AI-driven infrastructure management, distributed computing, and energy-aware computing, preparing them for impactful roles in industry and research.

Key Components and Example Scenarios

  1. Predictive Resource Allocation and Load Balancing:
    • Example: Imagine an e-commerce platform that experiences a surge in user activity during sales events. SmartScaleSystems (S3) would predict this peak demand using ML models and allocate additional resources to avoid downtime, adjusting dynamically as traffic fluctuates.
    • For researchers and students, this component focuses on developing ML models to predict resource needs, improving load distribution to avoid system bottlenecks, and ensuring low-latency performance.
  2. Energy-Efficient Operations with Carbon-Aware Scheduling:
    • Example: For non-urgent data processing, SmartScaleSystems (S3) could prioritize tasks during off-peak hours when renewable energy sources are more available, reducing carbon emissions.
    • Researchers and students will design policies that align workloads with green energy availability, minimizing energy costs and environmental impact. This offers opportunities to work on practical applications in sustainability for cloud and edge systems.
  3. Privacy-Enhancing Resource Management:
    • Example: A healthcare application needs to analyze sensitive patient data across distributed nodes.
    • Researchers and students can explore privacy-preserving algorithms and technologies like federated learning and zero-knowledge proofs, enabling secure, distributed data processing.
  4. Self-Healing and Fault Tolerance:
    • Example: If a node in a distributed system fails, SmartScaleSystems (S3) would automatically detect this and re-route tasks to maintain uninterrupted service.
    • This research area involves developing fault-tolerant systems that adapt to hardware and software failures. Students will work on predictive maintenance and automated recovery algorithms, improving system resilience.

Research Areas for Master’s and PhD Students

  1. AI-Enhanced Resource Forecasting and Optimization:
    • Research Focus: Developing and testing ML algorithms for predicting resource demand and optimizing allocations.
    • Skills Gained: Students will work with real-time data analytics, reinforcement learning, and predictive modeling to create efficient resource utilization strategies.
  2. Carbon-Aware and Sustainable Scheduling Algorithms:
    • Research Focus: Creating algorithms that factor in carbon intensity and renewable energy availability to minimize environmental impact.
    • Skills Gained: Students will study sustainability in computing, including integrating AI with real-time energy data and carbon monitoring, contributing to low-impact, sustainable cloud operations.
  3. Distributed Federated Learning for Resource Management:
    • Research Focus: Implementing federated learning to optimize resource management across multiple sites without compromising data privacy.
    • Skills Gained: This research offers experience in federated learning, data security, and privacy, preparing students for data-sensitive applications in healthcare, finance, and more.
  4. Dynamic Scaling and Load Balancing with Reinforcement Learning:
    • Research Focus: Designing adaptive load-balancing algorithms using reinforcement learning to dynamically respond to changes in workload.
    • Skills Gained: Students will develop expertise in reinforcement learning and its applications in cloud and edge environments, optimizing performance in real-world scenarios.
  5. Edge-Cloud Collaboration for Low-Latency Resource Management:
    • Research Focus: Investigating hybrid architectures that balance workload between edge devices and cloud resources to minimize latency.
    • Skills Gained: Students will study distributed systems and hybrid architectures, ideal for IoT applications, real-time data processing, and autonomous systems.

This project will allow students to gain hands-on experience in building, testing, and deploying intelligent resource management tools that not only improve performance but also reduce the environmental footprint of distributed computing systems. By working on SmartScaleSystems (S3), students will contribute to the future of sustainable and efficient computing infrastructure.


Learn more about minimum entry requirements.