A lightweight P2P-based cache system for model distributions on Kubernetes.
Name Story: the inspiration of the name Manta is coming from Dota2, called Manta Style, which will create 2 images of your hero just like peers in the P2P network.
Architecture
Note: llmaz is just one kind of integrations, Manta can be deployed and used independently.
Features Overview
Model Hub Support: Models could be downloaded directly from model hubs (Huggingface etc.) or object storages, no other effort.
Model Preheat: Models could be preloaded to clusters, or specified nodes to accelerate the model serving.
Model Cache: Models will be cached as chunks after downloading for faster model loading.
Model Lifecycle Management: Model lifecycle is managed automatically with different strategies, like Retain or Delete.
Plugin Framework: Filter and Score plugins could be extended to pick up the best candidates.
Memory Management(WIP): Manage the reserved memories for caching, together with LRU algorithm for GC.
You Should Know Before
Manta is not an all-in-one solution for model management, instead, it offers a lightweight solution to utilize the idle bandwidth and cost-effective disk, helping you save money.
It requires no additional components like databases or storage systems, simplifying setup and reducing effort.
All the models will be stored under the host path of /mnt/models/
Note: you can make the Torrent Standby by setting the preheat to false (true by default), then preheating will process in runtime, which obviously wll slow down the model loading.