FMInference/FlexGen: Running large language models like OPT-175B/GPT-3 on a single GPU. Focusing on high-throughput large-batch generation
FlexGen is a high-throughput generation engine for running large language models with limited GPU memory (e.g., a 16GB T4 GPU or a 24GB RTX3090 gaming card!).
If there’s a good, open source LLM, that could be a very big deal