serving-llms-vllm Guide

Name: serving-llms-vllm
Author: davila7

Serves LLMs with high throughput using vLLM's PagedAttention and continuous batching. Use when deploying production LLM APIs, optimizing inference latency/throughput, or serving models with limited GPU memory. Supports OpenAI-compatible endpoints, quantization (GPTQ/AWQ/FP8), and tensor parallelism.

27,615 starsby davila7

When to use serving-llms-vllm

How to use serving-llms-vllm

serving-llms-vllm is a Claude skill in the SKILL.md format. Add it to your Claude environment from the source repository below, then it activates as a user-invocable skill when your task matches its description.

Skill source

https://raw.githubusercontent.com/davila7/claude-code-templates/main/cli-tool/components/skills/ai-research/inference-serving-vllm/SKILL.md

Details

PlatformClaude

CategoryAI & ML

Invocationuser-invocable

Modelany

Maintainerdavila7

LicenseMIT

serving-llms-vllm Guide

When to use serving-llms-vllm

How to use serving-llms-vllm

Details

Resources