In this example, we deploy a dockerized model for infinity embedding server, a high-throughput, low-latency REST API server for serving vector embeddings.
To deploy a dockerized model, all you need is a config.yaml. It specifies how to build your Docker image, start the server, and manage resources. Let’s break down each section.
Pre-downloads model weights during the build phase to ensure the model is ready at container startup.
config.yaml
Copy
Ask AI
build_commands: # optional step to download the weights of the model into the image - sh -c "HF_TOKEN=$(cat /secrets/hf_access_token) infinity_emb v2 --preload-only --no-model-warmup --model-id BAAI/bge-small-en-v1.5 --revision main"
Defines essential environment variables including the Hugging Face access token, request batch size, queue size limit, and a flag to disable tracking.
config.yaml
Copy
Ask AI
environment_variables: hf_access_token: null # constrain api to at most 256 sentences per request, for better load-balancing INFINITY_MAX_CLIENT_BATCH_SIZE: 256 # constrain model to a max backpressure of INFINITY_MAX_CLIENT_BATCH_SIZE * predict_concurrency = 10241 requests INFINITY_QUEUE_SIZE: 10241 DO_NOT_TRACK: 1