GenAI Gateway

Many developers are integrating large language models (LLMs) into their applications to process user input and generate responses. This integration presents challenges such as:

Implementing rate limiting to prevent a single user from exceeding a predefined number of tokens and generating a lot of costs.
Collecting metrics on token usage per model or request or user.
Providing a standardized entry point for various providers (OpenAI, Anthropic, Self-Hosted LLMs).

This project addresses these challenges by offering a gateway that facilitates interaction with different LLMs.

Run & Config

The gateway can be run in the following way (currently no docker container is available, leave an issue if you want it):

OPENAI_TLS=true OPENAI_PORT=443 OPENAI_DOMAIN="api.openai.com" cargo run --release

After that you can send a request to the gateway:

curl -v -X POST http://127.0.0.1:8080/v1/chat/completions \
 -d '{"model": "gpt-4o","messages": [{"role": "system", "content":"what are the best football players all time?"}], "max_tokens": 250,"temperature": 0.1, "stream": true}' \
 -H "Authorization: Bearer <API_KEY>"

The gateway can be configured via flags or environment variables. The following flags are available:

Usage: genai-gateway [OPTIONS]

Options:
      --openai-tls
          Enable TLS for downstream OpenAI compatible endpoints [env: OPENAI_TLS=]
      --openai-port <OPENAI_PORT>
          Port to use for downstream OpenAI compatible endpoints [env: OPENAI_PORT=] [default: 443]
      --openai-domain <OPENAI_DOMAIN>
          Domain to use for downstream OpenAI compatible endpoints [env: OPENAI_DOMAIN=] [default: 0.0.0.0]
      --http-proxy-port <HTTP_PROXY_PORT>
          Port to use for HTTP proxy [env: HTTP_PROXY_PORT=] [default: 8080]
      --http-proxy-metrics-port <HTTP_PROXY_METRICS_PORT>
          Port to use for HTTP proxy metrics [env: HTTP_PROXY_METRICS_PORT=] [default: 9090]
      --enable-rate-limiting
          Enable rate limiting on user key [env: ENABLE_RATE_LIMITING=]
      --rate-limiting-redis-connection-string <RATE_LIMITING_REDIS_CONNECTION_STRING>
          Redis connection string for the rate limiter [env: RATE_LIMITING_REDIS_CONNECTION_STRING=] [default: redis://127.0.0.1:6379/0]
      --rate-limiting-redis-pool-size <RATE_LIMITING_REDIS_POOL_SIZE>
          Redis pool size for the rate limiter [env: RATE_LIMITING_REDIS_POOL_SIZE=] [default: 5]
      --rate-limiting-window-duration-size-min <RATE_LIMITING_WINDOW_DURATION_SIZE_MIN>
          Rate limiting window duration size in minutes [env: RATE_LIMITING_WINDOW_DURATION_SIZE_MIN=] [default: 60]
      --rate-limiting-max-prompt-tokens <RATE_LIMITING_MAX_PROMPT_TOKENS>
          Rate limiting max prompt tokens [env: RATE_LIMITING_MAX_PROMPT_TOKENS=] [default: 1000]
      --rate-limiting-user-header-key <RATE_LIMITING_USER_HEADER_KEY>
          Rate limiting user header key [env: RATE_LIMITING_USER_HEADER_KEY=] [default: user]
  -h, --help
          Print help
  -V, --version
          Print version

Metrics

Metric Name	Labels	Explanation
`prompt_tokens_total`	None	Number of prompt tokens
`completion_tokens_total`	None	Number of completion tokens
`tokens_total`	None	Number of total tokens
`prompt_tokens_by_model_total`	`model`	Number of prompt tokens by model
`completion_tokens_by_model_total`	`model`	Number of completion tokens by model
`tokens_by_model_total`	`model`	Number of total tokens by model
`prompt_tokens_by_user_by_model_total`	`user`, `model`	Number of prompt tokens by user by model
`completion_tokens_by_user_by_model_total`	`user`, `model`	Number of completion tokens by user by model
`tokens_by_user_by_model_total`	`user`, `model`	Number of total tokens by user by model

Usage

Here is an example to use it with the langchain client:

client = OpenAI(
        openai_api_base=http://localhost:8080, # GenAI Gateway URL
        openai_api_key=config["api_key"], # OpenAI API Key. The Gateway will forward this key to the downstream OpenAI endpoint
        model_name="gpt-4o", # Model name
        logit_bias=None,
        default_headers={"user": "user1"}, # User header key which the rate limiter will use to enforce rate limiting per total tokens
    )

Keep in mind that you need to set the user header per request to enforce rate limiting per user.

Development

To build the project, run the following command:

cargo build

Or build a docker image:

make docker-build-linux

To run tests, run the following command:

make test

To format the code, run the following command:

make fmt

To lint the code, run the following command:

make lint

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
media		media
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
Makefile		Makefile
README.md		README.md
docker-compose.yaml		docker-compose.yaml
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenAI Gateway

Run & Config

Metrics

Usage

Development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GenAI Gateway

Run & Config

Metrics

Usage

Development

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages