feat(apigw): add API Gateway response streaming support (#207)

Replace ALB + Lambda architecture with API Gateway REST API + Lambda
using response streaming for SSE support. This provides:

- No VPC required, reducing complexity and cost
- Native streaming support via API Gateway response streaming
- Pay-per-request pricing model

Changes:
- Add Lambda Web Adapter to Dockerfile for streaming support
- Replace BedrockProxy.template with API Gateway configuration
- Update README with new deployment options and latest models
- Update architecture diagram for API Gateway flow
This commit is contained in:
Mengxin Zhu
2025-12-05 10:54:13 +08:00
committed by GitHub
parent 0411454b3a
commit b41633b826
4 changed files with 136 additions and 225 deletions

View File

@@ -4,9 +4,16 @@ OpenAI-compatible RESTful APIs for Amazon Bedrock
## What's New 🔥
This project now supports **Claude Sonnet 4.5**, Anthropic's most intelligent model with enhanced coding capabilities and complex agent support, available via global cross-region inference.
**API Gateway Response Streaming Support** - You can now deploy with Amazon API Gateway REST API instead of ALB, enabling true response streaming for better latency and cost optimization. See [Deployment Options](#deployment-options) for details.
It also supports reasoning for both **Claude 3.7 Sonnet** and **DeepSeek R1**. Check [How to Use](./docs/Usage.md#reasoning) for more details. You need to first run the Models API to refresh the model list.
**Latest Models Supported:**
- **Claude 4.5 Family**: Opus 4.5, Sonnet 4.5, Haiku 4.5 - Anthropic's most intelligent models with enhanced coding and agent capabilities
- **Amazon Nova**: Nova Micro, Nova Lite, Nova Pro, Nova Premier - Amazon's native foundation models with multimodal support
- **DeepSeek**: DeepSeek-R1 (reasoning), DeepSeek-V3.1 - Advanced reasoning and general-purpose models
- **Qwen 3**: Qwen3-32B, Qwen3-235B, Qwen3-Coder-30B, Qwen3-Coder-480B - Alibaba's latest language and coding models
- **OpenAI OSS**: gpt-oss-20b, gpt-oss-120b - Open-source GPT models available via Bedrock
It also supports reasoning for **Claude 4/4.5** (extended thinking and interleaved thinking) and **DeepSeek R1**. Check [How to Use](./docs/Usage.md#reasoning) for more details. You need to first run the Models API to refresh the model list.
## Overview
@@ -46,13 +53,18 @@ Please make sure you have met below prerequisites:
### Architecture
The following diagram illustrates the reference architecture. Note that it also includes a new **VPC** with two public subnets only for the Application Load Balancer (ALB).
The following diagram illustrates the reference architecture. It uses [Amazon API Gateway response streaming](https://aws.amazon.com/blogs/compute/building-responsive-apis-with-amazon-api-gateway-response-streaming/) with Lambda for SSE support.
![Architecture](assets/arch.png)
You can also choose to use [AWS Fargate](https://aws.amazon.com/fargate/) behind the ALB instead of [AWS Lambda](https://aws.amazon.com/lambda/), the main difference is the latency of the first byte for streaming response (Fargate is lower).
### Deployment Options
Alternatively, you can use Lambda Function URL to replace ALB, see [example](https://github.com/awslabs/aws-lambda-web-adapter/tree/main/examples/fastapi-response-streaming)
| Option | Pros | Cons | Best For |
|--------|------|------|----------|
| **API Gateway + Lambda** | No VPC required, pay-per-request, native streaming support, lower operational overhead | Potential cold starts | Most use cases, cost-sensitive deployments |
| **ALB + Fargate** | Lowest streaming latency, no cold starts | Higher cost, requires VPC | High-throughput, latency-sensitive workloads |
You can also use Lambda Function URL as an alternative, see [example](https://github.com/awslabs/aws-lambda-web-adapter/tree/main/examples/fastapi-response-streaming)
### Deployment
@@ -105,8 +117,8 @@ After creation, you'll see your secret in the Secrets Manager console. Make note
**Step 3: Deploy the CloudFormation stack**
1. Download the CloudFormation template you want to use:
- For Lambda: [`deployment/BedrockProxy.template`](deployment/BedrockProxy.template)
- For Fargate: [`deployment/BedrockProxyFargate.template`](deployment/BedrockProxyFargate.template)
- For API Gateway + Lambda: [`deployment/BedrockProxy.template`](deployment/BedrockProxy.template)
- For ALB + Fargate: [`deployment/BedrockProxyFargate.template`](deployment/BedrockProxyFargate.template)
2. Sign in to AWS Management Console and navigate to the CloudFormation service in your target region.
@@ -227,7 +239,7 @@ For more information about creating and managing application inference profiles,
This proxy now supports **Prompt Caching** for Claude and Nova models, which can reduce costs by up to 90% and latency by up to 85% for workloads with repeated prompts.
**Supported Models:**
- Claude 3+ models (Claude 3.5 Haiku, Claude 3.7 Sonnet, Claude 4, Claude 4.5, etc.)
- Claude models (Claude 3.5 Haiku, Claude 4, Claude 4.5, etc.)
- Nova models (Nova Micro, Nova Lite, Nova Pro, Nova Premier)
**Enabling Prompt Caching:**
@@ -249,7 +261,7 @@ client = OpenAI()
# Cache system prompts
response = client.chat.completions.create(
model="us.anthropic.claude-3-7-sonnet-20250219-v1:0",
model="global.anthropic.claude-haiku-4-5-20251001-v1:0",
messages=[
{"role": "system", "content": "You are an expert assistant with knowledge of..."},
{"role": "user", "content": "Help me with this task"}
@@ -271,7 +283,7 @@ curl $OPENAI_BASE_URL/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{
"model": "us.anthropic.claude-3-7-sonnet-20250219-v1:0",
"model": "global.anthropic.claude-haiku-4-5-20251001-v1:0",
"messages": [
{"role": "system", "content": "Long system prompt..."},
{"role": "user", "content": "Question"}
@@ -334,9 +346,11 @@ print(response)
This application does not collect any of your data. Furthermore, it does not log any requests or responses by default.
### Why not used API Gateway instead of Application Load Balancer?
### Why choose API Gateway vs ALB?
Short answer is that API Gateway does not support server-sent events (SSE) for streaming response.
**API Gateway + Lambda** uses [API Gateway response streaming](https://aws.amazon.com/blogs/compute/building-responsive-apis-with-amazon-api-gateway-response-streaming/) with [Lambda Web Adapter](https://github.com/awslabs/aws-lambda-web-adapter) to support SSE streaming without requiring a VPC. This is a cost-effective, serverless option with up to 10 minutes timeout.
**ALB + Fargate** provides the lowest streaming latency with no cold starts, ideal for high-throughput workloads.
### Which regions are supported?
@@ -360,9 +374,9 @@ The API base url should look like `http://localhost:8000/api/v1`.
### Any performance sacrifice or latency increase by using the proxy APIs
Comparing with the AWS SDK call, the referenced architecture will bring additional latency on response, you can try and test that on you own.
Compared with direct AWS SDK calls, the proxy architecture will add some latency. The default API Gateway + Lambda deployment provides good streaming performance with Lambda response streaming.
Also, you can use Lambda Web Adapter + Function URL (see [example](https://github.com/awslabs/aws-lambda-web-adapter/tree/main/examples/fastapi-response-streaming)) to replace ALB or AWS Fargate to replace Lambda to get better performance on streaming response.
For lowest latency on streaming responses, consider the ALB + Fargate deployment option which eliminates cold starts and provides consistent performance.
### Any plan to support SageMaker models?