feat(apigw): add API Gateway response streaming support (#207)

Replace ALB + Lambda architecture with API Gateway REST API + Lambda
using response streaming for SSE support. This provides:

- No VPC required, reducing complexity and cost
- Native streaming support via API Gateway response streaming
- Pay-per-request pricing model

Changes:
- Add Lambda Web Adapter to Dockerfile for streaming support
- Replace BedrockProxy.template with API Gateway configuration
- Update README with new deployment options and latest models
- Update architecture diagram for API Gateway flow
This commit is contained in:
Mengxin Zhu
2025-12-05 10:54:13 +08:00
committed by GitHub
parent 0411454b3a
commit b41633b826
4 changed files with 136 additions and 225 deletions

View File

@@ -4,9 +4,16 @@ OpenAI-compatible RESTful APIs for Amazon Bedrock
## What's New 🔥 ## What's New 🔥
This project now supports **Claude Sonnet 4.5**, Anthropic's most intelligent model with enhanced coding capabilities and complex agent support, available via global cross-region inference. **API Gateway Response Streaming Support** - You can now deploy with Amazon API Gateway REST API instead of ALB, enabling true response streaming for better latency and cost optimization. See [Deployment Options](#deployment-options) for details.
It also supports reasoning for both **Claude 3.7 Sonnet** and **DeepSeek R1**. Check [How to Use](./docs/Usage.md#reasoning) for more details. You need to first run the Models API to refresh the model list. **Latest Models Supported:**
- **Claude 4.5 Family**: Opus 4.5, Sonnet 4.5, Haiku 4.5 - Anthropic's most intelligent models with enhanced coding and agent capabilities
- **Amazon Nova**: Nova Micro, Nova Lite, Nova Pro, Nova Premier - Amazon's native foundation models with multimodal support
- **DeepSeek**: DeepSeek-R1 (reasoning), DeepSeek-V3.1 - Advanced reasoning and general-purpose models
- **Qwen 3**: Qwen3-32B, Qwen3-235B, Qwen3-Coder-30B, Qwen3-Coder-480B - Alibaba's latest language and coding models
- **OpenAI OSS**: gpt-oss-20b, gpt-oss-120b - Open-source GPT models available via Bedrock
It also supports reasoning for **Claude 4/4.5** (extended thinking and interleaved thinking) and **DeepSeek R1**. Check [How to Use](./docs/Usage.md#reasoning) for more details. You need to first run the Models API to refresh the model list.
## Overview ## Overview
@@ -46,13 +53,18 @@ Please make sure you have met below prerequisites:
### Architecture ### Architecture
The following diagram illustrates the reference architecture. Note that it also includes a new **VPC** with two public subnets only for the Application Load Balancer (ALB). The following diagram illustrates the reference architecture. It uses [Amazon API Gateway response streaming](https://aws.amazon.com/blogs/compute/building-responsive-apis-with-amazon-api-gateway-response-streaming/) with Lambda for SSE support.
![Architecture](assets/arch.png) ![Architecture](assets/arch.png)
You can also choose to use [AWS Fargate](https://aws.amazon.com/fargate/) behind the ALB instead of [AWS Lambda](https://aws.amazon.com/lambda/), the main difference is the latency of the first byte for streaming response (Fargate is lower). ### Deployment Options
Alternatively, you can use Lambda Function URL to replace ALB, see [example](https://github.com/awslabs/aws-lambda-web-adapter/tree/main/examples/fastapi-response-streaming) | Option | Pros | Cons | Best For |
|--------|------|------|----------|
| **API Gateway + Lambda** | No VPC required, pay-per-request, native streaming support, lower operational overhead | Potential cold starts | Most use cases, cost-sensitive deployments |
| **ALB + Fargate** | Lowest streaming latency, no cold starts | Higher cost, requires VPC | High-throughput, latency-sensitive workloads |
You can also use Lambda Function URL as an alternative, see [example](https://github.com/awslabs/aws-lambda-web-adapter/tree/main/examples/fastapi-response-streaming)
### Deployment ### Deployment
@@ -105,8 +117,8 @@ After creation, you'll see your secret in the Secrets Manager console. Make note
**Step 3: Deploy the CloudFormation stack** **Step 3: Deploy the CloudFormation stack**
1. Download the CloudFormation template you want to use: 1. Download the CloudFormation template you want to use:
- For Lambda: [`deployment/BedrockProxy.template`](deployment/BedrockProxy.template) - For API Gateway + Lambda: [`deployment/BedrockProxy.template`](deployment/BedrockProxy.template)
- For Fargate: [`deployment/BedrockProxyFargate.template`](deployment/BedrockProxyFargate.template) - For ALB + Fargate: [`deployment/BedrockProxyFargate.template`](deployment/BedrockProxyFargate.template)
2. Sign in to AWS Management Console and navigate to the CloudFormation service in your target region. 2. Sign in to AWS Management Console and navigate to the CloudFormation service in your target region.
@@ -227,7 +239,7 @@ For more information about creating and managing application inference profiles,
This proxy now supports **Prompt Caching** for Claude and Nova models, which can reduce costs by up to 90% and latency by up to 85% for workloads with repeated prompts. This proxy now supports **Prompt Caching** for Claude and Nova models, which can reduce costs by up to 90% and latency by up to 85% for workloads with repeated prompts.
**Supported Models:** **Supported Models:**
- Claude 3+ models (Claude 3.5 Haiku, Claude 3.7 Sonnet, Claude 4, Claude 4.5, etc.) - Claude models (Claude 3.5 Haiku, Claude 4, Claude 4.5, etc.)
- Nova models (Nova Micro, Nova Lite, Nova Pro, Nova Premier) - Nova models (Nova Micro, Nova Lite, Nova Pro, Nova Premier)
**Enabling Prompt Caching:** **Enabling Prompt Caching:**
@@ -249,7 +261,7 @@ client = OpenAI()
# Cache system prompts # Cache system prompts
response = client.chat.completions.create( response = client.chat.completions.create(
model="us.anthropic.claude-3-7-sonnet-20250219-v1:0", model="global.anthropic.claude-haiku-4-5-20251001-v1:0",
messages=[ messages=[
{"role": "system", "content": "You are an expert assistant with knowledge of..."}, {"role": "system", "content": "You are an expert assistant with knowledge of..."},
{"role": "user", "content": "Help me with this task"} {"role": "user", "content": "Help me with this task"}
@@ -271,7 +283,7 @@ curl $OPENAI_BASE_URL/chat/completions \
-H "Content-Type: application/json" \ -H "Content-Type: application/json" \
-H "Authorization: Bearer $OPENAI_API_KEY" \ -H "Authorization: Bearer $OPENAI_API_KEY" \
-d '{ -d '{
"model": "us.anthropic.claude-3-7-sonnet-20250219-v1:0", "model": "global.anthropic.claude-haiku-4-5-20251001-v1:0",
"messages": [ "messages": [
{"role": "system", "content": "Long system prompt..."}, {"role": "system", "content": "Long system prompt..."},
{"role": "user", "content": "Question"} {"role": "user", "content": "Question"}
@@ -334,9 +346,11 @@ print(response)
This application does not collect any of your data. Furthermore, it does not log any requests or responses by default. This application does not collect any of your data. Furthermore, it does not log any requests or responses by default.
### Why not used API Gateway instead of Application Load Balancer? ### Why choose API Gateway vs ALB?
Short answer is that API Gateway does not support server-sent events (SSE) for streaming response. **API Gateway + Lambda** uses [API Gateway response streaming](https://aws.amazon.com/blogs/compute/building-responsive-apis-with-amazon-api-gateway-response-streaming/) with [Lambda Web Adapter](https://github.com/awslabs/aws-lambda-web-adapter) to support SSE streaming without requiring a VPC. This is a cost-effective, serverless option with up to 10 minutes timeout.
**ALB + Fargate** provides the lowest streaming latency with no cold starts, ideal for high-throughput workloads.
### Which regions are supported? ### Which regions are supported?
@@ -360,9 +374,9 @@ The API base url should look like `http://localhost:8000/api/v1`.
### Any performance sacrifice or latency increase by using the proxy APIs ### Any performance sacrifice or latency increase by using the proxy APIs
Comparing with the AWS SDK call, the referenced architecture will bring additional latency on response, you can try and test that on you own. Compared with direct AWS SDK calls, the proxy architecture will add some latency. The default API Gateway + Lambda deployment provides good streaming performance with Lambda response streaming.
Also, you can use Lambda Web Adapter + Function URL (see [example](https://github.com/awslabs/aws-lambda-web-adapter/tree/main/examples/fastapi-response-streaming)) to replace ALB or AWS Fargate to replace Lambda to get better performance on streaming response. For lowest latency on streaming responses, consider the ALB + Fargate deployment option which eliminates cold starts and provides consistent performance.
### Any plan to support SageMaker models? ### Any plan to support SageMaker models?

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

After

Width:  |  Height:  |  Size: 50 KiB

View File

@@ -1,4 +1,4 @@
Description: Bedrock Access Gateway - OpenAI-compatible RESTful APIs for Amazon Bedrock Description: Bedrock Access Gateway - OpenAI-compatible RESTful APIs for Amazon Bedrock (API Gateway + Lambda with Streaming)
Parameters: Parameters:
ApiKeySecretArn: ApiKeySecretArn:
Type: String Type: String
@@ -19,116 +19,8 @@ Parameters:
- "false" - "false"
Description: Enable prompt caching for supported models (Claude, Nova). When enabled, adds cachePoint to system prompts and messages for cost savings. Description: Enable prompt caching for supported models (Claude, Nova). When enabled, adds cachePoint to system prompts and messages for cost savings.
Resources: Resources:
VPCB9E5F0B4: # IAM Role for Lambda
Type: AWS::EC2::VPC ProxyApiHandlerServiceRole:
Properties:
CidrBlock: 10.250.0.0/16
EnableDnsHostnames: true
EnableDnsSupport: true
InstanceTenancy: default
Tags:
- Key: Name
Value: BedrockProxy/VPC
VPCPublicSubnet1SubnetB4246D30:
Type: AWS::EC2::Subnet
Properties:
AvailabilityZone:
Fn::Select:
- 0
- Fn::GetAZs: ""
CidrBlock: 10.250.0.0/24
MapPublicIpOnLaunch: true
Tags:
- Key: aws-cdk:subnet-name
Value: Public
- Key: aws-cdk:subnet-type
Value: Public
- Key: Name
Value: BedrockProxy/VPC/PublicSubnet1
VpcId:
Ref: VPCB9E5F0B4
VPCPublicSubnet1RouteTableFEE4B781:
Type: AWS::EC2::RouteTable
Properties:
Tags:
- Key: Name
Value: BedrockProxy/VPC/PublicSubnet1
VpcId:
Ref: VPCB9E5F0B4
VPCPublicSubnet1RouteTableAssociation0B0896DC:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
RouteTableId:
Ref: VPCPublicSubnet1RouteTableFEE4B781
SubnetId:
Ref: VPCPublicSubnet1SubnetB4246D30
VPCPublicSubnet1DefaultRoute91CEF279:
Type: AWS::EC2::Route
Properties:
DestinationCidrBlock: 0.0.0.0/0
GatewayId:
Ref: VPCIGWB7E252D3
RouteTableId:
Ref: VPCPublicSubnet1RouteTableFEE4B781
DependsOn:
- VPCVPCGW99B986DC
VPCPublicSubnet2Subnet74179F39:
Type: AWS::EC2::Subnet
Properties:
AvailabilityZone:
Fn::Select:
- 1
- Fn::GetAZs: ""
CidrBlock: 10.250.1.0/24
MapPublicIpOnLaunch: true
Tags:
- Key: aws-cdk:subnet-name
Value: Public
- Key: aws-cdk:subnet-type
Value: Public
- Key: Name
Value: BedrockProxy/VPC/PublicSubnet2
VpcId:
Ref: VPCB9E5F0B4
VPCPublicSubnet2RouteTable6F1A15F1:
Type: AWS::EC2::RouteTable
Properties:
Tags:
- Key: Name
Value: BedrockProxy/VPC/PublicSubnet2
VpcId:
Ref: VPCB9E5F0B4
VPCPublicSubnet2RouteTableAssociation5A808732:
Type: AWS::EC2::SubnetRouteTableAssociation
Properties:
RouteTableId:
Ref: VPCPublicSubnet2RouteTable6F1A15F1
SubnetId:
Ref: VPCPublicSubnet2Subnet74179F39
VPCPublicSubnet2DefaultRouteB7481BBA:
Type: AWS::EC2::Route
Properties:
DestinationCidrBlock: 0.0.0.0/0
GatewayId:
Ref: VPCIGWB7E252D3
RouteTableId:
Ref: VPCPublicSubnet2RouteTable6F1A15F1
DependsOn:
- VPCVPCGW99B986DC
VPCIGWB7E252D3:
Type: AWS::EC2::InternetGateway
Properties:
Tags:
- Key: Name
Value: BedrockProxy/VPC
VPCVPCGW99B986DC:
Type: AWS::EC2::VPCGatewayAttachment
Properties:
InternetGatewayId:
Ref: VPCIGWB7E252D3
VpcId:
Ref: VPCB9E5F0B4
ProxyApiHandlerServiceRoleBE71BFB1:
Type: AWS::IAM::Role Type: AWS::IAM::Role
Properties: Properties:
AssumeRolePolicyDocument: AssumeRolePolicyDocument:
@@ -139,12 +31,9 @@ Resources:
Service: lambda.amazonaws.com Service: lambda.amazonaws.com
Version: "2012-10-17" Version: "2012-10-17"
ManagedPolicyArns: ManagedPolicyArns:
- Fn::Join: - !Sub "arn:${AWS::Partition}:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole"
- ""
- - "arn:" ProxyApiHandlerServiceRoleDefaultPolicy:
- Ref: AWS::Partition
- :iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
ProxyApiHandlerServiceRoleDefaultPolicy86681202:
Type: AWS::IAM::Policy Type: AWS::IAM::Policy
Properties: Properties:
PolicyDocument: PolicyDocument:
@@ -166,122 +55,124 @@ Resources:
- secretsmanager:GetSecretValue - secretsmanager:GetSecretValue
- secretsmanager:DescribeSecret - secretsmanager:DescribeSecret
Effect: Allow Effect: Allow
Resource: Resource: !Ref ApiKeySecretArn
Ref: ApiKeySecretArn
Version: "2012-10-17" Version: "2012-10-17"
PolicyName: ProxyApiHandlerServiceRoleDefaultPolicy86681202 PolicyName: ProxyApiHandlerServiceRoleDefaultPolicy
Roles: Roles:
- Ref: ProxyApiHandlerServiceRoleBE71BFB1 - !Ref ProxyApiHandlerServiceRole
ProxyApiHandlerEC15A492:
# Lambda Function with Lambda Web Adapter for streaming
ProxyApiHandler:
Type: AWS::Lambda::Function Type: AWS::Lambda::Function
Properties: Properties:
Architectures: Architectures:
- arm64 - arm64
Code: Code:
ImageUri: ImageUri: !Ref ContainerImageUri
Ref: ContainerImageUri Description: Bedrock Proxy API Handler with Response Streaming
Description: Bedrock Proxy API Handler
Environment: Environment:
Variables: Variables:
# Lambda Web Adapter settings
AWS_LWA_INVOKE_MODE: RESPONSE_STREAM
AWS_LWA_READINESS_CHECK_PATH: /health
AWS_LWA_ASYNC_INIT: "true"
PORT: "8080"
# Application settings
DEBUG: "false" DEBUG: "false"
API_KEY_SECRET_ARN: API_KEY_SECRET_ARN: !Ref ApiKeySecretArn
Ref: ApiKeySecretArn DEFAULT_MODEL: !Ref DefaultModelId
DEFAULT_MODEL:
Ref: DefaultModelId
DEFAULT_EMBEDDING_MODEL: cohere.embed-multilingual-v3 DEFAULT_EMBEDDING_MODEL: cohere.embed-multilingual-v3
ENABLE_CROSS_REGION_INFERENCE: "true" ENABLE_CROSS_REGION_INFERENCE: "true"
ENABLE_APPLICATION_INFERENCE_PROFILES: "true" ENABLE_APPLICATION_INFERENCE_PROFILES: "true"
ENABLE_PROMPT_CACHING: ENABLE_PROMPT_CACHING: !Ref EnablePromptCaching
Ref: EnablePromptCaching API_ROUTE_PREFIX: /v1
MemorySize: 1024 MemorySize: 1024
PackageType: Image PackageType: Image
Role: Role: !GetAtt ProxyApiHandlerServiceRole.Arn
Fn::GetAtt:
- ProxyApiHandlerServiceRoleBE71BFB1
- Arn
Timeout: 600 Timeout: 600
DependsOn: DependsOn:
- ProxyApiHandlerServiceRoleDefaultPolicy86681202 - ProxyApiHandlerServiceRoleDefaultPolicy
- ProxyApiHandlerServiceRoleBE71BFB1 - ProxyApiHandlerServiceRole
ProxyApiHandlerInvoke2UTWxhlfyqbT5FTn5jvgbLgjFfJwzswGk55DU1HYF6C33779:
# API Gateway REST API (Regional)
RestApi:
Type: AWS::ApiGateway::RestApi
Properties:
Name: BedrockProxyApi
Description: Bedrock Access Gateway - OpenAI-compatible API with streaming support
EndpointConfiguration:
Types:
- REGIONAL
Body:
openapi: "3.0.1"
info:
title: BedrockProxyApi
version: "1.0"
paths:
/{proxy+}:
x-amazon-apigateway-any-method:
parameters:
- name: proxy
in: path
required: true
schema:
type: string
x-amazon-apigateway-integration:
type: aws_proxy
httpMethod: POST
uri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2021-11-15/functions/${ProxyApiHandler.Arn}/response-streaming-invocations"
passthroughBehavior: when_no_match
timeoutInMillis: 600000
responseTransferMode: STREAM
responses:
default:
description: Default response
/:
x-amazon-apigateway-any-method:
x-amazon-apigateway-integration:
type: aws_proxy
httpMethod: POST
uri: !Sub "arn:aws:apigateway:${AWS::Region}:lambda:path/2021-11-15/functions/${ProxyApiHandler.Arn}/response-streaming-invocations"
passthroughBehavior: when_no_match
timeoutInMillis: 600000
responseTransferMode: STREAM
responses:
default:
description: Default response
# Lambda Permission for API Gateway
LambdaPermission:
Type: AWS::Lambda::Permission Type: AWS::Lambda::Permission
Properties: Properties:
FunctionName: !Ref ProxyApiHandler
Action: lambda:InvokeFunction Action: lambda:InvokeFunction
FunctionName: Principal: apigateway.amazonaws.com
Fn::GetAtt: SourceArn: !Sub "arn:aws:execute-api:${AWS::Region}:${AWS::AccountId}:${RestApi}/*"
- ProxyApiHandlerEC15A492
- Arn # API Gateway Deployment
Principal: elasticloadbalancing.amazonaws.com ApiDeployment:
ProxyALB87756780: Type: AWS::ApiGateway::Deployment
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties: Properties:
LoadBalancerAttributes: RestApiId: !Ref RestApi
- Key: deletion_protection.enabled
Value: "false"
Scheme: internet-facing
SecurityGroups:
- Fn::GetAtt:
- ProxyALBSecurityGroup0D6CA3DA
- GroupId
Subnets:
- Ref: VPCPublicSubnet1SubnetB4246D30
- Ref: VPCPublicSubnet2Subnet74179F39
Type: application
DependsOn: DependsOn:
- VPCPublicSubnet1DefaultRoute91CEF279 - RestApi
- VPCPublicSubnet1RouteTableAssociation0B0896DC
- VPCPublicSubnet2DefaultRouteB7481BBA # API Gateway Stage
- VPCPublicSubnet2RouteTableAssociation5A808732 ApiStage:
ProxyALBSecurityGroup0D6CA3DA: Type: AWS::ApiGateway::Stage
Type: AWS::EC2::SecurityGroup
Properties: Properties:
GroupDescription: Automatically created Security Group for ELB BedrockProxyALB1CE4CAD1 RestApiId: !Ref RestApi
SecurityGroupEgress: DeploymentId: !Ref ApiDeployment
- CidrIp: 255.255.255.255/32 StageName: api
Description: Disallow all traffic Description: API Stage with streaming support
FromPort: 252
IpProtocol: icmp
ToPort: 86
SecurityGroupIngress:
- CidrIp: 0.0.0.0/0
Description: Allow from anyone on port 80
FromPort: 80
IpProtocol: tcp
ToPort: 80
VpcId:
Ref: VPCB9E5F0B4
ProxyALBListener933E9515:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
DefaultActions:
- TargetGroupArn:
Ref: ProxyALBListenerTargetsGroup187739FA
Type: forward
LoadBalancerArn:
Ref: ProxyALB87756780
Port: 80
Protocol: HTTP
ProxyALBListenerTargetsGroup187739FA:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckEnabled: false
TargetType: lambda
Targets:
- Id:
Fn::GetAtt:
- ProxyApiHandlerEC15A492
- Arn
DependsOn:
- ProxyApiHandlerInvoke2UTWxhlfyqbT5FTn5jvgbLgjFfJwzswGk55DU1HYF6C33779
Outputs: Outputs:
APIBaseUrl: APIBaseUrl:
Description: Proxy API Base URL (OPENAI_API_BASE) Description: Proxy API Base URL (OPENAI_API_BASE)
Value: Value: !Sub "https://${RestApi}.execute-api.${AWS::Region}.amazonaws.com/api/v1"
Fn::Join: RestApiId:
- "" Description: API Gateway REST API ID
- - http:// Value: !Ref RestApi
- Fn::GetAtt: LambdaFunctionArn:
- ProxyALB87756780 Description: Lambda Function ARN
- DNSName Value: !GetAtt ProxyApiHandler.Arn
- /api/v1

View File

@@ -1,9 +1,15 @@
FROM public.ecr.aws/lambda/python:3.12 FROM public.ecr.aws/lambda/python:3.12
# Add Lambda Web Adapter for API Gateway response streaming
COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.9.1 /lambda-adapter /opt/extensions/lambda-adapter
COPY ./api ./api COPY ./api ./api
COPY requirements.txt . COPY requirements.txt .
RUN pip3 install -r requirements.txt -U --no-cache-dir RUN pip3 install -r requirements.txt -U --no-cache-dir
CMD [ "api.app.handler" ] # Lambda Web Adapter requires overriding the Lambda base image entrypoint
# to run the web app directly instead of the Lambda runtime handler
ENTRYPOINT []
CMD ["python", "-m", "uvicorn", "api.app:app", "--host", "0.0.0.0", "--port", "8080"]