4 min read

S3 uploader high memory usage

How to prevent high memory usage when uploading many files via the Go S3 manager uploader.

Planted Oct 27, 2024
Last tended Oct 29, 2024

evergreen

Tldr

When using the S3 manager uploader, read the Body from an io.ReadSeekerAt (and not an io.Reader) to prevent high memory usage.

There are 2 options to upload files to S3 using the Go V2 AWS SDK (besides using presigned URLs):

Both accept PutObjectInput, where the Body must be an io.Reader.

From what I understand, option 2 is recommended when uploading many (large) files, because it:

Safely uploads files concurrently across goroutines.
Buffers large files into smaller chunks and uploads them in parallel.

The problem

While working on a project that needed to upload zip archives containing lots of files, I chose the S3 manager uploader for its concurrent upload capabilities.

But I quickly ran into memory issues: when uploading zip archives that contain hundreds of files, my Go service would often run out of memory (OOM).

For example, uploading a zip archive with ~100 files caused the service memory usage to consistently spike to ~500 MiB.

The code

package s3
 
import (
	"archive/zip"
	"context"
	"fmt"
	"io"
	"mime"
	"os"
	"path/filepath"
 
	"github.com/aws/aws-sdk-go-v2/aws"
	"github.com/aws/aws-sdk-go-v2/feature/s3/manager"
	"github.com/aws/aws-sdk-go-v2/service/s3"
	"golang.org/x/sync/errgroup"
)
 
type Uploader struct {
	uploader *manager.Uploader
	bucket   string
}
 
func NewUploader(client manager.UploadAPIClient, bucket string) *Uploader {
	return &Uploader{
		uploader: manager.NewUploader(client),
		bucket:   bucket,
	}
}
 
func (u *Uploader) UploadZip(ctx context.Context, zipr *zip.Reader) error {
	group, ctx := errgroup.WithContext(ctx)
	for _, file := range zipr.File {
		group.Go(func() error {
			return u.uploadZipFile(ctx, file)
		})
	}
	if err := group.Wait(); err != nil {
		return fmt.Errorf("uploading zip file: %v", err)
	}
	return nil
}
 
func (u *Uploader) uploadZipFile(ctx context.Context, file *zip.File) error {
	zf, err := file.Open()
	if err != nil {
		return err
	}
	defer zf.Close()
 
	mimeType := detectMimeType(file)
	_, err = u.uploader.Upload(ctx, &s3.PutObjectInput{
		Bucket:      aws.String(u.bucket),
		Key:         aws.String(file.Name),
		Body:        zf,
		ContentType: aws.String(mimeType),
	})
	return err
}
 
func detectMimeType(fileName string) string {
	ext := filepath.Ext(fileName)
	mimeType := mime.TypeByExtension(ext)
	if AllowedMimeType(mimeType) {
		// Use an allow list for improved security.
		return mimeType
	}
	return "application/octet-stream"
}

Opening a file in the zip archive returns an io.ReadCloser, and passing that to the Body when uploading should stream the contents of the file efficiently to S3. So why the memory issues?

Profiling using a benchmark didn’t show any issues. So I started digging into the S3 manager code.

Root cause: default part size

The S3 manager uploader memory behavior is controlled by the PartSize parameter. By default, it’s set to 5 MiB and is also used in the memory pool (so the allocated buffer memory can be reused between uploads).

This is the interesting part: by default the uploader allocates the 5 MiB buffer for every file being uploaded, regardless of the file’s actual size.

This happens because the uploader needs to calculate part sizes before uploading, and with an io.Reader it can only do this by reading the entire content into memory first. But if an io.ReadSeekerAt is used, the uploader can figure out the number of parts needed without buffering the entire content in memory.

The fix

Opening a zip file returns an io.ReadCloser. So it must be “converted” to an io.ReadSeekerAt to prevent memory issues (while still using the S3 manager to upload files concurrently).

I think the simplest options to do this are (before uploading):

Write each file in the zip archive to a temporary file.
Read each file in the zip archive into memory using io.ReadAll() and bytes.NewReader().

After testing both, I found using option 1 to be the (slightly) better choice. While both methods had similar memory overhead, writing to a temporary file used (slightly) less CPU and had (slightly) less Garbage Collector (GC) overhead.

The revised code

func (u *Uploader) uploadZipFile(ctx context.Context, file *zip.File) error {
	zf, err := file.Open()
	if err != nil {
		return err
	}
	defer zf.Close()
 
	// NOTE: this is safe to call from multiple goroutines (see godoc).
	temp, err := os.CreateTemp("", "s3-upload-*")
	if err != nil {
		return fmt.Errorf("creating temp file: %v", err)
	}
	defer func() {
		if err := os.Remove(temp.Name()); err != nil {
			// Log the error.
		}
	}()
 
	if _, err := io.Copy(temp, zf); err != nil {
		return fmt.Errorf("writing to temp file: %v", err)
	}
 
	// Rewind the file pointer to read again from the temp file on upload.
	if _, err := temp.Seek(0, io.SeekStart); err != nil {
		return fmt.Errorf("rewinding temp file pointer: %v", err)
	}
 
	mimeType := detectMimeType(file.Name)
	_, err = u.uploader.Upload(ctx, &s3.PutObjectInput{
		Bucket:      aws.String(u.bucket),
		Key:         aws.String(key),
		Body:        temp,
		ContentType: aws.String(mimeType),
	})
	return err
}

Recent

Dotfiles

Signing Git commits with SSH

Cache stampeding

Index

S3 uploader high memory usage

The problem

The code

Root cause: default part size

The fix

The revised code

Resources

On this page

Linking to this page

Garden view