Nick McHardy header image

Serverless and lambda: what to do when you run out of disk space

tl;dr

Lambda has 512MB of space to play with, which is shared across multiple invocations of the same function, which makes it difficult to manage. Long running functions share the same space and there is no control over when containers are reused.

My Serverless Journey

The journey to take serverless to the extreme and whole-heartedly build a CMS without any EC2 or RDS instances or even a humble VPC has been interesting to say the least. But today, I discovered something which I thought would be gone forever, in the form of a nice little message:

no space left on device

Say, wha?

Now, I thought I’d be free of this kind of error. I’d also be free of the following, because AWS manages it all for me, right!??

  • Patching, especially OS patches
  • Disk space
  • Memory
  • Scale and instance management
  • Load balancing
  • … a bunch of other serverless niceties

Also worthy of note is that I know an AWS Lambda instance has limits, but surely I’m not hitting those just yet.

On closer inspection, I see this error is coming from Hugo, the fastest static website generator I’ve ever seen and the core component of Koi CMS.

It’s trying to write out some static assets, for one of my larger sites, after generating all of the HTML files from Markdown. Which is fair enough, since it’s all part of the package.

I also find this:

child process exited with code 255

Ugh.

Non-zero return codes from a process. This needs more attention.

Error: ENOSPC: no space left on device, mkdir ‘/tmp/6NPGD7TPTpvy2tZx’

Some Background

AWS Lambda is limited to 512 MB of ephemeral storage mounted in /tmp/. This is shared across all lambda invocations until either they get recycled OR increased load triggers another instance to be deployed in the background.

My CMS does do quite a lot of different functions but the use of disk space is only used by the publish process, which is essentially a wrapper around Hugo. The process is:

  1. Files are downloaded from S3 to the Publish Lambda function’s local /tmp/ drive, which I’m told has about 512MB of space
  2. Hugo is executed
  3. Files are generated in a “public” folder, thanks to Hugo
  4. Files are then uploaded to S3
  5. Any temporary file is removed

Actually, there is a step 0, which ensures the Hugo binary is available to the function, which means writing it to the local disk. To save a bit of time, I check if the binary is already there and if not, go and fetch it. I could include the binary inside the function as part of the upload to AWS, but I kind of like having it slightly decoupled.

Anyways, disk space is limited, but that’s not too bad. None of my sites have hit 512MB of content. That might be a problem when they do. Problem for future Koi, shall we say.

How does it run out of space?

At the moment, if any major problem happens, the cleanup routine might not execute. This is not ideal, but can be fixed at some stage. Since we aren’t really in control of the instances start or reuse of instances, AWS might spin up new Lambda instances or just re-use existing ones if the function hasn’t been modified.

Older instances do go stale and need a bit of warm-up time, which you can feel when logging into the system on a cold morning. It’s not terrible, but definitely noticeable. Given how many times websites are published per day, I’d say that often my functions take the hit of having to warm-up.

This function re-use has one benefit for me: a Hugo binary that has been previously downloaded and stored locally can be used again, which saves a small amount of publishing time.

The downside, is that if a few failures occur in a row, the instance will have it’s ephemeral storage filled up (in /tmp/) with no automatic ability to release. A simple re-deploy of the function will clear this out good and proper, but that’s not ideal.

So, back to monitoring disk space?

Perhaps, but it would have to be baked into the function itself. In some ways I’d like a TTL on files/folders so that they will automatically be cleaned up after some period of time, but that might be too much to ask for. It’s not like I can just ssh into the Lambda instance and clean things up periodically.

At the end of the day, I’ll have to manage the use of temporary disk storage myself. In any case, adoption of serverless doesn’t mean you can ignore all the resources of the past, maybe just some of them.

The dream is still alive.