Development
-
July 12, 2022

How to manage large files with Heroku and Amazon S3 Buckets in Django Projects

As a developer, I’ve recently worked on a Django API  that processes large-sized images and videos. The system requires the admin to upload those mentioned files, and since the API is deployed in Heroku, we used to get the TIMEOUT error. 

This happens because it takes some time to upload a file that is not light causing Heroku to crash. So, what can we do if we don’t want to reduce the size or the quality of the files? 

In this blog post, I will provide a solution for this problem and will explain the flow and tools used. This information will hopefully help you with any working Django project that requires the management of large files.

Come join us

Interested in working with these type of technologies? We're hiring!

The software architecture and flow

After discussing different approaches, my team and I decided to maintain the files in Amazon S3 buckets. These are containers where you can store objects (such as images and videos) and access them with fast performance. 

As an example, if a Django model has an image attribute, let’s call it [.c-inline-code]profile_picture[.c-inline-code], then we store the file in a bucket, and store the corresponding URL to that file in the database. So, if the Frontend requires the picture, the Backend returns the corresponding URL for that instance. 

But, we don’t want everyone on the internet to have access to the image in the bucket, we only wish for the Frontend to do that. That’s why we have to configure the bucket as private. 

Now, we need to answer: if the bucket is private, how does our Frontend access the files in it?. Well, in this case, we have to generate a pre-signed URL, that is a URL to grant temporary access to the file.

Here I will explain the proposed solution:

  • Store the system files in private buckets of Amazon S3.
  • Frontend and/or Django admin upload the files directly to a private Amazon S3 bucket.
  • The Backend stores the corresponding URL of each file.
  • When the Django admin creates/updates a file attribute for a given instance, the file is uploaded directly to the bucket, and the corresponding URL is stored in the database.

Explanatory interaction diagram

Explanatory interaction diagram
Admin and browser
  • When Frontend is about to upload a file:
  • It requests the Backend to get a valid upload URL to upload it to the private bucket.
  • The Backend generates and sends the upload URL.
  • The Frontend uses the upload URL to store the file in the private bucket.
  • After the upload, the Frontend sends the file URL to the Backend to store it for the corresponding file.

Explanatory interaction diagram

Explanatory interaction diagram
Upload pre-signed URL
  • When the Backend sends the URL of a file to the Frontend:
  • It uses a function to generate a pre-signed URL so the Frontend can access the file in the private bucket.

Explanatory interaction diagram

Explanatory interaction diagram
Get pre-signed URL

The tools I used

Now that we have defined the solution flow, let’s talk about the tools. The first one I want to mention is django-s3direct, a library to directly upload files to the Amazon bucket from the admin panel.  Also, it provides a model field that corresponds to the URL stored in the database. 

On the other hand, we will use boto3 to generate the pre-signed URLs that the Backend sends to the Frontend in order to access the file. This library also generates the upload URL to allow the Frontend to upload files without the need of knowing the Amazon credentials. 

In the following section, I’ll show the corresponding configurations.

Amazon S3 private bucket

I won't speak about the creation of buckets as there is plenty of documentation available. Next up, I’ll describe the configuration of the bucket to make it private and integrate it with django-s3direct. 

Assuming you have already created the bucket and you have a user with access ID, access secret key, and permissions, these are the next steps:

  • Log in to Amazon S3 and create the bucket. Make sure you have the permissions.
  • Select the bucket, and go to the Permission tab.
  • In the bucket policy, paste the next policy:

CODE: https://gist.github.com/brunomichetti/86f803721a4fabf2d33aec2a4d0b1c48.js

  • Change [.c-inline-code]<name-of-bucket>[.c-inline-code] by your bucket name.
  • Now in the same tab, block all the public access to make it private:
Block all public access
Block all public access
  • Finally, in the same tab, paste this CORS configuration (needed by django-s3direct):

CODE: https://gist.github.com/brunomichetti/7aee887ea62abde16f669a8d4bc8a387.js

And that’s it, you have configured your private bucket. Keep in mind,  if you want to change the policy in the future, first need to uncheck the blocking access.

Django-s3direct library

Now I will explain the use of the django-s3direct library to directly upload the files from the Django admin. 

If you have a model that has a file attribute, when you change that attribute, this library will directly upload the file from the browser without sending it to the server. This is to avoid the named timeout error. 

Let’s go step by step on how to get this configured:

Add the library to your [.c-inline-code]INSTALLED_APPS[.c-inline-code] list in the settings:

CODE: https://gist.github.com/brunomichetti/b5524a3cb07428d91bbdd1c879266ecb.js

  • Make sure you have the [.c-inline-code]APP_DIRS[.c-inline-code] configuration set as True in your [.c-inline-code]TEMPLATES[.c-inline-code] settings:

CODE: https://gist.github.com/brunomichetti/70662a71d03644f592486f9e415cadda.js

  • Add django-s3direct urls to the urlpatterns list in your main [.c-inline-code]urls.py[.c-inline-code] file:

CODE: https://gist.github.com/brunomichetti/1b22401a81a0946c2abc022ff464b179.js

  • Add the configuration corresponding to Amazon to your settings (strongly recommended to put the access id and secret key values in environment variables):

CODE: https://gist.github.com/brunomichetti/5c1c3778444d56474dba1160147d37ad.js

  • Run collect static if needed:

[.c-inline-code]python manage.py collectstatic[.c-inline-code]

  • Now you can define in your model a file attribute corresponding to an image or video. Let’s take a look at an example:

CODE: https://gist.github.com/brunomichetti/da27283dce284851f6d9598827728947.js

  • And, what is [.c-inline-code]example_destination[.c-inline-code]? When you define a file attribute as an [.c-inline-code]S3DirectField[.c-inline-code], you have to specify the [.c-inline-code]dest[.c-inline-code] parameter. There you have to put the string corresponding to the key in the [.c-inline-code]S3DIRECT_DESTINATIONS[.c-inline-code] dictionary in the configuration.

 Let’s look at an example of the configuration to understand this better:

CODE: https://gist.github.com/brunomichetti/9932aa320f693f27b60bbbe6d9ec9d30.js

  • Inside each dictionary, you can configure a lot of parameters and check the library documentation to know them. This example has the [.c-inline-code]key[.c-inline-code] where you define the folders where you want to store the corresponding file inside the bucket.
  • You can test that it’s working by creating an instance of the class in the Django admin:

CODE: https://gist.github.com/brunomichetti/32ace4d6aa61f50e1ccf32b18c307f4b.js

  • Now you should go to django admin and you will see something like this:
Django admin
Django admin
  • Then if you select an image, the library will directly upload the image to the private bucket and store the corresponding URL in the database. The Backend won’t receive a file, just a string URL. After the correct upload, you will see something like:
Django admin
Django admin
  • And the stored URL will have this structure:

[.c-inline-code]https://s3.<region-name>.amazonaws.com/<bucket-name>/<key>/<file-name>[.c-inline-code]

  • If you click on the name of the picture, the browser will try to open it, but you won’t be able to see it because the bucket is private, and that’s ok. Also, if you click on REMOVE, the file will be removed from the instance but it will continue existing in the bucket.

Customize the name of the pictures

As highlighted, in the [.c-inline-code]S3DIRECT_DESTINATIONS[.c-inline-code] dictionary, you can configure in the key attribute the folders route to store the files corresponding to the given destination. 

But what happens when you upload a file with the same name as another existing one in the same folder? 

Well, that will overwrite the file, and I assume you don’t want that because it can be the file of another instance in your system. 

Here are two tips for this:

  • Add a timestamp to the file name to avoid repeated file names in the system.
  • Use a slugify function to manage spaces and/or invalid characters in the file name.

If you do that, you can avoid a lot of future problems. Let’s look at an example:

CODE: https://gist.github.com/brunomichetti/72f8992ec2013e221b18641eefb566f6.js

In this example, for each new file, we apply a normalization of the file name. First, we normalize the filename using the slugify function from Django, and then we append at the beginning the timestamp. 

If we upload a file with the name [.c-inline-code]image.png[.c-inline-code], the system will store that file with the form [.c-inline-code]<timestamp>--image.png[.c-inline-code]. 

Another example is if we have a file with the name [.c-inline-code]FILE with SPACEs.png[.c-inline-code], the system will store [.c-inline-code]<timestamp>--file-with-spaces.png[.c-inline-code]. This is super easy to understand and solves a lot of problems. 

Now, in this next section, we will see how to return presigned files to the Frontend.

It’s important to note, that this mentioned normalization will be executed only when uploading files in the Django admin. To be consistent, it would be nice to make the same normalization in the Frontend when it has to upload a file.

Boto3 and pre-signed URL

We know how to configure a private Amazon S3 bucket, and how to integrate it with django-s3direct to directly upload a file from the Django admin. 

Now, we need to know how to do the following:

  • Send a pre-signed URL from the Backend to the Frontend so this last can access the existing file in the private bucket.
  • Send a pre-signed upload URL from the Backend to the Frontend so this last can upload a file.

Let’s now look at how we do this using the boto3 library.

Generate and send a pre-signed URL

Since the bucket is private, if we take a file URL in the database and we try to access it, we won’t be able to. We would then see something like this:

Pre-signed URL
Pre-signed URL

This is why we need to generate a pre-signed URL in the Backend for the corresponding file to be temporarily available for the Frontend. For this I use the boto3 library, let’s look at the used functions:

CODE: https://gist.github.com/brunomichetti/a2ea087b8369f5c751a3da2586af100e

In this example, the function [.c-inline-code]get_object_key_from_url[.c-inline-code] obtains the object key from the file URL. As mentioned, the object key is the route of folders and the file name in the bucket. 

Why do we need the object key? We need it to generate the presigned URL of the file. That object key will be used in the [.c-inline-code]get_presigned_url[.c-inline-code] function for that purpose.

The next thing to do is to execute that function when the Frontend makes a GET request and needs to access the file. 

There are several ways of doing that, and I’m going to show you one that I find easy to understand:

  • Create a property in the model, that corresponds to the pre-signed attribute:

CODE: https://gist.github.com/brunomichetti/1f90592a64c89d6513fb67a39f3a334d.js

  • The [.c-inline-code]presigned_image[.c-inline-code] property first extracts the object key for the stored URL and then generates the pre-signed one to send it to the Frontend.
  • Now add the property to a model serializer:

CODE: https://gist.github.com/brunomichetti/c3e93408cd5567fe6f2a23794a2138fc.js

  • This way, if you use that serializer to send the data to the Frontend, it will return a pre-signed URL in the [.c-inline-code]presigned_image[.c-inline-code] field.

A pre-signed URL looks like this:

[.c-inline-code]https://<bucket name>.s3.amazonaws.com/images/<object key>?<lots of needed parameters>[.c-inline-code]

With that URL, the file will be available for the time period defined in your settings as in the example.

Generate an upload pre-signed URL

Last but not least, I’m going to explain how to send an upload URL from the Backend to the Frontend. Imagine we have an app where the users have profile pictures and the Frontend needs to upload the image. 

The flow would then look like this:

  1. The Frontend generates the object key of the image file (recommend adding a timestamp and normalizing the file name as in the example in the previous section).
  2. The Frontend sends the object key in a POST request to the Backend.
  3. The Backend takes the object key and generates a pre-signed upload URL.
  4. The Frontend uploads the image using the pre-signed upload URL.
  5. The Frontend sends the image URL to the backend so it can be stored in the database.

Let’s look at an example of this function:

CODE: https://gist.github.com/brunomichetti/b2ab59783e013053b31136671aefb258.js

Finally, let’s look at an example in a viewset for the action POST to receive an object key and return the necessary information to upload the file.

First, we must define the serializer like so:

CODE: https://gist.github.com/brunomichetti/1e0054227d4905a79d718e2fc92af1c0.js

Second, we must define the viewset with the corresponding action:

CODE: https://gist.github.com/brunomichetti/bc935d1facb98012e4f47c57eb585877.js

So, after this, we are sending the necessary information to upload the file. And that’s it! We have finished the proposed solution. 
I have created a public repo where you can see this little example project to get further insight. If you’d like, you can download it and give it a try for yourself.

What to take away

Throughout this blog, I presented a solution for a common problem for developers - managing large files in Django projects that are deployed using Heroku. 

I detailed in-depth the defined flow and tools I used, as well as providing some useful tips and examples to help you create a good solution. 

However, this doesn’t mean that the workaround we found is the only possible one. But it could very well be the most efficient. 

I hope you enjoyed reading and if you have another process that you think is useful, feel free to let us know in the comments section. Thanks for reading!