Binary Data (GridFS)

Introduction

RESTHeart offers complete binary content management in a fully RESTful style. It’s possibile to insert, read and delete huge binary files by leveraging the underlying GridFS store from MongoDB.

Once a file has been created in a bucket, only its json properties can updated. The binary content is immutable.

What’s GridFS?

GridFS is a specification for storing and retrieving files that exceed the BSON-document size limit of 16MB.

Instead of storing a file in a single document, GridFS divides a file into parts, or chunks, and stores each of those chunks as a separate document. By default GridFS limits chunk size to 255k. GridFS uses two collections to store files. One collection stores the file chunks, and the other stores file metadata.

When you query a GridFS store for a file, the driver or client will reassemble the chunks as needed. You can perform range queries on files stored through GridFS. You also can access information from arbitrary sections of files, which allows you to “skip” into the middle of a video or audio file.

GridFS is useful not only for storing files that exceed 16MB but also for storing any files for which you want access without having to load the entire file into memory.

Read more.

Specifications

RESTHeart is committed to adhere to Internet standards. To successfully upload files, RESTHeart’s clients must strictly follow the Internet Media Type multipart/form-data as defined in RFC 7578.

The multipart/form-data specification allow a form to upload multiple files at the same time. RESTHeart prohibits this: while it’s possibile to POST / PUT several different non-binary parts for each file, it’s mandatory to upload one single binary file per each POST / PUT request. This limitation is necessary so that is possible to univocally relate all the optional parts to a single file.

Using curl

Let’s use curl to upload a PNG image named “picture.png”.

What’s curl?

curl is an open source command line tool and library for transferring data with URL syntax. It’s installed by default on all Linux distributions, Mac OSX and most Unix systems. A Windows version is also available.

Read more.

In this example we assume RESTHeart running on localhost (127.0.0.1), port 8080, but of course your setup can be different.

POST a binary file with curl

1) Create a database, if none exists yet:

$ curl -u user:password -X PUT -H "Content-Type: application/json" 127.0.0.1:8080/testdb
``` 

2) Create the collection for hosting files. It must end with `.files` to
mark this as a special collection for files:

``` plain
curl -v -u user:password -X PUT -H "Content-Type: application/json" 127.0.0.1:8080/testdb/mybucket.files

3) POST the file. Set the file parameter to the actual binary file, using the =@ curl syntax.

curl -v -u user:password -X POST -F "file=@picture.png" 127.0.0.1:8080/testdb/mybucket.files

You’ll see an HTTP response header like below, returning a 201 Created:

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8080 (#0)
* Server auth using Basic with user 'admin'
> PUT /testdb/mybucket.files/picture.png HTTP/1.1
> Host: 127.0.0.1:8080
> Authorization: Basic YWRtaW46Y2hhbmdlaXQ=
> User-Agent: curl/7.54.0
> Accept: */*
> Content-Length: 221176
> Expect: 100-continue
> Content-Type: multipart/form-data; boundary=------------------------d2d164f1076f1b8b
> 
< HTTP/1.1 100 Continue
< Content-Length: 0
< HTTP/1.1 201 Created
< X-Powered-By: restheart.org
< Auth-Token-Location: /_authtokens/admin
< Access-Control-Expose-Headers: Location, ETag, Auth-Token, Auth-Token-Valid-Until, Auth-Token-Location, X-Powered-By
< Date: Sun, 08 Apr 2018 21:26:14 GMT
< Auth-Token: 4tplfhpu5uy5hq1ih4r2tg2f1b9kygp4sypr1oe7yvr3fby43d
< Connection: keep-alive
< Access-Control-Allow-Origin: *
< Auth-Token-Valid-Until: 2018-04-08T21:41:13.951Z
< Access-Control-Allow-Credentials: true
< Content-Length: 0
< Content-Type: application/json
< 
* Connection #0 to host 127.0.0.1 left intact

The following line in the response header gives the file’s location:

Location: http://127.0.0.1:8080/testdb/mybucket.files/552e1e89e4b019b2fa790f3f

Note that the location contains the object ID automatically generated by MongoDB (see the string “552e1e89e4b019b2fa790f3f” at the end of the above URL). This is a unique identifier and it’s convenient in many situation, but it’s not always desirable. In many case it would be better to explicitly name the resource with something more readable and meaningful. To set the resource name it is necessary to upload the file by using the PUT verb instead of POST.

Uploading files with PUT

By W3C specifications the POST semantic delegates to the owning resource the decision on how to handle the creation or the owned resource. In the previous examples, the mybucket.files owner by default assigned to new files a resource name coming from MongoDB. If we want to decide for a meaningful name then we just need to send a PUT, like this:

curl -v -u user:password -X PUT -F "file=@picture.png" 127.0.0.1:8080/testdb/mybucket.files/picture.png

If we GET the resulting resource, here is the full HTTP response:

HTTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ETag, Auth-Token, Auth-Token-Valid-Until, Auth-Token-Location, X-Powered-By
Auth-Token: 4tplfhpu5uy5hq1ih4r2tg2f1b9kygp4sypr1oe7yvr3fby43d
Auth-Token-Location: /_authtokens/admin
Auth-Token-Valid-Until: 2018-04-08T21:41:33.169Z
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 249
Content-Type: application/json
Date: Sun, 08 Apr 2018 21:26:33 GMT
ETag: 5aca88f5634459000711d931
X-Powered-By: restheart.org
{
    "_id": "picture.png",
    "_links": {
        "rh:data": {
            "href": "/db/mybucket.files/picture.png/binary"
        }
    },
    "chunkSize": 261120,
    "filename": "file",
    "length": {
        "$numberLong": "220973"
    },
    "md5": "0f289f7447e9db7af996e112e8fe3de1",
    "metadata": {
        "_etag": {
            "$oid": "5aca88f5634459000711d931"
        },
        "contentType": "image/png"
    },
    "uploadDate": {
        "$date": 1523222774054
    }
}

Note that in this case the resource shows a much nicer representation: 

127.0.0.1:8080/testdb/mybucket.files/picture.png

Which is easier to read and link than the automatically generated name (but requires a little more work on your side).

If you want to better understand the theoretical difference between POST and PUT verbs in terms of  the Hypertext Transfer Protocol – HTTP/1.1, this blog post might be useful: RESTful API memo: PUT and POST differences.

The “properties” part

It’s possibile to add an optional form data part to the request by using the properties field name. The content of this field is automatically parsed as JSON data, so it must be valid JSON:

$ curl -v -u user:password -X POST -F 'properties={"name":"Maurizio", "surname":"Turatti"}' -F "file=@picture.png" 127.0.0.1:8080/testdb/mybucket.files

The JSON will be merged into the file document (look at the end of the code block):

 Expand source

TTP/1.1 200 OK
Access-Control-Allow-Credentials: true
Access-Control-Allow-Origin: *
Access-Control-Expose-Headers: Location, ETag, Auth-Token, Auth-Token-Valid-Until, Auth-Token-Location, X-Powered-By
Auth-Token: 4tplfhpu5uy5hq1ih4r2tg2f1b9kygp4sypr1oe7yvr3fby43d
Auth-Token-Location: /_authtokens/admin
Auth-Token-Valid-Until: 2018-04-08T21:44:08.928Z
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 270
Content-Type: application/json
Date: Sun, 08 Apr 2018 21:29:08 GMT
ETag: 5aca899f634459000711d933
X-Powered-By: restheart.org

{
    "_id": {
        "$oid": "552e1e89e4b019b2fa790f3f" 
    },
    "_links": {
        "rh:data": {
            "href": "/db/mybucket.files/552e1e89e4b019b2fa790f3f/binary"
        }
    },
    "chunkSize": 261120,
    "filename": "file",
    "length": {
        "$numberLong": "220973"
    },
    "md5": "0f289f7447e9db7af996e112e8fe3de1",
    "metadata": {
        "_etag": {
            "$oid": "5aca899f634459000711d933"
        },
        "contentType": "image/png",
        "name": "Maurizio",
        "surname": "Turatti"
    },
    "uploadDate": {
        "$date": 1523222943247
    }
}

If the “properties” part contains a “filename” this sets the filename.

then passing the following part

'properties={"filename":"different_filename.png"}'

GET a binary file with curl

GET the representation (metadata only) for the newly created file:

$ curl -u user:password http://127.0.0.1:8080/testdb/mybucket.files/552e1e89e4b019b2fa790f3f

 

Appending `/binary` at the end of the above URL makes possible to GET the binary content stored into GridFS:

$ curl -u user:password http://127.0.0.1:8080/testdb/mybucket.files/552e1e89e4b019b2fa790f3f/binary

If we paste this URL to a browser’s address bar then the image is displayed. This allows RESTHeart to serve as a very basic but powerful digital asset management system.

{width=”800”}

As explained, the underlying storage mechanism is MongoDB’s GridFS, which is an extremely powerful abstraction over a filesystem, without any file size limits.

When should I use GridFS?

For documents in a MongoDB collection, you should always use GridFS for storing files larger than 16 MB.

In some situations, storing large files may be more efficient in a MongoDB database than on a system-level filesystem.

  • If your filesystem limits the number of files in a directory, you can use GridFS to store as many files as needed.
  • When you want to keep your files and metadata automatically synced and deployed across a number of systems and facilities. When using geographically distributed replica sets MongoDB can distribute files and their metadata automatically to a number of mongod instances and facilities.
  • When you want to access information from portions of large files without having to load whole files into memory, you can use GridFS to recall sections of files without reading the entire file into memory.

Read More.

Using httpie

While curl does its job, instead we love httpie. This little Python program allows to indent and colorize the output of HTTP responses, making it more readable than curl.

Let’s say we already have a db called testdb, exactly as with curl, first create the .files collection, if it doesn’t exist yet:

$ http -a user:password -j PUT 127.0.0.1:8080/testdb/mybucket.files

Then POST a file named “picture.png”:

http -a user:password -f POST 127.0.0.1:8080/testdb/mybucket.files file@picture.png;filename=picture.png

For example, to upload a file named “mypicture.png” with PUT:

http -a user:password -f PUT 127.0.0.1:8080/testdb/mybucket.files/mypicture.png file@"~/Desktop/mypicture.png" properties='{"author":"Maurizio Turatti"}'