What is ExaBucketFS?

So what is ExaBucketFS?

ExaBucketFS is essentially just a file system or store within the cluster. Files are automatically replicated across all nodes.

There are two main use cases (that I know of) to use ExaBucketFS. Firstly, it is as a repository for our in-house and third party libraries (e.g Python, Java, R etc). Secondly, it  allows us to store binary data, such as trained statistical models; where the Exasol database cannot traditionally store binary data.

Previously, before version 6, the Exasol cluster required access to the internet to be able to get the libraries used by UDFs. For custom libraries, you would have had to setup your own repository, and again be able to access it over the internet.

ExaBucketFS does require some administration though.

To use ExaBucketFS there is an API to allow you to put/get/remove files into the cluster. You can do this using curl. In the last post, I discussed how to use Curl to upload files to the bucket. The files added to the bucket are then accessible for use within UDFs.

Buckets can be password protected for reading or writing, or left public, although this to me seems like a purely DBA type task – get a password on them!

There is also the following do’s and don’ts for ExaBucketFS:

  • Ensure that you don’t write to buckets concurrently – Buckets are non-transactional.
  • Buckets and files do not get backed up – so you need to make sure you have this backed up somewhere else (by your own means!)
  • Don’t use ExaBucket as storage, as there is 100% replication across all nodes; meaning that if you store a file once, it will be replicated on every node – consuming your disk space.

Exasol’s knowledgeable Mathias Brink describes ExaBucketFS in one of the Exasol videos here.

HOW TO: Get your UDFs working with Python libraries after Exasol 6 upgrade

We recently took the plunge and upgraded to Exasol V6. There’s a few changes that you can find about, listed in the upgrade notes here. Like all upgrades, I thoroughly recommend you study these notes in detail – as there can always be something to trip you up.

This happened to us, with the changed implementation of UDFs and the way libraries are stored and used. You can read all about how it used to work here, but for now let’s go through how to make your Python libraries work with your UDFs again! I’ll be splitting up the content of this blog into a couple of future ones too, as there’s quite a lot to go through!

Get Curl!

First off, you’re going to need Curl. Get it here. Extract it on your computer, and remember where you put it!

Setup Bucket Services

Next up, let’s check out the changes in ExaOperation. There’s the addition of this new tab of ExaBuckets. Head on in there, and click Add.AddBucketService

By default the HTTP port for the service bucket is 2580, if you create another service bucket after this, you’ll need to change the port number (which I need to here – so I’m using 2585). Add a Description, this will be how you access the bucket. I’ll use “Demo”.

My ExaBuckets now looks like this. It has bfsdefault – which will be the service bucket you have, by default, when you get Version 6.

bucketservices

Click on the ID of the service you just created. This will take you through to where we can create the actual bucket. Click on Add.

nobuckets

I’ve added the name “libraries”, we’ll need this later. I’ve checked Public readable, and provided read (‘xyz’) and write passwords (123) for the demo.

newbucket

Your bucket will be displayed like this.

bucketcreated

Upload library to Bucket using Curl

Ok, now let’s use the bucket we created. I’m going to be using Windows, so open up a cmd window. Change the directory to your Curl one from earlier. Let’s list out what we find…. The libraries bucket.

c:\curl>curl Your_IP_here:2585

libraries

To put something in the bucket, first download the library locally that you need, from a reputable source, like Python.

Now we’ll get it in our bucket. We use the curl put command, and use the write password we set earlier, along with the bucket name we specified (libraries). Here, I’m uploading the boto3 library.

c:\curl>curl -X PUT -T C:\boto3-1.4.7.tar.gz http://w:123@YOUR_IP_HERE:2585/libraries/boto3-1.4.7.tar.gz

Check your library is there by listing out the bucket.

c:\curl>curl YOUR_IP_HERE:2585/libraries

boto3-1.4.7.tar.gz

Use library in UDF script

So, we’ve uploaded our library to the bucket, but our udfs still don’t work! Your script needs the path specifying before the import statement will work.

udf

Exasol have added a video of how to do some of this here.

Personally, I wish they’d kept the GUI in ExaOperation (or a flavor of it), as it was way quicker to use, and this is such a faff to do. This is another distinction between just writing a Python script with a traditional import, and an Exasol specific thing to remember.

Anyway, it’s all there for you if you read it (from me or Exasol!), and a great reminder to study the upgrade notes and think about their implications.