Wednesday, February 19, 2014

Automatic mounting of remote storage via SSHFS on Amazon EC2 instances

In this blog I demonstrate how you can create an Amazon EC2 instance image that will automount a folder on a remote server via SSHFS.
The purpose here is to fire up a EC2 compute server, run a program and save the output from that program on our local compute cluster at the university.

Basically, you just need to a line to /etc/fstab and save the instance as an image (that's what I did).

What you need:

  • An Amazon EC2 instance with sshfs installed.
  • A user with SSH keys properly setup to the remote system (the SSH keys cannot require a passphrase).
Your remote server has a folder that is named remote_folder and your instance has a folder named local_folder. The default username on Amazon is "Ubuntu" for Ubuntu instances, so I'm using this as an example.

sshfs#ubuntu@remoteserver:/home/ubuntu/remote_folder/ /home/ubuntu/local_folder/  fuse    user,delay_connect,_netdev,reconnect,uid=1000,gid=1000,IdentityFile=/home/ubuntu/.ssh/id_rsa,idmap=user,allow_other,workaround=rename  0   0

Everything is one long line that goes into /etc/fstab. The IdentityFile points to your SSH key. You need the "_netdev" keywords to mount the SSHFS folder after network becomes available. The "reconnect" keyword does what it reads, so throw that in as well.
I read a few posts from other people who had difficulties mounting SSHFS properly without the "delay_connect" and "workaround=rename" keywords, so I added those for good measure.

Note that you need the trailing / after the folder names! I won't work without (and I'm talking from bitter experience here).

Furthermore, you want to add the following line to /etc/ssh/ssh_config

ServerAliveInterval 5

This makes SSH send a keep alive signal every 5 seconds so you don't get disconnected due to being idle.

Apart from that I think the above should be self-explanatory (for someone looking for this information).

Friday, February 14, 2014

Chemistry on the Amazon EC2

We are trying out the Amazon EC2 compute cloud for running computations in the Jensen Group. This is a note on how things are going so far.

It was actually extremely easy to set up. Within minutes of having created the Amazon Web Services (AWS) account, I had a free instance of Ubuntu 12.04.3 LTS up and running and was able to SSH into it
You have access to one free virtual box and 750 free hours per month for the first year so it is free to get started. My free instance had some Intel processor, 0.5GB RAM and 8GB disk space (I think the spec change from time to time).
I copied binaries for PHAISTOS (the program we are looking to run) over and they ran successfully, and things pretty much went without a hitch.
After trying out the free instance, I just saved the image (you can do that via the web interface), and every other instance I just started from the same image so no configuration was needed after the first time.
I mounted a folder located on the university server via SSHFS which I use to store output data from the instance directly to our server. This way I  don't lose data if the instance is terminated, and I don't have to log in to the instance to check output or log-files.

The biggest problem for me was the vast number of different types of instance. You can select everything form memory-optimized to CPU, storage, interconnect or GPU instances, and these come in several different types each. This takes a bit of research and there is a lot of fine print. E.g. Amazon doesn't specifiy the physical core count, but rather "vCPU" which may or may not include hyperthreading (i.e. the vCPU number may be twice what you actually get!)
Also the price varies depending where the data center where you spawn your instances is located. I picked N. Virginia data center which was the cheapest. I don't know why I would pick one of their other data centers? The closest to me is located in Ireland, but it is about 15% more expensive. Asia seems to be even more expensive.

Managing payment is also surprisingly easy. I had my own free account which I used in the beginning. +Jan Jensen created an account using the university billing account number. From there we used the Consolidated Billing option to add my account to having the bill sent to Jan's account.

Our current project is pretty much only CPU-intensive and barely requires any storage or memory, so naturally I had to benchmark the instance types that are CPU optimized.

I tested out the largest (by CPU count) instances I could launch in the General Purpose (m3 tier), Compute Optimized (c3 tier) and Compute Optimized//previous generation (c1 tier). These are the m3.2xlarge, c3.2xlarge and c1.xlarge instances.

In short these machines are:

name = core count (processor type) ~ hourly price (geographical location of server)

m3.2xlarge = 4 physical cores (Intel E5-2670 @ 2.60 GHz) ~ \$0.90/hour (N. Virginia)
c3.2xlarge = 4 physical cores (E5-2680v2 @ 2.80 GHz) ~ \$0.60/hour (N. Virginia)
c1.xlarge = 8 physical cores* (E5-2650 @ 2.00 GHz) ~  \$0.58/hour (N. Virginia)

The c1.xlarge didn't support hyper threading from what I could gather. The m3.2xlarge is more expensive, because it has faster disks and more RAM. Initially, I thought the m3.2xlarge had 8 physical cores, but turns out I was merely fooled by the "vCPU" number and several pages of fine print in the pricing list.

As a test, I launched a Metropolis-Hastings simulation in PHAISTOS starting from the native structure of Protein G with the PROFASI force field at 300K with the same seed (666) in all the tests, and noted the iteration speed as a function of cores.

The maximum number of total iterations (all threads, collectively) per day for the three instances was comparable (see below) maxing out at around 500-600 millions/day.
A slight win for the quad core c3.2xlarge instance when it is hyperthreading on 8 cores.
No real benefit to spawn more than 8 concurrent threads either.

What is probably more important is the throughput for each USD you spend. Again, the c3.2xlarge wins (when hyperthreading on 8 cores) and is the cheapest for our purpose.