Cloud Computing – Systems Architect

Automated backups to Google Drive with PHP API

Lukasz Kujawa — Sun, 14 Apr 2013 22:36:37 +0000

Where do you keep backups? I guess that depends on what do you backup. You might have a very clever answer for a business critical data but what about less important content? The best example would be a private blog. It will hurt if you lose your data but the odds are you’re not willing to pay for any reliable storage. If you care enough to backup it’s going to be to another server (if you own one), your own laptop or external hard drive. S3 is gaining popularity but not everybody have and want to open account on Amazon. On the other hand there is one reliable storage, which is 100% free and almost everybody have access to it. Yes, I’m talking about Google Drive.

In order to integrate your program with Google Drive you need to create a Google API project. You can do it at Google console page.

Press “Create Project” button and select services you want to use with the project. For the purpose of this example “Drive API” is enough.

Once the service is enabled click on API Access from the left hand side navigation.

Click on “Create an OAuth 2.0 client ID” button. Make up a project name and click “next”. On the Client ID settings page choose “Service Account” as the application type.

Press “Create client ID” button. Click “Download private key” to download… you guess it – private key! You need it to access your account. Bear in mind you can download it only once.

Now your service account it created. You will need the client id and email address in a second. Leave the Google console page open.

There is one important thing you need to be aware of. Service account is not your Google account. If you upload files to the service account’s drive you won’t see them in your Google Drive. It’s not a big problem because the uploaded files can be shared.

If for some reason you need to have files uploaded directly to your account you can’t use the service account. You will have to create a web application instead. That change the way how you authenticate. Web application requires a manual journey though OAuth. Backups usually work in background and there is no web interface for OAuth redirections. For that reason I prefer to use a private key.

Now when your API project is created you can download an example script I prepared for this post. It’s a command line utility in PHP which uploads a file to shared folder on Google Drive. It’s available on my GitHub account: cp2goole. For your convenience the script comes with Google API but don’t use it with your projects. Download the latest API with examples from the official page https://developers.google.com/drive/quickstart-php.

$ git clone https://github.com/lukaszkujawa/cp2google.git
$ cd cp2google/
$ vim cp2google.php

You will have to modify first lines of the script.

[email protected]' );

define( 'CLIENT_ID',  '700692987478.apps.googleusercontent.com' );
define( 'SERVICE_ACCOUNT_NAME', '[email protected]' );
define( 'KEY_PATH', '../866a0f5841d09660ac6d4ac50ced1847b921f811-privatekey.p12');

BACKUP_FOLDER – name of shared folder. The script will create it at the first run.
SHARE_WITH_GOOGLE_EMAIL – your google account.
CLIENT_ID – your project’s client id
SERVICE_ACCOUNT_NAME – your project’s account name. It’s called e-mail address on the console page.
KEY_PATH – path to the downloaded private key.

Replace those values to match your configuration. Save changes are run the file.

$ php cp2google.php README.md
Uploading README.md to Google Drive
Creating folder...
File: 0B9_ZqV369SiSM19KbTROWldqcFk created

Now check your Google Drive. You should find a new folder in the “Shared with me” section. You also should receive an e-mail saying that the file has been shared with you.

I won’t get though the code because it’s quite simple to understand. The only thing worth mentioning is that on Google Drive files and folders are the same thing. Folder has a specific mime type which is “application/vnd.google-apps.folder”.

Full documentation of Google Drive API can be found here https://developers.google.com/drive/v2/reference/. Most of the calls have an example in: JAVA, .NET, PHP, Python, Ruby, JavaScript, Go and Objective-C. It should be enough for most people😉

Google was always very generous when it comes down to storage. There are multiple ways to take advantage of that and backups are one of them. I wouldn’t use it to store business critical data but everything else should be just fine. It feels much more convenient then anything else.

Painless guide to Solr Cloud configuration

Lukasz Kujawa — Sat, 06 Apr 2013 08:43:59 +0000

“Cloud” become very ambiguous term and it can mean virtually anything those days. If you are not familiar with Solr Cloud think about it as one logical service hosted on multiple servers. Distributed architecture helps with scaling, fault tolerance, distributed indexing and generally speaking improves search capabilities.

All of that is very exciting and I’m highly impressed how the service is designed but… it’s relatively new product. If you play with the tutorial (which by the way is great) running multiple services on the same host does’t cause problems. Setting the service for production environment and using it is a different story. There are still some unresolved issues which can confuse for hours if not days.

I would like to share my experience with setting up Solr Cloud and highlight problems I came across. If you are completely new to the subject I recommend you to read Solr Cloud tutorial first. If you new to Solr have a look into my previous post.

All applications we are going to use in this post are written in Java. I’m doing my best to setup the service up to the highest standards but I’m not a Java developer. There might be some things which could be done better. If that’s the case I would love to hear your feedback.

Goals and assumptions for this tutorial are:

Solr 4.2.1 – this’s the latest version of Solr at the time of writing this post.
Tomcat7 – Solr comes with Jetty which is very helpful for development although our goal is production setup. The service will have to be monitor and maintained by sysops. They are usually more familiar with TomCat7 than other solutions.
External ZooKeeper – Solr has ZooKeeper build in but it’s not recommended to use it in production.
Operating system Ubuntu 12.04 – personal preferences (sorry RedHat folks).

If you would like to try this setup but you don’t have an access to multiple servers there are two options:

Use amazon EC2 micro instances (it’s free).
Use virtualisation – I recommend VMWare player. It’s free and fast.

A journey of a thousand miles begins with a single step. Lets log in to the first server and download ZooKeeper, Solr and TomCat7.

$ sudo apt-get install tomcat7 tomcat7-admin
$ wget http://apache.mirrors.timporter.net/zookeeper/current/zookeeper-3.4.5.tar.gz
$ wget http://mirrors.ukfast.co.uk/sites/ftp.apache.org/lucene/solr/4.2.1/solr-4.2.1.tgz

My links might be out of date. Make sure you download the latest version of ZooKeeper and Solr.

Before we do any configuration lets check your host name.

$ hostname
ubuntu

Now look for it in /etc/hosts

$ sudo vim /etc/hosts

If you find something like this:

127.0.1.1       ubuntu

Change the IP to your LAN IP address. This tiny thing gave me guru meditation for few days. It will make at least one of your Solr nodes to register as “127.0.1.1”. Localhost address doesn’t make any sense from cloud’s point of view. That will populate multiple issues with replication and leader election. It’s later hard to guess that all of those problems come from this silly source. Don’t repeat my mistake.

Unpack downloaded software.

$ tar zxfv zookeeper-3.4.5.tgz
$ tar zxfv solr-4.2.1.tgz

The easier job is to setup ZooKeeper. You will do it only once on the first server. ZooKeeper scales well and you can run it on all Solr nodes if required. There is no need for this at the moment so we can take single server approach.

Create a directory for zookeeper data and set configuration to point to that place.

$ sudo mkdir -p /var/lib/zookeeper
$ cd zookeeper-3.4.5/
$ cp conf/zoo_sample.cfg conf/zoo.cfg
$ vim conf/zoo.cfg

Find dataDir and paste appropriate path.

dataDir=/var/lib/zookeeper

Start ZooKeeper.

$ bin/zkServer.sh start
JMX enabled by default
Using config: /home/lukasz/zookeeper-3.4.5/bin/../conf/zoo.cfg
Starting zookeeper ... STARTED

If you like you can use ZooKeeper client to connect with the server.

$ bin/zkCli.sh -server 127.0.0.1:2181
[zk: 127.0.0.1:2181(CONNECTED) 0] ls /
[zookeeper]

Type “quit” to exit.

Now lets insert Solr configuration into ZooKeeper. Go to Solr directory and have a look into solr-webapp. It should be empty.

$ cd solr-4.2.1/example/
$ ls solr-webapp/

Please notice I’m using the example directory. In real live you obviously want to rename it to something better. The same with collections. I’m going to use the default collection1 for this tutorial.

If your solr-webapp doesn’t have solr.war inside run Solr for few seconds to make it extract the file.

# java -jar start.jar
2013-04-05 09:38:58.132:INFO:oejs.Server:jetty-8.1.8.v20121106
2013-04-05 09:38:58.150:INFO:oejdp.ScanningAppProvider:Deployment monitor /root/solr-4.2.1/example/contexts at interval 0
2013-04-05 09:38:58.153:INFO:oejd.DeploymentManager:Deployable added: /root/solr-4.2.1/example/contexts/solr-jetty-context.xml
2013-04-05 09:38:58.209:INFO:oejw.WebInfConfiguration:Extract jar:file:/root/solr-4.2.1/example/webapps/solr.war!/ to /root/solr-4.2.1/example/solr-webapp/webapp

After this line you can press ctrl+c to stop the server.

$ ls webapps/solr.war
webapps/solr.war

Now we can start uploading configuration to the ZooKeeper.

$ cloud-scripts/zkcli.sh -cmd upconfig -zkhost 127.0.0.1:2181 -d solr/collection1/conf/ -n default1
$ cloud-scripts/zkcli.sh -cmd linkconfig -zkhost 127.0.0.1:2181 -collection collection1 -confname default1 -solrhome solr
$ cloud-scripts/zkcli.sh -cmd bootstrap -zkhost 127.0.0.1:2181 -solrhome solr

If you would like to learn more about the zkcli script have a look here http://docs.lucidworks.com/display/solr/Command+Line+Utilities.

Now if you login to ZooKeeper and run “ls /” command you should see the uploaded data.

$ bin/zkCli.sh -server 127.0.0.1:2181

[zk: localhost:2181(CONNECTED) 0] ls /
[configs, zookeeper, clusterstate.json, aliases.json, live_nodes, overseer, collections, overseer_elect]

[zk: 127.0.0.1:2181(CONNECTED) 1] ls /configs
[default1]

[zk: 127.0.0.1:2181(CONNECTED) 3] ls /configs/default1
[admin-extra.menu-top.html, currency.xml, protwords.txt, mapping-FoldToASCII.txt, solrconfig.xml, lang, stopwords.txt, spellings.txt, mapping-ISOLatin1Accent.txt, admin-extra.html, xslt, scripts.conf, synonyms.txt, update-script.js, velocity, elevate.xml, admin-extra.menu-bottom.html, schema.xml]

[zk: localhost:2181(CONNECTED) 4]  get /configs/default1/schema.xml 

// content of your schema.xml

[zk: 127.0.0.1:2181(CONNECTED) 5] quit
Quitting…

This step is obviously not required but it’s good to know what happens inside each service and how to get there.

If you impatient then you can go to “solr-4.2.1/example/” and run the service.

$ java -DzkHost=localhost:2181 -jar start.jar

it should work in Cloud mode and if you are happy with running it that way you can skip TomCat setup. If that’s the case visit http://SERVER01_IP:8080/solr/#/~cloud to confirm it’s working.

If you go to that URL have a look into first sub item in the navigation. It’s called “Tree”. Does it look familiar? Yes, it’s ZooKeeper’s data.

The final step is to setup TomCat. Stop Solr (Ctrl + c) if you run it and go to TomCat’s directory.

$ cd /etc/tomcat7/Catalina/localhost/
$ vim solr.xml

Paste below configuration. Make sure docBase and Environment path match your setup.

Enable admin user for TomCat.

$ vim /etc/tomcat7/tomcat-users.xml

Add

to tomcat-users tag.

You are almost there. The last thing is to “tell” Solr to use ZooKeeper. We already know how to do it from command line. When you run Solr from an external container you have to edit solr.xml.

$ vim solr-4.2.1/example/solr/solr.xml

Find top tag called solr and add zkHost attribute.

While you are editing solr.xml go to cores tag and set hostPort attribute to 8080.

Restart Tomcat.

$ sudo /etc/init.d/tomcat7 restart

Open web browser and go to http://SERVER01_IP:8080/manager/html. You will be asked for username and password which you set in the previous step (admin/secret).

Find “/solr” in Applications list and click on “start” in commands column. If it fails with a message “FAIL – Application at context path /solr could not be started” it’s most likely permissions issue. You can resolve it with

$ chown tomcat7.tomcat7 -R /home/lukasz/solr-4.2.1/

If it still doesn’t work you can troubleshoot it in “/var/log/tomcat7/catalina.*.log”.

Once the service is running you can access it under http://SERVER01_IP:8080/solr/#/.

That was first server. To have a cloud you need at least one more. The steps are exactly the same with a difference you can skip everything related to ZooKeeper. Make sure to set correct IP address for zkHost in solr.xml.

Run second server and go to http://SERVER01_IP:8080/solr/#/~cloud. You should see two servers replicating Collection1.

Just to remind you. If one of your servers has local IP like 127.0.1.1 there is a problem with your /etc/hosts file. If you made any mistake you can always start again. Stop TomCat servers, login to ZooKeeper and remove “clusterstate.json”.

[zk: localhost:2181(CONNECTED) 1] rmr /clusterstate.json

Now you can insert some data into your index.

$ cd solr-4.2.1/example/exampledocs/
$ vim post.sh

Bash script needs to be updated because it points to the default port.

URL=http://localhost:8080/solr/update

Run the script.

$ ./post.sh mem.xml
Posting file mem.xml to http://localhost:8080/solr/update


075




064

So far so good. Now lets use Collections API to create a new collection.

http://ONE_OF_YOUR_SERVERS:8080/solr/admin/collections?action=CREATE&name=hello
&numShards=2&replicationFactor=1&collection.configName=default1

This should add new collection called “hello”. The collection will use previously uploaded configuration “default1” and is going to be split into both server.

That looks more interesting. If you click on core selector (bottom of the left hand side navigation) you will notice the core is called “hello_shard1_replica1”. On the other server the name will be “hello_shard2_replica1”. You can still use “hello” name to query any of the server, for example:

http://ONE_OF_YOUR_SERVERS:8080/solr/hello/select?q=*%3A*&wt=xml&indent=true

If you are not on Solr 4.3 yet you have to be aware of very confusing bug – SOLR-4584. On some occasions you might not wish to store particular index on every server. For example, your cloud consists of 3 servers and you set shards to 1 and replication factor to 2. If you make a query to a server which physically don’t store the data you will get an error. This is obviously undesired behavior and will get fix. Right now you have to live with it so my recommendation is to use all servers.

It takes some effort to set everything up but it’s definitely worth it. There are some problems around Solr and the Cloud setup could be easier but I’m convinced all of those issues will be eventually addressed. If you still have some capacity for more Solr knowledge watch this speech: Solr 4: The SolrCloud Architecture.

Beginners guide to Amazon Cloud

Lukasz Kujawa — Thu, 14 Mar 2013 23:06:22 +0000

Everybody heard about Amazon Cloud. It has been around for a good few years. Despite its popularity not everybody had a chance to try it. Amazon tries it best to get more people on board by offering “Micro” versions of their web service for free. If you are into technology or just want a quality hosting for your website, it’s worth getting familiar with AWS. After all.. can you beat free?

First you have to create an account. Go to http://aws.amazon.com/ and register. Once you have an account, login and go to “AWS Managment Console”. The link is available under “My Account / Console” tab.

Whooa… that’s lots of products. Don’t worry, it’s not that complicated as it looks.

To install your software on Amazon you need a server. Choose EC2 from Compute & Networking section. In the realm of cloud computing servers are called instances. Click on “instance” from left hand side navigation. You should see something like on the below image (obviously your instances list should be empty).

To create a new instance click on a big button called “Launch Instance”. That should bring a JavaScript modal.

Stay with the “Classic Wizzard” and click continue.

Now it’s getting interesting. On this screen you are asked to select your distribution. My choice is 64bit Ubuntu but it’s just a personal preference. Before we go any further have a quick look on “My AMIs” tab. After you launch and setup your instance you can create an image from it. Later when you will require more computing power you can fire new instances from the image. Very cool, isn’t it? Select your distribution and go to the next step.

Now you have to select instance type. Go for the first option called T1 Micro. Make sure it says “Free tier eligible”. Click continue.

The next step is called “Advance Instance Option”. There is nothing you want to change there. Click continue. The same on “Storage Device Configuration” and on another page.

At this step you have to create a key pair. You will need it to login into your EC2 instance. If you are new to SSH keys you can find more details here. Always protect and backup keys. You can’t download them from Amazon and you can’t regenerate them for running instance. If you lose it you won’t be able to login. I learned it the hard way.

The next step is to select security group. It allows you to open / close certain ports on your server. You can go for the default option and edit the group later.

That’s it. The last screen is summary of your settings. If you are happy with everything click on “Launch” button.

To find out what is the address of your instance select it from the list. It should be something like “ec2-54-246-44-13.eu-west-1.compute.amazonaws.com”. If you are on Linux or Mac you can login immediately.

$ ssh -i path/to/key.pem ubuntu@you-address

Windows users need to convert the .pem key to format compatible with putty. Dowload PuTTYgen.exe, open it and select “Convertions > Import key”. Chose your Amazon key. Once the file is loaded select “File > Save private key”. Password protection is optional, you don’t have to do it. After that step you are ready to load the key with Putty Agent. If you don’t have it download pageant.exe from Putty website. Run pageant.exe and load the key. Now you can open Putty and login to your EC2 server.

If you plan to use the instance as a web server create a load balancer. One load balancer is free and it’s practical to use it. If for any reason you will have to stop your server IP address will change. It means you will have to change domain settings and wait 24h for propagation… not good. This is another thing I learned the hard way. It’s also better to have it ready in case you have to scale.

Click on “Load Balancers” on the left hand side navigation. It’s under “Network & Security” section. Click on “Create Load Balancer” button. It’s very easy setup. Just chose a name and go to step 3.

On this screen you have to specify which instances should be use with the load balancer. At this stage you should have only one item. Select it and click continue.

When you finish with configuration click on your load balancer. You will find notes which explain what is the best way to setup your domain.

Micro instance is good enough for a web server but might get slow with a database. For a database you might chose RDS. Go back to Amazon Web Services list by clicking on cube in the left top corner. Look for RDS under Database section. Click on DB Instances and then on Launch DB Instance button.

After selecting database engine pay attention to DB Instance Class settings. If you don’t want to pay for it select t1.db.micro and minimum storage size.

Using RDS is not require. You can fire another EC2 instance and set everything manual. The advantage of using RDS is free backups (to a certain level) and easy configuration.

I keep database and web server on the same EC2 instance. It wouldn’t survive the traffic it has without Varnish Cache. You can read about it here.

If you want to have multiple web servers you need to think how are you going to share use session. You can install memcached on each of your web servers or use ElastiCache.

ElastiCache is another service under Database section. There isn’t much more to say about it. Just be aware it’s there and you can use it.

This post covered basics of working with AWS however it should be enough to run a medium size website. Using Amazon is easy and fun. Number of features might be intimidating at the first glance but after few minutes it all start to make sense.

Some useful links to help you find out more about pricing and setups:

– AWS Free Usage Tier
– EC2 hardware configurations
– EC2 pricing
– RDS pricing