jeudi 6 août 2015

Adding functionalities to existing classes in Scala

New functionalities can be added to existent classes by wrapping them with a Value class and adding and implicit methods for converting back and forward form the original class:

class TLong(val value: Long) {
  def +(other: TLong) = new TLong(value + otehr.value)
  def decrement =  new TLong(value - 1L)
  override def toString(): String = value.toString;
}

// implicit methods for conversions
implicit def toTLong(l: Long) = new TLong(l)
implicit def toLong(tl: TLong) = tl.value

// some tests
val l1: TLong = new TLong(1)
val l2: TLong = new TLong(2)
l1 + l2
1L + l2
l1 + 2L

From Scala 2.10, you can use implicit classes so that you don't have to define conversion methods as they are automatically created:
implicit class ImplicitLong(val l: Long) {
  def print = l.toString
}

1L.print

samedi 13 juin 2015

Running Java applications on CloudFoundry

Introduction

CloudFoundry v2 uses Heroku buildpacks to package droplet on which an application will run. But before, CF checks among the locally available buildpacks which one can be used to prepare the application runtime. The buildpack contract is composed of the following scripts (that can be written in shell, python, ruby, etc):
  • Detect: checks if this buildpack is suitable for the submitted application,
  • Compile: prepares the runtime environment of the application,
  • Release: finally launches the application

Applications with single file

Java applications whether a standalone or web are managed by the java-buildpack. In case a manifest.yml is used to submit the application, then for Web applications or executable jar it may looks like:
---
applications:
- name: APP_NAME
  memory: 4G
  disk_quota: 2G
  timeout: 180
  instances: 1
  host: APP_NAME-${random-word}
  path: /path/to/war/file.war or /path/to/executable/file.jar

The java-buildpack will check if the file is a .war to launch Tomcat container, or an executable jar to look for the mainClass in META-INF/MANIFEST.MF.

Applications with many files

In case the application is composed of multiple files (jars, assets, configs, etc.) the java-buildpack won't be able to automatically detect what appropriate container to use. We need:
1. For the Detect phase to choose which container is appropriate (here the java-main): Clone the java-buildpack and set the java_main_class property in config/java_main.yml.

2. In the manifest: indicate the path to the folder containing all artifacts that should be download  to the droplet at the Compile phase.

3. In the manifest: set the command that will be used at the Release phase to launch the application. 

An example of java_main.yml file:
---
java_main_class: package.name.ClassName

An example of a manifest.yml file:
---
applications:
- name: APP_NAME
  memory: 2G
  timeout: 180
  instances: 1
  host: APP_NAME-${random-word}
  path: ./
  buildpack: http://url/to/custom/java-buildpack
  command: $PWD/.java-buildpack/open_jdk_jre/bin/java -cp $PWD/*:. -Djava.io.tmpdir=$TMPDIR package.name.ClassName

Application submission

$ cf push to submit an application
$ cf logs APP_NAME to access the application logs
$ cf events APP_NAME to access CF events related to this application
$ cf files APP_NAME to access the VCAP user home where the application files are stored

Troubleshooting

If the application fails to start for some reason (you may see no logs), you can check what command was used to launch the application as follows:
$ CF_TRACE=true cf app app_name | grep "detected_start_command"

Note 
  • Uploaded jar files are extracted under /home/vcap/app/APP_NAME in the droplet.
  • for executable jar, we need to accept traffic on the port given by CF which is in the VCAP_APP_PORT environment variable. Otherwise CF will think that the application has failed to start and thus shut it down.
  • to check if a java program is running on CloudFoundry:
import org.cloudfoundry.runtime.env.CloudEnvironment;
...
CloudEnvironment cloudEnvironment = new CloudEnvironment();
if (cloudEnvironment.isCloudFoundry()) {
    // activate cloud profile      
    System.out.println("On A cloudfoundry environment");
}else {
    System.out.println("Not on A cloudfoundry environment");
}

Resources:
  • Standalone (non-web) applications on Cloud Foundry - link

dimanche 31 mai 2015

DEV 301 - Developing Hadoop Applications

1. Introduction to Developing Hadoop Applications
- Introducing MapReduce concepts and history
- Discribing how MapReduce works at a high-level and how data flows in it

The typical example of MapReduce applications is Word Count. As an input, there is many files that are splitting amongst the TaskTracker nodes where the files are located. The splits are of multiple record, here a record is a line. The Map function gets a Key-Value pairs, and just uses the Value (i.e. line) to calculate one occurrence at a time of each word. Then a Combine function aggregates the occurrences and pass them to the Shuffle function The later is handled by the framework and aims to gather the output of prior functions by keys before sending them to the reducers. The Reduce function takes a list of all occurrence (i.e. value) of a word (i.e. key) to sum them up and return the total time the word has been seen.
MapReduce example: Word Count
Run the word count example:
1. Prepare a set of input text files:
$ mkdir -p /user/user01/1.1/IN1
$ cp /etc/*.conf /user/user01/1.1/IN1 2> /dev/null
$ ls /user/user01/1.1/IN1 | wc -l
2. Run word count application using the previously created files
$ hadoop jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1-mapr-1408.jar wordcount /user/user01/1.1/IN1 /user/user01/1.1/OUT1
3. Check the job output
$ wc -l /user/user01/1.1/OUT1/part-r-00000
$ more /user/user01/1.1/OUT1/part-r-00000

Trying binary files as input:
$ mkdir -p /user/user01/1.1/IN2/mybinary
$ cp /bin/cp /user/user01/1.1/IN2/mybinary
$ file /user/user01/1.1/IN2/mybinary
$ strings /user/user01/1.1/IN2/mybinary | more
hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/hadoop-0.20.2-dev-examples.jar wordcount /user/user01/1.1/IN2/mybinary /user/user01/1.1/OUT2
$ more /user/user01/1.1/OUT2/part-r-00000
Look for reference in the input and output of work AUTH:
$ strings /user/user01/1.1/IN2/mybinary | grep -c ATUH
$ egrep -ac ATUH /user/user01/1.1/OUT2/part-r-00000

MapReduce execution summary and data flow


MapReduce Workflow:
  • Load production data into HDFS with tools like Sqoop for SQL data, Flume for log data, or traditional tools as MapR-FS support POSIX operations and NFS access.
  • Analyze, Store, Read.
The InputFormat object is responsible for validating the job input, splitting files among mappers and instantiating the RecordReader. By default, the size of an input split is equal to the size of a block which is 64Mb in Hadoop and it is the size of a chunk in MapR which is 256 Mb. Each input Split references a set of Records which will be broken into a Key-Value for the Mapper. The TaskTracker passes the split input to the RecordReader constructor which will read the records one by one and passes them to the mapper as key-value pairs. By default, the RecordReader considers a line as a record. This can be modified by extending the RecordReader and InputFormat classes to define different records in the input file, for example multi-line records.
The Partitioner takes the output generated by the Map functions, hashes the record key to create partitions based on the key. By default, each partition will be passed to a reducer, this behavior can be overrided. As part of Shuffle operation, The partitions are then sorted and merged as preparation before sending them to the reducers. Once an intermediate partition is complete, it will be send over the network using protocols like RPC or HTTP.
The result of a MapReduce job is writing to an output directory: 
  • an empty file named _SUCCESS is created to indicate the success of the job,
  • the history of the job is captured under the _log/history* directory,
  • the output of the reduce job is captured under part-r-00000part-r-00001...
  • if you run a map-only job the output will be  part-m-00000part-m-00001... 

Hadoop Job Scheduling

Two schedulers are available in hadoop, the use of each one is declared in mapred-site.xml:

  • by default the Fair Scheduler is used where resources are shared evenly across pools (a slot of resources) and each user has its own pool. Custom pools can be configured to guaranty minimum access to pools to prevent starvation. This scheduler supports preemption.
  • Capacity Scheduler: resources are shared across queues, the administrator configure hierarchically queues (percentage of total resources in the cluster) to control access to resources. The queues has ACL to control user access and it's also possible to configure soft and hard limits per user within a queue. The schedule support resource-based scheduling and job priority. 

YARN architecture

Hadoop Job Management

Dependent on the MapReduce version there is different ways to manage Hadoop jobs:

  • MRv1: through web UIs (JobTracker, TaskTraker), MapR metrics database, hadoop job CLI.
  • MRv2 (YARN): through web UIs (Resource Manager, Node Manager, History Server), MapR metrics database (for future releases), mapred CLI.


The DistributedShell example
$ yarn jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.4.1-mapr-1408.jar -shell_command /bin/ls -shell_args /user/user01 -jar /opt/mapr/hadoop/hadoop-2.4.1/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.4.1-mapr-1408.jar
Check the application logs
$ cd /opt/mapr/hadoop/hadoop-2.4.1/logs/userlogs/application_1430664875648_0001/
$ cat container_1430664875648_0001_01_000002/stdout # stdout file
$ cat container_1430664875648_0001_01_000002/stderr # stderr file

The logs can also be accessed from the History Server Web UI at http://node-ip:8088/

to be continued

mercredi 8 avril 2015

ADM 201 - Hadoop Operations: Cluster Administration

This is article gathers notes taken from MapR's ADM 201 class which is mainly about:
- Testing & verifying Hardware before installing MapR Hadoop
- Installing MapR Hadoop
- Benchmarking MapR Hadoop Cluster configure new cluster for production
- Monitoring Cluster for failures & performance

Prerequisites

Install the cluster shell utility, declare the slave nodes and check if it is accessing the nodes properly
$ sudo -i
$ apt-get install clustershell
$ mv /etc/clustershell/groups /etc/clustershell/groups.original
$ echo "all: 192.168.2.212 192.168.2.200" > /etc/clustershell/groups
$ clush -a date

Mapr Cluster validation

Inconsistency in the hardware (e.g. different disk sizes or cpu cores) may not cause installation failure but may cause poor performance of the cluster. The use of benchmarking tools (cluster validation github repo) allows the measurement of the cluster performance.

The remaining of this section address pre-Install cluster hardware tests:
1. Download Benchmark Tools
2. Prepare Cluster Hardware for Parallel Execution of Tests
3. Test & Measure Subsystem Components
4. Validate Component SOftware & Firmware

Grap the validation tools from the github repo
$ curl -L -o cluster-validation.tgz http://github.com/jbenninghoff/cluster-validation/tarball/master
$ tar xvzf cluster-validation.tgz
$ mv jbenninghoff-cluster-validation-*/ ./
$ cd pre-install/

Copy the pre-install folder to all nodes, and check if it succeeded
$ clush -a --copy /root/pre-install/
$ clush -a ls /root/pre-install/

Test the hardware for specification heterogeneity
$ /root/pre-install/cluster-audit.sh | tee cluster-audit.log

Test the network bandwidth for its ability to handle MapReduce operations:
First, set the IP addresses of the node in network-test.sh (divide them between half1 and half2).
$ /root/pre-install/network-test.sh | tee network-test.log

Test memory performance
clush -Ba '/root/pre-install/memory-test.sh | grep ^Triad' | tee memory-test.log

Test disk performance
The disk-test.sh script checks the disk health and performance (i.e. throughput for sequential and random I/O read/write), it destroys any data available on it.
$ clush -ab /root/pre-install/disk-test.sh
For each scanned disk there will be a result file of the form disk_name-iozone.log.

Mapr Quick Install - link

Minimum requirements:
  • 2-4 cores (at least two: 1 CPU for OS, 1 CPU for filesystem)
  • 6GB of ram
  • 20GB size of raw disk (should not be formatted/partitioned)

First, download installer script
$ wget http://package.mapr.com/releases/v4.1.0/ubuntu/mapr-setup
$ chmod 755 mapr-setup
$ ./mapr-setup

Second, configure the installation process (e.g. define data and control nodes). A sample configuration can be found in /opt/mapr-installer/bin/config.example
$ cd /opt/mapr-installer/bin
$ cp config.example config.example.original
Use following commands to find information on nodes to declare in the configuration
$ clush -a lsblk # list drivers name
$ clush -a mount # list ip addresses and mounted drivers

Edit config.example file
  • Declare the nodes information (IP addresses and data drives) under the Control_Nodes section. 
  • Customize the cluster domain by replacing my.cluster.com with your own.
  • Set a new password (e.g. mapr)
  • Declare the disks and set ForceFormat to true.
Installing mapr (the installation script uses Ansible behind the scene)
./install --cfg config.example --private-key /root/.ssh/id_rsa -u root -s -U root --debug new
MapR Cluster Services - link
In case the installation succeeded, you can login to https://master-node:8443/ with mapr:mapr to access MapR Control System (MCS) then get a new license.
Otherwise, if the installation fails, then remove install folder then check installation logs that can be found at /opt/mapr-installer/var/mapr-installer.log. Example of failures may be caused by:
  • problems formatting disks for MapR FS (check /opt/mapr/logs/disksetup.0.log).
  • one of the nodes has less than 4G of memory
  • disks with LVM setup
As last remedial, you can remove all mapr packages and re-install again:

$ rm -r -f /opt/mapr/ # remove installation folder
$ dpkg --get-selections | grep -v deinstall | grep mapr
mapr-cldb                                       install
mapr-core                                       install
mapr-core-internal                              install
mapr-fileserver                                 install
mapr-hadoop-core                                install
mapr-hbase                                      install
mapr-historyserver                              install
mapr-mapreduce1                                 install
mapr-mapreduce2                                 install
mapr-nfs                                        install
mapr-nodemanager                                install
mapr-resourcemanager                            install
mapr-webserver                                  install
mapr-zk-internal                                install
mapr-zookeeper                                  install
dpkg -r --force-depends # remove all listed packages

To check if the cluster is running properly, we can run the following quick test job.
Note: check that the names of cluster nodes are resolvable through DNS, otherwise declare them in the /etc/hosts of each node.
$ su - mapr
$ cd /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/
$ yarn jar hadoop-mapreduce-examples-2.5.1-mapr-1501.jar pi 8 800

Benchmark the Cluster

1. Hardware Benchmarking
First, copy the post-install folder to all nodes
clush -a --copy /root/post-install
$ clush -a ls /root/post-install

Second, run tests to check drive throughput and establish a baseline for future comparison
$ cd /root/post-install
$ clush -Ba '/root/post-install/runRWSpeedTest.sh' | tee runRWSpeedTest.log

2. Application Benchmarking
Use specific MapReduce jobs to create test data and process it in order to challenge the performance limits of the cluster.
First, create a volume for the test data
$ maprcli volume create -name benchmarks -replication 1 -mount 1 -path /benchmarks

Second, generate random sequence of data
$ su mapr
$ yarn jar /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar teragen 5000000 /benchmarks/teragen1

Then, sort the data and write the output to a directory
$ yarn jar /opt/mapr/hadoop/hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1-mapr-1501.jar terasort /benchmarks/teragen1 /benchmarks/terasort1

To analyze how long it takes to perform each step check the logs on the JobHistoryServer
$ clush -a jps | grep -i JobHistoryServer

Cluster Storage Resources

MapR FS organizes the drives of a cluster into Storage Pools. The later is a group of drives (three by default) on a single physical node. Data is stored across drives of the cluster storage pools. In case, one drive fails then the entire storage pool is lost. To recover it, we need to put all drives of this pool offline, replace the failed drive then return then back to the cluster.
The 3 drives per pool gives us a good balance read/write speed for ingestion huge data and recovery time for failed drives.
Storage pools hold units called Containers (32Gb size by default) which are logically organized into Volumes (which are specific to MapR FS). By default, containers has replication factor inside a volume set to three. We can choose a pattern for replication across containers: chain pattern, star pattern.
$ maprcli volume create name type 0|1

When writing for a file, Container Location Database (CLDB) is used to determine first container where data is written. CLDB replaces the function of a NameNode in MapR hadoop, it stores container replication factor and pattern information. A file is divided into chuncks (default size 256 Mb): small chunk size leads to high writes scheduling overhead, big chunk size requires more memory.
A topology defines the physical layout of a cluster nodes. It's recommended to have two top-level topologies:
  • /data the parent topology for active nodes in the cluster
  • /decommissioned the parent topology used to segregate offline nodes or nodes to be repaired.
Usually, racks that house the physical nodes are used as sub-topology to /data.

Data Ingestion

Ingestion data to MapR FS can be done through:
  • NFS (e.g. Gateway Strategy, Colocation Strategy) by using traditional applications with multiple concurrent read/writes easily - link,
  • Sqoop to transfer data between MapR-FS and relational databases,
  • Flume a distributed service for collecting, aggregating & moving data into MapR-FS

Snapshots are read-only images of volumes at a specific point in time, more accurately a pointer that costs almost nothing. It's good idea to create them regularly to protect the integrity of the data. By default, a snapshot is scheduled automatically at the creation of a volume, it can be customized through the MCS or manually created as follows:
maprcli volume snapshot create -volume DemoVolume -snapshotname 12042015-DemoSnapshot

Mirrors are volumes that represents an exact copy of a source volume from same or different cluster, it takes an extra amount of resources and time to create them. By default, a mirror is a read-only volume but can be made writable. They can be created through the MCS, set the replication factor or manually as follows:
$ maprcli volume mirror start -name DemoVolumeMirror

$ maprcli volume mirror push -name DemoVolumeMirror

Configuring remote mirrors
First, edit cluster configuration file (in both clusters) to include the location of CLDB nodes on the remote one:
$ echo "cldb_addr1:7222 cldb_addr2:7222 cldb_addr3:7222" >> /opt/mapr/conf/mapr-clusters.conf
Second, copy this new configuration to all nodes in the cluster
$ clush -a --copy /opt/mapr/conf/mapr-clusters.conf
Third, restart the Warden service so that the modification takes effect:
$ clush -a service mapr-warden restart
Finally, start the mirroring from the MCS interface.

Cluster Monitoring

Once a cluster is up and running, it has to be kept running smoothly. MCS provides many tools to monitor the health and to investigate failure causes of the cluster by providing:

  • alarms: sending emails, nagios notification, and 
  • statistics about nodes (e.g. services), volumes, jobs (MapR metrics database). MapR Hadoop provide ways to

Standard logs for each node are stored at /opt/mapr/hadoop/hadoop-2.5.1/logs, however the centralized logs are stored in /mapr/MaprQuickInstallDemo/var/mapr/local/c200-01/logs at the cluster level.

Centralized logging automate for us the gathering of logs from all cluster nodes. It provides a job-centric view. The following command can be used to create a centralized log direcotry populated with symbolic links to all log files related to: tasks, map attempts, reduce attempts, pretaited to this specific job.
$ maprcli job linklogs -jobid JOB_ID -todir MAPRFS_DIR

The MapR centralized logging feature is enabled by default in /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-env.sh through the environement variable HADOOP_TASKTRACKER_ROOT_LOGGER.
Standard log for each node is stored under /opt/mapr/hadoop/hadoop-0.20.2/logs,
on the other hands the centralized logs are stored in the /map/ when starting at the cluster level.

Alarms
When a disk failure alarm is raised, the report at /opt/mapr/logs/faileddisk.log gives information about what disks have failed, the reason of the failure and recommended resolution.


Cluster Statistics
MapR collects a variety of statistics about the cluster and running jobs. There information helps track the cluster usage and health. They can be writting to an output file or consumed by ganglia, the output type is specified in two hadoop-metrics.properties files:
  • /opt/mapr/hadoop/hadoop-0.20.2/conf/hadoop-metrics.properties for output of hadoop standard services
  • /opt/mapr/conf/hadoop-metrics.properties for output of MapR specific services
Collected metrics can be about servicesjobsnodes and monitoring node.

Schedule Maintenance Jobs
The collected metrics give us a good view of the cluster performance and health. The complexity of the cluster makes it hard to use these metrics to optimize how the cluster is running.
Running test jobs regularly to gather job statistics and watch cluster performance. If a variance in the cluster performance can be seen the actions need to be taken to get back the cluster performance. By doing this in a controlled environment we can try different ways (e.g. tweak Disk and Role balancers settings) to optimize the cluster performance.

Resources:

  • MapR installation - Lab GuideQuick Installation Guide
  • Preparing Each Node - link
  • Setting up a MapR Cluster on Amazon Elastic MapReduce - link
  • Cluster service planning - link
  • Tuning cluster for MapReduce performance for specific jobs - link
  • MapR Hadoop data storage - link


vendredi 3 avril 2015

Hadoop interview questions

1) HDFS file can ...

  • ... be duplicated on several nodes
  • ... compressed
  • ... combine multiple files
  • ... contain multiple blocks of different sizes

2) How does HDFS ensure the integrity of the stored data?
  • by comparing the replicated data blocks with each other
  • through error logs
  • using checksums
  • by comparing the replicated blocks to the master copy
3) HBase is ...
  • ... column oriented
  • ... key-value oriented
  • ... versioned
  • ... unversioned
  • ... use zookeeper for synchronization
  • ... use zookeeper for electing a master
4) An HBase table ...
  • ... need a scheme
  • ... doesn't need a scheme
  • ... is served by only one server
  • ... is distributed by region
5) What does a major_compact on an HBase table?
  • It compresses the table files.
  • It combines multiple existing store files to one for each family.
  • It merges region to limit the region number.
  • It splits regions that are too big.
6) What is the relationship between Jobs and Tasks in Hadoop?
  • One job contains only one task
  • One task contains only one job
  • One Job can contain multiple tasks
  • One task can contain multiple tasks
7) The number of Map tasks to be launched in a given job mostly depends on...
  • the number of nodes in the cluster
  • property mapred.map.tasks
  • the number of reduce tasks
  • the size of input splits
8) If no custom partitioner is defined in Hadoop then how is data partitioned before it is sent to the reducer?
  • One by one on each available reduce slot
  • Statistically
  • By hash
9) In Hadoop can you set
  • Number of map
  • Number of reduce
  • Both map and reduce number
  • None, it's automatic
10) What is the minimum number of Reduce tasks for a Job?
  • 0
  • 1
  • 100
  • As many as there are nodes in the cluster
11) When a task fails, hadoop....
  • ... try it again
  • ... try it again until a failure threshold stops the job
  • ... stop the job
  • ... continue without this particular task
12) How can you debug map reduce job?
  • By adding counters.
  • By analyzing log.
  • By running in local mode in an IDE.
  • You can't debut a job.
References:
  • Hadoop wiki - link
  • Hadoop tutorial - link

mardi 24 mars 2015

Password-less SSH root access

So I had to configure password-less SSH access between a master machine and a slave one:

1. Create an SSH key pair on the master machine
root@master-machine$ ssh-keygen 

2. Create an SSH key pair on the slave machine,
root@slave-machine$ ssh-keygen

To copy the public key to the remote machine we need a root access, however by default password-based SSH access as root is not allowed

3. On the slave machine: sudo passwd.
3.1. set a password for root (if not already set)
3.2. edit /etc/ssh/sshd_config (not /etc/ssh/ssh_config) to change PermitRootLogin without-password to PermitRootLogin yes.
3.3. restart SSH deamon with service ssh restart, if in an ssh session service ssh reload.

4. Copy master's root public key to the authorized keys in the slave machine
root@master-machine$ ssh-copy-id -i root@slave-machine

Disable password-based SSH access for root:
5. On the slave machine, edit /etc/ssh/sshd_config to change PermitRootLogin yes to PermitRootLogin without-password.

6. Now you can ssh as root from the master to the slave machine without password:
root@master-machine$ ssh root@slave-machine

For more details on SSH keys, check link.

vendredi 20 mars 2015

Exposing services to CF applications

Service Broker API

The Service Broker (SB) API (full documentation) enables service providers to expose there offers to applications running on Cloud Foundry (CF). Implementing this contract, allows the CF Cloud Controller (CC) to communicate with the service provider in order to:
  1. Catalog Management: register the offering catalog (e.g. different service plans), 
  2. Provisioning: create/delete a service instance (e.g. create a new MongoDB collection),
  3. Binding: connect/deconnect a CF application to a provisioned service instance.
For each of these possible actions, there an endpoint defined in the Service Broker contract.

1. Catalog Management
The Service Broker (full documentation) should expose an endpoint for catalog management that provides information on the service itself in a JSON format, the different plans (e.g. free or not) that can be consumed by applications, some meta-data that describe the service.

# The Cloud Controller sends the following request
GET http://broker-url/v2/catalog
# The Service Broker may reply as follows
< HTTP/1.1 200 OK
< Content-Type: application/json;charset=UTF-8
...
{
  • services:
    [
    • {
      • planUpdatablefalse,
      • id"a unique service identifier",
      • name"service name",
      • description"service description",
      • bindabletrue,
      • plan_updateablefalse,
      • plans:
        [
        • {
          • id"a unique plan id",
          • name"plan name",
          • description"plan description",
          • metadata: { },
          • freefalse
          }
        ],
      • tags: [ ],
      • metadata: { },
      • requires: [ ],
      • dashboard_clientnull
      }
    ]
}

2. Provisioning
The provisioning consists of synchronous actions that the Service Broker performs on demand from the CC to create a new or destroy an existing resource for the application. The CC sends PUT message with a designated instance identifier. Once the actions are performed, the Service Broker replies with the service and plan identifiers in a JSON format.

# The Cloud Controller sends the following request
PUT http://broker-url/v2/service_instances/:instance_id
{
  • service_id: "service identifier"
    ,
  • plan_id: "plan identifier"
    ,
  • organization_guid: "ORG identifier"
    ,
  • space_id: "SPACE identifier"
}
# The Service Broker may reply as follows
< HTTP/1.1 201 Created
< Content-Type: application/json;charset=UTF-8
...
{
  • dashboard_url: null
}

A service instance once created can be updated (e.g. upgrading service consumption plan). For this, the same query is sent to the SB with a body containing only the attribute to update:
{
  • plan_id: "new_plan_identifier"
}

3. Binding
Binding allows CF application to connect to a provisioned service instance and to start consuming the offered plan. When the SB receives a binding request from a CC, it replies with a the necessary information (e.g. service url, authentication information, etc.) for the CF application to utilize the offered service.

# The Cloud Controller sends the following request
PUT http://broker-url/v2/service_instances/:instance_id/service_bindings/:binding_id
{
  • service_id"service identifier"
    ,
  • plan_id"plan identifier"
    ,
  • app_guid"application identifier"
}
# The Service Broker may reply as follows
< HTTP/1.1 201 Created
< Content-Type: application/json;charset=UTF-8
...
{
  • credentials:
    {
    • uri"a uri to the service instance",
    • username"username on the service",
    • password"password for the username"
    }
  • syslog_drain_url:
     null
}

For unbinding the application from the service, the SB receives on the same URL a request with a DELETE method.

Note! 
All previous requests from the Cloud Controller to the Service Broker contains the X-Broker-Api-Version HTTP header. It designates the Service Broker API (e.g. 2.4) supported by the Cloud Controller.

Managing Service Brokers

Once the previous endpoints are implemented, the SB can be registered to Cloud Foundry to be exposed to applications with the following command:
$ cf create-service-broker SERVICE_BROKER_NAME USERNAME PASSWORD http://broker-url/

To check if the service broker is successfully implemented
$ cf service-brokers

Other possible management operations are available to update, rename or delete a service borker
$ cf update-service-broker SERVICE_BROKER_NAME USERNAME PASSWORD http://broker-url/
$ cf rename-service-broker SERVICE_BROKER_NAME NEW_SERVICE_BROKER_NAME
$ cf delete-service-broker SERVICE_BROKER_NAME

Once the SB is created in CF database, its plans can be viewed with:
$ cf service-access

By default, the plans are all disabled, pick the service name from the output of the previous command and then:
cf enable-service-access SERVICE_NAME # enable access to service
$ cf marketplace -s SERVICE_NAME # output service plans

Managing Services
Once a service broker is available in the marketplace, an instance of the service can be created:
$ cf create-service SERVICE_NAME SERVICE_PLAN SERVICE_INSTANCE_NAME
Then service instances can be seen with:
$ cf services

Connecting service to application
To be able to connect an application to a service (running on a different network) and communicate with it, a route should be added through the definition of a Security group. Security groups allows you to control the outbound traffic of a CF app
cf create-security-group my_security_settings security.json

The content of security.json is as follows
[
  {
    "protocol": "tcp",
    "destination": "192.168.2.0/24",
    "ports":"80"
  }
]

Then, binding to a service instance should be performed as follows:
$ cf bind-service APP_NAME SERVICE_INSTANCE_NAME
Now, the application running on CF can access service instances through the credentials available from the environment variable VCAP_SERVICES.

Resources
  • Managed services in CloudFoundry - link
  • CloudFoundry and Apache Brooklyn for automating PaaS with a Service Broker - link
  • Leveraging Riak with CloudFoundry - link



mercredi 11 mars 2015

Pushing applications to CloudFoundry the Java way

CloudFondry provides a Java API that can be used to do anything just as the CLI. Follows are the steps that shows how to connect and publish an application to CF using Java code:

1. Skip SSL validation
You may have to skip SSL validation to avoid sun.security.validator.ValidatorException:
SSLContext ctx = SSLContext.getInstance("TLS");
X509TrustManager tm = new X509TrustManager() {
  public void checkClientTrusted(X509Certificate[] xcs, String string) {
  }
  public void checkServerTrusted(X509Certificate[] xcs, String string) {
  }
  public X509Certificate[] getAcceptedIssuers() {
    return null;
  }
};
ctx.init(null, new TrustManager[] { tm }, null);
SSLContext.setDefault(ctx);

2. Connect to CloudFoundry
Connect to the CloudFoundry API endpoint (e.g. https://api.run.pivotal.io) and authenticatewith your credentials:
String user = "admin";
String password = "admin";
String target = "https://api.10.244.0.34.xip.io";
CloudCredentials credentials = new CloudCredentials(user, password);
HttpProxyConfiguration proxy = new HttpProxyConfiguration("proxy_hostname", proxy_port);
CloudFoundryClientclient = new CloudFoundryClient(credentials, target, org, space, proxy);

3. Create an application
String appName = "my-app";
List urls = Arrays.asList("my-app.10.244.0.34.xip.io");
Staging staging = new Staging(null, "app_buildpack_git_repo");
client.createApplication(appName, staging, disk, mem, urls, Collections. emptyList());

4. Push the application
ZipFile file = new ZipFile(new File("path_to_app_archive_file"));
ApplicationArchive archive = new ZipApplicationArchive(file);
client.uploadApplication(appName, archive);

5. Check the application state
StartingInfo startingInfo = client.startApplication(appName);
System.out.println("Starting application: %s on %s", appName, startingInfo.getStagingFile());
CloudApplication application : client.getApplications()
System.out.printf("  %s (%s)%n", application.getName(), application.getState());

6. Disconnect from CloudFoundry
client.logout();

mardi 24 février 2015

Installing Cloud Foundry v2 locally on Vagrant

Cloud Foundry (CF)

CloundFoundry (CF) is one of the many PaaS available out there that aims to empower developers to build their applications (e.g. web) without caring about infrastructure details. The PaaS handles the deployment, scaling and management of the apps in the cloud data center, thus boosting the developer productivity.
CF has many advantages over other PaaS solutions as it is open source, it has a fast growing community and many big cloud actors are involved in the development and spreading it adoption. It also can be run anywhere even on a laptop and this what this post is about. So keep reading..

Terminology

- Bosh is an open-source platform that helps deploying/managing systems on cloud infrastructures (AWS, OpenStack/CloudStack, vSphere, vCloud, ect).
- Bosh Lite is a lightweight version of Bosh that can be used to deploy systems locally by using Vagrant instead of cloud infrastructure (e.g. AWS) and Linux Containers (Warden project) for to run your system instead of VMs.
- Stemcell is a template VM that will be used by Bosh to create VMs and deploy them to the cloud. I contains essentially an OS (e.g. CentOS) and a Bosh Agent in order to be controlled.

1. Install Git
sudo apt-get install git

2. Install VirtualBox
$ sudo echo "deb http://download.virtualbox.org/virtualbox/debian precise contrib" >> /etc/apt/sources.list
or create a new .list file as described in this thread.
$ wget -q http://download.virtualbox.org/virtualbox/debian/oracle_vbox.asc -O- | sudo apt-key add -
$ sudo apt-get update
$ sudo apt-get install virtualbox-4.3
$ sudo apt-get install dkms
$ VBoxManage --version
4.3.10_Ubuntur93012

3. Install Vagrant (the known version to work with bosh-lite is 1.6.3 - link)
$ wget https://dl.bintray.com/mitchellh/vagrant/vagrant_1.6.3_x86_64.deb
$ sudo dpkg -i vagrant_1.6.3_x86_64.deb
$ vagrant --version
Vagrant 1.6.3

Check if vagrant is correctly working with the installed virtual box
vagrant init hashicorp/precise32
$ vagrant up

4. Install Ruby(using RVM) + RubyGems + Bundler
4.1. Install rvm
curl -sSL https://rvm.io/mpapis.asc | gpg --import -
$ curl -sSL https://get.rvm.io | bash -s stable
$ source /home/{username}/.rvm/scripts/rvm
$ rvm --version

4.2. Install latest ruby version
rvm install 1.9.3-p551
$ ruby -v
ruby 1.9.3p551 (2014-11-13 revision 48407) [x86_64-linux]

5. Install Bosh CLI (check the prerequisites for the target OS here)
- Note that Bosh CLI is not suppored on windows - github issue
$ sudo apt-get install build-essential libxml2-dev libsqlite3-dev libxslt1-dev libpq-dev libmysqlclient-dev
gem install bosh_cli

6. Install Bosh-Lite
git clone https://github.com/cloudfoundry/bosh-lite
$ cd bosh-lite
$ vagrant up --provider=virtualbox

In case the following message is seen The guest machine entered an invalid state while waiting for it to boot, then:
  • check if virtualisation (Intel VT-x / AMD-V for 32bits or Intel EPT / AMD RVI for 64bits) is enabled on target system here. If not then enable it from the BIOS, for ESXi check link1 and link2 and add vhv.enable = "TRUE" to the vm configuration file (i.e. vmx) and make sure the VM is of version 9. 
  • You may also have to check if USB 2.0 controller is enabled, if it is then disable it.
Target the BOSH Director
$ cd ..
$ bosh target 192.168.50.4 lite
$ bosh login
Your username: admin
Enter password: *****

Logged in as `admin'

Setup a route between the laptop and the VMs running inside Bosh Lite
$ cd bosh-lite
$ ./bin/add-route

7. Deploy Cloud Foundry
Install spiff
$ brew tap xoebus/homebrew-cloudfoundry

$ brew install spiff

$ spiff
To install spiff on linux systems check this issue.

Upload latest stemcell
wget http://bosh-jenkins-artifacts.s3.amazonaws.com/bosh-stemcell/warden/latest-bosh-stemcell-warden.tgz
$ bosh upload stemcell latest-bosh-stemcell-warden.tgz
Check the stemcells
$ bosh stemcells

Upload latest CF release
git clone https://github.com/cloudfoundry/cf-release
$ export CF_RELEASE_DIR=$PWD/cf-release/
bosh upload release cf-release/releases/cf-XXX.yml

Deploy CF releases
$ cd bosh-lite/
$ ./bin/provision_cf
$ bosh target check the target director
$ bosh vms    check the installed VMs on the cloud

Manually (to be continued)
Generate a configuration file manifests/cf-manifest.yml
$ mkdir -p go
$ export GOPATH=~/go
$ cd bosh-lite
./bin/make_manifest_spiff

Deploy release
$ bosh deploy

Install CF CLI

Play with CF
$ cf api --skip-ssl-validation https://api.10.244.0.34.xip.io
$ cf login
$ cf create-org ORG_NAME
$ cf orgs
$ cf target -o ORG_NAME
cf create-space SPACE_NAME
$ cf target -o ORG_NAME -s SPACE_NAME

To access the VM from the LAN (i.e. another machine):
  1. Install an HTTP Proxy (e.g. squid3),
  2. Configure CF HTTP_PROXY environment variable, and 
  3. Configure the proxy:
       $ sudo nano /etc/squid3/squid.conf 
       acl local_network src 192.168.2.0/24
       http_access allow local_network

Stopping CF
Shooting down bosh-lite VM can be surprisingly tricky. May better stop the VM with:

  • vagrant suspend to save current state for next start up, or
  • vagrant halt, then next time to start CF use vagrant up followed by bosh cck (documentation).


Troubleshooting
$ bosh ssh then choose the job to access (password: admin)
bosh_something@something:~$ sudo /var/vcap/bosh/bin/monit summary
Find the Bosh Lite IP address
$ cd bosh-lite/
$ vagrant ssh
vagrant@agent-id-bosh-0:~$ ifconfig
vagrant@agent-id-bosh-0:~$ exit

Complete installation script can be found here.

Resources
  • Installing latest versions for virtualbox and vagrant - link
  • Installing ruby with rvm - link.
  • DIY PaaS (CF v1) running DEA link1, stagging applications link2.
  • Deploying CF Playground (a kind of web admin interface) - link
  • Installing CF on vagrant - link video
  • Installing BOSH lite - github repotutorial
  • Deploying CF using BOSH lite - github repo, demo
  • http://altoros.github.io/2013/using-bosh-lite/
  • Installing a new hard drive - link
  • xip.io a free internet service providing DNS wildcard - link
  • Troubleshooting with Bosh CLI - official doc, app healthmonit summary
  • Remotely debug a CF application - link
  • CloudFoundry manifest.yml generator - link