January 9th, 2013 | Tags: , , ,

Background Information
Let’s spend a few words about PBR. More precisely about the way to force some traffic to a specific next-hop base on a list of criteria instead of using the route configured in your routing table. You can have plenty of reasons to do such traffic engineering. In this example we will redirect some traffic to another router even when the current router has direct access to the destination address. In real life, PBR can be use to force some traffic to go through your firewall or your proxy server for example.

Policy-Based Routing (PBR) allows you to use ACLs and route maps to selectively modify and route IP packets. The ACLs classify the traffic. Route maps that match on the ACLs set routing attributes for the traffic.

A PBR policy specifies the next hop for traffic that matches the policy. Using standard ACLs with
PBR, you can route IP packets based on their source IP address. With extended ACLs, you can route
IP packets based on all of the clauses in the extended ACL.

Equipment used

  • Brocade ICX6610 (Router with PBR capability)
  • Brocade ICX6450 (Router)

Network Diagram

Before focusing on the policy-based routing, let’s configure the standard infrastructure above and ensure that the two clients can ping each other as well as the other ip addresses defined. Then we will add the PBR and see how it behaves.

On the Brocade ICX6610 device, the configuration would look like this for the router with PBR capability:

!
vlan 100 by port
  untagged ethe 1/1/10
  router-interface ve 100
!
vlan 200 by port
  untagged ethe 1/1/20
  router-interface ve 200
!
ip route 0.0.0.0 0.0.0.0 172.16.0.1
ip route 0.0.0.0 0.0.0.0 172.17.0.1
!
interface ethernet 1/1/23
  ip address 172.16.0.2 255.255.255.252
!
interface ethernet 1/1/24
  ip address 172.17.0.2 255.255.255.252
!
interface ve 100
  ip address 192.168.100.254 255.255.255.0
!
interface ve 200
  ip address 192.168.200.254 255.255.255.0
!

Nothing fancy in the configuration above. Two virtual interfaces (ve 100 and ve 200 in vlan 100 and 200) are defined. These virtual interfaces will act as gateway for the clients in their respective vlans. The other interfaces 1/1/23 and 1/1/24 are L3 interfaces connected to the other router ICX6450. All unknown traffic will be redirected to these two interfaces thanks to the default routes configured (0.0.0.0/0).

ICX6610#show ip route
Total number of IP routes: 5, avail: 11995 (out of max 12000)
B:BGP D:Connected R:RIP S:Static O:OSPF *:Candidate default
    Destination      NetMask                Gateway     Port     Cost  Type
    0.0.0.0             0.0.0.0                 *172.16.0.1 1/1/23 1        S
    0.0.0.0             0.0.0.0                   172.17.0.1 1/1/24 1        S
1  172.16.0.0       255.255.255.252   0.0.0.0       1/1/23 1        D
2  172.17.0.0       255.255.255.252   0.0.0.0       1/1/24 1        D
3  192.168.100.0 255.255.255.0       0.0.0.0       v100    1        D
4  192.168.200.0 255.255.255.0       0.0.0.0       v200    1        D

Here is the configuration of the other router (ICX6450):

!
ip route 192.168.100.0 255.255.255.0 172.16.0.2
ip route 192.168.200.0 255.255.255.0 172.16.0.2
ip route 192.168.100.0 255.255.255.0 172.17.0.2
ip route 192.168.200.0 255.255.255.0 172.17.0.2
!
interface loopback 1
  ip address 10.0.0.1 255.255.255.0
!
interface ethernet 1/1/23
  ip address 172.16.0.1 255.255.255.252
!
interface ethernet 1/1/24
  ip address 172.17.0.1 255.255.255.252
!

As before two L3 interfaces are configured (1/1/23 and 1/1/24). The four static routes allow the client return traffic to reach the ICX6610 router. There is although one loopback ip address that could take the role of the WAN for example.

ICX6450#show ip route
Total number of IP routes: 5, avail: 11995 (out of max 12000)
B:BGP D:Connected R:RIP S:Static O:OSPF *:Candidate default
    Destination      NetMask                Gateway     Port     Cost  Type
1  10.0.0.0           255.255.255.0       0.0.0.0        lb1      1        D
2  172.16.0.0       255.255.255.252   0.0.0.0       1/1/23 1        D
3  172.17.0.0       255.255.255.252   0.0.0.0       1/1/24 1        D
4  192.168.100.0 255.255.255.0     *172.16.0.2 1/1/23 1        S
    192.168.100.0 255.255.255.0       172.17.0.2 1/1/24 1        S
5  192.168.200.0 255.255.255.0     *172.16.0.2 1/1/23 1        S
    192.168.200.0 255.255.255.0       172.17.0.2 1/1/24 1        S

You can imagine a firewall or a proxy server instead of a the ICX6450 router. The idea is only to demonstrate that the PBR can be used to force the traffic between two clients to go through another equipment instead of being routed directly. Before configuring PBR, let’s see the current behavior.

The ICX6610 routing table shows direct routes to the clients subnets 192.168.100.0/24 and 192.168.200.0/24 through the virtual interfaces configured before. If we do a traceroute from 192.168.100.10, we can see that the traffic is routed directly by the ICX6610 router which makes sense:

C:\>tracert 192.168.200.10

Tracing route to 192.168.200.10 over a maximum of 30 hops

   1      <1 ms      <1 ms      <1 ms      192.168.100.254    2      <1 ms      <1 ms      <1 ms      192.168.200.10 Trace complete.

PBR
Let’s configure PBR on ICX6610:

interface ve 100
   ip address 192.168.100.254 255.255.255.0
   ip policy route-map Vlan100ToVlan200ThroughICX6450
!
access-list 101 permit ip 192.168.100.0 0.0.0.255 192.168.200.0 0.0.0.255
!
route-map Vlan100ToVlan200ThroughICX6450 permit 101
  match ip address 101
  set ip next-hop 172.16.0.1
  set ip next-hop 172.17.0.1

!

The ACL can be more granular, base on TCP/UDP ports, … like any other extended ACL. Here we simply defined an ACL to classify the traffic from 192.168.100.0/24 to 192.168.200.0/24. Then we configure a route-map that matches this ACL and set the next hop as the address of the other router and finally, we apply the route-map to the virtual address ve 100.

If we perform the traceroute again, we can see that the traffic between the same clients goes through the ICX6450, which is what we wanted to see 😉

C:\>tracert 192.168.200.10

Tracing route to 192.168.200.10 over a maximum of 30 hops

   1        7 ms      <1 ms      <1 ms      172.16.0.1    2      <1 ms      <1 ms      <1 ms      172.16.0.1    3      <1 ms      <1 ms      <1 ms      172.16.0.2    4      <1 ms      <1 ms      <1 ms      192.168.200.10 Trace complete.

Disable Redundant Link

Since we designed a redundant L3 path, we can even show that it’s working well even when one of the uplink is disabled:

ICX6610(config-if-e1000-1/1/23)#disable

ICX6610#show ip route
Total number of IP routes: 4, avail: 11996 (out of max 12000)
B:BGP D:Connected R:RIP S:Static O:OSPF *:Candidate default
    Destination      NetMask                Gateway     Port     Cost  Type
    0.0.0.0             0.0.0.0                   172.17.0.1 1/1/24 1        S
1  172.17.0.0       255.255.255.252   0.0.0.0       1/1/24 1        D
2  192.168.100.0 255.255.255.0       0.0.0.0       v100    1        D
3  192.168.200.0 255.255.255.0       0.0.0.0       v200    1        D

Now the redundant paths do not appear any more in the routing table which make sense since one of the route was disabled.

C:\>tracert 192.168.200.10

Tracing route to 192.168.200.10 over a maximum of 30 hops

   1        1 ms        1 ms      <1 ms      172.17.0.1    2      <1 ms      <1 ms      <1 ms      172.17.0.1    3      <1 ms      <1 ms      <1 ms      172.17.0.2    4      <1 ms      <1 ms      <1 ms      192.168.200.10 Trace complete.

When performing the same traceroute as before, we can see that the PBR redirect the traffic to the second next-hop provided since the first one disappeared from the routing table. This is what we wanted to demonstrate.

Comments Off on Traffic Redirection Example with Policy-Based Routing (PBR)

He are all the details to implement Wake-On-LAN with a Brocade Router across several subnets.

Background information

Wake-on-LAN (WOL) has been used for years as a way for network administrators to turn on PCs remotely. It works by creating and issuing a specially formatted network packet (often referred to as a “magic” packet, and typically generated via a software utility) containing a target PC’s MAC address.

WOL is based on the principle that when the PC shuts down, the NIC still receives power, and keeps listening on the network for the magic packet to arrive. This magic packet is mostly based on UDP. In our case we will use the utility from SolarWinds that will send a UDP packet on port (7) to trigger the wake up of a remote machine.

Equipment used

  • Brocade FCX that runs FCXR07100a.bin (Router)
  • Brocade FCX that runs FCXS07100a.bin (Switch)
  • Freeware Wake-On-LAN utility from SolarWinds

Considerations

Before deploying WOL across several subnets, first ensure your desktop can be waken remotely.

  • Check your BIOS settings and ensure Wake-On-LAN is enabled
  • Check your NIC Advanced Settings and ensure that Magic & Directed Packets are accepted
  • Connect your WOL server and your desktop/laptop client to the same VLAN
  • Invoke Wake Up PC from SolarWinds
  • If the PC wakes up you can go further and configure it across several subnets

Network Diagram


Configuration

  • Router (inter-VLAN)

!
vlan 10 name server_vlan by port
tagged ethe 1/1/10
untagged ethe 1/1/1
router-interface ve 10
!
vlan 20 name user_vlan20 by port
tagged ethe 1/1/10
router-interface ve 20
!
vlan 30 name user_vlan30 by port
tagged ethe 1/1/10
router-interface ve 30
!
!
ip forward-protocol udp echo
interface ve 10
ip address 192.168.10.1 255.255.255.0
ip helper-address 1 192.168.20.255
ip helper-address 2 192.168.30.255
!
interface ve 20
ip address 192.168.20.1 255.255.255.0
!
interface ve 30
ip address 192.168.30.1 255.255.255.0
!

  • Switch

!
vlan 10 name server_vlan by port
tagged ethe 1/1/10
untagged ethe 1/1/1
!
vlan 20 name user_vlan20 by port
tagged ethe 1/1/10
untagged ethe 1/1/2
!
vlan 30 name user_vlan30 by port
tagged ethe 1/1/10
untagged ethe 1/1/3
!

Explanation

As soon as your inter-VLAN routing is configured correctly, there are only 2 commands to add so that your machines on different subnets/VLANs can be waken remotely by a WOL server. The Brocade FastIron Router does not forward by default requests for UDP applications. You can enable forwarding support for a specific port (forward-protocol). Then you have to specify the port used by your application. In our case, SolarWinds uses UDP packets on port 7 (echo).

The second step is to configure an helper address on the VLAN of the WOL server to be able to join the subnet of your clients. The aim is to foward packets from the WOL server to each client subnet. Thus you have to specify the broadcast address of each client network to make it work. This is the only way to send a packet to a machine that is asleep. Indeed, in this situation, the PC will not be able to answer to ARP requests, since it does not own its IP at this time.

  • Specify the protocol (udp) and port (7) to be forwarded:

ip forward-protocol udp echo

  • Enable forwarding of WOL packets to clients

ip helper-address 1 192.168.20.255

Comments Off on Wake-On-LAN (WOL) Network Configuration

This post presents different ways of segmenting and also interconnecting several hosts across your network. For a standard company network, in the most common scenario, you connect your hosts to an edge switch, untag the port for a specific VLAN and do the same for each equipment. This edge switch is connected to an aggregation/core router that will be in charge of inter-VLAN routing. Basically on the router, you configure the LAGs to the edge switches, define authorized VLANs (tagged) on these links and finally create virtual interfaces for each subnet so that hosts from each VLAN can speak together while limiting the spread of L2 broadcasts.

Inter-VLAN Routing

Here is a standard inter-VLAN routing scenario:

On a Brocade FastIron device, the configuration would look like this for the switch:

!
vlan 2 name user_vlan2 by port
tagged ethe 1/1/1
untagged ethe 1/1/2
!
vlan 3 name user_vlan3 by port
tagged ethe 1/1/1
untagged ethe 1/1/3
!
vlan 4 name user_vlan4 by port
tagged ethe 1/1/1
untagged ethe 1/1/4
!

Ports connected to hosts are configured as untagged in a specific VLAN. The uplink (here port 1/1/1) is defined as a tagged port in each VLAN. Thus it can carry traffic from the different VLANs to the router.
The Router configuration looks like this:

!
vlan 2 name user_vlan2 by port
tagged ethe 1/1/1
router-interface ve 2
!
vlan 3 name user_vlan3 by port
tagged ethe 1/1/1
router-interface ve 3
!
vlan 4 name user_vlan4 by port
tagged ethe 1/1/1
router-interface ve 4
!
!
interface ve 2
ip address 192.168.2.1 255.255.255.0
!
interface ve 3
ip address 192.168.3.1 255.255.255.0
!
interface ve 4
ip address 192.168.4.1 255.255.255.0
!

Router interfaces are defined in each VLAN. Then, as soon as the IP addresses for the virtual interfaces are configured, the members of different VLANs can see each other:

FastIronRouter# show ip route
Total number of IP routes: 3 avail: 11997 (out of max 12000)
B:BGP D:Connected R:RIP S:Static O:OSPF *:Candidate default
Destination   NetMask            Gateway Port Cost Type
1  192.168.2.0  255.255.255.0   0.0.0.0    v2   1       D
2  192.168.3.0  255.255.255.0   0.0.0.0    v3   1       D
3  192.168.4.0  255.255.255.0   0.0.0.0    v4   1       D

It’s pretty much the same type of configuration for a Cisco Router/Switch except the fact that we configure a subinterface with Cisco instead of a virtual interface (ve) with Brocade. Then you can create ACLs at virtual interfaces to prevent non desirable traffic. Now imagine that you are a webhosting company that only has a small range of public IP addresses. You don’t want to waste 2 IP addresses per subnet (network and broadcast addresses) for each of your VLAN. Indeed, the previous case implied a subnet per VLAN. In such a case you should configure multiple VLANs inside the same subnet. This way, you still benefit of VLANs advantages (decrease the size of broadcast domains, segment traffic, …) but do not waste your public ip addresses.

Multiple VLAN inside the same subnet

The picture looks almost the same than the previous one. However each host is in the same subnet but in a different VLAN and has the same gateway address.

ip proxy-arp
!
interface ve 2
ip address 192.168.2.1 255.255.255.0
!
interface ve 3
ip follow ve 2
!
interface ve 4
ip follow ve 2
!

All three VLANs now use the same IP subnet. In addition to conserving IP subnet addresses, this feature allows containment of Layer 2 broadcasts to segments within an IP subnet. For ISP environments where the same IP subnet is allocated to different customers, placing each customer in a separate VLAN allows all customers to share the IP subnet address, while at the same time isolating them from one another Layer 2 broadcasts. By default, there is no inter-VLANs communication with ip follow mechanism as opposed to standard inter-VLAN routing. You need to configure ip proxy-arp if you want to enable inter-VLAN communication. Then, the Brocade device will perform proxy Address Resolution Protocol (ARP) for hosts that want to send IP traffic to hosts in other VLANs that are sharing the same IP subnet address. If the source and destination hosts are in the same VLAN, the Brocade device does not need to use ARP.

Private VLAN

Finally, there is still another solution if you’d rather a flat approach. A private VLAN secures traffic between a primary port and host ports. Traffic between the hosts and the rest of the network must travel through the primary port. Two hosts can be secured from communicating with one another even though they are in the same VLAN. There are three types of PVLAN ports:

  • Primary (or Promiscuous or Firewall)— A promiscuous port can communicate with all interfaces, including the isolated and community ports within a PVLAN.
  • Isolated— An isolated port has complete Layer 2 separation from the other ports within the same PVLAN, but not from the promiscuous ports. PVLANs block all traffic to isolated ports except traffic from promiscuous ports. Traffic from isolated port is forwarded only to promiscuous ports.
  • Community— Community ports communicate among themselves and with their promiscuous ports. These interfaces are separated at Layer 2 from all other interfaces in other communities or isolated ports within their PVLAN.

Let’s first configure such an example. Nothing to do on the router, only at switch level:

!
vlan 7 name private_vlan by port
untagged ethe 1/1/7
pvlan type primary
pvlan mapping 902 ethe 1/1/7
pvlan mapping 901 ethe 1/1/7
!
vlan 901 name community_vlan by port
untagged ethe 1/1/9 to 1/1/10
pvlan type community
!
vlan 902 name isolated_vlan by port
untagged ethe 1/1/11
pvlan type isolated
!
!
pvlan-preference broadcast flood
pvlan-preference unknown-unicast flood
!

VLAN 7 is the private VLAN.  VLAN 902 and 901 are mapped to this VLAN. By default, the private VLAN does not forward broadcast or unknown-unicast packets from outside sources into the private VLAN. The command pvlan-preference changes this default behavior to authorize such traffic and to be able to discover hosts behing the Primary port. Apart from this, you can see that the port 1/1/7 is the Primary port. Community and Isolated VLANs are mapped to this Primary port. The hosts connected to a Community VLAN can talk to each other without going through the Primary port. The hosts connected to an Isolated VLAN can only talk to the Primary port. This type of configuration is really a good way to secure your traffic between the machines while being simple to deploy and maintain.

Best available technologies overview

First of all, we will discuss how to loadbalance queries from application to our MySQL database cluster. It is nice to have a database cluster that provides high availability and performance, but if you don’t find a proper way to connect, your efforts will be useless. Different approaches exist to loadbalance queries from application to the database cluster. The purpose is to loadbalance queries to capitalize available resources but it is not the only one. If a node failed, the requests have to be correctly redirected to the remaining functional one. There are several approaches.

The first approach depends of your application. In my case, the database cluster is dedicated to a Java application, that’s why I tried several JDBC drivers to find one that matched my needs. Then we will discuss the MySQL Proxy. Despite the fact that it is still on alpha release, it looks really promising! Finally we will mention the IP layer loadbalancer I detailed in this previous post.

In the following section, the goal is to quickly identify the best solutions available to loadbalance queries to a MySQL cluster. The details are voluntarily not mentioned since the goal of this post is to identify the optimal MySQL cluster architecture in a context of two nodes. I benchmarked every option according to specific application criteria, but I think the results can be addressed to a more general context. The proposed solutions do not imply any modification from application perspective.

  • JDBC drivers
  • MySQL Proxy
  • IP layer loadbalancer

JDBC drivers as loadbalancer

  • Load Balanced MySQL driver

I used the Connector/J 5.1. Official JDBC driver for MySQL in its replication context as explained here. I quickly decided to give it up for several reasons. First of all, the response times were not satisfying, and then this driver does not support too many parallel connections. Finally data integrity was not always assured in stress situations.

  • lbpool driver

MySQL officially suggests this open driver in a replication context. You can find more information about it here. This driver revealed much better performance than the previous one. Unfortunately, in my case, after more than 100’000 inserts, deadlocks appeared randomly. The driver entered in an unstable state where there was no way to connect to the cluster any more. That’s why I had to find an alternative.

MySQL Proxy

This tool revealed good performance. Queries were correctly loadbalanced between my two nodes. I simulated the failure of one node; the queries were correctly redirected to the remaining one. I didn’t notice any inconsistency even for more than 100’000 inserts and the overhead was very small compared to a traditional connection. The only drawback is that the MySQL Proxy is still in alpha release and depends of the LUA script’s truthfullnees used to split read/write queries. However, it was the best candidate so far.

Loadbalance queries at IP level (keepalived)

This option explained in this previous post is the most mature and as proved to work over years. It allows to loadbalance MySQL queries between two nodes. We have to define a MySQL Virtual IP address. The application connects to the cluster through that VIP and then we configure the distribution on a round robin fashion for example. This option handles also the failover.  If the node that currently owns the VIP failed, it is automatically taken over by the other at IP layer. Unfortunately, loadbalancing cannot be done at this level to split read/write queries. It is the best candidate to loadbalance queries and to handle failover, but it is not able to split read/write queries at this level without modifying the application or adding an extra mechanism that is to say combining it with MySQL Proxy for example.

Now that we have completed our overview of mechanisms’ distribution, it is time to enter the heart of the matter. Using one of the loadbalancer or a combination of several techniques we will try to list available MySQL database cluster architecture in a context of two nodes and select the best one according to several criteria such as complexity, ease of administration, performance…

MySQL Cluster — NDB engine

First we will mention the MySQL Cluster. It is a recent and evolved technology to provide high-level cluster architecture composed of three types of nodes. In our limited architecture, this kind of evolved cluster architecture is clearly not adapted since it is intended to a large number of nodes, that’s why we will not go deeper in this direction.

add Benefits delete Drawbacks
Automatically synchronized 4 nodes at least
Synchronous replication Large amount of memory needed
High-level administration and configuration Difficult to evaluate a priori behavior
Hot backup Still discouraged by community in a production environment
Easy to add/remove nodes Not mature
Not so easy to deploy
No mechanisme to loadbalancer queries at MySQL nodes
Application dependent -> Need lots of tests

Master/Master Replication

In this scenario, to avoid any modification from application point of view, we set a MySQL Virtual IP Address (VIP) using keepalived as explained in this previous post and then configure the distribution so that queries are redirected to one or the other node on a round robin fashion. If a node failed, it is removed from IPVS table and then it does not receive further requests. Thus high availability and performance is assured since total amount of queries is split between the two machines and the global service remains available even if a node failed.

Each machine reads the logs of the other and executes the queries of its neighbor that modify (insert/update) the database content. The only performance enhancement becomes from the fact that two machines are available to accept read queries. If your application performs much more writes queries, this kind of architecture will not be very efficient. Then you have to be conscious that replication is asynchronous. The content of your two databases is not necessarily the same at a time t. It could be problematic for sensible read queries that should take in account your last write query. Furthermore, the MySQL community does not promote this architecture. They say it works but they do not advise it.

add Benefits delete Drawbacks
Transparent for the application Have to handle specifically auto_increments, duplicate keys, etc.
Best way to fail over Asynchronous replication
Easy to deploy No performance enhancement except for read queries
Not promoted by MySQL Community

Master/Slave Replication without loadbalancing

This technique is the easiest to deploy, the most mature. I will recommend it although it is not the most efficient one. With a standard master/slave replication we simply add a level of redundancy to a single MySQL database. To make it work correctly, as with the master/master replication, we configure a MySQL VIP address. Every request is forwarded to the node that currently owns the VIP. In this scenario, the master performs every request, there is no loadbalancing. The slave who reads the master log file replicates the master. In order to facilitate the failover, the read-only flag is not set on the slave. Thus if the master failed, the slave will take over the VIP and be able to ensure serviceability without manual intervention.  This architecture is safety and easy to deploy. It ensures a level of redundancy in the case of failure without modifying the application. However, if a node failed, we have to repair it manually.

add Benefits delete Drawbacks
Transparent for the application Resources not capitalized since only one node works at a time
High availability No high performance
Easy failover
Safety

Master/Slave Replication with loadbalancing

In this scenario, which is the most complicated to test in our case, the IP loadbalancer can indifferently redirects queries to one or the other backend on a round robin fashion. Then, in order to split read and write queries, we decide to use the MySQL Proxy and the rw-splitting.lua LUA script, since it is the better solution we find to split queries safely.

mysql-proxy \
    --proxy-lua-script=/etc/mysql-proxy-0.6.0/lib/rw-splitting.lua
    --proxy-backend-addresses=master_vip:3306 \
    --proxy-read-only-address=slave_vip:3306 &

The trick is to configure two VIP (master and slave VIP). All write access are forwarded to master_vip and read access to slave_vip. At proxy startup, we define a RW backend accessible through the VIP, thus if the master failed, the slave will take over the master_vip and be able to handle write queries since there is currently no way to do it with a LUA script. Here we assume keepalived deals correctly with VIP addresses. In spite of that, we are force to admit there is another single point of failure, which is the proxy itself. That’s why we have to start over the two machines and monitor it. The application has to access it through another VIP. I didn’t test this global solution since it seemed very complicated to ensure safety. Nevertheless it should work and provide a good solution, but at the price of high complexity! That’s why I’m free to any other proposition; feel free to suggest a better schema. Otherwise I will keep on promoting the master/slave replication with no loadbalancing…

add Benefits delete Drawbacks
Clearly the best solution in term of performance But what about feasibility?!
As good as the others in term of availability

In this scenario, we have two machines and try to make the most of available resources. Each of the node will play the role of realserver, it will provide a service such as a web or a mail server. At the same time, one of the machines will loadbalance the requests to itself and to its neighbor. The node that is responsible of the loadbalancing owns the VIP. Every client connects to it transparently thanks to the VIP. The other node is also able to take over the VIP if it detects that current master failed but in nominal case only process requests forwarded by the loadbalancer.

Throughout this post the following ip addresses are used. Do not forget to modify them according to you network settings:

  • hostname1 ip address: 192.168.9.10
  • hostname2 ip address: 192.168.9.20
  • virtual ip address: 192.168.9.100

Configuration Files

Install Keepalived and set up configuration files

  • Install Keepalived on both machines:

[tux]# apt-get install keepalived

  • Copy provided Keepalived configuration files (master and slave) into /etc/keepalived/ directory:

[hostame1]# cp keepalived_master.conf /etc/keepalived/keepalived.conf
[hostame2]# cp keepalived_slave.conf /etc/keepalived/keepalived.conf

  • Copy provided bypass_ipvs.sh script that will be called during master/slave transitions on both machines:

[tux]# cp bypass_ipvs.sh /etc/keepalived/

Install and Configure services (mail and web server in our case) on both machines

  • For our test purposes, the realservers provide a mail and a web server each. First install them:

[tux]# apt-get install postfix apache2

  • Configure postfix, make sure each node can connect to the mail server of its neighbor. In installation phase, select local only, then comment following line in /etc/postfix/main.cf to be sure the mail server not only listen on local interface:

# inet_interfaces = loopback-only

  • Then try to connect to the mail server of your neighbor to be sure it works correctly:

[hostname1]# telnet hostname2 25
Connected to hostname2.
Escape character is ‘^]’.
220 hostname2 ESMTP Postfix (Debian/GNU)

[hostname2]# telnet hostname1 25
Connected to hostname1.
Escape character is ‘^]’.
220 hostname1 ESMTP Postfix (Debian/GNU)

  • Generate digest string to check web server using genhash value for one accessible web page. In our case we compute the digest for /apache2-default/index.html which is the default page for apache2:

[hostname1]# genhash -s hostname1 -p 80 -u /apache2-default/index.html
MD5SUM = c7b4690c8c46625ef0f328cd7a24a0a3

[hostname1]# genhash -s hostname2 -p 80 -u /apache2-default/index.html
MD5SUM = c7b4690c8c46625ef0f328cd7a24a0a3

  • Keepalived will check if the server is up using this digest value. That’s why you have to copy it in Keepalived configuration specifically in realserver sections intended to web server:
HTTP_GET {
  url {
    path /apache2-default/index.html
    digest c7b4690c8c46625ef0f328cd7a24a0a3
  }
  connect_timeout 3
  nb_get_retry 3
  delay_before_retry 2
}
  • At this step, we set up a functional mail and web server on each node.

Configure VIP(Virtual IP Service)

  • This IP will enable access to realservers. It will be completely configured from Keepalived configuration and does not need any other modification. Only one of the two nodes owns the VIP at a given time. Thus there are different configurations for the two nodes. In our case, hostname1 is set up as the master and hostname2 is the slave. Furthermore the VIP is 192.168.9.100:
  • On the master:
# describe virtual service ip
vrrp_instance VI_1 {
  # initial state
  state MASTER
  interface eth0
  # arbitary unique number 0..255
  # used to differentiate multiple instances of vrrpd
  virtual_router_id 1
  # for electing MASTER, highest priority wins.
  # to be MASTER, make 50 more than other machines.
  priority 100
  authentication {
    auth_type PASS
    auth_pass xxx
  }
  virtual_ipaddress {
    192.168.9.100/24
  }
  • On the slave:
# describe virtual service ip
  vrrp_instance VI_1 {
  # initial state
  state BACKUP
  interface eth0
  # arbitary unique number 0..255
  # used to differentiate multiple instances of vrrpd
  virtual_router_id 1
  # for electing MASTER, highest priority wins.
  # to be MASTER, make 50 more than other machines.
  priority 50
  authentication {
    auth_type PASS
    auth_pass xxx
  }
  virtual_ipaddress {
    192.168.9.100/24
  }
  • Then we can start or reload Keepalived and check that the master really owns the VIP:
[hostname1]# /etc/init.d/keepalived start
[hostname2]# /etc/init.d/keepalived start
[hostname1]# ip addr list dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:e3:e2:40 brd ff:ff:ff:ff:ff:ff
    inet 192.168.9.10/24 brd 192.168.9.255 scope global eth0
    inet 192.168.9.100/24 scope global secondary eth0
    inet6 fe80::20c:29ff:fee3:e240/64 scope link
       valid_lft forever preferred_lft forever

Configure loadbalancing

  • The loadbalancing is also configured thanks to Keepalived. At a given time, only one machine owns the VIP and then the requests are forwarded to the realservers according to chosen rules. Services are accessed through VIP and are processed indifferently by one or another machine. In /etc/keepalived/keepalived.conf realservers are defined like this:
# describe virtual mail server
virtual_server 192.168.9.100 25 {
  delay_loop 15
  lb_algo rr
  lb_kind DR
  persistence_timeout 50
  protocol TCP

  real_server 192.168.9.10 25 {
    TCP_CHECK {
      connect_timeout 3
    }
  }
  real_server 192.168.9.20 25 {
    TCP_CHECK {
      connect_timeout 3
    }
  }
}
  • This example demonstrates how the requests intended to the web server are processed. The requests are loadbalanced according to round robin (rr) algorithm to the realservers. Direct Routing (DR) mode is preferred. In this scenario as soon as a realserver is selected to process a request, then the realserver is directly connected to the client without going through the loadbalancer. Thus a single loadbalancer can process a huge amount of requests without becoming the bottleneck of our system since query processing requires only a few amounts of resources.
  • Then enable ip_forward on both machines permanently. In /etc/sysctl.conf :

net.ipv4.ip_forward = 1

  • You can load this option and check it is set up correctly with the following commands:

[tux]# sysctl -p
net.ipv4.ip_forward = 1

[tux]# sysctl -a | grep net.ipv4.ip_forward
net.ipv4.ip_forward = 1

  • We have a mail and a web server at disposal. Ensure that loadbalancer is configured correctly:
[hostname1]# ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.9.100:25 rr persistent 50
  -> 192.168.9.10:25             Local   1      0          0
  -> 192.168.9.20:25             Route   1      0          0
TCP  192.168.9.100:80 rr persistent 50
  -> 192.168.9.10:80             Local   1      0          0
  -> 192.168.9.20:80             Route   1      0          0
  • Requests intended to VIP on port 25 or 80 are distributed to 192.168.9.10 and 192.168.9.20. Then we try to connect to the mail server through the VIP from another machine, that is to say not hostname1 or hostname2:

[tux]# telnet 192.168.9.100 25
Trying 192.168.9.100…

  • Noting happens… And that’s completely normal and happens every time the loadbalancer assigns the request to the node that does not currently owns the VIP since this node is not supposed to handle this request. The traditional way to sort out this issue is to configure the VIP on the other node for example on the loopback interface so that it accepts packets with VIP as destination address. Then you should configure network interfaces such that they ignore some ARP requests playing with arp_ignore and arp_announce options. This will be sufficient to resolve our problem in a classical scenario where there are dedicated machines for the load distribution, but not in our case!
  • In our architecture, both the loadbalancer and the realserver are located on the same machine. If you simply add the VIP on the secondary machine, there will be cases where packets will be processed by the loadbalancer of the two machines since it is not deactivated on the slave. Furthermore if each loadbalancer selects its neighbor to process the request, we will face a ping pong effect. In such cases there will be an infinite loop between the two nodes.  Thus the request is not handled at all!
  • Fortunately, there is a trick to handle every request efficiently. We use the mechanism specific to Keepalived to call predefined scripts on master/slave transitions in /etc/keepalived/keepalived.conf:

# Invoked to master transition
notify_master “/etc/keepalived/bypass_ipvs.sh del 192.168.9.100″
# Invoked to slave transition
notify_backup “/etc/keepalived/bypass_ipvs.sh add 192.168.9.100″
# Invoked to fault transition
notify_fault “/etc/keepalived/bypass_ipvs.sh add 192.168.9.100″

  • bypass_ipvs.sh script adds a nat rule when host is configured as slave and removes it when it goes to master state so that requests intended to VIP are processed correctly even when they are handled by the slave. The prerouting rule is essential for the slave to redirect incoming service packet to localhost. Otherwise a loop can appear between master and slave. The routing table is consulted when a packet that creates a new connection is encountered. Prerouting rule alters packets as soon as they come in. Redirect statement redirects the packet to the machine itself by changing the destination IP to the primary address of the incoming interface. Locally generated packets are mapped to the 127.0.0.1 address thanks to the following rule. Thus packets forwarded by the active loadbalancer are not handled a second time.

iptables -A PREROUTING -t nat -d 192.168.9.100 -p tcp -j REDIRECT

  • Check rule on the slave:
[hostname2]# iptables -t nat --list
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination
REDIRECT   tcp  --  anywhere             192.168.9.100       

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Failover

  • Stop Keepalived on the master:

[hostname1]# /etc/init.d/keepalived stop

  • Ensure new master owns VIP:
[hostname2]# ip addr list dev eth0
2: eth0: <BROADCAST,MULTICAST,UP,10000> mtu 1500 qdisc pfifo_fast qlen 1000
    link/ether 00:0c:29:ab:e7:dd brd ff:ff:ff:ff:ff:ff
    inet 192.168.9.20/24 brd 172.16.89.255 scope global eth0
    inet 192.168.9.100/24 scope global secondary eth0
    inet6 fe80::20c:29ff:feab:e7dd/64 scope link
       valid_lft forever preferred_lft forever
  • Check that nat rule disappeared:
[hostname2]# iptables -t nat --list
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination      

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination
  • If origin master appears again, then the global architecture adjusts automatically to keep on processing incoming requests.

Service Failure Handling

  • If a service failed, then it should not correctly respond to basic Keepalived checks and be automatically removed from ipvs table:
[hostname1]# ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.9.100:25 rr persistent 50
  -> 192.168.9.10:25             Local   1      0          0
  -> 192.168.9.20:25             Route   1      0          0
TCP  192.168.9.100:80 rr persistent 50
  -> 192.168.9.10:80             Local   1      0          0
  -> 192.168.9.20:80             Route   1      0          0

[hostname1]# /etc/init.d/postfix stop
Stopping Postfix Mail Transport Agent: postfix.

[hostname1]# ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  192.168.9.100:25 rr persistent 50
  -> 192.168.9.20:25             Route   1      0          0
TCP  192.168.9.100:80 rr persistent 50
  -> 192.168.9.10:80             Local   1      0          0
  -> 192.168.9.20:80             Route   1      0          0
  • New requests are not forwarded any more to the service that encountered failure.

Other Considerations

Thanks to the nat rule, we successfully set up loadbalancing and automatic failover on a two nodes configuration. With such architecture we take full advantage of available resources. One of the nodes plays the role of loadbalancer and realserver as the other can take over the role of loadbalancer as soon as it detects that its neighbor failed. The slave does not only check the master but also handles the requests it receives from the loadbalancer.

Keep in mind that the request distribution is not strictly made on a connection basis. The clients connect to one of the realserver going through the VIP. Once this is done and as soon as the realserver in question is available,  further requests will be forwarded to the same realserver. In a classical scenario, many clients connect to the VIP, thus the global amount of requests is equally distributed between the two nodes.

To go further

This section will show you in details the interaction between the different components of my system. This could be very useful to understand how it is suppose to work if everything goes well. The following figures were built based on the traces I got from Wireshark. So this is not how it is suppose to behave based on what you can read on some documentation, but it is really how it works under my configuration. The configuration is the same as before expect from the fact that I added a client machine:

  • hostname1 ip address: 192.168.9.10
  • hostname2 ip address: 192.168.9.20
  • virtual ip address: 192.168.9.100
  • client ip address: 192.168.9.15

Here are the commands I entered:

client~$ telnet 192.168.9.100 25
Trying 192.168.9.100…
Connected to 192.168.9.100.
Escape character is ‘^]’.
220 hostname2 ESMTP Postfix (Debian/GNU)
HELO hostname2
250 hostname2
QUIT
221 2.0.0 Bye
Connection closed by foreign host.

As expected, it works like a charm. I access hostname2 mail server going through VIP address currently owned by hostname1. In this case, as you can see, the ipvs mechanism decided to forward the request to hostname2. It could also decide to process it on its own since the decision is taken on a round robin basis, but this is not what I wanted to show. There are the interactions I built from the traces I noticed on client, hostname1, hostname2.

As you can see, the client always speaks to the VIP address and thus sends its requests to hostname1 in my case. Thanks to keepalived, hostname1 forwards the request to hostname2. The important point to notice is that hostname2 directly responds to the client without going through the VIP machine. So the client does not know that the current responder is in fact hostname2 since the packet it received has the VIP address as source address. The key point to make it work is to ensure that hostname2 can accepts and process packets with the VIP address as destination address. By default, this is not the case. Here it works because of my PREROUTING rule. Another way would be to add the VIP address as a second address in hostname2. In my configuration only the first option can work since the two machines run ipvs. If I set up the VIP address on each machine, infinite loops can appear between hostname1 and hostname2 if each one decides to forward the request to the other.

You have seen the traces when it works as expected. But what if for example hostname2 is not configured properly to accept requests with VIP destination address. For test purposes, I thus manually removed my PREROUTING rule on hostname2 to see what’s going on.

Firs of all, you notice that the client does not receive any response. As before, the client sends its first request to the VIP. The VIP owner makes its job correctly; it forwards the request to hostname2. But here is the problem: hostname2 receives the SYN request with VIP destination address. There is no reason for him to process such a request, so it simply drops the packet. The VIP keeps on sending the same SYN request until the time-to-live is exceeded. That’s why you should take care to correctly configure every machine that is suppose to respond to requests with VIP destination address.

This article should be seen as a continuation of the post entitled DRBD8 with two primaries on debian etch. We noticed that DRBD8 with two primaries could not only ensure filesystem integrity but also does not work as expected under Ext3 filesystem. It focuses only on synchronization. As shown in following figure, read accesses take place only locally as write accesses are sent both locally and remotely to the other node so that at any time the global filesystem is consistent on both nodes. This post shows how to ensure data protection when two master nodes are synchronized with DRBD8. The key point is the filesystem. A lock mechanism must be established to ensure protection. That’s why you have to deploy a filesystem that includes such a mechanism like GFS2 or OCFS2. The second one revealed malfunctions. We therefore focus on setting up a GFS2 on debian etch which was the only one that meets our requirements. Thus we have a shared nothing architecture between two nodes using DRBD8 + GFS2. Such a system does not spend too many resources since only write accesses are spread through the network and each host processed requests only once. Keep in mind that a specific filesystem is mandatory in our case. A filesystem with a lock manager is the only way to prevent data corruption when modifications are made from the two nodes simultaneously.

Configuration files

System requirements

  • 2.6.24 kernel at least

[tux]# apt-get update
[tux]# apt-get install linux-image-2.6.24-etchnhalf.1-686
[tux]# apt-get install linux-headers-2.6.24-etchnhalf.1-686
[tux]# reboot

  • Install dpkg-dev and other dependencies to get and build deb packages from sources:

[tux]# apt-get install dpkg-dev debhelper dpatch fakeroot gcc libc6-dev bzip2

  • Update your source list to be able to get packages from debian lenny which, contrary to etch, contains GFS2 packages. Add the following line to your /etc/apt/sources.list and comment default ones:

deb http://ftp.de.debian.org/debian lenny main
deb-src http://ftp.de.debian.org/debian lenny main
#deb http://ftp.de.debian.org/debian etch main
#deb-src http://ftp.de.debian.org/debian etch main

  • Then run apt-get update

[tux]# apt-get update

Check out GFS2 sources from lenny

  • First, create a directory to store source packages, enter:

[tux]# mkdir build
[tux]# cd build/

  • Use apt-get command to get source code for a package:

[tux]# apt-get source gfs2-tools

  • At the same time, get the following source packages from lenny repository, you will need them later to be able to build gfs2-tools successfully:

[tux]# apt-get source findutils libopenais-dev libvolume-id-dev

  • Since this step is done, you don’t need lenny depository any more. Update your source list to be able to get packages from debian backports (avoid compiling every depency package) and comment lenny repository. Afterwards you’ll get every package from default etch depository or backports. Thus to install remaining dependencies, your /etc/apt/sources.list shall contain the following lines:

#deb http://ftp.de.debian.org/debian lenny main
#deb-src http://ftp.de.debian.org/debian lenny main
deb http://ftp.fr.debian.org/debian/ etch main
deb-src http://ftp.fr.debian.org/debian/ etch main
deb http://www.backports.org/debian etch-backports main contrib non-free

  • If you use backports depository for the first time and if you want apt to verify the downloaded backports packages, you can import backports.org archive’s key into apt, then run apt-get update to take modifications into account:

[tux]# apt-get update
[tux]# apt-get install debian-backports-keyring
[tux]# apt-get update

Build packages get from lenny

  • To just compile and build a debian packages, you need to enter into the directory and issue the command:

[tux]# cd redhat-cluster-2.[current_date]/
[tux]# dpkg-buildpackage -rfakeroot -b -uc
dpkg-checkbuilddeps: Unmet build dependencies: libxml2-dev libncurses5-dev libopenais-dev (>= 0.83) libvolume-id-dev (>= 0.105-4) linux-libc-dev (>= 2.6.26) libvirt-dev (>= 0.3.0) libnss3-dev libnspr4-dev libslang2-dev

  • There are lots of unmet dependencies for gfs2-tools. First build and install libopenais-dev:

[tux]# cd ..
[tux]# cd openais-0.83/
[tux]# dpkg-buildpackage -rfakeroot -b -uc
dpkg-deb: building package `openais’ in `../openais_0.83-1_i386.deb’. dpkg-deb: building package `libopenais2′ in `../libopenais2_0.83-1_i386.deb’. dpkg-deb: building package `libopenais-dev’ in `../libopenais-dev_0.83-1_i386.deb’.

[tux]# cd ..
[tux]# dpkg -i libopenais-dev_0.83-1_i386.deb libopenais2_0.83-1_i386.deb

  • Then build and install libvolume-id-dev:

[tux]# cd udev-0.125/
[tux]# dpkg-buildpackage -rfakeroot -b -uc
dpkg-checkbuilddeps: Unmet build dependencies: quilt (>= 0.40) libselinux1-dev (>= 1.28)

[tux]# apt-get install quilt libselinux1-dev
[tux]# dpkg-buildpackage -rfakeroot -b -uc
touch .stamp-build fakeroot debian/rules binary find: invalid expression; you have used a binary operator with nothing before it.

  • libvolume-id-dev can not compile successfully. The problem is related to the find version used by etch. That’s why we will compile and install a new one with the sources we previously downloaded:

[tux]# find –version
GNU find version 4.2.28

[tux]# cd ..
[tux]# cd findutils-4.4.0/
[tux]# dpkg-buildpackage -rfakeroot -b -uc
dpkg-checkbuilddeps: Unmet build dependencies: autotools-dev dejagnu

[tux]# apt-get install autotools-dev dejagnu
[tux]# dpkg-buildpackage -rfakeroot -b -uc
dpkg-deb: building package `findutils’ in `../findutils_4.4.0-2_i386.deb’. dpkg-deb: building package `locate’ in `../locate_4.4.0-2_i386.deb’.

[tux]# cd ..
[tux]# dpkg -i findutils_4.4.0-2_i386.deb locate_4.4.0-2_i386.deb
[tux]# find –version
find (GNU findutils) 4.4.0

  • You can now successfully compile and install libvolume-id-dev:

[tux]# cd udev-0.125/
[tux]# dpkg-buildpackage -rfakeroot -b -uc
dpkg-deb: building package `udev’ in `../udev_0.125-7_i386.deb’. dpkg-deb: building package `libvolume-id0′ in `../libvolume-id0_0.125-7_i386.deb’. dpkg-deb: building package `libvolume-id-dev’ in `../libvolume-id-dev_0.125-7_i386.deb’. dpkg-deb: building package `udev-udeb’ in `../udev-udeb_0.125-7_i386.udeb’.

[tux]# cd ..
[tux]# dpkg -i libvolume-id-dev_0.125-7_i386.deb libvolume-id0_0.125-7_i386.deb

  • After that steps, you can install linux-libc-dev and libvirt-dev from backports depository:

[tux]# apt-get install linux-libc-dev libvirt-dev

  • Then process to standard etch dependencies installation:

[tux]# apt-get install libxml2-dev libncurses5-dev libnss3-dev libnspr4-dev libslang2-dev psmisc

  • You should now be able to compile gfs2-tools:

[tux]# cd redhat-cluster-2.20080801/
[tux]# dpkg-buildpackage -rfakeroot -b -uc
upgrade.o: In function `upgrade_device_archive': /root/build/redhat-cluster-2.20081102/ccs/ccs_tool/upgrade.c:226: undefined reference to `mkostemp’

  • Oops, there is still a problem to resolve. The function mkostemp has been introduced only in glibc 2.7. We have to switch to standard mkstemp function which is available under etch (glibc 2.6). Edit ccs/ccs_tool/upgrade.c and specify available function:
ccs/ccs_tool/upgrade.c
@@ -223,7 +223,7 @@ static int upgrade_device_archive(char *location){
memset(tmp_file, 0, 128);
sprintf(tmp_file, "/tmp/ccs_tool_tmp_XXXXXX");

-  tmp_fd = mkostemp(tmp_file, O_RDWR | O_CREAT |O_TRUNC);
+  tmp_fd = mkstemp(tmp_file);
if(tmp_fd < 0){
  fprintf(stderr, "Unable to create temporary archive: %s\n",
  strerror(errno));
  error = -errno;
--
  • Finally we can successfully compile gfs2-tools:

[tux]# dpkg-buildpackage -rfakeroot -b -uc
[tux]# cd ..

  • In order to be able to use a GFS2 filesystem you need gfs2-tools package as well as a cluster and a lock manager.

[tux]# dpkg -i gfs2-tools_2.20081102-1_i386.deb libcman2_2.20081102-1_i386.deb libdlm2_2.20081102-1_i386.deb cman_2.20081102-1_i386.deb openais_0.83-1_i386.deb

cman depends on libnet-snmp-perl;
however: Package libnet-snmp-perl is not installed.
cman depends on libnet-telnet-perl;
however: Package libnet-telnet-perl is not installed.
cman depends on python-pexpect;
however: Package python-pexpect is not installed.
cman depends on sg3-utils;
however: Package sg3-utils is not installed.

  • As you can see, cman still depends on several packages that are not yet installed. If problems occur, fix broken dependencies to make it work correctly:

[tux]# apt-get install libnet-snmp-perl libnet-telnet-perl python-pexpect sg3-utils
[tux]# apt-get -f install

  • Following this procedure, you can be through with the installation process!

dpkg -i gfs2-tools_2.20081102-1_i386.deb libcman2_2.20081102-1_i386.deb libdlm2_2.20081102-1_i386.deb cman_2.20081102-1_i386.deb openais_0.83-1_i386.deb

Configuration requirements

  • Create a cluster directory and copy provided configuration file:

[tux]# mkdir /etc/cluster
[tux]# cp cluster.conf /etc/cluster/

  • Adjust hostname, that is to say specify the same hostname as in /etc/cluster/cluster.conf in /etc/hostname.
  • Every resource has to be accessible from its hostname, that’s why you have to define every hostname in hosts section of both nodes:

[tux]# vi /etc/hosts
127.0.0.1 localhost
192.168.9.xx hostname2.domain.org hostname2
192.168.9.xx hostname1.domain.org hostname1

  • As a reminder, we will deploy a GFS2 filesystem on an architecture composed of two nodes. So you have to perform each step for both nodes. When this is done, each node as to be accessible with its hostname from one or another. To verify that works correctly, check if ping can be sent successfully:

[hostname1]# ping hostname2
[hostname2]# ping hostname1

  • If this is the case, you can go further and start the cluster manager at the same time on each node. Such a daemon is required to use a GFS2 filesystem. A node will not be allowed to mount a GFS2 filesystem unless the node is running fenced. Fencing happens in the context of a cman/openais cluster. A node must be a cluster member before it can run fenced. That’s why we have to start the cluster manager first:
[hostname1]# /etc/init.d/cman start
Starting cluster manager:
 Loading kernel modules: done
 Mounting config filesystem: done
 Starting cluster configuration system: done
 Joining cluster: done
 Starting daemons: groupd fenced dlm_controld gfs_controld
 Joining fence domain: done
 Starting Quorum Disk daemon: done
[hostname2]# /etc/init.d/cman start
  • Check if the cluster manager is correctly started:
[hostname1]# cman_tool nodes
Node  Sts   Inc   Joined               Name
   1   M     32   2009-01-13 17:30:42  hostname1
   2   M     28   2009-01-13 17:30:42  hostname2
[hostname1]# cman_tool status
Version: 6.1.0
Config Version: 1
Cluster Name: cluster
Cluster Id: 13364
Cluster Member: Yes
Cluster Generation: 32
Membership state: Cluster-Member
Nodes: 3
Expected votes: 1
Total votes: 2
Node votes: 1
Quorum: 1
Active subsystems: 7
Flags: 2node Dirty
Ports Bound: 0
Node name: hostname1
Node ID: 1
Multicast addresses: 239.192.52.104
Node addresses: 192.168.9.xx
  • If cman spends ages “Waiting for fenced to join the fence group.” then ensure that it is also running on the other node, and more importantly, be aware that the clusternode name in cluster.conf must be in DNS (or /etc/hosts) and resolve to an IP reachable from the other node.

The next step is to install and configure DRBD8

  • Please refer to the following post if you need more information. You need to perform following steps on both nodes.
  • Install DRBD on both machines:

[tux]# apt-get install drbd8-source drbd8-utils
[tux]# module-assistant auto-install drbd8

  • Prepare the drbd disk, set a new logical partition thanks to fdisk command, /deb/sdb5 in our case.
  • Adjust drbd.conf according to your system, that is to say specify your local partition as well as the correct hostnames and IP addresses, then copy it in /etc directory.
  • Create DRBD resource. The DRBD partition must have the same size on both nodes to work correctly.

[tux]# drbdadm create-md r0

  • If any error occurs, then adjust partition size using the current configuration leaves usable information returned by previous command or erase /dev/sdb5 current content, please perform one of the following commands on each node.

[tux]# e2fsck -f /dev/sdb5 && resize2fs /dev/sdb5 current_conf_sizeK

or

[tux]# dd if=/dev/zero of=/dev/sdb5

  • And then you can successfully create your resource on both machines:

[tux]# drbdadm create-md r0

  • When it is done on both nodes, you can start DRBD, then force one node in primary state

[hostname1]# /etc/init.d/drbd start
[hostname2]# /etc/init.d/drbd start
[hostname1]# drbsetup /dev/drbd0 primary -o

  • Wait the end of the synchronization. Then you can successfully set the secondary node into primary state in order to get two primaries:

[hostame2]# drbdadm primary r0

  • At the end the two nodes should be in the following state:
[hostname1]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.0.14 (api:86/proto:86)
GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by phil@fat-tyre,
2008-11-12 16:40:33
m:res  cs         st               ds                 p  mounted  fstype
0:r0   Connected  Primary/Primary  UpToDate/UpToDate  C

Couple DRBD8 and GFS2

  • Create the GFS2 filesystem on the DRBD device using dlm lock manager. Since it is already clustered, you only have to make the filesystem on one node (thank you for your comment Peter ;-)):

[hostname1]# mkfs.gfs2 -t cluster:gfs1 -p lock_dlm -j 2 /dev/drbd0

  • Here is a brief description of the parameters used:
-t clustername:fsname
   Clustername must match that in cluster.conf; only members
   of this cluster are permitted  to use  this file system.
   Fsname is a unique file system name used to distinguish this
   GFS2 file system from others created (1 to 16 characters).
   Lock_nolock doesn't use this field.

-j Number
   The number of journals for gfs2_mkfs to create. You need at
   least one journal per machine that will mount the filesystem.
   If this option is not specified, one journal will be created.

-p LockProtoName
   LockProtoName is the name of the locking protocol  to  use.
   Acceptable locking  protocols are lock_dlm (for shared storage)
   or if you are using GFS2 as a local filesystem  (1  node  only),
   you can specify the lock_nolock protocol.  If this option is not
   specified, lock_dlm protocol will be assumed.

Test the shared nothing architecture

  • You can finally check if your system works correctly. First create a new directory:

[hostname1]# mkdir /synchronized
[hostname2]# mkdir /synchronized

  • Then mount the DRBD partition at this location. You should be able to access it read/write on both machines simultaneously.

[hostname1]# mount -t gfs2 /dev/drbd0 /synchronized
[hostname2]# mount -t gfs2 /dev/drbd0 /synchronized

  • If you followed every step carefully, your system must be set up correctly and the DRBD status command should return the following state on both nodes. You can see that DRBD is mounted on /synchronized.
[hostname1]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.0.14 (api:86/proto:86)
GIT-hash: bb447522fc9a87d0069b7e14f0234911ebdab0f7 build by phil@fat-tyre,
2008-11-12 16:40:33
m:res  cs         st               ds                 p  mounted        fstype
0:r0   Connected  Primary/Primary  UpToDate/UpToDate  C  /synchronized  gfs2
  • We can try to create files from both locations to see if it works how it should.

[hostname1]# touch /synchronized/from_host1
[hostname2]# touch /synchronized/from_host2
[hostname1]# ls /synchronized
from_host1 from_host2
[hostname2]# ls /synchronized
from_host1 from_host2

  • That’s it, and… It works properly! I even made several thousands of entries in parallel on both nodes and it worked like a charm. No file disappeared or was altered.

Automate GFS2 partition mounting at startup

  • Copy provided mount script in /etc/init.d directory:

[tux]# cp mountgfs2.sh /etc/init.d/

  • Then take care to call the script at the right time. GFS2 partition needs to be mounted after DRBD start (to see /dev/drbd0) and unmounted before DRBD stop (to free the partition and enable DRBD to quit properly):
[tux]# update-rc.d mountgfs2.sh start 70 2 3 4 5 . stop 07 0 1 6 .
  Adding system startup for /etc/init.d/mountgfs2.sh ...
       /etc/rc0.d/K07mountgfs2.sh -> ../init.d/mountgfs2.sh
       /etc/rc1.d/K07mountgfs2.sh -> ../init.d/mountgfs2.sh
       /etc/rc6.d/K07mountgfs2.sh -> ../init.d/mountgfs2.sh
       /etc/rc2.d/S70mountgfs2.sh -> ../init.d/mountgfs2.sh
       /etc/rc3.d/S70mountgfs2.sh -> ../init.d/mountgfs2.sh
       /etc/rc4.d/S70mountgfs2.sh -> ../init.d/mountgfs2.sh
       /etc/rc5.d/S70mountgfs2.sh -> ../init.d/mountgfs2.sh

Failure handling

Let’s spend some time to speak about the behavior in case of failure. In the current state, only manual fencing is configured in cluster.conf. Thus, in one node failed, the access to the shared system is frozen until we are sure the failed node is really dead.

  • If hostname2 failed for one or another reason, following lines will appear in /var/syslog of hostname1:
Jan 13 22:04:10 hostname1 kernel: dlm: closing connection to node 2
Jan 13 22:04:10 hostname1 fenced[2543]: hostname2 not a cluster member after 0 sec post_fail_delay
Jan 13 22:04:10 hostname1 fenced[2543]: fencing node "hostname2"
Jan 13 22:04:10 hostname1 fenced[2543]: fence "hostname2" failed
  • During this time, access to GFS2 filesystem is frozen. A manual fencing (executed on functional machine) is needed to get access again to the shared partition:

[hostname1]# fence_ack_manual -n hostname2

  • Once this is done, repair the failed node and connect it with the valid one only when you are sure it is ok, necessary to avoid corruption!

Admittedly, you could try to automate this procedure by monitoring the logs and triggering the manual fencing only when necessary. It would allow to keep on accessing the GFS2 partition even if a node failed. In reality, this trick is strongly discouraged since there is no way to guaranty data integrity. The only way to deal with failures safely is to acquire a dedicated fence device that will STONITH the other one. I know, buying a new network device is restrictive, but it is the only way to consider such a system in a production environment.

Other Considerations

We set up a functional shared nothing architecture on two nodes using DRBD8 over a GFS2 filesystem. If you plan to deploy such a system on a production environment there are few things to keep in mind. First of all, you should find a way to benchmark the system according to your needs. There was no loss or data corruption in my tests, but you could encounter performance issues. For example if your synchronized partition is suppose to contain billion of files, time access could become too large due to the lock mechanism. Otherwise, the system has proved to be stable. I was even able to recover manually from failure states!

December 10th, 2008 | Tags: , , , , ,

The goal of this post is to set up an architecture composed of two nodes that can both simultaneously have read/write access to a synchronized partition. Each node will access the synchronized partition as any other classical partition, the only difference is that every modification will also be forwarded to the other node.

DRBD8 configuration file

System requirements

  • 2.6.24 kernel at least

[tux]# apt-get update
[tux]# apt-get install linux-image-2.6.24-etchnhalf.1-686
[tux]# apt-get install linux-headers-2.6.24-etchnhalf.1-686
[tux]# reboot

Configuration requirements

  • Adjust hostname, that is to say specify the same hostname as in /etc/drbd.conf in /etc/hostname.
  • Every resource has to be accessible from its hostname, that’s why you have to define every hostname in hosts section:

[tux]# vi /etc/hosts
192.168.9.xx hostname1.domain.org hostname1
192.168.9.xx hostname2.domain.org hostname2

Update your source list to be able to get packages from debian backports

  • Add the following line to your /etc/apt/sources.list:

deb http://www.backports.org/debian etch-backports main contrib non-free

  • If you are using etch and you want apt to verify the downloaded backports you can import backports.org archive’s key into apt then run apt-get update:

[tux]# apt-get install debian-backports-keyring
[tux]# apt-get update

Install DRBD8

  • Install DRBD8 source and DRBD8 utils:

[tux]# apt-get install drbd8-source drbd8-utils

  • Build DRBD8 module from previously installed sources using the module-assistant:

[tux]# module-assistant auto-install drbd8

Prepare the partition that will be synchronized thanks to DRBD

  • Generally it’s better to think about the DRBD partition earlier and prepare a dedicated partition during operating system installation. Since my machines are hosted on VMWare, I simply add a new virtual disk. Then we will process to its configuration. Do the same on both nodes as usual!
[tux]# fdisk -l

Disk /dev/sda: 6442 MB, 6442450944 bytes
255 heads, 63 sectors/track, 783 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device       Boot   Start    End     Blocks    Id                 System
/dev/sda1       *       1    743    5968116    83                  Linux
/dev/sda2             744    783     321300     5               Extended
/dev/sda5             744    783    321268+    82   Linux swap / Solaris

Disk /dev/sdb: 1073 MB, 1073741824 bytes
255 heads, 63 sectors/track, 130 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Disk /dev/sdb doesn't contain a valid partition table
  • Let’s configure the new disk added entitled /dev/sdb. First, create a logical partition on an extended one that will completely fill the new partition:
[tux]# fdisk /dev/sdb

Command (m for help): n
Command action
e   extended
p   primary partition (1-4)
e
Partition number (1-4): 1
First cylinder (1-130, default 1):
Using default value 1
Last cylinder or +size or +sizeM or +sizeK (1-130, default 130):
Using default value 130

Command (m for help): n
Command action
l   logical (5 or over)
p   primary partition (1-4)

l
First cylinder (1-130, default 1):
Using default value 1

Last cylinder or +size or +sizeM or +sizeK (1-130, default 130):

Using default value 130

Command (m for help): w
The partition table has been altered!

Calling ioctl() to re-read partition table.
Syncing disks.
  • In our case, the logical partition that will be synchronized by DRBD is entitled /dev/sdb5:
[tux]# fdisk -l /dev/sdb

Disk /dev/sdb: 1073 MB, 1073741824 bytes
255 heads, 63 sectors/track, 130 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device       Boot   Start    End     Blocks    Id      System
/dev/sdb1               1    130   1044193+     5    Extended
/dev/sdb5               1    130    1044162    83       Linux
  • /dev/sdb5 is ready to be used by DRBD. Thanks to the comment of Mark, we know that if there is no existing data on the device, we should not create a filesystem on the underlying device! It will be done later on the DRBD device itself.

Configure DRBD8

  • Adjust drbd.conf according to your system, that is to say specify your local partition as well as the correct hostnames and IP addresses, then copy it in /etc directory.
on hostname1 {
  device /dev/drbd0;
  disk /dev/sdb5;
  address 192.168.9.xx:7789;
  meta-disk internal;
}
on hostname2 {
  device /dev/drbd0;
  disk /dev/sdb5;
  address 192.168.9.xx:7789;
  meta-disk internal;
}

Create DRBD resource

  • First of all create your resource. Resource name corresponds to what you specified in your drbd.conf (here the resource name is r0). The DRBD partition must have the same size on both nodes to work correctly.
  • If there is no filesystem on /dev/sdb5, following commands should do the trick (on both nodes):
[tux]# drbdadm create-md r0
You want me to create a v08 style flexible-size internal meta data block.
There apears to be a v08 flexible-size internal meta data block
already in place on /dev/sdb5 at byte offset 106917888
Do you really want to overwrite the existing v08 meta-data?
[need to type 'yes' to confirm] yes

Writing meta data...
initialising activity log
NOT initialized bitmap
New drbd meta data block sucessfully created.
  • If any error like the following one occurs:
[tux]# drbdadm create-md r0
md_offset 1069215744
al_offset 1069182976
bm_offset 1069150208

Found ext3 filesystem which uses 1044160 kB
current configuration leaves usable 1044092 kB

Device size would be truncated, which
would corrupt data and result in
'access beyond end of device' errors.
You need to either
* use external meta data (recommended)
* shrink that filesystem first
* zero out the device (destroy the filesystem)
Operation refused.

Command 'drbdmeta /dev/drbd0 v08 /dev/sdb5 internal create-md'
terminated with exit code 40
drbdadm aborting
  • It means there was some data on /dev/sdb5 or at least a filesystem. In this kind of scenario there are two ways to solve the problem: adjust partition size or backup and erase existing data.
    1. Adjust partition size is the most dangerous option! Use it at your own risks. Many people encountered ‘accessing beyond end of device’ errors when the device was almost full! (thank you for your comment Mark ;-)). If you still want to do it,  use the current configuration leaves usable information returned by previous command (1044092) to resize the device. Please perform the same commands on each node.

[tux]# e2fsck -f /dev/sdb5 && resize2fs /dev/sdb5 1044092K

  1. The second option is the safer one: backup up existing data on another device, then erase its current content:

[tux]# dd if=/dev/zero of=/dev/sdb5

  • And then you can successfully create your resource on both machines:

[tux]# drbdadm create-md r0

Start DRBD8 and ensure it works correctly

  • You can start DRBD this way (will produce an error since at first startup you have to force only one node into primary state):

[hostname1]# /etc/init.d/drbd start
[hostname2]# /etc/init.d/drbd start

  • Or make it manually (attach and connect the resource r0):

[hostname1]# drbdadm attach r0
[hostname1]# drbdadm connect r0
[hostname2]# drbdadm attach r0
[hostname2]# drbdadm connect r0

  • In order to get consistent data, force DRBD partition into primary state for only one of the two nodes, this will trigger the synchronization process. (“drbdadm — –overwrite-data-of-peer primary r0″ is equivalent to the following command)

[hostame1]# drbdsetup /dev/drbd0 primary -o

  • Wait the end of the synchronization.
[hostname1]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.0.13 (api:86/proto:86)
GIT-hash: ee3ad77563d2e87171a3da17cc002ddfd1677dbe build by phil@fat-tyre,
2008-08-04 15:28:07
m:res  cs         st                ds                    p mounted fstype
...    sync'ed:   20.8%             (830012/1044092)K
0:r0   SyncSource Primary/Secondary UpToDate/Inconsistent C
  • When 100% is reached, you can successfully set the secondary node into primary state in order to get two primaries:

[hostame2]# drbdadm primary r0

  • Verify that everything works fine:
[tux]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.0.13 (api:86/proto:86)
GIT-hash: ee3ad77563d2e87171a3da17cc002ddfd1677dbe build by phil@fat-tyre,
2008-08-04 15:28:07
m:res cs        st              ds                p mounted fstype
0:r0  Connected Primary/Primary UpToDate/UpToDate C
  • Now it’s time to create your EXT3 filesystem on the DRBD device! (only in the case there was no existing data on /dev/sdb5)

[hostname1]# mke2fs -j /dev/drbd0

  • You can finally check if your system works correctly. First create a new directory:

[hostname1]# mkdir /synchronized
[hostname2]# mkdir /synchronized

  • Then mount the DRBD partition at this location. You should be able to access it read/write on both machines simultaneously.

[hostname1]# mount /dev/drbd0 /synchronized
[hostname2]# mount /dev/drbd0 /synchronized

  • If you were unable to mount your partition and get the following error message, ensure you have correctly created a valid filesystem on the DRBD device as explained before.

[tux] # mount /dev/drbd0 /synchronized/
mount: you must specify the filesystem type

  • If you followed every step carefully, your system must be set up correctly and the DRBD status command should return the following state on both nodes. You can see that DRBD is mounted on /synchronized.
[hostname1]# /etc/init.d/drbd status
drbd driver loaded OK; device status:
version: 8.0.13 (api:86/proto:86)
GIT-hash: ee3ad77563d2e87171a3da17cc002ddfd1677dbe build by phil@fat-tyre,
2008-08-04 15:28:07
m:res cs        st              ds                p mounted       fstype
0:r0  Connected Primary/Primary UpToDate/UpToDate C /synchronized ext3
  • According to DRBD specifications, our system and configuration is valid. We can try to create files from both locations to see if it works how it should.

[hostname1]# touch /synchronized/from_host1
[hostname2]# touch /synchronized/from_host2
[hostname1]# ls /synchronized
lost+found from_host1
[hostname2]# ls /synchronized
lost+found from_host2

  • As you can notice, our system does not work as expected, since we are not able to see the file added from the other node. I did not find a way to make it work properly using a standard Ext3 filesystem. Generally, you have to unmount the partition, put it into secondary state, put it back into primary state and mount it again to be sure to see the modifications of the other node. DRBD documentation specifies that in order to use the software safely with two primaries we have to couple DRBD with a specific filesystem that ensures integrity. I will go further and asserts that DRBD cannot work properly without a specific filesystem even if there is no corruption risk.

Other considerations

You have to keep in mind that there is no integrity guaranty between the two machines. Therefore we are in any case force to implement a lock mechanism to avoid data corruption. DRBD8 deals only with synchronization between nodes. In order to circumvent this drawback, the DRBD synchronized partition must use a specific file system like GFS2 for example if you want to see modifications from both nodes almost in real time. A next post will go into further details and explain how to build a shared nothing architecture between two nodes using DRBD8 + GFS2 on debian etch. I can already claim this time it will work as expected, so do not become discouraged.

TOP