Distributed computing

The distributed computing has gained popularity in recent years. It organizes ordinary PCs into a cluster to achieve, even surpass the performance of a minicomputer. It has major advantage in both general and maintenance costs.

To establish a distributed system, an esProc program can be made as a server to receive requests from other esProc programs and return the results. Its basic computing model is that a controlling node sends orders to the non-controlling nodes, collects and aggregates their results. A complex task may consist of multiple sub-tasks.

The key technology of the distributed computing is the scale-out ability, as well as the fault-tolerance capability for multi-node running. 

Task distribution

First let’s look at the simple shared-data-source strategy.

The so-called data sharing means that the data to be processed by the nodes is stored in the same place, like a database or a network file system, and that the nodes only handle the tasks assigned to them but don’t have the data. In this way, the source data will take a lot of pressure resulting from the concurrent accesses. The strategy is more suitable for computation-intensive tasks than for data-intensive ones.

It’s simple to implement the shared-data-source strategy with esProc Server:

  A The controlling program
1 =4.(“192.168.0.”/(10+~)/”:1234”) The list of 4 nodes
2 =callx(“sub.dfx”,to(8),8;A1) Pass the parameter in to call the node programs, which correspond to 8 sub-tasks
3 =A2.sum() Perform the aggregate
  A Node program (sub.dfx)
1 =hdfsfile(“hdfs:\\192.168.0.1\persons.txt”) An HDFS file
2 =A1.cursor@t(;seg:all) The cursor of a segment of file, where seg and all are parameters from node programs
3 =A2.select(gender==’M’).groups(;count(1):C) Select and count the Male records
4 return A3.C Return the result

esProc provides a solid support of many types of shared data source, such as database and HDFS.

The esProc Server’s distributed structure is centerless. This is unlike other distributed structures like Hadoop that possesses a thorough system to transparently simulate the whole cluster as a standalone. esProc doesn’t have a framework and a permanent controlling central node, it capacitates the programmers to control the participating nodes with codes by abandoning a definite structure for the cluster.

In a centerless distributed structure, all nodes are equal and none is special. The advantage is that the malfunction of a certain node won’t stop the whole cluster from running. A distributed structure with a center, however, will break down once the central node goes wrong.

Strictly speaking, an esProc cluster isn’t completely devoid of centers. Though the general server cluster is centerless, each sub-task has its own controlling node temporarily summoning other nodes to take part in the computation. If the controlling node collapses, the whole task fails. But the cluster as a whole can still handle other tasks.

The difference between esProc distributed system and other distributed systems is another embodiment of esProc’s design concept of emphasizing class libraries while avoiding a definite framework.

esProc Server is capable of balancing sub-tasks among nodes. It determines whether or not a node should be given a sub-task according to how much the node is occupied (the number of threads on the run). If the node is saturated (meaning the number of threads running on it has reached the maximum allowed for it), the controlling program will wait until the node finishes at least one of its sub-tasks and then start its task distribution. That way a faster node may receive more sub-tasks, creating balance between the responsibilities and the resources.

If a node malfunctions during the process and fails to go on with its work, the controlling program will reassign the work to the healthy nodes. This makes the total computing time longer, but is fault-tolerant to some extent.

esProc Server’s non-framework design allows the cluster to include machines with contrasting performances, such as different memories, CPU configurations, even operating systems. In a nutshell, esProc Server is open to any machine. This can exploit the potential of the user’s existing hardware devices to the full. By contrast, generally many cluster strategies with specified frameworks require that the nodes have something in common.

Data distribution

In order to attain better performance, we need to store data in a distributed style, particularly for the data-intensive tasks, which have a high I/O cost. The shared-data-source strategy will cause serious throughput bottleneck for those tasks, while the distributed data storage plan will spread the I/O delay among nodes.

In principle, the goal of distributed data storage plan is to break data apart and put it onto different nodes, enabling each node to access the data it needs locally and therefore avoiding network transmission delay and collisions in getting the shared source. Data distribution doesn’t mean that we simply divide the data (evenly) into N segments and place them on N nodes. This kind of distribution is fault-intolerant and probably still has a relatively large amount of network transmission resulted from the join operations.

Unlike the common network file system, esProc Server provides an opaque data distribution strategy, which requires programmers to decide how data should be distributed. The strategy’s virtue is that programmers are able to make more flexible decisions according to the algorithms and data characteristics.

Generally the physical storage form of data is files, or probably databases in theory. Actually it’s rare that we install many databases on node-intensive cluster, except that we choose to use the database’s own cluster system.

esProc divides data into N partitions and stores a number of different partitions in each node. Thus each partition may be stored redundantly in multiple nodes. esProc allows specifying the degree of redundancy and the nodes it is stored freely, by omitting the uniform system-defined redundancy factor and automatic redundancy plan.

esProc supplies a semi-transparent cross-node data retrieval scheme. According to it the esProc program will specify the partition and the node list for data accessing, so that the local node will be first searched for the desired partition and the nodes specified by the list with a lower level sharing will be searched when the former searching task fails.

A sub-task should be assigned, in principle, to a node holding the data partitions it needs, in order to reduce the network traffic. The esProc controlling program assigns sub-tasks to nodes according to their handling capacity. On receiving a sub-task, a node will check itself to see whether or not the data partitions the sub-task involves exist there, and if not, it will return failure to tell the main program to find another node for this sub-task, ensuring that both the sub-task and its needed data are on the same node.

  A The controlling program is same as it is with data sharing
1 =4.(“192.168.0.”/(10+~)/”:1234”) A list of 8 nodes
2 =callx(“sub.dfx”,to(8);A1) Pass in a parameter to call programs on nodes; divide the task into 8 subtasks
4 =A2.sum() Sum up results
  A B Node program (sub.dfx)
1 =file(“person.txt”,z) The number of data partition z, which is a parameter passed in from the controlling program
2 if !A1.exists() end “data not find” Return error message if the needed partition isn’t found and the controlling program will make a reassignment
3 =A1.cursor()  
4 =A3.select(gender==’M’).groups(;count(1):C) Do the required calculations
5 return A4.C Return the result

The check of needed data partitions on a node is code-driven, so the distribution of partitions can be very flexible. We can specify that the access-intensive partitions be placed on the current node but the less-accessed partitions be obtained cross-node.

esProc also provides the inter-node same-partition synchronization functionality. We can update data by updating certain nodes and then synchronizing data to other nodes. A simple way is designing a high-capacity node to store all partitions for data update, and all the other working nodes synchronize themselves with this node.

For each cluster computation, esProc needs to specify a list of available nodes and allocate partitions among nodes manually. This arrangement shows that the target of esProc Server is small and medium-sized clusters that have several, a dozen of, or dozens of computers at most, rather than the large clusters of hundreds of, even thousands of computer nodes. It’s not only feasible but more flexible to humanly control the distribution of nodes and data for a small and medium-sized cluster. Nevertheless, in managing a large-scale cluster the non-automatic operation will bring an intolerable amount of management work. In that case it’s worth trading flexibility for a more efficient management. This further explains the difference between the esProc Server and a Hadoop cluster.

Each node is a standalone computer, so the in-memory and external memory technologies explained in the above also apply to it.

The foremost purpose of the redundancy plan is to achieve fault-tolerance. As mentioned previously, if one node collapses, the main controlling program will find another node to do its job. The whole computation will be completed so long as the needed data partitions are available on the other computers that are working right, though a longer time will be spent.

The flexible data redundancy mechanism is more able to coordinate the nodes for a balanced distribution, making the performance of the cluster spreads more evenly. The above-mentioned dynamic balancing plan is used for task distribution, during which the higher-performance nodes would receive more data partitions to undertake more computing work, and which a node that finishes its work faster will continue to be tasked with a sub-task involving other data partitions stored on it.

The flexibility of data redundancy is useful in decreasing the network traffic. For instance, dimension tables are referenced in almost all computations. We can put them into the same partition and make this partition redundant on all nodes, while storing segments of the fact table respectively on multiple nodes. Normally the dimension table is not big and adds a little to the burden of capacity. And as there’s no need to make an inter-node reference of the dimension table, the method will effectively increase the performance.

In-memory data distribution

esProc Server’s data partition is a logical concept. The data doesn’t necessarily come from files, but it also can be the data loaded in the memory. In this sense, a cluster is equal to a machine with huge memory able to perform cluster-style in-memory computing with super-high performance, thereby achieving an instant response.

It’s appropriate to use redundancy-based storage plan with file-based data partitions, but it’s not a good idea to store in-memory data redundantly on the nodes. That’s because the utilization ratio is low but the cost is high (not error-prone but the memory is costly). Except for the shared dimension tables, one data partition won’t, by principle, be repeatedly stored on multiple nodes. To store multiple data partitions on one node will result in a memory utilization ratio of 1/k (k-1 is the maximum number of nodes that allows failure).

To distribute in-memory data, esProc Server uses the back-up style technique. The task-handling will first find a set of nodes that hold all in-memory data partitions. If a certain partition can’t be found on all available nodes, a node (the stand-by) that hasn’t loaded any valid partitions will be found and used to initialize and load the partition on the spot to make the set of the desired nodes complete. An error will be reported if these nodes can’t be all obtained. But with a complete set, the task-distribution program will assign the sub-tasks involving the in-memory data partitions exclusively to these nodes. Using this technique, we can achieve a memory utilization ratio of n/(n+k) (n is the total number of partitions), which is much higher than using redundancy-based fault-tolerance plan.

The use of in-memory data distribution requires initialization, so it’s a little more complicated than analyzing external memory data.

  A The controlling program
1 =8.(“192.168.0.”/(10+~)/”:1234”) A list of 8 nodes
2 =hosts(A1,to(4),”init.dfx”) Find 4 nodes to load in-memory data
3 =callx@a(“sub.dfx”,to(4);A2) Call programs on those nodes to calculate
4 =A3.sum() Sum up results
  A The program of initializing nodes (init.dfx)
1 =file(“person.txt”,z).import() Import data from data partition z
2 >env(T,A1) Store data in an environment variable
3 >zone(z) Add partition z on the current node
  A Node program (sub.dfx)
1 =env(T) Retrieve data from the environment variable
2 =A1.count(gender==’M’) Perform filtering and counting
3 return A2 Return results

As opposed to the dynamic task-distribution mechanism for file-based data partitions, the task distribution for partitions of in-memory data is static. Theoretically there is no fixed correspondence between a mechanism and the external memory data or in-memory data. We can also use static distribution for external memory data and dynamic distribution for in-memory data. But as the external memory computing lacks good stability and the time of performing each sub-task is unpredictable, the static mechanism will probably cause a waste of resource. The in-memory computing is stable and the computing time of each sub-task is basically predictable, so the static mechanism is more suitable for it. For the latter case generally data partitions won’t be stored redundantly on the nodes, we therefore can get the same results using both mechanisms, making the dynamic distribution mechanism unnecessary. For the partitions of external memory data, often they have a lot of redundancy on the nodes and the use of dynamic mechanism gives full play to the hardware resources.

Clustered dimension table

In the above section, we talked about that a dimension table is usually stored redundantly in every node. A dimension table is relatively small, and it is not suitable to be stored in the external memory because it is accessed randomly and frequently. Therefore, often it is imported wholly in the memory.

Sometimes, however, a dimension table can be really big. It can’t be imported entirely into the memory of a single node even a byte table is adopted. In view of this, esProc Server designs clustered dimension table based on the in-memory data partitions. The design is to segment the dimension table and import segments into memories of the clustered nodes, from which they will be retrieved when referenced.

  A The controlling program
1 =8.(“192.168.0.”/(10+~)/”:1234”) A list of 8 nodes
2 =hosts(A1,to(4),”init.dfx”) Find 4 nodes to load the dimension table into their memories
3 =callx(“sub.dfx”,8*A2,to(8);A1) Call programs on those nodes to calculate by passing in the nodes where the dimension table is stored
4 =A3.sum() Sum up results
  A The program of initializing nodes (init.dfx)
1 =file(“Product.txt”,z).import() Import the part of the dimension table from partition z
2 >env(T,A1) Store data in an environment variable
3 >zone(z) Add partition z on the current node
  A B Node program (sub.dfx)
1 =file(“Sales.txt”,z)   Factual data in partition z
2 if !A1.exists() end “data not find” Return error message if the needed partition isn’t found
3 =createx(T,h) Create clustered dimension table based on variable T on node group h where data is stored
4 =A1.cursor()  
5 =A4.join(productid,A3,price) Perform a join with the clustered dimension table A3
6 =A5.sum(quantity*price) Do the summing up
7 return A6 Return results

As can be seen, the nodes where a clustered dimension table is stored can be different from the nodes where calculations are performed. So we can create a clustered dimension table using certain nodes and perform calculations on others. Different clustered dimension tables can be stored on different nodes. In sum, programmers can specify this as needed.

It’s not a good way to transport small amounts of data frequently through the network, because that will cause serious performance loss, as what will happen when a single record of a clustered dimension table is made to be accessed randomly like the one of an ordinary table is accessed. Hence, pointerized foreign key reference doesn’t apply to fields of a dimension table. Batch reference based on cursors is a suitable way to make the reference instead. Its code will be different from the code for doing in-memory calculations.