In Excel, to perform a year-on-year comparison on a multilevel table, you have to paste formulas manually. That could be a large amount of work when the table contains huge data and many groups. If the aggregation is based on a part of the data, you need to rearrange it in a new worksheet manually, resulting in low efficiency and a greater risk of error occurring.
With esCalc, the desktop BI software, you can just enter one formula for accomplishing the whole year-on-year comparison; you can delete data for a filtering operation without manual work; and you can perform a sort directly on grouping rows and change the position of each group as a whole. The esCalc operation mode makes it very convenient to handle the kind of problem just mentioned, regardless of the amount of data or the number of groups.
To establish a distributed system, an esProc program can be made as a server to receive requests from other esProc programs and return the results. Its basic computing model is that a controlling node sends orders to the non-controlling nodes, collects and aggregates their results. A complex task may consist of multiple sub-tasks.
The key technology of the distributed computing is the scale-out ability, as well as the fault-tolerance capability for multi-node running.
First let’s look at the simple shared-data-source strategy.
The so-called data sharing means that the data to be processed by the nodes is stored in the same place, like a database or a network file system, and that the nodes only handle the tasks assigned to them but don’t have the data. In this way, the source data will take a lot of pressure resulting from the concurrent accesses. The strategy is more suitable for computation-intensive tasks than for data-intensive ones.
It’s simple to implement the shared-data-source strategy with esProc Server:
|A||The controlling program|
|1||=4.(“192.168.0.”/(10+~)/”:1234”)||The list of 4 nodes|
|2||=callx(“sub.dfx”,to(8),8;A1)||Pass the parameter in to call the node programs, which correspond to 8 sub-tasks|
|3||=A2.sum()||Perform the aggregate|
|A||Node program (sub.dfx)|
|1||=hdfsfile(“hdfs:\\192.168.0.1\persons.txt”)||An HDFS file|
|2||=A1.cursor@t(;seg:all)||The cursor of a segment of file, where seg and all are parameters from node programs|
|3||=A2.select(gender==’M’).groups(;count(1):C)||Select and count the Male records|
|4||return A3.C||Return the result|
esProc provides a solid support of many types of shared data source, such as database and HDFS.
The esProc Server’s distributed structure is centerless. This is unlike other distributed structures like Hadoop that possesses a thorough system to transparently simulate the whole cluster as a standalone. esProc doesn’t have a framework and a permanent controlling central node, it capacitates the programmers to control the participating nodes with codes by abandoning a definite structure for the cluster.
In a centerless distributed structure, all nodes are equal and none is special. The advantage is that the malfunction of a certain node won’t stop the whole cluster from running. A distributed structure with a center, however, will break down once the central node goes wrong.
Strictly speaking, an esProc cluster isn’t completely devoid of centers. Though the general server cluster is centerless, each sub-task has its own controlling node temporarily summoning other nodes to take part in the computation. If the controlling node collapses, the whole task fails. But the cluster as a whole can still handle other tasks.
The difference between esProc distributed system and other distributed systems is another embodiment of esProc’s design concept of emphasizing class libraries while avoiding a definite framework.
esProc Server is capable of balancing sub-tasks among nodes. It determines whether or not a node should be given a sub-task according to how much the node is occupied (the number of threads on the run). If the node is saturated (meaning the number of threads running on it has reached the maximum allowed for it), the controlling program will wait until the node finishes at least one of its sub-tasks and then start its task distribution. That way a faster node may receive more sub-tasks, creating balance between the responsibilities and the resources.
If a node malfunctions during the process and fails to go on with its work, the controlling program will reassign the work to the healthy nodes. This makes the total computing time longer, but is fault-tolerant to some extent.
esProc Server’s non-framework design allows the cluster to include machines with contrasting performances, such as different memories, CPU configurations, even operating systems. In a nutshell, esProc Server is open to any machine. This can exploit the potential of the user’s existing hardware devices to the full. By contrast, generally many cluster strategies with specified frameworks require that the nodes have something in common.
In order to attain better performance, we need to store data in a distributed style, particularly for the data-intensive tasks, which have a high I/O cost. The shared-data-source strategy will cause serious throughput bottleneck for those tasks, while the distributed data storage plan will spread the I/O delay among nodes.
In principle, the goal of distributed data storage plan is to break data apart and put it onto different nodes, enabling each node to access the data it needs locally and therefore avoiding network transmission delay and collisions in getting the shared source. Data distribution doesn’t mean that we simply divide the data (evenly) into N segments and place them on N nodes. This kind of distribution is fault-intolerant and probably still has a relatively large amount of network transmission resulted from the join operations.
Unlike the common network file system, esProc Server provides an opaque data distribution strategy, which requires programmers to decide how data should be dist