esProc Server

Market positioning and concept

esProc Server is a big-data computing engine developed based on esProc.

The fundamental technical challenge with big-data computing is to achieve high performance – in short, how to compute faster. Common solutions include the use of in-memory computing, parallel multithreaded processing and distributed cluster system, as well as some technical skills like indexing and the exploitation of the data orderliness. esProc Server provides class libraries of those algorithms for processing structured data.

esProc Server is one of the various products developed for big data. It is really better at handling some of the big data problems, but not all of them, because the all-powerful software is not what it’s intended to be. Its areas of expertise are:

Online report query

Relatively speaking, this type of scenarios involves a small volume of data, simple computations and fewer steps. The query, with a lot of concurrent requests, usually requires extremely fast response measured by second. The bigger the data is, the more difficult an instant response can be achieved simply by using the conventional database computing mechanism.

Usually we optimize the online report query by adopting in-memory computing.

Offline data preparation

This type of scenarios involves a much greater volume of data, a lot of external memory computations, complex business logic and many computing steps, but generally it doesn’t have concurrent requests. For complex external memory computations, it’s difficult to write efficient code and write or execute code efficiently with either SQL or the stored procedure. That the processing of the large amount of data lasts too long will result in insufficient time window which in turn affects the succeeding transaction processing.  

Often the offline data preparation is optimized through the use of parallel strategy that can be scaled out.

esProc Server is a technological product. It provides class libraries and methods for processing big data in the form of a programming language, but no analytical models for specific industries. Further development is needed to form the applicable solutions. Nor does it offer algorithms for machine learning and data mining. Though certain algorithms written in it can be understood as some sort of application of data mining, it is far from being regarded as a mining product.

esProc Server provides various class libraries (some methods even contradict each other and cannot work together) but it leaves the design of algorithms to programmers, who will do the job according to different computational tasks and data characteristics, in order to achieve the optimum performance. The side effect is that the syntax is not sufficiently transparent. Programmers need a deep understanding about the data transformation during both the physical storage and the computations.

esProc Server isn’t a (cluster) computational framework, it focuses on providing class libraries. It’s up to the programmers to decide what kind of application framework and computing procedure they want. Because whatever the framework, the basic, low-level computing elements are always indispensable. And esProc Server almost hasn’t a cluster framework. Programmers are free to determine the computing tasks and data distribution for each cluster node, which brings a higher performance as well as more workloads needing to be handled more painstakingly. It is, therefore, more suitable for small and medium-sized clusters than for large-scale clusters (which require being managed collectively), particularly the standalones. In this sense, esProc is a lightweight big data solution. 

Hadoop and SQL

We can’t talk about big data techniques without mentioning Hadoop. The esProc Server isn’t based on the popular Hadoop system, but it has its own parallel and cluster mechanisms.

Why esProc Server shuns Hadoop that has so many merits? That’s because Hadoop has the following shortcomings that discourage its utilization.

Hadoop is an open-source, free software, as well as a massive, heavyweight big data solution. It’s expensive to have the maintenance support for configuring it and making full use of its numerous functionalities. It’s a strong point that Hadoop has various product lines, but the interdependence of them increases the cost of maintenance. Hadoop is targeted at the large-scale clusters with hundreds of even thousands of computers, among which breakdowns can happen frequently, and a lot of resources are invested to address the fault-tolerance issue. That is necessary, but that also means Hadoop positions itself as being high-end rather than as a popular product.

esProc Server is the fruit of the effort of creating a lightweight product, applicable to both standalones and clusters, especially the small and medium-sized clusters that have several or a dozen of or scores of computers at most, and which have little dependence upon other products and technologies. Clusters of this size don’t need too much fault-tolerance ability but they emphasize more on flexibility in order to gain a high performance.

Hadoop has a definite framework to which programmers have to adapt themselves. This limits their flexibility and keeping them from writing the code that fits business logic and data characteristics. For example, probably we can manage redundancy in HDFS’s file system, as appropriate redundancy can effectively reduce network traffic – we’ll discuss this later – by modifying the source code. But this is not easy, and changing source code freely will affect the upgrading. Another example is the MapReduce, which breaks apart a task into too many subtasks to increase the degree of fault-tolerance and cannot control the executing order directly, making it difficult to express many order-related algorithms.

esProc Server’s way is providing class libraries for programmers, because all algorithms need the low-level, basic methods. Without a definite framework, esProc Server enables programmers to take control over the programming process using codes, instead of merely doing blank-filling as with MapReduce, in order to develop programs meeting the business logic and data characteristics. 

Hadoop is a relatively closed, self-contained system. To use its computational strategies, we should put data into it. And it can’t handle data coming from a relational database or a network file system. esProc Server is a pure computational product and a more open system, it allows processing data stored in all forms, including HDFS. And written purely in Java, it can be integrated by Hadoops solutions.

SQL finds another opportunity in the big data era thanks to the MPP technology. Its transparent syntax relieves programmers of concerning physical storage scheme, but the highly automatic design would deprive a programmer’s chance of controlling and optimizing the computational details according to the characteristics of data and task. Without the SQL-style syntax, esProc Server requires programmers to understand the storage scheme in order to write the effective and efficient code. But it can fully tap the potential of hardware resources by controlling each step of a local computation in the light of the feature of a specific problem.

There are many vendors who provide their own SQL implementations. Some are doing great jobs, such as those providing Hadoop solutions. There are already a sufficient number of sophisticated solutions to problems that can be solved smoothly in SQL (in terms of the degree of difficulty and the performance). As we all know, however, there are still many problems that are not easy to approach in SQL, which generates a great amount of code for those cases. But these big data problems demand high performance and need the support by parallel and cluster technologies. esProc is mainly designed to assist in dealing with them.

Of course it’s not an issue for esProc to connect to a SQL database and retrieve data for processing. Therefore esProc can work with MPP SQL to enhance the performance of big data processing.