esProc: An Efficient Instrument for Tackling Computational Problems

esProc

Design goals

esProc is a programming language specially designed to handle (semi)structured data. It adopts a new and different data model and computational model, as well as provides a wide variety of basic algorithms, achieving easy and high-performance computational problem-solving processes.
A non-object-oriented language, esProc doesn’t have such complex concepts as inheritance and overloading. It uses the concept of object merely to facilitate the object-related methods. Any programmer with the basic level of programming skill, such as the familiarity of BASIC, can learn it quickly and well. A Java-interpreted dynamic language, esProc is able to flexibly generate code at run time, simplifying program development even further.
esProc defines its niche as the tool of computing (semi)structured data, so it doesn’t provide algorithms for directly performing statistical analysis, data mining and machine learning; nor is it an expert at processing media and GIS data.

The primary objectives of esProc are to streamline the translation of the logic of computation into code, and to enhance computational performance.
To carry out a computer-based computation, we need to precisely translate our thought about the computational process into the formal language that is understandable and executable to the computer. But when the formal language renders the process in a way that is too far from our natural way of thinking, you may find the translation is actually harder than the computation. This is not only time-consuming, but error-prone.
Here’s such an example. You might want to count the longest consecutive run of rising trading days for a stock. The computational method is simple. You sort the records by trading day and compare each record with the previous one, adding 1 when the stock price is up and resetting the count as 0 when it is down; then get the biggest count number.
Yet with SQL, the commonly used query language among database developers, it’s really hard to code the logic. The SQL92 standard, which provides no window functions, is almost incapable of producing the code. The SQL2003 standard provides window functions which enable the coding, but great effort is still required.
That’s the first issue esProc tries to solve. esProc designs formal language syntax based on our natural way of thinking, clearing the way as much as possible for programmers to describe a computational process with the programming language.

esProc is also designed to enhance the computational performance.
It’s impossible to fundamentally increase the computational ability of the hardware through software. It takes an unchanging period of time for CPU to perform the basic operations for one hundred million times or to retrieve 100G data from the hard disk. The software optimizes the computational performance through optimal algorithms by making CPU perform basic operations less or retrieve less data from the hard disk.
But this depends on the choice of data model and algorithmic model for a formal language. Bad models are incapable of carrying high-efficiency algorithms.
Again we take SQL as an example. To find the top 10 from several hundred million transaction records according to the amount, a SQL solution is to first sort all the records. The work progresses at a snail’s pace, especially when the records can’t be loaded altogether into memory and frequent external storage access is needed. Actually there are more efficient algorithms for finding the top 10; it is unnecessary to sort all records, or to swap data out to the external storage.
This is the second issue esProc tries to address. The formal language overcomes the limitation of traditional models to adopt a new data model and an algorithmic model, letting programmers to realize any natural and efficient algorithms that they come up with, rather than resorting to inefficient ones.

Related Technologies

Compared with high-level programming languages like Java, esProc provides abundant basic objects and methods for manipulating structured data, which is commonly seen in data analysis, data processing and data preparation. It thus produces much more concise code than Java does in expressing the same algorithm and has much higher development efficiency than high-level languages.
For example, Java needs dozens of, even nearly a hundred, lines of code to filter a data set, longer if a universal data type and a universal condition are involved; whereas esProc uses a mere one-liner to get things done.
esProc is integration-friendly with Java applications. Since it has been developed in Java, the two are perfectly compatible. Designed to be integrated, it is open to the invocation coming from a Java main program. It is particularly convenient to use esProc to prepare data source for a Java reporting tool.

It’s rather inconvenient to perform a piecemeal computation, particularly an order-related computation in SQL. Normally data needs to be retrieved from the database and handled with Java or other languages because SQL is not completely set-oriented and lacks support of individual rows. esProc pushes deeper and ahead with the set orientation and supports handling individual records, permitting a more intuitive implementation of non-equi-grouping, reuse of grouped data, order-related computations, and multi-step computations.
Yet esProc cannot and doesn’t aim to replace SQL, despite its much simpler syntax in most scenarios.
Data retrieval from the database causes plenty of I/O performance loss. In handling a simple computation involving a large amount of data, it takes much longer time to retrieve data than to perform the computation. For this type of cases, it’s more appropriate to handle data in the database. Besides, SQL’s metadata system helps create a more transparent syntax. Programmers need not concern themselves with the physical storage scheme. esProc is a pure computing engine without its own storage mechanism and semantic data model and unable to provide transparent syntax, requiring programmers to approach computations according to different data storage plans.
Choosing to use esProc doesn’t mean abandoning SQL. Instead, it helps SQL in handling computational scenarios, such as complex multi-step computations and computations involving heterogeneous databases, where SQL has been weak.

SQL is the only standard programing language specializing in handling structured data in the industry. However, SQL comes with computational weaknesses mentioned above, as well as limitations in its application due to the closed nature, like restrictions on handling local files. As a result, programmers often turn to scripting languages like Python(pandas) and R to deal with those types of scenarios.
The truth is that Python (pandas) and R are designed to perform mathematical statistics and analysis. Equipped with dataframe object though, they are not the specialized tools for processing structured data and provide no direct support of computation over external storage. esProc is intended for structured-data computing. It provides table sequence object (the counterpart of dataframe’s superset) for in-memory data and cursor object for external storage data, making the coding of multithreaded parallel processing convenient and enabling simple configuration and easy-to-use method in handling multiple or heterogeneous data sources (xls, json, and mongoDB).
esProc is not good at performing mathematical statistics and analysis because of the lack of necessary class library.
Apart from the analyses handled separately, the structured-data computing often takes place within an application. Being integration-friendly with Java, esProc can be easily invoked by a Java main program. Python and R language are none-too-friendly with integration, it’s almost impossible to write an algorithm in Python that can be invoked by Java.

Application Architecture

After a script (a .dfx file) is written and debugged on IDE, it will be interpreted and executed by the JDBC driver jars embedded into the Java application. Both IDE and JDBC can obtain the ability of big data handling by invoking an esProc server cluster.

Independent operation
The excellent interactivity of esProc’s Integrated Development Environment (IDE) makes it an ideal desktop interactive analysis tool for anyone who has programming basics.

Instead of the normal text-style code used by scripting languages, an esProc script is written in a cellset, which has a number of benefits. You don’t need to name a temporary variable, but can use the cell name, such as A1 and B2, directly as with the Excel; the working range of a loop statement and a branch statement is defined by the intuitive indentation, rather than using the braces or BEIGIN/END; you can reduce the interface area, making the screen less occupied and partially displaying a long statement in a cell; you can check the cell tips by hovering your mouse pointer over the code; multiple lines of code can be written in one row to make the script easy to read and its overall structure clearer. This cellset style also enables a more complete debugging function, the step-by-step execution, and the check of the results of intermediate variables at any time during the process.

The key aspect of interactive computing is to conveniently show and reference the intermediate results, determining the next step by the previous step. esProc’s cellset-style coding retains the intermediate results in the cells naturally for viewing anytime and enables a direct reference of them without naming them, creating extremely smooth step-by-step interactive computations. Whereas a scripting language normally performs interactive computations from command line, which is far more inefficient.
esProc supports a wide variety of data sources, including the common databases, as well as the files stored in local file system, such as TXT and XLS files. The final result can be viewed, or written back to the database or the file.
esProc can perform execution at the IDE, as well as be started by an external job scheduling software from the command line. With its support of various data sources and remarkable computing power, esProc can perform tasks like scheduler data manipulation (similar to ETL).

Serving as Java class library
As mentioned above, esProc is designed to be integrated.
esProc provides standard JDBC interface which can be conveniently embedded into a Java-based application, and through which an esProc script is invoked the way that a stored procedure is invoked from the database. The invocation supports parameter passing and returns result in a familiar form of ResultSet. It’s cheap and easy for Java programmers to learn esProc. esProc RTL is provided as JARs, and can be deployed and distributed with the Java application. The integration is completely seamless.
Since Java isn’t equipped with a common class library for manipulating structured data, programmers have to hardcode to solve the related computational problems, which is a very cumbersome progress. esProc, outstandingly integration-friendly and with excellent computing power, can serve as a Java class library for processing batched (semi)structured data. Sometimes when there is not a database involved (such as the cases where text files are handled), SQL’s ease-of-use computing capability will lose its leverage. Other times when there is a database but the algorithm is difficult to be coded in SQL, data has to be retrieved out of the database to perform the computation. In all these scenarios, esProc is useful in assisting Java with the computation.

Like a SQL statement, one esProc statement can be invoked too. If the code is short, a long single statement can be written directly in the esProc JDBC for execution, without a script file created. This will simplify the code management as well as increase programming flexibility.
An interpreted dynamic language, esProc is able to generate code at run time for execution, and thus produce a more flexible program.

Preparing data source for reporting tools
As a special example of the application of Java, reporting tools can certainly integrate esProc code through JDBC to supply data sources to themselves.
The process of report development involves many complex, temporary computations, which are difficult to implement with the reporting tool due to their complexity, and which will cause unreasonable storage usage if performed within the database because of their temporariness, and which will lead to tight coupling between the Java application and the code if carried out in the intermediate Java program. But by using esProc as the special middleware to prepare the report data source, these computations can be detached to be executed separately so that the development can be much easier. Moreover, the esProc script can be managed together with the report template, which effectively reduces the complexity of application management.
esProc’s other abilities to further enhance report development performance will be covered later.

Manipulation of structured data

Set orientation

A set-oriented programming language provides Set data type that enables it to directly support set operations.
Structured data almost always involves a set of records. It is theoretically complete if a programming language only supports performing operations on individual records, since a batch computation can be carried out by performing logic on one record after another. But obviously this will be rather cumbersome. High-level languages such as Java and C++ are not set-oriented. Several lines of code are needed to perform a simple sum on an array, not to mention the complicated filtering and join operations.

Will this be solved by developing a set of class libraries to support Set data type and related computations?
Unfortunately, it’s not as simple as that. The syntax of a high-level language like Java doesn’t support defining functions with expressions. For example, you’d like to create one of the basic operations, sum, on a set of x, and you write a x.sum() function. Then you might want to calculate the sum of squares, and you write a x.sum2() function. Now you begin to aware that this could become an endless creation of functions, because you might also want to calculate the sum of absolute values, sum of units digits, and so on. The class library would soon be bursting at the seams with the endlessly growing number of functions.
To use only one method to represent various types of sums, the method should be able to perform calculations according to functions that are expressed with members of a set and that can be passed as parameters to the method, like getting square and absolute value. To realize the mechanism, both Java and C++ can use function pointers for parameter passing by writing the functions like square and absolute value in a predefined parameter form. But this is inconvenient.
SQL hasn’t such a problem. Different sums can be represented as sum(x), sum(x*x), sum(abs(x)), etc. because the language allows defining an anonymous functions with an expression and pass it to a certain method. It is the lambda syntax, a terminology used in programming and an essential part of a set-oriented programming language.
With Java, there’s yet an alternative, which writes the expression as a string for parameter passing, such as writing sum(x*x) as sum(“x*x”). It is theoretically feasible but it will raise difficulties in writing and reading, as well as could cause confusion between the string and the expression.

A set-oriented language also needs to support dynamic data structure since structured data type is one of the basic data types, which can be generated at any time during the execution of a program. The interpreted language SQL supports the structured type well whereas a compiled language like Java doesn’t support the data type.
SQL has the basic features of a set-oriented language, and is thus the most widely used language for manipulating structured data.

esProc is another interpreted, set-oriented programming language. It supports lambda syntax and dynamic data structure, as well as provides a basic, well-stocked class library for handling structured data.
Below are common manipulations of structured data:

	A
1	=file(“D.csv”).import@tc(name,sex,age)
2	=A1.select(sex==”M”&&age>=25\|\|sex==”F”&&age>=23)	Filter
3	=A2.sort(name)	Sort
4	=A2.groups(sex;avg(age):age)	Generate a new data structure by grouping and aggregation
5	=A2.id(left(name,1))	Get distinct values

Performing set computations directly

	A
1	=file(“T1.txt”).import@ti(id)
2	=file(“T2.txt”).import@ti(id)
3	=A1^A2	Intersection, which is the common values of T1 and T2
4	=A1\A2	Difference, which contains values T1 have but T2 don’t have

Discreteness

Discreteness refers to the degree of separateness of members of a set in existence and in participating in computations. For structured data, the concept particularly refers to a quality that records can be separated from its table to be handled with other records or with a different table, or to form a new set.
All high-level languages like Java have good discreteness, allowing an object instance to exist separately and be handled without clinging to a table.
But SQL lacks discreteness, forcing records to attach to a table for storage and for computations. SQL’s record retrieval is in fact record duplication which generates a temporary, different table that has nothing to do with the source table. SQL doesn’t have the record data type, so a record can only be treated as a temporary, one-record table.

Performing a complex computational problem in multiple steps can effectively reduce the implementation difficulties. The stepwise computational process usually requires the support of the concept of discreteness, whose absence is an obstacle for expressing stepwise computations.
For example, you might want to calculate the age difference and salary difference between Tom and Harry in an employee table. The solution is simple. You just retrieve respectively the records of the two employees and perform subtraction. Here involves the repeated reference of each record. Without the support of discrete records, SQL uses two awkward ways to express the actions. One is to perform multiple queries to obtain the ages and salaries of the two respectively; the other is to concatenate the two records into one and then perform subtraction. Both produce bloated code.
esProc, however, inherits the discreteness of high-level languages represented by Java, and thus produces a natural and graceful piece of code:

	A
1	=employee.select@1(name=="Tom")
2	=employee.select@1(name=="Harry")
3	=A1.age-A2.age
4	=A1.salary-A2.salary

Before structured data became prevalent in the world of computation, most of the data processed by computers were numeric values. According to the mathematical rules, the result of a number-crunching operation is a number unrelated to the operands, meaning each operation will generate a new piece of data. Numeric values are simple, so it seldom raises problems. But the structured data, including records and objects, is different. Records are a complex data type and have its own properties. The duplication of them for each operation not only waste resources and time, but make the reference of the originals for further handling impossible. The read-only operations cause only too much resource consumption, but modify operations on the original records will get stuck.
Record handling is quite common in the real-world business cases. Usually it involves complex scenarios, and simple examples are hard to get. The following one is just passable.
You might want to find in an agent table the agents whose sales amounts are in the top 10% and give them a reward of 5% of the sales amount. As SQL can’t use a simple WHERE statement to express the top 10%, it will first retrieve the primary key values of the records of those agents using a subquery and then perform an UPDATE query on the agent table according to the retrieved key values. But this SQL way disrupts the set-oriented syntax. This top 10% example aims to illustrate that often the filtering conditions in actual cases are not simple, requiring a multi-step filtering process that can’t be all written into a WHERE subclause.
esProc handles it in a natural and stepwise way:

	A
1	=agent.sort(amount:-1).to(agent.len()*0.1)	Get the top 10%
2	=A1.run(amount=amount*1.05)	Add the reward

A more important application of discreteness is creating record-referencing foreign keys.
SQL doesn’t have the record data type that stores discrete records, so field values can’t be records. Though SQL uses the foreign key to represent the relationship between records, the JOIN syntax becomes really complicated, error-prone and unintelligible when there are many tables being joined. If the field values are record type, then records corresponding to the foreign key could be treated as one of the fields of the records being handled. It is a field that contains subfields. That is the same object reference mechanism adopted by high-level languages like Java. And that mechanism enables clear code for multi-level association.
esProc supports discrete records and thus allows assigning records to fields, achieving record-referencing foreign key-based computations conveniently:

	A
1	=file(“D.csv”).import@tc()
2	=file(“P.txt”).import@t(id,area)
3	=A1.switch(aid,A2:aid)	Create a record-referencing foreign key
4	=A3.select(aid.area==”Beijing”)	Perform filtering using fields of the records referenced by the foreign key

The reason behind SQL’s lack of discreteness can be traced out from the history of computing. When SQL was born in 1970s, the size of computer’s RAM was very small and the cost of realizing pointer-based reference was high. Besides, businesses at the time were not so complex, generating little demand for the related technology. Over four decades later, the increased capability of hardware makes discreteness easily reachable, and more importantly, business demands have expanded hugely.
esProc is a programming language that integrates the merits of both SQL and Java. It is set-based and discreteness-based, much facilitating the manipulation of structured data.

Complete orientation towards sets

SQL only goes half way towards set orientation because a complete orientation needs the support of discreteness.

Grouped subsets
Except for data tables, SQL hasn’t any other explicit set data type. So it can’t help performing aggregation after data grouping. Besides aggregate values, you may also take an interest in grouped subsets. But it’s difficult to perform data handling directly on these subsets, and subqueries are needed.
With set data type and grouping functions that return subsets, esProc can easily handle the post-grouping computations.
For example, you might want to find the records of all subjects for students whose total scores are equal to or above 500. With SQL, you need to group records to calculate the total score for every student, select the students whose scores are equal to or above 500, and then JOIN the resulting table with the original score table or use IN statement to find the desired records, which is roundabout and needs to retrieve data repeatedly. But in esProc, you can do this in a natural and straightforward way:

	A
1	=db.query("select * from R")
2	=A1.group(student).select(~.sum(score)>=500).conj()

There are many scenarios that require the returning of the records of the subsets after data grouping. For those scenarios, the group and aggregate operations are intermediate steps towards completing a filtering, rather than the goal. There is a similar kind of example below in which data is sorted according to group and aggregate values for report development.

In some other cases, though only the aggregate values are desired, the aggregate operations are difficult to be expressed in simple aggregate functions and thus the grouped subsets need to be retained for the computations.
Those cases are not uncommon in real-world businesses. But as the computations often involve a lot of domain knowledge and are complicated, it’s inconvenient to cite an example. Here’s a simplified one.
The user login table L has two fields – user (ID) and login (time). Problem: Calculate the last login time of each user and the number of logins within 3 days before this time.
It’s easy to find the last login time, but it’s difficult to count the logins during the specified time period without retaining the grouped subsets. The SQL algorithm is like this: Group records to find the last login times; perform JOIN with the source table to find records during the specified time period and again group and aggregate these records. It’s roundabout and inefficient. esProc, however, can retain the grouped subsets and thus can do this in a stepwise approach:

	A
1	=db.query("select * from L")
2	=A1.group(user;~.max(login):last,~.count(interval(login,last)<=3):num)

In the code, ~ represents the subset obtained after data is grouped by user.
For ordered data, here’s a more efficient way to get this done:

	A
1	=db.query("select * from L order by login desc")
2	=A1.group(user;~(1).login:last,~.pselect@n(interval(login,last)>3)-1:num)

Order-related aggregation
It’s also common to get the top N records and to find the records corresponding to the maximum value. Of course the computations can be handled on the retained grouped subsets. But as they are too common, esProc treats them as a kind of aggregation and provides a special function to handle in basically the same way as handling the ordinary grouping and aggregation.
SQL lacks Set data type and discreteness. It’s unable to provide aggregate functions that can return a set of records referencing other records. SQL’s way is to use subqueries or other cumbersome tools, which often results in large-scale data sorting, which, in turn, causes performance loss.

Let’s look at the simplest case. The user login table L has these fields – user, login (time), IP-address … Problem: Find the record of first login of each user.
SQL can use the window function to generate sequence numbers after intra-group sorting and retrieve all the records whose sequence numbers are 1. Window functions can only be employed on the result set, so a subquery and then a filter are needed. The code thus becomes complicated. For databases that don’t support window functions, it’s more difficult to do this.
esProc provides group@1 function to directly retrieve the first member of each group.

	A
1	=db.query("select * from L order by login")
2	=A1.group@1(user)

This type of log files is common and usually they are already ordered according to the time. esProc can get the first record directly without doing sorting. Cursor can be used to handle this if the amount of data is too big to be entirely loaded into the memory.

The stock price table S has three fields – code, date and cp (closing price). Problem: Calculate the latest rate of increase for each stock.
The calculation involves records of the last two trading days. SQL will use two levels of window functions to perform intra-group inter-row calculation and to retrieve the first row respectively. The coding is complicated. esProc provides the aggregate function topN to directly return the related records as aggregate values for further handling.

	A
1	=db.query("select * from S")
2	=A1.groups(code;top(2,-date))	Get the records of the last two trading days
3	=A2.new(code,#2(1).cp-#2(2).cp:price-rises)	Calculate the rate of increase

Instead of grouping records into subsets, esProc aggregate functions perform an accumulation, achieving a better performance. They can also work in a cursor-based mode to handle the big data that cannot be entirely loaded into the memory.
Records can be retrieved according to their sequence numbers if data is already ordered. This is more efficient:

	A
1	=db.query("select * from S order by date desc")
2	=A1.groups(code;top(2,0))	Get the first two records directly
3	=A2.new(code,#2(1).cp-#2(2).cp:price-rises)

To find the records corresponding to the maximum value, and to get the first/last record are special cases of the topN-style aggregation.

Inverse grouping
Contrary to the group and aggregate operations, inverse grouping splits the aggregated data into detail data. The operation is not common, but hard to handle in SQL. Here’s one example.
An installment payments table has these fields – no, sum, start and periods. Problem: Split each sum of loan into multiple payment records; the resulting table should contain these fields – no, seq, date and amount. Distribute the total payment to each period (a month) evenly.
It’s easy to aggregate detail data, but it’s difficult to split data in SQL that isn’t completely set-oriented. To generate the details, SQL would perform a JOIN between the source table and a sequence number table, or use the recursive query. Both are roundabout. Whereas esProc is able to handle this in an intuitive way:

	A
1	=db.query("select * from I")
2	=A1.news(periods;no,~:seq,after@m(start,~-1):date,sum/periods:amount)

Order-related calculations

A greater number of the structured-data computations are related to data order. The type of calculation is already introduced in the above, but it is worth special discussion.
Order-related calculations are interesting because people naturally pay more attention to time-varying data than to static data. Calculating link relative ratio, YOY growth rate and moving average are all order related calculations.
Expressing an order-related calculation requires both the complete set-oriented syntax and a high discreteness. The “order” refers to the relative positions of members in a set; the concept only exists in a set-oriented language. To access members according to their relative positions is to retrieve a certain member, the member immediately before it, and the one immediately after it. The retrievals depend on a high degree of discreteness.
esProc has ordered sets whose members can be accessed according to their sequence numbers using the easy-to-use syntax it defines.

Inter-row reference
In the early days of SQL, the language doesn’t directly support inter-row reference. It would first generate sequence numbers to perform a JOIN. The code is excessively difficult and overloaded. By introducing window functions later, SQL can reference data from other rows more easily. But the code is still far from being concise – bulky we can say – particularly when multiple items from other rows need to be referenced. Additionally, as mentioned above, the window functions should be used based on the result set of an operation, and the subquery is needed for referencing the results of performing window functions. The code is as cumbersome as it used to be.
MySQL doesn’t support window functions, but it can make a backward reference using variables in SQL statements. Yet it cannot make a forward reference.
esProc provides a natural and easy-to-use syntax for inter-row references.

The monthly product sales table S has 3 fields – prod, month and sales. Problem: Find the records in which the sales have been increased by 10%.

	A
1	=db.query("select * from S order by prod,month")
2	=A1.select(if(prod==prod[-1],sales/sales[-1])>1.1)

esProc uses [-1] to reference the data of previous month after sorting, and performs filtering directly using the expression for an inter-row calculation. In contrast, SQL window functions require the subquery and MySQL needs to define two temporary variables.

Based on the above table, you might also want to calculate the moving average for each month’s previous month and next month:

	A
1	=db.query("select * from S order by product,month")
2	=A1.derive(if(prod==prod[-1]&&prod==prod[1],sales{-1:1}.avg()):moving-avg)

The calculations involve backward reference and the reference of a set. esProc uses [1] to reference data of the next record and {-1:1} to reference the set of field values of two neighboring records. SQL window functions still need to find the related records using a subquery before it can calculate the moving averages. MySQL also cannot handle the computation directly because its variable is unable to reference forward.

Here’s another example. The simplified event table E has these fields – seq, time … The time should be synchronized with the sequence number as the latter increases. But errors may exist, and you need to find the records where the time and the sequence number don’t synchronize

	A
1	=db.query("select * from E order by seq")
2	=A1.select(time!=max(time{:0})\|\|time!=min(time{0:}))	Compare each record with all records before and after it

esProc can get a set from the beginning to a certain point or from a point to the end. SQL window functions can express the similar operations, but data needs to be sorted in two directions for the two comparisons using subqueries.

Order-related grouping
SQL only supports order-unrelated equi-grouping. Sometimes the grouping condition isn’t a value which can be directly compared with every record. Instead, the grouping is more related to the order of the records. In this case, window functions (or other more inconvenient tools) should be used to generate sequence numbers first if SQL is used.
esProc offers ready-made functions for order-related grouping, making the computations involving continuous intervals more convenient.

The income & expense table B has three fields – month, income and expense. Problem: Find the records where a deficit exists for 3 or more continuous months.

	A
1	=db.query("select * from B order by month")
2	=A1.group@o(income>expense).select(~.income<~.expense && ~.len()>=3).conj()

The group@o function compares only the adjacent records and creates a new group once the adjacent value changes. By comparing the income and expense between adjacent records, records can be divided into groups like profitable, unprofitable, profitable…, get the unprofitable groups that have not less than 3 members and concatenate them.

You might also want to find in this table the maximum number of months when income increases continuously. The algorithm can be this: when the income increases, put the current record and the previous one into the same group; when the income decreases, put the record into a new group; finally get the maximum of the numbers of members among the continuous increasing groups.

	A
1	=db.query("select * from B order by month")
2	=A1.group@i(income<income[-1]).max(~.len())

The group@i function will create a new group when the grouping condition changes, that is when the income decreases.
SQL can handle both scenarios with its window functions, but the code would be very hard to understand.

The merging of intervals is another common type of order-related grouping computations. The event interval table T has fields S (start time) and E (end time). Problem: Find the real length of time the whole event takes by removing the overlapping time intervals from these intervals.

	A
1	$select S,E from T order by S
2	=A1.select(E>max(E{:-1}))	Remove records where the time period is completely covered
3	=A2.run(max(S,E[-1]):S)	Remove the time intervals which are completed covered
4	=A2.sum(interval@s(max(S,E[-1]),E))	Calculate the total length of time
5	=A2.run(if(S<E[-1],S[-1],S):S).group@o(S;~.m(-1).E:E)	Merge the overlapping time intervals by start time

Here are solutions for different types of order-related grouping, which make the most use of the features of the inter-row calculation and the order-related grouping. SQL cannot handle those scenarios by simply using window functions. It needs to use the extremely difficult-to-understand recursive query.

Position-based operations
Sometimes you might want to use the sequence numbers to directly access members of an ordered set. SQL, which is based on the mathematical concept of unordered sets, will first generate the sequence numbers and perform conditional filtering before accessing members of the specified positions. This causes a lot of problems to many computations.
esProc, however, adopts a mechanism based on the concept of ordered sets, allowing accessing members directly with sequence numbers and bringing great conveniences.
For example, the median value of various prices is often desired in analyzing economic statistics:

	A
1	=db.query@i("select price from T order by price")
2	=A1([(A1.len()+1)\2,A1.len()\2+1]).avg()

Sequence numbers can be used in data grouping as well. The event table E has three fields – no, time and act. The act field includes two types of values - start and end. Problem: Calculate the total length of time the events take, that is, the sum of every time period that each pair of start and end defines.

	A
1	=db.query@i("select time from E order by time")
2	=A1.group((#-1)\2).sum(interval@s(~(1),~(2))

# represents the sequence number of a record. group((#-1)\2) places every two records in one group; and then calculate the length of total time for each group and perform aggregate.

You can make an inter-row reference between two neighboring records according to sequence numbers. The stock price table S has two fields – date and cp (closing price). Problem: List the trading days when the stock prices are above 100 and calculate the rates of price increase on those days.

	A
1	=db.query("select * from S order by date")
2	=A1.pselect@a(cp>100).select(~>1)
3	=A2.new(A1(~).date:date,A1(~).cp-A1(~-1).cp:price-rises)

The pselect function returns the sequence numbers of the members satisfying the specified condition. According to the result, you can calculate the rate of increase easily, without having to calculate the rates of increase for all days and then perform filtering as with the window functions.

Sometimes the grouping result is expected to be continuous ranges. So the missing empty subsets need to be supplied. It’s troublesome to do this in SQL, because you should first create a set of continuous ranges manually and then left join the data tables under processing, during which you have to use the subquery. Yet esProc can use sequence numbers to realize the alignment grouping.
Here’s a simple transaction record table T, including no, date and amount fields. Problem: Cumulate the transaction amounts week by week; the weeks without transaction records also need to be displayed.

	A
1	=db.query("select * from T order by date")
2	>start=A1(1).date
3	=interval(start,A1.m(-1).date)\7+1	Calculate the total number of weeks
4	=A1.align@a(A2,interval(start,date)\7)	Group records by week; empty sets probably exist
5	=A4.new(#:week,acc[-1]+~.sum(amount):acc)	Aggregate the weekly amount and calculate the cumulative amount

Cursor technique

The cursor technique is used to handle a huge amount of data that can’t be entirely loaded into the memory.
The cursor mechanism is simple. Open a big data source with a cursor, retrieve a segment of data in order into the memory, process it and then release it from the memory; after that retrieve and handle the next segment of data. The process continues until all data is retrieved and processed.
The cursor moves in a forward-only direction, traversing all data once. Most computations involving big data can be handled with cursor, such as sum and count, group and aggregate operations that return small result sets, data sorting, and group and aggregate operations that return big result sets. A caching mechanism is needed to store the intermediate results during cursor processing.

Stepwise cursor processing
The cursor in esProc is distinguished by its computational capability. An esProc cursor can return a result, still a cursor most of the time; and this cursor can be further manipulated.

	A
1	=file(“Products.txt”).import().primary@i(id)	Import the Product table and establish an index
2	=file(“Sales.txt”).cursor()	Create a cursor for traversal
3	=A2.select(quantity<=10)	Filter the cursor and return a cursor too
4	=A3.switch(productid,A1:id)	Generate a joining pointer with a cursor returned
5	=A4.groups(;sum(quantity*productid.price))	Perform aggregation to calculate sum

A2 creates a new cursor; A3 filters data based on the cursor; A4 then creates a pointer for a table join; finally A5 calculate the aggregate value. The data traversal doesn’t begin until A5, the steps A2, A3 and A4 simply record the cursor processing rule but without actual implementation. It seems like the cursor is handled step-by-step, making coding and understanding quite easy.
All SQL databases support cursor, but the functionality can perform only data fetching. No other operations are encapsulated and the stepwise processing isn’t supported. So it is rather inconvenient to handle complicated computations.

Program cursor
It’s convenient to handle a cursor step-by-step, but each step of operation should be expressed by a function. At times a certain step is too complicated to be expressed by one function. Expressing it in a piece of code will make it easier to understand and maintain.
esProc provides the program cursor method to define a cursor through a piece of program.

	A	B	Execute program sub.dfx with the cursor
1	=file(“Sales.txt”).cursor()
2	for A1,1000	…	Retrieve 1000 records each time for handling
…		…
…		return …	Return the data handling result

The subroutine imports data by segment, performs complicated computations and return the result. The number of records returned is unrelated to the number of records imported.

	A	The main program
1	=cursor("sub.dfx")	Call a subroutine
2	=A1.fetch( 100 )	Perform a cursor-based operation

A cursor defined based on a subroutine will be treated by the main program as an ordinary cursor. The cursor function will cache the results returned from the subroutine and return them to the main program when requested to do so by the latter. When the cached data doesn’t meet the requirement, cursor function will again set the subroutine in execution. After the execution the subroutine will be suspended until the next request arrives. The subroutine won’t be exited and released unless data is completely fetched out from the cursor or the main program closes the cursor. Through the program cursor, the stepwise cursor processing can be partially realized.

Ordered cursors
User behavior analysis is common. It has the following characteristics: the computation is complex for each user but inter-user operation is rare. Examples include calculating the growth rates of online duration in the past weeks for each user, and finding interest and balance of a home loan account. Usually the data of each user is small enough to be handled in the memory, but the data of all users could be too big to be loaded altogether into the memory. An ideal mechanism is retrieving one user at a time for handling, and then moving to the next, and so forth.
The order-related grouping also applies to the cursor. esProc allows retrieving a group of records from an ordered cursor at one time, enabling a convenient way to handle the above-mentioned computations.

	A	B
1	=file(“userlogs.txt”).cursor()		The source file where data is sorted by user id
2	for A1;id	…	Fetch data from the file cursor in batches, with each batch of records having the same id
3		…	Handle the current batch

The cursor can be thus used on the condition that the source data is already ordered, no matter it comes from a database or a file.
More similar examples will be cited for discussing the transformation of text file into structured data.

Dynamic column handling

A thorough set orientation should also include column-based set operations.
As an interpreted language, SQL can improvise a dynamic data structure. But SQL regards columns as one of the data attributes, which are static, so it doesn’t provide any column-based set operations. Consequently, this becomes a headache in dealing with scenarios where the desired column data isn’t supplied or where a standard approach is needed to handle many columns.

Inter-column aggregation
PE is a table recording results of physical education. It has the following fields – name, 100m, 1000m, long-jump, high-jump, and … There are four grades - A, B, C and D – for evaluating the results. Problem: Calculate the numbers of people in each grade for every event.
The algorithm is simple. You just need to union the results of all the events, group them and perform aggregate. SQL will use a long union statement to combine the results of all events. That’s really boring. It’s complicated if the columns are indefinite because you need to obtain the desired column names dynamically from the database to perform the union.
esProc supports column-based set operations. The fully dynamic syntax makes coding simple and easy:

	A
1	=db.query("select * from PE")
2	=A1.conj(~.array().to(2,))	Concatenate the results for every event from the second field
3	=A2.groups(~:grade;count(1):count)	Grouping and aggregation

Standard approach for transpositions
For simple static transpositions, some databases supply pivot and unpivot statements to implement them. Databases that don’t support the statements can do this using complicated conditional expressions and union statement. But usually the columns transposed from rows are dynamic. To handle this In SQL, you need to generate the target columns and rows and then compose another statement dynamically for execution. The code is complicated and difficult to understand.

The student scores table R consists of these fields – student, semester, math, English, science, pe, art and … Problem: Perform both the row-to-column transposition and column-to-row transposition to present data in a structure as this – student, subject, semester1, semester2 …
esProc offers pivot function to perform the simple transposition:

	A	B	C
1	=db.query("select * from R")
2	=A1.pivot@r(student,semester;subject,score)
3	=A2.pivot(student,subject;semester,score)

To achieve the two-way transposition, A2 performs column-to-row transposition and A3 performs row-to-column transposition.

There is also a standard method which is easier-to-understand yet slightly complicated:

	A	B	C
1	=db.query("select * from R order by student,semester")
2	=create(student,subject,${A1.id(semester).string()})
3	for A1.group(student)	for 3,A1.fno()	=A3.field(B3)
4			>A2.record(A1.student\|A2.fname(B3)\|C3)
5	return A2

A2 generates the target result set using a macro. The loop in A3 and A4 transposes rows and columns and insert the result in the result set, which is the standard procedure for performing transpositions in esProc. The stepwise approach makes code clear and easy-to-understand. The approach applies to static transposition or one-way transposition and the code would be even simpler. esProc’s column access scheme and its flexibility characteristic of a dynamic language enables programmers to handle all types of transpositions, including static/dynamic transpositions, row-to-column transposition, column-to-row transposition, two-way transposition, in one standard approach.

Extended transpositions
Here’s the account state table T:

seq	acount	state	date
1	A	over	2014-1-4
2	A	OK	2014-1-8
3	A	lost	2014-3-21
…

Problem: Export the status of the accounts per day for a specified month; if there’s no record for an account on a certain date, then use the status of the previous date:

account	1	2	3	4	5	6	7	8	9	…	3 1
A				over	over	over	over	OK	OK	…	OK
...

Strictly speaking, the transposition is static. But it involves a lot of regular columns and is not easy to be expressed completely in a static way. It involves inter-column calculations as well, which are hard to be coded in SQL even using pivot statements.
The job can easily get done using the standard esProc way:

	A	B
1	=db.query("select * from T where year(date)=? and month(date)=?",2014,1)
2	=create(account,${to(31).string()})
3	for A1.group(account)	=31.(null)
4		>A3.run(B3(day(date))=state)
5		>B3.run(~=ifn(~,~[-1])
6		>A2.record(A3.account\|B3)
7	return A2

Here’s only one loop because it is the one-way transposition. In B3-B5, the calculation of getting data to be inserted to the result set according to esProc syntax is a little complicated. Yet the whole procedure is the same.

Quasi-structured-data processing

In many cases the source data comes from external database files, JSON or XML files. But SQL’s non-openness limits its ability to handle these types of data that is not exactly structured, and only makes it capable of manipulating database data. Those files should be imported into the database for handling. Moreover, since quasi-structured data lacks the strong data typing required by a database, the data import is usually accompanied by complicated data arrangement, which in itself consists in computations.
esProc is an open computing engine with the support of a wide variety of data types, so it has the ability of processing quasi-structured data.

In addition to databases, text files are almost the commonest form of data storage. So computations based on text files are very important. Text files themselves don’t possess any computational power. Programmers need to write code in one of the programming languages for their handling. The problem is few programming languages that have the ability to handle text files are set-oriented, making the coding of batch computation a highly cumbersome process. Certain scripting languages, such as Perl, Python and R, have made some improvements in this aspect, but they still don’t give adequate support of batch processing of structured data and are integration-unfriendly as well.
JSON data and XML data are two types of multi-level structured data. It’s hard to express them with a programming language which is set-based but lacks discreteness, not to mention dealing with computational problems.
esProc is expert at handling those types of data.

Text-file parsing
Not all text files are standard structured data. Files that are not standard should first be parsed before being processed. This requires that a to-be-used programming language have the set-based string handling capability.

The data items in each line of the text file T.txt are separated from each other by an indefinite number of spaces:
20010-8-13 991003 3166.63 3332.57 3166.63 3295.11
2010-8-10 991003 3116.31 3182.66 3084.2 3140.2
……
Problem: Make a list of the averages of the last four items in each line. esProc needs a mere one-liner to do this:

	A
1	=file("T.txt").read@n().(~.array@tp(“”).to(-4).avg())

read@n reads data from the text file as a set of strings. array@t(“”) splits each string into a set of substrings according to the indefinite number spaces; and @p parses each substring into the corresponding data type for the subsequent computation (of average value).

Here’s a comma-separated text file T.csv. Problem: Write the first 8 data items of the file’s lines having not less than 8 data items to another text file R.txt, separated with “|” (which is the separator some bank file systems use):

	A
1	=file("T.csv").read@n().(~.array(“,”)).select(~.len()>=8)
2	>file(“R.txt”).write(A1.(~.to(8).string(“\|”)))

The string() function concatenates members of a set into a string using a specified separator.

The text file T.txt holds a set of strings as shown below. Problem: Divide the file into several parts according to the state name (LA) before the characters US and put them in different files.
COOP:166657,'NEW IBERIA AIRPORT ACADIANA REGIONAL LA US',200001,177,553
……

	A
1	=file("T.txt").read@n()
2	=A1.group(mid(~,pos(~," US'")-2,2):state;~:data)
3	>A2.run(file(state+".txt").export(data))

esProc also supports the use of regular expressions for more complex parsing tasks. But since the regular expressions are difficult to use and perform poorly, the conventional approaches are generally recommended.

Parsing text files into structured data
It’s the common file handling to transform non-standard text data into standard structured data, and then (sometimes) write the data file to the database. Text files with special formats can be easily converted into structured data in esProc that offers a mechanism of handling order-related computation.

In the log file S.log, every 3 lines constitute a complete piece of information. Problem: Parse the file into structured data and write it to T.txt:

	A	B
1	=file(“S.log”).read@n()
2	=create(…)		Create a target result set
3	for A1.group((#-1)\3)	…	Group data every 3 lines by the line numbers
…		…	Parse field values from A3 (the current 3 lines)
…		>A2.insert(…)	Insert the parsing result into the target result set
…	>file(“T.txt”).export(A2)		Write the result set to the text file

Since esProc can group data by line numbers, you can run a loop to process one group of data each time, making the computation simpler.

Of course there is the special, simple case of handling a single line.

	A	B
1	=file(“S.log”).cursor@si()		Create a cursor and import the file in a stream
2	=file(“T.txt”)		Create the resulting file
3	for A1,3	…	Run a loop to import 3 lines for handling at a time
…		…	Parse field values from A3 (the current 3 lines)
…		>A2.export@a(...)	Write the parsed values to the resulting file

A skilled user can optimize the code to achieve a better performance by writing the parsed records in batches.

If every piece of complete information in the log file S.log starts with “---start---” but contain an indefinite number of lines, you just need to change A3 as follows:

for A1.group@i(~==”---start---”)

Create a new group with every “--start---”

Similarly you can deal with a big file in this type of scenarios with the cursor, and A3 will be like this:

for A1;~==”---start—“:0

Start another loop cycle with every “--start---”

Another scenario of the indefinite number lines is that each line of one piece of information begins with the same characters (For example the userID the log information belongs to). When the starting characters change, a new piece of information begins. To handle this you can slightly modify A3:

3	for A1.group@o(left(~,6))	Create a new group when the first 6 characters change
3	for A1;left(~,6)	Start another loop cycle when the first 6 characters change

You can also use the cursor to handle the big file by altering the code of the preceding subsection.

Searching and Summary
Find files under a directory that contain certain words, list the contents of the lines where they settle and the line numbers:

	A
1	=directory@p(“*.txt”)
2	=A1.conj(file(~).read@n().(if(pos(~,"xxx"),[A1.~,#,~].string())).select(~))

grep is a frequently used Unix command. But some operating systems don’t support it and it’s not easy-to-implement in a program. esProc provides the file traversal functionality and, along with the capability of file handling, it can do the job with only two lines of code.

Now list all the different words the text file T.txt contains and count the occurrences of each of them. Ignore the case of characters:

	A
1	=lower(file(“T.txt”).read()).words().groups(~:word;count(1):count)

WordCount is a famous programming exercise. esProc has words() function to split a string into separate words and only one line of code can complete the operation.

List all words containing the letters a, b and c in the text file T.txt. Ignore the case of characters:

	A
1	=lower(file(“T.txt”).read()).words().select(~.array(“”).pos([“a”,”b”,”c”]))

Because the orders of these letters are different in different words, you cannot determine whether a word is eligible through substring searching. You should use array(“”) to break apart a string into a set of single characters and then find whether the set contains these letters or not. With the support of set operations, esProc can get this done with a one-liner.
To handle big files, simply alter these operations by retrieving data in segments or with cursor.

json/xml data
Despite a sufficient number of class libraries for parsing and generating JSON data, Java lacks the capability to handle further computations. esProc supports multi-level data, and can parse the JSON data into computable memory data for further processing without compromising its integrity.
Here’s a piece of JSON data in a certain format:
{
“order”:[
{
“client”:”Beijing Raqsoft Inc.”,
“date”:”2015-6-23”,
“item” : [
｛
“product”:”HP laptop”,
“number”:4,
“price”:3200
},
{
“product”:”DELL server”,
“number”:1,
“price”:22100
}]
},…]
}
The data needs to be written to two database tables: order, which has three fields - orderid, client and data, and orderdetail, which has five fields - orderid, seq, product, number and price. The orderid and seq fields will be generated in sequence.

	A
1	=file(“data.json”).read().import@j().order
2	=A1.new(#:orderid,client,date)
3	=A1.news(item;A1.#:orderid,#:seq,product,number,price)
4	>db.update@i(A2,order)
5	>db.update@i(A3,ordedetail)

esProc can parse the multi-level JSON strings into a multi-level data set, in which the values of item field in A3 form another table.
Besides data parsing, esProc can generate multi-level JSON strings from a multi-level data set.

Big data technology

Different implementations

SQL
The fundamental technical challenge with big-data computing is to achieve high performance – in short, how to compute faster. The computational targets include filtering, grouping, sorting, and joins, which don’t require too much mathematical knowledge to understand.
SQL syntax is now dominant in describing structured-data computations. Not only database vendors, but many big data solutions have been endeavoring to implement SQL. But, as the top 10 example cited in the beginning of the article shows, using SQL whose mathematical basis is relational algebra in certain occasions will hinder the computing performance.

There is little room for optimization of the simplest operations, such as the commonly used GROUP+SUM, by reducing CPU consumption and hard disk access frequency. The amount of work for performing the basic actions is fixed, and so is the time taken for doing the work.
Now and then a big data solution emerges claiming that it is dozens of, even almost a hundred times faster than traditional databases while not mentioning the specific computational targets and computational contexts. This is potentially misleading. Because to compare the performance of Oracle’s row-wise storage model with any columnar storage model, or to compare Hadoop’s external-storage computing ability with any in-memory computing ability, the latter will be always dozens of times faster than the former. It’s time to stop bragging about a foregone result. In the same hardware and data storage contexts, it’s impossible that there’s a solution that can increase the performance of a simple operation exponentially, 2-3 times at the most.

But there’s plenty of opportunity to optimize operations that are not that simple. An example that is already cited is to find the top 10 records from a hundred million transaction records according to sales amount. esProc does the job differently. With its completely set-oriented concept, it considers top N operations as one of the aggregate operations, like SUM and COUNT. Basically all of them need the same aggregation procedure to get the result, which is a single value for SUM and COUNT but a set for top N. Only one data traversal is needed and the number of records to be sorted is much smaller, enhancing performance greatly.
Relational algebra-based SQL is unable to express top N-type of computations. So you can only hope that database vendors can perform more optimization. The following will show you more SQL problems of the similar kind.

The top N example sheds a lot of light on the direction esProc takes in coping big data. There’s specialized syntax, a wide and rich variety of data objects and related basic operations in esProc that enable programmers to design a best possible algorithm that makes the most use of the hardware ability and has the optimal performance according to the characteristics of the computational target and data source. With esProc, there won’t be any second-best situation as displayed in SQL in which the most efficient algorithm is given up because of the limitations of syntax and mathematical model.
The esProc way, however, does has its price – the relatively opaque syntax. Programmers need to have a deep understanding about physical storage structure and the data transformation during a computation. SQL has a transparent syntax that relieves programmers of concerning physical storage scheme and execution path. It’s like a double-edged sword, whose pros and cons should be weighed according to specific needs.

There have been already many vendors who provide perfectly sophisticated SQL implementations that have enabled a great number of mature solutions to problems that can be solved smoothly in SQL. So we decided to find our niche elsewhere.
In addition to the above-mentioned problems that can’t be handled with high-performance in SQL, there are many other operations that can’t be automatically performed in the language and for which programs need to be specifically written. They often involve big data and performance thus becomes an issue. It’s high-performance big-data computing that esProc is intended for.
Of course esProc can connect to a SQL database and retrieve data for processing and work with a SQL solution to enhance the performance of big data processing.

Hadoop
You can’t talk about big data techniques without mentioning Hadoop. But esProc isn’t based on the popular Hadoop system. Instead, it provides its own cluster computing mechanism.
Why esProc shuns Hadoop that has so many merits? That’s because Hadoop has the following shortcomings.

Hadoop is an open-source, free software, as well as a massive, heavyweight big data solution. It’s expensive to configure it and make the best use of it, and to get maintenance support. In fact Hadoop products are intended to be high-end rather than for small and medium-sized enterprises.
Hadoop is targeted at the large-scale clusters consisting of hundreds of even thousands of computers. At such a scale, nodes prone to failures within an execution cycle of a task, generally a few hours. So it makes great sense for Hadoop to invest a lot of resources to provide powerful fault-tolerance capability. It’s also hard to offer an individualized management and control service for nodes based on so massive-scale a cluster because that will require extremely huge workload. It’s necessary for Hadoop to adopt a central, automatic control management and control mechanism.
esProc is a lightweight solution, applicable to small and medium-sized clusters consisting of several or a dozen of or scores of computers at most, as well as to standalones. Though clusters of those sizes don’t require high fault-tolerance ability, esProc is able to ensure that the whole cluster is operable when a certain or more nodes malfunction. And though it’s a slim chance that node malfunction occurs within the several-hour execution cycle, a subtask could fail once the node carrying it out breaks down.
It’s feasible to individualize configuration for nodes in smaller-size clusters, making more efficiently use of the hardware resources with a bearable amount of workload in node management.

Hadoop has a definite framework to which programmers have to adapt themselves. This limits flexibility and keeps them from writing the code that fits business logic and data characteristics. For example, you may want to change the file redundancy plan for HDFS – as appropriate redundancy can effectively reduce network traffic, which we’ll discuss later – by modifying the source code. But this is not easy, and changing source code will affect the upgrading. Another example is the MapReduce, which breaks apart a task into too many subtasks to increase the degree of fault-tolerance and cannot control the execution order directly, making it difficult and roundabout to express order-related algorithms and data association algorithms.
esProc offers a different thinking. Framework is important. Yet it’s not easy to design a good one, and a developer tool can do little to facilitate the process. Actually only a very small amount of code is needed to implement the framework; a ready-made one has very limited use for enhancing development efficiency. Whatever the framework is, low-level methods are always indispensable and require the most of the development resources. esProc provides ready-made methods for invocation, enabling programmers to take control over the programming process using code.
Without a framework, esProc allows programmers to determine the computing tasks and data distribution for cluster nodes, which brings a higher performance as well as more detailed work. esProc is, therefore, more suitable for small and medium-sized clusters, even standalones, than for large-scale clusters. In this sense, esProc is a lightweight big data solution.

Hadoop is a relatively closed, self-contained system. To use its computational strategies, data should be placed into it because it can’t handle data stored in a relational database or a network file system. esProc is a pure computing engine, which is an open system, allowing processing data stored in all forms of data sources, including relational databases, NoSQL databases, and various file systems (including HDFS), and computational results to be written back to those data sources. Before data is transported into a system to be manipulated, it needs a pre-processing. In other words, the transportation itself is a data-processing process. It is in this stage that esProc can play an important role.

In-memory computing

In-memory computing refers to keeping data into RAM as a means of data processing. The means especially applies to instant reporting and data querying. The RAM capacity of contemporary computers can reach dozens of, or hundreds of gigabytes, even 1 Terabyte. It takes up to several minutes or several hours to traverse a huge amount of data in such sizes using the fastest hard disk. There are data querying tasks which can be performed based on pre-created indexes, but the number is not many. For computational tasks involving an amount of data that exceeds the maximum memory capacity that the contemporary computers can have, it’s also impossible to get an instant response by handling them with the external-storage-based methods. This is a measure of evaluating the possibility of an instant response in a business situation.

Processing data in RAM is much faster than handling data based on the external memory. But RAM’s critical advantage is that it allows accessing data frequently, randomly and in scanty amounts – respectively pointing to high access frequency, random access target and a scanty number of bytes accessed. It takes almost the same time to retrieve data from 10,000 different addresses from the RAM, with 100 bytes at a time, as retrieving one million bytes continuously.
It’s a lot different with hard disks. The random accesses to an HDD will cause disk heads jumps. The jumping actions are much slower than the data access actions. SSDs are better in this respect. The space of a hard disk is divided by the operating system into many clusters – a cluster is a unit of disk space. Each access will read or write at least one cluster, usually a 4K size. Reading or writing smaller sizes uses no less time. Besides, accessing hard disks need a lot of I/O-related preparatory work. These preparatory actions will be repeatedly performed for multiple accesses. In summary, hard disks are poor at performing frequent accesses, random accesses and scanty-amount accesses.

Pointer-based foreign key reference
It’s a common JOIN operation in SQL to get fields from a table pointed by foreign keys (a dimension table) to join with the current table (a fact table).
The accesses to a dimension table are characterized by frequency, randomness and scanty amount. From the point of view of a fact table, the accesses to records of the dimension table is unordered, which indicates the randomness; the target of each access is a field of a record, which is scanty in amount; and the number of the records in a fact table is great, which requires frequent accesses.

For example, here are the simplified Products table and Sales table of a supermarket:
Products table: Productid, Name, Supplier, Category, UnitPrice
Sales table: Number, Time, Productid, Quantity
The Sales table’s Productid is the foreign key pointing to the Products table.
Problem: Calculate the total sales amount. SQL’s way is to join the two tables according to Productid, calculate the product of the UnitPrice in the Products table by the Quantity in the Sales table, and then calculate the sum of products.

How to find the unit price corresponding to each sales record?
The slowest way is to perform traversal. The total number of records that need to be traversed is the product of the numbers of the records in the two tables. It’s quite complicated. Creating an index on the dimension table can dramatically reduce the accesses to it. The effect is like performing logarithm operation on the search count without an index. Using a hash index will reduce the search count to a constant, further speeding up the data lookup. But the result of hashing depends on how the hash function is designed. A bad hash function could cause many collisions, contributing little in increasing the performance.
Hashing is the fastest way yet anyway. Today’s databases also adopt the hash-join method. Relational databases, however, don’t have the concept dimension table, and thus don’t distinguish between dimension table and fact table. A hash-join algorithm should perform hash operations on the key values in both tables, and then map records having the same hash values through traversal. One hash-join operation and creating a hash index on a dimension table have the same degree of computational complexity. For multiple hash-join operations, the hash index can be reused – if the dimension table is already known. Using an existing hash index can save the time of calculating hash values repeatedly.

Suppose the memory capacity is so large that data can be loaded into it at one time. Then there’s a chance that we can make the data lookup faster.
Programmers who have worked with Java and C language know the concept of pointer, which is a mechanism enabling high-speed random accesses. If the foreign key field of a fact table can be converted to the pointers pointing to the records in the dimension table after data finishes loading, records of dimension table can be directly accessed and referenced with pointers and thus hash operations and comparison of hash values won’t be needed anymore. It will get a real-time query result. Although the conversion to pointers still needs the hash operation and the hash value comparison, they will be carried out once only. Once the pointers are created, succeeding computation can be performed in a high speed.
esProc provides the mechanism of creating pointers and using pointers to access records. For the sake of convenience, let’s take the file data source as an example. Below is the code for creating pointers and performing succeeding computations:

	A
1	=file(“Products.txt”).import()	Import the Products table
2	=file(“Sales.txt”).import()	Import the Sales table
3	>A2.switch(productid,A1:id)	Create a pointer association, and switch over Productid to pointers
4	=A2.sum(quantity*productid.price)	Calculate the total sales amount by referencing the product’s unit price with pointers

In practice, the loaded data may be shared by multiple subtasks, for which the code will be a little different.

Here’re two field tests of group and aggregate operation showing that the use of pointers can achieve better performance. The first is performed without joins and the second performs multi-level foreign key joins on five tables. Both have an equal amount of data that is less than the capacity of RAM. The tests are carried out in esProc and Oracle respectively on the same hardware environments. Below are the results:

	esProc	Oracle
Single table without joins	0.57s	0.623s
Five-table foreign key joins	2 .3s	5.1s

Oracle hasn’t any pure in-memory solution. After multiple executions of a program, Oracle will copy data into cache but still need some extra handling. So for the above results, they are not as significant when you compare their absolute values horizontally as when you study them to see how many times faster the former is in comparison with the latter. Without a foreign key join, there’s little difference in performance between esProc and Oracle; but esProc is over one time faster than Oracle when performing a foreign key join.

Pointer is a basic concept in both Java and C language, and so should have been an obvious approach. Unfortunately, SQL doesn’t offer such a concept and the corresponding data type. This, however, has reasons. SQL emerged in an era when the size of the computer memory is small and in-memory computing is rare. Today, some SQL-oriented in-memory databases simply load data into the memory, without providing a pointer-based data processing method, leaving the benefit of the pointer unfulfilled.

In-memory dimension table
Converting foreign keys to pointers requires that all data be loaded into the memory. However, the ever-increasing fact data, on which the pointers are created most of the time, will probably exceed the size of the memory as transaction progresses. Relatively speaking, the dimension table, with a small amount of data and a slow increase in size, remains rather steady, so it is much more likely to place a whole dimension table into the memory.
The most usual operation performed on a fact table is ordered traversal, which doesn’t have the characteristics of randomness and scanty amount and which thus won’t have a related catastrophic result. The only disadvantage of placing a fact table in the external memory is some performance losses. In contrast, the access to the dimension table features high-frequency, randomness and scanty amount, which badly needs the RAM’s powerful processing ability.

Therefore in a case where memory capacity is limited, you can load only the dimension table into the memory and put the fact data in the external storage for traversal. Below is the pointer-based program for the previous algorithm:

	A
1	=file(“Products.txt”).import().primary@i(id)	Import the Products table and create an index
2	=file(“Sales.txt”).cursor()	Create cursor of the Sales table for traversal
3	=A2.switch(productid,A1:id)	Create pointers based on the cursor for traversal
4	=A3.groups(;sum(quantity*productid.price))	Traverse records to calculate sales amount, while pointers are used to reference records

In this case, however, the pointers for associating tables should be generated during the traversal, rather than before it. Here esProc also uses the hash algorithm to search the matching records, but obviously the performance is not as good as that attained by handling data completely in the memory.

So it will take more advantage of the memory’s computational power to separate the dimension table from the fact table and to treat them in different ways by loading only the former into the memory. Often multiple computational tasks use one dimension table. You can import it into the memory, create an index on it, and use the index for different tasks. Through human intervention, that is, loading the dimension table into the memory, you can lower the external storage accesses and change the way of accessing the data to achieve a higher performance.
As mentioned previously, SQL doesn’t distinguish between dimension table and fact table. All tables are logically equal. The SQL join operations performed in the context of insufficient memory will load the table with a smaller amount of data into the memory. For a single operation, there is no significant difference between SQL and esProc in performance. With multiple operations, SQL could load the dimension table and calculate hash values repeatedly, and it’s a chance event that the cached data will be reused.

Sequence-number-based foreign key reference
If the foreign key values, which are a certain type of code, in a fact table stored in the external storage can be converted to the sequence numbers of their corresponding records in the in-memory dimension table, the performance of the above join operation between them can be increased into nearly the same level that the pointer-based join operation can achieve.
For the above example, that means converting the product codes in the Sales table into integers, which are the sequence numbers of the corresponding products in the Products table.
Apparently this is tedious and time-consuming, but the job is a one-off. Once it is finished, records can be located directly with those sequence numbers in the subsequent join operation, without having to calculate and compare HASH values.

	A
1	=file(“Products.txt”).import()	Import the Products table
2	=file(“Sales.txt”).cursor()	Create cursor according to the sales records represented by sequence numbers
3	=A2.switch(productid,A1:#)	Generate the joining pointers for traversal by locating records with sequence numbers
4	=A3.groups(;sum(quantity*productid.price))	Perform the desired calculation

The code seems similar, but the performance is much higher.
SQL is based on the concept of mathematical unordered sets. The records in a SQL table don’t have a definite order. Even the foreign key values are already represented by numbers through some kind of preprocessing, the database will still use a hash-join algorithm to calculate and compare the hash values. It’s just a waste of effort to preprocess data for performance improvement.

After the foreign key is transformed into sequence numbers, you should be careful in modifying data in the dimension table. It’s ok to append data to the table. To delete data, just mark the location where data is deleted and leave a vacancy at its original site; filling the vacancy by moving the following records upward will disturb the reference of the foreign key of the fact table. When a lot of records are deleted, there will be many vacancies, which occupy a fair proportion of the memory. In that case, the sequence numbers of the records in the dimension table, as well as the referencing records in the fact table, need to be rearranged.
The sequence-number-based foreign key reference is suitable for scenarios where the dimension table changes little, particularly the historical data manipulations, which are common in big-data analysis. For the changing current data, the method isn’t recommended because potentially there will involve big risks or complicated data management.

Byte-array-style in-memory storage
Any programmer who has had the experience of dealing with big data in Java knows that Java objects use a large part of the memory. Compared with the number of bytes occupied by a chunk of data stored in text format, the data uses 3-5 times more space after it is loaded into the memory and converted into Java data objects. The specific number depends on the data type. Data of integer type and real type uses less memory space than string type data.
This character of Java objects results in a very low memory utilization ratio of Java programs. A machine with a 100G memory can load just over 20G data. This is too much waste of memory space. Moreover, Java programs are very sensitive to the available memory, starting frequent garbage collection actions as soon as the memory capacity insufficiency is detected – moving cached data from the memory into the disk buffer, specifically speaking – reducing performance by dozens of times.

esProc designs a mechanism to reduce memory usage by delaying the objectification. It compactly loads data into the memory as a byte array, whose space usage is nearly the same as that when data is stored in external storage, objectifies the data only when it is referenced, and releases them when computation finishes. This mechanism can increase the memory utilization ratio greatly. The price is the additional time of objectification when data is referenced, but the benefit is a full use of memory capability for high-frequency, random and scanty-amount data accesses.

With the transparent encapsulation of this mechanism, esProc produces a byte-table-based program similar to the program written based on an in-memory object:

	A
1	=file(“Products.dat”).create().primary@i(id)	Create a byte table and an index based on the file
2	=file(“Sales.dat”).cursor@b()	Create cursor of the Sales table for traversal
3	=A2.switch(productid,A1)	Use a byte table as an ordinary memory table
4	=A3.groups(;sum(quantity*productid.price))	Perform same operations to get the result

You can also use a hash index to access a byte-format dimension table, or you can perform a sequence-number-based foreign key join with the dimension table.
Code test shows that, compared with in-memory data handling, this mechanism will reduce the performance by about 30%; but compared with the external storage computing, the method has great advantage under the circumstance of memory capacity insufficiency.

Multithreaded parallel processing
A RAM is suitable for concurrent accesses. Besides, contemporary computers have much more CPUs and cores. These make an ideal environment for multithreaded parallel processing that can take the greatest advantage of multiple cores.
esProc provides clean multithreading syntax, which, together with the use of in-memory cursor, makes it easy to develop multithreaded programming.

	A	B
1	=file(“Sales.dat”).create()		Import the source data as a byte table
2	fork 4	=A1.cursor(A2:4)	Perform parallel processing by dividing the table into 4 segments and creating an in-memory cursor for each segment
3		=B2.groups(;sum(amount):a)	Traverse each cursor to sum up amounts
4	=A2.conj().sum(a)		Concatenate the results returned from the threads

In this piece of code, fork statement is responsible for launching multiple threads to execute the code block and for collecting results of returned from the threads. The cellset-style code is intuitive and clear. It’s easy to discern the code block for parallel processing in the whole cellset.

Different from the more basic programming languages such as Java and C++, esProc doesn’t supply a mechanism for grabbing and synchronizing shared resources between multiple threads. esProc believes that threads start simultaneously and are aggregated collectively but each executes independently and separately. Apparently, the mechanism makes it impossible to express all parallel algorithms, especially the real-time logic, but it suffices for data computing. esProc trades some unimportant abilities for an significantly increased usability.
It’s not the case that the more the parallel threads the better the performance. Basically the threads should not outnumber the cores of a CPU; otherwise the extra threads are just logical but won’t be in effect.

esProc also provides built-in parallel options for certain functions, making it simple to perform a parallel computation.

	A
1	=file(“Sales.txt”).import()	Retrieve data out
2	=A1.select@m(amount>1000)	Filter data with multiple threads
3	=A2.sort@m(amount:-1)	Sort data with multiple threads

The @m option automatically decides how many parallel threads will be launched according to the current configuration. Note that the option can’t ensure an ordered data retrieval, so it shouldn’t be used for order-related computations.
The parallel function is simple and concise, yet with a high performance. It enables programmers to pay more attention to the overall logic of the computation, without worrying about the details of parallel processing for performance improvement.

External storage computing

It’s more common in big data scenarios that the amount of data exceeds the memory capacity. In those cases, the only choice is to store data in the external storage. Actually the external storage itself hasn’t the computational ability. In essence, the external storage computing refers to loading data from the external storage in batches into the memory for manipulation and putting the big intermediate results, if any, into the disk buffer.
Compared with the RAM, external storage (hard disks in a physically sense) handles data slowly, and more importantly, lacks the abilities of accessing data frequently, randomly and in scanty amounts. SSDs have better random access ability, but still don’t support frequent accesses in scanty amounts. There are three principles in optimizing the performance of external memory computations. First, we can reduce the slow disk access by increasing CPU activities. Second, try to retrieve data in order, especially with the HDD, and set an appropriate number of parallel threads. Third, avoid frequent, scanty-amount access.

Multi-use Cursor
In some cases, you might want to perform different aggregate operations on the same source data. If the data is stored in the memory, repeated traversal won’t hurt the performance. But for external storage data, it would be better if one traversal could yield as many results as possible. For example, both count and sum can be calculated during one traversal for data grouping and aggregation. SQL works in the same principle. You can SELECT multiple different aggregate values through one GROUP BY statement.
But it’s hard to deal with a complex scenario in SQL. Here’s an example for illustrating this simply. Suppose you want to find the median for a big data set. You sort the records in the data set to find the member in the middle position. But before doing this, you should know the number of the records. The problem is that you can’t automatically count the records while sorting them with SQL’s syntax. You have to calculate the number beforehand, meaning that an additional scanning of the data set is required. For a big data set stored in the external memory, it takes longer time to traverse it than to perform the computation.
esProc designs the syntax used to accomplish late binding on a cursor, enabling performing an aggregate operation while traversing data in the cursor. With this mechanism, you can calculate the count while sorting data through traversal.

	A
1	=file(“data.txt”).cursor()
2	>A1.groups@x(;count(1);n)	Define an aggregate operation on the cursor which will be executed later during the traversal
3	=A1.sortx(key)	Sort data through traversal while executing the bound aggregate operation by A2
4	=A1.v(n).#1	Get the result of the bound aggregate operation – the number of records
5	=A3.skip((A4-1)\2).fetch@x(2-A4%2).avg(key)	Retrieve the record at the middle position and get the median value

In this piece of code, the source data is only traversed once by A3; A5 traverses the sorted data.

In many cases, you might also need to perform different aggregate operations for different groups based on a big data set. SQL will traverse the source data for multiple times according to the multiple GROUP BY statements, while, similarly, esProc will perform late binding by defining multiple different operations on a cursor and executing them during one traversal.

Segment-based parallel strategy
In the cases where the computational load can’t be reduced, multithreaded parallel processing can make the CPU cores take on a part of the computational labor, improving the performance quite noticeably.
First let’s look at the text data sources. Text files are commonly seen in external memory computations. They need to be parsed as the objects of corresponding data types for computation when being imported into the memory. The process is slow, particularly for date/time/datetime data. It’s so time-consuming to analyze a date/time/datetime string that often the CPU time is longer than the time spent in disk access. Yet the performance can be effectively enhanced through multithreaded parallel processing.

To use the parallel strategy, you need to divide a file into segments, each of which will be assigned to a thread to handle. For a text file, generally one line corresponds to a record, but the length of each line may vary. Apparently segmenting the file according to the number of lines won’t increase the performance at all because the retrieval of each segment requires a traversal from beginning to the end. Segmenting the file by bytes probably doesn’t need traversals but the ending point of a segment may fall in somewhere in a line, putting data of one single line into different segments and thus causing data retrieval errors.
Learning from Hadoop, esProc performs the byte-based segmentation using the “skip head line and complement tail line” strategy automatically – that is, giving away the line in which a starting point falls to complement the previous segment. This ensures that each segment is made up of complete rows, without data integrity compromised.
This strategy, used with the above-mentioned esProc parallel processing mechanism, enables programmers to write a parallel program conveniently.

	A	B
1	=file(“data.txt”)		The source file
2	fork 4	=A1.cursor@t(amount;A2:4)	Divide the file into 4 segments and create 4 cursors based on them for parallel processing
3		=B2.groups(;sum(amount):a)	Traverse data in each cursor to calculate the total amount
4	=A2.conj().sum(a)		Concatenate the results of the threads to get the final result

The above code divides data.txt into 4 segments, for which 4 threads are used to traverse them with the cursor, summing up the amount field; and finally concatenates the returned results of the threads to get the total amount.

Often more time is spent in parsing a text file than in data calculation. As long as the parsing can be done in parallel, it’s unnecessary to calculating data in the same way. If the order of the records won’t affect the computational result, such as the above sum operation which is irrelevant to the order, you can use the esProc built-in parallel option to write the program more easily.

	A
1	=file(“data.txt”).cursor@tm(amount)	Declare cursors for parallel retrieval
2	=A1.groups(;sum(amount))._1	Traverse each cursor to sum up the amount field

In the above code, the multithreaded parallel processing starts automatically when data begins to be fetched with the cursor, but aggregating amount field is done serially.

Generally the number of parallel threads during an in-memory computation is affected only by the number of CPUs, but the number in an external storage computation is limited also by the disk capacity. Logically a single disk can’t support concurrent accesses. Accessing the disk for different files using multiple threads will result in frequent moves of the disk head, which takes a longer time than data retrieval. To solve this problem, a bigger cache for each thread should be set aside, which in turn causes more memory usage. esProc provides a special system setup to control the number of threads and the sizes of caches according to specific scenarios.

It’s not as flexible to segment a database table as to segment a text file. It’s not suitable to handle database tables in segments and you’d better leave the data-intensive computational tasks to the database. For some computation-intensive tasks, computations count for quite a part of the data handling. It’s really difficult to handle them in the database and is thus necessary to retrieve data out. For this latter case, you can use the segment-based parallel processing.
To use the segment-based parallel processing, one approach is to create sub-tables. Each thread will handle several sub-tables and one sub-table cannot be split to be given to more than one thread. With this approach, you need to create many sub-tables to make the threads take almost equal responsibility. But it’s a bad idea to create too many tables in a database. A general solution is to create the same number of sub-tables as the number of threads. This makes the data handling process not flexible enough due to the fixed number of parallel threads.
Another approach is to use a WHERE conditional statement to segment the table, which allows a flexible number of threads. It would be better, in using this approach, to first create an index and then perform the conditional segmentation according to the index. Otherwise each execution of WHERE will cause a full table scan, increasing the data retrieval time.

According to our test, it’s very slow to retrieve data out of the Oracle database and some other databases through their JDBCs. In this case JDBC becomes a bottleneck for computations that need to retrieve data out of the database. If the database is not heavily loaded, you can perform segment-based parallel retrieval, so the performance loss through JDBC can be partially made up for.

	A	B
1	fork 4	=connect(db)	4 threads, which connect to the database respectively
2		=B1.query@x(“select * from T where part=?”,A2)	Retrieve data in segments
3	=A1.conj()		Concatenate the results

The test shows that the parallel retrieval can boost the performance by several times when the database is lightly loaded.

Data storage
On top of the text files, esProc supports binary files stored in proprietary format.
esProc embeds data types in the proprietary format binary files. It’s unnecessary, therefore, to parse data during retrieval, making performance much better than retrieving text files. Moreover, esProc provides zipped format binary data that occupies a disk space which is generally one-third to half of the space that the same amount of data in text format occupies, creating faster data retrieval. This doesn’t necessarily mean that the higher the compression ratio, the better the performance. The decompression process needs some CPU time, and an algorithm with a higher compression ratio needs a longer CPU time. esProc offers a very simple compression algorithm that can achieve its compression ratio while almost demanding no extra CPU time, though the compression ratio is not high.
When taking all into consideration, retrieving data from binary file is 3-5 times faster than retrieving data from a text file. It is also faster than retrieving database data through JDBC, which is particularly significant with Oracle that has a rather slow JDBC. In a nutshell, binary format gives a big performance edge when data needs to be reused.
The conversion from a text file or a database table to a binary file is simple:

	A
1	=file(“data.txt”).cursor@t()	Define a text file cursor
2	=file(“data.bin”).export@z(A1)	Export data as a segmented binary file

An esProc binary file can also be handled by the segment-based parallel processing. The code is almost the same as handling a text file in the same way:

	A	B
1	=file(“data.bin”)		The source file
2	fork 4	=A1.cursor@b(amount;A2:4)	Use same parameters, except for @b meaning exporting data as a binary format;
3		=B2.groups(;sum(amount):a)	Completely same syntax for subsequent operations
4	=A2.conj().sum(a)

Based on binary files, esProc provides columnar storage plan, which stores one column as a file. As some computations need only a few columns, only a small amount of data needs to be retrieved with this plan, significantly reducing the time of hard disk access. This is the strength of columnar storage.
Conversion of a text file to column-wise files:

	A
1	=file(“data.txt”).cursor@t()	The original text file
2	=10.(file(“col”/~/”.bin”))	Generate files for 10 columns
3	>A2.export@z(A1)	Write the original file as multiple segmented column-wise files

Handling data based on the column-wise files:

	A
1	=[1,3].(file(“col”/~/”.bin”))	Use the first and the third columns
2	=A1.cursor()	Define a cursor of a column-wise file
3	=A2.groups(;sum(col1+col3):all)._1	Calculate sum using the cursor

You can process esProc column-wise files through segment-based parallel processing. The code is written in a same way as the code for handling a file stored row-wise.

	A	B
1	=[1,3].(file(“col”/~/”.bin”))		Define an array of column-wise files
2	fork 4	=A1.cursor(amount;A2:4)	Same code as the code for a row-wise file handled with parallel processing and with cursors
3		=B2.groups(;sum(amount):a)	Completely same syntax for subsequent operations
4	=A2.conj().sum(a)

Columnar files should be segmented in alignment. Otherwise field values of the records would have been misplaced. Yet there’re a series of problems. In a zipped file the lengths of the compressed field values vary with the record. Sequence numbers become useless in locating records. The general solution is to first split the source table into multiple chunks row-wise, store each chunk column-wise, and then perform the segmentation by chunk. A chunk shouldn’t be further divided. This will cause a dilemma. Too few chunks will limit the number of resulting segments. But too many chunks will complicate the segmentation and destroy the wholeness of the data, weakening the strength of the columnar storage.
In view of this, esProc provides a brand-new segment-based columnar storage scheme. The scheme uses an index which occupies just a little storage space. Based on the index, columns of the source table shall each be divided in alignment into segments whose number is decided according to specific environments. Each column of data is stored together without being split into chunks. The algorithm is a little complicated and you can make a study if it interests you.

It’s not always workable to use columnar storage. When many columns are being retrieved, with a HDD there is a serious contradiction between the time spent in locating tracks and the buffer capacity. The use of multithreaded parallel processing will make the situation even worse. The columnar storage plan is unsuitable for a computational environment with an ordinary HDD. It needs an environment where an SSD or a high-concurrency disk array is available.

Exploiting data orderliness
Often a logically wide database table is physically stored in multiple tables having same primary keys, which are called homo-dimensional tables. A main table and its sub-table are also related through a primary key. Performing data query on these types of tables involve join operations.
These joins are different from the foreign-key joins for associating a dimension table. The records referenced through dimension-table-oriented joins are unordered. But the records between homo-dimensional tables or between a main table and its sub-table are associated in alignment, making an ordered record retrieval. If the records in each of these tables are already sorted by the primary key, the table joins can be realized using the merge algorithm through on traversal. This is much simpler than the segment-based, external storage hash-join algorithm.
esProc provides the merge algorithm for handling ordered cursors, performing joins, intersection, union and difference efficiently.

	A
1	=file(“Order.txt”).cursor@t()	The Order table cursor sorted by id
2	=file(“Detail.txt”).cursor@t()	The Detail table cursor sorted by id
3	=join@x(A1:O,id;A2:D,id)	Join the two cursors using merge algorithm and return a cursor too
4	=A3.groups(O.area;sum(D.amount))	Aggregate the amount by area; area field is in the main table and amount field is in the detail sub-table

Join operations are classified in esProc into foreign-key join and alignment join. An in-memory algorithm can be used to implement the former because the dimension table is small enough to be entirely loaded into the memory. For the latter, data can be first sorted because records are accessed in order. It takes some time to sort data but the operation is one-off. By doing so, a much better performance will be achieved in further handling the sorted source data.
Since relational algebra doesn’t have the concept of ordered sets, it can’t take the advantage of the data orderliness. Many databases, however, will automatically create index for the primary key, which is functionally equivalent to the sorting operation. The action could improve performance in some degree but luck plays a role here.

In practice you can use the order-based merge operation for data preparation. Because the existing data is already ordered, you just need to sort the new data and merge it with the existing data, without having to re-sort the historical data.
It’s unnecessary to merge all the new data with the historical data each time. You just need to sort the new data in a certain time period and merge it with the ordered historical data as needed. The merge operation is low cost and has little impact on the overall performance if the data parts to be merged are not many.
In addition to the data needing to be prepared beforehand, certain data is definitely in order, like the results of esProc group and aggregate operation that are by default ordered by the key values, and the big grouping result obtained by merging the ordered sub-result-sets. Both are still ordered sets, and can be furtherly handled with the merge operation.

Using the above-mentioned ordered cursor strategy, an esProc file in binary format can be handled with segment-based parallel processing by retrieving data in segments. Marking every group in generating the binary file will ensure that esProc won’t split and place data of the same group in two segments during file segmentation, so that each thread will handle complete groups.

	A
1	=file(“userlog.txt”).cursor@t()	Cursor of the raw data
2	=A1.sortx(id)	Sort the cursor by id and return a cursor
3	>file(“userlog.dat”).export@z(A2;id)	Export the cursor data as a binary file segmented by id

Below is a piece of code for segment-based parallel processing based on ordered data retrieval by group:

	A	B	C
1	=file(“userlog.dat”)			A data file sorted by id
2	fork 4	=A1.cursor@b(;A2:4)		Make sure ids of the same group won’t be separated during segmentation
3		for B2;id	…	Completely same syntax for subsequent operations
4			…

The order-based merge operation can also be performed on source files stored in binary format using segment-based parallel processing. The first exported and segmented binary file will be used as the base file according to which the other exported binary file will be segmented, ensuring that records of both file will arranged in the same order and match correctly.

	A
1	=file(“Order.txt”).cursor@t()
2	=A1.sortx(id)	Sort data by id
3	>file(“Order.dat”).export@z(A2;id)	Export the cursor data as a binary file segmented by id
4	=file(“Detail.txt”).cursor@t()
5	=A4.sortx(id)	Sort data by id
6	>file(“Detail.dat”).export@z(A2;id,file(“Order.dat”),id )	By aligning with Order.dat, export the cursor data as a binary file segmented by id

Below is the parallel code for merging the two cursors:

	A	B
1	=file(“Order.dat”)		A data file segmented by id
2	=file(“Detail.dat”)		A data file segmented in the same way as the previous file
3	fork 4	=A1.cursor@b(;A3:4)	Segment file without splitting records with same ids
4		=A2.cursor@b(;A3:4)	Segment file in the same way
5		=join@x(B3:O,id;B4:D,id)	Merge
6		=B5.groups(O.area;sum(D.amount))	Aggregate
7	=A3.conj().groups(#1;sum(#2))		Concatenate and aggregate

Cluster computing

Cluster computing technology is adopted when data is so massive that a single machine can’t handle. Each node within a cluster is an independent computer on which all the in-memory computing and external-storage computing technologies introduced above can be employed.
The esProc application can operate independently as a server process, receiving computational requests from esProc programs and returning the results. The basic task model is simple. A controlling node will give computational orders to other nodes, and collect and aggregate the results returned from nodes. A complex task can consist of multiple subtasks.
A clustering strategy should be able to be scaled out conveniently, enabling it to handle more data by adding nodes. Of course any algorithm has restriction on the maximum number of nodes, limiting the scaling out in a certain range. Different from a stand-alone-based algorithm, a clustering strategy must also be fault-tolerant, ensuring the overall cluster is operable and the task can be completed even if one or more nodes malfunction.

Task distribution
First let’s look at the simple shared-data strategy.
The strategy stores the source data in one location, like a database or a network file system, and assigns subtasks to nodes without distributing data to them. This causes a lot of pressure resulting from the frequent concurrent accesses to the source data. So the strategy is more suitable for computation-intensive tasks than for data-intensive ones.

It’s simple to implement the shared-data-source strategy in esProc:

	A	B
1	=4.(“192.168.0.”/(10+~)/”:1234”)		The list of 4 nodes
2	fork to(8);A1		Assign 8 subtasks to the nodes
3		=hdfsfile(“hdfs:\\192.168.0.1\persons.txt”)	An HDFS file
4		=B3.cursor@t(;A2:8)	A segmented cursor
5		=B4.select(gender==’M’).groups(;count(1):C)	Perform filtering and count
7	=A2.conj().sum(C)		Calculate aggregate values

For the sake of convenience, an HDFS file is used here. In fact since esProc supports various types of data sources, a database table will be handled similarly.
In the main program, the nodes for handling the task are written in the code, which means it is the programmer who decides which nodes will be used in the computation. Different tasks could use different node list. The list of nodes can also be extracted from a certain configuration file. No matter which way is used to define the node list, there isn’t a controlling center for managing the nodes in esProc.
esProc’s distributed structure is centerless. This is unlike Hadoop that possesses a thorough system to simulate the whole cluster as a huge standalone. esProc doesn’t have a framework and a permanent controlling central node, it capacitates the programmers to control the participating nodes with code.

The benefit of a centerless cluster is that the malfunction of a certain node won’t stop the whole cluster from running. The price is that the workload of the programmer increases. Because adding or removing nodes must be expressed with the code. In contrast, a distributed structure with a center will break down once the central node goes wrong. Yet the management is simple. Any change about the nodes can be configured at the central controlling node and the existing code can be inherited transparently.
Strictly speaking, an esProc cluster isn’t completely devoid of centers. Though the general server cluster is centerless, each subtask has its own controlling node provisionally summoning other nodes to take part in the computation. If this controlling node collapses, the corresponding subtask fails. But the cluster as a whole can still handle other subtasks.

According to the above code, there are 8 subtasks which are more than the available 4 nodes. This shows esProc’s way of handling load balancing and fault-tolerance.
esProc doesn’t assign subtasks among nodes in an absolutely equal way, instead it creates a dynamic balance. The main program will find how much a node is occupied (that is, comparing the number of threads currently run on the node and the maximum number of thread a node can have) when performing task distribution and give a subtask to the idlest node. If every node is saturated (meaning the number of threads running on it has reached the maximum allowed for it), the main program will wait until the node finishes at least one of its subtasks and then start its task distribution. That way a faster node may receive more subtasks, creating balance between the responsibilities among nodes.
If a node malfunctions during the process and fails to go on with its work, the main program will reassign the work to the healthy nodes. This makes the total computing time longer, but creates a certain degree of fault-tolerance.
esProc’s non-framework design allows the cluster to include machines with contrasting performances, such as different memories, CPU configurations, even operating systems. In a nutshell, esProc is open to any machine. This can exploit the potential of the user’s existing hardware devices to the full.

Data distribution
That’s the esProc clustering mechanism using a shared-data strategy. In most cases, the computations of structured big data are data-intensive, which will have a high I/O cost. The strategy, however, will cause a serious throughput bottleneck, leading to much longer I/O delay than the time spent in data manipulation. To solve the problem, the distributed data storage plan should be used to spread the I/O delay among nodes.
Data distribution refers to splitting the source data into multiple parts and storing them among the nodes. The purpose is to enable each node to access data locally and thus avoid network transmission and collisions in getting the shared source. It doesn’t mean that data should be simply divided (evenly) into N segments and placed on N nodes. That type of distribution is fault-intolerant and probably still need a relatively large amount of network transmission resulted from join operations.
Unlike the common network file systems, esProc has an opaque data distribution strategy, which requires programmers to decide how data should be distributed. The strategy’s virtue is that programmers are able to make more flexible decisions according to the algorithms and data characteristics.
Generally the physical storage form of data is files, or databases. Actually it’s rare that databases are installed on a node-intensive cluster. So all the following sample programs use files as data sources.

esProc divides data into N partitions and stores a number of different partitions in each node. Thus each partition may be stored redundantly in multiple nodes. For each partition, esProc allows flexibly specifying the degree of redundancy and the nodes it is stored, without supplying the configuration for a uniform redundancy factor at the system level and an automatic redundancy plan.
The foremost purpose of the redundancy plan is to achieve fault-tolerance. If one node collapses, the data it stores can be found in other nodes. The whole computation will thus be completed though a longer time will be needed.
A flexible data redundancy plan is useful in decreasing the network traffic. For instance, a dimension table is usually globally referenced during the whole computation. So it can be put into one partition and make this partition redundant on all nodes, while storing partitions of the larger amount of fact data on multiple different nodes. Normally the dimension table is small and adds little to the burden of the storage. And as there’s no need to make an inter-node reference of the dimension table, the performance will be effectively increased.
When the capabilities of nodes are almost equal, the simple cyclic distribution is recommended.

The fact data is divided into 5 partitions to be distributed in 5 nodes. On each node we stored 3 partitions cyclically and redundantly. By doing so, the source data can be wholly obtained even if any two nodes collapse. The dimension table that will be used by all subtasks is stored separately in an x partition which is placed in every node for reducing the network transmission.

This isn’t the only plan for a flexible data redundancy system. When the capabilities of nodes are different, more data partitions can be placed on the high-performance nodes to give them more computational responsibilities, so that the overall performance of the cluster is more balanced.
esProc offers a functionality for synchronizing data in the same partition between nodes. You only need to update certain nodes when there are data modifications, and then synchronize all the nodes. A simple method is to set aside a special high-capacity node for synchronization and store all partitions on it. From it all data updates will synchronize to other working nodes.

About data usage, esProc provides a semi-transparent inter-node data retrieval strategy, which will specify partitions and a node list. esProc will first search for a specified partition locally; if the partition can’t be found, the node with the least sharing degree will be searched.
esProc also adopts a dynamic balance strategy in task distribution. On receiving a subtask, a node will first check whether there are the corresponding data partitions locally, begin the computation if the partitions exist and return the result; but return a failure message if the partitions don’t exist. If a failure message is returned, the main program will find another node for handling the subtask. The procedure will ensure that a node will have the required data for an assigned subtask. That data needed by a subtask can’t be obtained from any of the nodes indicates there are too many failure nodes within the cluster to get the complete source data for performing the task successfully.

	A	B	C
1	=4.(“192.168.0.”/(10+~)/”:1234”)			A list of 4 nodes
2	fork to(8);A1			Assign subtasks to the nodes
3		=file(“person.txt”,A2)		A2 is the data partition number
4		if !B3.exists()	end “data not find”	Failure message is returned; the subtask will be re-assigned
5		=B3.cursor()
6		=B5. select(gender==’M’).groups(;count(1):C)		Compute
7	=A2.conj().sum(C)			Aggregate

Checking a node to see if it has the data partitions a subtask requires is controlled with code. So the action can be flexibly performed. You can specify that the access-intensive partitions should be stored locally while the access-unintensive partitions can be obtained through inter-node data retrieval.
The network transmission amount in dynamic distribution is extremely small. The time needed for some repetitions of the distributing action is measured just in millisecond. It can be simply ignored compared with the time taken to perform the computation.

The esProc mechanism of task assignment and data distribution shows that the language’s clustering strategy is intended for small and medium-sized clusters that consisting of several or a dozen of or dozens of computers, instead of the massive clusters composed of hundreds of or thousands of or even as many as about ten thousand computers. While controlling the subtask assignment and data distribution manually for each node creates higher flexibility and performance, it increases the complexity of cluster management. The amount of the workload resulted from an un-automatic strategy is acceptable with a small or medium-sized cluster, but it is unacceptable with a massive cluster, for which flexibility is traded off for greater management efficiency. This further illustrates the differences between an esProc cluster and a Hadoop cluster.

In-memory data distribution
The data partition in esProc is a logical concept. A data partition can be stored in a file system, but it also can be placed in the memory. In this sense, we can create a memory cluster consisting of memories of multiple computers to carry out high-performance cluster-style in-memory computations, which will have an instantaneous response in handling a huge amount of data.
An in-memory clustering strategy should also take care of the fault-tolerance issue. But the method of realizing fault-tolerance is quite different from the method with data stored in external storage.

The fault-tolerance plan for external storage data is creating redundant data partitions that store a batch of data in multiple copies. If the redundancy index is k, which allows k-1 malfunctioning machines at most when the whole data can still be obtained, then each partition of data needs to be stored in k copies. The storage space utilization ratio is 1/k.
For the external storage, the ratio is absolutely acceptable and repeated data duplication have little impact on the performance because disk is cheap and is generally considered to have unlimited capacity. But for the expensive RAM, a utilization ratio of 1/k is literally unbearable.

To distribute in-memory data, esProc uses the “ spare tire ” fault-tolerance plan. The source data is split into n partitions which are respectively stored in n nodes, each of which stores one partition. In order to achieve the redundancy index k, k available reserve machines are ready to replace any failed machines. A reserve machine, in a case when a working node fails, will be provisionally used to load the data originally stored on the failed node to regroup the team of n nodes and to resume the computation. The failed node will be repaired and reused as a reserve machine. The procedure is like the spare tire carried in a vehicle, which is how it gets its name.
But if there are too many failed nodes and even the k reserve machines are not enough to regroup n available nodes, error will be reported to declare a cluster failure. The vehicle is no longer serviceable.
This technique can achieve a memory utilization ratio of n/(n+k), which is much higher than the 1/k by using redundancy-based fault-tolerance plan. Since the amount of in-memory data is relatively small, it usually takes an acceptable several minutes to provisionally load the related data after a node fails. However, it will take quite a long time to load a large amount of data in cases where a node collapses if this “spare tire” fault-tolerance plan is used for an external storage computation. It’s not appropriate to suspend a cluster too long.
With the “spare tire” plan, the commonly used dimension table can still be loaded onto every node, so the actual memory utilization ratio will be a little smaller than n/(n+k).

The use of in-memory data distribution requires initialization, so it’s a little more complicated than analyzing external memory data.

	A	The main program
1	=8.(“192.168.0.”/(10+~)/”:1234”)	A list of 8 nodes
2	=hosts(A1,to(4),”init.dfx”)	Find 4 nodes to load in-memory data
3	=callx@a(“sub.dfx”,to(4);A2)	Call programs on those nodes to calculate
4	=A3.sum()	Perform sums

	A	The program of initializing nodes (init.dfx)
1	=file(“Product.txt”,z).import()	Import data from data partition z
2	>env(T,A1)	Store data in an environment variable
3	>zone(z)	Add partition z on the current node

	A	Node program (sub.dfx)
1	=env(T)	Retrieve data from the environment variable
2	=A1.count(gender==’M’)	Perform filtering and counting
3	return A2	Return results

Compared with the dynamic task-distribution mechanismfor external storage data partitions, the task distribution based on partitions of in-memory data is static，meaning the main program will know how it will assign the subtasks after it obtains the list of available nodes.
Theoretically there is no fixed correspondence between a mechanism and the external storage data or in-memory data. You can also use static distribution for external storage data and dynamic distribution for in-memory data. But as the external storage computing lacks good stability and the time taken to perform each subtask is unpredictable, the static mechanism will probably cause a waste of resource. The in-memory computing is stable and the computing time spent in handling each subtask is basically predictable, so the static mechanism is more suitable for it. As previously mentioned, for the latter case generally data partitions won’t be stored redundantly on the nodes, you therefore can get the same results using both mechanisms, making the dynamic distribution mechanism unnecessary. For the external storage data partitions, often they have a lot of redundancy on the nodes and the use of dynamic mechanism gives full play to the hardware resources.

Clustered dimension table
In the above section, we talked about that a dimension table is usually stored redundantly in every node. Since a dimension table generally receives random and frequent accesses, it is not suitable to be stored in the external storage. Besides, a dimension table is relatively small. Therefore, usually it is imported wholly in the memory.
Sometimes, however, a dimension table can be really big. It can’t be imported entirely into the memory of a single node even it is converted into a byte table. In view of this, esProc designs the clustered dimension table based on the mechanism of in-memory data partitions. The design is to segment the dimension table and import them into memories of the nodes, from which they will be retrieved as needed.

	A	The main program
1	=8.(“192.168.0.”/(10+~)/”:1234”)	A list of 8 nodes
2	=hosts(A1,to(4),”init.dfx”)	Find 4 nodes to load the dimension table into their memories
3	=callx(“sub.dfx”,8*A2,to(8);A1)	Call programs on those nodes to calculate by passing in the nodes where the dimension table is stored
4	=A3.sum()	Perform sums

	A	The program of initializing nodes (init.dfx)
1	=file(“Product.txt”,z).import()	Import the part of the dimension table from partition z
2	>env(T,A1)	Store data in an environment variable
3	>zone(z)	Add partition z on the current node

	A	B	Node program (sub.dfx)
1	=file(“Sales.txt”,z)		Fact data in partition z
2	if !A1.exists()	end “data not find”	Return error message if the needed partition isn’t found
3	=createx(T,h)		Create a clustered dimension table based on variable T on node group h where data is stored
4	=A1.cursor()
5	=A4.join(productid,A3,price)		Perform a join with the clustered dimension table A3
6	=A5.sum(quantity*price)		Return results
7	return A6		Do the summing up

As can be seen, the nodes where a clustered dimension table is stored can be different from the nodes where calculations are performed. So you can create a clustered dimension table using certain nodes and perform calculations on others. Different clustered dimension tables can be stored on different nodes. In sum, programmers can specify this as needed.
Like hard disks, network is unsuitable to receive frequent accesses in scanty amounts. Every network transmission needs a lot of preparatory and wrap-up work, and a network protocol will also add some supplemental data at the beginning and at the end. So to access records of a clustered dimension table individually, you’d better not simulated it as an ordinary data table. Otherwise the high-frequent, scanty-amount accesses will seriously affect the performance. Hence, pointe-based foreign key reference doesn’t apply to fields of a dimension table. Batch reference based on cursors is a suitable way to make the reference instead. Its code will be different from the code for doing in-memory calculations.

About JOINs – A summary
JOINs are one of the biggest headaches in dealing with structured big data. They are a SQL nuisance too. To handle data exceeding the memory capacity, SQL will perform a JOIN group by group. The general procedure is to group each of two joined tables according to hash values calculated from both tables’ keys; as each pair of corresponding groups can be held in the memory, you can join the tables group by group. If there’s any group in a table that is still larger than the memory size, group it further using hash; so does its corresponding group in another table. The approach is also used in a clustering computation. The source data is grouped according to hash values and then the groups are transferred to different nodes for a small size JOIN. The whole process, with a lot of data traversals and plenty of network transmission, is extremely complicated and cumbersome.
SQL defines JOINs on tables as their Cartesian product on which a filtering will be performed. The definition is simple yet all-embracing. An extensive range of relational operations can be treated using the SQL JOIN, which has already been generalized to a wide variety of computational scenarios. Yet the simplicity causes a lack of ability to employ useful data or computational features to optimize performance. By qualifying the definition to reduce its comprehensiveness, we can approach relational operations according to different contexts and optimize performance by taking advantage of the characteristics of specific computational environments.
Here an analogy will be made to help you understand this. All integer multiplications can be represented with additions. With only the addition, you can perform any operations. In this sense, a system of arithmetic with only the addition operation can be considered as complete. But after people defined the addition of multiple same numbers, a special type of addition, as multiplication and made studies on it, they invent a more efficient way – multiplication table – to do this type of addition faster. With the new way, you don’t have to add these numbers one by one.
SQL JOIN is like an encompassing addition-only system. But if it can be further classified, more efficient algorithms can be designed for computational tasks in different contexts.

esProc classified JOINs into two categories – foreign key joins and alignment joins. In-memory computing, including a cluster dimension table, is used in performing foreign key joins, and a pre-sorting operation is useful in handling alignment joins. Without repeated data traversals and with only a very small amount of network transmission using a cluster dimension table, the computational process becomes simple and the overall performance will be significantly increased. However, the theory behind SQL JOINs is too simple to distinguish between dimension table and fact table. On the other hand, SQL doesn’t have ordered sets and an object reference mechanism to enable it to apply those techniques well.
Not all additions can be represented by multiplications, and not all JOINS fit into the two categories. In some scenarios, the source data can’t be pre-sorted and is so huge that it can’t be entirely loaded into the memory even through the cluster dimension table. But in most of the real-world scenarios, data can be manipulated by using the above-mentioned techniques.
For those scenarios where the in-memory computing and sorting technique can’t play a role, a hash-assisted, group-by-group join is still needed for the sake of the inclusiveness of the computational system. esProc offers functionalities for group-based data exporting in batches and remote file read/write, making it possible to compose the algorithm with cluster computing. But as the process is really complicated, it won’t be detailed. A semantic data model will expected to be added to esProc to develop a data warehousing product for big data computing. The updated design will be able to provide a more transparent syntax and an encapsulation of all those algorithms.

Report development

A computing engine for data preparation

A “zero-code reporting tool” is what the advertisements of many reporting tool vendors claim. This is complete wish-fulfillment. A lot of simple reports can be built by dragging and dropping though, some complicated others must be designed through coding.
Any programmers with the experience of report development will tell you that writing complicated SQL programs, coding stored procedures and creating intermediate tables are commonplaces of report building processes. Sometimes when SQL statements and stored procedures are hard to code, they have to use a user-defined data source – a Java program – to handle the computations.
Most of the reports are simple and there are only a small number of complex ones. But it is in dealing with this small number of complex reports that programmers spent much of their energy. It takes more time to create one report needing data preparation through coding than to build ten simple reports through drag-and-drop actions.

Certain reporting tools are able to realize multi-data-source associative operations and operations involving inter-row reference, such as the calculation of link relative ratio and YOY growth rate, in the stage of data presentation. But still there are many occasions where the data preparation is so complicated that a great deal of work is required. Then it’s natural that we want to facilitate data preparation using a middleware. Zero coding is impossible, but the esProc simple-code system is available.

The above shows the reporting architecture where esProc is introduced. Different from a conventional reporting process, the architecture adds a computing layer where the source data is processed before being presented by the displaying layer.

Facilitating report development

There are three benefits of introducing a computing layer for report building:
First and foremost, a computing layer reduces the difficulty of report development, or increases the efficiency, which is esProc is intended to achieve. Second, it optimizes the reporting architecture to create an independent reporting module and get rid of unnecessary operations. Third, it enhances the computational performance of the reporting tool by giving a full play of the hardware resources and system resources.
Below is the detailed discussion.

Easier to use than Java and SQL
Traditionally the data preparation for complex reports is performed in Java or SQL. Stored procedures and intermediate tables are essentially the implementations of SQL syntax. esProc syntax is simple and easy to understand compared with the syntax of Java or SQL, enabling it to greatly simplify the development process.
For a same computation, esProc code is much more concise than its Java counterpart. It makes coding faster, as well as makes it easier to understand the algorithm and to debug. Most of the data preparation algorithms for reporting can be wholly displayed within one screen, making it intuitive to understand the code as a whole. Usually there are hundreds of lines in a Java program for expressing a set of business logic completely. This is inconvenient to check.
In the previous part, we compared esProc syntax and SQL syntax in various aspects. Apart from more efficient computational modules, esProc also provides more methods for date handling and string handling than most of SQL products.
Actually most complex cases can be handled in SQL, but the SQL way is too roundabout and not in line with our natural thinking. Even the programmers could forget how to write them a long time later, causing inconveniences to the maintenance. The esProc code, even if it uses the same algorithm as SQL, is intuitive, clear, as well as easy to understand and maintain.
But, esProc can’t completely replace SQL though its syntax is simpler and easier than SQL syntax in most cases. Data retrieval from the database has a high I/O cost. For some data-intensive simple computations, the time spent in data retrieval is much longer than the time in computations. They are more suitable to be handled in the database.

A wide support of operations than reporting tools
All reporting tools can handle computed columns, data grouping and data sorting. Some even support inter-row/inter-group operations and provide syntax for referencing cells from relative positions and a set to perform rather complicated computations.
The computations provided by reporting tools are in a descriptive mode. This mode lists all expressions on the reporting interface, whose execution order is determined by the reporting tool according to their dependency relationship. This is intuitive. The computational target of each cell is clear when the relationships between expression is simple.
The descriptive mode becomes awkward when the dependency relationships are complex and the data pre-processing involves multiple steps. It must adopt a process mode that uses hidden cells which will hurt the computation’s intuitiveness. The mode handles dependency relationships all in the memory, causing a lot of extra memory usage. Many operations, however, are hard to handle even by using hidden cells. A process mode is thus required.

For example, you might want to list the clients whose sales amount accounts for half of the total. Without a data pre-preparation stage, you should hide certain data items using the functionality of hidden rows or columns, but can’t filter them away in advance. Another example is sorting a grouped report having detailed data by aggregate values. You need to first group the data and then sort it. Many reporting tools can’t control the order of grouping and sorting, and the report building just fails.
Round-off error control is particularly typical. The total of rounded detailed values probably doesn’t equate to the rounded total value of the original detailed values, causing disagreement between the detailed data and the totals. In that case you need to find the appropriate round-off values for the detailed values according to the round-off value of the totals. Reporting tools are helpless in handling this.
Though the logics in these computations are simple, reporting tools has great difficulty in expressing them, even with the single SQL statements. It’s also unnecessary to take all the trouble to write a piece of Java program or a stored procedure for a simple, one-off operation. esProc makes this a much easier process. The data preparation is performed separately, leaving only the intuitive part of the computation and the data presentation to the reporting tool, which can thus keep the strength of descriptive mode computation. A one more stage makes a clear structure for the report building.

A separate data preparation layer supports dynamic data sources and data sets. The data sources used by a reporting tool are pre-defined and fixed. You can’t select one dynamically through the parameter. esProc, however, supports using the script to control and connect to different data sources. In performing conditional data retrieval with reporting tool, SQL can’t control the condition simply using the parameter but needs to rewritten the subclause. Some reporting tools can solve the issue partially by providing macros. But it’s not easy to compute a macro and then write the result in a SQL statement. The conventional approach is to write a separate Java program to compute the macro and then pass the result in, just increasing the workload. In esProc, it’s easy to deal with various complicated macro calculations using the script.

Besides the data computing tasks, an independent data preparation layer helps to arrange the layout. Many reporting tools support vertical multi-column layout, but few can handle horizontal multi-column layout. esProc provides the ability to rearrange the records in a data set horizontally. The reporting tool just needs to present a resulting data set with more columns. A lot of reporting tools doesn’t support appending empty rows on the last page, whereas esProc can supply these rows in the returned data set.

A consistent support of heterogeneous data sources
The data source of today’s reporting tools comes from not only the database, but also a text file, a JSON file or an XML file. These non-database data sources haven’t the computational power to handle the necessary operations, such as filtering, grouping and multi-table joins, for creating some reports. Reporting tools also have trouble in handling JSON and XML data due to their inadequate capability. Even with the structured files on which they can perform the basic handling, reporting tools have the problem of capacity overloading. So usually you need to import the non-database data to a database before you begin the real work of report development, creating extra workload.
esProc, however, is able to use these non-database data sources directly in data preparation for report development, saving the trouble of importing data into the database and reducing the workload.

In esProc, NoSQL databases and HDFS files are the same when they are used as data sources. With its independent computational mechanism, esProc handles the various types of data sources in the same way, enabling a seemingly consistent computational ability in them. Files little have computational ability, MongoDB has an insufficient support for JOINs and GROUP operation, and databases of different vendors have different degrees of support for window functions, different syntax yet same weak capability in date and string handling. esProc’s homogeneous approach of handling these heterogeneous data sources reduces code migration cost and the learning cost.

The data set a reporting tool can use is the single-level two-dimensional table that a SQL program will return. A multilevel JSON file or XML file should first be transformed into multiple single-level data sets which will then be combined into a multilevel report through an associative operation in the template. esProc supports directly generating and computing multilevel data sets and returning them to the reporting tool for presentation. With no more transformation and associative operation needed, the process becomes easier.
esProc can also handle a multilevel MongoDB document directly and return the result to the reporting tool.

Optimizing reporting architecture

esProc is initially designed to enhance the report development efficiency. It turns out, however, that the product plays its biggest part in optimizing the reporting architecture. The change is a second revolution of the report development.

Interpreted execution that creates loose coupling architecture
Let’s look at the Java reporting architecture and esProc reporting architecture in data preparation.

Most of the Java applications compile code beforehand and load it statically. It means that all modules are already compiled and packaged before they are deployed and that the code is fixed at runtime. In fact Java offers the dynamic compilation and loading, but the technique is so difficult and complicated that few programmers use it for their daily coding. Moreover, the use of the dynamic loading technique can’t replace a class that is already loaded into the memory; it can only add new classes continuously.
With this mechanism, the data preparation algorithm written in Java must be packaged and distributed together with the main reporting program, leading to a tight coupling between the reporting module and the main program.
Generally speaking, the reporting module is less stable than the main program because the former is more changeable than the latter. A good reporting system should have a loose coupling architecture where the reporting module is independent. A report in its full sense involves two components – the template developed by the reporting tool and the data preparation algorithm written in Java. Any change of the report will also change both, indicating a tight-coupled model between the reporting module and the main program. The report template is generally stored in a file system or a database while the data preparation algorithm is packaged into the main application’s program package. They are stored in different places and managed in different ways. So it’s hard to keep them consistent, not to mention an independent reporting module.
As previously mentioned, most of the time Java uses static loading. If there’s any modification about the data preparation algorithm, the whole application needs to be recompiled and redeployed. A hot switching is impossible.

By implementing the data preparation algorithm with esProc, a much lower coupling can be achieved between the reporting functionality and the main application.
Like the report template, an esProc script is a separate file. It doesn’t need to be bound to the main application for compilation and packaging, but can be stored, managed and maintained, together with the template, in the file system. The reporting module thus can gain its independence.
esProc is a dynamic interpreted language. An esProc script can be modified without the involvement of the main application, meaning that it can be directly replaced – a natural supporter of hot switching.
A Java algorithm, once loaded, will firmly occupy the memory space and can’t be got rid of, even if the corresponding report won’t receive accesses any more. There is no such worry with an esProc algorithm, which will release itself to free up memory space after it is executed.

File-based algorithm storage: Lessening dependence on stored procedures
As with Java applications, preparing data for report development using stored procedures also cause a tight coupling. This time the coupling happens between the reporting module and the database.
It’s also difficult to keep the stored procedure in the database and the report template in a file system consistent.
Stored procedures are better than Java applications to hot switch. But high-level administrator privileges are required to recompile a stored procedure. The frequent use of high privileges will cause potential security risks to the database.
Even worse than Java applications, databases and their stored procedures could be used by more than one application. A bad management can easily cause tight couplings between multiple applications. Over time the invocation relationships between stored procedures and applications may become confused and the whole situation would be a total mess.
esProc helps cut down the procedures in the database greatly. An algorithm will be stored and managed along with the report template in file system and become a part of the reporting module. This will reduce its coupling with the other parts of the application while won’t add more coupling with other applications.

Actually it’s not easy to write a stored procedure. The traversal-style code only has an adequate performance but with very poor portability. The principle is that you should use as few stored procedures as possible.
Yet stored procedures are necessary in some cases where they are used to pre-process the source data whose amount is too large to be moved out the database.

File-based algorithm storage: Decreasing intermediate tables
With data-intensive and computation-intensive tasks, the original data is usually pre-processed to generate intermediate results based on which the report is built. Those intermediate results are stored as database tables, which are called intermediate tables.
In a system that has run for a long time, there are more intermediate tables than original tables. There are almost ten thousand tables in the database of a mobile telecommunication company, among which about five hundred tables are used to describe the most complicated business logic. A majority of them are intermediate tables. The database vendors never expect this.

As the stored procedure, a large number of intermediate tables could spell disaster for database management.
The linearly stored database tables are uncategorized. The intermediate tables in a database shared by multiple applications are difficult to distinguish from each other. A project team without remarkable management and control ability can’t sort it out. You can set rules for naming the intermediate tables and impose their execution, but the price is often a lower efficiency. All rules could be ignored with a deadline.
It’s a prevalent phenomenon that management is bad, only letting intermediate tables accumulates to an alarming number. Ten thousand may be slightly exaggerated but it is horrible. Besides, many of them are out-of-date, but all intermediate tables must be kept because you don’t know which ones are useless. At the same time, the ETL process will continue to perform meaningless updating on the useless data by wasting computational resources.

Why on earth should we store these intermediate results in a database, instead of a file system?
Since it’s impossible to calculate all intermediate results beforehand according to each parameter combination used for the report, the calculations must be done during the process. That requires that the intermediate results have computational ability. So far only databases can provide this ability while files can’t. That’s why intermediate results must become intermediate tables.

Generally, intermediate results are calculated from static historical data, so they don’t demand the ability of maintaining transaction consistency the databases have. Being essentially the redundant data, they don’t need high-degree steadiness. If they are corrupted, just perform the computation again. The only purpose of putting them in the database is to gain the computational ability, but this isn’t worth it.
esProc can provide computational capability for files. So with esProc you can store the intermediate data into a file system to effectively reduce the number of redundant intermediate tables in the database, releasing precious yet inefficient database space.
The intermediate data stored in a file system can be managed in a tree structure. The report template and data preparation algorithm are stored in the same place, facilitating an integrated, consistent management. Additionally, without having to consider the data write and data consistency, files have a better I/O throughput than databases, raising the overall computational speed of the reporting tool.

Creating T+0 reports based directly on hybrid data
Relational databases’ ability of maintaining transaction consistency hasn’t yet been challenged. A transaction processing system is still established on the relational database.
Under this circumstance, to create a real-time T+0 report based on all available data, you need to keep huge historical data in the database for storing current transaction data. A database with huge capacity is thus required, which in turn increases the cost. Even if the cost is acceptable, the ever-increasing data amount will ultimately affect the transactional performance to an unacceptable level.

A general solution is to move a part of the historical data out to establish a sub-database, making sure that the transaction processing system operates without too much burden. But this will involve cross-database querying, complicating the development of a T+0 report.
Many databases support the cross-database operations, requiring that the databases involved should be of the same type. Different from handling current transaction data, historical data handling involves larger amount of data but requires no transaction consistency. This means users are likely to store the historical data in a data warehouse with a different structure.
However, even with databases of the same structure, the way of performing a cross-database operation is to map the data table in one database into the currently handled database. Actually the operation takes place in one database, with more communication cost of data transmission, unsatisfactory performance and low steadiness.

eProc can handle this cross-database operation easily and effectively.
With its own computing engine, esProc is independent of the database. The database data, however, will still be handled in their place. esProc issues SQL statements to the involved databases through multiple parallel threads; databases will then perform the parallel processing and return results to the esProc for aggregation; and the aggregate results will be returned to the reporting tool for presentation.
This system is easy to scale out, allowing multiple databases of historical data with different structures.
esProc also supplies the server-based cluster computing model. Rather than storing historical data in a database, esProc server allows placing the data in a file system with better I/O performance. The cooperation of the cluster computing and the storage plan brings a higher performance with lower cost.

Handling various data sources directly
esProc can handle data stored in non-relational databases and in text files. This capability of using various data sources directly in report development not only saves the effort of importing data into a relational database but also creates a concise application framework. You don’t need to add additional relational databases to gain more computational ability. The cost, as well as the risk of data inconsistency during data importing, will be reduced.
In some aspects, non-relational databases offer superior performance than relational databases. They just don’t have sufficient computational power or do computations in distinctive ways. By enabling cooperation between them and esProc, we can retain the former’s strengths. For example, the throughput of MongoDB is significantly higher than the throughput of ordinary relational database can achieve when transferring append-only log data, but it is weak in handling structured data. esProc’s assistance retains the high-throughput of MongoDB by keeping data within it while injecting into it sufficient ability of processing structured data.

Elevating reporting performance

The performance bottlenecks regarding report development involve three stages. The first is the data source, which is the cause of performance bottleneck of report building involving a large amount of data. The symptom is the slow execution of SQL statements issued to the databases. The second is data retrieval. Among the most used databases, Oracle and MySQL have bad JDBC, execute SQL statements fast but is time-consuming in retrieving data into the reporting tool. The third is the reporting tool’s inefficiency in handling certain operations, such as some associative operations and group operation.
The diagnosis of performance problem should be made by looking at the logs generated during report building. Then a therapy will be proposed. It’s probably that all the three stages are sick.
esProc can optimize the three stages at some extent.

Implementing data computing tasks in esProc
Multi-source reports created by associating multiple data sets according to a joining field are common in report development. This is typical SQL JOINs. In a report template, alignment relationships are individually defined by expressions entered into the cells. The method can’t express the holistic relationship between the data sets. The alignment operation is actually performed through traversal with quadratic complexity. In most of the occasions, the report has only dozens of rows. The many calculations involved in a JOIN little affect the overall speed. But when there are thousands of rows in a report, there will be about ten million calculations. It is in this case that the performance problem arises.
You can use esProc to associate multiple data sets and submit the resulting single data set to the reporting tool for display. As SQL, esProc uses the high-efficiency hash algorithm in handling a JOIN. With a linear-complexity method, esProc enhances the performance of JOIN based on thousands of rows by thousands of times.
Similarly, esProc uses hashing for a group operation, which is much faster than the sorting algorithm used by reporting tools.

The descriptive mode used by reporting tools in performing computations can’t make use of the intermediate results. Each computation must redo the process by starting from the source data and the cell data, during which it’s rather inefficient to perform operations involving the sets of cells. Its alternative is to use hidden cells to keep the intermediate results. But this will use too much memory space, which will noticeably affect the performance of a Java application. Besides, the appearance properties of cells also take up memory space during the computations performed by a reporting tool.
All these problems won’t exist if esProc is used to prepare data. The process mode supports reusing the intermediate results conveniently. The data preparation is pure data handling without the involvement of hidden cells and appearance properties and utilizes memory capacity more efficiently.

The use of memory cache can shorten the response time during report development and greatly improve the user experience. Almost all high-end reporting tools offer the caching functionality. But the caching mechanism is so rigid that it can only cache data for the whole report, but can’t for a part of it and for the common part of two reports. It also can’t specify the life cycles of the parameters where different reports are cached. The visual configuration strategy the reporting tools adopted is simple yet inconvenient when setting too many complicated parameters.
Preparing data with esProc makes the data caching under control. esProc programs are code-controlled, enabling developers to make flexible plans about when and where to use caching. Caching data for a part of the report, cache reuse and specifying life cycles for different cached data can all be realized.

For high-concurrency computations in report development, esProc offers the shared memory mechanism for caching the data used for building a report in the memory. The cached data can thus be accessed by multiple report building programs. By storing data in the memory, data access performance is several times higher than accessing data stored in a database or a file, and it will be convenient to perform the parallel processing that can make full use of the contemporary computers’ multi-core CPU. It’s very easy to code the multithreaded parallel processing in esProc.
For the same computations, esProc also provides the in-memory byte table to increase the memory utilization ratio of Java applications. The source data will be loaded into the memory in the form of bytes and objectified when it is referenced. This will reduce performance by 30% but will increase the Java applications’ memory utilization ratio by 3-5 times. Retrieving data from the memory is much faster than retrieving data from the databases or files, even when 30% performance is compromised.

Optimizing data source handling in esProc
Basically you can’t optimize database performance via reporting tools and other non-database technologies. The only way is to optimize the SQL code. Yet there are still some occasions where esProc can be used to help increase the database performance.
In most of the time, it’s efficient to execute SQL statements. But if a SQL statement is too complicated by including many subqueries and nested JOIN and GROUP statements, the database could perform badly. The reason is that the database can’t find the optimal route to execute the complex SQL statement. On the other hand, SQL’s non-stepwise syntax makes it difficult for programmers to intervene the execution manually.
We’ve already had several such cases. One of them is this: performing JOIN+WHERE+GROUP on the basis of a series of subquery results. The execution of the SQL code is surprisingly slow – nearly 7 minutes. Yet it’s fast to execute each subquery individually – the total time is less than 1 minute. Then we use esProc as the controlling program under which each subquery plus WHERE clause is executed, the results are then retrieved for implementing the subsequent JOIN+GROUP in esProc. The time is decreased from nearly 7 minutes to a little more than 1 minute. The performance is increased by 5 times.
We guess that the reason behind the disappointing database performance in executing SQL code is that it splits the subqueries and perform each plus a JOIN. A programmer who is an expert at database performance optimization probably can make the database find the optimal execution route by adjusting certain optimization parameter. But this is a big challenge for programmers. esProc enables programmers to flexibly control the execution path by moving part of the computation outside the database, thus achieving a performance optimization.

As mentioned previously, some databases like Oracle have a bad JDBC performance, but the reporting performance relies heavily on the data retrieval performance. With a light load, the database can execute the SQL statements fast but it takes an unbearable several minutes to retrieve data. A solution is to create multiple database connections to retrieve data in segments by using multithreaded parallel processing. The test shows that the performance of parallel data retrieval can increase by several times. The problem is that a reporting tool can’t realize this alone. Yet esProc can help achieve it easily.

Some data source handling algorithms are written in Java. Theoretically esProc, whose development, interpretation and execution are all Java-based, shouldn’t have a better performance in coding data source handling when compared with Java, which is more basic and should have been sharper. The truth is that most application programmers are better at coding business logic but are not good at writing low-level algorithms. One example is that many of them don’t know how to write a HASH GROUP algorithm, or are just unwilling to write it, because the code is complicated. They choose to perform a GROUP through sorting instead. Now it’s far more inefficient to code the operation in Java than in esProc, which provides many ready-made efficient algorithms.

Sometimes the data source of a report should be prepared in advance, like the above-mentioned intermediate tables. This data preparation process is also implemented by SQL, stored procedures or Java. So the above analysis also applies to this type of data source handling, during which the performance can be elevated by using esProc and the ETL time window can be decreased. The esProc algorithm can be stored outside the database to gain a higher I/O performance and to reduce the database load during generating the report, so as to further optimize the overall performance of the reporting application.

Handling big data reporting in esProc
esProc’s multithreading mechanism enables it to perform a clustering computing based on multiple database to get higher performance. The mechanism supports scaling-out too.

The creation of a T+0 report adopts a multi-database clustering algorithm. The source data is divided into multiple segments to be stored in different databases, each holding an appropriate amount of data to ensure a good performance. esProc issues SQL commands to these databases through multiple threads, collects their results and aggregates them to submit to the reporting tool for presentation. It’s easy to carry out the common operations like filtering and grouping and aggregation. More databases can be added to scale out the cluster in cases when data amount increases continuously and becomes really large.
The configuration complexity and the environment requirements are much higher than the esProc strategy needs when the database’s clustering strategy is used. esProc multi-database clustering scheme allows including databases of different structures, such as creating a cluster containing Oracle databases on a minicomputer and MySQL databases on a PC server.

The I/O performance of a file system is higher than a database. The first reason behind this is probably that the database needs to set aside space for data write, which causes incompact data storage; and the second is that in order to keep consistency between data read and data write, it often needs to scan rollback segments, which also takes time. In a word, it’s a proven fact that databases have a lower I/O performance than files.
esProc moves data outside the database and stores it in a proprietary compression format to achieve a better performance than databases. According to our test, esProc and Oracle are almost equal in dealing with the traversal of big data with a single thread. esProc gains an apparent advantage when multiple threads are used. When the data amount is several times larger than the memory capacity, Oracle’s option for parallel processing is almost ineffective while esProc parallel processing increases performance by many times.
esProc supports columnar storage. It elevates performance by orders of magnitude for data query and report development where a very small number of columns are accessed, compared with row-wise structure databases. The columnar storage also supports segment-based parallel processing.
On the script execution, esProc code’s performance in interpretation and execution is remarkably superior to the Oracle stored procedures. For process mode complicated computations that are hard to code automatically in SQL, you can move data outside the database and use esProc to handle it, which is more efficient than processing it with the stored procedure within the database.
If the memory has large enough capacity, data can be loaded into the memory for a better processing performance. esProc supports the reference of in-memory records, making it simpler to express the JOIN algorithms and creating a much higher performance compared with the traditional SQL foreign-key mapping. A test tells us that esProc’s pointer-based joins are over one time faster than Oracle’s foreign-key joins.
In addition, esProc offers server-based clustering computing, which can be used to further speed up the data-intensive computations.

esProc In-depth