In fact, sets are a data type frequently seen in the process of analyzing data. Their absence makes many data analysis tasks difficult, even impossible. As a result, Excel users must write VBA code in order to accomplish the tasks. The problem is that is difficult and extremely inconvenient.

esCalc provides the set data type, and also various set functions and operations based specifically on sets, enabling users to carry out complex data analysis tasks.

**Row-wise sets**

The following Performance table records performance grade of each employee in each month. We want to find out employees who get at least three A continuously.

It’s simple to do it in esCalc. We just need to select any detail data row, such as the second row, and perform filter operation according to a filtering expression [B2:M2].group@o().pselect(~(1)==”A” && ~.len()>=3). [B2:M2] is a set consisting of cell values in row 2 from column B to column M. Based on this set, we perform a sorted-data grouping to get multiple groups, each having the same grade in continuous months. Check if there are groups holding three or more continuous A. We can see that each step involves the set-typed data.

**Column-wise sets**

There are same names in the employee name list, we want to delete the rows where extra names stay, while keeping the original order.

As what we do with the preceding example, we perform filter operation on the detail data rows (select row 2 again) according to a filtering expression {A2}.pos(A2)==#. {A2} represents a set consisting of homo-cells of A2 inclusive. Find the position where value of A2 first appears. If the position isn’t where A2 sits, the name is a repeated one and the corresponding row will be deleted.

**Expand operation **

Transpose the Student Score table

To a table like this:

We regard the operation that splits and expands one row into multiple rows as the inverse action of grouping. It’s almost impossible to realize this in Excel without turning to VBA, while merely a few steps will suffice in esCalc:

In column E, enter in E2 the formula =[B2:D2] to get sets composed of values in column B/C/D. Thus column E will have sets as its values.

esCalc provides expand operation performed based on a set. Now perform the operation on E2 to expand the row into multiple ones.

Add column F and fill it with subject names using the formula F2=[B2:D2]((#-1)%3+1). Here we get members from a set through their serial numbers.

Finally, delete column B/C/D and switch positions between column E and column F, and complete the headers.

]]>Excel uses Lookup functions to associate tables. They are similar to the SQL left join. SQL also has inner join, right join and full join, among which the inner join is implemented through filtering after the left join and the right join is the opposite operation of the left join with joining direction changed. The full join, however, can’t be performed automatically in Excel.

The biggest problem of Lookup functions is their complicated usage. They need to specify the joining column, the joining scope and the referenced columns, with only one referenced column for each look-up, and multiple Lookup statements using the same query condition for referencing multiple columns. Not only is the writing troublesome, but also the method has a poor performance due to repeated operations. In fact as a traversal-style query method, Lookup function are very inefficient in searching associated data.

Based on SQL model, esCalc supports the whole set of join operations including inner join, left join and full join, with multiple columns referenced at once from the associated table by specifying the associated cells in the two worksheet to be joined. This is much simpler than using the Excel method. To join the *performance* table and the *attendance* table, for instance, set master cells (i.e. the joining cells) and copy the to-be-referenced cells in the *attendance* table and paste them on the *employee* *performance* table using the JOIN operation.

Here’s the *performance* table, in which A2 and its homo-cells are set as the master cells where the employee numbers are stored:

Here’s the *attendance* table, which contains only the employees who have had absences, and in which A2 and its homo-cells are master cells holding the employee numbers:

To perform a join operation, select B2 in the *attendance* table and press Ctrl+C to copy, and then select E2 in the *employee performance* table and press Ctrl+Alt+J to choose and execute the Left join:

After that the resulting *employee performance* table is as follows:

The esCalc join operation also supports multi-level worksheet tables. For example, the employees are stored in groups according to their states, and the attendances are recorded in the same way. The multi-level join will first align tables according to the groups and then find the joining rows in each group. This way error won’t occur even there are employees with same names under different states and the result set will be obtained with detail group data kept neatly and completely.

Here’s the *duty* table, in which master cells hold state names and employee numbers:

Here’s the *employee* table, in which master cells also hold state names and employee numbers:

In this *employee* table, select B3 and C3 at the same time and copy the employee information, and then select C3 in the *duty* table and perform left join. Here’s the result:

We mentioned in the preceding part that there are records with their sub-records. But in many cases the hierarchical records are generated from group operations.

The Excel data model doesn’t support multi-level worksheets. Though the group operation is provided, it is treated specially. Aggregate operations performed on the summary level after data grouping use SUBTOTAL, which is difficult to memorize, instead of more familiar functions like SUM/COUNT; otherwise group members won’t be correctly located.

As mentioned previously, for the formulas in cells at the level of detail data, on one hand we can’t simply use the drag-and-drop method to perform the batched copying (because detail data is inconsecutive data areas separated by summary rows) but can only perform the cross-group copying manually; on the other hand, when formulas reference cells at the summary level or involve cross-row calculations (such as calculations of percentages and YOY rate), even the manual operation can’t guarantee a correct copying according to the Excel rule of formula copying, and, moreover, manual modification of mistakenly copied formulas is needed. All the work is too tedious to bear when there are a lot of groups.

Excel provides the sign $ to reference summary cells in a one-level worksheet, but it becomes helpless when facing the multi-level worksheet.

esCalc has a data model that supports multi-level tables. Aggregate operations performed on the summary level after data grouping still use common functions like sum/count. Particularly, esCalc distinguishes the levels to which the cells belong, and handles the copying of formulas that reference cells both at detail data level (including inter-row reference) and at summary data level according to different situations. The intra-group copying only adjusts cells at the detail data level, while cross-group copying changes the cells at summary data level. What the esCalc users need to do is to reference a desirable cell intuitively, without having to distinguish different levels themselves using the sign $ (actually the sign can merely reference data from one level and falls short of the need). With esCalc, formulas can be correctly copied even calculations involve multiple levels of summary data.

Calculate the average temperature difference in each month, for example, based on the following sheet:

esCalc stores the same type of data in homo-rows. Thus as E3 calculates the average temperature difference in January, its homo-cells corresponding to other months calculate their respective average temperature differences at the same time, saving users the trouble of copying formulas. Here’s the result:

In the above data handling, the month data is sub-rows and their parent row is the quarter data. Calculations performed on the sub-rows won’t affect the parent row; and similarly, data handling in the parent row won’t affect the sub-rows. For instance, enter the formula ={A3}.count() in E2 to calculate the number of records in each quarter. Here’s the result:

esCalc formulas can be intelligently copied according to different data structures, instead of being mechanically copied according to positions of cells. The esCalc copying rule is more reasonable.

Another example is to calculate the precipitation based on the *climate* table:

Of which E2 and E6 respectively calculate the average precipitation of the current quarter, as E2’s formula ={D3}.avg(). To calculate the difference between the precipitation in each month and the average precipitation in the corresponding quarter, just enter the formula =D3-E2 in E3. Here’s the result:

Formulas have been intelligently adjusted during the copying according to the hierarchical level to which the target cell belongs. For instance, we click on E6 and know that the formula has been adjusted as ={D7}.avg(), which calculates the average precipitation of the current quarter; click on E8 and see the formula have been adapted as =D8-E6, which means subtracting the average precipitation value of the current quarter from the precipitation of the current month. So we can see that esCalc can correctly copy the formula to both a cell that sits on a group’s summary row and one that sits on a group’s detail row.

**Post-grouping operations**

That data grouping in Excel is special is also reflected by the difficulty in handling post-grouping operations. We can’t perform operations such as sorting and filtering freely on the grouped worksheet table as what we do with a single-level table.

For example, in order to find out ranks of sellers on performances, we want to group and aggregate the order records by sellers and then sort groups according to aggregate amounts. To do that we need to first perform group and aggregate and then the sort by aggregate values; during the sorting, members of a group need to move together with the aggregate value. But we can’t perform this kind of sorting automatically in Excel. In a modified version of this example, for each seller we want to delete the small orders, each of which makes up less than 1% of the seller’s total sales amount, and then re-calculate the total amount. This requires grouping data and calculating the percentage of each member in each group, and performing filtering on all groups by the percentages (here the non-related calculations discussed above will also be used). Excel can’t make it all at once due to its lack of support to operations on the multi-level worksheet; it can only handle the groups separately one by one.

esCalc sees a multi-level worksheet as normal, and makes it open to all operations. So it’s easy for esCalc to handle the above scenarios. In esCalc, during the sorting by aggregate values after data grouping, detail rows of a group will move together with their summary row, which is again an application of esCalc record conception (a group with its members as a whole can be regarded as a record). The post-grouping filtering on detail rows will be performed once and for all by copying all groups at one time.

Here’s the *order* table:

F1 calculates total sales amount of orders with the formula ={E3}.sum(). F3 calculates the percentage of each order’s amount in the total amount with the formula =round(E3/F1,4).

esCalc permits various operations on a grouped worksheet, such as filtering. To delete every order whose amount makes up less than 1% of the total sales amount, we select F3 to do the filtering:

Here’s what we get through data filtering:

**Structure editing**

Excel doesn’t support inserting or deleting a data level based on a grouped worksheet. To change the existing data structure, we need to clear groups and re-group data, making the work we did on the summary level (the calculated cells) a waste. Sometimes it is the summary values, instead of the details, that we desired.

In the worksheet in which the small orders have been removed, for example, we need to group the records by the ordered products to see which products are more popular among each seller’s non-small orders. To do this we need to insert another level of groups into the double-level grouped worksheet, and to aggregate and sort each group. As we are only interested in the group and aggregate results, we want to delete the detail level of data. But we can’t perform these operations automatically in Excel. We have to copy the intermediate results out into a new worksheet for further handling. Even worse is that since the grouped data is not continuous, even the copying action can’t be carried out automatically.

In esCalc, we re-group the preceding worksheet table by products, and here’s the result:

We can do further computations based on the re-grouped worksheet. To calculate the total sales amount for each seller, for instance, we enter the same formula ={E4}.sum() in both E2 and E3. Here’s the result:

We entered the same formulas in F1, E2 and E3 to calculate total sales amount, but we get different results because they are entered in cells that sit at different levels.

Now we select E2 to perform a sorting in descending order to sort the worksheet data by the total sales amounts of sellers. The result is as follows:

In esCalc, when grouping rows move because of sorting or other operations, their sub-rows will follow suit.

There’s nothing particular for esCalc to carry out these operations. Because it defines the hierarchical level as a nature of the worksheet, enabling free insertion or deletion of a level and automatic copying of an action operated on a row to all its homo-rows (similar concept to homo-cells). So all detail rows will be deleted simultaneously if we execute an action on a certain detail row.

The data model for esCalc spreadsheet encompasses a hierarchical structure, making grouping and ungrouping the normal operations that can be still performed on the same worksheet as filtering and sorting. Here’s an analogy between spreadsheet data models and a numerical system. Within the range of integers, we are free to do addition, subtraction and multiplication but we can’t do division at will, because the quotient isn’t necessarily an integer. But if we expand the range into the rational numbers, the division operation becomes naturally as well, though we need to redefine the rules for other operations in the expanded scope. Likewise, when esCalc extends the data model for worksheets to include a multi-level structure, it also redefines rules for carrying out sorting, filtering and generating computed columns (to support the smart copying of formulas across different levels, for instance). By doing so, related operations can be performed consecutively, ensuring an interactive data analysis to proceed smoothly.

]]>Records are represented by the rows in an Excel worksheet. Users can perform operations such as filtering, sorting on the rows, and, particularly, add computed columns (its values are computed from other fields) for the rows. It’s in this latter case where the formula copying becomes a problem.

To add a computed column involves all records (rows), but as Excel hasn’t the concept of explicit record, the formula entered in a certain row needs to be manually copied to other rows. Excel cleverly adopts the drag-and-drop method to do this. The method is very convenient-to-use for handling single-row records (that is, each record corresponds to a single row).

But at times the worksheet data we’ are handling are complicated in that one record corresponds more than one row. That’s because, for example, the record has much content that needs to take up two rows, or the record includes the lower-level sub-records (the details of an order, for instance). In those cases, the cells to which the formula is copied aren’t continuous any more, and the drag-and-drop method becomes powerless. We can imagine how much hassle there will be if all the copying is done manually row by row.

esClac solves the problem by both retaining Excel’s intuitive way of naming data items after cells and by introducing the concept of explicit records. It combines the strongest points of Excel and database client software. A formula entered to a certain cell will be automatically and correctly copied to its homo-cells (cells of the desired field in other records) without specialized copying actions, even if there are multi-row records and records with sub-records.

Here’s an *order* table:

F1 calculates the total order amount using the formula ={E2}.sum(). We then enter the formula =round(E2/F1,4) in F2 to calculate the percentage of the amount of the current order in the total amount, that is – dividing the total value in F1 by the amount of the current order. At the same time, we set the display format of F2 as #0.00%, which means representing the value in percentage. After entering the formula in F2, here’s what we get:

Check the homo-cells (F3~F11) of F2 and we find that they’ve all finished the computations. This shows that esCalc can copy the formula and display format in one cell to its homo-cells automatically and correctly.

It can also copy the formula in handling multi-row records as conveniently as in handling the single-row records, for example:

The worksheet contains unit prices and quantities of vegetables and fruits purchased. D3 calculates purchasing amount of the pineapple with =floor(D2*B3,2). Here’s the result after the formula is entered:

As soon as the formula is entered, it is copied to the cells corresponding to all products, i.e. D3’s homo-cells, to calculate their total purchasing amount.

**Data editing **

Excel is again at a disadvantage in editing multi-row records.

Excel doesn’t handle a record as a whole. Inserting, deleting and moving a record are operations performed based on rows and columns of a worksheet. There’s almost no problem about processing single-row records. But operations on rows become complicated in handling multi-row records and records with sub-records, and inserting and deleting fields based on columns are almost non-executable.

Because Excel isn’t good at handling multi-row records, it generally prevents them from appearing when generating the original data. So there’re not many chances for Excel users to encounter them. In many real-world businesses, however, it’s not a rarity that users find themselves in the face of a multi-level worksheet or multi-level data items. By the way, group operations will generate multi-level tables, as we’ll mention later.

Even with single-row records, Excel will still make mistakes in copying formulas for inter-row calculations (such as the calculation of YOY rate and the accumulated value) when rows are inserted or deleted. There’s the same problem in moving records through the copy. Both cases require modifying the results manually or recopying by drag-and-drop. In addition, since Excel doesn’t stress the concept of records, it doesn’t offer hot keys for record processing, making the modification and recopying not that easy.

But it’s easy for esCalc, which defines records, to perform those operations. It also provides convenient hot keys to trigger the actions in a shortcut way. Records (including their sub-records) can be deleted and moved as a whole with just one click, after which the inter-row calculation formulas will remain correct. To insert and delete fields based on columns is to change the data structure. esCalc will automatically copy these operations on one cell to all its homo-cells.

Here’s the *employee* table:

There’s the formula =age(C3) in D3. The formula, as well as those in its homo-cells, is used to calculate the age of each employee. Meanwhile C2’s formula ={B3}.count() calculates the number of employees in each department, and D2’s formula =round({D3}.avg(),1) calculates the average age of the employees in each department. Suppose we want to delete duplicate department values in the first field of the *employee* table without affecting other data items. To do this we select B3 and press Ctrl+Backspace to delete A3 and its homo-cells. Here’s what we get:

All homo-rows will change their structure at the same time. In the meantime formulas in C3 and its homo-cells will adapt themselves intelligently to the new structure. For instance, C3’s formula becomes =age(B3) automatically.

In esCalc we can merely change the structure of summary rows. For instance, to delete blank cells in the second column of the department summary rows, we select C2 and press Ctrl+Backspace to delete A2 and its homo-cells. Here’s what we get:

This is a *membership management* table:

The worksheet table records the number of new members and of those who leave each month. Enter ==D7+B8-C8 in D8 to calculate the number of members in the current month according to the number in the last month and the number of withdrawals in this month. Here a related calculation expression starting with two equal signs is used and data in corresponding cells will adjust intelligently according to any change of the table.

Now we insert the records of the missing months April and May in the table and enter data to them. Here’s the complete table:

Because esCalc stores the inserted data also in the form of homo-rows, calculations in column D will still be correctly done and the membership statistics will be automatically updated along with the change of the data. If formulas are changed according to positions of cells instead of their structure, errors will occur when new rows are inserted.

**Non-related calculations **

Cells in Excel calculate in a related way. That means once the value of a referenced cell changes, a calculation cell will re-calculate; and if the referenced cell is deleted, error will occur to the calculation cell.

But the more commonly seen scenarios are these: After values of a computed column are obtained, values of cells referenced by formulas become useless and deletable; or we may change the original value to be referenced by a computed column and then compare the new value and the old value, in which case the old value is expected to remain what it was. For instance, the original data contains persons’ birthdays. Sometimes only birthdays during a certain time period are needed to compute the ages in the subsequent computations. Thus the birthday values can be deleted after the ages are obtained; other times the birthday values are changed and ages are calculated according to the changed birthdays. It’s not easy to deal with both scenarios in Excel.

esCalc offers two types of calculation cells: related calculation cell and non-related calculation cell. The value of a related calculation cell will change along with the change of a referenced cell, as with in Excel; a non-related calculation cell becomes irrelevant to the referenced cell once it finishes calculation, and either the change or the deletion of the referenced cell value can’t affect its value.

In reality, there are more non-related calculations than related calculations during interactive data analysis.

This is the *population* table of the state of Alaska:

C8 and its homo-cells calculate the growth rate of every census for the state of Alaska. The formula in C8 is =round((B8-B7)/B7,3) and the display format is #0.0%. Now we select C2 and sort records by the growth rate in descending order. Here’s the result:

Since each formula in column C is headed by a single equal sign, C2 and its homo-cells are non-related calculation cells which will keep their values unchanged, rather than re-calculate to get the wrong growth rates according to the new order.

]]>In effect, Excel, instead of many of the BI tools, is the most widely used desktop data analysis tool.

Excel is simple, intuitive, easy to use and to understand, particularly suitable for the average analysts who are incapable of programming and don’t have knowledge about mathematical models. Moreover, the result of each action operated on an Excel worksheet will be immediately shown to enable programmers to decide how to make the next move. This represents the typical model analysts use to perform analytics. It is neither necessary nor possible to model objects beforehand.

But we’ve found some Excel defects as data analysis becomes increasingly complicated. Simply and briefly put, they are the “4M” problems:

**1. Multi-row records**

There’s no definite concept of structured records in Excel. The single-row record is a record that corresponds to a single row. If a record has too many data items and needs to occupy multiple rows or if it has sub-records – the record is a multi-row record. It’s very complicated to edit a multi-row record and to perform operations on it.

**2. Multi-level tables**

Excel provides functions for grouping data, yet, unlike an ungrouped worksheet, many operations become impossible or need to be performed in a different way with the resulting hierarchical table, where consecutive operations are hard to be carried out. What’s worse, the grouping generates multi-row records, on which the cross-group copying of formulas that reference the aggregate values can’t be performed correctly and intelligently. As a result, computing errors arise.

**3. Multi-table joins**

Excel isn’t a relational-algebra-based product. It doesn’t have specialized functions to join tables; it only provides functions such as Lookup for simple cross-page cell reference, which are complicated to use and perform poorly.

**4. Multi-member value**

It is known that Excel has the concept of set, enabling aggregate operations on cells within a certain range. But it doesn’t provide explicit set data type, making it impossible to store set-typed values and limiting its abilities of handling set-oriented operations.

We’ll further explain the problems later through examples.

**esCalc**

To solve those Excel problems, we created another model for handling the spreadsheet data, and from this model the brand-new spreadsheet software – esCalc – was born.

Instead of an improved version of Excel, esCalc bases its data and operational models directly on the relational algebra, making it have more in common with the relational database. esCalc has a definition for records, and provides most of SQL’s computational features like computed columns, sorting, filtering, grouping and performing distinct, as well as join and union between multiple tables. Because of its spreadsheet interactive interface, esCalc can be regarded as a visualized SQL calculator.

Different from the general database client softwares that use field name in referencing a data item, esCalc inherits Excel’s grid style, in which cells are used to name data items and describe formulas, and supports automatic and intelligent copying of formulas. The more intuitive way makes it easy to be manipulated by the layman and handy to express order-related computations which SQL isn’t good at.

Furthermore, esCalc includes a multi-level data model to increase related computing capabilities on the basis of SQL, enabling it to support hierarchical tables containing a main table and its sub-tables, to perform operations such as filtering, sorting, re-grouping and ungrouping on the grouped worksheet, and to copy formulas automatically and intelligently between cells at different levels.

esCalc is designed for performing interactive data analysis. It doesn’t support such extensive functionalities as Excel does, but it’s more sophisticated and more adept with handling batched data. esCalc is intended to be a cooperator of Excel, rather than an unnecessary rival. Not only can esCalc retrieve an xls file to analyze and process it, also it can export the computational result in xls format for further processing in Excel. But it’s necessary to point out that esCalc isn’t a plug-in for Excel, it is an independent application.

Compared with Excel, the big functionalities for data handling that esCac hasn’t are VBA scripting and pivot table. In our opinion, VBA is too difficult for average analysts; besides, many VBA scripts are created only in order to complement those functionalities that are not so convenient to be carried out in Excel. With the same functionalities better provided in esCalc, VBA scripting is not as important as it is in Excel. For analysts who are capable of programming, we provide esProc, another product of the RaqSoft software family, for occasions where scripts are needed. The software offers scripting abilities that are much more powerful than VBA. As for the pivot table, Excel has already exceled in its feature and leaves little room for improvement. So the target of esCalc is to generate files of xls format as the data sources for Excel pivot tables.

Now let’s turn to discussing the above-mentioned “4M” problems of Excel in data analysis and handling through examples, and provide their esCalc solutions.

]]>**Dynamic data sources **

In reporting tools, generally the data source a report uses is definite and the report parameters only specify the condition (the SQL WHERE clause) according to which a data set is generated, rather than define the data source. To define the data source of the report through parameters, most reporting tools provide API for writing a program. But this is complex.

esProc, which can work as the definite data source for the report meets the need easily. In the esProc script, we can use the parameter to connect to the desired data source to retrieve data and return.

>A | |

1 | =${pds}.query(“select * from T where F=?”,pF) |

The parameter *pds* is used to pass in the data source name. This way it doesn’t matter whether the reporting tool supports dynamic data source connection or not.

The same method applies to a similar situation, where the report requires that the main report and the subreport use different data sources but the reporting tool only allows them using the same one. We make esProc the nominal common data source of the main report and the subreport, while the actual data source is determined by the parameter in the esProc script.

**Dynamic datasets**

With reporting tools, normally parameters, which are also the SQL’s parameters, are used to specify conditions for creating data sets. At times, instead of replacing the SQL with parameters, we want to make part of it, the whole WHERE clause for instance, passed in through one parameter in order to gain a more flexible query condition.

Some reporting tools can make it using macros. But others would resort to their APIs to change the data set defined for the report template through recoding. That is not smart at all. esProc, however, is extremely neat:

A | ||

1 | =”select * from T” + if(where!=”",” where “+where,”") | Append the WHERE; do not append it if there’s no such a condition |

2 | =db.query(A1) |

Despite the support of macros, these reporting tools cannot compose SQL easily. For example, to perform aggregate based on a passed-in list of fields, we should add sum() to each of these fields. But average reporting tools haven’t direct means to handle string concatenation and have to use API, or do it beforehand in the main program. This is also easy to handle in esProc.

A | ||

1 | =sums.array().(“sum(“+~+”) as “+~).string() | Change a, b to sum(a) as a, sum(b) as b respectively |

2 | =db.query(“select G,”+A1+” from T group by G”) |

**Controlled data retrieval **

Because the memory capacity is limited, we make the reporting tool retrieve 10,000 rows at most each time. If the rows retrieved are less than the maximum number, we need to supply an extra row marked with “Continuing” to show whether the retrieved rows have the regular number. Usually reporting tools are only able to retrieve the fixed number of rows, they need to use API to tackle this kind of controlled retrieval with complicated code.

esProc issues the following code to handle the controlled data retrieve with ease:

A | B | ||

1 | =db.cursor(“select * from T”) | =A1.fetch(1000) | |

2 | if A1.fetch@0(1) | >B1.insert(0,”Continuing”) | Mark the irregular retrieval |

3 | >A1.close() | return B1 |

**Horizontal multi-column layout**

Most reporting tools support vertical multi-column layout, but few can handle horizontal multi-column layout. In view of this, we can use esProc to first prepare the data set.

A | B | C | |

1 | =db.query(“select a,b,c from T “) | ||

2 | =A1.step(3,1) | =A1.step(3,2)|[null] | =A1.step(3,3)|[null] |

3 | =A2.derive(B2(#).a:a2,B2(#).b:b2,B2(#).c:c2,C2(#).a:a3,C2(#).b:b3,C2(#).c:c3) |

This piece of code joins the 3-column data set (a, b and c columns) into a 9-column data set (a, b, c, a2, b2, b3, a3, b3 and c3) with the number of rows reduced to one-third of the source data set. After that according to the normal way of report building, the reporting tool can create a horizontal 3-column layout.

**Supplying empty rows **

In printing a report, each page should be filled up. If there are not enough rows on the last page, empty rows need to be supplied. Many reporting tools lack related functionalities. It’s not easy to supply these rows to the data set in SQL, whereas esProc can easily get it done.

A | ||

1 | =db.query(“select * from T”) | |

2 | =pn-A1.len()%pn | Calculate the number of rows to be appended |

3 | =A1|if(A2!=pn,A2*[null]) | The data set with empty rows appended |

The number of rows for each page will be passed in through parameter *pn*.

**Inter-column calculation for the cross table**

According to the following report, calculate the product sales amounts for a specified year (defined by parameter) and its previous year, as well as the growth rate.

Product | 2014 | 2015 | 2015-inc |

AC | 100 | 120 | 20% |

TV | 200 | 210 | 5% |

… | … | … | … |

The data structure of the source table is like this – product, year and amount.

The resulting table seems like a cross table, where the last column, the growth rate, involves inter-column calculation. The average reporting tool only provides aggregate operations (such as sum or average) for columns of a cross table. If we don’t use a cross table, we need to transpose data first, during which we need to get column names dynamically. But the required functions don’t exist in reporting tools.

Now we use esProc to calculate the growth rates and append to the source table, and then present data with a cross table.

A | B | ||

1 | =db.query(“select product,year,amount from T where year=?-1 or year=? order by product,year”,Y,Y) | Y is the parameter specifying the year | |

2 | for A1.group(product) | >A1.insert(0,A2.product, string(Y)+”-inc”, string(A2(2).amount/A2(1).amount-1,”#%”)) | Calculate the growth rates and insert them to the source table |

3 | return A1 |

**Avoiding hidden cells**

Find the big clients who contribute half of the sales amount, as well as the sales amount of each of them, their numbers and average sales. The data structure of the table is simple – client and amount.

The steps for performing the computation are obvious. First sort clients by sales amounts and calculate the grand total; then find the eligible records in the client list by accumulating the amounts in order to half of the total.

But it’s hard to implement the algorithm automatically in reporting tools, which perform state-style computation, that is – after all expression are written, they automatically identify the relationship between them and determine the computing order. To get control of the computing order, we need to have an appropriate relational pattern for cell reference, during which the intermediate results will be presented in hidden cells. It’s not easy even for the reporting tools with strong computing power.

esProc will first prepare the desired data source and then create the report with it. The process is much clearer.

A | B | |

1 | =db.query(“select client,amonut from Sales order by amount desc”) | |

2 | =A1.sum(amount)/2 | =0 |

3 | =A1.pselect((B1+=amount)>=A2) | return A1.to(A3) |

Though several lines of code are needed to express the algorithm just mentioned in an esProc script, they are natural, easy to write and understand. The script will only return a data set of records of the eligible big clients, according to which any reporting tool can build the report easily as they always do.

**Intermediate steps with position changing**

The intermediate steps involving the change of the positions of cells in some algorithms cannot be expressed with hidden cells, even if we’re willing to go to the trouble and to sacrifice the computing efficiency.

For example, a relatively common scenario is to sort grouped records by aggregate values but many reporting tools cannot handle it directly. Mostly reporting tools does sorting before grouping (because this is more frequently seen), but not vice versa. We can easily implement it in esProc with flexible code, which can express any requirement involving position changing.

A | ||

1 | =db.query(“select …”) | Retrieve data |

2 | =A1.group(G).sort(~.sum(A)) | Group by G and Sort by the aggregated A |

3 | =A2.conj() | Concatenate into a single-layer set and return |

The script will return a result set sorted by aggregate values, so reporting tools will simply group it and calculate the aggregate values again.

Likewise, it’s almost impossible for reporting tools to sort the big clients in the above subsection by names, rather than by sales amounts. Yet with esProc we just need to change B3 to *A1.to(A3).sort(client)*.

Actually esPoc cannot and isn’t intended to replace SQL. After years of efforts made by database vendors, the implementations of the algorithms that are easy to express in SQL have almost achieved their perfection. esProc is unable and unnecessary to outperform SQL. So all we have here is the algorithms that are difficult to implement or that can be only expressed in a very roundabout way in SQL. With esProc, programmers can solve these computational problems effortlessly in a much simpler way.

As these tough, unsystematic SQL problems are hard to be classified, we select some typical types of scenarios for illustration.

SQL believes that columns are a part of data’s attributes and they’re static. That’s why there are no specialized SQL set functions for the handling of columns. Consequently, this becomes a headache in dealing with scenarios where the desired column data isn’t supplied or where a standard approach is needed to handle many columns.

**Inter-column aggregation **

*PE* is a table recording results of physical education. It has the following fields – name, 100m, 1000m, long-jump, high-jump, and … There are four grades – A, B, C and D – for evaluating the results. Now we need to calculate the numbers of persons in each grade for every event.

The algorithm is simple. We just need to union the results of all the events, group them and perform aggregates. In SQL, we use a long union statement to combine the results of all events. That’s really boring. It’s complicated if the columns are indefinite. We need to obtain the desired column names dynamically from the database to perform union.

esProc supports handling columns through set operations. The fully dynamic syntax makes coding simple and easy:

A | ||

1 | =db.query(“select * from PE”) | |

2 | =A1.conj(~.array().to(2,)) | Concatenate the results for every event from the second field |

3 | =A2.groups(~:grade;count(1):count) | Grouping and aggregation |

**Standard approach for transpositions**

For simple static transpositions, some databases supply *pivot* and *unpivot* statements to implement them. Databases that don’t support the statements can do this using complicated conditional expressions and union statement. But usually the columns transposed from rows are dynamic. To handle this In SQL, we need to generate the target columns and rows and then compose another statement dynamically to execute. The code is complicated and difficult to understand.

Usually data is transposed for presentation. We could leave the pure row-to-column transposition to reporting tools. But many reporting tools don’t support the handling of rows and that of columns equally, and they cannot perform the column-to-row transposition during the stage of data presentation.

The student scores table *R* consists of these fields – student, semester, math, English, science, pe, art and … We need to perform both the row-to-column transposition and column-to-row transposition to present data in a structure as this – student, subject, semester1, semester2 …

esProc offers *pivot* function to perform the simple transposition:

A | B | C | |

1 | =db.query(“select * from R”) | ||

2 | =A1.pivot@r(student,semester;subject,score) | ||

3 | =A2.pivot(student,subject;semester,score) |

To achieve the two-way transposition, A2 performs column-to-row transposition and A3 performs row-to-column transposition.

There is also a standard method which is easier-to-understand yet slightly complicated:

A | B | C | |

1 | =db.query(“select * from R order by student,semester”) | ||

2 | =create(student,subject,${A1.id(semester).string()}) | ||

3 | for A1.group(student) | for 3,A1.fno() | =A3.field(B3) |

4 | >A2.record(A1.student|A2.fname(B3)|C3) | ||

5 | return A2 |

A2 generates the target result set using a macro. The loop in A3 and A4 transposes rows and columns and insert the result in the result set, which is the standard procedure for performing transpositions in esProc. The stepwise approach makes code clear and easy-to-understand. The approach applies to static transposition or one-way transposition and the code would be even simpler. esProc’s column access scheme and its flexibility characteristic of a dynamic language enables programmers to handle all types of transpositions, including static/dynamic transpositions, row-to-column transposition, column-to-row transposition, two-way transposition, in one standard approach.

**Complex transpositions **

Here’s the account state table *T*:

seq | account | state | date |

1 | A | over | 2014-1-4 |

2 | A | OK | 2014-1-8 |

3 | A | lost | 2014-3-21 |

… |

We need to export the states of the accounts per day for a specified month. If there’s no record for an account on a certain date, then use the state of the previous date:

account | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 31 |

A | over | over | over | over | OK | OK | … | OK | |||

… |

Strictly speaking, the transposition is static. But it involves a lot of regular columns and is not easy to be expressed completely in a static way. It involves inter-column calculations as well, which are hard to be coded in SQL even using *pivot* statements.

We can easily get the job done according to the standard esProc way:

A | B | |

1 | =db.query(“select * from T where year(date)=? and month(date)=?”,2014,1) | |

2 | =create(account,${to(31).string()}) | |

3 | for A1.group(account) =31.(null) | =31.(null) |

4 | >A3.run(B3(day(date))=state) | |

5 | >B3.run(~=ifn(~,~[-1]) | |

6 | >A2.record(A3.account|B3) | |

7 | return A2 |

Here’s only one loop because it is the one-way transposition. In B3-B5, the calculation of getting data to be inserted to the result set according to esProc syntax is a little complicated. Yet the whole procedure is the same.

**Non-equi-grouping **

It is a common demand to group data according to ranges, such as exam grades (excellent, good …), age groups (young, middle-aged…), and etc.

It is always inconvenient to implement this type of scenarios in SQL. For the scenarios where there are only a few static ranges, we can use *case when* conditional statement. But for those involving many ranges or regular, dynamic ones, generally people create temporary tables and then use non-equi-joins. All these approaches are complicated.

esProc uses *penum* function to return the sequence numbers of the enumerated conditions:

[”?<60”,”?>=60&&?<75”, ”?>=75&&?<90”, “?>=90”].penum(score)

If the ranges are continuous, esProc also provides a simpler *pseg* function to get their sequence numbers:

[60,75,90].pseg(score)

All the conditions and ranges here are ordinary arrays. They can be passed in as parameters and there’s no limit to their lengths. According to the sequence numbers of the ranges, the enum grouping and grouping by continuous-ranges can be transformed to more familiar equi-grouping:

A | ||

1 | [”?<60”,”?>=60&&?<75”, ”?>=75&&?<90”, “?>=90”] | conditional ranges, which can work as a parameter |

2 | [60,75,90] | Continuous ranges, which can work as a parameter |

3 | =db.query(“select * from R”) | |

4 | =A3.groups(A1.penum(score);count(1):Counts) | Group data by conditional ranges |

5 | =A3.groups(A2.pseg(score);count(1):Counts) | Group data by continuous ranges |

A non-equi-grouping-related scenario is the fixed-sorting. Often a specified order, instead of the natural data order, is required in presenting the data analysis result. For example, the Permanent Five heads the UN’s list of member states. The way SQL handles the fixed-sorting is similar to the way it groups data by ranges. If the sorting condition is static and very simple, we can use *decode* function to generate sequence numbers. If it contains many items or is regular and dynamic, we need to create the temporary table and use JOIN to generate the sequence numbers.

esProc specially offers *align@s* function to perform the alignment sorting:

T.align@s([“China”,”France”,”Russia”,”UK”,”US”,…],nation)

This way we can sort the nations in the table *T* according to the specified order. Being an ordinary data type, the sorting condition can also be passed in as the parameter:

A | ||

1 | [“China”,”France”,”Russia”,”UK”,”US”,…] | Sorting condition, which can work as a parameter |

2 | =db.query(“select * from T”) | |

3 | =A2.align@s(A1,nation) | Sort data according to the specified condition |

Different from the equi-grouping without empty subsets, sometimes the grouping result is expected to be the continuous ranges. So the missing empty subsets need to be supplied. It’s troublesome to do this in SQL, because we need to first create a set of continuous ranges manually and then *left join* the data tables under processing. During the process we have to use complex subqueries. With the *align* function and the convenient design for generating the sequence used as the sorting condition, it’s easy for esProc to handle this type of scenarios.

Here’s a simple transaction record table *T*, including no, date and amount fields. We need to cumulate the transaction amounts week by week. The weeks without transaction records also need to be displayed.

A | ||

1 | =db.query(“select * from T order by date”) | |

2 | >start=A1(1).date | |

3 | =interval(start,A1.m(-1).date)\7+1 | Calculate the total number of weeks |

4 | =A1.align@a(A2,interval(start,date)\7) | Group records by weeks. There are empty sets probably. |

5 | =A4.new(#:week,acc[-1]+~.sum(amount):acc) | Aggregate the weekly amounts and calculate the cumulative amounts. |

**Grouped subsets**

Without the explicit set data type, SQL cannot but perform the aggregate after data grouping. Except for the aggregate values, we may also take an interest in each data group. It would be difficult to handle these groups in SQL using subqueries.

With the support of set-type data and grouping functions that return subsets, esProc can easily handle post-grouping computations.

For example, if we want to find the records of all subjects for students whose total scores are above 500 in SQL, we need to group records to calculate the total score for every student, select the students whose scores are above 50, and then JOIN the resulting table with the original score table or find the desired records using IN statement, which also requires repeated data retrievals. The process is cumbersome. But in esProc, we can do this in a straightforward way:

A | |

1 | =db.query(“select * from R”) |

2 | =A1.group(student).select(~.sum(score)>=500).conj() |

There are many scenarios that require the returning of the records of the subsets after data grouping. The group and aggregate operations are mere intermediate steps towards completing the desired query, rather than the goal. In fact there is a similar example below in which data is sorted according to group and aggregate values for report development

Though in some cases only the aggregate values are desired, the aggregate operations are difficult to be expressed in simple aggregate functions and the grouped subsets need to be retained for further computations.

This type of requirements is not uncommon in real-world computational problems. But as the computations are complicated and involve a lot of expert knowledge, it’s inconvenient to cite one of the real examples. Here’s an adapted one:

Suppose there’s a table *L* recording the user logins. It has two fields – user (ID) and login (time). We want to calculate the last login time of each user and the number of logins in 3 days before this time.

It’s easy to find the last login time, but it’s difficult to count the logins during the specified time period without the grouped subsets. The SQL algorithm is like this: Group records and find the last login times, perform JOIN with the original table and find records during the specified time period, and group and aggregate these records. The code is bulky and inefficient. esProc can retain the grouped subsets and thus can do this in a stepwise approach:

A | |

1 | =db.query(“select * from L”) |

2 | =A1.group(user;~.max(login):last,~.count(interval(login,last)<=3):num) |

Of which ~ represents the subset obtained after data is grouped by user.

Here’s a more efficient way to get this done for ordered data:

A | |

1 | =db.query(“select * from L order by login desc”) |

2 | =A1. .group(user;~(1).login:last,~.pselect@n(interval(login,last)>3)-1:num) |

**Order-related aggregation**

It’s also a common type of scenarios where we want to get the top N records and the records corresponding to the maximum value. Of course we can perform the computations using the retained grouped subsets. But as they are too common, esProc regards them as a kind of aggregation and provides a special function. So the way of handling them is basically the same as we handle the ordinary group and aggregate operations.

Let’s look at the simplest case. The user login table *L* has these fields – user, login (time), IP-address … Now we want to find the record of first login of each user.

SQL can use window functions to generate the sequence numbers after intra-group sorting and retrieve all the records whose sequence numbers are 1. But the window function can only be employed based on a result set, so we should write subqueries and then perform filtering. The code thus becomes a little complicated. For databases that don’t support window functions, it’s more difficult to do this.

esProc provides *group@1* function to directly retrieve the first member of each group.

A | |

1 | =db.query(“select * from L order by login”) |

2 | =A1.group@1(user) |

This type of log files is frequently seen and they are already ordered according to the time. esProc can get the first record directly without doing sorting. Cursor can be used to handle this if the data is too big to be entirely loaded into the memory.

The stock price table *S* has three fields – code, date and cp (closing price). Now we need to calculate the latest rate of increase of each stock.

The calculation involves records of the last two trading days. We need to use two levels of window functions respectively to perform intra-group inter-row calculation and retrieve the first row. The coding is complicated. esProc provides the aggregate function *topN* to directly return the desired records as aggregate values for further computations.

A | ||

1 | =db.query(“select * from S”) | |

2 | =A1.groups(code;top(2,-date)) | Get the records of the last two trading days |

3 | =A2.new(code,#2(1).cp-#2(2).cp:price-rises) | Calculate the rate of increase |

Instead of aggregating the grouped subsets, esProc aggregate functions will perform accumulation based on the existing values, achieving a better performance. They can also work with cursor to handle the big data that cannot be entirely loaded into the memory.

We can retrieve records according to their sequence numbers if data is already ordered. This is more efficient:

A | ||

1 | =db.query(“select * from S order by date desc”) | |

2 | =A1.groups(code;top(2,0)) | Get the first two records directly |

3 | =A2.new(code,#2(1).cp-#2(2).cp:price-rises) |

To find the records corresponding to the maximum value, and to get the first/last record are special cases of the topN-style aggregation.

**Inverse grouping**

Contrary to the group and aggregate operations, inverse grouping aims to split the aggregated data into detail data. The operation is rare, but it’s hard to handle it in SQL. Here’s one example.

The installment payments table *I* has these fields – no, sum, start and periods. We need to split each sum of loan into multiple payment records. The resulting table contains these fields – no, seq, date and amount. The total payment will be distributed to each period (a month) evenly.

It’s easy to aggregate detail data, but it’s difficult to split data. To generate the details in SQL, we would perform a JOIN across the source table and a sequence number table, or use the recursive query. Both are roundabout ways. esProc, however, is able to write the code in an intuitive way:

A | |

1 | =db.query(“select * from I”) |

2 | =A1.news(periods;no,~:seq,after@m(start,~-1):date,sum/periods:amount) |

**Cross-row reference **

In the early days of SQL, the language doesn’t directly support cross-row reference. It would first generate the sequence numbers before performing a JOIN. The code is excessively difficult and overloaded. By introducing window functions later, SQL can reference data from other rows more easily. But the code is still far from being concise – bulky we can say – particularly when multiple items from other rows need to be referenced. As mentioned above, the window functions should be used based on the result set of an operation, and subquery is needed in order to reference the results of window functions. The code is as cumbersome as it used to be.

MySQL doesn’t support window functions, but it can make a backward reference using variables in SQL statements. Yet it cannot make a forward reference.

esProc provides a natural and easy-to-use syntax for cross-row references.

The monthly product sales table *S* has 3 fields – prod, month and sales. We need to find the records in which the sales have been increased by 10%.

A | |

1 | =db.query(“select * from S order by prod,month”) |

2 | =A1.select(if(prod==prod[-1],sales/sales[-1])>1.1) |

We can use [-1] to reference the data of previous month after sorting, as well as perform filtering according to the results of inter-row calculations. In contrast, SQL window functions require subquery and MySQL needs to define two temporary variables.

Based on the above table, we want to calculate the moving average for each month’s previous month and next month:

A | |

1 | =db.query(“select * from S order by product,month”) |

2 | =A1.derive(if(prod==prod[-1]&&prod==prod[1],sales{-1:1}.avg()):moving-avg) |

The calculation involves backward reference and the reference of a set. esProc uses [1] to reference data of the next record and {-1:1} to reference the set of field values of two neighboring records. With window functions, SQL still need to find the desired records using subquery before it can calculate the moving average. MySQL also cannot handle the computation directly because its variable is unable to reference forward.

Here’s another example. The simplified event table *E* has these fields – seq, time … The time should be synchronized with the sequence number as the latter increases. But errors may exist, and we want to find the records where the time and the sequence number don’t synchronize.

A | ||

1 | =db.query(“select * from E order by seq”) | |

2 | =A1.select(time!=max(time{:0})||time!=min(time{0:})) | Compare each records with all records before and after it |

esProc can get a set from the beginning to a certain point or from a point to the end. SQL window functions have the similar syntax, but to sort data in two directions for the two comparisons, they have to use subquery.

**Order-related grouping **

SQL only supports order-unrelated equi-grouping. Sometimes not all records have the same value or change in the same way for the grouping fields. Instead, the grouping is more related to the order of the records. In this case, we still have to use window functions (or other more inconvenient tools) to generate sequence numbers first if SQL is used.

esProc has the syntax for order-related grouping, making the computations related to continuous intervals more convenient.

The income & expense table *B* has three fields – month, income and expense. Find the records of 3 or above continuous months during which the income is less than the expense.

A | |

1 | =db.query(“select * from B order by month”) |

2 | =A1.group@o(income>expense).select(~.income=3).conj() |

The *group@* function compares only the adjacent records during grouping and creates a new group once the adjacent value changes. By comparing the income and expense between adjacent records, we can divide records into groups like profitable, unprofitable, profitable…, get the unprofitable groups that have not less than 3 months and concatenate them.

We also want to find according to this table the maximum number of months when income increases continuously. We can design an algorithm for grouping. That is, when the income increases, put the record and the previous one into the same group; when the income decreases, put the record into a new group; finally get the maximum of the numbers of group members.

A | |

1 | =db.query(“select * from B order by month”) |

2 | =A1.group@i(income<income[-1]).max(~.len()) |

The *group@i* function will create a new group when the grouping condition changes – that is when the income decreases.

SQL can handle both this scenario and the previous one with its window functions, but the code would be very hard to understand.

The merging of intervals is another common type of order-related grouping computations. The event interval table *T* has S and E fields. We want to find the real length of time the event takes by removing the overlap of the time intervals from these intervals.

A | ||

1 | $select S,E from T order by S | |

2 | =A1.select(E>max(E{:-1})) | Remove records where the time period is included |

3 | =A2.run(max(S,E[-1]):S) | Remove the overlap of the time intervals |

4 | =A2.sum(interval@s(max(S,E[-1]),E)) | Calculate the total length of time |

5 | =A2.run(if(S<E[-1],S[-1],S):S).group@o(S;~.m(-1).E:E) | Merge the time intervals that overlap |

In this part, we provide solutions for different types of scenarios, which take advantage of the esPoc features of handling inter-row calculations and order-related grouping. SQL cannot achieve these simple implementations through window functions, unless it makes use of the extremely difficult-to-understand recursive query.

**Position-based access **

Sometimes we want to use the sequence numbers to access members of an ordered set. SQL, which is based on the mathematical concept of unordered sets, will first generate the sequence numbers and perform conditional filtering before accessing members of the specified positions. This causes a lot of problems to many computations.

esProc, however, adopts the ordered sets mechanism, allowing accessing members directly with sequence numbers and bringing great conveniences.

For example, in analyzing economic data, people often need to find the median value for various prices:

A | |

1 | =db.query@i(“select price from T order by price”) |

2 | =A1([(A1.len()+1)\2,A1.len()\2+1]).avg() |

Sequence numbers can be used in data grouping as well. The event table *E* has three fields – no, time and act. The act field includes two types of values – start and end. We want to calculate the total length of time the events take, that is, the sum of every time period that each pair of start and end defines.

A | |

1 | =db.query@i(“select time from E order by time”) |

2 | =A1.group((#-1)\2).sum(interval@s(~(1),~(2)) |

*#* represents the sequence number of a record. *group((#-1)\2)* means putting every two records in one group. Then calculate the length of total time for each group and perform aggregate.

We can make a cross-row reference according to sequence numbers. The stock price table *S* has two fields – date and cp(closing price). We want to find the trading days when the stock prices are above 100 and the rates of price increase on those days.

A | |

1 | =db.query(“select * from S order by date”) |

2 | =A1.pselect@a(cp>100).select(~>1) |

3 | =A2.new(A1(~).date:date,A1(~).cp-A1(~-1).cp:price-rises) |

The *pselect* function returns the sequence numbers of the members satisfying the specified condition. According to the result, we can calculate the rate of increase easily. With window functions, we have to calculate the rates of increase for all days and perform filtering then.

**Strings **

Every database provides sufficient functions, even complicated parsing functions like regular expressions, to handle string splitting, string concatenation and other operations that don’t use sets. esProc encapsulates all of these too, but on top of that the dynamic language offers the functionality of using strings as expression in computations.

The problematic string handling scenarios in SQL mainly involve the inverse grouping. SQL has great trouble in splitting a separator-segmented string into multiple records or a set for further handling, due to its lack of explicit sets.

It’s relatively easy to concatenate field values into a string during grouping and aggregation. MySQL has *group_concat* function to do this, and other databases also provide similar slightly complicated functions.

Here’s a simple string concatenation task. The student table *S* has 3 fields – class, name and sex. We need to group the table according to classes and write the names of boys and girls respectively as comma-separated strings in alphabetically order.

A | |

1 | =db.query(“select * from S”) |

2 | =A1.group(class; ~.select(sex==’M’).(name).sort().string():males,~.select(sex==’F’).(name).sort().string():females) |

With set-type data, esProc can rig up various string operations without having to use the specialized string concatenation functions.

Normally, string splitting will go hand in hand with the generation of multiple records, like the inverse operation of the above example. Suppose we want to convert the class table *C*, which contains the class, males and females fields, to the student table *S*, which contains class, name and sex:

A | |

1 | =db.query(“select * from C”) |

2 | =create(class,name,sex) |

3 | >A1.run(A2.insert(0:males.array(),A1.class,~,”M”),A2.insert(0:females.array(),A1.class,~,”F”)) |

Unlike SQL that provides complicated solutions including recursive query and doing a JOIN with the table to compare, esProc just splits the strings into sets and then generates corresponding records.

Sometimes the purpose of splitting a string is to perform set operations. The book table *B* has book field and author field. The latter has comma-separated multiple-person string values. We want to find records where the same team of authors writes at least two books. The order of the author names is indefinite.

A | |

1 | =db.query(“select * from B”) |

2 | =A1.group(author.array().sort()).select(~.len()>1).conj() |

After splitting a string into a set and performing sorting, they can be used as grouping field. Then we can perform the rest of the computation as we always do.

**Dates**

Databases have no problem with the average handling of single date values. But similar to string processing they are not convenient-to-use for dealing with date splitting or the generation of a sequence of dates. The root of the problem is that SQL only goes halfway in its orientation towards set.

The travel log table *T* has these fields – client, start, end… We want to find the top five days that seen the most travelers.

To do this we should convert the time period between the staring date and the ending date to a set of separate dates and then group and aggregate the records.

A | |

1 | =db.query(“select start,end from T”) |

2 | =A1.conj(periods(start,end)).groups(~:date,count(1):count) |

3 | =A2.sort(count:-1).to(5) |

With set-based solution to date splitting, it’s easy for esProc to do it.

It’s complicated to generate a sequence of dates, mainly because of the characteristics of the date values. It’s particularly thorny when it is accompanied by an inverse grouping operation.

The event table *T* has 3 fields – I (event), S (starting date) and E (ending date). We need to break apart the time interval between the staring date and the ending date by months in order to generate multiple records. The first month and the last month begins and ends respectively at the starting date and ending date, while the months in between have all their days.

A | |

1 | =db.query(“select I,S,E from T”) |

2 | =A1.news(interval@m(a=S,b=E)+1;I,max(a,pdate@m(after@m(a,~-1))):S,min(b,pdate@me(after@m(a,~-1))):E) |

The *pdate@m* function and *@me* options find the starting date and ending date of a month respectively. The *after@m* function obtains the date which is a certain number of months after the given date; it can automatically adjust the new date to the last day of the desired month and is handy in generating a monthly interval.

A second approach is to import the text data into a database and use SQL. But since the text file lacks the strong data typing required by a database, the data import is usually accompanied by complicated data arrangement. The extra step will seriously compromise the efficiency of transaction processing.

Being a dynamic set-oriented scripting language, esProc can to some extent fill the gaps. Some of the common file handling scenarios in the following will serve to show the advantages of esProc in handling this type of computations.

**Parsing text files **

The data items in each line of the text file *T.txt* are separated from each other by an indefinite number of spaces:

20010-8-13 991003 3166.63 3332.57 3166.63 3295.11

2010-8-10 991003 3116.31 3182.66 3084.2 3140.2

……

Now we want to make a list of the averages of the last four items in each line. esProc needs a mere one-liner to do this:

A | |

1 | =file(“T.txt”).read@n().(~.array@tp(“”).to(-4).avg()) |

*read@n* reads data from the text file as a set of strings. *array@t(“”)* splits each string into a set of substrings according to the indefinite number spaces; and *@p* parses each substring into the corresponding data type for the subsequent computation (of average value).

Here’s a comma-separated text file *T.csv*. We need to write the first 8 data items of the file’s lines having not less than 8 data items to another text file *R.txt*, separated with “|” (which is the separator some bank file systems use):

A | |

1 | =file(“T.csv”).read@n().(~.array(“,”)).select(~.len()>=8) |

2 | >file(“R.txt”).write(A1.(~.to(8).string(“|”))) |

The *string()* function concatenates members of a set into a string using a specified separator.

The text file *T.txt* holds a set of strings as shown below. We need to divide the file into several parts according to the state name (LA) before the characters US and put them in different files.

COOP:166657,’NEW IBERIA AIRPORT ACADIANA REGIONAL LA US’,200001,177,553

……

A | |

1 | =file(“T.txt”).read@n() |

2 | =A1.group(mid(~,pos(~,” US’”)-2,2):state;~:data) |

3 | >A2.run(file(state+”.txt”).export(data)) |

esProc also supports the use of regular expressions for more complex parsing tasks. But since the regular expressions are difficult to use and perform poorly, the conventional approaches are generally recommended.

**Parsing text files into structured data **

In the logfile *S.log*, every 3 lines constitute a complete piece of information. We need to parse the file into structured data and write it to *T.txt*:

A | |||

1 | =file(“S.log”).read@n() | ||

2 | =create(…) | Create a target result set | |

3 | for A1.group((#-1)\3) | … | Group data every 3 lines by the line numbers |

… | … | Parse field values from A3 (the current 3 lines) | |

… | >A2.insert(…) | Insert the parsing result into the target result set | |

… | >file(“T.txt”).export(A2) | Write the result set to the text file |

Since esProc can group data by line numbers, we can run a loop to process one group of data each time, making the computation simpler.

Of course there is the special, simple case of handling a single line.

If *S.log* is too big to be loaded completely into the memory, we can retrieve the file step by step using the cursor and export it the same way:

A | B | ||

1 | =file(“S.log”).cursor@si() | Create a cursor and import the file in a stream | |

2 | =file(“T.txt”) | Create the resulting file | |

3 | for A1,3 | … | Run a loop to process the 3 lines imported at a time |

4 | … | Parse field values from A3 (the current 3 lines) | |

… | >A2.export@a(…) | Write the parsed values to the resulting file |

A skilled user can optimize the code to achieve a better performance by writing the parsed records in batches.

If every piece of complete information in the log file *S.log* starts with “—start—” but contain an indefinite number of lines, we just need to change A3 as follows:

3 | for A1.group@i(~==”—start—”) | Create a new group with every “–start—” |

Similarly we can deal with a big file in this type of scenarios with the cursor, and A3 will be like this:

3 | for A1;~==”—start—“:0 | Start another loop cycle with every “–start—” |

Another scenario for the indefinite number lines is that each line of one piece of information begins with the same characters (For example the userID the log information belongs to). When the starting characters change, a new piece of information begins. To handle this we can slightly modify A3:

3 | for A1.group@o(left(~,6)) | Create a new group when the first 6 characters change |

3 | for A1;left(~,6) | Start another loop cycle when the first 6 characters change |

And we can also use the cursor to handle the big file by altering the code of the preceding subsection.

**Searching and Summary**

Find files under a directory that contain certain words, list the contents of the lines where they settle and the line numbers:

A | |

1 | =directory@p(“*.txt”) |

2 | =A1.conj(file(~).read@n().(if(pos(~,”xxx”),[A1.~,#,~].string())).select(~)) |

*grep* is a frequently used Unix command. But some operating systems don’t support it and it’s not easy-to-implement in a program. esProc provides the file traversal functionality and, along with the capability of file handling it can do the job with only two lines of code.

Now list all the different words the text file *T.txt* contains and count the occurrences of each of them. Ignore the case of characters:

A | |

1 | =lower(file(“T.txt”).read()).words().groups(~:word;count(1):count) |

WordCount is a famous programming exercise. esProc has *words()* function to split a string into separate words and only one line of code can complete the operation.

List all words containing the letters a, b and c in the text file *T.txt*. Ignore the case of characters:

A | |

1 | =lower(file(“T.txt”).read()).words().select(~.array(“”).pos([“a”,”b”,”c”])) |

Because the orders of these letters are different in different words, we cannot determine whether a word is eligible through substring searching. We should use *array(“”)* to break apart a string into a set of single characters and then find whether the set contains these letters or not. With the support of set operations, esProc can get this done with a one-liner.

To handle big files, we can simply alter these operations by retrieving data in segments or with cursor.

**Controlled data retrieval **

Import 4 columns – name, sex, age and phone – from a titled, comma-separated structured text file *D.csv*. The values of phone field are all numbers that must be imported as strings.

A | |

1 | =file(“D.csv”).import@t(name,sex,age,phone:string;”,”) |

The *import* function has various parameters and options to control whether the text file has a title or not, which separator it uses, which columns it needs, and what the data type is. In most cases, the retrieval of a structured file can be done with a single line of code.

The import result can be returned to the Java main program in the form of ResultSet object through JDBC for further processing. A programmer skilled at using JDBC can do this easily. The retrieval of the text file is like that of a database table.

The same work can be done with the cursor in handling a big file.

A | ||

1 | =file(“D.csv”).cursor@t(name,sex,age,phone:string;”,”) | |

2 | =A1.fetch(100) | Fetch 100 rows |

A cursor can also be returned to the Java main program via JDBC.

A big file can be retrieved in segments:

A | ||

1 | =file(“D.csv”).import@tz(;”,”,2:4) | Divide the file into 4 segments evenly and import the second one |

esProc can achieve a good retrieval efficiency by segmenting the file according to bytes (while segmenting file by rows requires traversing all the desired number of rows each time), and it can prevent the incomplete rows and ensure the data integrity by adopting a “skip head line and complement the tail line” strategy for each segment – that is giving the first line away to the previous segment and supplying the missing part to the last line.

The segment-style retrieval applies likewise to a cursor.

When the order is not important (in sum and count operations for example), we could take advantage of esProc’s built-in parallel framework to enhance performance:

A | |

1 | =file(“D.csv”).import@tm(name,sex,age,phone:string;”,”) |

With this line of code, esProc will enable the multithreading function to automatically segment and retrieve the file in multiple threads. The text parsing is slow, but the parallel multithreaded retrieval can make good use of the computing capacity of multi-core CPU and thus greatly increase the parsing speed.

**Common computations **

Find men who are above 25 and women who are above 23 in the text file *D.csv*, and then 1) list their names in alphabetical order; 2) group the persons by genders and calculate the average age separately; 3) list all the distinct surnames (but ignore the compound surnames).

A | ||

1 | =file(“D.csv”).import@t(name,sex,age;”,”) | |

2 | =A1.select(sex==”M”&&age>=25||sex==”F”&&age>=23) | Filtering |

3 | =A2.sort(name) | Sorting |

4 | =A2.groups(sex;avg(age):age) | Grouping and aggregation |

5 | =A2.id(left(name,1)) | Get distinct values |

esProc is rich in functionalities for structured-data computations, empowering it to handle text files as database tables in some degree and to have the SQL-like computing capability even without a database.

Cursor can always be used to handle the big data:

A | ||

1 | =file(“D.csv”).cursor@tm(name,sex,age;”,”) | |

2 | =A1.select(sex==”M”&&age>=25||sex==”F”&&age>=23) | Filtering |

3 | =A2.sortx(name) | Sorting |

4 | =A2.groups(sex;avg(age):age) | Grouping and aggregation |

3 | =A2.groupx(left(name,1);) | Get distinct values |

4 | =A3.fetch(…) | Fetch result |

Different from the in-memory computations, a cursor only traverses data once. So after the sorting operation in the above, another cursor needs to be created for the grouping operation.

Here’s a text file *D.csv*. According to the 4^{th} -7^{th} digits of each value in its phone field, we can find out the area that the phone number belongs to – area field – from the text file *P.txt*’s aid field. Now find records of *D.csv* that have Beijing phone numbers

A | |

1 | =file(“D.csv”).import@tm(;”,”) |

2 | =file(“P.txt”).import@t(id,area) |

3 | =A1.derive(mid(phone,4,3):aid).switch(aid,A2:aid) |

4 | =A3.select(aid.area==”Beijing”) |

We can use a cursor in A1 to handle big files.

esProc handles foreign key association using a pointer, making the reference easier. But we can still perform a SQL-style join:

A | ||

1 | =file(“D.csv”).import@tm(;”,”) | |

2 | =file(“P.txt”).import@t(id,area) | |

3 | =join@1(A1,left(phone,4,3);A2,aid) | @1 means a left join |

4 | =A3.select(#2.area==”Beijing”).(#1) |

**File comparison**

Find the values of id field in the text file *T1.txt* that still appear in *T2.txt* and those that don’t show up there.

A | ||

1 | =file(“T1.txt”).import@ti(id) | |

2 | =file(“T2.txt”).import@ti(id) | |

3 | =A1^A2 | Intersection, which is the common values of T1 and T2 |

4 | =A1\A2 | Difference, which contains values T1 have but T2 don’t have |

As you see, we just perform intersection and difference operations to compare column values of different files.

Among the rows corresponding to values of id field of *T1.txt*, find those whose ids exist in *T2.txt* and those whose ids don’t exist.

A | ||

1 | =file(“T1.txt”).import@t().sort(id) | |

2 | =file(“T2.txt”).import@t().sort(id) | |

3 | =[A1,A2].merge@i(id) | Intersection, which contains rows of T1 whose ids also exist in T2 |

4 | =[A1,A2].merge@d(id) | Difference, which contains rows of T1 whose ids don’t exist in T2 |

To get the whole rows, first sort the id fields in both files and then perform merge operation.

Find rows from *T1.txt *and *T2.txt* that have the common id but whose other columns are different:

A | ||

1 | =file(“T1.txt”).import@t() | |

2 | =file(“T2.txt”).import@t() | |

3 | =join(A1,id;A2,id) | An equi-join, which discards the unmatchable rows |

4 | =A3.select(cmp(#1,#2)!=0) | Find different rows |

The *join* function can align data according to a certain column.

We can also use *merge* and *join* to compare big files that cannot be entirely loaded into the memory. Both need data sorting first:

A | ||

1 | =file(“T1.txt”).cursor@t().sortx(id) | |

2 | =file(“T2.txt”).cursor@t().sortx(id) | |

3 | =[A1,A2].merge@xi(id) | Intersection, which contains rows of T1 whose ids also exist in T2 |

4 | =[A1,A2].merge@xd(id) | Difference, which contains rows of T1 whose ids don’t exist in T2 |

The *merge@x* function merges sorted cursors. To perform external memory sorting for big files, use *sortx()*.

A | ||

1 | =file(“T1.txt”).cursor@t().sortx(id) | |

2 | =file(“T2.txt”).cursor@t().sortx(id) | |

3 | =join@x(A1,id;A2,id) | An equi-join, which discards the unmatchable rows |

4 | =A3.select(#1.array()!=#2.array()) |

The *join@x *function joins sorted cursors.

**json**

Despite a sufficient number of class libraries for parsing and generating json data, Java lacks the capability to handle further computations. esProc supports multi-level data, and can parse the json data into computable memory data for further processing without compromising its integrity.

Here’s json data of a certain format:

{

“order”:[

{

“client”:”Beijing Raqsoft Inc.”,

“date”:”2015-6-23”,

“item” : [

｛

“product”:”HP laptop”,

“number”:4,

“price”:3200

},

{

“product”:”DELL server”,

“number”:1,

“price”:22100

}]

},…]

}

We need to write the json data to two database tables: *order*, which include three fields – orderid, client and data, and *orderdetail*, which includes five fields – orderid, seq, product, number and price. The orderid and seq fields of the *orderdetail* table can be generated according to the data order.

A | |

1 | =file(“data.json”).read().import@j().order |

2 | =A1.new(#:orderid,client,date) |

3 | =A1.news(item;A1.#:orderid,#:seq,product,number,price) |

4 | >db.update@i(A2,order) |

5 | >db.update@i(A3,ordedetail) |

esProc is able to parse the multi-level json strings into a multi-level data set, in which the value of item field in A3 is a table.

Besides data parsing, esProc can generate multi-level json strings from a multi-level data set.

**Excel**

Excel files are equal to the structured files. Java provides powerful open source class libraries (such as poi) for parsing XLS files, but they are low-level tools, making the development process very complex. By encapsulating poi, esProc is capable of retrieving XLS files into two-dimensional tables for further handling.

Here’re *range.xls* and *position.xls*:

range.xls |
position.xls |
||||

range | start | stop | Point | position | |

Range1 | 4561 | 6321 | point1 | 5213 | |

Range2 | 9842 | 11253 | point2 | 10254 | |

… | … |

For each point position in *position.xls*, find the appropriate start/stop range in *range.xls* to cover it, and append the start and stop values in *position.xls*.

A | |

1 | =file(“range.xls”).importxls@t() |

2 | =file(“position.xls”).importxls@t() |

3 | =A2.derive((t=A1.select@1(position>=start&&position<=stop)).range:range,t.start:start,t.stop:stop) |

4 | =file(“result.xls”).exportxls(A3) |

esProc can give full play of its computing power after retrieving an XLS file. Excel VBA, however, is only capable of hardcoding *join*s, and sometimes even resorts to exporting data to the database. Both ways would generate bloated code.

esProc’s Integrated Development Environment (IDE) has good interactivity. Anyone who has programming basics can use it as a desktop interactive analysis tool.

The key aspect of interactive computing is to conveniently show and reference the intermediate results, determining the next step by the previous step. esProc adopts a cellset-style coding to naturally retain the intermediate results in the cells for viewing when they’re needed. Programmer can directly reference the intermediate results without having to name them, making the stepwise interactive computations extremely convenient. An average scripting language performs interactive computations from command line, which is far more inefficient.

esProc can access and handle heterogeneous data sources, including the common databases and the files stored in local file system, such as TXT and XLS files. The final result can be viewed, or written back to the database or the file.

esProc is intended as a development tool for writing repeatedly executed code. It’s easier and more intuitive to debug an esProc cellset than to debug the traditional text code. Besides executing at the IDE, esProc can be started by an external job scheduling software from the command line. With its support of various data sources and remarkable computing power, esProc can perform tasks like scheduler data manipulation (similar to ETL).

**Working as Java class library**

As we mentioned, being integrated is another purpose esProc is intended for.

esProc provides JDBC interface through which the esProc code can be invoked as the database stored procedure. The passing of parameter, the execution of code and the result returning are all in accordance with the JDBC standard. Programmers familiar with JDBC can pick up esProc quickly. esProc RTL is provided as JARs, it can be deployed and distributed with an application. The integration is completely seamless.

As far as we know, Java hasn’t had a universal class library for structured data. It’s a cumbersome progress that programmers have to hardcode this type of computational problems. esProc, very integration-friendly and with excellent computing power, can work as a Java class library for processing batched (semi)structured data. When there is not a database involved (such as the case where a text file is handled), SQL’s ease-of-use computing capability will have no chance to demonstrate. Other times if the algorithm is difficult to be coded in SQL, one has to retrieve data out of database to perform the computation. In all these scenarios, esProc can be used to assist Java with the computation.

As a SQL statement, a single esProc statement can be invoked. If the code is short, programmers can write a long single statement directly in the esProc JDBC, without having to create a script file. This will save them the trouble of managing the script as well as increase programming flexibility.

**Preparing data source for reporting tools **

As a special example of the application of Java, reporting tools can certainly integrate the esProc code through JDBC to supply data sources to themselves.

The process of report development involves many complex, temporary computations, which are rarely implemented successfully with the reporting tool due to their complexity, and which will cause unreasonable storage usage if performed within the database because of their temporariness, and which will lead to tight coupling between the Java application and these codes if carried out in the intermediate Java program. But by using esProc as the special middleware to prepare the report data source, these computations can be detached to execute separately so that developing can be much easier. Moreover, the esProc script can be managed together with the report template, which effectively reduces the complexity of application management.

Here are some typical scenarios exemplifying esProc’s role in solving computational problems. All the examples come from the Q&A on the internet, which are the first-hand, real-life problems, and have been simplified to facilitate understanding.

]]>esProc defines its niche as the handling of (semi)structured data, so it doesn’t provide algorithms for directly performing data analysis, data mining and machine learning, nor is it an expert at processing media and map data.

Compared with high-level programming languages like Java, esProc has abundant basic objects and methods for structured-data computing, which is commonly seen in data analysis, data handling and data preparation. This enables it to produce much more concise code than Java does in expressing the same algorithm and to have higher development efficiency than high languages, like Java.

For example, Java needs dozens of, even nearly a hundred, lines of code to filter a data set, longer if a universal data type and a universal condition are involved. esProc uses a mere one-liner to get things done.

esProc is integration-friendly with Java applications. Since it has been developed in Java, the two are perfectly compatible. And designed to be integrated, it is open to the invocation coming from a Java main program. It is particularly convenient to use esProc to prepare data source for a Java reporting tool.

With SQL, it is admitted that there’s a lot of hassle to perform piecemeal multi-step computations, particularly the order-related computations. Normally programmers need to retrieve data from the database and handle it with Java or other languages, thanks to SQL’s incomplete orientation towards set, lack of support of discrete records, and its non-stepwise approach. The design of esProc enhances the functionality and fills the gap, permitting a more intuitive implementation of non-equi-grouping, reuse of grouped data, order-related or multi-step computations. esProc integrates the merits of both SQL and Java to let programmers employ the SQL-style batch approach to set-style operations, while enjoying the flexibility as Java can provide.

Yet esProc cannot and doesn’t aim to replace SQL, despite its much simpler syntax in most scenarios.

Data retrieval from the database could cause plenty of I/O performance loss. When big data is involved in simple operations, it takes much longer time to retrieve data than to perform the computation. In view of this, it’s more appropriate to handle data in the database. Besides, the SQL metadata system helps create a more transparent syntax. Programmers need not concern themselves with the physical storage scheme. As a pure computing engine without complete storage mechanism, esProc can handle data from all types of files and databases, but it has different syntax for performing in-memory computations and external memory computations, to which different approaches are needed.

Choosing to use esProc doesn’t mean abandoning SQL. Instead, it helps SQL in handling computational scenarios where SQL has been weak. They include complex multi-step computations and computations involving heterogeneous databases, etc.

Except for SQL, the industry hasn’t had another standard programing language specializing in handling structured data. However, SQL has computational weaknesses as mentioned above, as well as application limitations caused by its closed nature. We cannot handle a local file in SQL freely. As a result, people often turn to scripting languages like Python (pandas) and R in dealing with those types of scenarios.

The truth is that Python (pandas) and R are designed to perform mathematical statistics and analysis. Equipped with dataframe object though, they are not the specialized tools for processing structured data, and provide no direct support of external memory computations. esProc is an expert in structured-data computing, boasting the table sequence object for in-memory data (the counterpart of dataframe’s superset) and cursor object for external memory data. esProc is convenient in coding multithreading parallel processing, and has simple and easy-to-use configuration and method of application in handling heterogeneous data sources (xls, json, and mongoDB).

However, esProc is poor at performing mathematical statistics and analysis because of the lack of necessary class library.

Apart from the individually handled analyses, the structured-data computing often takes place within an application. Being integration-friendly with Java, esProc can be easily invoked by a Java main program. Python and R language are none-too-friendly with integration, programmers cannot write an algorithm in Python that can be invoked by Java.

]]>