esCalc: A Quick and Easy Desktop Spreadsheet
Background
Excel
In effect, Excel, instead of many of the BI tools, is the most widely used desktop data analysis tool.
Excel is simple, intuitive, easy to use and to understand, particularly suitable for the average analysts who are incapable of programming and don’t have knowledge about mathematical models. Moreover, the result of each action operated on an Excel worksheet will be immediately shown to enable programmers to decide how to make the next move. This represents the typical model analysts use to perform analytics. It is neither necessary nor possible to model objects beforehand.
But we’ve found some Excel defects as data analysis becomes increasingly complicated. Simply and briefly put, they are the “4M” problems:
1. Multi-row records
There’s no definite concept of structured records in Excel. The single-row record is a record that corresponds to a single row. If a record has too many data items and needs to occupy multiple rows or if it has sub-records – the record is a multi-row record. It’s very complicated to edit a multi-row record and to perform operations on it.
2. Multi-level tables
Excel provides functions for grouping data, yet, unlike an ungrouped worksheet, many operations become impossible or need to be performed in a different way with the resulting hierarchical table, where consecutive operations are hard to be carried out. What’s worse, the grouping generates multi-row records, on which the cross-group copying of formulas that reference the aggregate values can’t be performed correctly and intelligently. As a result, computing errors arise.
3. Multi-table joins
Excel isn’t a relational-algebra-based product. It doesn’t have specialized functions to join tables; it only provides functions such as Lookup for simple cross-page cell reference, which are complicated to use and perform poorly.
4. Multi-member value
It is known that Excel has the concept of set, enabling aggregate operations on cells within a certain range. But it doesn’t provide explicit set data type, making it impossible to store set-typed values and limiting its abilities of handling set-oriented operations.
We’ll further explain the problems later through examples.
esCalc
To solve those Excel problems, we created another model for handling the spreadsheet data, and from this model the brand-new spreadsheet software – esCalc – was born.
Instead of an improved version of Excel, esCalc bases its data and operational models directly on the relational algebra, making it have more in common with the relational database. esCalc has a definition for records, and provides most of SQL’s computational features like computed columns, sorting, filtering, grouping and performing distinct, as well as join and union between multiple tables. Because of its spreadsheet interactive interface, esCalc can be regarded as a visualized SQL calculator.
Different from the general database client softwares that use field name in referencing a data item, esCalc inherits Excel’s grid style, in which cells are used to name data items and describe formulas, and supports automatic and intelligent copying of formulas. The more intuitive way makes it easy to be manipulated by the layman and handy to express order-related computations which SQL isn’t good at.
Furthermore, esCalc includes a multi-level data model to increase related computing capabilities on the basis of SQL, enabling it to support hierarchical tables containing a main table and its sub-tables, to perform operations such as filtering, sorting, re-grouping and ungrouping on the grouped worksheet, and to copy formulas automatically and intelligently between cells at different levels.
esCalc is designed for performing interactive data analysis. It doesn’t support such extensive functionalities as Excel does, but it’s more sophisticated and more adept with handling batched data. esCalc is intended to be a cooperator of Excel, rather than an unnecessary rival. Not only can esCalc retrieve an xls file to analyze and process it, also it can export the computational result in xls format for further processing in Excel. But it’s necessary to point out that esCalc isn’t a plug-in for Excel, it is an independent application.
Compared with Excel, the big functionalities for data handling that esCac hasn’t are VBA scripting and pivot table. In our opinion, VBA is too difficult for average analysts; besides, many VBA scripts are created only in order to complement those functionalities that are not so convenient to be carried out in Excel. With the same functionalities better provided in esCalc, VBA scripting is not as important as it is in Excel. For analysts who are capable of programming, we provide esProc, another product of the RaqSoft software family, for occasions where scripts are needed. The software offers scripting abilities that are much more powerful than VBA. As for the pivot table, Excel has already exceled in its feature and leaves little room for improvement. So the target of esCalc is to generate files of xls format as the data sources for Excel pivot tables.
Now let’s turn to discussing the above-mentioned “4M” problems of Excel in data analysis and handling through examples, and provide their esCalc solutions.
Structure
Formula copying
Records are represented by the rows in an Excel worksheet. Users can perform operations such as filtering, sorting on the rows, and, particularly, add computed columns (its values are computed from other fields) for the rows. It’s in this latter case where the formula copying becomes a problem.
To add a computed column involves all records (rows), but as Excel hasn’t the concept of explicit record, the formula entered in a certain row needs to be manually copied to other rows. Excel cleverly adopts the drag-and-drop method to do this. The method is very convenient-to-use for handling single-row records (that is, each record corresponds to a single row).
But at times the worksheet data we’ are handling are complicated in that one record corresponds more than one row. That’s because, for example, the record has much content that needs to take up two rows, or the record includes the lower-level sub-records (the details of an order, for instance). In those cases, the cells to which the formula is copied aren’t continuous any more, and the drag-and-drop method becomes powerless. We can imagine how much hassle there will be if all the copying is done manually row by row.
esClac solves the problem by both retaining Excel’s intuitive way of naming data items after cells and by introducing the concept of explicit records. It combines the strongest points of Excel and database client software. A formula entered to a certain cell will be automatically and correctly copied to its homo-cells (cells of the desired field in other records) without specialized copying actions, even if there are multi-row records and records with sub-records.
Here’s an order table:
F1 calculates the total order amount using the formula ={E2}.sum(). We then enter the formula =round(E2/F1,4) in F2 to calculate the percentage of the amount of the current order in the total amount, that is – dividing the total value in F1 by the amount of the current order. At the same time, we set the display format of F2 as #0.00%, which means representing the value in percentage. After entering the formula in F2, here’s what we get:
Check the homo-cells (F3~F11) of F2 and we find that they’ve all finished the computations. This shows that esCalc can copy the formula and display format in one cell to its homo-cells automatically and correctly.
It can also copy the formula in handling multi-row records as conveniently as in handling the single-row records, for example:
The worksheet contains unit prices and quantities of vegetables and fruits purchased. D3 calculates purchasing amount of the pineapple with =floor(D2*B3,2). Here’s the result after the formula is entered:
As soon as the formula is entered, it is copied to the cells corresponding to all products, i.e. D3’s homo-cells, to calculate their total purchasing amount.
Data editing
Excel is again at a disadvantage in editing multi-row records.
Excel doesn’t handle a record as a whole. Inserting, deleting and moving a record are operations performed based on rows and columns of a worksheet. There’s almost no problem about processing single-row records. But operations on rows become complicated in handling multi-row records and records with sub-records, and inserting and deleting fields based on columns are almost non-executable.
Because Excel isn’t good at handling multi-row records, it generally prevents them from appearing when generating the original data. So there’re not many chances for Excel users to encounter them. In many real-world businesses, however, it’s not a rarity that users find themselves in the face of a multi-level worksheet or multi-level data items. By the way, group operations will generate multi-level tables, as we’ll mention later.
Even with single-row records, Excel will still make mistakes in copying formulas for inter-row calculations (such as the calculation of YOY rate and the accumulated value) when rows are inserted or deleted. There’s the same problem in moving records through the copy. Both cases require modifying the results manually or recopying by drag-and-drop. In addition, since Excel doesn’t stress the concept of records, it doesn’t offer hot keys for record processing, making the modification and recopying not that easy.
But it’s easy for esCalc, which defines records, to perform those operations. It also provides convenient hot keys to trigger the actions in a shortcut way. Records (including their sub-records) can be deleted and moved as a whole with just one click, after which the inter-row calculation formulas will remain correct. To insert and delete fields based on columns is to change the data structure. esCalc will automatically copy these operations on one cell to all its homo-cells.
Here’s the employee table:
There’s the formula =age(C3) in D3. The formula, as well as those in its homo-cells, is used to calculate the age of each employee. Meanwhile C2’s formula ={B3}.count() calculates the number of employees in each department, and D2’s formula =round({D3}.avg(),1) calculates the average age of the employees in each department. Suppose we want to delete duplicate department values in the first field of the employee table without affecting other data items. To do this we select B3 and press Ctrl+Backspace to delete A3 and its homo-cells. Here’s what we get:
All homo-rows will change their structure at the same time. In the meantime formulas in C3 and its homo-cells will adapt themselves intelligently to the new structure. For instance, C3’s formula becomes =age(B3) automatically.
In esCalc we can merely change the structure of summary rows. For instance, to delete blank cells in the second column of the department summary rows, we select C2 and press Ctrl+Backspace to delete A2 and its homo-cells. Here’s what we get:
This is a membership management table:
The worksheet table records the number of new members and of those who leave each month. Enter ==D7+B8-C8 in D8 to calculate the number of members in the current month according to the number in the last month and the number of withdrawals in this month. Here a related calculation expression starting with two equal signs is used and data in corresponding cells will adjust intelligently according to any change of the table.
Now we insert the records of the missing months April and May in the table and enter data to them. Here’s the complete table:
Because esCalc stores the inserted data also in the form of homo-rows, calculations in column D will still be correctly done and the membership statistics will be automatically updated along with the change of the data. If formulas are changed according to positions of cells instead of their structure, errors will occur when new rows are inserted.
Non-related calculations
Cells in Excel calculate in a related way. That means once the value of a referenced cell changes, a calculation cell will re-calculate; and if the referenced cell is deleted, error will occur to the calculation cell.
But the more commonly seen scenarios are these: After values of a computed column are obtained, values of cells referenced by formulas become useless and deletable; or we may change the original value to be referenced by a computed column and then compare the new value and the old value, in which case the old value is expected to remain what it was. For instance, the original data contains persons’ birthdays. Sometimes only birthdays during a certain time period are needed to compute the ages in the subsequent computations. Thus the birthday values can be deleted after the ages are obtained; other times the birthday values are changed and ages are calculated according to the changed birthdays. It’s not easy to deal with both scenarios in Excel.
esCalc offers two types of calculation cells: related calculation cell and non-related calculation cell. The value of a related calculation cell will change along with the change of a referenced cell, as with in Excel; a non-related calculation cell becomes irrelevant to the referenced cell once it finishes calculation, and either the change or the deletion of the referenced cell value can’t affect its value.
In reality, there are more non-related calculations than related calculations during interactive data analysis.
This is the population table of the state of Alaska:
C8 and its homo-cells calculate the growth rate of every census for the state of Alaska. The formula in C8 is =round((B8-B7)/B7,3) and the display format is #0.0%. Now we select C2 and sort records by the growth rate in descending order. Here’s the result:
Since each formula in column C is headed by a single equal sign, C2 and its homo-cells are non-related calculation cells which will keep their values unchanged, rather than re-calculate to get the wrong growth rates according to the new order.
Grouping
Formulas and their copying
We mentioned in the preceding part that there are records with their sub-records. But in many cases the hierarchical records are generated from group operations.
The Excel data model doesn’t support multi-level worksheets. Though the group operation is provided, it is treated specially. Aggregate operations performed on the summary level after data grouping use SUBTOTAL, which is difficult to memorize, instead of more familiar functions like SUM/COUNT; otherwise group members won’t be correctly located.
As mentioned previously, for the formulas in cells at the level of detail data, on one hand we can’t simply use the drag-and-drop method to perform the batched copying (because detail data is inconsecutive data areas separated by summary rows) but can only perform the cross-group copying manually; on the other hand, when formulas reference cells at the summary level or involve cross-row calculations (such as calculations of percentages and YOY rate), even the manual operation can’t guarantee a correct copying according to the Excel rule of formula copying, and, moreover, manual modification of mistakenly copied formulas is needed. All the work is too tedious to bear when there are a lot of groups.
Excel provides the sign $ to reference summary cells in a one-level worksheet, but it becomes helpless when facing the multi-level worksheet.
esCalc has a data model that supports multi-level tables. Aggregate operations performed on the summary level after data grouping still use common functions like sum/count. Particularly, esCalc distinguishes the levels to which the cells belong, and handles the copying of formulas that reference cells both at detail data level (including inter-row reference) and at summary data level according to different situations. The intra-group copying only adjusts cells at the detail data level, while cross-group copying changes the cells at summary data level. What the esCalc users need to do is to reference a desirable cell intuitively, without having to distinguish different levels themselves using the sign $ (actually the sign can merely reference data from one level and falls short of the need). With esCalc, formulas can be correctly copied even calculations involve multiple levels of summary data.
Calculate the average temperature difference in each month, for example, based on the following sheet:
esCalc stores the same type of data in homo-rows. Thus as E3 calculates the average temperature difference in January, its homo-cells corresponding to other months calculate their respective average temperature differences at the same time, saving users the trouble of copying formulas. Here’s the result:
In the above data handling, the month data is sub-rows and their parent row is the quarter data. Calculations performed on the sub-rows won’t affect the parent row; and similarly, data handling in the parent row won’t affect the sub-rows. For instance, enter the formula ={A3}.count() in E2 to calculate the number of records in each quarter. Here’s the result:
esCalc formulas can be intelligently copied according to different data structures, instead of being mechanically copied according to positions of cells. The esCalc copying rule is more reasonable.
Another example is to calculate the precipitation based on the climate table:
Of which E2 and E6 respectively calculate the average precipitation of the current quarter, as E2’s formula ={D3}.avg(). To calculate the difference between the precipitation in each month and the average precipitation in the corresponding quarter, just enter the formula =D3-E2 in E3. Here’s the result:
Formulas have been intelligently adjusted during the copying according to the hierarchical level to which the target cell belongs. For instance, we click on E6 and know that the formula has been adjusted as ={D7}.avg(), which calculates the average precipitation of the current quarter; click on E8 and see the formula have been adapted as =D8-E6, which means subtracting the average precipitation value of the current quarter from the precipitation of the current month. So we can see that esCalc can correctly copy the formula to both a cell that sits on a group’s summary row and one that sits on a group’s detail row.
Post-grouping operations
That data grouping in Excel is special is also reflected by the difficulty in handling post-grouping operations. We can’t perform operations such as sorting and filtering freely on the grouped worksheet table as what we do with a single-level table.
For example, in order to find out ranks of sellers on performances, we want to group and aggregate the order records by sellers and then sort groups according to aggregate amounts. To do that we need to first perform group and aggregate and then the sort by aggregate values; during the sorting, members of a group need to move together with the aggregate value. But we can’t perform this kind of sorting automatically in Excel. In a modified version of this example, for each seller we want to delete the small orders, each of which makes up less than 1% of the seller’s total sales amount, and then re-calculate the total amount. This requires grouping data and calculating the percentage of each member in each group, and performing filtering on all groups by the percentages (here the non-related calculations discussed above will also be used). Excel can’t make it all at once due to its lack of support to operations on the multi-level worksheet; it can only handle the groups separately one by one.
esCalc sees a multi-level worksheet as normal, and makes it open to all operations. So it’s easy for esCalc to handle the above scenarios. In esCalc, during the sorting by aggregate values after data grouping, detail rows of a group will move together with their summary row, which is again an application of esCalc record conception (a group with its members as a whole can be regarded as a record). The post-grouping filtering on detail rows will be performed once and for all by copying all groups at one time.
Here’s the order table:
F1 calculates total sales amount of orders with the formula ={E3}.sum(). F3 calculates the percentage of each order’s amount in the total amount with the formula =round(E3/F1,4).
esCalc permits various operations on a grouped worksheet, such as filtering. To delete every order whose amount makes up less than 1% of the total sales amount, we select F3 to do the filtering:
Here’s what we get through data filtering:
Structure editing
Excel doesn’t support inserting or deleting a data level based on a grouped worksheet. To change the existing data structure, we need to clear groups and re-group data, making the work we did on the summary level (the calculated cells) a waste. Sometimes it is the summary values, instead of the details, that we desired.
In the worksheet in which the small orders have been removed, for example, we need to group the records by the ordered products to see which products are more popular among each seller’s non-small orders. To do this we need to insert another level of groups into the double-level grouped worksheet, and to aggregate and sort each group. As we are only interested in the group and aggregate results, we want to delete the detail level of data. But we can’t perform these operations automatically in Excel. We have to copy the intermediate results out into a new worksheet for further handling. Even worse is that since the grouped data is not continuous, even the copying action can’t be carried out automatically.
In esCalc, we re-group the preceding worksheet table by products, and here’s the result:
We can do further computations based on the re-grouped worksheet. To calculate the total sales amount for each seller, for instance, we enter the same formula ={E4}.sum() in both E2 and E3. Here’s the result:
We entered the same formulas in F1, E2 and E3 to calculate total sales amount, but we get different results because they are entered in cells that sit at different levels.
Now we select E2 to perform a sorting in descending order to sort the worksheet data by the total sales amounts of sellers. The result is as follows:
In esCalc, when grouping rows move because of sorting or other operations, their sub-rows will follow suit.
There’s nothing particular for esCalc to carry out these operations. Because it defines the hierarchical level as a nature of the worksheet, enabling free insertion or deletion of a level and automatic copying of an action operated on a row to all its homo-rows (similar concept to homo-cells). So all detail rows will be deleted simultaneously if we execute an action on a certain detail row.
The data model for esCalc spreadsheet encompasses a hierarchical structure, making grouping and ungrouping the normal operations that can be still performed on the same worksheet as filtering and sorting. Here’s an analogy between spreadsheet data models and a numerical system. Within the range of integers, we are free to do addition, subtraction and multiplication but we can’t do division at will, because the quotient isn’t necessarily an integer. But if we expand the range into the rational numbers, the division operation becomes naturally as well, though we need to redefine the rules for other operations in the expanded scope. Likewise, when esCalc extends the data model for worksheets to include a multi-level structure, it also redefines rules for carrying out sorting, filtering and generating computed columns (to support the smart copying of formulas across different levels, for instance). By doing so, related operations can be performed consecutively, ensuring an interactive data analysis to proceed smoothly.
Join
The JOIN is one of the most important SQL operations. It is used in scenarios such as obtaining attributes through codes (like using the product codes to get the producing areas and the unit prices) and multiple table alignments (like aligning both the allowance table and attendance table with the employee table)。
Excel uses Lookup functions to associate tables. They are similar to the SQL left join. SQL also has inner join, right join and full join, among which the inner join is implemented through filtering after the left join and the right join is the opposite operation of the left join with joining direction changed. The full join, however, can’t be performed automatically in Excel.
The biggest problem of Lookup functions is their complicated usage. They need to specify the joining column, the joining scope and the referenced columns, with only one referenced column for each look-up, and multiple Lookup statements using the same query condition for referencing multiple columns. Not only is the writing troublesome, but also the method has a poor performance due to repeated operations. In fact as a traversal-style query method, Lookup function are very inefficient in searching associated data.
Based on SQL model, esCalc supports the whole set of join operations including inner join, left join and full join, with multiple columns referenced at once from the associated table by specifying the associated cells in the two worksheet to be joined. This is much simpler than using the Excel method. To join the performance table and the attendance table, for instance, set master cells (i.e. the joining cells) and copy the to-be-referenced cells in the attendance table and paste them on the employee performance table using the JOIN operation.
Here’s the performance table, in which A2 and its homo-cells are set as the master cells where the employee numbers are stored:
Here’s the attendance table, which contains only the employees who have had absences, and in which A2 and its homo-cells are master cells holding the employee numbers:
To perform a join operation, select B2 in the attendance table and press Ctrl+C to copy, and then select E2 in the employee performance table and press Ctrl+Alt+J to choose and execute the Left join:
After that the resulting employee performance table is as follows:
The esCalc join operation also supports multi-level worksheet tables. For example, the employees are stored in groups according to their states, and the attendances are recorded in the same way. The multi-level join will first align tables according to the groups and then find the joining rows in each group. This way error won’t occur even there are employees with same names under different states and the result set will be obtained with detail group data kept neatly and completely.
Here’s the duty table, in which master cells hold state names and employee numbers:
Here’s the employee table, in which master cells also hold state names and employee numbers:
In this employee table, select B3 and C3 at the same time and copy the employee information, and then select C3 in the duty table and perform left join. Here’s the result:
Sets
A set is the basic concept of batched data. Excel is able to perform aggregate operations and create an association relationship (such as Lookup) based on a set of cell values. However, it doesn’t define the set data type. Sets can only exist during the calculation of a formula, and the set-typed results can’t be stored as cell values. It can’t perform many more set-oriented operations, like intersection, union, difference as well as searching and locating, too.
In fact, sets are a data type frequently seen in the process of analyzing data. Their absence makes many data analysis tasks difficult, even impossible. As a result, Excel users must write VBA code in order to accomplish the tasks. The problem is that is difficult and extremely inconvenient.
esCalc provides the set data type, and also various set functions and operations based specifically on sets, enabling users to carry out complex data analysis tasks.
Row-wise sets
The following Performance table records performance grade of each employee in each month. We want to find out employees who get at least three A continuously.
It’s simple to do it in esCalc. We just need to select any detail data row, such as the second row, and perform filter operation according to a filtering expression [B2:M2].group@o().pselect(~(1)==”A” && ~.len()>=3). [B2:M2] is a set consisting of cell values in row 2 from column B to column M. Based on this set, we perform a sorted-data grouping to get multiple groups, each having the same grade in continuous months. Check if there are groups holding three or more continuous A. We can see that each step involves the set-typed data.
Column-wise sets
There are same names in the employee name list, we want to delete the rows where extra names stay, while keeping the original order.
As what we do with the preceding example, we perform filter operation on the detail data rows (select row 2 again) according to a filtering expression {A2}.pos(A2)==#. {A2} represents a set consisting of homo-cells of A2 inclusive. Find the position where value of A2 first appears. If the position isn’t where A2 sits, the name is a repeated one and the corresponding row will be deleted.
Expand operation
Transpose the Student Score table
To a table like this:
We regard the operation that splits and expands one row into multiple rows as the inverse action of grouping. It’s almost impossible to realize this in Excel without turning to VBA, while merely a few steps will suffice in esCalc:
In column E, enter in E2 the formula =[B2:D2] to get sets composed of values in column B/C/D. Thus column E will have sets as its values.
esCalc provides expand operation performed based on a set. Now perform the operation on E2 to expand the row into multiple ones.
Add column F and fill it with subject names using the formula F2=[B2:D2]((#-1)%3+1). Here we get members from a set through their serial numbers.
Finally, delete column B/C/D and switch positions between column E and column F, and complete the headers.