SQL-style computations

SQL is intended for structured-data processing. Without doubt we can handle the standard structured-data computations, such as aggregation, grouping and joins in SQL. The query system of SQL is fully-functional, but its ability to handle complex computations falls short of being perfectly convenient. This is mainly because SQL fails to provide complete set orientation, lacks good support of discrete data and order-related operations, as well as doesn’t have an object-reference mechanism and the adequate dynamic model for data processing. As a result, it’s difficult to write and understand a SQL program in many cases. esProc has improved functionalities in these aspects, getting the ability of helping SQL with the complex computations in a more natural way.

Actually esPoc cannot and isn’t intended to replace SQL. After years of efforts made by database vendors, the implementations of the algorithms that are easy to express in SQL have almost achieved their perfection. esProc is unable and unnecessary to outperform SQL. So all we have here is the algorithms that are difficult to implement or that can be only expressed in a very roundabout way in SQL. With esProc, programmers can solve these computational problems effortlessly in a much simpler way.

As these tough, unsystematic SQL problems are hard to be classified, we select some typical types of scenarios for illustration.

Dynamic columns

SQL believes that columns are a part of data’s attributes and they’re static. That’s why there are no specialized SQL set functions for the handling of columns. Consequently, this becomes a headache in dealing with scenarios where the desired column data isn’t supplied or where a standard approach is needed to handle many columns.

Inter-column aggregation

PE is a table recording results of physical education. It has the following fields – name, 100m, 1000m, long-jump, high-jump, and … There are four grades – A, B, C and D – for evaluating the results. Now we need to calculate the numbers of persons in each grade for every event.

The algorithm is simple. We just need to union the results of all the events, group them and perform aggregates. In SQL, we use a long union statement to combine the results of all events. That’s really boring. It’s complicated if the columns are indefinite. We need to obtain the desired column names dynamically from the database to perform union.

esProc supports handling columns through set operations. The fully dynamic syntax makes coding simple and easy:

	A
1	=db.query(“select * from PE”)
2	=A1.conj(~.array().to(2,))	Concatenate the results for every event from the second field
3	=A2.groups(~:grade;count(1):count)	Grouping and aggregation

Standard approach for transpositions

For simple static transpositions, some databases supply pivot and unpivot statements to implement them. Databases that don’t support the statements can do this using complicated conditional expressions and union statement. But usually the columns transposed from rows are dynamic. To handle this In SQL, we need to generate the target columns and rows and then compose another statement dynamically to execute. The code is complicated and difficult to understand.

Usually data is transposed for presentation. We could leave the pure row-to-column transposition to reporting tools. But many reporting tools don’t support the handling of rows and that of columns equally, and they cannot perform the column-to-row transposition during the stage of data presentation.

The student scores table R consists of these fields – student, semester, math, English, science, pe, art and … We need to perform both the row-to-column transposition and column-to-row transposition to present data in a structure as this – student, subject, semester1, semester2 …

esProc offers pivot function to perform the simple transposition:

	A	B	C
1	=db.query(“select * from R”)
2	=A1.pivot@r(student,semester;subject,score)
3	=A2.pivot(student,subject;semester,score)

To achieve the two-way transposition, A2 performs column-to-row transposition and A3 performs row-to-column transposition.

There is also a standard method which is easier-to-understand yet slightly complicated:

	A	B	C
1	=db.query(“select * from R order by student,semester”)
2	=create(student,subject,${A1.id(semester).string()})
3	for A1.group(student)	for 3,A1.fno()	=A3.field(B3)
4			>A2.record(A1.student\|A2.fname(B3)\|C3)
5	return A2

A2 generates the target result set using a macro. The loop in A3 and A4 transposes rows and columns and insert the result in the result set, which is the standard procedure for performing transpositions in esProc. The stepwise approach makes code clear and easy-to-understand. The approach applies to static transposition or one-way transposition and the code would be even simpler. esProc’s column access scheme and its flexibility characteristic of a dynamic language enables programmers to handle all types of transpositions, including static/dynamic transpositions, row-to-column transposition, column-to-row transposition, two-way transposition, in one standard approach.

Complex transpositions

Here’s the account state table T:

seq	account	state	date
1	A	over	2014-1-4
2	A	OK	2014-1-8
3	A	lost	2014-3-21
…

We need to export the states of the accounts per day for a specified month. If there’s no record for an account on a certain date, then use the state of the previous date:

account	1	2	3	4	5	6	7	8	9	…	31
A				over	over	over	over	OK	OK	…	OK
…

Strictly speaking, the transposition is static. But it involves a lot of regular columns and is not easy to be expressed completely in a static way. It involves inter-column calculations as well, which are hard to be coded in SQL even using pivot statements.

We can easily get the job done according to the standard esProc way:

	A	B
1	=db.query(“select * from T where year(date)=? and month(date)=?”,2014,1)
2	=create(account,${to(31).string()})
3	for A1.group(account)	=31.(null)
4		>A3.run(B3(day(date))=state)
5		>B3.run(~=ifn(~,~[-1])
6		>A2.record(A3.account\|B3)
7	return A2

Here’s only one loop because it is the one-way transposition. In B3-B5, the calculation of getting data to be inserted to the result set according to esProc syntax is a little complicated. Yet the whole procedure is the same.

Data grouping

Non-equi-grouping

It is a common demand to group data according to ranges, such as exam grades (excellent, good …), age groups (young, middle-aged…), and etc.

It is always inconvenient to implement this type of scenarios in SQL. For the scenarios where there are only a few static ranges, we can use case when conditional statement. But for those involving many ranges or regular, dynamic ones, generally people create temporary tables and then use non-equi-joins. All these approaches are complicated.

esProc uses penum function to return the sequence numbers of the enumerated conditions:

[”?<60”,”?>=60&&?<75”, ”?>=75&&?<90”, “?>=90”].penum(score)

If the ranges are continuous, esProc also provides a simpler pseg function to get their sequence numbers:

[60,75,90].pseg(score)

All the conditions and ranges here are ordinary arrays. They can be passed in as parameters and there’s no limit to their lengths. According to the sequence numbers of the ranges, the enum grouping and grouping by continuous-ranges can be transformed to more familiar equi-grouping:

	A
1	[”?<60”,”?>=60&&?<75”, ”?>=75&&?<90”, “?>=90”]	conditional ranges, which can work as a parameter
2	[60,75,90]	Continuous ranges, which can work as a parameter
3	=db.query(“select * from R”)
4	=A3.groups(A1.penum(score);count(1):Counts)	Group data by conditional ranges
5	=A3.groups(A2.pseg(score);count(1):Counts)	Group data by continuous ranges

A non-equi-grouping-related scenario is the fixed-sorting. Often a specified order, instead of the natural data order, is required in presenting the data analysis result. For example, the Permanent Five heads the UN’s list of member states. The way SQL handles the fixed-sorting is similar to the way it groups data by ranges. If the sorting condition is static and very simple, we can use decode function to generate sequence numbers. If it contains many items or is regular and dynamic, we need to create the temporary table and use JOIN to generate the sequence numbers.

esProc specially offers align@s function to perform the alignment sorting:

T.align@s([“China”,”France”,”Russia”,”UK”,”US”,…],nation)

This way we can sort the nations in the table T according to the specified order. Being an ordinary data type, the sorting condition can also be passed in as the parameter:

	A
1	[“China”,”France”,”Russia”,”UK”,”US”,…]	Sorting condition, which can work as a parameter
2	=db.query(“select * from T”)
3	=A2.align@s(A1,nation)	Sort data according to the specified condition

Different from the equi-grouping without empty subsets, sometimes the grouping result is expected to be the continuous ranges. So the missing empty subsets need to be supplied. It’s troublesome to do this in SQL, because we need to first create a set of continuous ranges manually and then left join the data tables under processing. During the process we have to use complex subqueries. With the align function and the convenient design for generating the sequence used as the sorting condition, it’s easy for esProc to handle this type of scenarios.

Here’s a simple transaction record table T, including no, date and amount fields. We need to cumulate the transaction amounts week by week. The weeks without transaction records also need to be displayed.

	A
1	=db.query(“select * from T order by date”)
2	>start=A1(1).date
3	=interval(start,A1.m(-1).date)\7+1	Calculate the total number of weeks
4	=A1.align@a(A2,interval(start,date)\7)	Group records by weeks. There are empty sets probably.
5	=A4.new(#:week,acc[-1]+~.sum(amount):acc)	Aggregate the weekly amounts and calculate the cumulative amounts.

Grouped subsets

Without the explicit set data type, SQL cannot but perform the aggregate after data grouping. Except for the aggregate values, we may also take an interest in each data group. It would be difficult to handle these groups in SQL using subqueries.

With the support of set-type data and grouping functions that return subsets, esProc can easily handle post-grouping computations.

For example, if we want to find the records of all subjects for students whose total scores are above 500 in SQL, we need to group records to calculate the total score for every student, select the students whose scores are above 50, and then JOIN the resulting table with the original score table or find the desired records using IN statement, which also requires repeated data retrievals. The process is cumbersome. But in esProc, we can do this in a straightforward way:

	A
1	=db.query(“select * from R”)
2	=A1.group(student).select(~.sum(score)>=500).conj()

There are many scenarios that require the returning of the records of the subsets after data grouping. The group and aggregate operations are mere intermediate steps towards completing the desired query, rather than the goal. In fact there is a similar example below in which data is sorted according to group and aggregate values for report development

Though in some cases only the aggregate values are desired, the aggregate operations are difficult to be expressed in simple aggregate functions and the grouped subsets need to be retained for further computations.

This type of requirements is not uncommon in real-world computational problems. But as the computations are complicated and involve a lot of expert knowledge, it’s inconvenient to cite one of the real examples. Here’s an adapted one:

Suppose there’s a table L recording the user logins. It has two fields – user (ID) and login (time). We want to calculate the last login time of each user and the number of logins in 3 days before this time.

It’s easy to find the last login time, but it’s difficult to count the logins during the specified time period without the grouped subsets. The SQL algorithm is like this: Group records and find the last login times, perform JOIN with the original table and find records during the specified time period, and group and aggregate these records. The code is bulky and inefficient. esProc can retain the grouped subsets and thus can do this in a stepwise approach:

	A
1	=db.query(“select * from L”)
2	=A1.group(user;~.max(login):last,~.count(interval(login,last)<=3):num)

Of which ~ represents the subset obtained after data is grouped by user.

Here’s a more efficient way to get this done for ordered data:

	A
1	=db.query(“select * from L order by login desc”)
2	=A1. .group(user;~(1).login:last,~.pselect@n(interval(login,last)>3)-1:num)

Order-related aggregation

It’s also a common type of scenarios where we want to get the top N records and the records corresponding to the maximum value. Of course we can perform the computations using the retained grouped subsets. But as they are too common, esProc regards them as a kind of aggregation and provides a special function. So the way of handling them is basically the same as we handle the ordinary group and aggregate operations.

Let’s look at the simplest case. The user login table L has these fields – user, login (time), IP-address … Now we want to find the record of first login of each user.

SQL can use window functions to generate the sequence numbers after intra-group sorting and retrieve all the records whose sequence numbers are 1. But the window function can only be employed based on a result set, so we should write subqueries and then perform filtering. The code thus becomes a little complicated. For databases that don’t support window functions, it’s more difficult to do this.

esProc provides group@1 function to directly retrieve the first member of each group.

	A
1	=db.query(“select * from L order by login”)
2	=A1.group@1(user)

This type of log files is frequently seen and they are already ordered according to the time. esProc can get the first record directly without doing sorting. Cursor can be used to handle this if the data is too big to be entirely loaded into the memory.

The stock price table S has three fields – code, date and cp (closing price). Now we need to calculate the latest rate of increase of each stock.

The calculation involves records of the last two trading days. We need to use two levels of window functions respectively to perform intra-group inter-row calculation and retrieve the first row. The coding is complicated. esProc provides the aggregate function topN to directly return the desired records as aggregate values for further computations.

	A
1	=db.query(“select * from S”)
2	=A1.groups(code;top(2,-date))	Get the records of the last two trading days
3	=A2.new(code,#2(1).cp-#2(2).cp:price-rises)	Calculate the rate of increase

Instead of aggregating the grouped subsets, esProc aggregate functions will perform accumulation based on the existing values, achieving a better performance. They can also work with cursor to handle the big data that cannot be entirely loaded into the memory.

We can retrieve records according to their sequence numbers if data is already ordered. This is more efficient:

	A
1	=db.query(“select * from S order by date desc”)
2	=A1.groups(code;top(2,0))	Get the first two records directly
3	=A2.new(code,#2(1).cp-#2(2).cp:price-rises)

To find the records corresponding to the maximum value, and to get the first/last record are special cases of the topN-style aggregation.

Inverse grouping

Contrary to the group and aggregate operations, inverse grouping aims to split the aggregated data into detail data. The operation is rare, but it’s hard to handle it in SQL. Here’s one example.

The installment payments table I has these fields – no, sum, start and periods. We need to split each sum of loan into multiple payment records. The resulting table contains these fields – no, seq, date and amount. The total payment will be distributed to each period (a month) evenly.

It’s easy to aggregate detail data, but it’s difficult to split data. To generate the details in SQL, we would perform a JOIN across the source table and a sequence number table, or use the recursive query. Both are roundabout ways. esProc, however, is able to write the code in an intuitive way:

	A
1	=db.query(“select * from I”)
2	=A1.news(periods;no,~:seq,after@m(start,~-1):date,sum/periods:amount)

Order-related calculations

Cross-row reference

In the early days of SQL, the language doesn’t directly support cross-row reference. It would first generate the sequence numbers before performing a JOIN. The code is excessively difficult and overloaded. By introducing window functions later, SQL can reference data from other rows more easily. But the code is still far from being concise – bulky we can say – particularly when multiple items from other rows need to be referenced. As mentioned above, the window functions should be used based on the result set of an operation, and subquery is needed in order to reference the results of window functions. The code is as cumbersome as it used to be.

MySQL doesn’t support window functions, but it can make a backward reference using variables in SQL statements. Yet it cannot make a forward reference.

esProc provides a natural and easy-to-use syntax for cross-row references.

The monthly product sales table S has 3 fields – prod, month and sales. We need to find the records in which the sales have been increased by 10%.

	A
1	=db.query(“select * from S order by prod,month”)
2	=A1.select(if(prod==prod[-1],sales/sales[-1])>1.1)

We can use [-1] to reference the data of previous month after sorting, as well as perform filtering according to the results of inter-row calculations. In contrast, SQL window functions require subquery and MySQL needs to define two temporary variables.

Based on the above table, we want to calculate the moving average for each month’s previous month and next month:

	A
1	=db.query(“select * from S order by product,month”)
2	=A1.derive(if(prod==prod[-1]&&prod==prod[1],sales{-1:1}.avg()):moving-avg)

The calculation involves backward reference and the reference of a set. esProc uses [1] to reference data of the next record and {-1:1} to reference the set of field values of two neighboring records. With window functions, SQL still need to find the desired records using subquery before it can calculate the moving average. MySQL also cannot handle the computation directly because its variable is unable to reference forward.

Here’s another example. The simplified event table E has these fields – seq, time … The time should be synchronized with the sequence number as the latter increases. But errors may exist, and we want to find the records where the time and the sequence number don’t synchronize.

	A
1	=db.query(“select * from E order by seq”)
2	=A1.select(time!=max(time{:0})\|\|time!=min(time{0:}))	Compare each records with all records before and after it

esProc can get a set from the beginning to a certain point or from a point to the end. SQL window functions have the similar syntax, but to sort data in two directions for the two comparisons, they have to use subquery.

Order-related grouping

SQL only supports order-unrelated equi-grouping. Sometimes not all records have the same value or change in the same way for the grouping fields. Instead, the grouping is more related to the order of the records. In this case, we still have to use window functions (or other more inconvenient tools) to generate sequence numbers first if SQL is used.

esProc has the syntax for order-related grouping, making the computations related to continuous intervals more convenient.

The income & expense table B has three fields – month, income and expense. Find the records of 3 or above continuous months during which the income is less than the expense.

	A
1	=db.query(“select * from B order by month”)
2	=A1.group@o(income>expense).select(~.income<~.expense && ~.len()>=3).conj()

The group@ function compares only the adjacent records during grouping and creates a new group once the adjacent value changes. By comparing the income and expense between adjacent records, we can divide records into groups like profitable, unprofitable, profitable…, get the unprofitable groups that have not less than 3 months and concatenate them.

We also want to find according to this table the maximum number of months when income increases continuously. We can design an algorithm for grouping. That is, when the income increases, put the record and the previous one into the same group; when the income decreases, put the record into a new group; finally get the maximum of the numbers of group members.

	A
1	=db.query(“select * from B order by month”)
2	=A1.group@i(income<income[-1]).max(~.len())

The group@i function will create a new group when the grouping condition changes – that is when the income decreases.

SQL can handle both this scenario and the previous one with its window functions, but the code would be very hard to understand.

The merging of intervals is another common type of order-related grouping computations. The event interval table T has S and E fields. We want to find the real length of time the event takes by removing the overlap of the time intervals from these intervals.

	A
1	$select S,E from T order by S
2	=A1.select(E>max(E{:-1}))	Remove records where the time period is included
3	=A2.run(max(S,E[-1]):S)	Remove the overlap of the time intervals
4	=A2.sum(interval@s(max(S,E[-1]),E))	Calculate the total length of time
5	=A2.run(if(S<E[-1],S[-1],S):S).group@o(S;~.m(-1).E:E)	Merge the time intervals that overlap

In this part, we provide solutions for different types of scenarios, which take advantage of the esPoc features of handling inter-row calculations and order-related grouping. SQL cannot achieve these simple implementations through window functions, unless it makes use of the extremely difficult-to-understand recursive query.

Position-based access

Sometimes we want to use the sequence numbers to access members of an ordered set. SQL, which is based on the mathematical concept of unordered sets, will first generate the sequence numbers and perform conditional filtering before accessing members of the specified positions. This causes a lot of problems to many computations.

esProc, however, adopts the ordered sets mechanism, allowing accessing members directly with sequence numbers and bringing great conveniences.

For example, in analyzing economic data, people often need to find the median value for various prices:

	A
1	=db.query@i(“select price from T order by price”)
2	=A1([(A1.len()+1)\2,A1.len()\2+1]).avg()

Sequence numbers can be used in data grouping as well. The event table E has three fields – no, time and act. The act field includes two types of values – start and end. We want to calculate the total length of time the events take, that is, the sum of every time period that each pair of start and end defines.

	A
1	=db.query@i(“select time from E order by time”)
2	=A1.group((#-1)\2).sum(interval@s(~(1),~(2))

# represents the sequence number of a record. group((#-1)\2) means putting every two records in one group. Then calculate the length of total time for each group and perform aggregate.

We can make a cross-row reference according to sequence numbers. The stock price table S has two fields – date and cp(closing price). We want to find the trading days when the stock prices are above 100 and the rates of price increase on those days.

	A
1	=db.query(“select * from S order by date”)
2	=A1.pselect@a(cp>100).select(~>1)
3	=A2.new(A1(~).date:date,A1(~).cp-A1(~-1).cp:price-rises)

The pselect function returns the sequence numbers of the members satisfying the specified condition. According to the result, we can calculate the rate of increase easily. With window functions, we have to calculate the rates of increase for all days and perform filtering then.

String and date handling

Strings

Every database provides sufficient functions, even complicated parsing functions like regular expressions, to handle string splitting, string concatenation and other operations that don’t use sets. esProc encapsulates all of these too, but on top of that the dynamic language offers the functionality of using strings as expression in computations.

The problematic string handling scenarios in SQL mainly involve the inverse grouping. SQL has great trouble in splitting a separator-segmented string into multiple records or a set for further handling, due to its lack of explicit sets.

It’s relatively easy to concatenate field values into a string during grouping and aggregation. MySQL has group_concat function to do this, and other databases also provide similar slightly complicated functions.

Here’s a simple string concatenation task. The student table S has 3 fields – class, name and sex. We need to group the table according to classes and write the names of boys and girls respectively as comma-separated strings in alphabetically order.

	A
1	=db.query(“select * from S”)
2	=A1.group(class; ~.select(sex==’M’).(name).sort().string():males, ~.select(sex==’F’)(name).sort().string():females)

With set-type data, esProc can rig up various string operations without having to use the specialized string concatenation functions.

Normally, string splitting will go hand in hand with the generation of multiple records, like the inverse operation of the above example. Suppose we want to convert the class table C, which contains the class, males and females fields, to the student table S, which contains class, name and sex:

	A
1	=db.query(“select * from C”)
2	=create(class,name,sex)
3	>A1.run(A2.insert(0:males.array(),A1.class,~,”M”),A2.insert(0:females.array(),A1.class,~,”F”))

Unlike SQL that provides complicated solutions including recursive query and doing a JOIN with the table to compare, esProc just splits the strings into sets and then generates corresponding records.

Sometimes the purpose of splitting a string is to perform set operations. The book table B has book field and author field. The latter has comma-separated multiple-person string values. We want to find records where the same team of authors writes at least two books. The order of the author names is indefinite.

	A
1	=db.query(“select * from B”)
2	=A1.group(author.array().sort()).select(~.len()>1).conj()

After splitting a string into a set and performing sorting, they can be used as grouping field. Then we can perform the rest of the computation as we always do.

Dates

Databases have no problem with the average handling of single date values. But similar to string processing they are not convenient-to-use for dealing with date splitting or the generation of a sequence of dates. The root of the problem is that SQL only goes halfway in its orientation towards set.

The travel log table T has these fields – client, start, end… We want to find the top five days that seen the most travelers.

To do this we should convert the time period between the staring date and the ending date to a set of separate dates and then group and aggregate the records.

	A
1	=db.query(“select start,end from T”)
2	=A1.conj(periods(start,end)).groups(~:date,count(1):count)
3	=A2.sort(count:-1).to(5)

With set-based solution to date splitting, it’s easy for esProc to do it.

It’s complicated to generate a sequence of dates, mainly because of the characteristics of the date values. It’s particularly thorny when it is accompanied by an inverse grouping operation.

The event table T has 3 fields – I (event), S (starting date) and E (ending date). We need to break apart the time interval between the staring date and the ending date by months in order to generate multiple records. The first month and the last month begins and ends respectively at the starting date and ending date, while the months in between have all their days.

	A
1	=db.query(“select I,S,E from T”)
2	=A1.news(interval@m(a=S,b=E)+1;I, max(a,pdate@m(after@m(a,~-1))):S,min(b,pdate@me(after@m(a,~-1))):E)

The pdate@m function and @me options find the starting date and ending date of a month respectively. The after@m function obtains the date which is a certain number of months after the given date; it can automatically adjust the new date to the last day of the desired month and is handy in generating a monthly interval.

esProc

esProc

Text-file manipulation