In-memory computing

To put it simply, in-memory computing refer to the data processing methods of importing data into the memory to achieve higher performance by exploiting the memory’s high speed processing capability.

The computational method is quite popular these years. The memory capacity of contemporary computers can reach to scores and hundreds of gigabytes, even terabytes. This makes it possible that the data of some applications can be entirely loaded into the memory. In-memory computing is particularly suitable for handling transactions of online report query. Actually, except for searching for certain data using the index, all traversal-style computational tasks involving an amount of data that exceeds the maximum memory capacity that the contemporary computers can have can’t be guaranteed an instant response by adopting the external-memory-based methods. 

Pointerized foreign keys

Not only does the memory have a better query performance, the more important thing is that it also has an excellent support of random concurrent data retrievals, that is – the free access of a small block of data at any address in constant time. We can use this feature to perform fast foreign-key joins.

When data is initialized and loaded into the memory, we can convert the foreign key association between data tables into the memory pointer. For example, here are the simplified Products table and Sales table of a supermarket:

Products table: Productid, Name, Supplier, Category, UnitPrice

Sales table: Number, Time, Productid, Quantity

As a common data structure goes, the Sales table’s Productid is the foreign key pointing to the Products table.

Suppose we want to calculate the total sales amount. To do this, get the UnitPrice of the product in each sales record from the Products table according to its Productid, multiple the UnitPrice by Quantity, and then calculate the sum of these products.

This algorithm will query the UnitPrice for many times (which is equal to the number of Sales records). If the query is performed through traversal, the computation will be rather complicated. We can create index for the Products table to increase the query speed, but still the result isn’t satisfactory. Now databases adopt the hash algorithm to align the keys of the two tables. The computation indeed becomes much faster, but the implementation is complicated. Besides, a large number of HASH computations and comparisons are involved.

If we transform the Productid of Sales table to pointers pointing to product records after data finishes loading, comparing actions won’t be needed anymore and the unit price of each Product can be directly accessed. This is time-saving because repeated querying actions are not necessary. Although the conversion to pointers still needs the queries, they will be carried out only once. Once the pointers are created, computation will be performed in a high speed.

esProc has the mechanism of creating pointers and using pointers to access records. Take the file data source as an example, the code for creating pointers and the succeeding computations is as follows:

1 =file(“Products.txt”).import() Import the Products table
2 =file(“Sales.txt”).import() Import the Sales table
3 >A2.switch(productid,A1:id) Create a pointer association, and switch over Productid to pointers
4 =A2.sum(quantity*productid.price) Calculate total sales amount by referencing the product’s unit price with pointers

In practice, the loaded data may be shared by multiple subtasks, for which the code will be a little different.

Here’re two field tests showing that the use of pointers can achieve better performance. The first is a group and aggregate operation without joins and the second performs the group and aggregate operation on five tables with multi-level foreign key joins. Both have an equal amount of data that can be loaded into the memory. We handle them in esProc and Oracle respectively and these are the results: 

  esProc Oracle
Single table without joins 0.57s 0.623s
Five-table foreign key joins 2.3s 5.1s

According to the results, esProc and Oracle have nearly equal performance when there isn’t a foreign key join. But with the foreign key join, esProc is over one time faster than Oracle.

The mere importing of data into the memory without creating pointers won’t have the benefit of the pointers. By the way, some in-memory database products using SQL syntax can’t use this approach, because the SQL model is external-memory-computing-oriented, and doesn’t support converting the foreign key of records to pointers. 

Sequence-number-style foreign key

For computations that involve a foreign key join, if there is too much data to be loaded into the memory altogether (this is a not a suitable case for online report query), we can choose to import only part of the data into the memory.

Here we’ll still use the above example. Compared with the ever increasing sales records, the product records are less and increase slower. That is to say, in a set of tables, the dimension table pointed by the foreign key is smaller but the fact table is much bigger. In this case we probably can load all dimension tables into the memory and put the fact data in the external memory which can be accessed through traversal. Often a dimension table is shared by multiple tasks and a reusable index can be created for it after data importing. Some humanly treatment of the in-memory data could lower the external memory accesses.

Below is the code for the algorithm that loads only the dimension table into the memory:

1 =file(“Products.txt”).import().primary@i(id) Import the Products table and create index
2 =file(“Sales.txt”).cursor() Create cursor of the sales table and prepare for traversal
3 =A2.switch(productid,A1:id) Create pointers based on the cursor and prepare for traversal
4 =A3.groups(;sum(quantity*productid.price)) Traverse records to calculate sales amount, while pointers can be used to reference records

In this case, however, the pointers for associating tables should be generated on the spot during the traversal, rather than in advance. Here esProc also uses HASH algorithm to search the matching records. Obviously performance is not as good as that in the previous example where all data is imported into the memory.

In order to make the generation of pointers more efficient, we can use the sequence-number-style foreign key, which can be obtained by converting the foreign key values to integers. Though tedious and time-consuming, the job is a one-off. Once it is finished, the records referenced by the foreign key can be located directly with those sequence numbers in the subsequent computation, without having to calculate and compare HASH values. This approach enables the in-memory algorithm where only the dimension table is loaded into the memory to achieve a high performance driven by the pointerized foreign key.  

1 =file(“Products.txt”).import() Import the Products table
2 =file(“Sales.txt”).cursor() Create cursor according to the sales records represented by sequence numbers
3 =A2.switch(productid,A1:#) Generate the joining pointers by locating records with sequence numbers and prepare for traversal
4 =A3.groups(;sum(quantity*productid.price)) Perform calculation to get result

Note that, after the foreign key is transformed into sequence numbers, we shouldn’t insert data into or delete data from the dimension table in case the reference of the foreign key becomes confused. The method is suitable for handling the unchangeable historical data, but we’ll risk confusing data or face complex data management if using it to handle the changing current data.

Under the traditional database mechanism, all tables are logically equal, without distinguishing the dimension table and the fact table. JOIN operations will automatically decide which table – generally it is the smaller one – should be loaded into the memory. A single operation sees no obvious change in performance, if compared with esProc. Yet with multiple operations the data may be repeatedly loaded for creating HASH indexes, and performance will be compromised. On the other hand, the records in a database are not ordered. The database will still use HASH algorithm to locate records even if programmers intervenes to transform the foreign key values into sequence numbers. So preparing data beforehand won’t improve performance. 

Memory utilization ratio

Always there’s lots of non-data management information stored in Java objects. After the data is loaded into the memory and objectified, it will use 3-5 times more space (measured by bytes). And yet a Java application can merely process the objectified data. As a consequence, the memory can’t be fully exploited.

esProc Server designs a scheme to reduce memory usage by delaying the objectification. It is like this: Compactly load data in the form of an array of bytes (this way the space usage of the memory and external memory is nearly the same), and the data will only be objectified when it is referenced.

Through this scheme of data import and objectification, the memory utilization ratio will be greatly increased. The price is that it takes time to objectify the data to be referenced. But it doesn’t have to read data from external memory – the speed of disk reading is much slower after all; and supports random access at the same time. Of course this is not as efficient as loading all data as the memory object, but it has a big edge compared with the external memory computing.

esProc Server provides creating a byte table to realize this scheme, with the efficient pointerized foreign key reference, as well as the sequence-number-style reference, retained. 

1 =file(“Products.dat”).create().primary@i(id) Create a byte table and an index based on the file
2 =file(“Sales.dat”).cursor@b() Create cursor of the Sales table and prepare for traversal
3 =A2.switch(productid,A1) Use a byte table as an ordinary memory table
4 =A3.groups(;sum(quantity*productid.price)) Perform same operations to get the result

The sequence-number-style reference also applies to a byte table. But a byte table must be created based on a binary file (discussed below), instead of a text file. 

Multithreaded parallel processing

The multithreaded parallel processing can make the most of multiple CPUs or the multicore CPU of contemporary computers. esProc provides clean syntax for doing multithreading, together with the use of in-memory cursor, to make it easy to develop multithread programming. 

  A B  
1 =file(“Sales.dat”).create() Import the source data as a byte table
2 fork 4 =A1.cursor(A2:4) Perform parallel processing by dividing the table into 4 segments and creating an in-memory cursor for each segment
3   =B2.groups(;sum(amount):a) Traverse each cursor to sum up amounts
4 =A2.conj().sum(a) Concatenate the results returned from each thread

In this piece of code, fork statement is responsible for launching multiple threads to execute the subsequent code and for collecting results of each thread. Cursors are used respectively to retrieve only the needed data for each thread to avoid too much memory usage caused by retrieving out all records at once. It is suitable to use cursors to handle both a byte table and an ordinary objectified in-memory table.

Different from the more basic programming languages such as Java and C++, esProc doesn’t supply a mechanism for grabbing and synchronizing shared resources between multiple threads. The concept of esProc is that threads start simultaneously and are aggregated collectively but each executes independently and separately. Apparently, it’s impossible to express all parallel algorithms, especially the real-time logic, with the esProc mechanism. But the mechanism suffices for performing data analysis and preparation. esProc removes some unnecessary abilities to increase the usability significantly.

It’s not that the more the parallel threads the better the performance. Basically the threads should not outnumber CPUs (or cores); otherwise the extra threads are just logical but won’t be in effect.

esProc also provides built-in parallel options for certain functions, making it simple to perform a parallel computation. 

1 =file(“Sales.txt”).import() Retrieve data out
2>1000) Filter data with multiple threads
3 =A2.sort@m(amount:-1) Sort data with multiple threads

The @m option automatically decides how many parallel threads will be launched according to the current configuration. Note that the option can’t retrieve records in certain order, so it shouldn’t be used for order-related computations.