High Performance Computing Database

What's esProc

Offers SPL Base, a professional high-performance computing database
Uses SPL（Structured Process Language）as its built-in programming language
Features high performance

High performance computing database

What does esProc deal with？

Slow batch processing

An extremely time-consuming process that turns into a nasty marathon that leaves no chance to start over as deadline looms.

Slow query

It takes 10 minutes or longer as well as extreme patience to build a simple report. Concurrency-intensive queries and large timespan just make the query impossible.

Slow response

Slow join operations and drag-and-drop interface; the pre-aggregation module occupies too much space but has too many functional blind spots

High cost

Expensive yet low-performant in-memory database

Single machine or cluster, traditional famous data processing products or emerging newcomers, MPP or HADOOP, esProc can increase their average performance by several to ten times!

Why traditional processing technologies suck

Computational performance determinant factors

Computational efficiency depends on both hardware and software
Algorithmic efficiency constitutes software performance
Algorithmic efficiency is determined by both algorithm design and implementation
An unimplementable great algorithm amounts to inefficiency
Programming languages without a mechanism to achieve high performance are the enemy of efficient algorithms

An optimal algorithm is useless before it is implemented

Underlying cause

The prevalent structured data

Structured data generated by businesses across all industries is the main goal of data computing
Memory expansion and huge cluster are the main methods to increase performance of structured data computations
The essence of both methods is to scale out or scale up the hardware capability
Software core competencies are dominated by the relational algebra-based SQL
The SQL design is not suitable for achieving high-performance algorithms

Highly performant structured data computations are impossible in SQL

Key issues

SQL problems

SQL's inability to implement high-performance algorithms stems from deficiencies of its theoretical base (relational algebra)
It's impossible to make up for theoretical defects through engineering design

【Example】How to get top 10 from 1 billion rows in SQL?

In theory, SQL will sort all rows and get the top 10. This is extremely slow
There is a fast algorithm without the need of sorting all rows, but SQL cannot express it
Database engine auto-optimization is the only way out, but this won't happen in complex scenarios

A great algorithm that is unimplementable is useless

JAVA problems

Java can achieve high performance algorithms, but the methods are too complicated to be feasible

Essence of high performance

SQL – Able to design but unable to express high performance algorithms

JAVA – Implement high performance algorithms in too complicated ways

In essence, high performance comes from high development efficiency

Able to design & Easy to write

How esProc stands out

esProc innovative computing system results in high performance

【Analogy】1+2+3+…+100=?

Others

1+2=3
3+3=6
6+4=10
10+5=15
15+6=21
…

Gauss

1+100=101
2+99=101
…
There are fifty 101s
50*101=5050

Gauss was intelligent to think of the efficient solution. But note that multiplication had already been invented then !

Previous example: How to get top 10 from 1 billion rows in ？

Relational algebra-based SQL is like the system of arithmetic that has "addition" only, while esProc SPL invents "multiplication"!

esProc offers more types of multiplication(high performance computing & storage database) to enable everyone to become Gauss(to achieve high performance algorithms fast and efficiently)

Discrete dataset model

Innovative algebraic system empowers esProc with high performance

Discrete dataset model

is the "multiplication"esProc creates

Set-lization
Discreteness
Deep set-lization
Orderliness

Which makes high-performance algorithms easy to code and implementable

Simple agile syntax

Computing goal: the largest number of days when a stock rises consecutively.

select max(continuousDays)-1
from (select count(*) continuousDays
	from (select sum(changeSign) over(order by tradeDate) unRiseDays
		from (select tradeDate,
			case when closePrice>lag(closePrice) over(order by tradeDate)
			then 0 else 1 end changeSign
		from stock) )
	group by unRiseDays)

SQL solution

A triple-layer nested query is needed even when SQL uses the window function;

But can you understand?

	A
1	=stock.sort(tradeDate)
2	=0
3	=A1.max(A2=if(closePrice>closePrice[-1],A2+1,0))

SPL solution

According to the natural way of thinking, sort rows by trading dates (line1), compare the current closing price with the previous one, add 1 if it is higher and reset as 0 if it is lower, and then get the largest number (line3)

esProc mechanisms for achieving high performance algorithms & storage

Memory search

Binary search
Sequence-number-based location
Position indexing
HASH index
Sequence-number-based location on multilayer data

External storage data set

Text file segmentation
Bin file & double increment segmentation
Special data types
Composite file & columnwise storage
Order-related patch file
Data update & multi-zone composite table

External storage search

Binary search
HASH index
Sorting index
Rowwise storage & value-attached index
Index preload
Batch searching
Searches that return a set
Multi-indexed merge
Full-text search

Traversal technology

Cursor filtering
Traversal reuse
Multi-threaded traversal
Multi-threaded database load
Multicursor
Traversal-based grouping & aggregation
Aggregate essence application
The redundant grouping key

Order-based traversal

Order-based grouping & aggregation
Ordered post-grouping subsets
Programming cursor
Grouping based on ordered first-half table
Grouping based on ordered second-half table
Sequence-number-based grouping & controllable segmentation
Index-based sorting

Foreign key join

Foreign key converted to addresses
Values temporarily converted to addresses
Numbered foreign key
Special inner join syntax
Index reuse
Alignment sequence
Big dimension table search
One table-based grouping

Merge & join

Order-based merge
Segment-based merge
Join-based location
Attached table

Multidimensional analysis

Partial pre-aggregation
Pre-aggregation over a specific period
Redundancy-based sorting
Alignment sequence
Tagged dimension
Unwanted change of memory tags

Cluster

Task & data distribution
Cluster multi-zone composite table
Duplicate dimension table
Segmented dimension table
Redundancy-based fault tolerance plan
"Spare tire" fault tolerance plan
Multitask load balancing

Lots of original algorithms！

High performance algorithm examples

Aggregate essence application

	A
1	=file("data.ctx").create().cursor()
2	=A1.groups(;top(10,amount))	Get orders whose amounts are in top 10
3	=A1.groups(area;top(10,amount))	Get orders whose amounts are in top 10 in each area

Transform complex full sorting to simple aggregation

Traversal reuse

	A
1	=file("order.ctx").create().cursor()	Ready for traversal
2	=channel(A1).groups(product;count(1):N)	Configure a subordinate computation at traversal
3	=A1.groups(area;sum(amount):amount)	Traverse records to perform grouping and get the result
4	=A2.result()	Get result of the subordinate computation

One traversal returns multiple result sets

High performance storage

high performance storage

Proprietary data storage format/bin file/composite file

File system storage

Store data by categories in tree structured directory system

Bin file

Double increment segmentation enables any number of parallel threads
Exclusive high efficiency compact codingmechanism (space- and CPU-efficient, high-security)
Generic storage model supports set type data

Composite Table

Mixed columnwise & rowwise storage system
Order-based storage enhances compressibility and positioning performance
High efficiency automated index
Double increment segmentation enables any number of parallel threads
Unification of primary and sub tables reduces storage and join load
Serial number type key values enables high efficiency position-based join

Distributed computing

Fault-tolerant storage & computing techniques

esProc offers two data storage designs – Redundancy-based plan for external data & "Spare tire" plan for memory data

The fault-tolerant technique for computing automatically reassigns the subtask(s) on a malfunctioning node to an available node for processing

Controllable data distribution

The user-defined data distribution and redundancy plan tailored to suit the current data characteristics and computing situation considerably reduces the volume of data transmitted across nodes and thus increases performance

Centerless architecture avoids task failure due to single node malfunction

The centerless cluster system lets programmers to manage computing nodes through coding

Load balancing design

The design decides whether to assign a subtask to a node according to its workload (number of threads on it), which ensures balance between workload and resources

esProc computing performance test

* Intel3014 1.7G/12core/64G storage - FT-1500/16core/32G storage - MIPS/8core/64G storage - Intel2670 2.6G/16core/128G storage - FT-2000/64core/256G storage

Application scenarios

SPL Base architecture

Application scenarios

Typical application scenarios for SPL Base

Online query & analysis
Interactive analysis & retrieval
Offline regular batch processing

Online query & analysis

【Feature】 Concurrency-intensive; potentially complex computing tasks; instant response (in seconds); big data cluster computing

Interactive analysis & retrieval

【Feature】 No concurrency; weak demand for real-time response; step-by-step computing mode

Offline regular batch processing

【Feature】No concurrency; no demand for real-time response; huge amount of data; high requirements of time window

esProc high performance Q&A

Does esProc store data by itself?

Absolutely Yes！A high-efficiency storage plan is a guarantee of great performance. Both RDB and Hadoop cannot achieve high performance due to their traditional inefficient storage design.

esProc designs special and efficient data organization schemes for data respectively stored in memory, external storage and cluster to suit a variety of computing scenarios.

Is esProc based on open source or database technologies?

esProc is based on a wholly original computing model with brand-new theory and syntax for which no open-source technologies can be borrowed.

The innovative theory-based esProc abandons SQL, which cannot describe most of the low-complexity algorithms, for high performance algorithm implementation.

But it does not fail to offer a high-performance SQL interface for multidimensional analysis, for which standard way of coding is enough, to adapt to various front-end BI tools.

Is esProc difficult to learn?

esProc has exclusive SPL syntax to achieve performance optimization.

SPL is easy to learn; it only takes hours to learn it and weeks to master it!

The hard part is to design optimized algorithms!

We design the following optimization process to help users succeed

Performance optimization process

We will designate a senior engineer to collaborate with our user in dealing with their first one or two computing scenarios with esProc.

Some prefatory training and tuning are necessary as most programmers are accustomed to SQL's roundabout way of thinking and not familiar with high performance algorithms.

Then users will be able to be skilled in employing dozens of performance optimization techniques to design and implement high performance algorithms.

Describe a problem (User)

Understand data characteristics & computing requirements (Raqsoft)

Work out an appropriate solution (Both parties)

Test & debug (Raqsoft)

Write case report (Raqsoft)

Train the user (Raqsoft)

Give a user a solution and we support him for a day. Teach the user how to reach a solution and they support themselves forever !

References

SPL CookBook

Application cases