High Performance Computing Database

What's esProc

  • Offers SPL Base, a professional high-performance computing database
  • Uses SPL(Structured Process Language)as its built-in programming language
  • Features high performance
High performance computing database

What does esProc deal with?

Slow batch processing

An extremely time-consuming process that turns into a nasty marathon that leaves no chance to start over as deadline looms.

Slow query

It takes 10 minutes or longer as well as extreme patience to build a simple report. Concurrency-intensive queries and large timespan just make the query impossible.

Slow response

Slow join operations and drag-and-drop interface; the pre-aggregation module occupies too much space but has too many functional blind spots

High cost

Expensive yet low-performant in-memory database

Single machine or cluster, traditional famous data processing products or emerging newcomers, MPP or HADOOP, esProc can increase their average performance by several to ten times!

Why traditional processing technologies suck

Computational performance determinant factors

  • Computational efficiency depends on both hardware and software
  • Algorithmic efficiency constitutes software performance
  • Algorithmic efficiency is determined by both algorithm design and implementation
  • An unimplementable great algorithm amounts to inefficiency
  • Programming languages without a mechanism to achieve high performance are the enemy of efficient algorithms

An optimal algorithm is useless before it is implemented

Underlying cause

The prevalent structured data

  • Structured data generated by businesses across all industries is the main goal of data computing
  • Memory expansion and huge cluster are the main methods to increase performance of structured data computations
  • The essence of both methods is to scale out or scale up the hardware capability
  • Software core competencies are dominated by the relational algebra-based SQL
  • The SQL design is not suitable for achieving high-performance algorithms

Highly performant structured data computations are impossible in SQL

Key issues

SQL problems

  • SQL's inability to implement high-performance algorithms stems from deficiencies of its theoretical base (relational algebra)
  • It's impossible to make up for theoretical defects through engineering design
【Example】How to get top 10 from 1 billion rows in SQL?
  • In theory, SQL will sort all rows and get the top 10. This is extremely slow
  • There is a fast algorithm without the need of sorting all rows, but SQL cannot express it
  • Database engine auto-optimization is the only way out, but this won't happen in complex scenarios

A great algorithm that is unimplementable is useless


JAVA problems

Java can achieve high performance algorithms, but the methods are too complicated to be feasible


Essence of high performance

SQL – Able to design but unable to express high performance algorithms

JAVA – Implement high performance algorithms in too complicated ways

In essence, high performance comes from high development efficiency

Able to design & Easy to write

How esProc stands out

esProc innovative computing system results in high performance

【Analogy】1+2+3+…+100=?

Others

1+2=3
3+3=6
6+4=10
10+5=15
15+6=21

Gauss

1+100=101
2+99=101

There are fifty 101s
50*101=5050

Gauss was intelligent to think of the efficient solution. But note that multiplication had already been invented then !
Previous example: How to get top 10 from 1 billion rows in ?

Relational algebra-based SQL is like the system of arithmetic that has "addition" only, while esProc SPL invents "multiplication"!

esProc offers more types of multiplication(high performance computing & storage database) to enable everyone to become Gauss(to achieve high performance algorithms fast and efficiently)


Discrete dataset model

Innovative algebraic system empowers esProc with high performance

Discrete dataset model

is the "multiplication"esProc creates

  • Set-lization
  • Discreteness
  • Deep set-lization
  • Orderliness

Which makes high-performance algorithms easy to code and implementable


Simple agile syntax

Computing goal: the largest number of days when a stock rises consecutively.
select max(continuousDays)-1
from (select count(*) continuousDays
	from (select sum(changeSign) over(order by tradeDate) unRiseDays
		from (select tradeDate,
			case when closePrice>lag(closePrice) over(order by tradeDate)
			then 0 else 1 end changeSign
		from stock) )
	group by unRiseDays)

SQL solution

A triple-layer nested query is needed even when SQL uses the window function;

But can you understand?


A
1 =stock.sort(tradeDate)
2 =0
3 =A1.max(A2=if(closePrice>closePrice[-1],A2+1,0))

SPL solution

According to the natural way of thinking, sort rows by trading dates (line1), compare the current closing price with the previous one, add 1 if it is higher and reset as 0 if it is lower, and then get the largest number (line3)


esProc mechanisms for achieving high performance algorithms & storage

Memory search

  • Binary search
  • Sequence-number-based location
  • Position indexing
  • HASH index
  • Sequence-number-based location on multilayer data

External storage data set

  • Text file segmentation
  • Bin file & double increment segmentation
  • Special data types
  • Composite file & columnwise storage
  • Order-related patch file
  • Data update & multi-zone composite table

External storage search

  • Binary search
  • HASH index
  • Sorting index
  • Rowwise storage & value-attached index
  • Index preload
  • Batch searching
  • Searches that return a set
  • Multi-indexed merge
  • Full-text search

Traversal technology

  • Cursor filtering
  • Traversal reuse
  • Multi-threaded traversal
  • Multi-threaded database load
  • Multicursor
  • Traversal-based grouping & aggregation
  • Aggregate essence application
  • The redundant grouping key

Order-based traversal

  • Order-based grouping & aggregation
  • Ordered post-grouping subsets
  • Programming cursor
  • Grouping based on ordered first-half table
  • Grouping based on ordered second-half table
  • Sequence-number-based grouping & controllable segmentation
  • Index-based sorting

Foreign key join

  • Foreign key converted to addresses
  • Values temporarily converted to addresses
  • Numbered foreign key
  • Special inner join syntax
  • Index reuse
  • Alignment sequence
  • Big dimension table search
  • One table-based grouping

Merge & join

  • Order-based merge
  • Segment-based merge
  • Join-based location
  • Attached table

Multidimensional analysis

  • Partial pre-aggregation
  • Pre-aggregation over a specific period
  • Redundancy-based sorting
  • Alignment sequence
  • Tagged dimension
  • Unwanted change of memory tags

Cluster

  • Task & data distribution
  • Cluster multi-zone composite table
  • Duplicate dimension table
  • Segmented dimension table
  • Redundancy-based fault tolerance plan
  • "Spare tire" fault tolerance plan
  • Multitask load balancing

Lots of original algorithms!


High performance algorithm examples

Aggregate essence application

  A  
1 =file("data.ctx").create().cursor()  
2 =A1.groups(;top(10,amount)) Get orders whose amounts are in top 10
3 =A1.groups(area;top(10,amount)) Get orders whose amounts are in top 10 in each area

Transform complex full sorting to simple aggregation

Traversal reuse

  A  
1 =file("order.ctx").create().cursor() Ready for traversal
2 =channel(A1).groups(product;count(1):N) Configure a subordinate computation at traversal
3 =A1.groups(area;sum(amount):amount) Traverse records to perform grouping and get the result
4 =A2.result() Get result of the subordinate computation

One traversal returns multiple result sets


High performance storage

high performance storage

Proprietary data storage format/bin file/composite file

File system storage

Store data by categories in tree structured directory system

Bin file

  • Double increment segmentation enables any number of parallel threads
  • Exclusive high efficiency compact codingmechanism (space- and CPU-efficient, high-security)
  • Generic storage model supports set type data

Composite Table

  • Mixed columnwise & rowwise storage system
  • Order-based storage enhances compressibility and positioning performance
  • High efficiency automated index
  • Double increment segmentation enables any number of parallel threads
  • Unification of primary and sub tables reduces storage and join load
  • Serial number type key values enables high efficiency position-based join

Distributed computing

Fault-tolerant storage & computing techniques

esProc offers two data storage designs – Redundancy-based plan for external data & "Spare tire" plan for memory data

The fault-tolerant technique for computing automatically reassigns the subtask(s) on a malfunctioning node to an available node for processing

Controllable data distribution

The user-defined data distribution and redundancy plan tailored to suit the current data characteristics and computing situation considerably reduces the volume of data transmitted across nodes and thus increases performance

Centerless architecture avoids task failure due to single node malfunction

The centerless cluster system lets programmers to manage computing nodes through coding

Load balancing design

The design decides whether to assign a subtask to a node according to its workload (number of threads on it), which ensures balance between workload and resources


esProc computing performance test

* Intel3014 1.7G/12core/64G storage - FT-1500/16core/32G storage - MIPS/8core/64G storage - Intel2670 2.6G/16core/128G storage - FT-2000/64core/256G storage

Application scenarios

SPL Base architecture


Application scenarios

Typical application scenarios for SPL Base

  • Online query & analysis
  • Interactive analysis & retrieval
  • Offline regular batch processing

Online query & analysis

【Feature】 Concurrency-intensive; potentially complex computing tasks; instant response (in seconds); big data cluster computing


Interactive analysis & retrieval

【Feature】 No concurrency; weak demand for real-time response; step-by-step computing mode


Offline regular batch processing

【Feature】No concurrency; no demand for real-time response; huge amount of data; high requirements of time window

esProc high performance Q&A

Does esProc store data by itself?

Absolutely Yes!A high-efficiency storage plan is a guarantee of great performance. Both RDB and Hadoop cannot achieve high performance due to their traditional inefficient storage design.

esProc designs special and efficient data organization schemes for data respectively stored in memory, external storage and cluster to suit a variety of computing scenarios.


Is esProc based on open source or database technologies?

esProc is based on a wholly original computing model with brand-new theory and syntax for which no open-source technologies can be borrowed.

The innovative theory-based esProc abandons SQL, which cannot describe most of the low-complexity algorithms, for high performance algorithm implementation.

But it does not fail to offer a high-performance SQL interface for multidimensional analysis, for which standard way of coding is enough, to adapt to various front-end BI tools.


Is esProc difficult to learn?

esProc has exclusive SPL syntax to achieve performance optimization.

SPL is easy to learn; it only takes hours to learn it and weeks to master it!

The hard part is to design optimized algorithms!

We design the following optimization process to help users succeed


Performance optimization process

We will designate a senior engineer to collaborate with our user in dealing with their first one or two computing scenarios with esProc.

Some prefatory training and tuning are necessary as most programmers are accustomed to SQL's roundabout way of thinking and not familiar with high performance algorithms.

Then users will be able to be skilled in employing dozens of performance optimization techniques to design and implement high performance algorithms.

Describe a problem (User)
Understand data characteristics & computing requirements (Raqsoft)
Work out an appropriate solution (Both parties)
Test & debug (Raqsoft)
Write case report (Raqsoft)
Train the user (Raqsoft)

Give a user a solution and we support him for a day. Teach the user how to reach a solution and they support themselves forever !