Data Agile Computing

Application scenarios

  • Report data source
  • Java computing layer
  • Data preparation
  • Data microservice

Scene Pain Point

Report data source

  • SQL/stored procedures are too complex and inefficient to develop.
  • Involving text/XLS/HTTP and other non-computational data sources, the development is tedious.
  • Java Computing, Excessive coupling with applications, Difficult maintenance.
  • Big data report export memory overflow, slow access speed.
  • T+0 real-time report,the cross-database implementation is difficult.
  • There are many intermediate tables in the database, which are difficult to manage and consume a lot of database resources.

Java computing layer

  • Direct Java computing, the code is lengthy and difficult to reuse.
  • Long development cycle, uncertain performance, uncontrollable quality.
  • The application coupling is too high, and the algorithm modification will cause the whole application to be recompiled and deployed, lacking agility.

Data preparation

  • Professional ETL tools are complex in environment, inflexible in use, highly demanding for personnel, and difficult to prepare data anytime, anywhere.
  • Procedure calculation is complex, and SQL/stored procedure is difficult to write and debug.
  • Diversified data sources are difficult to process, often requiring data to be stored in database first, and there are many steps and inefficient.

Data microservice

  • Algorithms are tedious to write, difficult to maintain and poor to expand.
  • Multi-source cross-database computing, cross-type data source computing,eg. Oracle+MSSQL
  • Difficulties in system integration, integration of computing frameworks or languages with project engineering languages, such as Java integration Python

Report data source

Scene features:Original data sources have weak computing power or become very difficult to cope with complex computing.

JAVA computing layer

Scene features:Direct implementation with JAVA has long development cycle, uncertain performance and uncontrollable quality.

Data preparation

Scene features: Low data access capability, multi-step process calculation, low development and debugging efficiency.

Data microservice

Scene features: Algorithms encapsulation is not agile, late maintenance is difficult, data storage is not flexible and scalability is poor.

Application case

Case 1 FlexGroup- Multi-source mixed computing

Business requirement

Associative computation of Vertica business details data and Excel customer information data

Existing problems

  1. Computational complexity: Although Vertica has advanced analysis functions, it does not have stored procedures, and complex process computation needs to be implemented externally.
  2. Difficult to implement: With Java, data objects need to be constructed continually, and it is very troublesome to implement associative computation.
  3. Difficult to maintain: If the algorithm is implemented in Java, the code is long and complex, the coupling with application is too high, and the maintenance is difficult.

Solution effects

  1. Grid-style programming and professional class library to promote rapid development
  2. Code is simplified, easy to modify
  3. Algorithm and application is decoupled, independently adjusted and maintained

Architecture contrast

Case 2A Petrochemical Company - Generating multi-level dynamic tables

Business requirement

Generating multidimensional multilayer dynamic cross-multiplication intermediate table

Existing problems

  1. Difficult to simplify:This algorithm can not directly express hierarchical relations with SQL/stored procedure, which leads to complex nested statements and multi-level associations.
  2. Heavy workload:The expression ability of SQL is weak, so it is impossible to implement the algorithm with a compact grammar.
  3. Poor performance:In SQL, a large number of temporary tables are needed to store nested temporary data, which takes a lot of time.
  4. Difficult to maintain:If the algorithm is implemented by SQL, the code is long and complex, which makes it difficult to interpret.
  5. Difficult to optimize:Using SQL / stored procedure to implement this algorithm, the amount of code is huge, and it can not consume more time to optimize.

Solution

This algorithm is implemented by using high performance functions, object access, in-memory calculation and other functions in esProc, which effectively improves the performance of execution.

  1. Rapid development
  2. Code simplified (11 lines) easy to maintain
  3. Optimizable (based on intermediate-layer recalculation)
  4. High Performance

Index comparison

Why esproc?

Easy to use

Easy to understand, simplify what’s complicated

= Low skill requirement for personnel

Function

Diversified data and rich class library

= Faster implementation

Performance

Adapting to big data and autonomous controlling

= Stronger computing power

Application

Fast integration and flexible deployment

= Lighter architecture

Development environment

Install and use immediately, with perfect debugging function

Rich class library

Intended for structured data processing

Grouping & Loop

Grouping & Loop

Sorting & Filtering

Sorting & Filtering

Set operations

Set operations

Ordered sets

Ordered sets

Java Integration

esProc is developed in JAVA and provides standard application interfaces for seamless integration.

Development language comparison

SPL vs. SQL

e.g. Count the longest consecutively rising trading days for a stock

SQL

1 select max(continuousDays)-1
2 from (select count(*) continuousDays
3 from (select sum(changeSign) over(order by tradeDate) unRiseDays
4 from (select tradeDate,
5 case when closePrice>lag(closePrice) over(order by tradeDate)
6 then 0 else 1 end changeSign
7 from stock) )
8 group by unRiseDays)
SQL: It is difficult to simplify the calculation step by step, and the code is difficult to read and maintain.

SPL

  A
1 =stock.sort(tradeDate)
2 =0
3 =A1.max(A2=if(closePrice>closePrice[-1],A2+1,0))
SPL: Syntax suitable for describing a natural way of thinking; Support for step-by-step calculation naturally.

SPL vs. Python

e.g.Generate PivotTable

Python

1 import pandas as pd
2 user_watch_data = pd.read_csv('user_watch1.cvs')
3 music_meta_data = pd.read_csv('music_meta1.cvs')
4 u_i_watch_list = []
5 gr = user_watch_data.groupby(['user','music'])
6 for (userid,musicid),group in gr:
7 u_i_watch_list.append([userid,musicid,group['listen_time'].mean()])
8 data_listen_time = pd.DataFrame(u_i_watch_list,columns=['user','music','listen_time'])
9 data_merge = pd.merge(data_listen_time,music_meta_data,on='music')
10 data_merge['score'] = data_merge['listen_time']/data_merge['long']
11 u_i_s_data = data_merge.loc[:,['user','music','score']]
12 u_i_s_data = pd.pivot_table(u_i_s_data,index = 'music',colums='user',values='score')
Python:Explicit “for”is needed to implement grouping and transpose actions.

SPL

  A
1 =file(“user_watch1.csv”).import@tc()
2 =file(“music_meta1.csv”).import@tc().key(music)
3 =A1.groups(user,music,avg(listen_time):listen_time)
4 =A3.join(music,A2,listen_time/long:score).new(user,music,sorce)
5 =A4.id(user).concat@cq()
6 =A4.pivot(music;user,score;${A5})
SPL:Professional syntax in the field of set operations, the same process code is more concise.

The Contrastive advantages of esProc and common languages

SPL Technical Advantages

  • AgilityInstall-and-use, simple environment configuration, no additional plug-ins
  • ApplicabilityPerfect class libraries and procedure computing for complex data processing
  • UsabilitySimple and intuitive syntax, easy to debug, no professional programmer required
  • OpennessBuilt-in multiple data source interfaces, direct computing
  • IntegrationSeamlessly embedded into applications, easy to convert to daily computing

Now take a deep dive into esProc!