RAQSOFT

Inverse grouping

2020-04-30T09:25:04Z

Task:List the details of installment loan: current payment amount, current interest, current principal and principal balance.

Python

1	import numpy as np
2	import pandas as pd
3	loan_data = pd.read_csv(‘E:\\txt\\loan.csv’,sep=‘\t’)
4	loan_data[‘mrate‘] = loan_data[‘Rate’]/(100*12)
5	loan_data[‘mpayment‘] = loan_data[‘LoanAmt’]loan_data[‘mrate‘]np.power(1+loan_data[‘mrate‘],loan_data[‘Term’]) \
6	/(np.power(1+loan_data[‘mrate‘],loan_data[‘Term’])-1)
7	loan_term_list = []
8	for i in range(len(loan_data)):
9	tm = loan_data.loc[i][‘Term’]
10	loanid = np.tile(loan_data.loc[i][‘LoanID’],tm)
11	loanamt = np.tile(loan_data.loc[i][‘LoanAmt’],tm)
12	term = np.arange(1,tm+1)
13	rate = np.tile(loan_data.loc[i][‘mrate‘],tm)
14	payment = np.tile(np.array(loan_data.loc[i][‘mpayment‘]),loan_data.loc[i][‘Term’])
15	interest = np.zeros(len(loanamt))
16	principal = np.zeros(len(loanamt))
17	principalbalance = np.zeros(len(loanamt))
18	loan_amt = loanamt[0]
19	for j in range(len(loanamt)):
20	interest[j] = loan_amtloan_data.loc[i][‘mrate‘*]
21	principal[j] = payment[j] – interest[j]
22	principalbalance[j] = loan_amt – principal[j]
23	loan_amt = principalbalance[j]
24	loan_data_df = pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])), columns = [‘loanid‘,‘loanamt‘,‘term’,‘mRate’,‘payment’,‘interest’,‘principal’,‘principalbalance‘])
25
	loan_term_list.append(loan_data_df)
26	loan_term_pay = pd.concat(loan_term_list,ignore_index=True)
27	print(loan_term_pay)

It’s very troublesome for padas to deal with such inverse grouping.

esProc

	A
1	E:\\txt\\loan.csv
2	=file(A1).import@t()
3	=A2.derive(Rate/100/12:mRate,LoanAmtmRatepower((1+mRate),Term)/(power((1+mRate),Term)-1):mPayment)
4	=A3.news((t=LoanAmt,Term);LoanID, LoanAmt, mPayment:payment, to(Term)(#):Term, mRate, t* mRate:interest, payment-interest:principal, t=t-principal:principlebalance)

Using the function of inverse grouping, it is easy to solve the problem of inverse grouping.

Log processing 3

2020-04-30T09:23:04Z

Task:Each piece of log has indefinite rows, and a fixed mark is at the beginning of each record.

Python

1	import pandas as pd
2	log_file = ‘E://txt//Indefinite _info2.txt‘
3	log_info = pd.read_csv(log_file,header=None)
4	group_cond = log_info[0].apply(lambda x:1 if x.split("\t")[0].split(":")[0]=="userid" else 0).cumsum()
5	log_g = log_info.groupby(group_cond,sort=False)
6	columns = ["userid","gender","age","salary","province","musicid","watch_time","time"]
7	df_dic = {}
8	for c in columns:
9	df_dic[c]=[]
10	for index,group in log_g:
11	rec_dic = {}
12	rec = group.values.flatten()
13	rec = ‘\t’.join(rec).split("\t")
14	for r in rec:
15	v = r.split(":")
16	rec_dic[v[0]]=v[1]
17	for col in columns:
18	if col not in rec_dic.keys():
19	df_dic[col].append(None)
20	else:
21	df_dic[col].append(rec_dic[col])
22	df = pd.DataFrame(df_dic)
23	print(df)

Pandas does not have the function of grouping by conditions. It needs to construct an array of grouping by conditions.

esProc

	A
1	E://txt//Indefinite _info2.txt
2	[userid,gender,age,salary,province,musicid,watch_time,time]
3	=file(A1).import@s()
4	=A3.group@i(_1.array("\t")(1).array("\:")(1)=="userid")
5	=A4.(~.(_1.array("\t")).conj().align(A2,~.array("\:")(1)).(~.array("\:")(2))).conj()
6	=create(${A2.concat@c()}).record(A5)

esProc has powerful grouping function and loop computing ability, and its code is simple and clear.

Log processing 2

2020-04-30T09:20:37Z

Task:Each piece of log has indefinite rows, and each row with the same mark indicates that it is a piece of record.

Python

1	import pandas as pd
2	log_file = ‘E://txt//Indefinite _info.txt’
3	log_info = pd.read_csv(log_file,header=None)
4	log_g = log_info.groupby(log_info[0].apply(lambda x:x.split("\t")[0]),sort=False)
5	columns = ["userid","gender","age","salary","province","musicid","watch_time","time"]
6	df_dic = {}
7	for c in columns:
8	df_dic[c]=[]
9	for index,group in log_g:
10	rec_dic = {}
11	rec = group.values.flatten()
12	rec = ‘\t’.join(rec).split("\t")
13	for r in rec:
14	v = r.split(":")
15	rec_dic[v[0]]=v[1]
16	for col in columns:
17	if col not in rec_dic.keys():
18	df_dic[col].append(None)
19	else:
20	df_dic[col].append(rec_dic[col])
21	df = pd.DataFrame(df_dic)
22	print(df)

esProc

	A
1	E://txt//Indefinite _info.txt
2	=file(A1).import@s()
3	[userid,gender,age,salary,province,musicid,watch_time,time]
4	=A2.group@o(_1.array("\t")(1))
5	=A4.(~.(_1.array("\t")).conj().id().align(A3,~.array("\:")(1)).(~.array("\:")(2))).conj()
6	=create(${A3.concat@c()}).record(A5)

The merge grouping of esProc and the special alignment operation make the log processing very easy.

Log processing 1

2020-04-30T09:16:06Z

Task:Every three rows of records are a piece of log. Organize the log into a structured file.

Python

1	import pandas as pd
2	import numpy as np
3	log_file = ‘E://txt//access_log.txt’
4	log_info = pd.read_csv(log_file,header=None)
5	log_g=log_info.groupby(log_info.index//3)
6	rec_list = []
7	for i,g in log_g:
8	rec = g.values.reshape(1*3)
9	rec[1] = rec[1].split(":")[-1].replace("#","")
10	rec="\t".join(rec)
11	rec = np.array(rec.split("\t"))
12	rec = rec[[6,7,0,1,3,4,8,5]]
13	rec_list.append(rec)
14	rec_df = pd.DataFrame(rec_list,columns=["USERID","UNAME","IP","TIME","URL","BROWER","LOCATION","module"])
15	print(rec_df)

esProc

	A
1	E://txt//access_log.txt
2	=file(A1).import@s()
3	=A2.group((#-1)\3)
4	=A3.(~.(_1).concat("\t").array("\t"))
5	=A4.new(~(7):USERID,~(8):UNAME,~(1):IP,~(2):TIME,~(4):URL,~(5):BROWER,~(9):LOCATION,left(~(6).array("\:")(2),-1):module)

With the mechanism of grouping by row number, you can process one group of data every time in loop, simplifying the difficulty.

Repeated conditional grouping

2020-04-30T08:53:52Z

Task:Divide employees into groups by segmentation according to the length of service in the company and count the number of male and female employees in each group.

Python

1	import pandas as pd
2	import datetime
3	def eval_g(dd:dict,ss:str):
4	return eval(ss,dd)
5	emp_file = ‘E:\\txt\\employee.txt’
6	emp_info = pd.read_csv(emp_file,sep=‘\t’)
7	employed_list = [‘Within five years’,‘Five to ten years’,‘More than ten years’,‘Over fifteen years’]
8	employed_str_list = ["(s<5)","(s>=5) & (s<10)","(s>=10)","(s>=15)"]
9	today = datetime.datetime.today().year
10	arr = pd.to_datetime(emp_info[‘HIREDATE’])
11	employed = today-arr.dt.year
12	emp_info[‘EMPLOYED’]=employed
13	dd = {‘s’:emp_info[‘EMPLOYED’]}
14	group_cond = []
15	for n in range(len(employed_str_list)):
16	emp_g = emp_info.groupby(eval_g(dd,employed_str_list[n]))
17	emp_g_index = [index for index in emp_g.size().index]
18	if True not in emp_g_index:
19	female_emp=0
20	male_emp=0
21	else:
22	group = emp_g.get_group(True)
23	sum_emp = len(group)
24	female_emp = len(group[group[‘GENDER’]==‘F’])
25	male_emp = sum_emp-female_emp
26	group_cond.append([employed_list[n],male_emp,female_emp])
27	group_df = pd.DataFrame(group_cond,columns=[‘EMPLOYED’,‘MALE’,‘FEMALE’])
28	print(group_df)

Pandas does not have the function of repeated conditional grouping, so it can only regroup according to the conditions and get the groups that meet the conditions.

esProc

	A	B
1	?<5	Within five years
2	?>=5 && ?<10	Five to ten years
3	?>=10	More than ten years
4	?>=15	Over fifteen years
5	E:\\txt\\employee.txt
6	=[A1:A4]	=A6.concat@c()
7	=file(A5).import@t()	=A7.derive(age@y(HIREDATE):EMPLOYED)
8	=B7.enum@r(A6,EMPLOYED)	=[B1:B4]
9	=A8.new(B8(#):EMPLOYED,~.count(GENDER=="M"):MALE,~.count(GENDER=="F"):FEMAL)

esProc has powerful enumeration grouping function, which can easily realize repeated conditional grouping.

Group by specified order

2020-04-30T08:49:40Z

Task:List in order the number and average age of female employees of technology, production, sales and HR department.

Python

1	import pandas as pd
2	import datetime
3	emp_file = ‘E:\\txt\\employee.txt’
4	dept_seq = [‘Technology’,‘Production’,‘Sales’,‘HR’]
5	emp_info = pd.read_csv(emp_file,sep=‘\t’)
6	emp_g = emp_info.groupby(by=‘DEPT’)
7	emp_g_index = [index for index,group in emp_g]
8	today = datetime.datetime.today().year
9	dept_femal_list = []
10	for dept in dept_seq:
11	if dept not in emp_g_index:
12	dept_femal_num = 0
13	age_a = None
14	else:
15	dept_emp = emp_g.get_group(dept)
16	dept_emp_femal = dept_emp[dept_emp[‘GENDER’]==‘F’]
17	dept_femal_num = len(dept_emp_femal)
18	arr = pd.to_datetime(dept_emp_femal[‘BIRTHDAY’])
19	age_a = (today-arr.dt.year).mean()
20	dept_femal_list.append([dept,dept_femal_num,age_a])
21	dept_femal_info = pd.DataFrame(dept_femal_list,columns=[‘DEPT’,‘femal_num’,‘femal_age’])
22	print(dept_femal_info)

esProc

	A
1	E:\\txt\\employee.txt
2	[Technology,Production,Sales,HR]
3	=file(A1).import@t()
4	=A3.align@a(A2,DEPT)
5	=A4.new(A2(#):DEPT,(F_EMP=~.select(GENDER:"F"),F_EMP.count()):Femal_num,F_EMP.avg(age@y(BIRTHDAY)):Femal_age)

It is much more convenient to deal with this kind of calculation by esProc based on the ordered set and the special alignment operation.

Adjacent records grouping with the original order

2020-04-30T08:47:09Z

Task:List the team information with the most consecutive NBA titles.

Python

1	import pandas as pd
2	import numpy as np
3	pd.set_option(‘display.max_columns’, None)
4	nba_file = ‘E:\\txt\\nba.txt’
5	nba_champion = pd.read_csv(nba_file,sep=‘\t’)
6	nba_champion = nba_champion.sort_values(by = ‘Year’)
7	arr = np.zeros(len(nba_champion))
8	arr[nba_champion[‘Champion’]!=nba_champion[‘Champion’].shift(1)]=1
9	arr = np.cumsum(arr)
10	nba_champion[‘flag’]=arr
11	nba_champion_g = nba_champion.groupby(by=‘flag’)
12	max_num = nba_champion_g.size().idxmax()
13	max_champion = nba_champion_g.get_group(max_num)
14	print(max_champion)

Python doesn’t have the ability to group by adjacent conditions. You need to create a list of grouping flags.

esProc

	A
1	=connect("mysql")	Connect to database
2	=A1.query("select * from nba order by year")	Sort by year
3	=A2.group@o(Champion)	When adjacency is different, start a new group
4	=A3.maxp(~.len())	List the team information with the most consecutive NBA titles.

The set of esProc is ordered, and it is very convenient to merge the adjacent same records into one group and start a new group when the adjacent record is different.

Sort grouped results

2020-04-30T08:51:16Z

Task:Find out the two departments with the largest number and the smallest number of employees.

Python

1	import pandas as pd
2	emp_file = ‘E:\\txt\\employee.txt’
3	emp_info = pd.read_csv(emp_file,sep=‘\t’)
4	emp_g = emp_info.groupby(by=‘DEPT’)
5	size = emp_g.size().sort_values()
6	sorted_dept = size.index.values
7	print(sorted_dept[[0,-1]])

esProc

	A
1	E:\\txt\\employee.txt
2	=file(A1).import@t()
3	=A2.group(DEPT).sort(~.len()).m([1,-1]).(~.DEPT)

With esProc, the grouping, sorting and filtering are completed by one line.

Grouped subsets operations

2020-04-30T08:42:23Z

Task:Calculate the daily inventory status of various goods in the specified time period.

Python

1	import pandas as pd
2	import numpy as np
3	starttime = ‘2015-01-01’
4	endtime = ‘2015-12-31’
5	stock_data = pd.read_csv(‘E:\\txt\\stocklog.csv’,sep=‘\t’)
6	stock_data[‘DATE’]=pd.to_datetime(stock_data[‘DATE’])
7	stock_data = stock_data[(stock_data[‘DATE’]>=starttime)&(stock_data[‘DATE’]<=endtime)]
8	stock_data[‘ENTER’]=stock_data[‘QUANTITY’][stock_data[‘INDICATOR’]!=‘ISSUE’]
9	stock_data[‘ISSUE’]=stock_data[‘QUANTITY’][stock_data[‘INDICATOR’]==‘ISSUE’]
10	stock_g = stock_data[[‘STOCKID‘,‘DATE’,‘ENTER’,‘ISSUE’]].groupby(by=[‘STOCKID’,‘DATE’],as_index=False).sum()
11	stock_gr = stock_g.groupby(by=‘STOCKID’,as_index = False)
12	date_df = pd.DataFrame(pd.date_range(starttime,endtime),columns=[‘DATE’])
13	stock_status_list = []
1415	for index,group in stock_gr:
16	date_df[‘STOCKID’]=group[‘STOCKID’].values[0]
17	stock_status = pd.merge(date_df,group,on=[‘STOCKID’,‘DATE’],how=‘left’)
18	stock_status = stock_status.sort_values([‘STOCKID’,‘DATE’])
19	stock_status[‘OPEN’]=0
20	stock_status[‘CLOSE’]=0
21	stock_status[‘TOTAL’]=0
22	stock_status = stock_status.fillna(0)
23	stock_value = stock_status[[‘STOCKID’,‘DATE’,‘OPEN’,‘ENTER’,‘TOTAL’,‘ISSUE’,‘CLOSE’]].values
24	open = 0
25	for value in stock_value:
26	value[2] = open
27	value[4] = value[2] + value[3]
28	value[6] = value[4] – value[5]
29	open = value[6]
30	stock = pd.DataFrame(stock_value,columns = [‘STOCKID’,‘DATE’,‘OPEN’,‘ENTER’,‘TOTAL’,‘ISSUE’,‘CLOSE’])
31	stock_status_list.append(stock)
32	stock_status = pd.concat(stock_status_list,ignore_index=True)
	print(stock_status)

esProc

	A	B
1	=file("E:\\txt\\stocklog.csv").import@t()
2	=A1.select(DATE>=date("2015-01-01") && DATE<=date("2015-12-31"))
3	=A2.groups(STOCKID,DATE; sum(if(INDICATOR=="ISSUE",QUANTITY,0)):ISSUE, sum(if(INDICATOR!="ISSUE",QUANTITY,0)):ENTER)
4	=periods(start,end)
5	for A3.group(STOCKID)	=A5.align(A4,DATE)
6		>b=c=0
7		=B5.new(A5.STOCKID:STOCKID,A4(#):DATE,c:OPEN,ENTER, (b=c+ENTER):TOTAL,ISSUE, (c=b-ISSUE):CLOSE)
8		>B7.run(ENTER=ifn(ENTER,0),ISSUE=ifn(ISSUE,0))
9		=@\|B7

esProc 9 lines of code complete the task of Python 32 lines of code, and when you read the code later on, esProc is easier to understand.

Grouped subsets iteration loop

2020-04-30T08:39:59Z

Task:Calculate how many months it takes for each sales representative to reach the sales amount of 50k.

Python

1	import pandas as pd
2	sale_file = "E:/txt/orders_i.csv"
3	sale_data = pd.read_csv(sale_file,sep=‘\t’)
4	sale_g = sale_data.groupby(‘sellerid‘)
5	breach50_list = []
6	for index,group in sale_g:
7	amount=0
8	group = group.sort_values(‘month’)
9	for row in group.itertuples():
10	amount+=getattr(row, ‘amount’)
11	if amount>=500000:
12	breach50_list.append([index,getattr(row, ‘month’),])
13	break
14	breach50_df = pd.DataFrame(breach50_list,columns=[‘sellerid‘,‘month’])
15	print(breach50_df)

esProc

	A
1	E:/txt/orders_i.csv
2	=file(A1).import@t()
3	=A2.group(sellerid;(~.iterate((x=month,~~+amount),0,~~>500000),x):breach50)

esProc retains grouped subsets and uses the iterative function to realize the iteration.