Text-file manipulation

In addition to databases, text files are almost the commonest form of data storage. So computations based on text files are very important. Unlike databases that have SQL syntax, text files don’t possess any computational power. Programmers need to write code for the manipulation of text files using one of the programming languages. The problem is few programming languages that have the ability to handle text files are set-oriented, making the coding of batch computation a highly cumbersome process. Actually a simple sum operation written in Java has many lines of code, and those operations involving data filtering and grouping even need hundreds of lines. More recent scripting languages – Perl, Python and R – have made some improvements in this aspect, but they still don’t give adequate support of batch processing of structured data and are integration-unfriendly as well.

A second approach is to import the text data into a database and use SQL. But since the text file lacks the strong data typing required by a database, the data import is usually accompanied by complicated data arrangement. The extra step will seriously compromise the efficiency of transaction processing.

Being a dynamic set-oriented scripting language, esProc can to some extent fill the gaps. Some of the common file handling scenarios in the following will serve to show the advantages of esProc in handling this type of computations.  

Non-structured-data processing

Parsing text files

The data items in each line of the text file T.txt are separated from each other by an indefinite number of spaces:

20010-8-13 991003 3166.63 3332.57  3166.63  3295.11

2010-8-10 991003 3116.31 3182.66  3084.2   3140.2

……

Now we want to make a list of the averages of the last four items in each line. esProc needs a mere one-liner to do this: 

  A
1 =file(“T.txt”).read@n().(~.array@tp(“”).to(-4).avg())

read@n reads data from the text file as a set of strings. array@t(“”) splits each string into a set of substrings according to the indefinite number spaces; and @p parses each substring into the corresponding data type for the subsequent computation (of average value).

Here’s a comma-separated text file T.csv. We need to write the first 8 data items of the file’s lines having not less than 8 data items to another text file R.txt, separated with “|” (which is the separator some bank file systems use): 

  A
1 =file(“T.csv”).read@n().(~.array(“,”)).select(~.len()>=8)
2 >file(“R.txt”).write(A1.(~.to(8).string(“|”)))

The string() function concatenates members of a set into a string using a specified separator.

The text file T.txt holds a set of strings as shown below. We need to divide the file into several parts according to the state name (LA) before the characters US and put them in different files.

COOP:166657,’NEW IBERIA AIRPORT ACADIANA REGIONAL LA US’,200001,177,553

……

  A
1 =file(“T.txt”).read@n()
2 =A1.group(mid(~,pos(~,” US'”)-2,2):state;~:data)
3 >A2.run(file(state+”.txt”).export(data))

esProc also supports the use of regular expressions for more complex parsing tasks. But since the regular expressions are difficult to use and perform poorly, the conventional approaches are generally recommended. 

Parsing text files into structured data

In the logfile S.log, every 3 lines constitute a complete piece of information. We need to parse the file into structured data and write it to T.txt:

  A    
1 =file(“S.log”).read@n()    
2 =create(…)   Create a target result set
3 for A1.group((#-1)\3) Group data every 3 lines by the line numbers
  Parse field values from A3 (the current 3 lines)
  >A2.insert(…) Insert the parsing result into the target result set
>file(“T.txt”).export(A2)   Write the result set to the text file

Since esProc can group data by line numbers, we can run a loop to process one group of data each time, making the computation simpler.

Of course there is the special, simple case of handling a single line.

If S.log is too big to be loaded completely into the memory, we can retrieve the file step by step using the cursor and export it the same way: 

  A B  
1 =file(“S.log”).cursor@si() Create a cursor and import the file in a stream
2 =file(“T.txt”)   Create the resulting file
3 for A1,3 Run a loop to process the 3 lines imported at a time
4   Parse field values from A3 (the current 3 lines)
  >A2.export@a(…) Write the parsed values to the resulting file

A skilled user can optimize the code to achieve a better performance by writing the parsed records in batches. 

If every piece of complete information in the log file S.log starts with “—start—” but contain an indefinite number of lines, we just need to change A3 as follows:

3 for A1.group@i(~==”—start—”) Create a new group with every “–start—”

Similarly we can deal with a big file in this type of scenarios with the cursor, and A3 will be like this:

3 for A1;~==”—start—“:0 Start another loop cycle with every “–start—”

Another scenario for the indefinite number lines is that each line of one piece of information begins with the same characters (For example the userID the log information belongs to). When the starting characters change, a new piece of information begins. To handle this we can slightly modify A3:

3 for A1.group@o(left(~,6)) Create a new group when the first 6 characters change
3 for A1;left(~,6) Start another loop cycle when the first 6 characters change

And we can also use the cursor to handle the big file by altering the code of the preceding subsection.

Searching and Summary

Find files under a directory that contain certain words, list the contents of the lines where they settle and the line numbers:

  A
1 =directory@p(“*.txt”)
2 =A1.conj(file(~).read@n().(if(pos(~,”xxx”),[A1.~,#,~].string())).select(~))

grep is a frequently used Unix command. But some operating systems don’t support it and it’s not easy-to-implement in a program. esProc provides the file traversal functionality and, along with the capability of file handling it can do the job with only two lines of code.

Now list all the different words the text file T.txt contains and count the occurrences of each of them. Ignore the case of characters:

  A
1 =lower(file(“T.txt”).read()).words().groups(~:word;count(1):count)

WordCount is a famous programming exercise. esProc has words() function to split a string into separate words and only one line of code can complete the operation.

List all words containing the letters a, b and c in the text file T.txt. Ignore the case of characters: 

  A
1 =lower(file(“T.txt”).read()).words().select(~.array(“”).pos([“a”,”b”,”c”]))

Because the orders of these letters are different in different words, we cannot determine whether a word is eligible through substring searching. We should use array(“”) to break apart a string into a set of single characters and then find whether the set contains these letters or not. With the support of set operations, esProc can get this done with a one-liner.

To handle big files, we can simply alter these operations by retrieving data in segments or with cursor.

Structured-data processing

Controlled data retrieval

Import 4 columns – name, sex, age and phone – from a titled, comma-separated structured text file D.csv. The values of phone field are all numbers that must be imported as strings.

  A
1 =file(“D.csv”).import@tc(name,sex,age,phone:string)

The import function has various parameters and options to control whether the text file has a title or not, which separator it uses, which columns it needs, and what the data type is. In most cases, the retrieval of a structured file can be done with a single line of code.

The import result can be returned to the Java main program in the form of ResultSet object through JDBC for further processing. A programmer skilled at using JDBC can do this easily. The retrieval of the text file is like that of a database table.

The same work can be done with the cursor in handling a big file. 

  A  
1 =file(“D.csv”).cursor@tc(name,sex,age,phone:string)  
2 =A1.fetch(100) Fetch 100 rows

A cursor can also be returned to the Java main program via JDBC.

A big file can be retrieved in segments: 

  A  
1 =file(“D.csv”).import@tc(;2:4) Divide the file into 4 segments evenly and import the second one 

esProc can achieve a good retrieval efficiency by segmenting the file according to bytes (while segmenting file by rows requires traversing all the desired number of rows each time), and it can prevent the incomplete rows and ensure the data integrity by adopting a “skip head line and complement the tail line” strategy for each segment – that is giving the first line away to the previous segment and supplying the missing part to the last line.

The segment-style retrieval applies likewise to a cursor.

When the order is not important (in sum and count operations for example), we could take advantage of esProc’s built-in parallel framework to enhance performance: 

  A
1 =file(“D.csv”).import@tcm(name,sex,age,phone:string)

With this line of code, esProc will enable the multithreading function to automatically segment and retrieve the file in multiple threads. The text parsing is slow, but the parallel multithreaded retrieval can make good use of the computing capacity of multi-core CPU and thus greatly increase the parsing speed. 

Common computations

Find men who are above 25 and women who are above 23 in the text file D.csv, and then 1) list their names in alphabetical order; 2) group the persons by genders and calculate the average age separately; 3) list all the distinct surnames (but ignore the compound surnames). 

  A  
1 =file(“D.csv”).import@tc(name,sex,age)  
2 =A1.select(sex==”M”&&age>=25||sex==”F”&&age>=23) Filtering
3 =A2.sort(name) Sorting
4 =A2.groups(sex;avg(age):age) Grouping and aggregation
5 =A2.id(left(name,1)) Get distinct values

esProc is rich in functionalities for structured-data computations, empowering it to handle text files as database tables in some degree and to have the SQL-like computing capability even without a database.

Cursor can always be used to handle the big data:

  A  
1 =file(“D.csv”).cursor@tcm(name,sex,age)  
2 =A1.select(sex==”M”&&age>=25||sex==”F”&&age>=23) Filtering
3 =A2.sortx(name) Sorting
4 =A2.groups(sex;avg(age):age) Grouping and aggregation
3 =A2.groupx(left(name,1);) Get distinct values
4 =A3.fetch(…) Fetch result

Different from the in-memory computations, a cursor only traverses data once. So after the sorting operation in the above, another cursor needs to be created for the grouping operation.

Here’s a text file D.csv. According to the 4th -7th digits of each value in its phone field, we can find out the area that the phone number belongs to – area field – from the text file P.txt’s aid field. Now find records of D.csv that have Beijing phone numbers :

  A
1 =file(“D.csv”).import@tcm()
2 =file(“P.txt”).import@t(id,area)
3 =A1.derive(mid(phone,4,3):aid).switch(aid,A2:aid)
4 =A3.select(aid.area==”Beijing”)

We can use a cursor in A1 to handle big files.

esProc handles foreign key association using a pointer, making the reference easier. But we can still perform a SQL-style join:

  A  
1 =file(“D.csv”).import@tcm()  
2 =file(“P.txt”).import@t(id,area)  
3 =join@1(A1,left(phone,4,3);A2,aid) @1 means a left join
4 =A3.select(#2.area==”Beijing”).(#1)  

File comparison

Find the values of id field in the text file T1.txt that still appear in T2.txt and those that don’t show up there. 

  A  
1 =file(“T1.txt”).import@ti(id)  
2 =file(“T2.txt”).import@ti(id)  
3 =A1^A2 Intersection, which is the common values of T1 and T2
4 =A1\A2 Difference, which contains values T1 have but T2 don’t have

As you see, we just perform intersection and difference operations to compare column values of different files.

Among the rows corresponding to values of id field of T1.txt, find those whose ids exist in T2.txt and those whose ids don’t exist. 

  A  
1 =file(“T1.txt”).import@t().sort(id)
2 =file(“T2.txt”).import@t().sort(id)
3 =[A1,A2].merge@i(id) Intersection, which contains rows of T1 whose ids also exist in T2
4 =[A1,A2].merge@d(id) Difference, which contains rows of T1 whose ids don’t exist in T2

To get the whole rows, first sort the id fields in both files and then perform merge operation.

Find rows from T1.txt and T2.txt that have the common id but whose other columns are different: 

  A  
1 =file(“T1.txt”).import@t()  
2 =file(“T2.txt”).import@t()  
3 =join(A1,id;A2,id) An equi-join, which discards the unmatchable rows
4 =A3.select(cmp(#1,#2)!=0) Find different rows

The join function can align data according to a certain column.

We can also use merge and join to compare big files that cannot be entirely loaded into the memory. Both need data sorting first:

  A  
1 =file(“T1.txt”).cursor@t().sortx(id)
2 =file(“T2.txt”).cursor@t().sortx(id)
3 =[A1,A2].merge@xi(id) Intersection, which contains rows of T1 whose ids also exist in T2
4 =[A1,A2].merge@xd(id) Difference, which contains rows of T1 whose ids don’t exist in T2

The merge@x function merges sorted cursors. To perform external memory sorting for big files, use sortx()

  A  
1 =file(“T1.txt”).cursor@t().sortx(id)  
2 =file(“T2.txt”).cursor@t().sortx(id)  
3 =join@x(A1,id;A2,id) An equi-join, which discards the unmatchable rows
4 =A3.select(#1.array()!=#2.array())  

The join@x function joins sorted cursors.

Text-like data processing

json

Despite a sufficient number of class libraries for parsing and generating json data, Java lacks the capability to handle further computations. esProc supports multi-level data, and can parse the json data into computable memory data for further processing without compromising its integrity.

Here’s json data of a certain format:

         {

         “order”:[

                   {

                  “client”:”Beijing Raqsoft Inc.”,

                   “date”:”2015-6-23”,

                   “item” : [

                            {

                            “product”:”HP laptop”,

                            “number”:4,

                            “price”:3200

                            },

                            {

                            “product”:”DELL server”,

                            “number”:1,

                            “price”:22100

                            }]

                   },…]

         }

We need to write the json data to two database tables: order, which include three fields – orderid, client and data, and orderdetail, which includes five fields – orderid, seq, product, number and price. The orderid and seq fields of the orderdetail table can be generated according to the data order. 

  A
1 =file(“data.json”).read().import@j().order
2 =A1.new(#:orderid,client,date)
3 =A1.news(item;A1.#:orderid,#:seq,product,number,price)
4 >db.update@i(A2,order)
5 >db.update@i(A3,ordedetail)

esProc is able to parse the multi-level json strings into a multi-level data set, in which the value of item field in A3 is a table.

Besides data parsing, esProc can generate multi-level json strings from a multi-level data set.

Excel

Excel files are equal to the structured files. Java provides powerful open source class libraries (such as poi) for parsing XLS files, but they are low-level tools, making the development process very complex. By encapsulating poi, esProc is capable of retrieving XLS files into two-dimensional tables for further handling.

Here’re range.xls and position.xls

range.xls

 

position.xls

range start stop Point position
Range1 4561 6321 point1 5213
Range2 9842 11253 point2 10254
     

For each point position in position.xls, find the appropriate start/stop range in range.xls to cover it, and append the start and stop values in position.xls

  A
1 =file(“range.xls”).importxls@t()
2 =file(“position.xls”).importxls@t()
3 =A2.derive((t=A1.select@1(position>=start&&position<=stop)).range:range,t.start:start,t.stop:stop)
4 =file(“result.xls”).exportxls(A3)

esProc can give full play of its computing power after retrieving an XLS file. Excel VBA, however, is only capable of hardcoding joins, and sometimes even resorts to exporting data to the database. Both ways would generate bloated code.