Session E-TEXT

Text and String Handling

Steven Black
steveb@stevenblack.com


Text

Steven has been a Fox developer since 1986. He markets Steven Black's INTL Toolkit , a multi-lingual framework for FoxPro and Visual FoxPro, which he created in 1993 and continues to refine. He has been a featured speaker at FoxPro DevCons and regional conferences, and his contributions occasionally darken the pages of VFP books and magazines. His company, SBC, is based in Kingston, Ontario, and operates worldwide. He specializes in multi-lingual, multi-site, and other challenging FoxPro projects, including out-of-control project turn-arounds and cleanups. He consults with small developers as well as Fortune 500 companies, national and international government agencies, and software development companies to elevate their development teams. He is also the creator and webmeister of the FoxForum Wiki, a popular collaborative topic-based Visual FoxPro website at http://fox.wikis.com/.

Note: Thanks to Rick Strahl, who contributed several good ideas.

Introduction

This session serves to introduce, illustrate, and explore the some of the great (and not so great) string handling capabilities of Visual FoxPro.

I've pondered doing a session on this topic for many years - I always seem to be involved with solving many text-data related problems in my VFP projects. On the surface, handling text isn't very sexy and seemingly not very interesting. I think otherwise, and I hope you'll agree. This document is split into three sections: Inbound is about getting text into the VFP environment so you can work with it. Processing is about manipulating the text, and Outbound is about sending text on its way when you're done.

To illustrate text handling in VFP, I am using the following things:

Some facts about VFP strings

Here are a few things you need to know about VFP strings:

  • In functional terms, there is no difference between a character field and a memo field. All functions that work on characters also work on memos.

  • The maximum number of characters that VFP can handle in a string is 16,777,184.

Inbound

This section is all about getting text into your processing environment.

Inbound text from table fields

To retrieve text from a table field, simply assign it to a memory variable.

    CREATE TABLE Ships ( Name C(40), Desc M)
    INSERT INTO Ships (Name, Desc) VALUES("Exxon Valdez", "Billions 

    unpaid!")LOCAL lcShip, lcDesc
    lcShip= Ship.Name
    lcDesc= Ship.Desc

Note that it doesn't matter if the field is type Character or Memo.

Inbound from text files

There are many ways to retrieve text from files on disk.

FILETOSTR(cFileName) is used to place the contents of a disk file into a string memory variable. This is among my favorite new functions in VFP 6. It's both useful and fast. For example, the following code executes in one-seventh of a second on my 220Mhz Pentium laptop.

    LOCAL t1, t2, x
    t1= SECONDS()
    x= FILETOSTR("WarAndPeace.TXT")      && 3.2 mb in size
    t2= SECONDS()
    ?t2-t1, "seconds"      && 0.140 (seconds)
    ?len(x)      && 3271933

In other words, on a very modest laptop (by today's standards) VFP can load the full text from Tolstoy's War And Peace in one-seventh of a second.

Low Level File Functions (LLFF) are somewhat more cumbersome but offer great control. LLFF are also very fast. The following example reads the entire contents of Tolstoy's War And Peace from disk into memory:


    LOCAL t1, t2, x, h
    t1= SECONDS()
    h= FOPEN("WarAndPeace.TXT")      && 3.2 mb in size
    x= FREAD(h,4000000)
    t2= SECONDS()
    ?t2-t1, "seconds"      && 0.140 seconds
    ?LEN(x)      && 3271933
    FCLOSE(h)

Given the similar execution times, I think we can conclude that internally, LLFF and FILETOSTR() are implemented similarly. However with the LLFF we also have fine control. For example, FGETS() allows us to read a line at a time. To illustrate, the following code reads the first 15 lines of War And Peace into array wpLines.

    LOCAL t1, t2, i, h
    LOCAL ARRAY wpLines[15]
    CLEAR
    t1= SECONDS()
    h= FOPEN("WarAndPeace.TXT")      && 3.2 mb in size
    FOR i= 1 TO ALEN( wpLines)
    wpLines[i]= FGETS(h)
    ENDFOR
    t2= SECONDS()
    ?t2-t1, "seconds"      && 0.000 seconds
    FOR i= 1 TO ALEN( wpLines)
    ? wpLines[i]
    ENDFOR
    FCLOSE(h)

We can also retrieve a segment from War And Peace. FSEEK() moves the LLFF pointer, and the FREAD() function is used to read a range. Let's read, say, 1000 bytes about half way through the book.

    LOCAL t1, t2, x, h
    LOCAL ARRAY wpLines[15]
    CLEAR
    t1= SECONDS()
    h= FOPEN("WarAndPeace.TXT")      && 3.2 mb in size
    FSEEK(h, 1500000)      && Move the pointer
    x=FREAD(h, 1000)      && Read 1000 bytes
    t2= SECONDS()
    ?t2-t1, "seconds"      && 0.000 seconds
    SET MEMOWIDTH TO 8000
    ?x
    FCLOSE(h)

Inbound from text files, with pre-processing

Sometimes you need to pre-process text before it is usable. For example, you may have an HTML file from which you need to clean and remove tags. Or maybe you have the problem exhibited by our copy of War and Peace, which has embedded hard-returns at the end of each line. How can we create a streaming document that we can actually format?

Often the answer is to use the APPEND FROM command, which imports from file into a table, and moreover supports a large variety of file formats. The strategy always works something like this: You create a single-field table, and you use APPEND FROM ... TYPE SDF to load it

    LOCAL t1, t2
    CLEAR
    t1= SECONDS()
    CREATE TABLE Temp (cLine C (254))
    APPEND FROM WarAndPeace.Txt TYPE SDF
    t2= SECONDS()
    ?t2-t1, "seconds"      && 1.122 seconds

Now you're good to go: You've got a table of records that you can manipulate and transform to your heart's content using VFP's vast collection of functions.

Processing

This section discusses a wide variety of string manipulation techniques in Visual FoxPro. Let's say we've got some text in our environment, now let's muck with it.

Does a sub-string exist?

There are many ways to determine if a sub-string exists in a string. The $ command returns True or False if a sub-string is contained in a string. This command is fast. Try this:

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT")      && 3.2 mb in size
    t1= SECONDS()
    ? "THE END" $ x
    t2= SECONDS()
    ?[ "THE END" $ x], t2-t1, "seconds"      && 0.180 seconds

The AT()and ATC()functions are also great for determining if a sub-string exists, the former having the advantage of being case insensitive and, moreover, their return values gives you an exact position of the sub-string.

    LOCAL t1, t2, x, lnAtPos
    x= FILETOSTR("WarAndPeace.TXT")      && 3.2 mb in size
    t1= SECONDS()
    lnAtPos=ATC("the end",x)
    t2= SECONDS()
    ?lnAtPos      && 97837
    ?[ATC( "the end", x)], t2-t1, "seconds"      && 0.110 seconds
    ?SUBS(x, lnAtPos,20)      && the end of it...

The OCCURS() function will also tell you if a sub-string exists, and moreover tell you how many times the sub-string occurs. This code will count the number of occurrences of a variety of sub-strings in War And Peace.

    LOCAL t1, t2, x, i
    LOCAL ARRAY aStrings[5]
    aStrings[1]= "Russia"      && 775 occurences
    aStrings[2]= "Anna"      && 293 occurences
    aStrings[3]= "Czar"      && 4 occurences
    aStrings[4]= "windows"      && 23 occurences
    aStrings[5]= "Pentium"      && 0 occurences
    x= FILETOSTR("WarAndPeace.TXT")      && 3.2 mb in size
    FOR i = 1 TO ALEN( aStrings)
    t1= SECONDS()
    ?OCCURS(aStrings[i],x)
    t2= SECONDS()
    ?[OCCURS( "]+aStrings[i]+[", x)], t2-t1, "seconds"      && 0.401 seconds avg
    ENDFOR

Locating sub-strings in strings is something VFP does really well.

Locating sub-strings

One of the basic tasks in almost any string manipulation is locating sub strings within larger strings. Four useful functions for this are AT(), RAT(), ATC(), and RATC(). These locate the ordinal position of sub-strings locating from the left (AT()), from the right (RAT()), both of which have case-insensitive variants (ATC(), and RATC()). All these functions are very fast and scale well with file size. For example, let's go look for "THE END" in War And Peace.

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT")      && 3.2 mb in size
    t1= SECONDS()
    ?AT("THE END",x)      && 3271915
    t2= SECONDS()
    ?[AT( "THE END", x)], t2-t1, , "seconds"      && 0.180 seconds

You can also check for the nth occurrence of a sub-string, as illustrated below where we find the 1st, 101st, 201st...701st occurrence of the word "Russia" in War And Peace.

    LOCAL t1, t2, x, i, nOcc
    x= FILETOSTR("WarAndPeace.TXT")      && 3.2 mb in size
    FOR i= 0 TO 7
    nOcc= i*100+1
    t1= SECONDS()
    ?AT("Russia",x, nOcc )
    t2= SECONDS()
    ?[AT("Russia", x, ]+Transform(nOcc)+[)], t2-t1      && 0.180 sec on average
    ENDFOR

Two other functions are seemingly useful for locating strings: ATLINE() and ATCLINE(). These return the line number of the first occurrence of a string. Two things to note: First the result of these functions is sensitive to the value of SET MEMOWIDTH, and secondly the performance of these functions is relatively pathetic.

    LOCAL t1, t2, x, lnline
    x= FILETOSTR("WarAndPeace.TXT")      && 3.2 mb in size
    t1= SECONDS()
    lnline=ATLINE("Anna",x)
    t2= SECONDS()
    ?lnLine                                 && 16
    ?[ATLINE("Anna",x)], t2-t1, "seconds"      && 0.771 seconds
    t1= SECONDS()
    lnline=ATLINE("CZAR",x)
    t2= SECONDS()
    ?lnLine                                  && 0
    ?[ATLINE("CZAR",x)], t2-t1, "seconds"        && 1075.467 seconds !!
    t1= SECONDS()
    lnline=ATLINE("THE END",x)
    t2= SECONDS()
    ?lnLine                                  && 54223
    ?[ATLINE("THE END",x)], t2-t1, "seconds"           && 1081.835 seconds !!

Which leads to the following general observation: Any function that is sensitive to SET MEMOWIDTH is dog-slow on larger strings and does not scale well at all.

Traversing text line-by-line

Iterating through text, one line at a time, is a common task. Here's the way VFP developers have been doing it for years: Using the MEMLINES() and MLINE() functions. Like this:

    LOCAL x, y, i, t1, t2
    SET MEMOWIDTH TO 8000
    x=FILETOSTR(home()+"Genhtml.prg")          && 767 lines

    t1=SECONDS()
    FOR i= 1 TO MEMLINES(x)
    y=MLINE(x,i)
    ENDFOR
    t2=SECONDS()
    ?"Using MLINE()", t2-t1, "seconds" && 22.151 seconds

That's pathetic performance. 20+ seconds to iterate through 767 lines! Fortunately, there's a trick to using MLINE(), which is to pass the _MLINE system memory variable as the third parameter. Like this.

    LOCAL x, y, i, t1, t2
    SET MEMOWIDTH TO 8000
    x=FILETOSTR(home()+"Genhtml.prg")          && 767 lines

    t1=SECONDS()
    _mline=0
    FOR i= 1 TO MEMLINES(x)
    y=MLINE(x,1,_MLINE)
    ENDFOR
    t2=SECONDS()
    ?"Using MLINE() with _MLINE", SECONDS()-z, "seconds"   && 0.451 seconds

Now that's more like it - a fifty-fold improvement. A surprising number of VFP developers don't know this idiom with _MLINE even though it's been documented in the FoxPro help since version 2 at least.

Starting in VFP 6 all this is obsolete, since ALINES() is a screaming new addition to the language. Let's see how these routines look and perform with ALINES().

    LOCAL x, i, y, ti, t2
    LOCAL ARRAY laArray[1]
    x=FILETOSTR(home()+"Genhtml.prg")         && 767 lines

    t1=SECONDS()
    FOR i= 1 TO ALINES(laArray,x)
    y=laArray[i]
    ENDFOR
    t2=SECONDS()
    ?"Using ALINES() and traverse array:", t2-t1, "seconds" && 0.020 seconds

Another twenty-fold improvement in speed. I think the lesion is clear: If you are using MLINE() in your applications, and you are using VFP 6, then it's time to switch to ALINES(). There are just two major differences: First, ALINES() is limited by VFP's 65,000 array element limit, and second, successive lines with only CHR(13) carriage returns are considered as one line. For example:

    lcSource = "How~~Many~~Lines?~"
    lcString = STRTRAN(lcSource,"~",CHR(13))
    ?ALINES(aParms,lcstring) && 3
    But if you use carriage return + line feed, CHR(13)+CHR(10), you'll get the results you expect.
    lcSource = "How~~Many~~Lines?~"
    lcString = STRTRAN(lcSource,"~",CHR(13)+CHR(10))
    ?ALINES(aParms,lcstring) && 5

This is a bit unnerving if blank lines are important, so beware and use CHR(13)+CHR(10) to avoid this problem.

Now, just for fun, let's rip through War And Peace using ALINES().

    LOCAL x, i, y, z, k
    LOCAL ARRAY laArray[1]
    x=FILETOSTR("WarAndPeace.TXT")

    t1=SECONDS()
    FOR i= 1 TO ALINES(laArray,x)          && 54,227 array elements
    y=laArray[i]
    ENDFOR
    t2=SECONDS()
    ?"Using ALINES() and traverse", t2-t1, "seconds"         && 3.395 seconds

Excuse me, but wow, considering we're creating a 54,337 element array from a file on disk, then we're traversing the entire array assigning each element's contents to a memory variable, and we're back in 3.4 seconds.

What about just creating the array of War And Peace:

    LOCAL x, t1, t2
    LOCAL ARRAY laArray[1]
    x=FILETOSTR("WarAndPeace.TXT")
    t1=SECONDS()
    ?ALINES(laArray,x)               && 54,227 array elements
    t2=SECONDS()
    ?"Using ALINES() to load War and Peace", t2-t1, "seconds" && 2.203 seconds

So, on my Pentium 233 laptop using VFP 6, we can load War and Peace from disk into a 54,000-item array in 2.2 seconds. On my newer desktop machine, a Pentium 500, this task is subsecond.

Traversing text word-by-word

You could recursively traverse a string word-by-word by using, among other things, the return value from AT(" ",x,n)and SUBS(" ",,) and, if you are doing that, you're missing a great and little known feature of VFP.

Two little-known functions are great for word-by-word text processing. The Words() and WordNum() functions, which are available to you when you load the FoxTools.FLL library, return the number of words and individual words respectively.

Let's see how they perform. Let's first count the words in War And Peace.

    SET LIBRARY TO HOME()+"FOXTOOLS"
    LOCAL x, t1, t2
    LOCAL ARRAY laArray[1]
    x=FILETOSTR("WarAndPeace.TXT")            && 3.2 mb in size
    t1=SECONDS()
    ?WORDS(x)            && 565412
    t2=SECONDS()
    ?"Using WORDS() on War and Peace", t2-t1, "seconds"         && 0.825 seconds

The Words() function is also useful for counting all sorts of tokens since you can pass the word delimiters in the second parameter. How many sentences are there in War And Peace?

    SET LIBRARY TO HOME()+"FOXTOOLS"
    LOCAL x, t1, t2
    LOCAL ARRAY laArray[1]
    x=FILETOSTR("WarAndPeace.TXT")             && 3.2 mb in size
    t1=SECONDS()
    ?WORDS(x, ".")                          && (Note the ".")         26673
    t2=SECONDS()
    ?"Using WORDS() countING sentences in W&p", t2-t1, "seconds" && 0.803 seconds

WordNum() returns a specific word from a string. What's the 666th word in War And Peace? What about the 500000th?

    SET LIBRARY TO HOME()+"FOXTOOLS"
    LOCAL x, t1, t2
    x=FILETOSTR("WarAndPeace.TXT")          && 3.2 mb in size
    t1=SECONDS()
    ?WORDNUM(x, 666)         && Anna
    t2=SECONDS()
    ?"Finding the 666th word in W&P", t2-t1, "seconds"         && 0.381 seconds
    t1=SECONDS()
    ?WORDNUM(x, 500000)          && his
    t2=SECONDS()
    ?"Finding the 500000th word in W&P", t2-t1, "seconds"        && 1.001 seconds

Similarly to Words(), we can use WordNum() to return a token from a string by specifying the delimiter. What's the 2000th sentence in War And Peace?

    SET LIBRARY TO HOME()+"FOXTOOLS"
    LOCAL x, t1, t2
    x=FILETOSTR("WarAndPeace.TXT")        && 3.2 mb in size
    t1=SECONDS()
    ?WORDNUM(x, 2000, ".")
    t2=SECONDS()
    ?"Finding the 2000th sentence in W&P", t2-t1, "seconds" && 0.391 seconds

Substituting text

VFP has a number of useful functions for substituting text. STRTRAN(), CHRTRAN(), CHRTRANC(), STUFF(), and STUFFC().

STRTRAN() replaces occurrences of a string with another. For example, let's change all occurrences of "Anna" to "the McBride twins" in War And Peace.

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT")       && 3.2 mb in size
    ? "Words in W&P:", WORDS(x)         && 565412 words
    t1= SECONDS()
    x=STRTRAN(x,"Anna", "the McBride twins")
    t2= SECONDS()
    ? t2-t1, "seconds"          && 2.314 seconds
    ?Occurs("the McBride twins", x), "occurences"         && 293 Occurences
    ? "Words in W&P:", WORDS(x)             && 565412 words

That's over 125 replacements per second, which is phenomenal. What about removing strings?

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT")          && 3.2 mb in size
    ? "Words in W&P:", WORDS(x)         && 565412 words
    t1= SECONDS()
    x=STRTRAN(x,"Anna")
    t2= SECONDS()
    ? t2-t1, "seconds"        && 2.293 seconds
    ? "Words in W&P:", WORDS(x)          && 565168 words

So it appears that STRTRAN() both adds and removes strings with equal aplomb. What of CHRTRAN(), which swaps characters? Let's, say, change all "s" to "ch" in War and Peace. 

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT")       && 3.2 mb in size
    t1= SECONDS()
    x=CHRTRAN(x,"s", "ch")
    t2= SECONDS()
    ? t2-t1, "seconds"         && 0.521 seconds

Which isn't bad considering that there are 159,218 occurrences of character "s" in War And Peace.

However don't try to use CHRTRAN() when the second parameter is an empty string. The performance of CHRTRAN() in these circumstances is terrible. If you need to suppress sub-strings, use STRTRAN() instead.

String Concatenation

VFP has tremendous concatenation speed if you use it in a particular way. Since many common tasks, like building web pages, involve building documents one element at a time, you should know that string expressions of the form x=x+y are very fast in VFP. Consider this:

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT") && 3.2 mb in size
    t1= SECONDS()
    x= x+ "<b>Drink Milk</b>!"
    t2= SECONDS()
    ? t2-t1, "seconds" && 0.000 seconds

The same type of performance applies if you build strings small chunks at a time, which is a typical scenario in dynamic Web pages whether a template engine or raw output is used. For example:

    LOCAL t1, t2, x, y, count
    t1= SECONDS()
    x = ""
    y = "VFP Strings are fast"
    FOR count = 1 to 10000
    x = x + y
    ENDFOR
    t2= SECONDS()
    ? t2-t1, "seconds" && 0.030 seconds
    ? len(x) && 200,000 chars
    RETURN

This full optimization occurs as long as the string is adding something to itself and as long as the string concatenated is stored in a variable. Using class properties is somewhat less efficient. String optimization does not occur if the first expression on the right of the = sign is not the same as the string being concatenated. So:

x = "" + x + y

is not optimized in this fashion. The above line, placed in the example above, takes 25 seconds! So appending strings to strings is blazingly fast in most common situations.

Outputting text

So you've got text, maybe a lot of it, what are your options for writing it to disk.

Foremostly there's the new STRTOFILE() function which creates a disk file wit the contents of a string. Let's write War And Peace to disk.

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT")        && 3.2 mb in size
    t1= SECONDS()
    STRTOFILE(x,"Junk.txt")
    t2= SECONDS()
    ? t2-t1, "seconds"          && 0.480 seconds

Which means that you can dish 3+ Mb to disk in about a half-second.

You can also use Low Level File Functions (LLFF) to output text. The FWRITE() function dumps all or part of a string to disk. The FPUTS() function outputs a single line from the string, and moves the pointer

    LOCAL t1, t2, x
    x= FILETOSTR("WarAndPeace.TXT")          && 3.2 mb in size
    t1= SECONDS()
    h=FCREATE("Junk.txt")
    FWRITE(h, x)
    FCLOSE(h)
    t2= SECONDS()
    ? t2-t1, "seconds"           && 0.451 seconds

Here again, the similar performance times between FWRITE() and STRTOFILE() are striking, just as they were when comparing FREAD() and FILETOSTR().

Here's an example of outputting War And Peace line-by-line using FPUTS(). Since we're using ALINES(), it's not that onerous a task. In fact, it's very slick!

    LOCAL x, h, i, t1, t2
    LOCAL ARRAY laArray[1]

    x=FILETOSTR("WarAndPeace.TXT")    && 3.2 mb in size

    t1=SECONDS()
    h=FCREATE("Junk.txt")
    FOR i= 1 TO ALINES(laArray,x)
    FPUTS(h, laArray[i])
    ENDFOR
    FCLOSE(h)
    t2=SECONDS()
    ?"Total time:", t2-t1, "seconds"      && 3.595 seconds

Conclusion

So, there you have it, a cafeteria-style tour of VFP's text handling capabilities. I personally think that most of the code snippets I've shown here have amazing and borderline unbelievable execution speeds. I hope I've been able to show that VFP really excels at string handling.


vorheriger Vortrag D-NOCO

zur Übersicht der Gruppe PROG

nächster Vortrag D-BUILD