In this section, you’ll develop an application that counts word frequencies in a text. The WordFrequencies application scans text files and counts the occurrences of each word in the text (download the project here). As you will see, the HashTable is the natural choice for storing this information because you want to access a word’s frequency by using the actual word as the key. To retrieve (or update) the frequency of the word elaborate, for example, you will use this expression:
Words("ELABORATE").Value
Code language: JavaScript (javascript)
where Words is a properly initialized HashTable object.
When the code runs into another instance of the word elaborate, it simply increases the matching item of the Words HashTable by one:
Words("ELABORATE").Value += 1
Code language: JavaScript (javascript)
Arrays and ArrayLists are out of the question because they can’t be accessed by a key. You could also use the SortedList collection (described later in this chapter), but this collection maintains its items sorted at all times. If you need this functionality as well, you can modify the application accordingly. The items in a SortedList are also accessed by keys, so you won’t have to introduce substantial changes in the code.
Let me start with a few remarks. First, all words we locate in the various text files will be converted to uppercase. Because the keys of the HashTable are case-sensitive, converting them to uppercase eliminates the usual problem of case-sensitivity (hello being a different word than Hello and HELLO) by eliminating multiple possible spellings for the same word.
The frequencies of the words can’t be calculated instantly because we need to know the total number of words in the text. Instead, each value in the HashTable is the number of occurrences of a specific word. To calculate the actual frequency of the same word, we must divide this value by the number of occurrences of all words, but this can happen only after we have scanned the entire text file and counted the occurrences of each word.
The application’s interface is shown in Figure 10.3. To scan a text file and process its words, click the Read Text File button. The Open dialog box will prompt you to select the text file to be processed, and the application will display in a message box the number of unique words read from the file. Then you can click the ShowWord Count button to count the number of occurrences of each word in the text. The last button on the form calculates the frequency of each word and sorts the words according to their frequencies.
The application maintains a single HashTable collection, the Words collection, and it updates this collection rather than counting word occurrences from scratch for each file you open. The Frequency Table menu contains the commands to save the words and their counts to a disk file and read the same data from the file. The commands in this menu can store the data either to a text file (Save XML/Load XML commands) or to a binary file (Save Binary/Load Binary commands). Use these commands to store the data generated in a single session, load the data in a later session, and process more files.
The WordFrequencies application uses techniques and classes we haven’t discussed yet. The topic of serialization is discussed in detail in Chapter, “XML and Object Serialization,” whereas the topic of reading from (or writing to) files is discussed in Chapter, “Accessing Folders and Files.” You don’t really have to understand the code that opens a text file and reads its lines; just focus on the segments that manipulate the items of the HashTable. (To test the example I used a set of text files from my computer).
The code reads the text into a string variable and then it calls the Split method of the String class to split the text into individual words. The Split method uses the space, comma, period, quotation mark, exclamation mark, colon, semicolon, and new-line characters as delimiters. The individual words are stored in the Words array; after this array has been populated, the program goes through each word in the array and determines whether it’s a valid word by calling the IsValidWord() function. This function returns False if one of the characters in the word is not a letter; strings such as B2B or U2 are not considered proper words. IsValidWord() is a custom function, and you can edit it as you wish.
Any valid word becomes a key to the WordFrequencies HashTable. The corresponding value is the number of occurrences of the specific word in the text. If a key (a new word) is added to the table, its value is set to 1. If the key exists already, its value is increased by 1 via the following If statement:
If Not WordFrequencies.ContainsKey(word) Then
WordFrequencies.Add(word, 1)
Else
WordFrequencies(word) = CType(WordFrequencies(word), Integer) + 1
End If
Code language: PHP (php)
The code that reads the text file and splits it into individual words is shown in Listing 10.6. The code reads the entire text into a string variable, the txtLine variable, and the individual words are isolated with the Split method of the String class. The Delimiters array stores the characters that the Split method will use as delimiters, and you can add more delimiters depending on the type of text you’re processing. If you’re counting keywords in program listings, for example, you’ll have to add the math symbols and parentheses as delimiters.
Listing 10.6: Splitting a Text File into Words
Private Sub bttnRead_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles bttnRead.Click
' prompt for text file
OpenFileDialog1.DefaultExt = "TXT"
OpenFileDialog1.Filter = "Text|*.TXT|All Files|*.*"
If OpenFileDialog1.ShowDialog() = Windows.Forms.DialogResult.OK Then
Dim str As StreamReader
' establish a StreamReader object to the file
str = File.OpenText(OpenFileDialog1.FileName)
Dim txtLine As String
Dim Words() As String
' these are the common word delimiters
Dim Delimiters() As Char = {CType(" ", Char), CType(".", Char), _
CType(",", Char), CType("?", Char), _
CType("!", Char), CType(";", Char), _
CType(":", Char), Chr(10), Chr(13), vbTab}
Me.Text = "Calculating word count..."
Me.Cursor = Cursors.WaitCursor
' read text and store into txtLine variable
txtLine = str.ReadToEnd
' break text into individual words and store them into the Words array
Words = txtLine.Split(Delimiters)
Dim uniqueWords As Integer
Dim iword As Integer, word As String
' iterate through all the words and add the unique ones to the SortedList
' Each word is a key for the word's count
For iword = 0 To Words.GetUpperBound(0)
word = Words(iword).ToUpper
If IsValidWord(word) Then
' if word is in the list already, increase its count by 1
' if not, add the word and set its count to 1
If Not WordFrequencies.ContainsKey(word) Then
WordFrequencies.Add(word, 1)
uniqueWords += 1
Else
WordFrequencies(word) = CType(WordFrequencies(word), Integer) + 1
End If
End If
Next
Me.Text = "Word Frequencies"
Me.Cursor = Cursors.Default
MsgBox("Read " & Words.Length & " words and found " & _
uniqueWords & " unique words")
RichTextBox1.Clear()
End If
End Sub
Code language: PHP (php)
This event handler keeps track of the number of unique words and displays them in a RichTextBox control. In a document with 90,000 words, it took less than a second to split the text and perform all the calculations. The process of displaying the list of unique words in the RichTextBox control was very fast, too, thanks to the StringBuilder class. The code behind the Show Word Count button (see Listing 10.7) displays the list of words along with the number of occurrences of each word in the text.
Listing 10.7: Displaying the Count of Each Word in the Text
Private Sub bttnCount_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles bttnCount.Click
Dim wEnum As IDictionaryEnumerator
Dim allWords As New System.Text.StringBuilder
' iterate through the list and display words and their count
wEnum = WordFrequencies.GetEnumerator
While wEnum.MoveNext
allWords.Append(wEnum.Key.ToString & vbTab & "-->" & _
vbTab & wEnum.Value.ToString & vbCrLf)
End While
RichTextBox1.Text = allWords.ToString
End Sub
Code language: PHP (php)
The last button on the form calculates the frequency of each word in the HashTable, sorts the words according to their frequencies, and displays the list. Its code is detailed in Listing 10.8.
Listing 10.8: Sorting theWords According to Frequency
Private Sub bttnShow_Click(ByVal sender As System.Object, _
ByVal e As System.EventArgs) Handles bttnSort.Click
Dim wEnum As IDictionaryEnumerator
Dim Words(WordFrequencies.Count) As String
Dim Frequencies(WordFrequencies.Count) As Double
Dim allWords As New System.Text.StringBuilder
Dim i, totCount As Integer
' iterate through the list and calculateword frequencies
wEnum = WordFrequencies.GetEnumerator
While wEnum.MoveNext
Words(i) = CType(wEnum.Key, String)
Frequencies(i) = CType(wEnum.Value, Integer)
totCount = totCount + Convert.ToInt32(Frequencies(i))
i = i + 1
End While
' display words and their frequencies
For i = 0 To Words.GetUpperBound(0)
Frequencies(i) = Frequencies(i) / totCount
Next
Array.Sort(Frequencies, Words)
RichTextBox1.Clear()
For i = Words.GetUpperBound(0) To 0 Step -1
allWords.Append(Words(i) & vbTab & "-->" & vbTab & _
Format(100 * Frequencies(i), "#.000") & vbCrLf)
Next
RichTextBox1.Text = allWords.ToString
End Sub
Code language: PHP (php)
Handling Large Sets of Data
Incidentally, my first attempt was to display the list of unique words in a ListBox control. The process was incredibly slow. The first 10,000 words were added in a couple of seconds, but as the number of items increased, the time it took to add them to the control increased exponentially (or so it seemed). Adding thousands of items to a ListBox control is a very slow process. You can call the BeginUpdate/EndUpdate methods, but they won’t help a lot. It’s likely that sometimes a seemingly simple task will turn out to be detrimental to your application’s performance.
You should try different approaches but also consider a total overhaul of your user interface. Ask yourself this:Who needs to see a list with 10,000 words? You can use the application to do the calculations and then retrieve the count of selected words, display the 100 most common ones, or even display 100 words at a time. I’m displaying the list of words because this is a demonstration, but a real application shouldn’t display such a long list. The core of the application counts unique words in a text file, and it does it very efficiently.
Even if you decide to display an extremely long list of items on your interface, you should perform some worst-case scenarios (that is, attempt to load the control with too many items), and if this causes serious performance problems, consider different controls. I’ve decided to append all the items to a StringBuilder variable and then display this variable in a RichTextBox control. I could have used a plain TextBox control — after all, I’m not formatting the list of words and their frequencies — but the RichTextBox allowed me to specify the absolute tab positions. The tab positions of the TextBox control are fixed and weren’t wide enough for all words.