Quantcast
Channel: Windows Wide Open
Viewing all articles
Browse latest Browse all 3110

Read a text tile and do frequency analysis using PowerShell

$
0
0

Summary: Learn how to read a text file and do a letter-frequency analysis using Windows PowerShell in this article written by the Microsoft Scripting Guy, Ed Wilson.

Today I am going to put the script I wrote yesterday together with the script that I wrote on Friday. After I do that, I will be able to get a more accurate letter-frequency analysis of a text file. The code that I wrote the other day reads a text file by using the Get-Content cmdlet. Then I join the strings together so that I can have a single string to parse. I then convert the script to all uppercase, get the enumerator, group my results, and sort my results.

So, first of all, here is the basic letter-frequency analysis code that I wrote the other day:

$ a = Get-Content C:\fso\ATaleOfTwoCities.txt
$ a.Count
$ ajoined = $ a -join "`r"
$ ajoinedUC = $ ajoined.ToUpper()
$ ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

Put the script together

The first thing I do is copy the code to a blank page in my Windows PowerShell integrated scripting environment (ISE). This is shown here:

Screenshot of the basic letter-frequency analysis code in the Windows PowerShell ISE.

Now I need to take the code that I wrote yesterday. This code removes the beginning and ending portions of the text file.

$ a= Get-Content 'C:\fso\MobyDick.txt'

$ array = @()
for ($ i = 0; $ i -lt $ a.Count; $ i++)
{
If ($ a[$ i] -cmatch ‘START’)
{$ array +=$ i }
If ($ a[$ i] -like “End of *Project*”)
{$ array += $ i }
}

$ start = $ array[0] +7
$ end = $ array[1] -1
$ a[$ start .. $ end]

This script also reads the text file. It then creates an empty array, loops through the text, and looks for start and end strings. It then saves the line numbers that it finds so that I can use array notation to return a range of text from the file.

I paste this code at the beginning of my new script page because I need to grab the correct text BEFORE I convert it all to a single line of text, convert it to uppercase, and count the letters. So, at this point, my script appears as shown here:

Screenshot of yesterday’s code pasted before the basic letter-frequency analysis code in the Windows PowerShell ISE.

Clean up the code

Well, there are some redundancies. The code as it stands is shown here:

$ a= Get-Content 'C:\fso\MobyDick.txt'

$ array = @()
for ($ i = 0; $ i -lt $ a.Count; $ i++)
{
If ($ a[$ i] -cmatch ‘START’)
{$ array +=$ i }
If ($ a[$ i] -like “End of *Project*”)
{$ array += $ i }
}

$ start = $ array[0] +7
$ end = $ array[1] -1
$ a[$ start .. $ end]

$ a = Get-Content C:\fso\ATaleOfTwoCities.txt
$ a.Count
$ ajoined = $ a -join “`r”
$ ajoinedUC = $ ajoined.ToUpper()
$ ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

So, the obvious duplication is the second Get-Content line. I delete it, and my script is shown here:

$ a= Get-Content 'C:\fso\MobyDick.txt'

$ array = @()
for ($ i = 0; $ i -lt $ a.Count; $ i++)
{
If ($ a[$ i] -cmatch ‘START’)
{$ array +=$ i }
If ($ a[$ i] -like “End of *Project*”)
{$ array += $ i }
}

$ start = $ array[0] +7
$ end = $ array[1] -1
$ a[$ start .. $ end]

$ a.Count
$ ajoined = $ a -join “`r”
$ ajoinedUC = $ ajoined.ToUpper()
$ ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The next thing I need to do is to delete the $ a.count line because I do not need it either. The script now is shown here:

$ a= Get-Content 'C:\fso\MobyDick.txt'

$ array = @()
for ($ i = 0; $ i -lt $ a.Count; $ i++)
{
If ($ a[$ i] -cmatch ‘START’)
{$ array +=$ i }
If ($ a[$ i] -like “End of *Project*”)
{$ array += $ i }
}

$ start = $ array[0] +7
$ end = $ array[1] -1
$ a[$ start .. $ end]

$ ajoined = $ a -join “`r”
$ ajoinedUC = $ ajoined.ToUpper()
$ ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The last thing I need to do is to store the result of grabbing my text from array notation. So that I do not need to modify my copied frequency code, I simply store the $ a[$ start ... $ end] code back into the $ a variable. This revised line is shown here:

$ a = $ a[$ start .. $ end]

The entire script is shown here:

$ a= Get-Content 'C:\fso\MobyDick.txt'

$ array = @()
for ($ i = 0; $ i -lt $ a.Count; $ i++)
{
If ($ a[$ i] -cmatch ‘START’)
{$ array +=$ i }
If ($ a[$ i] -like “End of *Project*”)
{$ array += $ i }
}

$ start = $ array[0] +7
$ end = $ array[1] -1
$ a = $ a[$ start .. $ end]

$ ajoined = $ a -join “`r”
$ ajoinedUC = $ ajoined.ToUpper()
$ ajoinedUC.GetEnumerator() | group -NoElement | sort count -Descending

The script is shown here in the ISE:

Screenshot of the entire edited script in the Windows PowerShell ISE.

The output from this script is shown here:

Screenshot of output of the script.

I invite you to follow me on Twitter and Facebook. If you have any questions, send email to me at scripter@microsoft.com, or post your questions on the Official Scripting Guys Forum. Also check out my Microsoft Operations Management Suite Blog. See you tomorrow. Until then, peace.

Ed Wilson
Microsoft Scripting Guy

 

 

 

Hey, Scripting Guy! Blog

The post Read a text tile and do frequency analysis using PowerShell appeared first on Windows Wide Open.


Viewing all articles
Browse latest Browse all 3110

Trending Articles