Python – ABC analysis for linguists

Some time ago, like a year or so, I have written an article for ABC analysis for linguists for C#. Today, I have decided to do it for Python, just for fun and to check whether the 80/20 rule is valid.

Thus, I started with the development of the code. These were the tasks in front:

Open a file and read it;
Save all the words in a list, stripping them from non-word characters;
Make the words all non-capital letters, in order to avoid duplication.
For every unique word from the file, save its occurrences in the file;
Count the total words;
Count the total unique words;
Based on the variable wordsInAnalysis (which is hardcoded), define the top words and their total occurrences. E.g., in the given example, the top 3 words repeated as follows (“shall” – 549 times, “national” – 181 times, “assembly” – 148 times);
Calculate the percentage of the selected unique words from the total unique words (the 20 from the 80/20 rule);
Calculate the percentage of the occurrences of the selected unique words from all the words in the text (the 80 from the 80/20 rule);
Print anything useful.

It was not that hard actually – Python has a lot of helping classes and libraries, so I was even wondering which one to use. What is more important are the results. In order to make them somehow more representative, I have decided to exclude all words with length less than 4 characters. Not a great idea though, but for my program I have decided to skip the articles and the tiny prepositions. Last but not least, I have decided to analyze the constitution of Bulgaria, which I easily found available online.

So, in total the Bulgarian constitution in translation to English has 1549 words, so I have tested with the top 310 words (20% of 1549 = 310). This is my result:

The result claims that we have 20/70 in this case. Considering the fact that I have removed the small words, we are still pretty close to the Pareto rule. You may play it with your own files, it can be really useful for linguists and translators:

from collections import *

wordsInAnalysis = 310

myFile = open('textFile.txt')
text = [word.strip(",.\n?!:;[]()")
        for line in myFile for word in line.lower().split()]

words = [word for line in text for word in line.split() if len(word) > 3]


# print(Counter(words))
print("\n\n")

TopWordsNumber = 0

TopDictionary = Counter(words).most_common(wordsInAnalysis)
for i in TopDictionary:
    TopWordsNumber += int(words.count(i[0]))
    print(i)

TotalWords = len(words)
DiffWordsTotal = len(set(words))
PercentUniqueWordsOverAllUnique = int((wordsInAnalysis / DiffWordsTotal) * 100)
PercentUniqueWordsUsedInTime = int((TopWordsNumber / len(words)) * 100)


print("\n\nAnalysis: \n\n")
print("Words in total %s" % (TotalWords))
print("Different words in total: %s" % (DiffWordsTotal))
print("Words in top %s in total: %s" % (wordsInAnalysis, TopWordsNumber))
print("%s percent of the words are used in %s percent of the time in %s" % (
    PercentUniqueWordsOverAllUnique, PercentUniqueWordsUsedInTime, myFile.name))
print("\n\n")

Enjoy your day! 🙂

Python – ABC analysis for linguists

Related posts: