Detecting Language by Letter Frequency (Python)
This was our first assignment in the Python language. We had to detect language based on the frequency of letters in the text.
Python...
My biggest challenge here was learning the new syntax of python. Many of the computer science concepts I learned in Java are still applicable in Python (if statements, loops, errors, etc.). However, the syntax (grammar) of the language was super different than anything I'd seen before.
Reading files is astonishingly easy in python! I just had to use the open(fileName) method to read the file. As opposed to Java, where normally I would create a File, a BufferedReader, etc.
Processing Data
Loading in and sorting the data was the toughest part of this challenge for me. I started off by converting a spreadsheet of letter frequencies (taken from the link above) and converting it to a .csv (comma-separated values) file. This way, each cell could be found just by splitting each line by the commas' locations.
Next, for each line, I created a dictionary (a map that stores values based on unique keys) for the language. Then for each letter, I entered the frequency into the dictionary.
The Algorithm
This algorithm is pretty simple. It determines the frequency of each letter (# occurrences/total # letters) and then compares the frequency to each registered language (from the .csv file). To compare the languages, it finds the sum of the differences between each individual letter in the language.
Python...
My biggest challenge here was learning the new syntax of python. Many of the computer science concepts I learned in Java are still applicable in Python (if statements, loops, errors, etc.). However, the syntax (grammar) of the language was super different than anything I'd seen before.
Reading files is astonishingly easy in python! I just had to use the open(fileName) method to read the file. As opposed to Java, where normally I would create a File, a BufferedReader, etc.
Processing Data
Loading in and sorting the data was the toughest part of this challenge for me. I started off by converting a spreadsheet of letter frequencies (taken from the link above) and converting it to a .csv (comma-separated values) file. This way, each cell could be found just by splitting each line by the commas' locations.
Next, for each line, I created a dictionary (a map that stores values based on unique keys) for the language. Then for each letter, I entered the frequency into the dictionary.
The Algorithm
This algorithm is pretty simple. It determines the frequency of each letter (# occurrences/total # letters) and then compares the frequency to each registered language (from the .csv file). To compare the languages, it finds the sum of the differences between each individual letter in the language.
Comments
Post a Comment