TI 84 Plus CE Python and Numworks - Cosine Similarity
Note: The python script should work on the TI-83 Premium CE Python Edition, TI-Nspire CX II, Casio fx-9750GIII, and Casio fx-CG 50. The script only calls for the math module, which is standard on all platforms.
Introduction and Calculating the Cosine Similarity
We can find how similar two phrases by calculating the cosine similarity. This is similar to finding the angle between two vectors.
Here are the steps:
1. Separate the phrases into a list of words. In Python, this can easily achieved by .split() attachment. We can split using any character, but if the argument is left blank, the character used is the space.
Example:
str1="here comes the sun"
list1=str1.split()
print(list1)
Output:
['here', 'comes', 'the', 'sun']
2. Return the unique elements of each list. That is, filter out any repeats. Since the program is using two lists, we need to combine the two list of words from the phrases and then filter out any repeated words.
Example:
list1=['my', 'apple', 'is', 'in', 'my', 'apple', 'pie']
u=[ ]
for i in list1:
if i not in u:
u.append(i)
print(u)
Output:
['my', 'apple', 'is', 'in', 'pie']
3. Obtain a word count of each of two phrases compared to the unique list of words from two phrases combined. The .count(arg) attachment to a list returns the number of occurrences of arg is present in that list.
Example:
list1=['my', 'apple', 'is', 'in', 'my', 'apple', 'pie']
list1.count('apple')
Output:
2
This step is accomplished by the list comprehension:
[lsrc.count(i) for i in lmain]
lsrc = source list
lmain = main list
We are counting the number of occurrences of each word in lmain found in lsrc.
The result are two equal-sized vectors of integers.
4. Calculate the cosine similarity by the formula:
cos θ = dot(v1, v2) ÷ (norm(v1) × norm(v2))
dot(v1, v2): the dot product of the count vectors
norm(v1) and norm(v2): norm of the count vectors
The cosine similarity varies between 0 and 1. We are not going to calculate θ itself. Hence the cosine similarity (CS) is:
CS = dot(v1, v2) ÷ (norm(v1) × norm(v2))
For more details, please refer to the excellent "Cosine Similarity, Clearly Explained!!!" video from StatQuest, which is listed in the Sources below.
Python Code: cossim2.py
# phrases prograrm
# 2023-08-06 ews
from math import *
# subroutines
def unique(l):
u=[]
for i in l:
if i not in u:
u.append(i)
return u
def counta(lmain,lsrc):
c=[lsrc.count(i) for i in lmain]
return c
def norm(v):
# list have integers
s=[i**2 for i in v]
s=sqrt(sum(s))
return s
# main program
print("\nDo not use punctuation")
str1=input("phrase 1? ")
str2=input("phrase 2? ")
# split into 2 lists
list1=str1.split()
list2=str2.split()
# find the unique list
list3=list1+list2
list3=unique(list3)
# word count
listc1=counta(list3,list1)
listc2=counta(list3,list2)
# vector operations
# norm
n1=norm(listc1)
n2=norm(listc2)
# dot
d=sum([listc1[i]*listc2[i] for i in range(len(listc1))])
# cosine similarity
c=d/(n1*n2)
# no need to take the arccosine
print("cosine similarity: ")
print(c)
print("\n0: no words in common \n1: all words in common")
Numworks page: https://my.numworks.com/python/ews31415/cossim2
Download: https://drive.google.com/file/d/1qOtHIau6vm_TolNQugjhHthnO3ECoe9d/view?usp=sharing
Examples
Phrase 1: hello world
Phrase 2: hi planet earth
Cosine Similarity: 0.0 (no words in common)
Phase 1: girls like flowers and trees
Phase 2: boys like trees and raccoons
Cosine Similarity: 0.5999999999998 (exact: 0.6)
When entering phrases, do not use punctuation. I would just use all lowercase or all uppercase for the most accurate results. The matches are exact, so spelling counts!
Sources
Infopedic Techie. "Python program to find the unique values in a list || Python list [example-3]". YouTube video posted on March 7, 2019. https://www.youtube.com/watch?v=7f2UJgig2yI
StatQuest with Josh Starmer. "Cosine Similarity, Clearly Explained!!!" YouTube video posted on January 29, 2023.
https://www.youtube.com/watch?v=e9U0QAFbfLI
"Cosine Similarity" Wikipedia. Last edited July 6, 2023, accessed August 2023.
https://en.wikipedia.org/wiki/Cosine_similarity
Eddie
All original content copyright, © 2011-2023. Edward Shore. Unauthorized use and/or unauthorized distribution for commercial purposes without express and written permission from the author is strictly prohibited. This blog entry may be distributed for noncommercial purposes, provided that full credit is given to the author.