Saturday, October 7, 2023

TI 84 Plus CE Python and Numworks - Cosine Similarity

 TI 84 Plus CE Python and Numworks  - Cosine Similarity



Note:  The python script should work on the TI-83 Premium CE Python Edition, TI-Nspire CX II, Casio fx-9750GIII, and Casio fx-CG 50. The script only calls for the math module, which is standard on all platforms.  



Introduction and Calculating the Cosine Similarity


We can find how similar two phrases by calculating the cosine similarity.  This is similar to finding the angle between two vectors.  





Here are the steps:


1.  Separate the phrases into a list of words.   In Python, this can easily achieved by .split() attachment.  We can split using any character, but if the argument is left blank, the character used is the space.


Example:

str1="here comes the sun"

list1=str1.split()

print(list1) 


Output:

['here', 'comes', 'the', 'sun']


2.  Return the unique elements of each list. That is, filter out any repeats. Since the program is using two lists, we need to combine the two list of words from the phrases and then filter out any repeated words.


Example:

list1=['my', 'apple', 'is', 'in', 'my', 'apple', 'pie']

u=[ ]

for i in list1:

  if i not in u:

    u.append(i)

print(u)


Output:

['my', 'apple', 'is', 'in', 'pie']


3.  Obtain a word count of each of two phrases compared to the unique list of words from two phrases combined.     The .count(arg) attachment to a list returns the number of occurrences of arg is present in that list.


Example:

list1=['my', 'apple', 'is', 'in', 'my', 'apple', 'pie']

list1.count('apple')


Output: 

2   


This step is accomplished by the list comprehension:


[lsrc.count(i) for i in lmain]


lsrc = source list

lmain = main list


We are counting the number of occurrences of each word in lmain found in lsrc.


The result are two equal-sized vectors of integers.


4.  Calculate the cosine similarity by the formula:


cos θ = dot(v1, v2) ÷ (norm(v1) × norm(v2))


dot(v1, v2):  the dot product of the count vectors

norm(v1) and norm(v2):  norm of the count vectors


The cosine similarity varies between 0 and 1.   We are not going to calculate θ itself.  Hence the cosine similarity (CS) is:


CS = dot(v1, v2) ÷ (norm(v1) × norm(v2))


For more details, please refer to the excellent "Cosine Similarity, Clearly Explained!!!" video from StatQuest, which is listed in the Sources below.  



Python Code:  cossim2.py


# phrases prograrm

# 2023-08-06 ews


from math import *


# subroutines

def unique(l):

  u=[]

  for i in l:

    if i not in u:

      u.append(i)

  return u


def counta(lmain,lsrc):

  c=[lsrc.count(i) for i in lmain]

  return c


def norm(v):

  # list have integers

  s=[i**2 for i in v]

  s=sqrt(sum(s))

  return s


# main program

print("\nDo not use punctuation")

str1=input("phrase 1? ")

str2=input("phrase 2? ")


# split into 2 lists

list1=str1.split()

list2=str2.split()


# find the unique list

list3=list1+list2

list3=unique(list3)


# word count

listc1=counta(list3,list1)

listc2=counta(list3,list2)


# vector operations

# norm

n1=norm(listc1)

n2=norm(listc2)

# dot

d=sum([listc1[i]*listc2[i] for i in range(len(listc1))])


# cosine similarity

c=d/(n1*n2)

# no need to take the arccosine

print("cosine similarity: ")

print(c)

print("\n0: no words in common \n1: all words in common")


Numworks page:    https://my.numworks.com/python/ews31415/cossim2


Download:  https://drive.google.com/file/d/1qOtHIau6vm_TolNQugjhHthnO3ECoe9d/view?usp=sharing



Examples


Phrase 1:   hello world

Phrase 2:  hi planet earth


Cosine Similarity:  0.0   (no words in common)



Phase 1:  girls like flowers and trees

Phase 2:  boys like trees and raccoons


Cosine Similarity:  0.5999999999998   (exact:  0.6)


When entering phrases, do not use punctuation.   I would just use all lowercase or all uppercase for the most accurate results.   The matches are exact, so spelling counts!  



Sources


Infopedic Techie.  "Python program to find the unique values in a list || Python list [example-3]".   YouTube video posted on March 7, 2019.  https://www.youtube.com/watch?v=7f2UJgig2yI


StatQuest with Josh Starmer.  "Cosine Similarity, Clearly Explained!!!"  YouTube video posted on January 29, 2023.

https://www.youtube.com/watch?v=e9U0QAFbfLI


"Cosine Similarity"  Wikipedia.  Last edited July 6, 2023, accessed August 2023.

https://en.wikipedia.org/wiki/Cosine_similarity



Eddie



All original content copyright, © 2011-2023.  Edward Shore.   Unauthorized use and/or unauthorized distribution for commercial purposes without express and written permission from the author is strictly prohibited.  This blog entry may be distributed for noncommercial purposes, provided that full credit is given to the author.