我的问题与简单的单词相似度有点不同。问题是是否有任何算法可用于计算邮件地址和名称之间的相似度。
for example:
mail Abd_tml_1132@gmail.com
Name Abdullah temel
levenstein,hamming distance 11
jaro distance 0.52
但很可能,这个邮件地址属于这个名字。
请您参考如下方法:
没有直接包,但这可以解决你的问题:
将电子邮件 ID 放入列表
a = 'Abd_tml_1132@gmail.com'
rest = a.split('@', 1)[0] # Removing @
result = ''.join([i for i in rest if not i.isdigit()]) ## Removing digits as no names contains digits in them
list_of_email_words =result.split('_') # making a list of all the words. The separator can be changed from _ or . w.r.t to email id
list_of_email_words = list(filter(None, list_of_email_words )) # remove any blank values
为列表命名:
b = 'Abdullah temel'
list_of_name_words =b.split(' ')
对两个列表应用模糊匹配:
score =[]
for i in range(len(list_of_email_words)):
for j in range(len(list_of_name_words)):
d = fuzz.partial_ratio(list_of_email_words[i],list_of_name_words[j])
score.append(d)
现在您只需要检查 score
的任何元素是否大于您可以定义的阈值。例如:
threshold = 70
if any(x>threshold for x in score):
print ("matched")