Skip to main content
 首页 » 编程设计

python之如何从朴素贝叶斯分类器中的概率密度函数计算概率

2024年11月01日5kuangbin

我正在实现高斯朴素贝叶斯算法:

# importing modules 
import pandas as pd 
import numpy as np 
 
# create an empty dataframe 
data = pd.DataFrame() 
# create our target variable 
data["gender"] = ["male","male","male","male", 
                  "female","female","female","female"] 
# create our feature variables 
data["height"] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75] 
data["weight"] = [180,190,170,165,100,150,130,150] 
data["foot_size"] = [12,11,12,10,6,8,7,9] 
# view the data 
print(data) 
 
# create an empty dataframe 
person = pd.DataFrame() 
# create some feature values for this single row 
person["height"] = [6] 
person["weight"] = [130] 
person["foot_size"] = [8] 
# view the data 
print(person) 
 
# Priors can be calculated either constants or probability distributions. 
# In our example, this is simply the probability of being a gender. 
# calculating prior now 
# number of males 
n_male = data["gender"][data["gender"] == "male"].count() 
# number of females 
n_female = data["gender"][data["gender"] == "female"].count() 
# total people 
total_ppl = data["gender"].count() 
print ("Male count =",n_male,"and Female count =",n_female) 
print ("Total number of persons =",total_ppl) 
 
# number of males divided by the total rows 
p_male = n_male / total_ppl 
# number of females divided by the total rows 
p_female = n_female / total_ppl 
print ("Probability of MALE =",p_male,"and FEMALE =",p_female) 
 
# group the data by gender and calculate the means of each feature 
data_means = data.groupby("gender").mean() 
# view the values 
data_means 
 
# group the data by gender and calculate the variance of each feature 
data_variance = data.groupby("gender").var() 
# view the values 
data_variance 
 
data_variance = data.groupby("gender").var() 
data_variance["foot_size"][data_variance.index == "male"].values[0] 
 
# means for male 
male_height_mean=data_means["height"][data_means.index=="male"].values[0] 
male_weight_mean=data_means["weight"][data_means.index=="male"].values[0] 
male_footsize_mean=data_means["foot_size"][data_means.index=="male"].values[0] 
print (male_height_mean,male_weight_mean,male_footsize_mean) 
 
# means for female 
female_height_mean=data_means["height"][data_means.index=="female"].values[0] 
female_weight_mean=data_means["weight"][data_means.index=="female"].values[0] 
female_footsize_mean=data_means["foot_size"][data_means.index=="female"].values[0] 
print (female_height_mean,female_weight_mean,female_footsize_mean) 
 
# variance for male 
male_height_var=data_variance["height"][data_variance.index=="male"].values[0] 
male_weight_var=data_variance["weight"][data_variance.index=="male"].values[0] 
male_footsize_var=data_variance["foot_size"][data_variance.index=="male"].values[0] 
print (male_height_var,male_weight_var,male_footsize_var) 
 
# variance for female 
female_height_var=data_variance["height"][data_variance.index=="female"].values[0] 
female_weight_var=data_variance["weight"][data_variance.index=="female"].values[0] 
female_footsize_var=data_variance["foot_size"][data_variance.index=="female"].values[0] 
print (female_height_var,female_weight_var,female_footsize_var) 
 
# create a function that calculates p(x | y): 
def p_x_given_y(x,mean_y,variance_y): 
    # input the arguments into a probability density function 
    p = 1 / (np.sqrt(2 * np.pi * variance_y)) * \ 
       np.exp((-(x - mean_y) ** 2) / (2 * variance_y)) 
    # return p 
    return p 
 
# numerator of the posterior if the unclassified observation is a male 
posterior_numerator_male = p_male * \ 
   p_x_given_y(person["height"][0],male_height_mean,male_height_var) * \ 
   p_x_given_y(person["weight"][0],male_weight_mean,male_weight_var) * \ 
   p_x_given_y(person["foot_size"][0],male_footsize_mean,male_footsize_var) 
 
# numerator of the posterior if the unclassified observation is a female 
posterior_numerator_female = p_female * \ 
   p_x_given_y(person["height"][0],female_height_mean,female_height_var) * \ 
   p_x_given_y(person["weight"][0],female_weight_mean,female_weight_var) * \ 
   p_x_given_y(person["foot_size"][0],female_footsize_mean,female_footsize_var)  
 
print ("Numerator of Posterior MALE =",posterior_numerator_male) 
print ("Numerator of Posterior FEMALE =",posterior_numerator_female) 
if (posterior_numerator_male >= posterior_numerator_female): 
    print ("Predicted gender is MALE") 
else: 
    print ("Predicted gender is FEMALE") 

当我们计算概率时,我们使用高斯 PDF 计算它:

$$ P(x) =\frac{1}{\sqrt {2\pi {\sigma}^2}} e^{\frac{-(x-\mu)^2}{2 {\西格玛}^2}}$$

我的问题是上面的等式是 PDF 的等式。要计算概率,我们必须在区域 dx 上对其进行积分。

$\int_{x0}^{x1} P(x)dx $

但在上面的程序中,我们将 x 的值插入并计算概率。那是对的吗?为什么?我有 seen most of the articles calculating概率以同样的方式。

如果这是朴素贝叶斯分类器中计算概率的错误方法,那么正确的方法是什么?

请您参考如下方法:

方法正确。 pdf函数是概率密度,即衡量处于某个值的邻域中的概率除以该邻域的“大小”的函数,其中“大小”是维度的长度1、2中的面积、3中的体积等

在连续概率中,准确获得任何给定结果的概率为 0,这就是使用密度代替的原因。因此,我们不处理诸如 P(X=x) 之类的表达式。但是用P(|X-x| < Δ(x)) , 代表 X 的概率亲近x .

让我简化符号并写成P(X~x)对于 P(|X-x| < Δ(x)) .

如果你在这里应用贝叶斯法则,你会得到

P(X~x|W~w) = P(W~w|X~x)*P(X~x)/P(W~w) 

因为我们处理的是概率。如果我们现在引入密度:

pdf(x|w)*Δ(x) = pdf(w|x)Δ(w)*pdf(x)Δ(x)/(pdf(w)*Δ(w)) 

因为probability = density*neighborhood_size .由于所有 Δ(·)在上面的表达式中取消,我们得到

pdf(x|w) = pdf(w|x)*pdf(x)/pdf(w) 

这是密度的贝叶斯规则。

结论是,鉴于贝叶斯规则也适用于密度,在处理连续随机变量时使用相同的方法用密度代替概率是合法的。