Site Overlay

大数据学习-特征选取-单变量特征选取


文章热度: 51 热度

快速目录

说明

方法使用了sklearn库。

主要有两个部分,其中包括:移除低方差特征和保留K个单变量特征。

数据集无所谓,代码完成了一次性输出所有低方差特征和保留K个单变量特征和相对应的评分,只需要一次执行便可快速筛选需要保留的单变量特征项。

请注意:

  • 不要使用一个回归评分函数来处理分类问题,你会得到无用的结果。
  • 如果你使用的是稀疏的数据 (例如数据可以由稀疏矩阵来表示),chi2 , mutual_info_regression , mutual_info_classif 可以处理数据并保持它的稀疏性。
  • 分类包括
    • chi2
    • f_classif
    • mutual_info_classif
  • 回归包括
    • f_regression
    • mutual_info_regression

代码

import numpy as np 
from sklearn.model_selection import *
from sklearn import *
import pandas as pd
from sklearn.externals import joblib
from sklearn.datasets import *
from sklearn.feature_selection import *

#读取数据集,需要训练的部分在feature,结果在target
data = pd.read_csv("/home/joger/Downloads/winequality-red.csv",sep=",")

print("数据描述:")
print(data.describe)

Result = {}

#设置数据(feature)和目标(target)
feature = data.drop(['quality','ID'],axis=1)        #去掉与训练数据不相关项(包括结果Species和数据自增编号ID)
target = []
for item in data.quality.values:
    target.append([item])
target = np.array(target)                           #设置目标值为种类

print("----------去除低方差特征----------")
#条件是布尔值特征集,即我们只需要判断离散数据列即可。
#显示的是高方差的特征,如果某列是1\0布尔值列,且他没有出现在下方列表中,则该列被认为是无效的。
threshold = 0.8 * (1 - 0.8)  #阈值为80%,特征值0或1比例超过80%的则认为是无效的
sel = VarianceThreshold(threshold=threshold)
sel.fit_transform(feature,target)
TitleList = np.array(feature.columns.values.tolist())[sel.get_support()]
print("非低方差特征" , TitleList)

print("----------分类----------")
#移除那些除了评分最高的 K 个特征之外的所有特征
K = 11 #请设置保留多少个特征(建议第一次先为所有特征,然后获取得分根据得分筛选)
print("单变量-获取前" , K , "个特征值,方法为 chi2")
SKB = SelectKBest()
SKB = SelectKBest(chi2, k=K)
X_new = SKB.fit_transform(feature, target)
TitleList = np.array(feature.columns.values.tolist())[SKB.get_support()]
print("选中:" , TitleList)
print("%25.25s\t%.3s\t%.3s" % ("列名","得分","p值"))
for i in range(len(SKB.scores_)):
    print("%30.30s\t%.3f\t%.3f" % (feature.columns.values.tolist()[i],SKB.scores_[i],SKB.pvalues_[i]))
    pass
Result["class-chi2"] = TitleList


K = 11 #请设置保留多少个特征(建议第一次先为所有特征,然后获取得分根据得分筛选)
print("单变量-获取前" , K , "个特征值,方法为 f_classif")
SKB = SelectKBest()
SKB = SelectKBest(f_classif, k=K)
X_new = SKB.fit_transform(feature, target)
TitleList = np.array(feature.columns.values.tolist())[SKB.get_support()]
print("选中:" , TitleList)
print("%25.25s\t%.3s\t%.3s" % ("列名","得分","p值"))
for i in range(len(SKB.scores_)):
    print("%30.30s\t%.3f\t%.3f" % (feature.columns.values.tolist()[i],SKB.scores_[i],SKB.pvalues_[i]))
    pass
Result["class-f_classif"] = TitleList

K = 11 #请设置保留多少个特征(建议第一次先为所有特征,然后获取得分根据得分筛选)
print("单变量-获取前" , K , "个特征值,方法为 mutual_info_classif")
SKB = SelectKBest()
SKB = SelectKBest(mutual_info_classif, k=K)
X_new = SKB.fit_transform(feature, target)
TitleList = np.array(feature.columns.values.tolist())[SKB.get_support()]
print("选中:" , TitleList)
print("%25.25s\t%.3s" % ("列名","得分"))
for i in range(len(SKB.scores_)):
    print("%30.30s\t%.3f" % (feature.columns.values.tolist()[i],SKB.scores_[i]))
    pass
Result["class-mutual_info_classif"] = TitleList

print("----------回归----------")
K = 11 #请设置保留多少个特征(建议第一次先为所有特征,然后获取得分根据得分筛选)
print("单变量-获取前" , K , "个特征值,方法为 f_regression")
SKB = SelectKBest()
SKB = SelectKBest(f_regression, k=K)
X_new = SKB.fit_transform(feature, target)
TitleList = np.array(feature.columns.values.tolist())[SKB.get_support()]
print("选中:" , TitleList)
print("%25.25s\t%.3s\t%.3s" % ("列名","得分","p值"))
for i in range(len(SKB.scores_)):
    print("%30.30s\t%.3f\t%.3f" % (feature.columns.values.tolist()[i],SKB.scores_[i],SKB.pvalues_[i]))
    pass
Result["regression-f_regression"] = TitleList

K = 11 #请设置保留多少个特征(建议第一次先为所有特征,然后获取得分根据得分筛选)
print("单变量-获取前" , K , "个特征值,方法为 mutual_info_regression")
SKB = SelectKBest()
SKB = SelectKBest(mutual_info_regression, k=K)
X_new = SKB.fit_transform(feature, target)
TitleList = np.array(feature.columns.values.tolist())[SKB.get_support()]
print("选中:" , TitleList)
print("%25.25s\t%.3s" % ("列名","得分"))
for i in range(len(SKB.scores_)):
    print("%30.30s\t%.3f" % (feature.columns.values.tolist()[i],SKB.scores_[i]))
    pass
Result["regression-mutual_info_regression"] = TitleList




print("最终结果" , Result)

结果示例

数据描述:
<bound method NDFrame.describe of         ID  fixed acidity  volatile acidity  citric acid  ...    pH  sulphates  alcohol  quality
0        1            7.4             0.700         0.00  ...  3.51       0.56      9.4        5
1        2            7.8             0.880         0.00  ...  3.20       0.68      9.8        5
2        3            7.8             0.760         0.04  ...  3.26       0.65      9.8        5
3        4           11.2             0.280         0.56  ...  3.16       0.58      9.8        6
4        5            7.4             0.700         0.00  ...  3.51       0.56      9.4        5
...    ...            ...               ...          ...  ...   ...        ...      ...      ...
1594  1595            6.2             0.600         0.08  ...  3.45       0.58     10.5        5
1595  1596            5.9             0.550         0.10  ...  3.52       0.76     11.2        6
1596  1597            6.3             0.510         0.13  ...  3.42       0.75     11.0        6
1597  1598            5.9             0.645         0.12  ...  3.57       0.71     10.2        5
1598  1599            6.0             0.310         0.47  ...  3.39       0.66     11.0        6

[1599 rows x 13 columns]>
----------去除低方差特征----------
非低方差特征 ['fixed acidity' 'residual sugar' 'free sulfur dioxide'
 'total sulfur dioxide' 'alcohol']
----------分类----------
单变量-获取前 11 个特征值,方法为 chi2
选中: ['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol']
                       列名     得分    p值
                 fixed acidity  11.261  0.046
              volatile acidity  15.580  0.008
                   citric acid  13.026  0.023
                residual sugar  4.123   0.532
                     chlorides  0.752   0.980
           free sulfur dioxide  161.936 0.000
          total sulfur dioxide  2755.558        0.000
                       density  0.000   1.000
                            pH  0.155   1.000
                     sulphates  4.558   0.472
                       alcohol  46.430  0.000
单变量-获取前 11 个特征值,方法为 f_classif
/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
选中: ['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol']
                       列名     得分    p值
                 fixed acidity  6.283   0.000
              volatile acidity  60.914  0.000
                   citric acid  19.691  0.000
                residual sugar  1.053   0.385
                     chlorides  6.036   0.000
           free sulfur dioxide  4.754   0.000
          total sulfur dioxide  25.479  0.000
                       density  13.396  0.000
                            pH  4.342   0.001
                     sulphates  22.273  0.000
                       alcohol  115.855 0.000
单变量-获取前 11 个特征值,方法为 mutual_info_classif
/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
选中: ['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol']
                       列名     得分
                 fixed acidity  0.058
              volatile acidity  0.136
                   citric acid  0.064
                residual sugar  0.020
                     chlorides  0.026
           free sulfur dioxide  0.033
          total sulfur dioxide  0.070
                       density  0.089
                            pH  0.018
                     sulphates  0.099
                       alcohol  0.175
----------回归----------
单变量-获取前 11 个特征值,方法为 f_regression
/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
选中: ['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol']
                       列名     得分    p值
                 fixed acidity  24.960  0.000
              volatile acidity  287.444 0.000
                   citric acid  86.258  0.000
                residual sugar  0.301   0.583
                     chlorides  26.986  0.000
           free sulfur dioxide  4.109   0.043
          total sulfur dioxide  56.658  0.000
                       density  50.405  0.000
                            pH  5.340   0.021
                     sulphates  107.740 0.000
                       alcohol  468.267 0.000
单变量-获取前 11 个特征值,方法为 mutual_info_regression
/usr/local/lib/python3.5/dist-packages/sklearn/utils/validation.py:724: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
选中: ['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol']
                       列名     得分
                 fixed acidity  0.041
              volatile acidity  0.101
                   citric acid  0.031
                residual sugar  0.006
                     chlorides  0.026
           free sulfur dioxide  0.013
          total sulfur dioxide  0.086
                       density  0.085
                            pH  0.032
                     sulphates  0.113
                       alcohol  0.185
最终结果 {'class-f_classif': array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
      dtype='<U20'), 'class-chi2': array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
      dtype='<U20'), 'regression-f_regression': array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
      dtype='<U20'), 'class-mutual_info_classif': array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
      dtype='<U20'), 'regression-mutual_info_regression': array(['fixed acidity', 'volatile acidity', 'citric acid',
       'residual sugar', 'chlorides', 'free sulfur dioxide',
       'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol'],
      dtype='<U20')}
已终止
1+

说点什么

200
  Subscribe  
提醒
Copyright 王政乔 | 中国. 联系方式:me@zhengqiao.wang