【机器学习实战】读书笔记 - （2）

本帖最后由 PeersLee 于 2016-4-23 10:11 编辑
问题导读：

1.如何归一化特征值？
2.如何完成这个完整程序来验证分类器？
3.如何完成系统（改进约会网站的配对效果）整体框架？
4.怎样在Ubuntu 15.10 安装 Ipython、numpy、pandas、matplotlib、并进行测试？
5.在机器学习领域高效处理数据的Python 工具通常被用来完成哪些任务？

解决方案：

如何归一化特征值：

（1）为何要归一化特征值：
在进行两个样本的距离求解时，通常用到以下求解方式：

我们会发现，上面方程中数字差值最大的那个属性其实对于计算的结果影响力没那么大，

所以在处理这样取值范围不同的特征值时，通常需要将数据进行归一化处理，如0 ~ 1或者-1 ~ 1之间。
（2）如何对特征值进行归一化：
下面公式可以将任意范围的特征值转化为0 ~ 1区间的内的值：
[mw_shl_code=applescript,true]newValue = {oldValue-min)/ (max-min)[/mw_shl_code]
其中min和max分别是数据集中的最小特征值和最大特征值。虽然改变数值取值范围增加了分类器的复杂度，但为了得到准确结果，我们必须这样做。

将下列代码加入都我们的kNN.py中
[mw_shl_code=python,true]def auto_norm(data_set):
min_vals = data_set.min(0)
max_vals = data_set.max(0)
ranges = max_vals - min_vals
norm_data_set = zeros(shape(data_set))
rows = data_set.shape[0]
norm_data_set = data_set - tile(min_vals,(rows,1))
norm_data_set = norm_data_set - tile(ranges,((rows,1)))
return norm_data_set, ranges, min_vals[/mw_shl_code]

在terminal 中进行测试
[mw_shl_code=applescript,true]>>>reload(kNN)
>>>normMat,ranges,minVals = kNN.auto_norm(datingDataMat)
>>>normMat
>>>ranges
>>>minVals
[/mw_shl_code]

注：datingDataMat 请参看【机器学习实战】读书笔记 - （1）

作为完整程序验证分类器:

代码：（将代码加入到kNN.py）[mw_shl_code=python,true]

def dating_class_test():
ho_ratio = 0.10
dating_data_mat, dating_labels = file_2_matrix('datingTestSet.txt')
norm_mat, ranges, min_vals = auto_norm(dating_data_wet)
m = norm_mat.shape[0]
num_test_vecs = int(ho_ratio*m)
error_count = 0.0
for i in ranges(num_test_vecs):
      classifier_result = classify_0(norm_mat[i,:],norm_mat[num_test_vecs:m,:],dating_labels[num_test_vecs:m],3)
      print "the classifier came back, with:%d, the real answer is: %d" % (classifier_result, dating_labels)
if (classifier_result != dating_labels):
      error_count += 1.0
print "the total error rate is: %f" % (error_count/float(num_test_vecs))
print error_count

[/mw_shl_code]

代码分析：

（1）程序首先使用 file_2_matrix 和 auto_norm 函数从文件中读取数据并将数据进行归一化特征值。
（2）接着计算出测试向量(数据)的数量，此步决定norm_mat 向量中哪些数据用于测试，哪些数据用于训练样本。
（3）将这两部分数据输入到kNN分类器函数 classify_0 函数计算错误率并输出结果。

构建完整可用系统：

代码：
（将代码加入到kNN.py）
[mw_shl_code=python,true]def classify_person():
      result_list = ['not at all','a little','yes']
      percent_tats = float(raw_input("percentage of time spent playing video games?"))
      f_f_miles = float(raw_input("freguent flier miles earned per year?"))
      ice_cream = float(raw_input("liters of ice cream consumed per year?"))
      dating_data_mat,dating_labels = file_2_matrix('datingTestSet2.txt')
      norm_mat, ranges, min_vals = auto_norm(dating_data_mat)
      in_arr = array([f_f_miles,percent_tats,ice_cream])
      classifier_result = classify_0((in_arr-min_vals)/ranges,norm_mat,dating_labels,3)
      print "You will probably like this person： ",result_list[classifier_result - 1][/mw_shl_code]

在Ubuntu 上安装Ipython、numpy、pandas、matplotlib:

安装Ipython：[mw_shl_code=bash,true]peerslee@peerslee-ubuntu:~$ sudo apt-get install ipython
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
将会安装下列额外的软件包：
  python-pexpect python-simplegeneric
建议安装的软件包：
  ipython-doc ipython-notebook ipython-qtconsole python-zmq python-pexpect-doc
下列【新】软件包将被安装：
  ipython python-pexpect python-simplegeneric
升级了 0 个软件包，新安装了 3 个软件包，要卸载 0 个软件包，有 3 个软件包未被升级。
需要下载 656 kB 的软件包。
解压缩后会消耗掉 3,611 kB 的额外空间。
您希望继续执行吗？ [Y/n] y
获取：1 http://mirrors.hust.edu.cn/ubuntu/ wily/main python-pexpect all 3.2-1 [38.0 kB]
获取：2 http://mirrors.hust.edu.cn/ubuntu/ wily/main python-simplegeneric all 0.8.1-1 [11.5 kB]
获取：3 http://mirrors.hust.edu.cn/ubuntu/ wily/universe ipython all 2.3.0-2ubuntu1 [607 kB]
下载 656 kB，耗时 42秒 (15.5 kB/s)
正在选中未选择的软件包 python-pexpect。
(正在读取数据库 ... 系统当前共安装有 219261 个文件和目录。)
正准备解包 .../python-pexpect_3.2-1_all.deb  ...
正在解包 python-pexpect (3.2-1) ...
正在选中未选择的软件包 python-simplegeneric。
正准备解包 .../python-simplegeneric_0.8.1-1_all.deb  ...
正在解包 python-simplegeneric (0.8.1-1) ...
正在选中未选择的软件包 ipython。
正准备解包 .../ipython_2.3.0-2ubuntu1_all.deb  ...
正在解包 ipython (2.3.0-2ubuntu1) ...
正在处理用于 hicolor-icon-theme (0.15-0ubuntu1) 的触发器 ...
正在处理用于 man-db (2.7.4-1) 的触发器 ...
正在处理用于 gnome-menus (3.13.3-6ubuntu1) 的触发器 ...
正在处理用于 desktop-file-utils (0.22-1ubuntu3) 的触发器 ...
正在处理用于 bamfdaemon (0.5.2~bzr0+15.10.20150627.1-0ubuntu1) 的触发器 ...
Rebuilding /usr/share/applications/bamf-2.index...
正在处理用于 mime-support (3.58ubuntu1) 的触发器 ...
正在设置 python-pexpect (3.2-1) ...
正在设置 python-simplegeneric (0.8.1-1) ...
正在设置 ipython (2.3.0-2ubuntu1) ...
peerslee@peerslee-ubuntu:~$ Ipython
未找到 'Ipython' 命令，您要输入的是否是：
命令 'python' 来自于包 'python-minimal' (main)
命令 'python' 来自于包 'python3' (main)
命令 'bpython' 来自于包 'bpython' (universe)
命令 'ipython' 来自于包 'ipython' (universe)
Ipython：未找到命令
peerslee@peerslee-ubuntu:~$ ipython
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
Type "copyright", "credits" or "license" for more information.

IPython 2.3.0 -- An enhanced Interactive Python.
?       -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help    -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
[/mw_shl_code]

安装numpy：

[mw_shl_code=bash,true]sudo apt-get install python-numpy

sudo apt-get install python-scipy[/mw_shl_code]

测试：[mw_shl_code=bash,true]peerslee@peerslee-ubuntu:~$ python
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
[GCC 5.2.1 20151010] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from numpy import *
>>> random.rand(4,4)
array([[ 0.08515879,  0.42498937,  0.11710541,  0.83416534],
   [ 0.22091503,  0.21738498,  0.96636182,  0.26673299],
   [ 0.40037502,  0.32046245,  0.70898183,  0.01849357],
   [ 0.75332768,  0.24119343,  0.08228831,  0.11922242]])
>>> randMat = mat(random.rand(4,4))
>>> randMat.I
matrix([[-0.27240966, -1.10995439,  0.82532499,  1.68272 ],
      [ 0.3253672 ,  1.40474812,  0.40780151, -2.32758753],
      [-0.61296157,  1.29445785, -0.17902085,  0.0515141 ],
      [ 1.40961488, -0.93162909, -1.00366411,  0.55006776]])
>>>
[/mw_shl_code]

安装pandas：[mw_shl_code=bash,true]peerslee@peerslee-ubuntu:~$ sudo apt-get install python-pandas
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
将会安装下列额外的软件包：
  libamd2.3.1 libdsdp-5.8gf libfftw3-double3 libglpk36 libgsl0ldbl libhdf5-10
  liblz4-1 libsnappy1v5 python-antlr python-cvxopt python-jdcal python-joblib
  python-numexpr python-openpyxl python-pandas-lib python-patsy python-py
  python-pytest python-simplejson python-statsmodels python-statsmodels-lib
  python-tables python-tables-data python-tables-lib python-xlrd python-xlwt
建议安装的软件包：
  libfftw3-bin libfftw3-dev libiodbc2-dev libmysqlclient-dev gsl-ref-psdoc
  gsl-doc-pdf gsl-doc-info gsl-ref-html python-pandas-doc python-patsy-doc
  subversion python-pytest-xdist python-tables-doc python-netcdf vitables
下列【新】软件包将被安装：
  libamd2.3.1 libdsdp-5.8gf libfftw3-double3 libglpk36 libgsl0ldbl libhdf5-10
  liblz4-1 libsnappy1v5 python-antlr python-cvxopt python-jdcal python-joblib
  python-numexpr python-openpyxl python-pandas python-pandas-lib python-patsy
  python-py python-pytest python-simplejson python-statsmodels
  python-statsmodels-lib python-tables python-tables-data python-tables-lib
  python-xlrd python-xlwt
升级了 0 个软件包，新安装了 27 个软件包，要卸载 0 个软件包，有 3 个软件包未被升级。
需要下载 12.5 MB 的软件包。
解压缩后会消耗掉 59.4 MB 的额外空间。
您希望继续执行吗？ [Y/n] y
获取：1 http://mirrors.hust.edu.cn/ubuntu/ wily/main libamd2.3.1 amd64 1:4.2.1-3ubuntu1 [23.5 kB]
获取：2 http://mirrors.hust.edu.cn/ubuntu/ wily/main libfftw3-double3 amd64 3.3.4-2ubuntu1 [718 kB]
获取：3 http://mirrors.hust.edu.cn/ubuntu/ wily/universe libglpk36 amd64 4.55-1 [391 kB]
获取：4 http://mirrors.hust.edu.cn/ubuntu/ wily/main libgsl0ldbl amd64 1.16+dfsg-4 [806 kB]
获取：5 http://mirrors.hust.edu.cn/ubuntu/ wily/universe libhdf5-10 amd64 1.8.15-patch1+docs-4 [1,034 kB]
获取：6 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-antlr all 2.7.7+dfsg-6build1 [19.0 kB]
获取：7 http://mirrors.hust.edu.cn/ubuntu/ wily/universe libdsdp-5.8gf amd64 5.8-9.1 [223 kB]
获取：8 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-cvxopt amd64 1.1.4-1.3 [1,347 kB]
获取：9 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-jdcal all 1.0-1build1 [7,702 B]
获取：10 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-joblib all 0.8.3-1 [59.4 kB]
获取：11 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-numexpr amd64 2.4.3-1 [129 kB]
获取：12 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-openpyxl all 2.3.0~b1-1ubuntu1 [185 kB]
获取：13 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-pandas-lib amd64 0.15.0-2 [1,437 kB]
获取：14 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-pandas all 0.15.0-2 [1,358 kB]
获取：15 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-patsy all 0.4.0-1 [170 kB]
获取：16 http://mirrors.hust.edu.cn/ubuntu/ wily/main python-py all 1.4.30-2 [62.5 kB]
获取：17 http://mirrors.hust.edu.cn/ubuntu/ wily/main python-pytest all 2.7.2-2 [102 kB]
获取：18 http://mirrors.hust.edu.cn/ubuntu/ wily/main python-simplejson amd64 3.7.3-1ubuntu1 [60.0 kB]
获取：19 http://mirrors.hust.edu.cn/ubuntu/ wily/universe liblz4-1 amd64 0.0~r131-1 [31.8 kB]
获取：20 http://mirrors.hust.edu.cn/ubuntu/ wily/main libsnappy1v5 amd64 1.1.3-2 [16.0 kB]
获取：21 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-tables-lib amd64 3.2.2-1 [354 kB]
获取：22 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-tables-data all 3.2.2-1 [45.4 kB]
获取：23 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-tables all 3.2.2-1 [336 kB]
获取：24 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-xlrd all 0.9.4-1 [107 kB]
获取：25 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-xlwt all 0.7.5+debian1-1 [83.5 kB]
获取：26 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-statsmodels-lib amd64 0.5.0+git13-g8e07d34-1ubuntu2 [63.7 kB]
获取：27 http://mirrors.hust.edu.cn/ubuntu/ wily/universe python-statsmodels all 0.5.0+git13-g8e07d34-1ubuntu2 [3,362 kB]
下载 12.5 MB，耗时 8分 40秒 (24.1 kB/s)
正在选中未选择的软件包 libamd2.3.1:amd64。
(正在读取数据库 ... 系统当前共安装有 219743 个文件和目录。)
正准备解包 .../libamd2.3.1_1%3a4.2.1-3ubuntu1_amd64.deb  ...
正在解包 libamd2.3.1:amd64 (1:4.2.1-3ubuntu1) ...
正在选中未选择的软件包 libfftw3-double3:amd64。
正准备解包 .../libfftw3-double3_3.3.4-2ubuntu1_amd64.deb  ...
正在解包 libfftw3-double3:amd64 (3.3.4-2ubuntu1) ...
正在选中未选择的软件包 libglpk36:amd64。
正准备解包 .../libglpk36_4.55-1_amd64.deb  ...
正在解包 libglpk36:amd64 (4.55-1) ...
正在选中未选择的软件包 libgsl0ldbl:amd64。
正准备解包 .../libgsl0ldbl_1.16+dfsg-4_amd64.deb  ...
正在解包 libgsl0ldbl:amd64 (1.16+dfsg-4) ...
正在选中未选择的软件包 libhdf5-10:amd64。
正准备解包 .../libhdf5-10_1.8.15-patch1+docs-4_amd64.deb  ...
正在解包 libhdf5-10:amd64 (1.8.15-patch1+docs-4) ...
正在选中未选择的软件包 python-antlr。
正准备解包 .../python-antlr_2.7.7+dfsg-6build1_all.deb  ...
正在解包 python-antlr (2.7.7+dfsg-6build1) ...
正在选中未选择的软件包 libdsdp-5.8gf。
正准备解包 .../libdsdp-5.8gf_5.8-9.1_amd64.deb  ...
正在解包 libdsdp-5.8gf (5.8-9.1) ...
正在选中未选择的软件包 python-cvxopt。
正准备解包 .../python-cvxopt_1.1.4-1.3_amd64.deb  ...
正在解包 python-cvxopt (1.1.4-1.3) ...
正在选中未选择的软件包 python-jdcal。
正准备解包 .../python-jdcal_1.0-1build1_all.deb  ...
正在解包 python-jdcal (1.0-1build1) ...
正在选中未选择的软件包 python-joblib。
正准备解包 .../python-joblib_0.8.3-1_all.deb  ...
正在解包 python-joblib (0.8.3-1) ...
正在选中未选择的软件包 python-numexpr。
正准备解包 .../python-numexpr_2.4.3-1_amd64.deb  ...
正在解包 python-numexpr (2.4.3-1) ...
正在选中未选择的软件包 python-openpyxl。
正准备解包 .../python-openpyxl_2.3.0~b1-1ubuntu1_all.deb  ...
正在解包 python-openpyxl (2.3.0~b1-1ubuntu1) ...
正在选中未选择的软件包 python-pandas-lib。
正准备解包 .../python-pandas-lib_0.15.0-2_amd64.deb  ...
正在解包 python-pandas-lib (0.15.0-2) ...
正在选中未选择的软件包 python-pandas。
正准备解包 .../python-pandas_0.15.0-2_all.deb  ...
正在解包 python-pandas (0.15.0-2) ...
正在选中未选择的软件包 python-patsy。
正准备解包 .../python-patsy_0.4.0-1_all.deb  ...
正在解包 python-patsy (0.4.0-1) ...
正在选中未选择的软件包 python-py。
正准备解包 .../python-py_1.4.30-2_all.deb  ...
正在解包 python-py (1.4.30-2) ...
正在选中未选择的软件包 python-pytest。
正准备解包 .../python-pytest_2.7.2-2_all.deb  ...
正在解包 python-pytest (2.7.2-2) ...
正在选中未选择的软件包 python-simplejson。
正准备解包 .../python-simplejson_3.7.3-1ubuntu1_amd64.deb  ...
正在解包 python-simplejson (3.7.3-1ubuntu1) ...
正在选中未选择的软件包 liblz4-1:amd64。
正准备解包 .../liblz4-1_0.0~r131-1_amd64.deb  ...
正在解包 liblz4-1:amd64 (0.0~r131-1) ...
正在选中未选择的软件包 libsnappy1v5:amd64。
正准备解包 .../libsnappy1v5_1.1.3-2_amd64.deb  ...
正在解包 libsnappy1v5:amd64 (1.1.3-2) ...
正在选中未选择的软件包 python-tables-lib。
正准备解包 .../python-tables-lib_3.2.2-1_amd64.deb  ...
正在解包 python-tables-lib (3.2.2-1) ...
正在选中未选择的软件包 python-tables-data。
正准备解包 .../python-tables-data_3.2.2-1_all.deb  ...
正在解包 python-tables-data (3.2.2-1) ...
正在选中未选择的软件包 python-tables。
正准备解包 .../python-tables_3.2.2-1_all.deb  ...
正在解包 python-tables (3.2.2-1) ...
正在选中未选择的软件包 python-xlrd。
正准备解包 .../python-xlrd_0.9.4-1_all.deb  ...
正在解包 python-xlrd (0.9.4-1) ...
正在选中未选择的软件包 python-xlwt。
正准备解包 .../python-xlwt_0.7.5+debian1-1_all.deb  ...
正在解包 python-xlwt (0.7.5+debian1-1) ...
正在选中未选择的软件包 python-statsmodels-lib。
正准备解包 .../python-statsmodels-lib_0.5.0+git13-g8e07d34-1ubuntu2_amd64.deb  ...
正在解包 python-statsmodels-lib (0.5.0+git13-g8e07d34-1ubuntu2) ...
正在选中未选择的软件包 python-statsmodels。
正准备解包 .../python-statsmodels_0.5.0+git13-g8e07d34-1ubuntu2_all.deb  ...
正在解包 python-statsmodels (0.5.0+git13-g8e07d34-1ubuntu2) ...
正在处理用于 doc-base (0.10.6) 的触发器 ...
Processing 4 added doc-base files...
正在处理用于 man-db (2.7.4-1) 的触发器 ...
正在设置 libamd2.3.1:amd64 (1:4.2.1-3ubuntu1) ...
正在设置 libfftw3-double3:amd64 (3.3.4-2ubuntu1) ...
正在设置 libglpk36:amd64 (4.55-1) ...
正在设置 libgsl0ldbl:amd64 (1.16+dfsg-4) ...
正在设置 libhdf5-10:amd64 (1.8.15-patch1+docs-4) ...
正在设置 python-antlr (2.7.7+dfsg-6build1) ...
正在设置 libdsdp-5.8gf (5.8-9.1) ...
正在设置 python-cvxopt (1.1.4-1.3) ...
正在设置 python-jdcal (1.0-1build1) ...
正在设置 python-joblib (0.8.3-1) ...
正在设置 python-numexpr (2.4.3-1) ...
正在设置 python-openpyxl (2.3.0~b1-1ubuntu1) ...
正在设置 python-pandas-lib (0.15.0-2) ...
正在设置 python-pandas (0.15.0-2) ...
正在设置 python-patsy (0.4.0-1) ...
正在设置 python-py (1.4.30-2) ...
正在设置 python-pytest (2.7.2-2) ...
正在设置 python-simplejson (3.7.3-1ubuntu1) ...
正在设置 liblz4-1:amd64 (0.0~r131-1) ...
正在设置 libsnappy1v5:amd64 (1.1.3-2) ...
正在设置 python-tables-lib (3.2.2-1) ...
正在设置 python-tables-data (3.2.2-1) ...
正在设置 python-tables (3.2.2-1) ...
正在设置 python-xlrd (0.9.4-1) ...
正在设置 python-xlwt (0.7.5+debian1-1) ...
正在设置 python-statsmodels-lib (0.5.0+git13-g8e07d34-1ubuntu2) ...
正在设置 python-statsmodels (0.5.0+git13-g8e07d34-1ubuntu2) ...
正在处理用于 libc-bin (2.21-0ubuntu4.1) 的触发器 ...
[/mw_shl_code]

测试：
[mw_shl_code=python,true]peerslee@peerslee-ubuntu:~$ ipython --pylab
Python 2.7.10 (default, Oct 14 2015, 16:09:02)
Type "copyright", "credits" or "license" for more information.

IPython 2.3.0 -- An enhanced Interactive Python.
?       -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help    -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
Using matplotlib backend: TkAgg

In [1]: import pandas

In [2]: plot(arange(10))
Out[2]: [<matplotlib.lines.Line2D at 0x7f5aecc32510>]

In [3]:
[/mw_shl_code]

2016-04-20 21-08-59屏幕截图.png

安装matplotlib库：

[mw_shl_code=bash,true]sudo apt-get install python-matplotlib[/mw_shl_code]

Ubuntu 15.10 出现的问题以及解决方案：

（1）问题：
[mw_shl_code=bash,true]E: 无法下载 http://mirrors.hust.edu.cn/ubunt ... lyx_2.1.4-2_all.deb 大小不符

E: 无法下载 http://mirrors.hust.edu.cn/ubunt ... 10.1+dfsg-1_all.deb 大小不符

E: 有几个软件包无法下载，您可以运行 apt-get update 或者加上 --fix-missing 的选项再试试？
[/mw_shl_code]

（2）解决方案：
下载两个deb包，手动安装：

下载地址：
http://mirrors.hust.edu.cn/ubuntu/pool/universe/l/lyx/fonts-lyx_2.1.4-2_all.deb
http://mirrors.hust.edu.cn/ubuntu/pool/universe/j/jqueryui/libjs-jquery-ui_1.10.1+dfsg-1_all.deb

安装：
[mw_shl_code=bash,true]sudo dpkg -i fonts-lyx_2.1.4-2_all.deb
sudo dpkg -i libjs-jquery-ui_1.10.1+dfsg-1_all.deb
apt-get -f install[/mw_shl_code]

[mw_shl_code=bash,true]peerslee@peerslee-ubuntu:~$ sudo apt-get install python-matplotlib
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
将会安装下列额外的软件包：
  blt python-dateutil python-funcsigs python-glade2 python-gobject-2
  python-gtk2 python-matplotlib-data python-mock python-nose python-pbr
  python-pyparsing python-tk python-tz tk8.6-blt2.5
建议安装的软件包：
  blt-demo python-funcsigs-doc python-gtk2-doc python-gobject-2-dbg dvipng
  inkscape ipython python-cairocffi python-configobj python-excelerator
  python-gobject python-matplotlib-doc python-tornado python-traits
  python-wxgtk3.0 texlive-extra-utils texlive-latex-extra ttf-staypuft
  python-mock-doc python-coverage python-nose-doc tix python-tk-dbg
下列【新】软件包将被安装：
  blt python-dateutil python-funcsigs python-glade2 python-gobject-2
  python-gtk2 python-matplotlib python-matplotlib-data python-mock python-nose
  python-pbr python-pyparsing python-tk python-tz tk8.6-blt2.5
升级了 0 个软件包，新安装了 15 个软件包，要卸载 0 个软件包，有 224 个软件包未被升级。
需要下载 0 B/7,751 kB 的软件包。
解压缩后会消耗掉 28.1 MB 的额外空间。
您希望继续执行吗？ [Y/n] y
正在选中未选择的软件包 tk8.6-blt2.5。
(正在读取数据库 ... 系统当前共安装有 187147 个文件和目录。)
正准备解包 .../tk8.6-blt2.5_2.5.3+dfsg-1_amd64.deb  ...
正在解包 tk8.6-blt2.5 (2.5.3+dfsg-1) ...
正在选中未选择的软件包 blt。
正准备解包 .../blt_2.5.3+dfsg-1_amd64.deb  ...
正在解包 blt (2.5.3+dfsg-1) ...
正在选中未选择的软件包 python-dateutil。
正准备解包 .../python-dateutil_2.2-2_all.deb  ...
正在解包 python-dateutil (2.2-2) ...
正在选中未选择的软件包 python-funcsigs。
正准备解包 .../python-funcsigs_0.4-1_all.deb  ...
正在解包 python-funcsigs (0.4-1) ...
正在选中未选择的软件包 python-gobject-2。
正准备解包 .../python-gobject-2_2.28.6-12build1_amd64.deb  ...
正在解包 python-gobject-2 (2.28.6-12build1) ...
正在选中未选择的软件包 python-gtk2。
正准备解包 .../python-gtk2_2.24.0-4ubuntu1_amd64.deb  ...
正在解包 python-gtk2 (2.24.0-4ubuntu1) ...
正在选中未选择的软件包 python-glade2。
正准备解包 .../python-glade2_2.24.0-4ubuntu1_amd64.deb  ...
正在解包 python-glade2 (2.24.0-4ubuntu1) ...
正在选中未选择的软件包 python-matplotlib-data。
正准备解包 .../python-matplotlib-data_1.4.2-3.1ubuntu1_all.deb  ...
正在解包 python-matplotlib-data (1.4.2-3.1ubuntu1) ...
正在选中未选择的软件包 python-pyparsing。
正准备解包 .../python-pyparsing_2.0.3+dfsg1-1_all.deb  ...
正在解包 python-pyparsing (2.0.3+dfsg1-1) ...
正在选中未选择的软件包 python-tz。
正准备解包 .../python-tz_2014.10~dfsg1-0ubuntu2_all.deb  ...
正在解包 python-tz (2014.10~dfsg1-0ubuntu2) ...
正在选中未选择的软件包 python-pbr。
正准备解包 .../python-pbr_1.8.0-2ubuntu1_all.deb  ...
正在解包 python-pbr (1.8.0-2ubuntu1) ...
正在选中未选择的软件包 python-mock。
正准备解包 .../python-mock_1.3.0-2.1ubuntu1_all.deb  ...
正在解包 python-mock (1.3.0-2.1ubuntu1) ...
正在选中未选择的软件包 python-nose。
正准备解包 .../python-nose_1.3.6-1_all.deb  ...
正在解包 python-nose (1.3.6-1) ...
正在选中未选择的软件包 python-matplotlib。
正准备解包 .../python-matplotlib_1.4.2-3.1ubuntu1_amd64.deb  ...
正在解包 python-matplotlib (1.4.2-3.1ubuntu1) ...
正在选中未选择的软件包 python-tk。
正准备解包 .../python-tk_2.7.9-1_amd64.deb  ...
正在解包 python-tk (2.7.9-1) ...
正在处理用于 man-db (2.7.4-1) 的触发器 ...
正在设置 tk8.6-blt2.5 (2.5.3+dfsg-1) ...
正在设置 blt (2.5.3+dfsg-1) ...
正在设置 python-dateutil (2.2-2) ...
正在设置 python-funcsigs (0.4-1) ...
正在设置 python-gobject-2 (2.28.6-12build1) ...
正在设置 python-gtk2 (2.24.0-4ubuntu1) ...
正在设置 python-glade2 (2.24.0-4ubuntu1) ...
正在设置 python-matplotlib-data (1.4.2-3.1ubuntu1) ...
正在设置 python-pyparsing (2.0.3+dfsg1-1) ...
正在设置 python-tz (2014.10~dfsg1-0ubuntu2) ...
正在设置 python-pbr (1.8.0-2ubuntu1) ...
update-alternatives: 使用 /usr/bin/python2-pbr 来在自动模式中提供 /usr/bin/pbr (pbr)
正在设置 python-mock (1.3.0-2.1ubuntu1) ...
正在设置 python-nose (1.3.6-1) ...
正在设置 python-matplotlib (1.4.2-3.1ubuntu1) ...
正在设置 python-tk (2.7.9-1) ...
正在处理用于 libc-bin (2.21-0ubuntu4) 的触发器 ...

[/mw_shl_code]

测试：
[mw_shl_code=python,true]import matplotlib.pyplot as plt

def bar_chart_generator():
l=[1,2,3,4,5]
h=[20,14,38,27,9]
w=[0.1,0.2,0.3,0.4,0.5]
b=[1,2,3,4,5]
fig=plt.figure()
ax=fig.add_subplot(111)
rects=ax.bar(l,h,w,b)
plt.show()

bar_chart_generator()  [/mw_shl_code]

2016-04-19 19-27-21屏幕截图.png

附录：pyenv 安装已经使用

Python 在机器学习领域常用的几大功能：

与外界进行交互

读写各种各样的文件格式和数据库。

准备

对数据进行清理、修整、整合、规范化、重塑、切片切块、变形等处理以便进行分析

转换

对数据集做一些数学合统计运算以产生新的数据集。例如根据分组变量对一个大表进行聚合。

建模和计算

将数据跟统计模型、机器学习算法或其他计算工具联系起来。

创建交互式的或静态的图片或文字摘要。

利用python 读取 json格式数据：

代码：
[mw_shl_code=python,true]#!/usr/bin/env python
# coding=utf-8
import json
path = '/home/peerslee/workspace/py/pydata-book-master/ch02/usagov_bitly_data2012-03-16-1331923249.txt'
records = [json.loads(line) for line in open(path)[/mw_shl_code]

解释：
将数据/home/peerslee/workspace/py/pydata-book-master/ch02/usagov_bitly_data2012-03-16-1331923249.txt 读取，之后利用python 的 json 模块及其loads函数逐行加载。代码中最后一行是列表推导式，它可以在一组字符串（或一组别对象）上执行一条相同操作的简洁方式。

数据格式：
[mw_shl_code=bash,true]n [5]: records[0]
Out[5]:
{u'a': u'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.78 Safari/535.11',
u'al': u'en-US,en;q=0.8',
u'c': u'US',
u'cy': u'Danvers',
u'g': u'A6qOVH',
u'gr': u'MA',
u'h': u'wfLQtf',
u'hc': 1331822918,
u'hh': u'1.usa.gov',
u'l': u'orofrog',
u'll': [42.576698, -70.954903],
u'nk': 1,
u'r': u'http://www.facebook.com/l/7AQEFzjSi/1.usa.gov/wfLQtf',
u't': 1331923247,
u'tz': u'America/New_York',
u'u': u'http://www.ncbi.nlm.nih.gov/pubmed/22415991'}

In [6]: records[0]['tz']
Out[6]: u'America/New_York'

In [7]: records[0]['l']
Out[7]: u'orofrog'
[/mw_shl_code]

单引号前面的u 表示 unicode（字符串编码方式），Ipython 给出的是时区的字符串对象形式，而不是其打印形式。

利用纯python 代码对时区进行计数：
方法1：
[mw_shl_code=python,true]def get_counts(sequence):
counts = {}
for x in sequence:
      if x in counts:
         counts[x] += 1
      else:
         counts[x] = 1
return counts[/mw_shl_code]

测试：
[mw_shl_code=bash,true]In [20]: run /tmp/ipython_edit_kVoPnD/ipython_edit_S7qM0H.py

In [21]: counts = get_counts(time_zones)

In [22]: counts['America/New_York']
Out[22]: 1251

In [23]: len(time_zones)
Out[23]: 3440
[/mw_shl_code]

方法2：
[mw_shl_code=python,true]#!/usr/bin/env python
# coding=utf-8
from collections import defaultdict

def get_counts_1(sequence):
'''
defaultdict：

python 原生的dict（字典）是不能利用默认的工厂方法生成一个有默认值的dict
collections 的 defaultdict 在创建的时候给他一个工程方法，即告诉他该dict 中要存什么数据类型，其可以在之后直接使用dict[key] 的形式并且不会有KeyError
'''
counts = defaultdict(int)
for x in sequence:
      counts[x] += 1;
return counts[/mw_shl_code]

测试方式同方法（1），这里不罗嗦了。

得到前十位的时区以及计数值：

方法1：
[mw_shl_code=bash,true]#!/usr/bin/env python
# coding=utf-8
def top_counts(count_dict, n=10):
'''
dict.items():
以列表返回可遍历的(键, 值) 元组数组
'''
value_key_pairs = [(count, tz) for tz, count in count_dict.items()]
value_key_pairs.sort()
return value_key_pairs[-n:][/mw_shl_code]

测试：
[mw_shl_code=bash,true]In [29]: top_counts(counts)
Out[29]:
[(33, u'America/Sao_Paulo'),
(35, u'Europe/Madrid'),
(36, u'Pacific/Honolulu'),
(37, u'Asia/Tokyo'),
(74, u'Europe/London'),
(191, u'America/Denver'),
(382, u'America/Los_Angeles'),
(400, u'America/Chicago'),
(521, u''),
(1251, u'America/New_York')]
[/mw_shl_code]

方法2：
[mw_shl_code=bash,true]In [24]: from collections import Counter

In [25]: counts = Counter(time_zones)

In [26]: counts.most_common(10)
Out[26]:
[(u'America/New_York', 1251),
(u'', 521),
(u'America/Chicago', 400),
(u'America/Los_Angeles', 382),
(u'America/Denver', 191),
(u'Europe/London', 74),
(u'Asia/Tokyo', 37),
(u'Pacific/Honolulu', 36),
(u'Europe/Madrid', 35),
(u'America/Sao_Paulo', 33)]

In [27]: counts.most_common(20)
Out[27]:
[(u'America/New_York', 1251),
(u'', 521),
(u'America/Chicago', 400),
(u'America/Los_Angeles', 382),
(u'America/Denver', 191),
(u'Europe/London', 74),
(u'Asia/Tokyo', 37),
(u'Pacific/Honolulu', 36),
(u'Europe/Madrid', 35),
(u'America/Sao_Paulo', 33),
(u'Europe/Berlin', 28),
(u'Europe/Rome', 27),
(u'America/Rainy_River', 25),
(u'Europe/Amsterdam', 22),
(u'America/Phoenix', 20),
(u'America/Indianapolis', 20),
(u'Europe/Warsaw', 16),
(u'America/Mexico_City', 15),
(u'Europe/Paris', 14),
(u'Europe/Stockholm', 14)]
[/mw_shl_code]

用pandas 对时区进行计数：

pandas 入门教程：
该教程中Series 和 DataFrame数据结构的简单介绍，这里不进行介绍。

（1）创建DataFrame ：
DataFrame是 pandas 中最重要的数据结构，它用于将数据表示为一个表格。一个DataFrame表示一个表格，类似电子表格的数据结构，包含一个经过排序的列表集，它们没一个都可以有不同的类型值（数字，字符串，布尔等等）。

[mw_shl_code=bash,true]from pandas import DataFrame, Series

In [37]: import pandas as pd

In [38]: import numpy as np

In [39]: frame = DataFrame(records)

In [40]: frame
Out[40]:
   _heartbeat_                                                 a  \
0          NaN  Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKi...
1          NaN                            GoogleMaps/RochesterNY
2          NaN  Mozilla/4.0 (compatible; MSIE 8.0; Windows NT ...
3          NaN  Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8)...
4          NaN  Mozilla/5.0 (Windows
..............
..............

3554    America/New_York  http://www.shrewsbury-ma.gov/egov/gallery/1341...
3555    America/New_York  http://www.fda.gov/AdvisoryCommittees/Committe...
3556    America/Chicago  http://www.okc.gov/PublicNotificationSystem/Fo...
3557    America/Denver       http://www.monroecounty.gov/etc/911/rss.php
3558  America/Los_Angeles             http://www.ahrq.gov/qual/qitoolkit/
3559    America/New_York  http://herndon-va.gov/Content/public_safety/Pu...

[3560 rows x 18 columns]

In [41]: frame['tz'][:10]
Out[41]:
0    America/New_York
1    America/Denver
2    America/New_York
3 America/Sao_Paulo
4    America/New_York
5    America/New_York
6       Europe/Warsaw
7
8
9
Name: tz, dtype: object
[/mw_shl_code]

（2）frame 的输出形式是摘要视图，frame['tz']所返回的Series 对象有一个value_counts方法，该方法可以让我们得到所需要的信息
[mw_shl_code=python,true]In [36]: tz_counts = frame['tz'].value_counts()

In [37]: tz_counts[:10]
Out[37]:
America/New_York    1251
                     521
America/Chicago       400
America/Los_Angeles    382
America/Denver       191
Europe/London          74
Asia/Tokyo             37
Pacific/Honolulu       36
Europe/Madrid          35
America/Sao_Paulo       33
dtype: int64
[/mw_shl_code]

（3）利用matplotlib 为这段数据生成一张图片（一定要以 pylab 模式打开 ipython）

为记录中未知或缺失的时区填上一个代替值
fillna 函数可以替换缺失值（NA）
未知值（空字符串）则可以通过布尔型数组索引加以替换

[mw_shl_code=python,true]In [38]: clean_tz = frame['tz'].fillna('Missing')

In [39]: clean_tz[clean_tz == ''] = 'Unknown'

In [40]: tz_counts = clean_tz.value_counts()

In [41]: tz_counts[:10]
Out[41]:
America/New_York    1251
Unknown                521
America/Chicago       400
America/Los_Angeles    382
America/Denver       191
Missing                120
Europe/London          74
Asia/Tokyo             37
Pacific/Honolulu       36
Europe/Madrid          35
dtype: int64

In [42]: tz_counts[:10].plot(kind='barh' ,rot=0)
Out[42]: <matplotlib.axes._subplots.AxesSubplot at 0x7fb3ec6c5e50>
[/mw_shl_code]

value_counts用于计算一个Series中各值出现的频率它还是一个顶级pandas方法，可用于任何数组或序列。

2016-04-21 17-14-03屏幕截图.png

将字符串中信息分离出来并得到另一份用户行为摘要：

Python split()通过指定分隔符对字符串进行切片，如果参数num 有指定值，则仅分隔 num 个子字符串
dropna
对于一个 Series，dropna 返回一个仅含非空数据和索引值的 Series。
问题在于对 DataFrame 的处理方式，因为一旦 drop 的话，至少要丢掉一行（列）。这里的解决方式与前面类似，还是通过一个额外的参数：dropna(axis=0, how='any', thresh=None) ，how 参数可选的值为 any 或者 all。all 仅在切片元素全为 NA 时才抛弃该行(列)。另外一个有趣的参数是 thresh，该参数的类型为整数，它的作用是，比如 thresh=3，会在一行中至少有 3 个非 NA 值时将其保留。

[mw_shl_code=bash,true]In [56]: results = Series([x.split()[0] for x in frame.a.dropna()])

In [57]: results[:5]
Out[57]:
0             Mozilla/5.0
1 GoogleMaps/RochesterNY
2             Mozilla/4.0
3             Mozilla/5.0
4             Mozilla/5.0
dtype: object

In [58]: results.value_counts()[:8]
Out[58]:
Mozilla/5.0                2594
Mozilla/4.0                601
GoogleMaps/RochesterNY    121
Opera/9.80                   34
TEST_INTERNET_AGENT          24
GoogleProducer             21
Mozilla/6.0                   5
BlackBerry8520/5.0.0.681    4
dtype: int64
[/mw_shl_code]

is(not)null
这一对方法对对象做元素级应用，然后返回一个布尔型数组，一般可用于布尔型索引。
numpy.where()返回的是坐标
pandas.core.strings.StringMethods.contains

[mw_shl_code=bash,true]In [72]: cframe['a'].str.contains('Windows')
Out[72]:
0    True
1    False
2    True
3    False
4    True
5    True
6    True
7    True
8    False
9    True
10    True
11 False
12    True
14    True
15    True
...
3545    True
3546 False
3547 False
3548 False
3549    True
3550    True
3551    True
3552    True
3553    True
3554    True
3555    True
3556    True
3557 False
3558 False
3559    True
Name: a, Length: 3440, dtype: bool
[/mw_shl_code]

截取agent 字符串中含有 'Windows' 的数据:
[mw_shl_code=bash,true]n [93]: cframe = frame[frame.a.notnull()]

In [94]: operating_system = np.where(cframe['a'].str.contains('Windows'),'windows','not windowns')

In [95]: operating_system[:5]Out[95]:
array(['windows', 'not windowns', 'windows', 'not windowns', 'windows'],
   dtype='|S12')
[/mw_shl_code]

根据时区合新得到的操作系统列表对数据进行分组，然后通过size对分组结果进行计数，利用unstack 对计数结果进行重塑：
[mw_shl_code=bash,true]In [98]: by_tz_os = cframe.groupby(['tz',operating_system])

In [99]: agg_counts = by_tz_os.size().unstack().fillna(0)

In [100]: agg_counts[:10]
Out[100]:
                              Not Windowns  Windows
tz
                                       245    276
Africa/Cairo                            0       3
Africa/Casablanca                         0       1
Africa/Ceuta                            0       2
Africa/Johannesburg                      0       1
Africa/Lusaka                            0       1
America/Anchorage                         4       1
America/Argentina/Buenos_Aires          1       0
America/Argentina/Cordoba                0       1
America/Argentina/Mendoza                0       1
[/mw_shl_code]

根据agg_counts中的行数构造了一个间接索引数组：

argsort函数返回的是：数组值从小到大的索引值

[mw_shl_code=bash,true]In [101]: indexer = agg_counts.sum(1).argsort()

In [102]: indexer[:10]
Out[102]:
tz
                              24
Africa/Cairo                   20
Africa/Casablanca                21
Africa/Ceuta                   92
Africa/Johannesburg             87
Africa/Lusaka                   53
America/Anchorage                54
America/Argentina/Buenos_Aires 57
America/Argentina/Cordoba       26
America/Argentina/Mendoza       55
dtype: int64
[/mw_shl_code]

通过take按照这个顺序截取最后10行：
[mw_shl_code=bash,true]In [103]: count_subset = agg_counts.take(indexer)[-10:]

In [104]: count_subset
Out[104]:
                  Not Windowns  Windows
tz
America/Sao_Paulo             13    20
Europe/Madrid                16    19
Pacific/Honolulu             0    36
Asia/Tokyo                   2    35
Europe/London                43    31
America/Denver             132    59
America/Los_Angeles          130    252
America/Chicago             115    285
                           245    276
America/New_York             339    912
[/mw_shl_code]

使用stacked=True 来生成一张堆积条形图：
[mw_shl_code=bash,true]In [105]: count_subset.plot(kind='barh', stacked=True)
Out[105]: <matplotlib.axes._subplots.AxesSubplot at 0x7fb3ebae9dd0>
[/mw_shl_code]

2016-04-21 20-17-43屏幕截图.png

上一篇：【机器学习实战】读书笔记 - （1）

图文精华

【机器学习实战】读书笔记 - （2）

推荐 /2