Le principe de l’algorithme des k plus proches voisins est basé le fait que les objets d’un même type sont “proches” entre eux. Pour quantifier la distance entre les objets, la grandeur couramment utilisée est la distance euclidienne.
L’algorithme permet ainsi de prédire l’appartenance d’un nouvel objet à une classe en fonction de ses distances avec ses voisins : elle appartiendra à la classe majoritaire de ses voisins.
Comme on calcule des distances euclidienne, la version de base suppose que les varibles sont numériques. Pour que l’échelle des valeurs numériques des différentes variables n’impactent pas le calcul de la distance, les valeurs doivent être normalisées.
Il est toujours possible d’avoir des variables catégoriques, mais on doit alors définir la notion de distance.
Cet algorithme ne construit pas un modèle. En entrée, on doit disposer d’une base de données libelées. Ensuite, pour chacune des nouvelles observations, on calcule les distances avec les voisins. Comme son nom l’indique, le nombre de voisin est un paramtère de l’algorithme. L’apprentissage automatique consiste à optimiser ce paramètre.
La base de données sur le diagnostic des cancers du sein peut illustrer cet algorithme. les données utilisées regroupent les observations des caractéristiques des cellules des seins. Elles sont disponibles sur le site de UCI,
On a 30 variables prédictives et une variable à prédire :
names(data)
## [1] "radius_mean" "texture_mean"
## [3] "perimeter_mean" "area_mean"
## [5] "smoothness_mean" "compactness_mean"
## [7] "concavity_mean" "concave.points_mean"
## [9] "symmetry_mean" "fractal_dimension_mean"
## [11] "radius_se" "texture_se"
## [13] "perimeter_se" "area_se"
## [15] "smoothness_se" "compactness_se"
## [17] "concavity_se" "concave.points_se"
## [19] "symmetry_se" "fractal_dimension_se"
## [21] "radius_worst" "texture_worst"
## [23] "perimeter_worst" "area_worst"
## [25] "smoothness_worst" "compactness_worst"
## [27] "concavity_worst" "concave.points_worst"
## [29] "symmetry_worst" "fractal_dimension_worst"
## [31] "class"
La variable à prédire prend la valeur “B” qui signifie bénigne, et “M” qui signifie maligne.
(table(data$class))
##
## B M
## 357 212
kable(head(data))
radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave.points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst | fractal_dimension_worst | class |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
17.99 | 10.38 | 122.80 | 1001.0 | 0.11840 | 0.27760 | 0.3001 | 0.14710 | 0.2419 | 0.07871 | 1.0950 | 0.9053 | 8.589 | 153.40 | 0.006399 | 0.04904 | 0.05373 | 0.01587 | 0.03003 | 0.006193 | 25.38 | 17.33 | 184.60 | 2019.0 | 0.1622 | 0.6656 | 0.7119 | 0.2654 | 0.4601 | 0.11890 | M |
20.57 | 17.77 | 132.90 | 1326.0 | 0.08474 | 0.07864 | 0.0869 | 0.07017 | 0.1812 | 0.05667 | 0.5435 | 0.7339 | 3.398 | 74.08 | 0.005225 | 0.01308 | 0.01860 | 0.01340 | 0.01389 | 0.003532 | 24.99 | 23.41 | 158.80 | 1956.0 | 0.1238 | 0.1866 | 0.2416 | 0.1860 | 0.2750 | 0.08902 | M |
19.69 | 21.25 | 130.00 | 1203.0 | 0.10960 | 0.15990 | 0.1974 | 0.12790 | 0.2069 | 0.05999 | 0.7456 | 0.7869 | 4.585 | 94.03 | 0.006150 | 0.04006 | 0.03832 | 0.02058 | 0.02250 | 0.004571 | 23.57 | 25.53 | 152.50 | 1709.0 | 0.1444 | 0.4245 | 0.4504 | 0.2430 | 0.3613 | 0.08758 | M |
11.42 | 20.38 | 77.58 | 386.1 | 0.14250 | 0.28390 | 0.2414 | 0.10520 | 0.2597 | 0.09744 | 0.4956 | 1.1560 | 3.445 | 27.23 | 0.009110 | 0.07458 | 0.05661 | 0.01867 | 0.05963 | 0.009208 | 14.91 | 26.50 | 98.87 | 567.7 | 0.2098 | 0.8663 | 0.6869 | 0.2575 | 0.6638 | 0.17300 | M |
20.29 | 14.34 | 135.10 | 1297.0 | 0.10030 | 0.13280 | 0.1980 | 0.10430 | 0.1809 | 0.05883 | 0.7572 | 0.7813 | 5.438 | 94.44 | 0.011490 | 0.02461 | 0.05688 | 0.01885 | 0.01756 | 0.005115 | 22.54 | 16.67 | 152.20 | 1575.0 | 0.1374 | 0.2050 | 0.4000 | 0.1625 | 0.2364 | 0.07678 | M |
12.45 | 15.70 | 82.57 | 477.1 | 0.12780 | 0.17000 | 0.1578 | 0.08089 | 0.2087 | 0.07613 | 0.3345 | 0.8902 | 2.217 | 27.19 | 0.007510 | 0.03345 | 0.03672 | 0.01137 | 0.02165 | 0.005082 | 15.47 | 23.75 | 103.40 | 741.6 | 0.1791 | 0.5249 | 0.5355 | 0.1741 | 0.3985 | 0.12440 | M |
On peut normaliser les valeurs à l’aide la fonction suivante :
normalize <- function(x) {
norm <- ((x - min(x))/(max(x) - min(x)))
return (norm)
}
datan <- as.data.frame(lapply(data[,1:30], normalize))
Ainsi toutes les variables prennent des valeurs de 0 à 1 :
kable(summary(datan))
radius_mean | texture_mean | perimeter_mean | area_mean | smoothness_mean | compactness_mean | concavity_mean | concave.points_mean | symmetry_mean | fractal_dimension_mean | radius_se | texture_se | perimeter_se | area_se | smoothness_se | compactness_se | concavity_se | concave.points_se | symmetry_se | fractal_dimension_se | radius_worst | texture_worst | perimeter_worst | area_worst | smoothness_worst | compactness_worst | concavity_worst | concave.points_worst | symmetry_worst | fractal_dimension_worst | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. :0.0000 | Min. :0.00000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.00000 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | |
1st Qu.:0.2233 | 1st Qu.:0.2185 | 1st Qu.:0.2168 | 1st Qu.:0.1174 | 1st Qu.:0.3046 | 1st Qu.:0.1397 | 1st Qu.:0.06926 | 1st Qu.:0.1009 | 1st Qu.:0.2823 | 1st Qu.:0.1630 | 1st Qu.:0.04378 | 1st Qu.:0.1047 | 1st Qu.:0.04000 | 1st Qu.:0.02064 | 1st Qu.:0.1175 | 1st Qu.:0.08132 | 1st Qu.:0.03811 | 1st Qu.:0.1447 | 1st Qu.:0.1024 | 1st Qu.:0.04675 | 1st Qu.:0.1807 | 1st Qu.:0.2415 | 1st Qu.:0.1678 | 1st Qu.:0.08113 | 1st Qu.:0.3000 | 1st Qu.:0.1163 | 1st Qu.:0.09145 | 1st Qu.:0.2231 | 1st Qu.:0.1851 | 1st Qu.:0.1077 | |
Median :0.3024 | Median :0.3088 | Median :0.2933 | Median :0.1729 | Median :0.3904 | Median :0.2247 | Median :0.14419 | Median :0.1665 | Median :0.3697 | Median :0.2439 | Median :0.07702 | Median :0.1653 | Median :0.07209 | Median :0.03311 | Median :0.1586 | Median :0.13667 | Median :0.06538 | Median :0.2070 | Median :0.1526 | Median :0.07919 | Median :0.2504 | Median :0.3569 | Median :0.2353 | Median :0.12321 | Median :0.3971 | Median :0.1791 | Median :0.18107 | Median :0.3434 | Median :0.2478 | Median :0.1640 | |
Mean :0.3382 | Mean :0.3240 | Mean :0.3329 | Mean :0.2169 | Mean :0.3948 | Mean :0.2606 | Mean :0.20806 | Mean :0.2431 | Mean :0.3796 | Mean :0.2704 | Mean :0.10635 | Mean :0.1893 | Mean :0.09938 | Mean :0.06264 | Mean :0.1811 | Mean :0.17444 | Mean :0.08054 | Mean :0.2235 | Mean :0.1781 | Mean :0.10019 | Mean :0.2967 | Mean :0.3640 | Mean :0.2831 | Mean :0.17091 | Mean :0.4041 | Mean :0.2202 | Mean :0.21740 | Mean :0.3938 | Mean :0.2633 | Mean :0.1896 | |
3rd Qu.:0.4164 | 3rd Qu.:0.4089 | 3rd Qu.:0.4168 | 3rd Qu.:0.2711 | 3rd Qu.:0.4755 | 3rd Qu.:0.3405 | 3rd Qu.:0.30623 | 3rd Qu.:0.3678 | 3rd Qu.:0.4530 | 3rd Qu.:0.3404 | 3rd Qu.:0.13304 | 3rd Qu.:0.2462 | 3rd Qu.:0.12251 | 3rd Qu.:0.07170 | 3rd Qu.:0.2187 | 3rd Qu.:0.22680 | 3rd Qu.:0.10619 | 3rd Qu.:0.2787 | 3rd Qu.:0.2195 | 3rd Qu.:0.12656 | 3rd Qu.:0.3863 | 3rd Qu.:0.4717 | 3rd Qu.:0.3735 | 3rd Qu.:0.22090 | 3rd Qu.:0.4942 | 3rd Qu.:0.3025 | 3rd Qu.:0.30583 | 3rd Qu.:0.5546 | 3rd Qu.:0.3182 | 3rd Qu.:0.2429 | |
Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00000 | Max. :1.0000 | Max. :1.00000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.00000 | Max. :1.0000 | Max. :1.0000 | Max. :1.0000 |
data_n=cbind(datan,class=data[,31])
Pour tester l’algorithme, on répartit les données en deux bases : base d’apprentissage et base de test.
ind=createDataPartition(data_n$class, times = 1,p = 0.3,list=FALSE)
a=data_n[ind,]
t=data_n[-ind,]
dim(a);dim(t)
## [1] 172 31
## [1] 397 31
On peut faire la prédiction des classes de la base de test à l’aide la fonction knn
du package class
. On peut choisir dans un premier temps k=5
.
pred <- knn(train = a[,1:30], test = t[,1:30],
cl = a$class, k=5)
On peut construire la matrice de confusion avec la fonction table
:
table(pred,t$class)
##
## pred B M
## B 247 18
## M 2 130
ou utiliser la fonction confusionMatrix
qui permet de calculer automatiquement certains caractéristiques :
(confusionMatrix(pred,t$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 247 18
## M 2 130
##
## Accuracy : 0.9496
## 95% CI : (0.9233, 0.969)
## No Information Rate : 0.6272
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8899
## Mcnemar's Test P-Value : 0.0007962
##
## Sensitivity : 0.9920
## Specificity : 0.8784
## Pos Pred Value : 0.9321
## Neg Pred Value : 0.9848
## Prevalence : 0.6272
## Detection Rate : 0.6222
## Detection Prevalence : 0.6675
## Balanced Accuracy : 0.9352
##
## 'Positive' Class : B
##
Le surapprentissage est le cas où l’algorithme fonctionne presque parfaitement pour la base d’apprentissage, mais le résultat est détérioré pour la base de test.
On peut par exemple choisir k=1
pour faire la prédiction sur la base d’apprentissage. La prédiction sera 100% correcte.
pred <- knn(train = a[,1:30], test = a[,1:30],
cl = a$class, k=1)
(confusionMatrix(pred,a$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 108 0
## M 0 64
##
## Accuracy : 1
## 95% CI : (0.9788, 1)
## No Information Rate : 0.6279
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Sensitivity : 1.0000
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 1.0000
## Prevalence : 0.6279
## Detection Rate : 0.6279
## Detection Prevalence : 0.6279
## Balanced Accuracy : 1.0000
##
## 'Positive' Class : B
##
Cependant, si on applique ce paramètre pour la base de test, la précision de prédiction baisse :
pred <- knn(train = a[,1:30], test = t[,1:30],
cl = a$class, k=1)
(confusionMatrix(pred,t$class))
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 247 12
## M 2 136
##
## Accuracy : 0.9647
## 95% CI : (0.9415, 0.9806)
## No Information Rate : 0.6272
## P-Value [Acc > NIR] : < 2e-16
##
## Kappa : 0.9235
## Mcnemar's Test P-Value : 0.01616
##
## Sensitivity : 0.9920
## Specificity : 0.9189
## Pos Pred Value : 0.9537
## Neg Pred Value : 0.9855
## Prevalence : 0.6272
## Detection Rate : 0.6222
## Detection Prevalence : 0.6524
## Balanced Accuracy : 0.9554
##
## 'Positive' Class : B
##
On peut utiliser la validation croisée pour tester les différentes valeurs de k :
knn.cross <- tune.knn(x = data_n[,1:30], y = data_n$class, k = 1:30,tunecontrol=tune.control(sampling = "cross"), cross=10)
summary(knn.cross)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## k
## 13
##
## - best performance: 0.02637845
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.04577068 0.02399496
## 2 2 0.05100251 0.01760393
## 3 3 0.03167293 0.02462478
## 4 4 0.03170426 0.02609975
## 5 5 0.03170426 0.02333244
## 6 6 0.02994987 0.02510752
## 7 7 0.02994987 0.02370634
## 8 8 0.02819549 0.02087928
## 9 9 0.03170426 0.02181755
## 10 10 0.02991855 0.01678869
## 11 11 0.02640977 0.01721256
## 12 12 0.03167293 0.01626021
## 13 13 0.02637845 0.01495163
## 14 14 0.02988722 0.02036111
## 15 15 0.02637845 0.01708650
## 16 16 0.02813283 0.01697894
## 17 17 0.02988722 0.01860585
## 18 18 0.03164160 0.01813378
## 19 19 0.03164160 0.01813378
## 20 20 0.03515038 0.01654170
## 21 21 0.03339599 0.01537054
## 22 22 0.03515038 0.01654170
## 23 23 0.04044486 0.01672308
## 24 24 0.03869048 0.01621508
## 25 25 0.04044486 0.01672308
## 26 26 0.04219925 0.01487137
## 27 27 0.04219925 0.01487137
## 28 28 0.04219925 0.01701631
## 29 29 0.04044486 0.01672308
## 30 30 0.04395363 0.01497368
plot(knn.cross)
On peut également faire du rééchantillonage :
knn.boot <- tune.knn(x = data_n[,1:30], y = data_n$class, k = 1:30,tunecontrol=tune.control(sampling = "boot") )
#Summarize the resampling results set
summary(knn.boot)
##
## Parameter tuning of 'knn.wrapper':
##
## - sampling method: bootstrapping
##
## - best parameters:
## k
## 7
##
## - best performance: 0.03396051
##
## - Detailed performance results:
## k error dispersion
## 1 1 0.04830481 0.008836647
## 2 2 0.04445023 0.010613729
## 3 3 0.04594091 0.014505065
## 4 4 0.04410417 0.016831335
## 5 5 0.03943188 0.012935081
## 6 6 0.03695312 0.012561443
## 7 7 0.03396051 0.010032768
## 8 8 0.03531110 0.013674661
## 9 9 0.03528645 0.011669237
## 10 10 0.03699277 0.013698316
## 11 11 0.03575241 0.012065851
## 12 12 0.03535861 0.011943867
## 13 13 0.03489323 0.013653993
## 14 14 0.03659524 0.013437397
## 15 15 0.03782825 0.013373994
## 16 16 0.03658264 0.014391250
## 17 17 0.03912168 0.013598094
## 18 18 0.03954813 0.013614659
## 19 19 0.03912080 0.013118203
## 20 20 0.04122168 0.014923003
## 21 21 0.04039835 0.012571568
## 22 22 0.03992083 0.013963521
## 23 23 0.04004607 0.012315373
## 24 24 0.04172712 0.013083907
## 25 25 0.04007553 0.013451435
## 26 26 0.04097639 0.013872452
## 27 27 0.03971107 0.012627554
## 28 28 0.04048460 0.013927123
## 29 29 0.04099185 0.012835775
## 30 30 0.04143843 0.012281031
plot(knn.boot)
On ne constate pas une unique valeur de k pour laquelle le taux d’erreurs est le plus bas. Les valeurs entre 5 et 15 semblent toutes pertinentes.
Copyright © 2016 Blog de Kezhan Shi