1 Introduction

1.1 Principe

Le principe de l’algorithme des k plus proches voisins est basé le fait que les objets d’un même type sont “proches” entre eux. Pour quantifier la distance entre les objets, la grandeur couramment utilisée est la distance euclidienne.

L’algorithme permet ainsi de prédire l’appartenance d’un nouvel objet à une classe en fonction de ses distances avec ses voisins : elle appartiendra à la classe majoritaire de ses voisins.

1.2 Types de variables

Comme on calcule des distances euclidienne, la version de base suppose que les varibles sont numériques. Pour que l’échelle des valeurs numériques des différentes variables n’impactent pas le calcul de la distance, les valeurs doivent être normalisées.

Il est toujours possible d’avoir des variables catégoriques, mais on doit alors définir la notion de distance.

1.3 Caractéristique algorithmique

Cet algorithme ne construit pas un modèle. En entrée, on doit disposer d’une base de données libelées. Ensuite, pour chacune des nouvelles observations, on calcule les distances avec les voisins. Comme son nom l’indique, le nombre de voisin est un paramtère de l’algorithme. L’apprentissage automatique consiste à optimiser ce paramètre.

2 Construction de l’algorithme

2.1 Données

La base de données sur le diagnostic des cancers du sein peut illustrer cet algorithme. les données utilisées regroupent les observations des caractéristiques des cellules des seins. Elles sont disponibles sur le site de UCI,

On a 30 variables prédictives et une variable à prédire :

names(data)
##  [1] "radius_mean"             "texture_mean"           
##  [3] "perimeter_mean"          "area_mean"              
##  [5] "smoothness_mean"         "compactness_mean"       
##  [7] "concavity_mean"          "concave.points_mean"    
##  [9] "symmetry_mean"           "fractal_dimension_mean" 
## [11] "radius_se"               "texture_se"             
## [13] "perimeter_se"            "area_se"                
## [15] "smoothness_se"           "compactness_se"         
## [17] "concavity_se"            "concave.points_se"      
## [19] "symmetry_se"             "fractal_dimension_se"   
## [21] "radius_worst"            "texture_worst"          
## [23] "perimeter_worst"         "area_worst"             
## [25] "smoothness_worst"        "compactness_worst"      
## [27] "concavity_worst"         "concave.points_worst"   
## [29] "symmetry_worst"          "fractal_dimension_worst"
## [31] "class"

La variable à prédire prend la valeur “B” qui signifie bénigne, et “M” qui signifie maligne.

(table(data$class))
## 
##   B   M 
## 357 212
kable(head(data))
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave.points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst class
17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 M
20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 M
19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 M
11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 M
20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 M
12.45 15.70 82.57 477.1 0.12780 0.17000 0.1578 0.08089 0.2087 0.07613 0.3345 0.8902 2.217 27.19 0.007510 0.03345 0.03672 0.01137 0.02165 0.005082 15.47 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440 M

2.2 Normalisation

On peut normaliser les valeurs à l’aide la fonction suivante :

normalize <- function(x) {
    norm <- ((x - min(x))/(max(x) - min(x)))
    return (norm)
}

datan <- as.data.frame(lapply(data[,1:30], normalize))

Ainsi toutes les variables prennent des valeurs de 0 à 1 :

kable(summary(datan))
radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concave.points_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concave.points_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concave.points_worst symmetry_worst fractal_dimension_worst
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.00000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.2233 1st Qu.:0.2185 1st Qu.:0.2168 1st Qu.:0.1174 1st Qu.:0.3046 1st Qu.:0.1397 1st Qu.:0.06926 1st Qu.:0.1009 1st Qu.:0.2823 1st Qu.:0.1630 1st Qu.:0.04378 1st Qu.:0.1047 1st Qu.:0.04000 1st Qu.:0.02064 1st Qu.:0.1175 1st Qu.:0.08132 1st Qu.:0.03811 1st Qu.:0.1447 1st Qu.:0.1024 1st Qu.:0.04675 1st Qu.:0.1807 1st Qu.:0.2415 1st Qu.:0.1678 1st Qu.:0.08113 1st Qu.:0.3000 1st Qu.:0.1163 1st Qu.:0.09145 1st Qu.:0.2231 1st Qu.:0.1851 1st Qu.:0.1077
Median :0.3024 Median :0.3088 Median :0.2933 Median :0.1729 Median :0.3904 Median :0.2247 Median :0.14419 Median :0.1665 Median :0.3697 Median :0.2439 Median :0.07702 Median :0.1653 Median :0.07209 Median :0.03311 Median :0.1586 Median :0.13667 Median :0.06538 Median :0.2070 Median :0.1526 Median :0.07919 Median :0.2504 Median :0.3569 Median :0.2353 Median :0.12321 Median :0.3971 Median :0.1791 Median :0.18107 Median :0.3434 Median :0.2478 Median :0.1640
Mean :0.3382 Mean :0.3240 Mean :0.3329 Mean :0.2169 Mean :0.3948 Mean :0.2606 Mean :0.20806 Mean :0.2431 Mean :0.3796 Mean :0.2704 Mean :0.10635 Mean :0.1893 Mean :0.09938 Mean :0.06264 Mean :0.1811 Mean :0.17444 Mean :0.08054 Mean :0.2235 Mean :0.1781 Mean :0.10019 Mean :0.2967 Mean :0.3640 Mean :0.2831 Mean :0.17091 Mean :0.4041 Mean :0.2202 Mean :0.21740 Mean :0.3938 Mean :0.2633 Mean :0.1896
3rd Qu.:0.4164 3rd Qu.:0.4089 3rd Qu.:0.4168 3rd Qu.:0.2711 3rd Qu.:0.4755 3rd Qu.:0.3405 3rd Qu.:0.30623 3rd Qu.:0.3678 3rd Qu.:0.4530 3rd Qu.:0.3404 3rd Qu.:0.13304 3rd Qu.:0.2462 3rd Qu.:0.12251 3rd Qu.:0.07170 3rd Qu.:0.2187 3rd Qu.:0.22680 3rd Qu.:0.10619 3rd Qu.:0.2787 3rd Qu.:0.2195 3rd Qu.:0.12656 3rd Qu.:0.3863 3rd Qu.:0.4717 3rd Qu.:0.3735 3rd Qu.:0.22090 3rd Qu.:0.4942 3rd Qu.:0.3025 3rd Qu.:0.30583 3rd Qu.:0.5546 3rd Qu.:0.3182 3rd Qu.:0.2429
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.00000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :1.0000
data_n=cbind(datan,class=data[,31])

2.3 Création de partition

Pour tester l’algorithme, on répartit les données en deux bases : base d’apprentissage et base de test.

ind=createDataPartition(data_n$class, times = 1,p = 0.3,list=FALSE)
a=data_n[ind,]
t=data_n[-ind,]
dim(a);dim(t)
## [1] 172  31
## [1] 397  31

2.4 Prédiction

On peut faire la prédiction des classes de la base de test à l’aide la fonction knn du package class. On peut choisir dans un premier temps k=5.

pred <- knn(train = a[,1:30], test = t[,1:30],
 cl = a$class, k=5)

On peut construire la matrice de confusion avec la fonction table :

table(pred,t$class)
##     
## pred   B   M
##    B 247  18
##    M   2 130

ou utiliser la fonction confusionMatrix qui permet de calculer automatiquement certains caractéristiques :

(confusionMatrix(pred,t$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 247  18
##          M   2 130
##                                          
##                Accuracy : 0.9496         
##                  95% CI : (0.9233, 0.969)
##     No Information Rate : 0.6272         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8899         
##  Mcnemar's Test P-Value : 0.0007962      
##                                          
##             Sensitivity : 0.9920         
##             Specificity : 0.8784         
##          Pos Pred Value : 0.9321         
##          Neg Pred Value : 0.9848         
##              Prevalence : 0.6272         
##          Detection Rate : 0.6222         
##    Detection Prevalence : 0.6675         
##       Balanced Accuracy : 0.9352         
##                                          
##        'Positive' Class : B              
## 

3 Surapprentissage

Le surapprentissage est le cas où l’algorithme fonctionne presque parfaitement pour la base d’apprentissage, mais le résultat est détérioré pour la base de test.

On peut par exemple choisir k=1 pour faire la prédiction sur la base d’apprentissage. La prédiction sera 100% correcte.

pred <- knn(train = a[,1:30], test = a[,1:30],
 cl = a$class, k=1)
(confusionMatrix(pred,a$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 108   0
##          M   0  64
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9788, 1)
##     No Information Rate : 0.6279     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
##                                      
##             Sensitivity : 1.0000     
##             Specificity : 1.0000     
##          Pos Pred Value : 1.0000     
##          Neg Pred Value : 1.0000     
##              Prevalence : 0.6279     
##          Detection Rate : 0.6279     
##    Detection Prevalence : 0.6279     
##       Balanced Accuracy : 1.0000     
##                                      
##        'Positive' Class : B          
## 

Cependant, si on applique ce paramètre pour la base de test, la précision de prédiction baisse :

pred <- knn(train = a[,1:30], test = t[,1:30],
 cl = a$class, k=1)
(confusionMatrix(pred,t$class))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 247  12
##          M   2 136
##                                           
##                Accuracy : 0.9647          
##                  95% CI : (0.9415, 0.9806)
##     No Information Rate : 0.6272          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9235          
##  Mcnemar's Test P-Value : 0.01616         
##                                           
##             Sensitivity : 0.9920          
##             Specificity : 0.9189          
##          Pos Pred Value : 0.9537          
##          Neg Pred Value : 0.9855          
##              Prevalence : 0.6272          
##          Detection Rate : 0.6222          
##    Detection Prevalence : 0.6524          
##       Balanced Accuracy : 0.9554          
##                                           
##        'Positive' Class : B               
## 

4 Optimisation de l’algorithme

4.1 Validation croisée

On peut utiliser la validation croisée pour tester les différentes valeurs de k :

knn.cross <- tune.knn(x = data_n[,1:30], y = data_n$class, k = 1:30,tunecontrol=tune.control(sampling = "cross"), cross=10)
summary(knn.cross)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: 10-fold cross validation 
## 
## - best parameters:
##   k
##  13
## 
## - best performance: 0.02637845 
## 
## - Detailed performance results:
##     k      error dispersion
## 1   1 0.04577068 0.02399496
## 2   2 0.05100251 0.01760393
## 3   3 0.03167293 0.02462478
## 4   4 0.03170426 0.02609975
## 5   5 0.03170426 0.02333244
## 6   6 0.02994987 0.02510752
## 7   7 0.02994987 0.02370634
## 8   8 0.02819549 0.02087928
## 9   9 0.03170426 0.02181755
## 10 10 0.02991855 0.01678869
## 11 11 0.02640977 0.01721256
## 12 12 0.03167293 0.01626021
## 13 13 0.02637845 0.01495163
## 14 14 0.02988722 0.02036111
## 15 15 0.02637845 0.01708650
## 16 16 0.02813283 0.01697894
## 17 17 0.02988722 0.01860585
## 18 18 0.03164160 0.01813378
## 19 19 0.03164160 0.01813378
## 20 20 0.03515038 0.01654170
## 21 21 0.03339599 0.01537054
## 22 22 0.03515038 0.01654170
## 23 23 0.04044486 0.01672308
## 24 24 0.03869048 0.01621508
## 25 25 0.04044486 0.01672308
## 26 26 0.04219925 0.01487137
## 27 27 0.04219925 0.01487137
## 28 28 0.04219925 0.01701631
## 29 29 0.04044486 0.01672308
## 30 30 0.04395363 0.01497368
plot(knn.cross)

4.2 Boostrap

On peut également faire du rééchantillonage :

knn.boot <- tune.knn(x =  data_n[,1:30], y = data_n$class, k = 1:30,tunecontrol=tune.control(sampling = "boot") )
#Summarize the resampling results set
summary(knn.boot)
## 
## Parameter tuning of 'knn.wrapper':
## 
## - sampling method: bootstrapping 
## 
## - best parameters:
##  k
##  7
## 
## - best performance: 0.03396051 
## 
## - Detailed performance results:
##     k      error  dispersion
## 1   1 0.04830481 0.008836647
## 2   2 0.04445023 0.010613729
## 3   3 0.04594091 0.014505065
## 4   4 0.04410417 0.016831335
## 5   5 0.03943188 0.012935081
## 6   6 0.03695312 0.012561443
## 7   7 0.03396051 0.010032768
## 8   8 0.03531110 0.013674661
## 9   9 0.03528645 0.011669237
## 10 10 0.03699277 0.013698316
## 11 11 0.03575241 0.012065851
## 12 12 0.03535861 0.011943867
## 13 13 0.03489323 0.013653993
## 14 14 0.03659524 0.013437397
## 15 15 0.03782825 0.013373994
## 16 16 0.03658264 0.014391250
## 17 17 0.03912168 0.013598094
## 18 18 0.03954813 0.013614659
## 19 19 0.03912080 0.013118203
## 20 20 0.04122168 0.014923003
## 21 21 0.04039835 0.012571568
## 22 22 0.03992083 0.013963521
## 23 23 0.04004607 0.012315373
## 24 24 0.04172712 0.013083907
## 25 25 0.04007553 0.013451435
## 26 26 0.04097639 0.013872452
## 27 27 0.03971107 0.012627554
## 28 28 0.04048460 0.013927123
## 29 29 0.04099185 0.012835775
## 30 30 0.04143843 0.012281031
plot(knn.boot)

5 Algorithme finale

On ne constate pas une unique valeur de k pour laquelle le taux d’erreurs est le plus bas. Les valeurs entre 5 et 15 semblent toutes pertinentes.

Copyright © 2016 Blog de Kezhan Shi