Feature significance is an extension of kernel density estimation which is used to establish the statistical significance of features (e.g. local modes). See Chaudhuri and Marronn (1999) for 1-dimensional data, Godtliebsen et al. (2002) for 2-dimensional data and Duong et al. (2007) for 3- and 4-dimensional data. The
feature package contains a range of options to display and compute kernel density estimates, significant gradient and significant curvature regions. Significant gradient and/or curvature regions often correspond to significant features. In this vignette we focus on 1- and 2-dimensional data.
earthquake data set contains 510 observations, each consisting of measurements of an earthquake beneath the Mt St Helens volcano. The first is the
longitude (in degrees, where a negative number indicates west of the International Date Line), second is the
latitude (in degrees, where a positive number indicates north of the Equator) and the third is the
depth (in km, where a negative number indicates below the Earth’s surface). For the univariate example, we take the
log(-depth) as our variable of interest. The kernel density estimate with bandwidth 0.1 is the orange curve. Superimposed in green are the sections of this density estimate which have significant gradient (i.e. significantly different from zero). The rug plot is the
library(feature) data(earthquake) log10(-earthquake[,3]) eq3 <- featureSignif(eq3, bw=0.1) eq3.fs <-plot(eq3.fs, xlab="-log(-depth)", addSignifGradRegion=TRUE, addData=TRUE)
par()$usr[1:2] ## save x-axis limits to align following SiZer plotxlim <-
Below this is the SiZer plot of Chaudhuri & Marron (1999). In the SiZer plot, blue indicates significantly increasing gradient, red is significantly decreasing gradient, purple is non-significant gradient and grey is data too sparse for reliable estimation. The horizontal black line is for the bandwidth 0.1.
SiZer(eq3, xlim=xlim, bw=c(0.05, 0.5), xlab="-log(-depth)") eq3.SiZer <-abline(h=log(0.1))
For bivariate data, we look at an Old Faithful geyser data set, in the
MASS library. The horizontal axis is the
waiting time (in minutes) between two eruptions, and the vertical axis is the
duration time (in minutes) of an eruption. Below is a kernel density estimate with bandwidth (4.5, 0.37) with the significant curvature regions in blue superimposed.
library(MASS) data(geyser) featureSignif(geyser, bw=c(4.5, 0.37)) geyser.fs <-plot(geyser.fs, addSignifCurvRegion=TRUE)
A variation on plotting the significant regions is to plot the data points which fall inside these regions: significant curvature data points are in blue.
The result of
featureSignif is an object of class
fs which is a list with fields
names(geyser.fs) #>  "x" "names" "bw" "fhat" #>  "grad" "curv" "gradData" "gradDataPoints" #>  "curvData" "curvDataPoints"
xis the data
namesare the name labels used for plotting
bwis the bandwidth
fhatis the kernel density estimate
gradis the logical matrix indicating signficant gradient on a grid
curvis the logical matrix indicating signficant curvature on a grid
gradDatais the logical vector indicating signficant gradient data points
gradDataPointsare the signficant gradient data points
curvDatais the logical vector indicating signficant curvature data points
curvDataPointsare the signficant curvature data points.
featureSignif also provides feature significance for 3-dimensional data via an
rgl engine. Since the latter is not integrated with
rmarkdown, this has been excluded from this vignette for the time being. See the example code in
featureSignifGUI provides interactive feature significance for 1-, 2- and 3-dimensional data via
tcltk windows but the latter are also not integrated with
Chaudhuri, P. and Marron, J. S. (1999). SiZer for exploration of structures in curves. Journal of the American Statistical Association, 94, 807-823.
Duong, T., Cowling, A., Koch, I., and Wand, M. P. (2008). Feature significance for multivariate kernel density estimation. Computational Statistics and Data Analysis, 52, 4225-4242.
Godtliebsen, F., Marron, J. S., and Chaudhuri, P. (2002). Significance in scale space for bivariate density estimation. Journal of Computational and Graphical Statistics, 11, 1-21.