Skip to content

If there is only one converged run, the sivs will break #2

@mmahmoudian

Description

@mmahmoudian

There is one very rare condition that sivs() does not catch and it will break without a proper error. In this post/issue I'm going to first dissect the way this issue can happen, and then I will list how this can be addressed, and ultimately I elaborate on my preference on how this should be addressed.

The reasons I'm writing this are:

  • be transparent
  • give the possibility to the users and community to engage and express their opinion, and to discuss solutions
  • have a good and comprehensive documentation on this extremely rare issue (one single commit would not do justice to the amount of time I put into investigating this issue. I only received an email with an error and I didn't have access to the input data to investigate it, so I had to go through many hypotheses and run part of the code in my head with imaginary data to test those hypotheses).

Dissection of the problem

In the following line we are removing the runs that have failed to converge (due to whatever reason that is relevant to the chosen method)

sivs/R/sivs.R

Lines 1010 to 1011 in 4a57625

# remove the runs that could not converge (and as the result have NA instead of coefficients' dataframe)
clean.iterative.res <- iterative.res[!sapply(lapply(iterative.res, "[[", "coef"), is.logical)]

and in the following lines we are making sure that we have at least something to proceed with:

sivs/R/sivs.R

Lines 1013 to 1019 in 4a57625

# if there were not any runs that have converged
if(length(clean.iterative.res) == 0){
stop("In the Iterative step, none of the ", length(iterative.res),
" runs got converged. This means that sivs cannot progress to",
" subsequent steps, and you should use different",
" 'method' or normalize your data differently.")
}

and if we have at least one good run, we proceed with this beautiful and compact part to extract the coefficients in a data.frame format:

sivs/R/sivs.R

Lines 1021 to 1030 in 4a57625

coef.df <- Reduce(function(...){ merge(...,
by = "names",
all = TRUE) },
sapply(names(clean.iterative.res),
FUN = function(item) {
temp <- clean.iterative.res[[item]]$coef
colnames(temp)[2] <- paste0("coef.", item)
return(temp)
},
simplify = FALSE))

The issue raises when clean.iterative.res has length of 1, which subsequently would cause the coef.df to be a data.frame with only two columns. When we move the row names to row.names, the data.frame would be left with only one column which consequently turn into a vector:

sivs/R/sivs.R

Lines 1032 to 1034 in 4a57625

## set the feature names to be the rownames
row.names(coef.df) <- coef.df[, 1]
coef.df <- coef.df[, -1]

This can be reproduced with this small demo:

# only keeping two columns
a <- iris[, 1:2]
# convert the first column to some unique character values
a[, 1] <- paste0("a", 1:nrow(a))
# the following two lines is exactly what we do in the code chuck above (sivs.R#L1032-L1034)
row.names(a) <- a[, 1]
a <- a[, -1]
> a
  [1] 3.5 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3.0 3.0 4.0 4.4 3.9 3.5 3.8 3.8 3.4 3.7 3.6 3.3 3.4 3.0 3.4 3.5 3.4 3.2 3.1 3.4 4.1 4.2 3.1
 [36] 3.2 3.5 3.6 3.0 3.4 3.5 2.3 3.2 3.5 3.8 3.0 3.8 3.2 3.7 3.3 3.2 3.2 3.1 2.3 2.8 2.8 3.3 2.4 2.9 2.7 2.0 3.0 2.2 2.9 2.9 3.1 3.0 2.7 2.2 2.5
 [71] 3.2 2.8 2.5 2.8 2.9 3.0 2.8 3.0 2.9 2.6 2.4 2.4 2.7 2.7 3.0 3.4 3.1 2.3 3.0 2.5 2.6 3.0 2.6 2.3 2.7 3.0 2.9 2.9 2.5 2.8 3.3 2.7 3.0 2.9 3.0
[106] 3.0 2.5 2.9 2.5 3.6 3.2 2.7 3.0 2.5 2.8 3.2 3.0 3.8 2.6 2.2 3.2 2.8 2.8 2.7 3.3 3.2 2.8 3.0 2.8 3.0 2.8 3.8 2.8 2.8 2.6 3.0 3.4 3.1 3.0 3.1
[141] 3.1 3.1 2.7 3.2 3.3 3.0 2.5 3.0 3.4 3.0

And this is the problem, because the following apply() expects a tabular object but what it will receive is just a vector:

sivs/R/sivs.R

Lines 1036 to 1039 in 4a57625

# calculate how many times each feature has been given a coefficient other than 0 and them sort them decreasingly
selection.freq <- sort(x = apply(coef.df, 1, function(f){
sum(f != 0)
}), decreasing = TRUE)


Possible solutions and my thoughts

There are many ways to solve this programmatically, like converting the coef.df to a matrix or even using the tibble package for all tabular data we have in the SIVS package; but the question is more fundamental imho (when nfolds > 1) :

  • If we end up with one single "converged" run, should we even consider moving forward?
  • Is this one single run a reliable run? (because it has worked in only one particular sample configuration)
  • Should we do the Recursive Feature Elimination step using this one "lucky" run?

I would argue that the RFE should not be ran, because having one single lucky run is well against having a reproducible research and what "Stable Iterative Variable Selection" is all about.

If we decide not to move to the next step in SIVS (i.e recursive feature elimination), then the question would be on how this should be reported to user and what should we return? Possible options are

  1. Throw an error and not returning anything to user (even that one lucky run)
  2. Throw a warning and return whatever we have so far

I would argue that we should throw an error and do not return anything to user. This is because, in spite of writing it in the article that SIVS should not be treated as a black-box method, many naively would still run the code and expect magic, and in the off chance of hitting this "lucky" run, they will just go with what SIVS is providing and they will move forward without thinking twice about what has happened and why. Perhaps when #1 is implemented, the user can have access to the list of coefficients and can do some sort of RFE themselves at their own risk if they want to.

Reports

This issue has been reported on Stackoverflow as well:

Acknowledgment

I would like to thank Alexander Biehl who reported this issue. I will edit this post when/if I get their consent to put their real name or GithubID here. As I mentioned, this is a very rare case and Alex responsibly and kindly reported the issue to me via email. The time he invested in writing that email should also be factored in for resolution of this issue. Also Alex pointed me to the Stackoverflow post which I mentioned above.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions