# psqn: Partially Separable Quasi-Newton

$\renewcommand\vec{\boldsymbol} \def\bigO#1{\mathcal{O}(#1)} \def\Cond#1#2{\left(#1\,\middle|\, #2\right)} \def\mat#1{\boldsymbol{#1}} \def\der{{\mathop{}\!\mathrm{d}}} \def\argmax{\text{arg}\,\text{max}} \def\Prob{\text{P}} \def\diag{\text{diag}} \def\argmin{\text{arg}\,\text{min}} \def\Expe{\text{E}}$

This package provides an optimization method for partially separable functions. Partially separable functions are of the following form:

$f(\vec x) = \sum_{i = 1}^n f_i(\vec x_{\mathcal I_i})$

where $$\vec x\in \mathbb R^l$$,

$\vec x_{\mathcal I_i} = (\vec e_{j_{i1}}^\top, \dots ,\vec e_{j_{im_i}}^\top)\vec x, \qquad \mathcal I_i = (j_{i1}, \dots, \mathcal j_{im_i}) \subseteq \{1, \dots, l\}^l,$ and $$\vec e_k$$ is the $$k$$’th column of the $$l$$ dimensional identity matrix. Each function $$f_i$$ is called an element function and only depends on $$m_i \ll l$$ parameters. This allows for an efficient quasi-Newton method when all the $$m_i$$’s are much smaller than the dimension of the parameter vector $$\vec x$$, $$l$$. The framework can be extended to allow for a linear combination of parameters but we do not cover such problems. This vignette closely follows Nocedal and Wright (2006) who cover the methods and alternatives in much greater detail.

We only consider a more restricted form of the problem. Assume that each index set $$\mathcal I_i$$ is of the form:

\begin{align*} \mathcal I_i &= \{1,\dots, p\} \cup \mathcal J_i \\ \mathcal J_i \cap \mathcal J_j &= \emptyset \qquad j\neq i \\ \mathcal J_i \cap \{1,\dots, p\} &= \emptyset \qquad \forall i = 1,\dots, n \end{align*}.

That is, each index set contains $$p$$ global parameters and $$q_i = \lvert\mathcal J_i\rvert$$ private parameters which are particular for each element function, $$f_i$$. For implementation reason, we let:

\begin{align*} \overleftarrow q_i &= \begin{cases} p & i = 0 \\ p + \sum_{k = 1}^i q_k & i > 0 \end{cases} \\ \mathcal J_i &= \{1 + \overleftarrow q_{i - 1}, \dots , q_i + \overleftarrow q_{i - 1}\} \end{align*}

such that the element functions’ private parameters lies in consecutive parts of $$\vec x$$.

## Example

We are going to consider a Taylor approximation for a generalized linear mixed model. In particular, we focus on a mixed logit regression where:

\begin{align*} \vec U_i &\sim N^{(r)}(\vec 0, \mat\Sigma) \\ \vec\eta_i &= \mat X_i\vec\beta + \mat Z_i\vec U_i \\ Y_{ij} &\sim \text{Bin}(\text{logit}^{-1}(\eta_{ij}), 1), \qquad j = 1, \dots, t_i \end{align*}

where $$N^{(r)}(\vec\mu,\mat\Sigma)$$ means a $$r$$-dimensional a multivariate normal distribution with mean $$\vec\mu$$ and covariance matrix $$\mat\Sigma$$ and $$\text{Bin}(p, k)$$ means a binomial distribution probability $$p$$ and size $$k$$. $$\vec U_i$$ is an unknown random effect with an unknown covariance $$\mat\Sigma$$ and $$\vec\beta\in\mathbb{R}^p$$ are unknown fixed effect coefficients. $$\mat X_i$$ and $$\mat Z_i$$ are known design matrices each with $$t_i$$ rows for each of the $$t_i$$ observed outcomes, the $$y_{ij}$$s.

As part of a Taylor approximation, we find a mode of $$\vec x = (\vec\beta^\top, \widehat{\vec u}_1^\top, \dots, \widehat{\vec u}_n^\top)$$ of the log of the integrand given a covariance matrix estimate, $$\widehat{\mat \Sigma}$$. That is, we are minimizing:

\begin{align*} f(\vec x) &= -\sum_{i = 1}^n \left( \sum_{k = 1}^{t_i}(y_{ij}\eta_{ij} - \log(1 + \exp\eta_{ij})) - \frac 12 \widehat{\vec u}_i^\top\widehat{\mat \Sigma}^{-1} \widehat{\vec u}_i \right) \\ &= -\sum_{i = 1}^n \left( \vec y_i(\mat X_i\vec\beta + \mat Z_i\widehat{\vec u}_i) - \sum_{k = 1}^{t_i} \log(1 + \exp(\vec x_{ik}^\top\vec\beta + \vec z_{ik}^\top\widehat{\vec u}_i)) - \frac 12 \widehat{\vec u}_i^\top\widehat{\mat \Sigma}^{-1} \widehat{\vec u}_i \right) \\ &= \sum_{i = 1}^nf_i((\vec\beta^\top, \widehat{\vec u}_i^\top)^\top) \\ f_i((\vec\beta^\top, \vec u^\top)^\top) &= -\vec y_i(\mat X_i\vec\beta + \mat Z_i\vec u) + \sum_{k = 1}^{t_i} \log(1 + \exp(\vec x_{ik}^\top\vec\beta + \vec z_{ik}^\top\vec u)) + \frac 12 \vec u^\top\widehat{\mat \Sigma}^{-1} \vec u \end{align*}

In this problem, $$\vec\beta$$ are the global parameters and the $$\widehat{\vec u}_i$$’s are the private parameters. Thus, $$l = p + nr$$. We will later return to this example with an implementation which uses this package.

### Variational Approximations

The objective function for variational approximations for mixed models for clustered data is commonly also partially separable. We will briefly summarize the idea here. Ormerod and Wand (2012) and Ormerod (2011) are examples where one might benefit from using the methods in this package.

We let $$\tilde f_i$$ be the log marginal likelihood term from cluster $$i$$. This is of the form:

$\tilde f_i(\vec\omega) = \log \int p_i(\vec y_i, \vec u;\vec\omega)\der \vec u$

where $$\vec\omega$$ are unknown model parameters, $$p_i(\vec u;\vec\omega)$$ is the joint density of the observed data denoted by $$\vec y_i$$, and $$\vec U_i$$ which is a cluster specific random effect. $$\exp \tilde f_i(\vec\omega)$$ is often intractable. An approximation of $$\tilde f_i$$ is to select some variational distribution denoted by $$v_i$$ parameterized by some set $$\Theta_i$$. We then use the approximation:

\begin{align*} \tilde f_i(\vec\omega) &= \int v_i(\vec u; \vec\theta_i) \log\left( \frac{p_i(\vec y_i \vec u;\vec\omega)/v_i(\vec u; \vec\theta_i)} {p_i(\vec u_i \mid \vec y_i;\vec\omega)/v_i(\vec u; \vec\theta_i)} \right)\der\vec u \\ &= \int v_i(\vec u; \vec\theta_i) \log\left( \frac{p_i(\vec y_i, \vec u;\vec\omega)} {v_i(\vec u; \vec\theta_i)} \right)\der\vec u + \int v_i(\vec u; \vec\theta_i) \log\left( \frac{v_i(\vec u; \vec\theta_i)} {p_i(\vec u \mid \vec y_i;\vec\omega)} \right)\der\vec u \\ &\geq \int v_i(\vec u; \vec\theta_i) \log\left( \frac{p_i(\vec y_i, \vec u;\vec\omega)} {v_i(\vec u; \vec\theta_i)} \right)\der\vec u = f_i(\vec\omega,\vec\theta_i) \end{align*}

where $$\vec\theta_i\in\Theta_i$$ and $$p_i(\vec u_i \mid \vec y_i;\vec\omega)$$ is the conditional density of the random effect given the observed data, $$\vec y_i$$, and model parameters, $$\vec\omega$$. $$f_i(\vec\omega,\vec\theta_i)$$ is a lower bound since the Kullback–Leibler divergence

$\int v_i(\vec u; \vec\theta_i)\log\left( \frac{v_i(\vec u; \vec\theta_i)} {p_i(\vec u \mid \vec y_i;\vec\omega)} \right)\der\vec u$

is positive. The idea is to replace the minimization problem:

$\argmin_{\vec\omega} -\sum_{i = 1}^n \tilde f_i(\vec\omega)$

with a variational approximation:

$\argmin_{\vec\omega,\vec\theta_1,\dots,\vec\theta_n} -\sum_{i = 1}^n f_i(\vec\omega,\vec\theta_i)$

This problem fits into the framework in the package where $$\vec\omega$$ are the global parameters and the $$\vec\theta_i$$s are the private parameters.

Variational approximation have the property that if $$v_i(\vec u; \vec\theta_i) = p_i(\vec u \mid \vec y_i;\vec\omega)$$ then the Kullback–Leibler divergence is zero and the lower bound is equal to the log marginal likelihood. Thus, we need to use a family of variational distributions, $$v_i$$, which yields a close approximation of the conditional density of the random effects, $$p_i(\vec u \mid \vec y_i;\vec\omega)$$, for some $$\vec\theta_i\in\Theta_i$$. Moreover, the lower bound also needs to be easy to optimize. Variational approximations have an advantage that given estimates of $$\widehat{\vec\omega},\widehat{\vec\theta}_1,\dots,\widehat{\vec\theta}_n$$ then subsequent inference can be approximated using:

$\Expe\left(h(\vec U_i)\right) = \int h(\vec u) p_i(\vec u \mid \vec y_i;\vec\omega)\der\vec u \approx \int h(\vec u) v_i(\vec u; \widehat{\vec\theta}_i)\der\vec u.$

The latter integral may be much easier to work with for some functions $$h$$ and variational distribution, $$v_i$$.

## Quasi-Newton Method for Partially Separable Functions

We are going to assume some prior knowledge of Newton’s method and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm and we only provide a few details of these methods. However, will need a bit of notations from these methods to motivate the quasi-Newton method we have implemented.

Newton’s method to minimize a function is to start at some value $$\vec x_0$$. Then we set $$k = 1$$ and

1. compute a direction $$\vec p_k$$ given by $\nabla^2 f(\vec x_{k - 1})\vec p_k = - \nabla f(\vec x_{k -1}),$
2. set $$\vec x_k = \vec x_{k - 1} + \vec p_k$$ or $$\vec x_k = \vec x_{k - 1} + \gamma\vec p_k$$ for $$\gamma \in (0, 1]$$ set to satisfy the Wolfe conditions, and
3. repeat with $$k\leftarrow k + 1$$ if a convergence criterion is not satisfied.

Computing the Hessian, $$\nabla^2 f(\vec x_{k - 1})$$, at every iteration can be expensive. The BFGS algorithm offers an alternative where we use an approximation instead. Here we start with some Hessian approximation $$\mat B_0$$ and

1. compute a direction $$\vec p_k$$ given by $\mat B_{k - 1}\vec p_k = - \nabla f(\vec x_{k -1}),$
2. find a step size $$\alpha$$ such that $$\vec x_{k - 1} + \alpha\vec p_k$$ satisfy the Wolfe conditions,
3. set $$\vec x_k = \vec x_{k - 1} + \alpha\vec p_k$$, $$\vec s_k = \alpha\vec p_k = \vec x_k - \vec x_{k - 1}$$, $$\vec d_k = \nabla f(\vec x_k) - \nabla f(\vec x_{k - 1})$$,
4. perform a rank-two update $\mat B_k = \mat B_{k - 1} + \frac{\vec y_k\vec y_k^\top}{\vec y_k^\top\vec s_k} - \frac{\mat B_{k - 1}\vec s_k\vec s_k^\top\mat B_{k - 1}^\top}{\vec s_k^\top\mat B_{k - 1}\vec s_k},$ and
5. repeat with $$k\leftarrow k + 1$$ if a convergence criterion is not satisfied.

This reduces the cost of computing the Hessian. Further, we can update $$\mat B_k^{-1}$$ to avoid solving $$\mat B_{k - 1}\vec p_k = - \nabla f(\vec x_{k -1})$$. The matrix $$\mat B_k^{-1}$$ will still be large and dense when $$l$$ is large.

### Using Partial Separability

As an alternative, we can exploit the structure of the problem we are solving. Let

$\mat H_i = (\vec e_{j_{i1}}^\top, \dots ,\vec e_{j_{im_i}}^\top).$

The true Hessian in our case is sparse and given by

$\nabla^2 f(\vec x) = \sum_{i = 1}^n \mat H_i^\top\nabla^2f_i(\vec x_{\mathcal I_i})\mat H_i$

Notice that each $$\nabla^2f_i(\vec x_{\mathcal I_i})$$ is only a $$(p + q_i)\times (p + q_i)$$ matrix. We illustrate this below with $$n = 10$$ element functions. Each plot is $$\mat H_i^\top\nabla^2f_i(\vec x_{\mathcal I_i})\mat H_i$$ where black entries are a non-zero. The whole Hessian is: We can use the partial separability to implement a BFGS method where we make $$n$$ BFGS approximations, one for each element function, $$f_i$$. Let $$\mat B_{ki}$$ be the approximation of $$\nabla^2f_i(\vec x_{\mathcal I_i})$$ at iteration $$k$$. Then the method we have implemented starts with $$\mat B_{k1},\dots,\mat B_{kn}$$ and

1. computes a direction $$\vec p_k$$ given by $\left(\sum_{i = 1}^n\mat H_i^\top\mat B_{k - 1,i}\mat H_i\right)\vec p_k = - \nabla f(\vec x_{k -1}),$
2. finds a step size $$\alpha$$ such that $$\vec x_{k - 1} + \alpha\vec p_k$$ satisfy the Wolfe conditions,
3. sets $$\vec x_k = \vec x_{k - 1} + \alpha\vec p_k$$,
4. performs BFGS updates for each $$\mat B_{k1},\dots,\mat B_{kn}$$, and
5. repeats with $$k\leftarrow k + 1$$ if a convergence criterion is not satisfied.

This seems as if it is going to be much slower as we are solving a large linear system if $$l$$ is large. However, we can use the conjugate gradient method we describe in the next section. This will be fast if we can perform the following matrix-vector product fast:

$\left(\sum_{i = 1}^n\mat H_i^\top\mat B_{k - 1,i}\mat H_i\right)\vec z.$

To elaborate on this, each $$\mat H_i^\top\mat B_{k - 1,i}\mat H_i\vec z$$ consists of matrix-vector product with a $$o_i \times o_i$$ symmetric matrix and a vector where $$o_i = (p + q_i)$$. This can be done in $$2o_i(o_i + 1)$$ flops. Thus, the total cost is $$2\sum_{i = 1}^n o_i(o_i + 1)$$ flops. This is in contrast to the original $$2l(l + 1)$$ flops with the BFGS method.

As an example suppose that $$q_i = 5$$ for all $$n$$ element functions, $$n = 5000$$, and $$p = 10$$. Then $$o_i = 15$$ and the matrix-vector product above requires $$2\cdot 5000 \cdot 15(15 + 1) = 2400000$$ flops. In contrast $$l = 5000 \cdot 5 + 10 = 25010$$ and the matrix-vector product in the BFGS method requires $$2\cdot 25010 (25010 + 1) = 1251050220$$ flops. That is 521 times more flops. Similar ratios are shown in the BFGS and Partially Separable Quasi-Newton section.

More formerly, the former is $$\mathcal O(\sum_{i = 1}^n(p + q_{i})^2) = \mathcal O(np^2 + np\bar q + \sum_{i = 1}^nq_i^2)$$ where $$\bar q = \sum_{i = 1}^n q_i / n$$ whereas the matrix-vector product in the BFGS method is $$\mathcal O((p + n\bar q)^2) = \mathcal O(p^2 + pn\bar q + (n\bar q)^2)$$. Thus, the former is favorable as long as $$np^2 + \sum_{i = 1}^nq_i^2$$ is small compared with $$(n\bar q)^2$$. Furthermore, the rank-two BFGS updates are cheaper and may converge faster to a good approximation. However, we should keep in mind that the original BFGS method yields an approximation of $$\mat B_k^{-1}$$. Thus, we do not need to solve a linear system. However, we may not need to take many conjugate gradient iterations to get a good approximation with the implemented quasi-Newton method.

The conjugate gradient method we use solves

$\mat A\vec b = \vec v$

which in our quasi-Newton method is

$\left(\sum_{i = 1}^n\mat H_i^\top\mat B_{k - 1,i}\mat H_i\right)\vec p_k = - \nabla f(\vec x_{k -1})$

We start of with some initial value $$\vec x_0$$. Then we set $$k = 0$$, $$\vec r_0 = \mat A\vec x_0 - \vec v$$, $$\vec p_0 = -\vec r_0$$, and:

1. find the step length $\alpha_k = \frac{\vec r_k^\top\vec r_k}{\vec p_k^\top\mat A\vec p_k},$
2. find the new value $\vec x_{k + 1} = \vec x_k + \alpha_k\vec p_k,$
3. find the new residual $\vec r_{k + 1} = \vec r_k + \alpha_k\mat A\vec p_k,$
4. set $$\beta_{k + 1} = (\vec r_k^\top\vec r_k)^{-1}\vec r_{k + 1}^\top\vec r_{k + 1}$$,
5. set the new search direction to $\vec p_{k + 1} = - \vec r_{k + 1} + \beta_{k + 1}\vec p_k,$ and
6. stop if $$\vec r_{k + 1}^\top\vec r_{k + 1}$$ is smaller. Otherwise set $$k\leftarrow k + 1$$ and repeat.

The main issue is the matrix-vector product $$\mat A\vec p_k$$ but as we argued in the previous section that this can be computed in $$\mathcal O(\sum_{i = 1}^n(p + q_{i})^2)$$ time. The conjugate gradient method will at most take $$h$$ iterations where $$h$$ is the number of rows and columns of $$\mat A$$. Moreover, if $$\mat A$$ only has $$r < h$$ distinct eigenvalues then we will at most make $$r$$ conjugate gradient iterations. Lastly, if $$\mat A$$ has clusters of eigenvalues then we may expect to perform only a number of iterations close to the number of distinct clusters.

In practice, we terminate the conjugate gradient method when $$\lVert\vec r_k\rVert < \min (c, \sqrt{\lVert\nabla f(\vec x_{k -1})\rVert})\lVert\nabla f(\vec x_{k -1})\rVert$$ where $$c$$ is a constant the user can set. Moreover, we use diagonal preconditioning.

We can compare the flops of the matrix product in BFGS with applying the conjugate gradient method. Assume that all $$q_i$$s are almost $$\bar q$$. Then the ratio of flops is approximately:

$\frac{p^2 + 2pn\bar q + (n\bar q)^2} {n_{\text{cg}}(np^2 + 2pn\bar q + n\bar q^2)}$

where $$n_{\text{cg}}$$ is the number of conjugate gradient iterations. Thus, to get something which is linear in the number of element functions, $$n$$, then we must have that:

\begin{align*} \frac{n_{\text{cg}}(np^2 + 2pn\bar q + n\bar q^2)} {p^2 + 2pn\bar q + (n\bar q)^2} &\leq \frac kn \\ \Leftrightarrow n_{\text{cg}} &\leq \frac kn \frac{p^2 + 2pn\bar q + (n\bar q)^2} {np^2 + 2pn\bar q + n\bar q^2} \\ &= k\frac{p^2 + 2pn\bar q + (n\bar q)^2} {n^2(p^2 + 2p\bar q + \bar q^2)} \\ &\approx k \frac{\bar q^2}{p^2 + 2p\bar q + \bar q^2} \end{align*}

where $$k$$ is some fixed constant with $$k < n$$. An example of the latter ratio is shown in the BFGS and Partially Separable Quasi-Newton section.

We can get rid of the $$p^2$$ in the denominator by once computing the first $$p\times p$$ part of the Hessian approximation prior to performing the conjugate gradient method. This is implemented. The max_cg argument is added beacuse of the considerations above.

## Line Search and Wolfe Condition

We use line search and search for a point which satisfy the strong Wolfe condition by default. The constants in the Wolfe condition can be set by the user. The line search is implemented as described by Nocedal and Wright (2006) with cubic interpolation in the zoom phase.

Symmetric rank-one (SR1) updates are implemented as an alternative to the BFGS updates. The user can set whether the SR1 updates should be used. The SR1 updates does not guarantee that the Hessian approximation is positive definite. Thus, the conjugate gradient method only proceeds if $$\vec p_k^\top\mat A\vec p_k > 0$$. That is, if the new direction is a descent direction.

## Example Using the Implementation

We simulate a data set below from the mixed logit model we showed earlier.

# assign model parameters, number of random effects, and fixed effects
q <- 4 # number of private parameters per cluster
p <- 5 # number of global parameters
beta <- sqrt((1:p) / sum(1:p))
Sigma <- diag(q)

# simulate a data set
n_clusters <- 800L # number of clusters
set.seed(66608927)

sim_dat <- replicate(n_clusters, {
n_members <- sample.int(20L, 1L) + 2L
X <- matrix(runif(p * n_members, -sqrt(6 / 2), sqrt(6 / 2)),
p)
u <- drop(rnorm(q) %*% chol(Sigma))
Z <- matrix(runif(q * n_members, -sqrt(6 / 2 / q), sqrt(6 / 2 / q)),
q)
eta <- drop(beta %*% X + u %*% Z)
y <- as.numeric((1 + exp(-eta))^(-1) > runif(n_members))

list(X = X, Z = Z, y = y, u = u, Sigma_inv = solve(Sigma))
}, simplify = FALSE)

# example of the first cluster
sim_dat[[1L]]
#> $X #> [,1] [,2] [,3] #> [1,] 0.0416 -0.809 -0.1839 #> [2,] 0.6524 -1.373 -0.9254 #> [3,] -1.3339 -0.957 -0.8708 #> [4,] 0.7547 -0.156 0.0178 #> [5,] 0.7191 -0.681 -0.7232 #> #>$Z
#>        [,1]   [,2]   [,3]
#> [1,]  0.167 -0.483 -0.785
#> [2,] -0.266 -0.823  0.794
#> [3,]  0.609 -0.549  0.269
#> [4,] -0.414 -0.457  0.605
#>
#> $y #>  0 0 0 #> #>$u
#>   0.0705 -1.7285  0.1538 -0.3245
#>
#> $Sigma_inv #> [,1] [,2] [,3] [,4] #> [1,] 1 0 0 0 #> [2,] 0 1 0 0 #> [3,] 0 0 1 0 #> [4,] 0 0 0 1 The combined vector with global and private parameters can be created like this (it is a misnoma to call this true_params as the modes of the random effects, the private parameters, should only match the random effects if the clusters are very large): true_params <- c(beta, sapply(sim_dat, function(x) x$u))

# global parameters
true_params[1:p]
#>  0.258 0.365 0.447 0.516 0.577

# some of the private parameters
true_params[1:(4 * q) + p]
#>    0.0705 -1.7285  0.1538 -0.3245  0.2516 -0.5419 -0.5537 -0.2805 -1.1777
#>  -1.7539  1.7338  0.5616 -0.8379  1.2412 -1.2046  1.4547

As a reference, we will create the following function to evaluate the log of the integrand:

eval_integrand <- function(par){
out <- 0.
inc <- p
beta <- par[1:p]
for(i in seq_along(sim_dat)){
dat <- sim_dat[[i]]
X <- dat$X Z <- dat$Z
y <- dat$y Sigma_inv <- dat$Sigma_inv

u <- par[1:q + inc]
inc <- inc + q
eta <- drop(beta %*% X + u %*% Z)

out <- out - drop(y %*% eta) + sum(log(1 + exp(eta))) +
.5 * drop(u %*% Sigma_inv %*% u)
}

out
}

# check the log integrand at true global parameters and the random effects
eval_integrand(true_params)
#>  6898

We will use this function to compare with our C++ implementation.

### R Implementation

A R function which we need to pass to psqn to minimize the partially separable function is given below:

# evalutes the negative log integrand.
#
# Args:
#   i cluster/element function index.
#   par the global and private parameter for this cluster. It has length
#       zero if the number of parameters is requested.
dat <- sim_dat[[i]]
X <- dat$X Z <- dat$Z

if(length(par) < 1)
# requested the dimension of the parameter
return(c(global_dim = NROW(dat$X), private_dim = NROW(dat$Z)))

y <- dat$y Sigma_inv <- dat$Sigma_inv

beta <- par[1:p]
uhat <- par[1:q + p]
eta <- drop(beta %*% X + uhat %*% Z)
exp_eta <- exp(eta)

out <- -sum(y * eta) + sum(log(1 + exp_eta)) +
sum(uhat * (Sigma_inv %*% uhat)) / 2
d_eta <- -y + exp_eta / (1 + exp_eta)
Z %*% d_eta + dat$Sigma_inv %*% uhat) attr(out, "grad") <- grad } out } Here is a check that the above yields the same as the function we defined before: # check the function r_func_val <- sum(sapply(1:n_clusters, function(i) r_func(i, true_params[c(1:p, 1:q + (i - 1L) * q + p)], FALSE))) all.equal(eval_integrand(true_params), r_func_val) #>  TRUE # we could check the gradient like this if(FALSE){ r_func_gr <- numeric(length(true_params)) for(i in seq_along(sim_dat)){ out_i <- r_func(i, true_params[c(1:p, 1:q + (i - 1L) * q + p)], TRUE) r_func_gr[1:p] <- r_func_gr[1:p] + attr(out_i, "grad")[1:p] r_func_gr[1:q + (i - 1L) * q + p] <- attr(out_i, "grad")[1:q + p] } library(numDeriv) gr_num <- grad(function(par) eval_integrand(par), true_params) all.equal(r_func_gr, gr_num, tolerance = 1e-6) } The partially separable function can be minimized like this: start_val <- true_params start_val[ 1:p ] <- start_val[ 1:p ] + c(0.49, -0.63, -0.4, -0.33, -0.38) # ~rnorm(length(beta), sd = .5) start_val[-(1:p)] <- 0 library(psqn) r_psqn_func <- function(par, n_threads = 2L, c1 = 1e-4, c2 = .9) psqn(par = par, fn = r_func, n_ele_func = n_clusters, n_threads = n_threads, c1 = c1, c2 = c2) R_res <- r_psqn_func(start_val) We will later compare this with the result from the C++ implementation which we provide in the next section. ### C++ Implementation We provide a C++ implementation with the package as an example of how to use this package. The location of the implementation can be found by calling system.file("mlogit-ex.cpp", package = "psqn"). The most important part of the implementation is the problem specific m_logit_func class, the get_mlogit_optimizer function and the optim_mlogit function which are needed to perform the optimization. The content of the file is: // we will use openMP to perform the comptutation in parallel // [[Rcpp::plugins(openmp, cpp11)]] // we use RcppArmadillo to simplify the code // [[Rcpp::depends(RcppArmadillo)]] #include <RcppArmadillo.h> // [[Rcpp::depends(psqn)]] #include "psqn.h" #include "psqn-reporter.h" using namespace Rcpp; /// simple function to avoid copying a vector. You can ignore this inline arma::vec vec_no_cp(double const * x, size_t const n_ele){ return arma::vec(const_cast<double *>(x), n_ele, false); } /*** implements the element function for a given cluster. The class must provide the member functions which we provide here. We do not need to inherit from the element_function class but we can do it to ensure that we have implemented all the member functions. */ class m_logit_func final : public PSQN::element_function { /// design matrices arma::mat const X, Z; /// outcomes arma::vec const y; /// inverse covariance matrix arma::mat const Sigma_inv; public: m_logit_func(List data): X (as<arma::mat>(data["X" ])), Z (as<arma::mat>(data["Z" ])), y (as<arma::vec>(data["y" ])), Sigma_inv(as<arma::mat>(data["Sigma_inv"])) { } /// dimension of the global parameters size_t global_dim() const { return X.n_rows; } /// dimension of the private parameters size_t private_dim() const { return Z.n_rows; } /*** computes the element function. @param point point to compute function at. */ double func(double const *point) const { arma::vec const beta = vec_no_cp(point , X.n_rows), u = vec_no_cp(point + X.n_rows, Z.n_rows); double out(0); for(size_t i = 0; i < y.n_elem; ++i){ double const eta = arma::dot(beta, X.col(i)) + arma::dot(u, Z.col(i)); out -= y[i] * eta - log(1 + exp(eta)); } out += arma::dot(u, Sigma_inv * u) * .5; return out; } /*** computes the element function and its gradient. @param point point to compute function at. @param gr gradient vector with respect to global and private parameters. */ double grad (double const * point, double * gr) const { arma::vec const beta = vec_no_cp(point , X.n_rows), u = vec_no_cp(point + X.n_rows, Z.n_rows); // create objects to write to for the gradient std::fill(gr, gr + beta.n_elem + u.n_elem, 0.); arma::vec dbeta(gr , beta.n_elem, false), du (gr + beta.n_elem, u.n_elem , false); double out(0); for(size_t i = 0; i < y.n_elem; ++i){ arma::vec const xi = X.unsafe_col(i), zi = Z.unsafe_col(i); double const eta = arma::dot(beta, xi) + arma::dot(u, zi), exp_eta = exp(eta), d_eta = y[i] - exp_eta / (1 + exp_eta); out -= y[i] * eta - log(1 + exp_eta); dbeta -= d_eta * xi; du -= d_eta * zi; } arma::vec u_scaled = Sigma_inv * u; out += arma::dot(u, u_scaled) * .5; du += u_scaled; return out; } /*** returns true if the member functions are thread-safe. */ bool thread_safe() const { return true; } }; using mlogit_topim = PSQN::optimizer<m_logit_func, PSQN::R_reporter, PSQN::R_interrupter>; /*** creates a pointer to an object which is needed in the optim_mlogit function. @param data list with data for each element function. @param max_threads maximum number of threads to use. */ // [[Rcpp::export]] SEXP get_mlogit_optimizer(List data, unsigned const max_threads){ size_t const n_elem_funcs = data.size(); std::vector<m_logit_func> funcs; funcs.reserve(n_elem_funcs); for(auto dat : data) funcs.emplace_back(List(dat)); // create an XPtr to the object we will need XPtr<mlogit_topim> ptr(new mlogit_topim(funcs, max_threads)); // return the pointer to be used later return ptr; } /*** performs the optimization. @param val vector with starting value for the global and private parameters. @param ptr returned object from get_mlogit_optimizer. @param rel_eps relative convergence threshold. @param max_it maximum number iterations. @param n_threads number of threads to use. @param c1,c2 thresholds for Wolfe condition. @param use_bfgs boolean for whether to use SR1 or BFGS updates. @param trace integer where larger values gives more information during the optimization. @param cg_tol threshold for conjugate gradient method. @param strong_wolfe true if the strong Wolfe condition should be used. @param max_cg maximum number of conjugate gradient iterations in each iteration. Use zero if there should not be a limit. @param pre_method preconditioning method in conjugate gradient method. zero yields no preconditioning and one yields diagonal preconditioning. */ // [[Rcpp::export]] List optim_mlogit (NumericVector val, SEXP ptr, double const rel_eps, unsigned const max_it, unsigned const n_threads, double const c1, double const c2, bool const use_bfgs = true, int const trace = 0L, double const cg_tol = .5, bool const strong_wolfe = true, size_t const max_cg = 0L, int const pre_method = 1L){ XPtr<mlogit_topim> optim(ptr); // check that we pass a parameter value of the right length if(optim->n_par != static_cast<size_t>(val.size())) throw std::invalid_argument("optim_mlogit: invalid parameter size"); NumericVector par = clone(val); optim->set_n_threads(n_threads); auto res = optim->optim(&par, rel_eps, max_it, c1, c2, use_bfgs, trace, cg_tol, strong_wolfe, max_cg, static_cast<PSQN::precondition>(pre_method)); NumericVector counts = NumericVector::create( res.n_eval, res.n_grad, res.n_cg); counts.names() = CharacterVector::create("function", "gradient", "n_cg"); int const info = static_cast<int>(res.info); return List::create( _["par"] = par, _["value"] = res.value, _["info"] = info, _["counts"] = counts, _["convergence"] = res.info == PSQN::info_code::converged ); } /*** performs the optimization but only for the private parameters. @param val vector with starting value for the global and private parameters. @param ptr returned object from get_mlogit_optimizer. @param rel_eps relative convergence threshold. @param max_it maximum number iterations. @param n_threads number of threads to use. @param c1,c2 thresholds for Wolfe condition. */ // [[Rcpp::export]] NumericVector optim_mlogit_private (NumericVector val, SEXP ptr, double const rel_eps, unsigned const max_it, unsigned const n_threads, double const c1, double const c2){ XPtr<mlogit_topim> optim(ptr); // check that we pass a parameter value of the right length if(optim->n_par != static_cast<size_t>(val.size())) throw std::invalid_argument("optim_mlogit_private: invalid parameter size"); NumericVector par = clone(val); optim->set_n_threads(n_threads); double const res = optim->optim_priv(&par, rel_eps, max_it, c1, c2); par.attr("value") = res; return par; } /*** evaluates the partially separable function. @param val vector with global and private parameters to evaluate the function at. @param ptr returned object from get_mlogit_optimizer. @param n_threads number of threads to use. */ // [[Rcpp::export]] double eval_mlogit(NumericVector val, SEXP ptr, unsigned const n_threads){ XPtr<mlogit_topim> optim(ptr); // check that we pass a parameter value of the right length if(optim->n_par != static_cast<size_t>(val.size())) throw std::invalid_argument("eval_mlogit: invalid parameter size"); optim->set_n_threads(n_threads); return optim->eval(&val, nullptr, false); } /*** evaluates the gradient of a partially separable function. @param val vector with global and private parameters to evaluate the function at. @param ptr returned object from get_mlogit_optimizer. @param n_threads number of threads to use. */ // [[Rcpp::export]] NumericVector grad_mlogit(NumericVector val, SEXP ptr, unsigned const n_threads){ XPtr<mlogit_topim> optim(ptr); // check that we pass a parameter value of the right length if(optim->n_par != static_cast<size_t>(val.size())) throw std::invalid_argument("grad_mlogit: invalid parameter size"); NumericVector grad(val.size()); optim->set_n_threads(n_threads); grad.attr("value") = optim->eval(&val, &grad, true); return grad; } /*** returns the current Hessian approximation. @param ptr returned object from get_mlogit_optimizer. */ // [[Rcpp::export]] NumericMatrix get_Hess_approx_mlogit(SEXP ptr){ XPtr<mlogit_topim> optim(ptr); NumericMatrix out(optim->n_par, optim->n_par); optim->get_hess(&out); return out; } The PSQN::R_reporter class ensures that output will be printed when one passes a trace argument which is greater than zero. The PSQN::R_interrupter class ensures that the user can interrupt the computation. These two classes can be replaced with custom classes if one wants to and provide another implementation. See the source code of this package for the required members. We can use the code by calling Rcpp::sourceCpp to compile the code: library(Rcpp) sourceCpp(system.file("mlogit-ex.cpp", package = "psqn")) Then we can create a pointer to an optimizer and check that it yields the correct value and gradient like this: optimizer <- get_mlogit_optimizer(sim_dat, max_threads = 4L) stopifnot(all.equal( eval_integrand(true_params), eval_mlogit(val = true_params, ptr = optimizer, n_threads = 2L))) library(numDeriv) gr_num <- grad( function(par) eval_mlogit(val = par, ptr = optimizer, n_threads = 2L), true_params) gr_opt <- grad_mlogit(val = true_params, ptr = optimizer, n_threads = 2L) stopifnot(all.equal(gr_num, gr_opt, tolerance = 1e-5, check.attributes = FALSE), # also check the function value! all.equal(attr(gr_opt, "value"), eval_mlogit(val = true_params, ptr = optimizer, n_threads = 2L))) We can now use the BFGS implementation in the optim function to compare with like this: optim_func <- function(par, n_threads = 2L) optim( par, function(par) eval_mlogit(val = par, ptr = optimizer, n_threads = n_threads), function(par) grad_mlogit(val = par, ptr = optimizer, n_threads = n_threads), method = "BFGS", control = list(reltol = 1e-8)) bfgs_res <- optim_func(start_val) We then use the quasi-Newton method like this: psqn_func <- function(par, n_threads = 2L, c1 = 1e-4, c2 = .9, trace = 0L, use_bfgs = TRUE, opt_private = FALSE){ rel_eps <- 1e-8 if(opt_private){ # it may be useful to fix the global parameters and optimize the # private parameters to get starting values. This is very fast as each # set of parameters can be optimized separately par <- optim_mlogit_private( val = par, ptr = optimizer, rel_eps = sqrt(rel_eps), max_it = 100, n_threads = n_threads, c1 = c1, c2 = c2) } optim_mlogit(val = par, ptr = optimizer, rel_eps = rel_eps, max_it = 1000L, n_threads = n_threads, c1 = c1, c2 = c2, trace = trace, use_bfgs = use_bfgs) } psqn_res <- psqn_func(start_val) # using SR1 updates psqn_res_sr1 <- psqn_func(start_val, use_bfgs = FALSE) all.equal(psqn_res_sr1$value, psqn_res$value) #>  TRUE # w/ different starting values psqn_res_diff_start <- psqn_func(start_val, opt_private = TRUE) all.equal(psqn_res$value, psqn_res_diff_start$value) #>  TRUE The counts element contains the number of function evaluations, gradient evaluations, and the total number of conjugate gradient iterations: psqn_res$counts
#>       13       12       20

# it is the same as we got from R
all.equal(psqn_res$par, R_res$par)
#>  TRUE
all.equal(psqn_res$value, R_res$value)
#>  TRUE

# compare with optim
bfgs_res$counts #> function gradient #> 63 19 We can compare the solution with the solution from optim: all.equal(bfgs_res$par, psqn_res$par) #>  "Mean relative difference: 7.81e-05" all.equal(psqn_res$value, bfgs_res$value, tolerance = 1e-8) #>  TRUE psqn_res$value - bfgs_res$value #>  -5.61e-06 The optim_mlogit takes fewer iterations possibly because we quicker get a good approximation of the Hessian. Furthermore, we only take psqn_res$counts["n_cg"], 20, conjugate gradient iterations. This in contrast to the worst case scenario where we make length(start_val), 3205, iterations for just one iteration of the quasi-Newton method! We can also compare with the limited memory BFGS minimizer from the lbfgsb3c package:

library(lbfgsb3c)
lbfgsb3c_func <- function(par, n_threads = 2L)
lbfgsb3c(par = par, function(par)
function(par)
control = list(factr = 1e-8 * 10, maxit = 1000L))

lbfgsb3c_res <- lbfgsb3c_func(start_val)

all.equal(lbfgsb3c_res$par, psqn_res$par)
#>  "Mean relative difference: 1.72e-05"
all.equal(lbfgsb3c_res$value, bfgs_res$value)
#>  TRUE
psqn_res$value - lbfgsb3c_res$value
#>  2.22e-06

We can also compare with the limited memory BFGS minimizer from the lbfgs package:

library(lbfgs)
lbfgs_func <- function(par, n_threads = 2L)
lbfgs(vars = par, function(par)
function(par)
invisible = 1)

lbfgs_res <- lbfgs_func(start_val)

all.equal(lbfgs_res$par, psqn_res$par)
#>  "Mean relative difference: 1.83e-05"
all.equal(lbfgs_res$value, bfgs_res$value)
#>  TRUE
psqn_res$value - lbfgs_res$value
#>  2.21e-06

We can get the Hessian approximation by calling the get_Hess_approx_mlogit function we declared after calling the optimizer:

aprox_hes <- get_Hess_approx_mlogit(ptr = optimizer)
dim(aprox_hes) # quite large; requires a lot of memory
#>  3205 3205

# create a plot like before. Black entries are non-zero
par(mar = c(.5, .5, .5, .5))
idx <- 1:min(1000, NROW(aprox_hes))
aprox_hes <- aprox_hes[idx, idx] # reduce dimension to plot quickly
image(abs(aprox_hes[, NCOL(aprox_hes):1]) > 0, xaxt = "n", yaxt = "n",
col = gray.colors(2L, 1, 0)) if(FALSE){
# only feasible for smaller problem
hess_true <- jacobian(
function(par) grad_mlogit(val = par, ptr = optimizer),
psqn_respar) # should not hold exactly! Might not be that good of an approximation. all.equal(aprox_hes, hess_true) } The true Hessian is very sparse. Finally, here is a benchmark to compare the computation time: bench::mark(  optim BFGS (2 threads) = optim_func (start_val, n_threads = 2L),  lbfgs (1 thread) = lbfgs_func (start_val, n_threads = 1L),  lbfgs(2 threads) = lbfgs_func (start_val, n_threads = 2L),  lbfgs(4 threads) = lbfgs_func (start_val, n_threads = 4L),  lbfgsb3c (1 thread) = lbfgsb3c_func(start_val, n_threads = 1L),  lbfgsb3c(2 threads) = lbfgsb3c_func(start_val, n_threads = 2L),  lbfgsb3c(4 threads) = lbfgsb3c_func(start_val, n_threads = 4L),  psqn (R; 1 thread) = r_psqn_func (start_val, n_threads = 1L),  psqn(R; 2 threads) = r_psqn_func (start_val, n_threads = 2L),  psqn (1 thread, SR1) = psqn_func (start_val, n_threads = 1L, use_bfgs = FALSE),  psqn(2 threads, SR1) = psqn_func (start_val, n_threads = 2L, use_bfgs = FALSE), psqn (1 thread, opt pri.) = psqn_func (start_val, n_threads = 1L, opt_private = TRUE), psqn (2 threads, opt pri.) = psqn_func (start_val, n_threads = 2L, opt_private = TRUE),  psqn (1 thread) = psqn_func (start_val, n_threads = 1L),  psqn(2 threads) = psqn_func (start_val, n_threads = 2L),  psqn(4 threads) = psqn_func (start_val, n_threads = 4L), check = FALSE, min_time = 5) #> Warning: Some expressions had a GC in every iteration; so filtering is disabled. #> # A tibble: 16 x 6 #> expression min median itr/sec mem_alloc gc/sec #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 optim BFGS (2 threads) 1.13s 1.13s 0.882 42.41MB 0.705 #> 2 lbfgs (1 thread) 251.66ms 252.93ms 3.94 10.07MB 0.985 #> 3 lbfgs(2 threads) 136.38ms 136.89ms 7.22 9.99MB 0.976 #> 4 lbfgs(4 threads) 92.23ms 94.17ms 10.5 11.93MB 1.58 #> 5 lbfgsb3c (1 thread) 298.11ms 306.41ms 3.19 22.71MB 0.997 #> 6 lbfgsb3c(2 threads) 180.32ms 180.9ms 5.14 25.25MB 1.58 #> 7 lbfgsb3c(4 threads) 78.02ms 80.27ms 12.3 18.98MB 3.18 #> 8 psqn (R; 1 thread) 168.67ms 177.44ms 5.49 6.78MB 12.0 #> 9 psqn(R; 2 threads) 179.28ms 186.24ms 5.29 6.78MB 10.2 #> 10 psqn (1 thread, SR1) 37.45ms 37.49ms 26.7 27.58KB 0 #> 11 psqn(2 threads, SR1) 19.04ms 19.35ms 51.0 27.58KB 0 #> 12 psqn (1 thread, opt pri.) 24.12ms 24.77ms 40.2 60.23KB 0 #> 13 psqn (2 threads, opt pri.) 13.27ms 13.46ms 73.9 55.16KB 0 #> 14 psqn (1 thread) 18.29ms 18.36ms 53.7 27.58KB 0.199 #> 15 psqn(2 threads) 10.13ms 10.31ms 96.8 27.58KB 0 #> 16 psqn(4 threads) 6ms 6.19ms 161. 27.58KB 0 We see a large reduction. To be fair, we can use the C interface for the limited-memory BFGS methods to avoid re-allocating the gradient at every iteration. This will reduce their computation time. The R version of the quasi-Newton method is slower mainly as the R version to evaluate the log of the integrand and its derivative is slower than the version used by all the other methods. We can illustrate this by comparing with the computation time with the eval_integrand: bench::mark(  R = eval_integrand(true_params), C++ = eval_mlogit(val = true_params, ptr = optimizer, n_threads = 1L), min_iterations = 100) #> # A tibble: 2 x 6 #> expression min median itr/sec mem_alloc gc/sec #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 R 2.94ms 3.26ms 280. 139.17KB 11.0 #> 2 C++ 573.83µs 576.78µs 1705. 2.49KB 0 There is a big difference. Moreover, there is an overhead with repeatedly going back and forward between R and C++. A fair comparison would use an R implementation for all methods. ### Polynomial Example We consider the following trivial (regression) example as there is an explicit solution to compare with: \begin{align*} \mathcal G &=\{1,\dots, p\} \\ \mathcal G \cap \mathcal P_i &= \emptyset \\ \mathcal P_j \cap \mathcal P_i &= \emptyset, \qquad i\neq j \\ \mathcal I_i &\in \{1,\dots, p\}^{\lvert\mathcal P_i\rvert} \\ f(\vec x) &= (\vec x_{\mathcal G} - \vec\mu_{\mathcal G})^\top (\vec x_{\mathcal G} - \vec\mu_{\mathcal G}) + \sum_{i = 1}^n (\vec x_{\mathcal P_i} - \vec\mu_{\mathcal P_i} - \mat\Psi_i\vec x_{\mathcal I_i})^\top (\vec x_{\mathcal P_i} - \vec\mu_{\mathcal P_i} - \mat\Psi_i\vec x_{\mathcal I_i}) \end{align*} This is not because the problem is interesting per se but it is meant as another illustration. R code to simulate from this model is given below: # simulate the data set.seed(1) n_global <- 10L n_clusters <- 50L mu_global <- rnorm(n_global) idx_start <- n_global cluster_dat <- replicate(n_clusters, { n_members <- sample.int(n_global, 1L) g_idx <- sort(sample.int(n_global, n_members)) mu_cluster <- rnorm(n_members) Psi <- matrix(rnorm(n_members * n_members), n_members, n_members) out <- list(idx = idx_start + 1:n_members, g_idx = g_idx, mu_cluster = mu_cluster, Psi = Psi) idx_start <<- idx_start + n_members out }, simplify = FALSE) # assign matrices needed for comparisons library(Matrix) M <- diag(idx_start) for(cl in cluster_dat) M[clidx, cl$g_idx] <- -cl$Psi
M <- as(M, "dgCMatrix")

# Assign two R functions to evaluate the objective function. There are two
# versions of the function to show that we get the same with one being
# closer to the shown equation
fn_one <- function(par, ...){
delta <- par[1:n_global] - mu_global
out <- drop(delta %*% delta)
for(cl in cluster_dat){
delta <- drop(par[cl$idx] - cl$mu_cluster - cl$Psi %*% par[cl$g_idx])
out <- out + drop(delta %*% delta)
}
out
}
fn_two <- function(par, ...){
mu <- c(mu_global, unlist(sapply(cluster_dat, "[[", "mu_cluster")))
delta <- drop(M %*% par - mu)
drop(delta %*% delta)
}

tmp <- rnorm(idx_start)
all.equal(fn_one(tmp), fn_two(tmp)) # we get the same w/ the two
#>  TRUE
fn <- fn_two
rm(fn_one, fn_two, tmp)

gr <- function(par, ...){
mu <- c(mu_global, unlist(sapply(cluster_dat, "[[", "mu_cluster")))
2 * drop(crossprod(M, drop(M %*% par - mu)))
}

# we can easily find the explicit solution
mu <- c(mu_global, unlist(sapply(cluster_dat, "[[", "mu_cluster")))
exp_res <- drop(solve(M, mu))
fn(exp_res) # ~ zero as it should be
#>  2.98e-29

C++ code to work with this function is provided at system.file("poly-ex.cpp", package = "psqn") with the package and given below:

// see mlogit-ex.cpp for an example with more comments

// we will use openMP to perform the comptutation in parallel
// [[Rcpp::plugins(openmp, cpp11)]]

// we use RcppArmadillo to simplify the code

// [[Rcpp::depends(psqn)]]
#include "psqn.h"
#include "psqn-reporter.h"

using namespace Rcpp;

/// simple function to avoid copying a vector. You can ignore this
inline arma::vec vec_no_cp(double const * x, size_t const n_ele){
return arma::vec(const_cast<double *>(x), n_ele, false);
}

class poly_func final : public PSQN::element_function {
/// global parameter indices
arma::uvec const g_idx;
/// centroid vector
arma::vec const mu_cluster;
/// matrix used to transform subset of global parameters
arma::mat const Psi;
/// number of global parameters
size_t const n_global;
/// global parameter centroid vector
arma::vec const mu_global;
/**
true if this element function should compute the terms from the global
paramaters */
bool const comp_global;

public:
poly_func(List data, arma::vec const &mu_g, bool const comp_global):
g_idx     (as<arma::uvec>(data["g_idx"    ]) - 1L),
mu_cluster(as<arma::vec>(data["mu_cluster"])     ),
Psi       (as<arma::mat>(data["Psi"       ])     ),
n_global(mu_g.n_elem),
mu_global(comp_global ? mu_g : arma::vec() ),
comp_global(comp_global)
{ }

size_t global_dim() const {
return n_global;
}
size_t private_dim() const {
return mu_cluster.n_elem;
}

double func(double const *point) const {
arma::vec const x_glob = vec_no_cp(point           , n_global),
x_priv = vec_no_cp(point + n_global, mu_cluster.n_elem),
delta = x_priv - Psi * x_glob(g_idx) - mu_cluster;

// compute the function
double out(0);
out += arma::dot(delta, delta);

if(comp_global)
out += arma::dot(x_glob - mu_global, x_glob - mu_global);

return out;
}

(double const * point, double * gr) const {
arma::vec const x_glob = vec_no_cp(point           , n_global),
x_priv = vec_no_cp(point + n_global, mu_cluster.n_elem),
delta = x_priv - Psi * x_glob(g_idx) - mu_cluster;

// create objects to write to for the gradient
std::fill(gr, gr + x_glob.n_elem, 0.);
arma::vec d_glob(gr                , x_glob.n_elem, false),
d_priv(gr + x_glob.n_elem, x_priv.n_elem, false);

// compute the function and the gradient
double out(0);
out += arma::dot(delta, delta);
d_glob(g_idx) -= 2 * Psi.t() * delta;
d_priv         = 2 * delta;

if(comp_global){
out += arma::dot(x_glob - mu_global, x_glob - mu_global);
d_glob += 2. * x_glob;
d_glob -= 2 * mu_global;
}

return out;
}

return true;
}
};

using poly_optim = PSQN::optimizer<poly_func, PSQN::R_reporter,
PSQN::R_interrupter>;

// [[Rcpp::export]]
SEXP get_poly_optimizer(List data, arma::vec const &mu_global,
size_t const n_elem_funcs = data.size();
std::vector<poly_func> funcs;
funcs.reserve(n_elem_funcs);
bool comp_global(true);
for(auto dat : data){
funcs.emplace_back(List(dat), mu_global, comp_global);
comp_global = false;
}

// create an XPtr to the object we will need

// return the pointer to be used later
return ptr;
}

// [[Rcpp::export]]
List optim_poly
(NumericVector val, SEXP ptr, double const rel_eps, unsigned const max_it,
unsigned const n_threads, double const c1,
double const c2, bool const use_bfgs = true, int const trace = 0L,
double const cg_tol = .5, bool const strong_wolfe = true,
size_t const max_cg = 0L, int const pre_method = 1L){
XPtr<poly_optim> optim(ptr);

// check that we pass a parameter value of the right length
if(optim->n_par != static_cast<size_t>(val.size()))
throw std::invalid_argument("optim_poly: invalid parameter size");

NumericVector par = clone(val);
auto res = optim->optim(&par, rel_eps, max_it, c1, c2,
use_bfgs, trace, cg_tol, strong_wolfe, max_cg,
static_cast<PSQN::precondition>(pre_method));
NumericVector counts = NumericVector::create(

int const info = static_cast<int>(res.info);
return List::create(
_["par"] = par, _["value"] = res.value, _["info"] = info,
_["counts"] = counts,
_["convergence"] =  res.info == PSQN::info_code::converged);
}

// [[Rcpp::export]]
double eval_poly(NumericVector val, SEXP ptr, unsigned const n_threads){
XPtr<poly_optim> optim(ptr);

// check that we pass a parameter value of the right length
if(optim->n_par != static_cast<size_t>(val.size()))
throw std::invalid_argument("eval_poly: invalid parameter size");

return optim->eval(&val, nullptr, false);
}

// [[Rcpp::export]]
XPtr<poly_optim> optim(ptr);

// check that we pass a parameter value of the right length
if(optim->n_par != static_cast<size_t>(val.size()))

}

We can Rcpp::sourceCpp the file and use the code like below to find the solution:

library(Rcpp)
sourceCpp(system.file("poly-ex.cpp", package = "psqn"))

# get a pointer to C++ object
optimizer <- get_poly_optimizer(cluster_dat, mu_global = mu_global,

# we get the same function value and gradient
tmp <- rnorm(idx_start)
all.equal(fn       (tmp),
eval_poly(tmp, optimizer, 1L))
#>  TRUE
all.equal(gr       (tmp),
check.attributes = FALSE)
#>  TRUE

# run the optimization
psqn_func <- function(par, n_threads = 2L, c1 = 1e-4,
c2 = .9, trace = 0L)
optim_poly(val = par, ptr = optimizer, rel_eps = 1e-8, max_it = 1000L,
c2 = c2, trace = trace)

psqn_res <- psqn_func(numeric(idx_start))
all.equal(exp_res, psqn_res$par) #>  TRUE A version using the R function psqn is: # assign function to pass to psqn r_func <- function(i, par, comp_grad){ dat <- cluster_dat[[i]] g_idx <- dat$g_idx
mu_cluster <- dat$mu_cluster Psi <- dat$Psi

if(length(par) < 1)
# requested the dimension of the parameter
return(c(global_dim = length(mu_global),
private_dim = length(mu_cluster)))

is_glob <- 1:length(mu_global)
x_glob <- par[is_glob]
x_priv <- par[-is_glob]

delta <- drop(x_priv - Psi %*% x_glob[g_idx] - mu_cluster)

out <- drop(delta %*% delta)
if(i == 1L){
delta_glob <- x_glob - mu_global
out <- out + drop(delta_glob %*% delta_glob)
}

grad[g_idx] <- -2 * drop(crossprod(Psi, delta))
if(i == 1L)
}

out
}

# use the function
r_psqn_func <- function(par, n_threads = 2L, c1 = 1e-4,
c2 = .9, trace = 0L)
psqn(par = par, fn = r_func, n_ele_func = n_clusters,
trace = trace, max_it = 1000L)

R_res <- r_psqn_func(numeric(idx_start))
#>  1 1
psqn_bfgs(c(-1.2, 1), fn, gr_psqn)            $par #>  1 1 # they run in about the same time bench::mark( optim = optim (c(-1.2, 1), fn, gr, method = "BFGS"), psqn_bfgs = psqn_bfgs(c(-1.2, 1), fn, gr_psqn), check = FALSE, min_time = .5) #> # A tibble: 2 x 6 #> expression min median itr/sec mem_alloc gc/sec #> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> #> 1 optim 111.9µs 126.8µs 7834. 384B 14.5 #> 2 psqn_bfgs 86.3µs 94.3µs 10432. 2.49KB 12.6 ### BFGS and Partially Separable Quasi-Newton Below we show the ratio of flops required in the matrix-vector product in a BFGS method relative to the flops required in the matrix-vector product for the conjugate gradient method for the quasi-Newton method: vals <- expand.grid(n = 2^(8:13), p = 2^(2:4), q = 2^(2:8)) vals <- within(vals, { flops_qsn <- 2L * n * (p + q) * (p + q + 1L) flops_bfgs <- 2L * (q * n + p)^2 ratio <- flops_bfgs / flops_qsn }) nq <- length(unique(vals$q))
tvals <- c(vals$n[seq_len(NROW(vals) / nq)], vals$p[seq_len(NROW(vals) / nq)], floor(vals[, "ratio"]))

vals <- matrix(
tvals, ncol = nq + 2L, dimnames = list(
NULL, c("n", "p/q", unique(vals$q)))) knitr::kable(vals) n p/q 4 8 16 32 64 128 256 256 4 57 105 156 196 223 238 247 512 4 114 210 312 393 447 477 494 1024 4 228 420 624 787 894 955 988 2048 4 455 840 1248 1574 1787 1911 1977 4096 4 910 1680 2496 3149 3575 3822 3955 8192 4 1820 3361 4993 6297 7151 7645 7911 256 8 26 60 109 160 199 225 239 512 8 52 120 218 320 399 450 479 1024 8 105 241 437 639 798 900 959 2048 8 210 482 874 1279 1596 1801 1918 4096 8 420 964 1748 2557 3192 3601 3837 8192 8 840 1928 3495 5115 6384 7203 7674 256 16 10 27 62 111 162 201 226 512 16 19 55 124 223 323 401 451 1024 16 39 109 248 446 647 803 903 2048 16 78 218 496 892 1294 1607 1807 4096 16 156 437 993 1783 2589 3214 3615 8192 16 312 874 1986 3567 5178 6428 7230 The $\frac{\bar q^2}{p^2 + 2p\bar q + \bar q^2}$ ratio from the section called Conjugate Gradient Method is shown below: vals <- expand.grid(p = 2^(2:8), q = 2^(2:10)) vals <- within(vals, { ratio <- q^2 / (p^2 + 2 * p * q + q^2) }) tvals <- c(unique(vals$p), vals[, "ratio"])

vals <- matrix(
tvals, ncol = length(unique(vals$q)) + 1L, dimnames = list( NULL, c("p/q", unique(vals$q))))
knitr::kable(vals, digits = 4)
p/q 4 8 16 32 64 128 256 512 1024
4 0.2500 0.4444 0.6400 0.7901 0.886 0.940 0.970 0.985 0.992
8 0.1111 0.2500 0.4444 0.6400 0.790 0.886 0.940 0.970 0.985
16 0.0400 0.1111 0.2500 0.4444 0.640 0.790 0.886 0.940 0.970
32 0.0123 0.0400 0.1111 0.2500 0.444 0.640 0.790 0.886 0.940
64 0.0035 0.0123 0.0400 0.1111 0.250 0.444 0.640 0.790 0.886
128 0.0009 0.0035 0.0123 0.0400 0.111 0.250 0.444 0.640 0.790
256 0.0002 0.0009 0.0035 0.0123 0.040 0.111 0.250 0.444 0.640

We can get rid of the $$p^2$$ term which gives us:

vals <- expand.grid(p = 2^(2:8), q = 2^(2:10))
vals <- within(vals, {
ratio <- q^2 / (2 * p * q + q^2)
})
tvals <- c(unique(vals$p), vals[, "ratio"]) vals <- matrix( tvals, ncol = length(unique(vals$q)) + 1L, dimnames = list(
NULL, c("p/q", unique(vals\$q))))
knitr::kable(vals, digits = 4)
p/q 4 8 16 32 64 128 256 512 1024
4 0.3333 0.5000 0.6667 0.8000 0.889 0.941 0.970 0.985 0.992
8 0.2000 0.3333 0.5000 0.6667 0.800 0.889 0.941 0.970 0.985
16 0.1111 0.2000 0.3333 0.5000 0.667 0.800 0.889 0.941 0.970
32 0.0588 0.1111 0.2000 0.3333 0.500 0.667 0.800 0.889 0.941
64 0.0303 0.0588 0.1111 0.2000 0.333 0.500 0.667 0.800 0.889
128 0.0154 0.0303 0.0588 0.1111 0.200 0.333 0.500 0.667 0.800
256 0.0078 0.0154 0.0303 0.0588 0.111 0.200 0.333 0.500 0.667

This is implemented.