Title: | Regression Analysis for Very Large Data Sets via Merge and Reduce |
---|---|
Description: | Frequentist and Bayesian linear regression for large data sets. Useful when the data does not fit into memory (for both frequentist and Bayesian regression), to make running time manageable (mainly for Bayesian regression), and to reduce the total running time because of reduced or less severe memory-spillover into the virtual memory. This is an implementation of Merge & Reduce for linear regression as described in Geppert, L.N., Ickstadt, K., Munteanu, A., & Sohler, C. (2020). 'Streaming statistical models via Merge & Reduce'. International Journal of Data Science and Analytics, 1-17, <doi:10.1007/s41060-020-00226-0>. |
Authors: | Esther Denecke [aut], Leo N. Geppert [aut, cre], Steffen Maletz [ctb], R Core Team [ctb] |
Maintainer: | Leo N. Geppert <[email protected]> |
License: | GPL-2 | GPL-3 |
Version: | 1.0.0 |
Built: | 2024-11-01 03:23:28 UTC |
Source: | https://github.com/cran/mrregression |
Simulated data set with 1500 observations for illustrational purposes.
exampleData
exampleData
A data frame with 1500 rows and 11 variables where V1-V10 are the predictors and V11 is the dependent variable.
mrbayes
is used to conduct Bayesian linear regression on
very large data sets using Merge and Reduce as described in Geppert et al. (2020).
Package rstan
needs to be installed. When calling the function this
is checked using requireNamespace
as suggested by
Hadley Wickham in "R packages" (section Dependencies,
http://r-pkgs.had.co.nz/description.html, accessed 2020-07-31).
mrbayes( y, intercept = TRUE, fileMr = NULL, dataMr = NULL, obsPerBlock, dataStan = NULL, sep = "auto", dec = ".", header = TRUE, naStrings = "NA", colNames = NULL, naAction = na.fail, ... )
mrbayes( y, intercept = TRUE, fileMr = NULL, dataMr = NULL, obsPerBlock, dataStan = NULL, sep = "auto", dec = ".", header = TRUE, naStrings = "NA", colNames = NULL, naAction = na.fail, ... )
y |
|
intercept |
|
fileMr |
( |
dataMr |
( |
obsPerBlock |
|
dataStan |
( |
sep |
See documentation of |
dec |
See documentation of |
header |
|
naStrings |
|
colNames |
|
naAction |
|
... |
Further optional arguments to be passed on to
|
Returns an object of class "mrbayes"
which is a list
containing the following components:
level |
Number of level of the final model in Merge and Reduce. This is equal
to |
numberObs |
The total number of observations. |
summaryStats |
Summary statistics including the mean, median, quartiles, 2.5% and 97.5% quantiles of the posterior distributions for each regression coefficient and the error term's standard deviation sigma. |
diagnostics |
Effective sample size (n_eff) and potential scale reduction factor on split chains (Rhat) calculated from the output of summary,stanfit-method. Note that, using Merge and Reduce, for each regression coefficient only one value is reported: For n_eff the minimum observed value on level 1 is reported and for Rhat the maximum observed value on level 1 is reported. |
modelCode |
The model. Syntax
as in argument |
dataHead |
First six rows of the data in the first block. This serves
as a sanity check, especially when using the argument |
Code of default dataStan
makes use of all predictors: dataStan = list(n = nrow(currentBlock),
d = (ncol(currentBlock) -
1),
X = currentBlock[, -colNumY],
y = currentBlock[, colNumY])
where currentBlock
is the current block of data to be evaluated, n
the number of observations,
d
the number of variables (without intercept), X
contains the predictors,
and y
the dependent variable. colNumY
is the column number of the
dependent variable that the function finds internally.
When specifying the argument dataStan
, note two things:
1. Please use the syntax of the default dataStan
, i.e. the object
containing the data of the block to be evaluated is called
currentBlock
, the number of observations must be set to
n = nrow(currentBlock)
, d
needs to be set to the number of
variables without intercept, the dependent variable must be named y
,
and the independent variables must be named X
.
2. The expressions
within the list must be unevaluated: Therefore, use the function
quote
.
Geppert, L.N., Ickstadt, K., Munteanu, A., & Sohler, C. (2020).
Streaming statistical models via Merge & Reduce. International Journal
of Data Science and Analytics, 1-17,
doi: https://doi.org/10.1007/s41060-020-00226-0
# Package rstan needs to be installed for running this example. if (requireNamespace("rstan", quietly = TRUE)) { n = 2000 p = 4 set.seed(34) x1 = rnorm(n, 10, 2) x2 = rnorm(n, 5, 3) x3 = rnorm(n, -2, 1) x4 = rnorm(n, 0, 5) y = 2.4 - 0.6 * x1 + 5.5 * x2 - 7.2 * x3 + 5.7 * x4 + rnorm(n) data = data.frame(x1, x2, x3, x4, y) normalmodell = ' data { int<lower=0> n; int<lower=0> d; matrix[n,d] X; // predictor matrix vector[n] y; // outcome vector } parameters { real alpha; // intercept vector[d] beta; // coefficients for predictors real<lower=0> sigma; // error scale } model { y ~ normal(alpha + X * beta, sigma); // likelihood } ' datas = list(n = nrow(data), d = ncol(data)-1, y = data[, dim(data)[2]], X = data[, 1:(dim(data)[2]-1)]) fit0 = rstan::stan(model_code = normalmodell, data = datas, chains = 4, iter = 1000) fit1 = mrbayes(dataMr = data, obsPerBlock = 500, y = 'y') }
# Package rstan needs to be installed for running this example. if (requireNamespace("rstan", quietly = TRUE)) { n = 2000 p = 4 set.seed(34) x1 = rnorm(n, 10, 2) x2 = rnorm(n, 5, 3) x3 = rnorm(n, -2, 1) x4 = rnorm(n, 0, 5) y = 2.4 - 0.6 * x1 + 5.5 * x2 - 7.2 * x3 + 5.7 * x4 + rnorm(n) data = data.frame(x1, x2, x3, x4, y) normalmodell = ' data { int<lower=0> n; int<lower=0> d; matrix[n,d] X; // predictor matrix vector[n] y; // outcome vector } parameters { real alpha; // intercept vector[d] beta; // coefficients for predictors real<lower=0> sigma; // error scale } model { y ~ normal(alpha + X * beta, sigma); // likelihood } ' datas = list(n = nrow(data), d = ncol(data)-1, y = data[, dim(data)[2]], X = data[, 1:(dim(data)[2]-1)]) fit0 = rstan::stan(model_code = normalmodell, data = datas, chains = 4, iter = 1000) fit1 = mrbayes(dataMr = data, obsPerBlock = 500, y = 'y') }
mrfrequentist
is used to conduct frequentist linear
regression on very large data sets using Merge and Reduce as
described in Geppert et al. (2020).
mrfrequentist( formula, fileMr = NULL, dataMr = NULL, obsPerBlock, approach = c("1", "3"), sep = "auto", dec = ".", header = TRUE, naStrings = "NA", colNames = NULL, naAction = na.fail )
mrfrequentist( formula, fileMr = NULL, dataMr = NULL, obsPerBlock, approach = c("1", "3"), sep = "auto", dec = ".", header = TRUE, naStrings = "NA", colNames = NULL, naAction = na.fail )
formula |
|
fileMr |
( |
dataMr |
( |
obsPerBlock |
|
approach |
|
sep |
See documentation of |
dec |
See documentation of |
header |
|
naStrings |
|
colNames |
|
naAction |
|
Returns an object of class "mrfrequentist"
which is a list
containing the following components for both approaches "1" and "3":
approach |
The approach used for merging the models. Either "1" or "3". |
formula |
The model's |
level |
Number of level of the final model in Merge and Reduce. This is equal
to |
numberObs |
The total number of observations. |
summaryStats |
Summary statistics reporting the estimated regression coefficients
and their unbiased standard errors. Estimates are based
on the merge technique as specified in the argument |
dataHead |
First six rows of the data in the first block. This serves
as a sanity check, especially when using the argument |
terms |
Terms object. |
Additionally for approach "3" only:
XTX |
The final model's |
yTX |
The final model's |
yTy |
The final model's |
In approach "3" the estimated regression coefficients and their unbiased standard errors
are calculated via qr decompositions on X'X (as in speedlm
with argument method = "qr"
). Moreover, the merge step uses the same
idea of blockwise addition for X'X, y'y and y'X as speedglm
's updating
procedure updateWithMoreData
. Conceptually though,
Merge and Reduce is not an updating algorithm as it merges models based on
a comparable amount of data along a tree structure to obtain a final model.
Geppert, L.N., Ickstadt, K., Munteanu, A., & Sohler, C. (2020).
Streaming statistical models via Merge & Reduce. International Journal
of Data Science and Analytics, 1-17,
doi: https://doi.org/10.1007/s41060-020-00226-0
## run mrfrequentist() with dataMr data(exampleData) fit1 = mrfrequentist(dataMr = exampleData, approach = "1", obsPerBlock = 300, formula = V11 ~ .) ## run mrfrequentist() with fileMr filepath = system.file("extdata", "exampleFile.txt", package = "mrregression") fit2 = mrfrequentist(fileMr = filepath, approach = "3", header = TRUE, obsPerBlock = 100, formula = y ~ .)
## run mrfrequentist() with dataMr data(exampleData) fit1 = mrfrequentist(dataMr = exampleData, approach = "1", obsPerBlock = 300, formula = V11 ~ .) ## run mrfrequentist() with fileMr filepath = system.file("extdata", "exampleFile.txt", package = "mrregression") fit2 = mrfrequentist(fileMr = filepath, approach = "3", header = TRUE, obsPerBlock = 100, formula = y ~ .)
Frequentist and Bayesian linear regression for large data sets. Useful when
the data does not fit into memory (for both frequentist and Bayesian
regression), to make running time manageable (mainly for Bayesian
regression), and to reduce the total running time because of reduced or
less severe memory-spillover into the virtual memory.
The package contains the two main functions
mrfrequentist
and mrbayes
as well as several S3 methods listed below. Note, that currently only
numerical predictors are supported. Factor variables can be included in the
model in dummy-coded form, e.g. using model.matrix
.
However, this may lead to highly variable or even unreliable estimates /
posterior distributions if levels are not represented well in every single block.
It is solely the user's responsibility to check that this is not the case!
## S3 method for class 'mrfrequentist' coef(object, ...) ## S3 method for class 'mrfrequentist' nobs(object, ...) ## S3 method for class 'mrfrequentist' predict(object, data, ...) ## S3 method for class 'mrfrequentist' summary(object, ...) ## S3 method for class 'summary.mrfrequentist' print(x, ...) ## S3 method for class 'mrbayes' nobs(object, ...) ## S3 method for class 'mrbayes' summary(object, ...) ## S3 method for class 'summary.mrbayes' print(x, ...)
## S3 method for class 'mrfrequentist' coef(object, ...) ## S3 method for class 'mrfrequentist' nobs(object, ...) ## S3 method for class 'mrfrequentist' predict(object, data, ...) ## S3 method for class 'mrfrequentist' summary(object, ...) ## S3 method for class 'summary.mrfrequentist' print(x, ...) ## S3 method for class 'mrbayes' nobs(object, ...) ## S3 method for class 'mrbayes' summary(object, ...) ## S3 method for class 'summary.mrbayes' print(x, ...)
object |
Object of class |
... |
Currently only useful for method |
data |
A |
x |
Object of class |
Geppert, L.N., Ickstadt, K., Munteanu, A., & Sohler, C. (2020).
Streaming statistical models via Merge & Reduce. International Journal
of Data Science and Analytics, 1-17,
doi: https://doi.org/10.1007/s41060-020-00226-0