R/merge_cols_omv.R
merge_cols_omv.Rd
Merges two or more data files by adding the content of other input files as columns to the first input file and outputs them as files for the statistical spreadsheet 'jamovi' (https://www.jamovi.org)
vector with the file names of the input files (including the path, if required; c("FILE_IN1.omv", "FILE_IN2.omv"); default: c()); can be any supported file type, see Details below
Name of the data file to be written (including the path, if required; "FILE_OUT.omv"; default: ""); if empty, the data frame with the added columns is returned as variable (but not written)
Type of merging operation: "outer" (default), "inner", "left" or "right"; see Details below
Name of the variable by which the data sets are matched, can either be a string, a character or a list (see Details below; default: list())
Variable(s) that are used to sort the data frame (see Details; if empty, the order after merging is kept; default: c())
Name of the package: "foreign" or "haven" that shall be used to read SPSS, Stata and SAS files; "foreign" is the default (it comes with base R), but "haven" is newer and more comprehensive
Name of the data set that is to be selected from the workspace (only applies when reading .RData-files)
Additional arguments passed on to methods; see Details below
a data frame (if fleOut is empty) with where the columns of all input data sets (in the files given to fleInp) are concatenated
There are four different types of merging operations: "outer" keeps all cases (but columns in the resulting data set may be empty if they did not contain values in same input data sets), "inner" keeps
only those cases where all datasets contain the same value in the matching variable, for "left" all cases from the first data set in fleInp are kept (whereas cases that are only contained in input data
set two or higher are dropped), for "right" all cases from the second (or any higher) data set in fleInp are kept. The behaviour of "left" and "right" may be somewhat difficult to predict in case of
merging several data sets, therefore "outer" might be a safer choice if several data sets are merged.
The variable that is used for matching (varBy) can either be a string (if all datasets contain a matching variable with the same name), a character vector (containing several matching variables that
are the same for all data sets) or a list with the same length as fleInp. In the latter case, each cell of that list can again contain either a string (one matching variable for each data set in fleInp)
or a character vector (several matching variables for each data set in fleInp; NB: all character vectors in the cells of the list must have the same length as it is necessary to always use the same
number of matching variables when merging).
The ellipsis-parameter (...) can be used to submit arguments / parameters to the functions that are used for merging or reading the data. Adding columns uses merge
. When reading the data, the
functions are: read_omv
(for jamovi-files), read.table
(for CSV / TSV files; using similar defaults as read.csv
for CSV and read.delim
for TSV which both are based upon read.table
but
with adjusted defaults for the respective file types), readRDS
(for rds-files), read_sav
(needs R-package "haven") or read.spss
(needs R-package "foreign") for SPSS-files, read_dta
("haven") / read.dta
("foreign") for Stata-files, read_sas
("haven") for SAS-data-files, and read_xpt
("haven") / read.xport
("foreign") for SAS-transport-files. If you would like to use
"haven", it may be needed to install it manually (i.e., install.packages("haven", dep = TRUE)
).
if (FALSE) {
library(jmvReadWrite);
dtaInp <- bfi_sample2;
nmeInp <- paste0(tempfile(), "_", 1:3, ".rds");
nmeOut <- paste0(tempfile(), ".omv");
for (i in seq_along(nmeInp)) {
saveRDS(stats::setNames(dtaInp, c("ID", paste0(names(dtaInp)[-1], "_", i))), nmeInp[i]);
}
# save dtaInp three times (i.e., the length of nmeInp), adding "_" + 1 ... 3 as index
# to the data variables (A1 ... O5, gender, age → A1_1, ...)
merge_cols_omv(fleInp = nmeInp, fleOut = nmeOut, varBy = "ID");
cat(file.info(nmeOut)$size);
# -> 17731 (size may differ on different OSes)
dtaOut <- read_omv(nmeOut, sveAtt = FALSE);
# read the data set where the three original datasets were added as columns and show
# the variable names
cat(names(dtaOut));
cat(names(dtaInp));
# compared to the input data set, we have the same names (expect for "ID" which was
# used for matching and that each variable had added an indicator from which data
# set they came)
cat(dim(dtaInp), dim(dtaOut));
# the first dimension of the data sets (rows) stayed the same (250), whereas the
# second dimension is now approx. three times as large (28 -> 82):
# 28 - 1 (for "ID") = 27 * 3 + 1 (for "ID") = 82
cat(colMeans(dtaInp[2:11]));
cat(colMeans(dtaOut[2:11]));
# it's therefore not much surprise that the values of the column means for the first
# 10 variables of dtaInp and dtaOut are the same too
unlink(nmeInp);
unlink(nmeOut);
}