Calculates distances (returning a symmetric matrix) from a raw data matrix in .omv-files for the statistical spreadsheet 'jamovi' (https://www.jamovi.org)
Source:R/distances_omv.R
distances_omv.Rd
Calculates distances (returning a symmetric matrix) from a raw data matrix in .omv-files for the statistical spreadsheet 'jamovi' (https://www.jamovi.org)
Arguments
- dtaInp
Either a data frame or the name of a data file to be read (including the path, if required; "FILENAME.ext"; default: NULL); files can be of any supported file type, see Details below.
- fleOut
Name of the data file to be written (including the path, if required; "FILE_OUT.omv"; default: ""); if empty, the resulting data frame is returned instead.
- varDst
Variable (default: c()) containing a character vector with the names of the variables for which distances are to be calculated. See Details for more information.
- clmDst
Whether the distances shall be calculated between columns (TRUE) or rows (FALSE; default: TRUE). See Details for more information.
- stdDst
Character string indicating whether the variables in varDst are to be standardized and how (default: "none"). See Details for more information.
- nmeDst
Character string indicating which distance measure is to be calculated calculated (default: "euclidean"). See Details for more information.
- mtxSps
Whether the symmetric matrix to be returned should be sparse (default: FALSE)
- mtxTrL
Whether the symmetric matrix to be returned should only contain the lower triangular (default: FALSE)
- mtxDgn
Whether the symmetric matrix to be returned should retain the values in the main diagonal (default: TRUE)
- usePkg
Name of the package: "foreign" or "haven" that shall be used to read SPSS, Stata, and SAS files; "foreign" is the default (it comes with base R), but "haven" is newer and more comprehensive.
- selSet
Name of the data set that is to be selected from the workspace (only applies when reading .RData-files)
- ...
Additional arguments passed on to methods; see Details below.
Value
a data frame containing a symmetric matrix (only returned if fleOut
is empty)
containing the distances between the variables / columns (clmDst == TRUE) or rows
(clmDst == FALSE)
Details
varDst
must a character vector containing the variables to calculated distances over. IfclmDst
is set to TRUE, distances are calculated between all possible variable pairs and over subjects / rows in the original data frame. IfclmDst
is set to FALSE, distances are calculated between participants and over all variables given invarDst
. IfclmDst
is set toTRUE
, the symmetric matrix that is returned has the size V x V (V being the number of variables in varDst; ifmtxSps
is set toTRUE
, the size is V - 1 x V - 1, see below); ifclmDst
is set toFALSE
, the symmetric matrix that is returned has the size R x R (R being the number of rows in the original dataset; it is ifmtxSps
is set toTRUE
, the size is R - 1 x R - 1, see below).stdDst
can be one of the following calculations to standardize the selected variables before calculating the distances:none
(do not standardize; default),z
(z scores),sd
(divide by the std. dev.),range
(divide by the range),max
(divide by the absolute maximum),mean
(divide by the mean),rescale
(subtract the mean and divide by the range).nmeDst
can be one of the following distance measures. (1) For interval data:euclid
(Euclidean),seuclid
(squared Euclidean),block
(city block / Manhattan),canberra
(Canberra).chebychev
(maximum distance / supremum norm / Chebychev),minkowski_p
(Minkowski with power p; NB: needs p),power_p_r
(Minkowski with power p, and the r-th root; NB: needs p and r),cosine
(cosine between the two vectors),correlation
(correlation between the two vectors). (2) For frequency count data:chisq
(chi-square dissimilarity between two sets of frequencies),ph2
(chi-square dissimilarity normalized by the square root of the number of values used in the calculation). (3) For binary data, all measure have to optional partsp
andnp
which indicate presence (p
; defaults to 1 if not given) or absence (np
; defaults to zero if not given). (a) matching coefficients:rr_p_np
(Russell and Rao),sm_p_np
(simple matching),jaccard_p_np
/jaccards_p_np
(Jaccard similarity; as in SPSS),jaccardd_p_np
(Jaccard dissimiliarity; as indist(..., "binary")
in R),dice_p_np
(Dice or Czekanowski or Sorenson similarity),ss1_p_np
(Sokal and Sneath measure 1),rt_p_np
(Rogers and Tanimoto),ss2_p_np
(Sokal and Sneath measure 2),k1_p_np
(Kulczynski measure 1),ss3_p_np
(Sokal and Sneath measure 3). (b) conditional probabilities:k2_p_np
(Kulczynski measure 2),ss4_p_np
(Sokal and Sneath measure 4),hamann_p_np
(Hamann). (c) predictability measures:lambda_p_np
(Goodman and Kruskal Lambda),d_p_np
(Anderberg’s D),y_p_np
(Yule’s Y coefficient of colligation),q_p_np
(Yule’s Q). (d) other measures:ochiai_p_np
(Ochiai),ss5_p_np
(Sokal and Sneath measure 5),phi_p_np
(fourfold point correlation),beuclid_p_np
(binary Euclidean distance),bseuclid_p_np
(binary squared Euclidean distance),size_p_np
(size difference),pattern_p_np
(pattern difference),bshape_p_np
(binary Shape difference),disper_p_np
(dispersion similarity),variance_p_np
(variance dissimilarity),blwmn_p_np
(binary Lance and Williams non-metric dissimilarity). (4)none
(only carry out standardization, if stdDst is different fromnone
).If
mtxSps
is set, a sparse matrix is returned. Those matrices are similar to the format one often finds for correlation matrices. The values are only retained in the lower triangular, the columns range from the first to the variable that is second to the last invarDst
(or respectively, the columns contain the first to the second to the last row of the original dataset whenclmDst
is set toFALSE
), and the rows contain the second to the last variable invarDst
(or respectively, the rows contain the second to the last row of the original dataset whenclmDst
is set toFALSE
).By default, a full symmetric matrix is returned (i.e., a matrix that has no NAs in any cell). This behaviour can be changed with setting
mtxTrL
andmtxDgn
: IfmtxTrL
is set toTRUE
, the values from the upper triangular matrix are removed / replaced with NAs; ifmtxDgn
is set toFALSE
, the values from the main diagonal are removed / replaced with NAs.The ellipsis-parameter (
...
) can be used to submit arguments / parameters to the functions that are used for reading and writing the data. By clicking on the respective function under “See also”, you can get a more detailed overview over which parameters each of those functions take. The functions are:read_omv
andwrite_omv
(for jamovi-files),read.table
(for CSV / TSV files; using similar defaults asread.csv
for CSV andread.delim
for TSV which both are based uponread.table
),load
(for .RData-files),readRDS
(for .rds-files),read_sav
(needs the R-packagehaven
) orread.spss
(needs the R-packageforeign
) for SPSS-files,read_dta
(haven
) /read.dta
(foreign
) for Stata-files,read_sas
(haven
) for SAS-data-files, andread_xpt
(haven
) /read.xport
(foreign
) for SAS-transport-files. If you would like to usehaven
, you may need to install it usinginstall.packages("haven", dep = TRUE)
.
See also
distances_omv
internally uses the following function for calculating the distances
for interval data stats::dist()
. It furthermore uses the following functions for reading
and writing data files in different formats: read_omv()
and
write_omv()
for jamovi-files, utils::read.table()
for CSV / TSV files,
load()
for reading .RData-files, readRDS()
for .rds-files, haven::read_sav()
or
foreign::read.spss()
for SPSS-files, haven::read_dta()
or foreign::read.dta()
for
Stata-files, haven::read_sas()
for SAS-data-files, and haven::read_xpt()
or
foreign::read.xport()
for SAS-transport-files.
Examples
if (FALSE) { # \dontrun{
# create matrices for the different types of distance measures: continuous
# (cntFrm), frequency counts (frqFrm) or binary (binFrm); all 20 R x 5 C
set.seed(1)
cntFrm <- stats::setNames(as.data.frame(matrix(rnorm(100, sd = 10),
ncol = 5)), sprintf("C_%02d", seq(5)))
frqFrm <- stats::setNames(as.data.frame(matrix(sample(seq(10), 100,
replace = TRUE), ncol = 5)), sprintf("F_%02d", seq(5)))
binFrm <- stats::setNames(as.data.frame(matrix(sample(c(TRUE, FALSE), 100,
replace = TRUE), ncol = 5)), sprintf("B_%02d", seq(5)))
nmeOut <- tempfile(fileext = ".omv")
# calculates the distances between columns, nmeDst is not required: "euclid"
# is the default
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Euclidian distances
print(dtaFrm)
# calculates the (Euclidean) distances between rows (clmDst = FALSE)
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), clmDst = FALSE, nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (20 x 20) with the Euclidian distances
print(dtaFrm)
# calculates the (Euclidean) distances between columns; the original data
# are z-standardized before calculating the distances (stdDst = "z")
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), stdDst = "z", nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Euclidian distances using the
# z-standardized data
print(dtaFrm)
# calculates the correlations between columns
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
names(cntFrm), nmeDst = "correlation")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the correlations
print(dtaFrm)
# calculates the chi-square dissimilarity (nmeDst = "chisq") between columns
jmvReadWrite::distances_omv(dtaInp = frqFrm, fleOut = nmeOut, varDst =
names(frqFrm), nmeDst = "chisq")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the chi-square dissimilarities
print(dtaFrm)
# calculates the Jaccard similarity (nmeDst = "jaccard") between columns
jmvReadWrite::distances_omv(dtaInp = binFrm, fleOut = nmeOut, varDst =
names(binFrm), nmeDst = "jaccard")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Jaccard similarities
print(dtaFrm)
} # }