Skip to contents

Calculates distances (returning a symmetric matrix) from a raw data matrix in .omv-files for the statistical spreadsheet 'jamovi' (https://www.jamovi.org)

Usage

distances_omv(
  dtaInp = NULL,
  fleOut = "",
  varDst = c(),
  clmDst = TRUE,
  stdDst = "none",
  nmeDst = "euclid",
  mtxSps = FALSE,
  mtxTrL = FALSE,
  mtxDgn = TRUE,
  usePkg = c("foreign", "haven"),
  selSet = "",
  ...
)

Arguments

dtaInp

Either a data frame or the name of a data file to be read (including the path, if required; "FILENAME.ext"; default: NULL); files can be of any supported file type, see Details below.

fleOut

Name of the data file to be written (including the path, if required; "FILE_OUT.omv"; default: ""); if empty, the resulting data frame is returned instead.

varDst

Variable (default: c()) containing a character vector with the names of the variables for which distances are to be calculated. See Details for more information.

clmDst

Whether the distances shall be calculated between columns (TRUE) or rows (FALSE; default: TRUE). See Details for more information.

stdDst

Character string indicating whether the variables in varDst are to be standardized and how (default: "none"). See Details for more information.

nmeDst

Character string indicating which distance measure is to be calculated calculated (default: "euclidean"). See Details for more information.

mtxSps

Whether the symmetric matrix to be returned should be sparse (default: FALSE)

mtxTrL

Whether the symmetric matrix to be returned should only contain the lower triangular (default: FALSE)

mtxDgn

Whether the symmetric matrix to be returned should retain the values in the main diagonal (default: TRUE)

usePkg

Name of the package: "foreign" or "haven" that shall be used to read SPSS, Stata, and SAS files; "foreign" is the default (it comes with base R), but "haven" is newer and more comprehensive.

selSet

Name of the data set that is to be selected from the workspace (only applies when reading .RData-files)

...

Additional arguments passed on to methods; see Details below.

Value

a data frame containing a symmetric matrix (only returned if fleOut is empty) containing the distances between the variables / columns (clmDst == TRUE) or rows (clmDst == FALSE)

Details

  • varDst must a character vector containing the variables to calculated distances over. If clmDst is set to TRUE, distances are calculated between all possible variable pairs and over subjects / rows in the original data frame. If clmDst is set to FALSE, distances are calculated between participants and over all variables given in varDst. If clmDst is set to TRUE, the symmetric matrix that is returned has the size V x V (V being the number of variables in varDst; if mtxSps is set to TRUE, the size is V - 1 x V - 1, see below); if clmDst is set to FALSE, the symmetric matrix that is returned has the size R x R (R being the number of rows in the original dataset; it is if mtxSps is set to TRUE, the size is R - 1 x R - 1, see below).

  • stdDst can be one of the following calculations to standardize the selected variables before calculating the distances: none (do not standardize; default), z (z scores), sd (divide by the std. dev.), range (divide by the range), max (divide by the absolute maximum), mean (divide by the mean), rescale (subtract the mean and divide by the range).

  • nmeDst can be one of the following distance measures. (1) For interval data: euclid (Euclidean), seuclid (squared Euclidean), block (city block / Manhattan), canberra (Canberra). chebychev (maximum distance / supremum norm / Chebychev), minkowski_p (Minkowski with power p; NB: needs p), power_p_r (Minkowski with power p, and the r-th root; NB: needs p and r), cosine (cosine between the two vectors), correlation (correlation between the two vectors). (2) For frequency count data: chisq (chi-square dissimilarity between two sets of frequencies), ph2 (chi-square dissimilarity normalized by the square root of the number of values used in the calculation). (3) For binary data, all measure have to optional parts p and np which indicate presence (p; defaults to 1 if not given) or absence (np; defaults to zero if not given). (a) matching coefficients: rr_p_np (Russell and Rao), sm_p_np (simple matching), jaccard_p_np / jaccards_p_np (Jaccard similarity; as in SPSS), jaccardd_p_np (Jaccard dissimiliarity; as in dist(..., "binary") in R), dice_p_np (Dice or Czekanowski or Sorenson similarity), ss1_p_np (Sokal and Sneath measure 1), rt_p_np (Rogers and Tanimoto), ss2_p_np (Sokal and Sneath measure 2), k1_p_np (Kulczynski measure 1), ss3_p_np (Sokal and Sneath measure 3). (b) conditional probabilities: k2_p_np (Kulczynski measure 2), ss4_p_np (Sokal and Sneath measure 4), hamann_p_np (Hamann). (c) predictability measures: lambda_p_np (Goodman and Kruskal Lambda), d_p_np (Anderberg’s D), y_p_np (Yule’s Y coefficient of colligation), q_p_np (Yule’s Q). (d) other measures: ochiai_p_np (Ochiai), ss5_p_np (Sokal and Sneath measure 5), phi_p_np (fourfold point correlation), beuclid_p_np (binary Euclidean distance), bseuclid_p_np (binary squared Euclidean distance), size_p_np (size difference), pattern_p_np (pattern difference), bshape_p_np (binary Shape difference), disper_p_np (dispersion similarity), variance_p_np (variance dissimilarity), blwmn_p_np (binary Lance and Williams non-metric dissimilarity). (4) none (only carry out standardization, if stdDst is different from none).

  • If mtxSps is set, a sparse matrix is returned. Those matrices are similar to the format one often finds for correlation matrices. The values are only retained in the lower triangular, the columns range from the first to the variable that is second to the last in varDst (or respectively, the columns contain the first to the second to the last row of the original dataset when clmDst is set to FALSE), and the rows contain the second to the last variable in varDst (or respectively, the rows contain the second to the last row of the original dataset when clmDst is set to FALSE).

  • By default, a full symmetric matrix is returned (i.e., a matrix that has no NAs in any cell). This behaviour can be changed with setting mtxTrL and mtxDgn: If mtxTrL is set to TRUE, the values from the upper triangular matrix are removed / replaced with NAs; if mtxDgn is set to FALSE, the values from the main diagonal are removed / replaced with NAs.

  • The ellipsis-parameter (...) can be used to submit arguments / parameters to the functions that are used for reading and writing the data. By clicking on the respective function under “See also”, you can get a more detailed overview over which parameters each of those functions take. The functions are: read_omv and write_omv (for jamovi-files), read.table (for CSV / TSV files; using similar defaults as read.csv for CSV and read.delim for TSV which both are based upon read.table), load (for .RData-files), readRDS (for .rds-files), read_sav (needs the R-package haven) or read.spss (needs the R-package foreign) for SPSS-files, read_dta (haven) / read.dta (foreign) for Stata-files, read_sas (haven) for SAS-data-files, and read_xpt (haven) / read.xport (foreign) for SAS-transport-files. If you would like to use haven, you may need to install it using install.packages("haven", dep = TRUE).

See also

distances_omv internally uses the following function for calculating the distances for interval data stats::dist(). It furthermore uses the following functions for reading and writing data files in different formats: read_omv() and write_omv() for jamovi-files, utils::read.table() for CSV / TSV files, load() for reading .RData-files, readRDS() for .rds-files, haven::read_sav() or foreign::read.spss() for SPSS-files, haven::read_dta() or foreign::read.dta() for Stata-files, haven::read_sas() for SAS-data-files, and haven::read_xpt() or foreign::read.xport() for SAS-transport-files.

Examples

if (FALSE) { # \dontrun{
# create matrices for the different types of distance measures: continuous
# (cntFrm), frequency counts (frqFrm) or binary (binFrm); all 20 R x 5 C
set.seed(1)
cntFrm <- stats::setNames(as.data.frame(matrix(rnorm(100, sd = 10),
            ncol = 5)), sprintf("C_%02d", seq(5)))
frqFrm <- stats::setNames(as.data.frame(matrix(sample(seq(10), 100,
            replace = TRUE), ncol = 5)), sprintf("F_%02d", seq(5)))
binFrm <- stats::setNames(as.data.frame(matrix(sample(c(TRUE, FALSE), 100,
            replace = TRUE), ncol = 5)), sprintf("B_%02d", seq(5)))
nmeOut <- tempfile(fileext = ".omv")

# calculates the distances between columns, nmeDst is not required: "euclid"
# is the default
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
  names(cntFrm), nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Euclidian distances
print(dtaFrm)

# calculates the (Euclidean) distances between rows (clmDst = FALSE)
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
  names(cntFrm), clmDst = FALSE, nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (20 x 20) with the Euclidian distances
print(dtaFrm)

# calculates the (Euclidean) distances between columns; the original data
# are z-standardized before calculating the distances (stdDst = "z")
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
  names(cntFrm), stdDst = "z", nmeDst = "euclid")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Euclidian distances using the
# z-standardized data
print(dtaFrm)

# calculates the correlations between columns
jmvReadWrite::distances_omv(dtaInp = cntFrm, fleOut = nmeOut, varDst =
  names(cntFrm), nmeDst = "correlation")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the correlations
print(dtaFrm)

# calculates the chi-square dissimilarity (nmeDst = "chisq") between columns
jmvReadWrite::distances_omv(dtaInp = frqFrm, fleOut = nmeOut, varDst =
  names(frqFrm), nmeDst = "chisq")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the chi-square dissimilarities
print(dtaFrm)

# calculates the Jaccard similarity (nmeDst = "jaccard") between columns
jmvReadWrite::distances_omv(dtaInp = binFrm, fleOut = nmeOut, varDst =
  names(binFrm), nmeDst = "jaccard")
dtaFrm <- jmvReadWrite::read_omv(nmeOut)
unlink(nmeOut)
# the resulting matrix (10 x 10) with the Jaccard similarities
print(dtaFrm)

} # }