--- title: "Update Datasets With replacer" author: "Dragos Bandur" date: "`r Sys.Date()`" bibliography: "citation.bib" output: rmarkdown::html_vignette: toc: true toc_depth: 1 toc_includes: fontawesome5: yes vignette: > %\VignetteIndexEntry{Update Datasets With replacer} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} \usepackage[utf8]{inputenc} \usepackage{color} \usepackage[backend=biber,natbib=true, style=mla]{biblatex} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( echo = TRUE, root.dir = "C:/Users/Dragos/Desktop/projects", collapse = TRUE, comment = "#>", out.width = "100%", tidy.opts = list(width.cutoff = 120), tidy = TRUE, if (any(!requireNamespace("data.table", quietly = TRUE))) { knitr::opts_chunk$set(eval = FALSE) } ) pks = c('data.table', 'knitr', 'kableExtra', 'replacer') if(!any(pks %in% search())) { invisible(lapply(pks, require, character.only = TRUE)) } ``` # Introduction `replacer`, a value replacement utility currently based on package [`data.table`](https://CRAN.R-project.org/package=data.table), is intended for outside-database update of data. It requires the preparation of a *lookup* file which is a list of replacement *requests* with and, in special circumstances, without an *index* column. The input and output data files are comma-separated values file format (*csv*). Various other file formats can therefore be processed after conversion. Data processing such as cleanup, file format conversion and the appendage of new data are outside the scope of this utility. There could be several lookup files associated with one data file and vice versa which could be batch-processed. This vignette defines the terms related to the replacement and to the elements of the lookup file, and details the procedure of creating an efficient lookup file suggesting ways to maintain it relevant for future updates. It continues by introducing the two User-intended functions, for single file and batch file processing, and describes their internal workflow. Finally, it presents a screen output result obtained with a lookup without index and introduces the datasets that are used in examples. ## Motivation This utility is accessible to beginners to `R` working with outside-database files. It facilitates complex dataset updates with minimal prompt time by employing User-friendly functions which automatically follow a decision tree rooted in the presence of User-made index. The choice for `data.table` is motivated by its concise and sound code, effectiveness and efficiency in processing large and complex datasets, and its independence on packages, or *verse*s, [other than base `R` itself](https://www.rdocumentation.org/packages/data.table/versions/1.14.0); this utility will only install `data.table` on User's machine, had it had not been installed already. Hence, familiarity with basic `R` commands is necessary. Familiarity with `data.table` is not, yet strongly recommended!
# Definitions
The following terms are specific to this utility: - **lookup** table: a csv format data file containing at least 4 standard columns with standard column names: * "**vars**": names of data columns involved in value replacement * "**oldVals**": values set for replacement found in the data file * "**newVals**": replacement values associated with **oldVals** * "**id**": index containing row numbers and special values such as 0, NA (or empty) - **simple (or 1:1) replacement**: one value from **newVals** replaces one value from **oldVals** - **multiple (or 1:many) replacement**: one value from **newVals** replaces identical values from **oldVals** - **split**: in this context, recursive partition of multiple replacements into simple and multiple replacements - **generic value**: restricted replacement value used for multiple replacements of missing data - **involved column**: data column with the name listed in **vars** column of lookup - **request**: implicit information containing value type, value missingness, value uniqueness, value location, replacement type, replacement value - all in one row, spanning the standard columns of lookup Lookup may contain a combination of simple and multiple replacements, simple replacements only, multiple replacements only (i.e. missing only, duplicated values only, or mixed) as needed. Multiple replacements can be split as needed. On special circumstances, column "**id**" may be absent from lookup while the presence of the other three standard columns is mandatory. Modifying standard columns' names will result in error. Any extra columns in lookup will be automatically removed during internal conformance checks on lookup. Multiple replacements apply to missing or duplicated non-missing values. While replacement values of duplicated values could be regarded as generic, they are entirely decided by the User. They are not restricted in this sense. In this version, replacements of missing data are restricted to words formed with the word roots *MIS|mis, PRE|pre, UN|un, ABS|abs* and the adverbs *YES|yes, NO|no* for missing values of type character and with a number equal to- or larger than 3 integer **9**s (i.e. 999, 9999 etc.) for missing values of double or integer types. Short sentences beginning with any word root or adverb above may be accepted as replacements.
# Example
Typical data replacement workflow: the initial data file and the corresponding lookup file (here named *"data.csv"* and *"lookup.csv"*) are saved in a directory, here named *dir*. Once the package is installed, the following commands should be typed at the `R` prompt: ``` > require(replacer) > dir = 'C:/path/to/directory' > replaceVals(dir) ## to update the values in 'data.csv' ``` Results will show on screen (see Section "Screen Output Example" which presents the result of a different data/lookup pair and requests set). A csv file similar to *Out-Data* is saved on *dir*. ```{r echo = FALSE, message = FALSE} ll = lapply(c('data_id_vig.csv', 'lookup_id_vig.csv', 'outData.csv'), fread) dat = ll[[1]]; orderDat = names(copy(dat)) lk = ll[[2]]; orderLk = names(copy(lk)) odat = ll[[3]]; orderOdat = names(copy(odat)) dat$Rw = seq_len(nrow(dat)); setcolorder(dat, neworder = c('Rw', orderDat)) lk$Rw = seq_len(nrow(lk)); setcolorder(lk, neworder = c('Rw', orderLk)) odat$Rw = seq_len(nrow(odat)); setcolorder(odat, neworder = c('Rw', orderOdat)) ```
```{r tables, echo = FALSE, message = FALSE} options(knitr.kable.NA = "") kable_styling(kable(dat, format = 'html', caption = '*In-Data*', escape = TRUE, digits = 1), 'bordered', full_width = FALSE, position = 'float_left', font_size = 10) kable_styling(kable(lk, format = 'html', caption = 'Lookup', escape = TRUE, digits = 1), 'bordered', full_width = FALSE, position = 'float_left', font_size = 10) kable_styling(kable(odat, format = 'html', caption = '*Out-Data*', escape = TRUE, digits = 1), 'bordered', full_width = FALSE, position = 'float_left', font_size = 10) ```
This small example covers the complete set of lookup requests with index present. Column "**Rw**" is here for convenience only. It should not be manually created. The left-hand side table, *In-Data*, is the original data file and contains 3 columns of interest. The two columns ("**a**" and "**b**") involved in replacements, are of character respectively, of numeric data types and both contain missing, duplicates and unique values. The "**uninvlvd**" column in *In-Data* is not listed in lookup's **vars**, it is not involved in replacements and therefore, remains unchanged in *Out-Data*. Column **source** in *Lookup* is not standard and therefore, is automatically removed from computations. The *Out-Data* result was achieved with a lookup file with User-made index; the replacement requests were as follows: - simple replacements: missing and non-missing values at rows 2, 7, 8, 9 and 12 - multiple replacements: * of duplicated values (**id** = **0**) at rows 3, 6, and 9 * of missing values (**id** = **empty**) at rows 7, 10, 11 and 12 - split replacements: i.e. column "**a**" underwent a split on the duplicated value "cc" present in 3 rows of *In-Data*; a *default* length 3 multiple replacement was split in a length 2 multiple replacement at rows 3 and 6 replaced by "DD" and a simple replacement at row 12 replaced by "CA".
# User-made Index In Lookup
A User-made index may or, may not be necessary in lookup: ## Lookup With User-made Index A User-made index is the safest value replacement procedure; it requires little insight in the structure of data and insures fast review of large datasets. When there is a full range of request types including splitting, the index should contain the value types shown in the above example, in lookup column **id**. ### Zero Value Present There could be several distinct sets of duplicated values in any data column. If the purpose is multiple replacements of any/all of these sets (or subsets of these sets) then, for each subset, the replacement request should take only one row in lookup having **0** as index value. Corresponding replacement values are at User's discretion. ### Empty (NA) Value Present Same rules as above apply to multiple replacement of sets of missing values with two differences: 1) the corresponding index value is empty or NA, and 2) the generic value is restricted to a word or short sentence starting with any word root or adverb listed in "Definitions". For generic values of numeric type, data type preservation (see **Note 2**) is not required (i.e. it is not required to enter 9999.0 instead of 9999 in the **id** row corresponding to missing values of double data type). Further revisions may bring more generic value customization if necessary. ### Row Number Present To request simple replacements, whether having unique, duplicated or missing values in **oldVals** column, the index should contain the corresponding data file row number. Sorting lookup by **vars** would maintain focus on one data column at a time. Standard columns **oldVals** and **newVals** in lookup, revert to the character data type upon completion since they contain numeric and character values combined. This behavior is normal. In conclusion, a User-made index covers all types of replacements, including splits. ## Lookup Without User-made Index There are special circumstances when the index is absent in lookup: ### Index Must Be Absent In Lookup There is no need for index when lookup contains *only* multiple replacement requests (Section "Screen Output Example"). An extension of this case involves splitting on missing values and is explained in the sub-section below. ### Index May Be Absent In Lookup This case requires a deeper insight in the structure of the data file such as knowing in advance which values are duplicated. When used standalone, helper function `whichDups()` - part of this utility - finds the duplicated values in all columns of the data file. The requirements for an effective lookup table without index are: - rows in lookup *must* have the same succession as the succession of values in data. This applies to each involved column (e.g. if two data values located at rows 3 and 2000, need be updated in data column "X", the corresponding requests for "X" in lookup should be located in *adjacent rows, in the same order*) - simple replacements: if values set for replacement within any or all involved data columns are unique (either missing or not), simple replacements without index are possible - mixed simple and multiple replacements: if any involved data column contains unique, duplicated and/or missing values *and* some or all of the duplicated and/or missing values are set for multiple replacements *that do not undergo* splitting, replacements without index are possible - splitting: not implemented for duplicated values when User-made index is absent. Although there are ways for implementing duplicate values splits without index, the necessary User time and data structure insight plead for the implementation of a User-made index. A special case of splitting is, however, allowed for missing values (see **Note 3**) **Note 1** Whether the index is present or not in lookup, at the time of running the updates the row order in the data file must be identical to the row order in the data file as it was at the completion of lookup. If necessary, an index named *other than "id"* could be built and preserved inside the dataset and it will be treated as uninvolved column. Re-ordering the dataset by this index before run will retrieve the row order at the completion of lookup and will keep the lookup relevant for future data updates. **Note 2** It is recommended that User preserves the data type between values set for replacement and their replacements (i.e. integer/double to integer/double etc.). Although the helper function `con2fcoales()` automatically solves such discrepancies in the background, if an error message similar to *"Item 2 is type integer but the first item is type double. Please coerce ..."* appears on screen, it means that 1) the requirement was obviously violated and that 2) the severity of violation was beyond the mitigation capability of `con2fcoales()`. In such cases, a review of all output messages and a review of the data file and the associated lookup should ensue. **Note 3** Due to their special type, splitting on missing values *across* data columns is allowed when index is absent, i.e. if one involved data column contains one missing value and other involved columns contain missing values set (or not) for multiple replacements, splitting is allowed and the unique missing value can be replaced with a *non-generic* replacement value.
# Functions Intended For User
There are two functions designed for direct use and set to display a series of messages and comments informing the User on the computational path taken and on findings along the way. While the messages can be turned off, the comments remain visible. Single file replacements are processed by function `replaceVals()`. When the data and lookup files are named simply *"data.csv"* and *"lookup.csv"* the function only requires the path to the directory where these files were stored, written as length **1** quoted character, with forward "**/**" or double backward "**\\\\**" slash and without end slash. At start, this function performs a series of conformance checks on data and lookup, excludes non-standard columns, separates the data into involved and uninvolved columns then, rejoins them on exit to preserve the data format. Batch file replacements are processed by function `bReplace()`. When messages are set to **TRUE** it offers an wide range of custom messages/comments for each data/lookup input pair and request type. During run, `bReplace()` calls `replaceVals()` function as many times as data/lookup pairs are on the names list, displays a named list of messages, comments, updated data and counts related to duplicated and/or missing values requests, for each pair. Both functions save their updated data to the directory above, in csv file format.
# Internal Workflow
Helper function `sReplace()` (described in the Manual) is the data replacement workhorse for this utility. When called by `replaceVals()`, it firstly checks for index presence in lookup. Upon result, the function moves along the branches of a decision tree: ## If The Index Is Found Absent The function starts by identifying duplicated and/or missing values within the involved columns as well as eventual splits on missing values. If lookup requires multiple and simple replacements, the function separates lookup into maximum 3 subsets: 1) of multiple replacements for duplicated values, for which it later creates an internal index list, 2) of multiple replacements for missing values for which an internal index is not necessary, and 3) the remainder subset containing unique values, including unique missing values, for which it also creates an internal index list. Requests for splitting on duplicated values with no index stop the function with an error. ### Internal Index For Duplicated Values The function creates an internal index list of row numbers corresponding to all elements of distinct subsets of duplicated values found within each involved data column and loops the function `data.table::set()` for replacements. ### No Internal Index For Missing Values As already mentioned, no index is created for multiple replacements of missing values as there is only one generic value per data column (this utility uses no other information on missing data beside the data type. For a different approach on this subject the User is directed to *data imputation* literature). The subset of missing values is then *reshaped*, and the columns are *coalesced* with corresponding data columns, for each generic value present in lookup. ### Index For Unique Values As stated above, simple replacements of unique values without User-made index are possible. Once the internal index is created, this subset is *reshaped*, *joined* with the data on index and then, the corresponding columns are *coalesced* (please see these terms in `data.table` Manual). ## If The Index Is Found Present The function subsets the lookup using the special index values **0** and/or **NA** (or empty). At maximum, 3 subsets are formed as above. The replacement process is similar with the process used for absent index with the difference that the unique values already have the User-made index of row numbers. In conclusion, whether single or batch file processing, all request types are processed in one run while Users have the ability of monitoring the internal workflow.
# Screen Output Example
Presented below is an update performed on the "Chile" dataset from package `carData`. At the `R` prompt, type or just copy/paste the following commands: ``` help(Chile, package = "carData") require(replacer) dir = system.file("extdata", package = "replacer") replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE) ```
Screen output: ``` reading data from: C:/R/R-4.1.0/library/replacer/extdata ... completed reading data ... checking standard columns in lookup ... starting replacements ... found no request for 1:1 replacements ... searching for 1:many replacements ... found request for multiple duplicated value replacments: creating index ... replacing multiple duplicated values ... found request for multiple missing value replacements: replacing ... helper function has completed! $` updated_chile_using_chile_nadup` region population sex age education income statusquo vote 1: North 200000 Male 65 Primary 35000 1.00820 Y 2: North 200000 Male 29 Post-secondary 7500 -1.29617 N 3: North 200000 Female 38 Primary 15000 1.23072 Y 4: North 200000 Female 49 Primary 35000 -1.03163 N 5: North 200000 Female 23 Secondary 35000 -1.10496 N --- 2696: Metro Santiago 15000 Male 42 Primary 15000 -1.26247 N 2697: Metro Santiago 15000 Female 28 Primary 15000 1.32950 Y 2698: Metro Santiago 15000 Female 44 Primary 75000 1.42045 Y 2699: Metro Santiago 15000 Male 21 Secondary 75000 0.18315 PREFERS NO ANSWER 2700: Metro Santiago 15000 Male 20 Post-secondary 35000 1.38179 Y $` multiple_dups_repl_counts` vars oldVals newVals education population region sex 1: education P Primary 1107 NA NA NA 2: education PS Post-secondary 462 NA NA NA 3: education S Secondary 1120 NA NA NA 4: population 175000 200000 NA 140 NA NA 5: population 25000 50000 NA 360 NA NA 6: region C Central NA NA 600 NA 7: region M Metro Santiago NA NA 100 NA 8: region N North NA NA 322 NA 9: region S South NA NA 718 NA 10: region SA City of Santiago NA NA 960 NA 11: sex F Female NA NA NA 1379 12: sex M Male NA NA NA 1321 $` multiple_NAs_repl_counts` region population sex age education income statusquo vote 0 0 0 1 11 98 17 168 ```
This update has replaced the abbreviations used in the source data with full names and complete words, as well as, updated the **population** totals on few regions with fictitious values. Missing values in column **vote** were replaced by a short sentence using the word root "*PRE*" listed above. The requests were processed through a lookup table without User-made index. The first output block named "*updated_chile_using_chile_nadup*" displays the head and the tail of the updated data file. To view the complete updated dataset on screen, type: ``` upData = replaceVals(dir, 'chile.csv', 'chile_nadup.csv', save = FALSE)[[1]] View(upData) ``` Next block, named "*multiple_dups_repl_counts*", displays a count of duplicated value updates processed on each involved column, by value and replacement value. In this example, the left-hand side of this block contains the *complete cases* of lookup. Finally, the third block named "*multiple_NAs_repl_counts*" shows counts of missing values replaced by generic values within respective involved columns. In other cases, unrequested replacements within any of the involved columns will be commented in the screen output (see `replaceVals()` Examples). All block names change according to input and request type; these names show what was processed and what was not. In case of error, the displayed messages/comments should hint of where in the process the error occurred.
# Example Datasets
```{r dups_in_examples, include=FALSE, warning=FALSE} dir = system.file('extdata', package = 'replacer') inData = list.files(dir) inData = inData[grep('.csv', inData)] inData = inData[-grep('chile_id|chile_nadup|lookup', inData)] lData = lapply(inData, function(i) paste0(dir,'/', i)) data = lapply(lData, fread, na.strings = c(NA_character_, '')) n.chile = data$chile[, lapply(.SD, function(i) {sum(duplicated(i))/length(i)})] names(data) = unlist(strsplit(inData, split = '.csv', fixed = TRUE)) ldup = lapply(data, function(i) whichDups(i)[length(i) > 0]) ldup$chile = sample(ldup$chile, size = 4, prob = n.chile) nams = names(ldup) = names(data) # Tables DUPS = lapply(nams, function(n) {kable(t(ldup[[n]]) , format = 'html' , align='c' , caption = 'Duplicated') }) NAS = lapply(nams, function(n) {kable(t(data[[n]][, colSums(is.na(.SD))]) , format = 'html' , align='c' , caption = 'Missing') }) names(DUPS) = paste0(nams, 'DUP') names(NAS) = paste0(nams, 'NA') ``` Although artificial, the datasets in this package cover situations encountered in practice. They contain fictitious values inserted for exemplification purpose with the exception of "*chile*" dataset, a copy of the "Chile" dataset from package `carData`; however, some replacement values within associated lookup files are also fictitious. The batch-file example (see `bReplace()` Examples) contains the names of data and associated lookup files presented below. Duplicated values and counts of missing values are shown below for all these datasets. When several distinctive subsets of duplicated values are present in an involved column the column's name appears indexed. ## Data And Associated Lookup Files With Index
- The "*data_id*" dataset
```{r echo = FALSE, results = 'asis'} cat(c('
', DUPS$data_idDUP , '', NAS$data_idNA, '
'), sep = '') ```
Associated lookup files: "*lookup_id.csv*": mixed simple/multiple replacements of unique, duplicated and missing data, similar to the typical replacement workflow example. The file contains standard columns only. "*lookup_idsimple.csv*": simple replacements i.e. without **0** or **NA** values in "**id**". The file contains standard columns only. - The '*chile*' dataset. Since these data contain large subsets of duplicated values, only a very small sample is shown below.
```{r echo = FALSE, results = 'asis'} cat(c('
', DUPS$chileDUP , '', NAS$chileNA, '
'), sep = '') ```
Associated lookup file: "*chile_id.csv*": splits of multiple replacements of duplicated values. Also, multiple replacements of missing data. The file includes non-standard column "**source**". ## Data And Associated Lookup Files Without Index - The "*data*" dataset
```{r echo = FALSE, results = 'asis'} cat(c('
', DUPS$dataDUP , '', NAS$dataNA, '
'), sep = '') ```
Associated lookup file: "*lookup.csv*": mixed simple and multiple replacements of unique, duplicated and missing values. The file contains standard columns only. - The "*data_unique*" dataset
```{r echo = FALSE, results = 'asis'} cat(c('
', DUPS$data_uniqueDUP , '', NAS$data_uniqueNA, '
'), sep = '') ```
Associated lookup file: "*lookup_unique*": split across columns on missing values. The file contains standard columns only. ### Other Associated Lookup Files "*lookupDUP*", "*lookupNA*": multiple replacements *only* of duplicated or missing values in "*data_id*" dataset. Both files contain standard columns only. "*chile_nadup*": multiple replacements only of duplicated and missing values in "*chile*" dataset. The file contains non-standard column "**source**" .
# Session Info
```{r sessioninfo, echo = FALSE} sessionInfo() ``` ```{r citation, include = FALSE} knitr::write_bib(file = 'citation.bib') ```