---
title: "A2.3.1 File format considerations"
author:
- name: "Johannes Zauner"
- name: "Salma Thalji"
- name: "Manuel Spitschan"
format:
html:
code-link: true
code-tools: true
toc: true
toc-depth: 3
filters:
- shinylive
---
## Preface
MeLiDos activity A2.3.1 calls for the development of a jointly accessible and interoperable data format for light dosimetry data. This is closely linked to A2.3.2, the development of a metadata descriptor. The metadata descriptor, described [in this publication](https://link.springer.com/article/10.1186/s44247-024-00113-9), separates the overall format into a data part and a metadata part. The metadata part contains information about the dataset itself (e.g., column names, variables, units, and value definitions) as well as information about the device, participant, and study. The metadata schema is explained in detail in the [Global Light Commons (GLC) website](https://tscnlab.github.io/glc_dp_viewer/).
This separation keeps the device data file simple and flexible. It does not mean, however, that the data file is arbitrary.
For the MeLiDos analysis package [LightLogR](https://tscnlab.github.io/LightLogR/), we reviewed file formats from 19 devices and found substantial variation in both file structure and preprocessing requirements. Moving forward, we encourage device manufacturers to standardize wearable export formats and to adopt the dual structure of data plus metadata, even if only part of the metadata is embedded directly with the export.
```{r}
#| message: false
#| warning: false
library(LightLogR)
supported_devices()
```
This document focuses on the data part and summarizes relevant design considerations derived from existing formats. The goal is to optimize, across files and devices, for:
- readability
- consistency
- unambiguity
- machine-readability
- cross-language interoperability
:::{.callout-note}
In this document, the term `record` is used interchangeably with `observation` and refers to one written row in an export file for a given time point or event.
:::
### Overview of the proposed format
The proposed structure separates wearable exports into a simple tabular data file and a metadata descriptor. The data file contains the observations themselves, while the metadata descriptor defines how those observations should be interpreted.
```{mermaid}
flowchart LR
A[Wearable device export] --> B[Data file<br/>CSV]
A --> C[Metadata descriptor]
B --> B1[Rectangular rows and columns]
B --> B2[Single header row]
B --> B3[Canonical UTC timestamp<br/>Datetime]
B --> B4[Consistent dtypes per column]
B --> B5[Portable encoding and separators]
C --> C1[Variable definitions]
C --> C2[Units]
C --> C3[Device metadata]
C --> C4[Participant / study metadata]
C --> C5[Time zone and missing-data rules]
```
## Data format considerations
We provide an example file to download here: [⬇ Download CSV](data/wearable_light_example.csv){.btn .btn-primary download="wearable_light_example.csv"}
### Base format
We recommend [comma-separated values (CSV)](https://www.rfc-editor.org/rfc/rfc4180) as the base format. See the [full specification](https://www.rfc-editor.org/rfc/rfc4180) for details. In summary:
- variables (columns) are separated by a single comma (`,`)
- records (rows) are separated by line breaks
- every row must contain the same number of fields
- fields containing commas, double quotes, or line breaks must be enclosed in double quotes
- double quotes inside quoted fields must be escaped according to CSV rules
- rows must not end with a trailing comma
For maximum interoperability across R, Python, spreadsheets, databases, and command-line tooling, exported files should additionally follow these requirements:
- file encoding must be UTF-8
- files should be written without a UTF-8 byte order mark (BOM)
- line endings may be `LF` or `CRLF`, but must be used consistently within a file
- the delimiter must always be a comma and must not vary by locale
- decimal values must always use `.` as the decimal separator
:::{.callout-warning}
Exported files must not contain locale-dependent variations. The file format must remain identical regardless of the locale settings of the exporting computer.
:::
### Tabular data
Wearable data are typically analyzed as time series. Such analyses are best supported by rectangular tabular data, with variables in columns and observations in rows. In rectangular data, each row has the same number of fields and each column represents the same variable throughout the file. Cells may be empty to indicate missing data, but the structure itself must not vary across rows.
](https://r4ds.had.co.nz/images/tidy-1.png)
::: {.callout-note}
This structure can be inefficient when not all variables are recorded for every observation. This may occur when a device contains multiple sensors or modalities sampled at different intervals, or when irregular event records occur (e.g., button presses).
In such cases, three approaches can preserve a rectangular structure:
- add an observation (row) that is empty for variables not recorded at that time point; this is appropriate when such cases are rare
- aggregate irregular events into the next regular observation (e.g., store the number of button presses since the previous sample); this is appropriate when the recording interval is short and aggregation is scientifically acceptable
- split the export into multiple rectangular files, for example one file per sensor or modality, or one file for regular samples and one for irregular events; this is appropriate when irregularity is common
:::
### Headers
Each file must contain exactly one header row containing the variable names. Data records begin on the following row.
Additional lines above the column names are strongly discouraged. In fact, the file validation in the [Global Light Commons (GLC)](https://tscnlab.github.io/glc_dp_viewer/) only supports data files without additional headers.
:::{.callout-note}
Variable names should be [`syntactic`](https://adv-r.hadley.nz/names-values.html#non-syntactic) so that they can be imported directly into data analysis software without renaming.
For best interoperability across R and Python, variable names should:
- use only ASCII letters (`a-z`, `A-Z`), digits (`0-9`), `_`, and `.`
- start with a letter
- avoid spaces and special characters
- not start with a digit
- not duplicate any other column name within the file
- avoid reserved words in common analysis environments where possible
For easier ingestion into both R and Python data pipelines, `snake_case` is recommended for general variables. A documented exception may be used for canonical field names such as `Datetime`.
:::
### Consistency of variables
Within a column, values must be consistent with the declared variable type. Otherwise, users must manually correct types after import, which introduces avoidable error sources.
General requirements:
- a column must contain values of one logical type only (e.g., numeric, integer, string, boolean, datetime)
- numeric values may contain only digits, an optional leading minus sign, and the decimal separator
- `.` must be used as the decimal separator, independent of locale
- thousands separators must not be used
- text values containing commas, double quotes, or line breaks must be quoted according to the CSV specification
- missing values should be encoded as empty fields rather than implementation-specific strings where possible
Recommended canonical encodings:
- boolean values: `true` and `false`
- missing values: empty field
- categorical values: documented controlled vocabulary in metadata
- units: defined in metadata, not embedded in the cell value
Python-specific interoperability requirements:
- columns should not mix numeric values with strings such as `error`, `off`, or `n/a`
- vendor-specific missing-value strings such as `NA`, `N/A`, `null`, `NULL`, `-`, or `9999` should be avoided in raw exports unless they are explicitly documented in metadata
- if integer columns can contain missing values, this must be documented, as some Python tooling may otherwise infer floating-point types during import
- where practical, one physical quantity should map to one column and one dtype
:::{.callout-note}
#### On sentinel values
Some devices use sentinel values to indicate special information, like `sleep`, or `out of bounds`. This behavior is discouraged within the same column as other (measurement) information is contained. Instead, a new variable containing the status information (if necessary as a sentinal value) should be used and the measurement column be set to missing for a given observation.
:::
### Record ordering and uniqueness
Records should be ordered by ascending timestamp within each file and timestamps need to be unique.
If a file represents a single continuous stream, each row should be uniquely identifiable by the timestamp.
### Timestamps
Timestamps in wearable data must contain date and time information in a consistent, unambiguous, and machine-readable format. Because time handling is complex, timestamps require special care.
A time zone should not be confused with a UTC offset. For any given moment, a time zone maps to a specific offset, but a time zone also captures daylight saving time rules and policy changes. Therefore, local time plus time zone contains more information than local time plus offset alone.
Date and time values should follow [ISO 8601](https://www.iso.org/iso-8601-date-and-time-format.html). A practical internet profile is described in [RFC 3339, Section 5.6](https://datatracker.ietf.org/doc/html/rfc3339#section-5.6).
At the time of writing, many devices do not include UTC offsets in exported timestamps, which can create ambiguity around daylight saving transitions. This ambiguity can be avoided if:
- datetimes follow ISO 8601 / RFC 3339 conventions
- timestamps in the data file are stored in UTC
- UTC timestamps are encoded with the `Z` suffix (for example, `2026-03-09T14:30:00Z`)
- the participant's or device's local time zone is stored in metadata using an [IANA time zone identifier](https://en.wikipedia.org/wiki/List_of_tz_database_time_zones) such as `Europe/Berlin`
Recommended practice:
- include one canonical timestamp column named `Datetime`
- store `Datetime` in UTC in full datetime format with seconds
- if sub-second precision is available and scientifically relevant, use a fixed precision consistently throughout the file
- do not mix naive local timestamps and UTC timestamps in the same column
- if a local device timestamp is required in addition to UTC, store it in a separate column and document it clearly in metadata
:::{.callout-note}
Time-series analysis often requires a regularly spaced, uninterrupted time series. Manufacturers can support researchers in a major way by ensuring that observations are stored in regular and consistent intervals (e.g., every 60 seconds), and that gaps in the data collection are explicit (i.e., timestamped missing data).
:::
### Missing data
Missing data should be represented explicitly and consistently.
Requirements:
- a missing value must not change the number of fields in a row
- missing values should be encoded as empty fields
- the meaning of missingness should be documented in metadata where relevant (e.g., sensor saturation, device off-body, battery depletion, or processing failure)
If a distinction between different types of missingness is scientifically required, that distinction should be represented in separate status or quality-control columns rather than by mixing ad hoc string tokens into measurement columns.
### Complex data
In some cases, wearable device output is too complex to be fully represented in a single CSV file. Examples include images, videos, binary sensor payloads, or unavoidable proprietary sidecar files.
In such cases, the CSV file should contain:
- a timestamp column representing the relevant time point or start time
- a file reference column containing the file name or relative path of the associated object
- any required modality or event-type columns needed to interpret the referenced file
Associated files should be stored in a documented directory structure and described in metadata. File references should be relative rather than absolute so that datasets remain portable across systems.
### Python-specific implementation notes
The requirements above are language-agnostic, but the following conventions are especially important for robust Python ingestion with tools such as `pandas`, `polars`, and `pyarrow`:
- UTF-8 encoding should be used consistently
- column names should be unique and stable across software versions
- each column should have a single intended dtype
- datetimes should be parseable without custom locale settings
- delimiters, decimal separators, and missing-value encodings should be fixed across all exports from a device family
- very large files may optionally be distributed in compressed form (e.g., `.csv.gz`) provided the contained file remains a standards-compliant CSV
- schema changes between firmware or software versions should be versioned and documented in metadata
A minimal interoperable export should therefore provide:
- one CSV file with a single header row
- UTF-8 encoding
- one canonical UTC timestamp column named `Datetime`
- consistent rectangular rows
- stable column names
- metadata that defines variables, units, missingness rules, and time zone context
## Validation
### Example file
We provide an example file to download here: [⬇ Download CSV](data/wearable_light_example.csv){.btn .btn-primary download="wearable_light_example.csv"}
### Interactive validator
This validator works up to file sizes of 10 MB.
::: {.column-page}
```{shinylive-r}
#| standalone: true
#| viewerHeight: 800
library(shiny)
library(vroom)
library(tools)
options(shiny.maxRequestSize = 10 * 1024^2)
`%||%` <- function(x, y) if (is.null(x) || length(x) == 0) y else x
escape_html <- function(x) {
x <- gsub("&", "&", x, fixed = TRUE)
x <- gsub("<", "<", x, fixed = TRUE)
x <- gsub(">", ">", x, fixed = TRUE)
x
}
syntactic_name_ok <- function(x) {
if (is.na(x) || !nzchar(x)) return(FALSE)
make.names(x) == x
}
is_unique <- function(x) length(unique(x)) == length(x)
guess_separator_from_header <- function(header_line) {
seps <- c("," = ",", ";" = ";", "\t" = "\t", "|" = "|")
counts <- vapply(seps, function(s) {
if (s == "\t") {
lengths(strsplit(header_line, "\t", fixed = TRUE)) - 1L
} else {
lengths(strsplit(header_line, s, fixed = TRUE)) - 1L
}
}, integer(1))
names(counts)[which.max(counts)]
}
safe_read_lines <- function(path, n = 2000L) {
is_gz <- identical(tolower(file_ext(path)), "gz")
con <- if (is_gz) gzfile(path, open = "rt") else file(path, open = "rb")
on.exit(close(con), add = TRUE)
readLines(con, n = n, warn = FALSE, encoding = "UTF-8")
}
parse_delimited_line <- function(line, delim = ",") {
if (is.na(line)) return(character(0))
parsed <- tryCatch(
utils::read.table(
text = line,
sep = delim,
quote = '"',
header = FALSE,
fill = TRUE,
comment.char = "",
stringsAsFactors = FALSE,
colClasses = "character"
),
error = function(e) NULL
)
if (is.null(parsed)) {
return(strsplit(line, delim, fixed = TRUE)[[1]])
}
vals <- as.character(parsed[1, , drop = TRUE])
vals[is.na(vals)] <- ""
vals
}
safe_read_vroom <- function(path, delim = ",", n_max = Inf, altrep = TRUE, skip = 0L) {
vroom::vroom(
file = path,
delim = delim,
skip = skip,
altrep = altrep,
show_col_types = FALSE,
progress = FALSE,
na = c("", "NA", "N/A", "NULL", "null"),
trim_ws = FALSE,
num_threads = max(1, parallel::detectCores(logical = TRUE) - 1),
n_max = n_max
)
}
detect_header_skip <- function(lines, delim = ",", required_col = "Datetime", max_scan = 100L) {
if (length(lines) == 0) {
return(list(skip = 0L, header_line = NA_character_, reason = "No lines available."))
}
scan_n <- min(length(lines), as.integer(max_scan))
candidate_lines <- lines[seq_len(scan_n)]
split_line <- function(line) parse_delimited_line(line, delim = delim)
parts <- lapply(candidate_lines, split_line)
clean_fields <- lapply(parts, function(x) trimws(gsub('^"|"$', "", x)))
score_line <- function(fields) {
if (length(fields) < 2) return(-Inf)
non_missing <- fields[nzchar(fields)]
has_required <- required_col %in% fields
unique_non_missing <- length(unique(non_missing)) == length(non_missing)
syntactic_fraction <- if (length(non_missing) == 0) 0 else mean(vapply(non_missing, syntactic_name_ok, logical(1)))
(if (has_required) 3 else 0) +
(if (unique_non_missing) 1.5 else 0) +
syntactic_fraction +
min(length(fields) / 10, 2)
}
scores <- vapply(clean_fields, score_line, numeric(1))
valid <- which(is.finite(scores))
if (length(valid) == 0) {
return(list(skip = 0L, header_line = lines[1], reason = "No plausible header detected; defaulted to first line."))
}
best_ix <- valid[which.max(scores[valid])]
list(
skip = as.integer(best_ix - 1L),
header_line = lines[best_ix],
reason = if (best_ix == 1L) "First line appears to be the header." else paste0("Detected likely header on line ", best_ix, ".")
)
}
skip_preamble_lines <- function(lines, skip = 0L) {
skip <- max(0L, as.integer(skip))
if (skip >= length(lines)) return(character(0))
lines[seq.int(skip + 1L, length(lines))]
}
parse_posix_utc <- function(x) {
suppressWarnings(as.POSIXct(x, tz = "UTC", format = "%Y-%m-%dT%H:%M:%OSZ"))
}
fraction_has_utc_z <- function(x) {
vals <- as.character(x)
vals <- trimws(vals)
vals <- vals[!is.na(vals) & nzchar(vals)]
if (length(vals) == 0) return(NA_real_)
mean(grepl("(?:Z|z)\\s*$", vals, perl = TRUE))
}
extract_datetime_raw_z_fraction <- function(lines, datetime_col = "Datetime", delim = ",") {
if (length(lines) < 2) return(NULL)
split_line <- function(line) parse_delimited_line(line, delim = delim)
header <- split_line(lines[1])
col_ix <- which(header == datetime_col)[1]
if (is.na(col_ix)) return(NULL)
body <- lines[-1]
if (length(body) == 0) return(NULL)
dt_vals <- vapply(body, function(line) {
fields <- split_line(line)
if (length(fields) < col_ix) return(NA_character_)
val <- trimws(fields[col_ix])
val <- sub('^"', '', val)
sub('"$', '', val)
}, character(1))
dt_vals <- dt_vals[!is.na(dt_vals) & nzchar(dt_vals)]
if (length(dt_vals) == 0) return(NULL)
list(
fraction = mean(grepl("(?:Z|z)\\s*$", dt_vals, perl = TRUE)),
n_non_missing = length(dt_vals),
col_ix = col_ix
)
}
nice_n <- function(x) format(x, big.mark = ",", scientific = FALSE, trim = TRUE)
format_preview_column <- function(x) {
if (inherits(x, "POSIXt")) {
tz <- attr(x, "tzone")
tz <- if (length(tz) == 0 || is.na(tz[1]) || !nzchar(tz[1])) "UTC" else tz[1]
out <- format(x, tz = tz, usetz = TRUE)
out[is.na(x)] <- NA_character_
return(out)
}
x
}
build_data_preview <- function(data, n = 10L) {
preview <- head(as.data.frame(data), as.integer(n))
as.data.frame(lapply(preview, format_preview_column), stringsAsFactors = FALSE)
}
make_check <- function(level, id, title, status, message, details = NULL) {
list(
level = level,
id = id,
title = title,
status = status,
message = message,
details = details
)
}
status_icon <- function(status) {
switch(
status,
pass = "✅",
warn = "⚠️",
fail = "❌",
info = "ℹ️",
"•"
)
}
status_class <- function(status) {
switch(
status,
pass = "success",
warn = "warning",
fail = "danger",
info = "info",
"secondary"
)
}
validate_level_1_file <- function(path) {
checks <- list()
info <- file.info(path)
if (is.na(info$size) || info$size <= 0) {
checks[[length(checks) + 1]] <- make_check(
1, "file_exists", "Readable file",
"fail",
"The uploaded file is empty or inaccessible."
)
return(checks)
}
checks[[length(checks) + 1]] <- make_check(
1, "file_exists", "Readable file",
"pass",
paste0("The file is accessible and has size ", nice_n(info$size), " bytes.")
)
ext <- tolower(file_ext(path))
if (ext %in% c("csv", "gz", "txt")) {
checks[[length(checks) + 1]] <- make_check(
1, "extension", "Filename extension",
"pass",
paste0("The file extension is .", ext, ".")
)
} else {
checks[[length(checks) + 1]] <- make_check(
1, "extension", "Filename extension",
"warn",
paste0("The file extension is .", ext, ". A .csv extension is recommended.")
)
}
checks
}
validate_level_2_lines <- function(lines, path, skip = 0L, skip_reason = NULL) {
checks <- list()
if (length(lines) == 0) {
checks[[length(checks) + 1]] <- make_check(
2, "lines_present", "Text content available",
"fail",
"No text lines could be read from the file."
)
return(checks)
}
checks[[length(checks) + 1]] <- make_check(
2, "lines_present", "Text content available",
"pass",
paste0("Successfully read ", nice_n(length(lines)), " initial lines for structural inspection.")
)
if (isTRUE(skip > 0)) {
checks[[length(checks) + 1]] <- make_check(
2, "header_skip", "Preamble/header offset",
"info",
paste0("Skipped ", skip, " line(s) before header detection and structure checks."),
details = skip_reason
)
}
header <- lines[1]
guessed_sep <- guess_separator_from_header(header)
guessed_label <- if (guessed_sep == "\t") "tab" else guessed_sep
if (guessed_sep == ",") {
checks[[length(checks) + 1]] <- make_check(
2, "separator_guess", "Separator heuristic",
"pass",
"The header appears to use a comma separator."
)
} else {
checks[[length(checks) + 1]] <- make_check(
2, "separator_guess", "Separator heuristic",
"warn",
paste0("The header appears more consistent with the separator '", guessed_label, "' than with ','.")
)
}
field_counts_csv <- vapply(lines, function(z) length(strsplit(z, ",", fixed = TRUE)[[1]]), integer(1))
unique_counts <- sort(unique(field_counts_csv))
if (length(unique_counts) == 1) {
if (unique_counts == 1) {
checks[[length(checks) + 1]] <- make_check(
2, "single_column_csv", "Column count under comma parsing",
"warn",
"Parsing the inspected lines with ',' yields only one column. This often indicates that the wrong separator was used."
)
} else {
checks[[length(checks) + 1]] <- make_check(
2, "rectangular_text", "Rectangular structure under comma parsing",
"pass",
paste0("All inspected lines have ", unique_counts, " fields when split on ','.")
)
}
} else {
details <- paste0("Observed field counts in inspected lines: ", paste(unique_counts, collapse = ", "))
checks[[length(checks) + 1]] <- make_check(
2, "rectangular_text", "Rectangular structure under comma parsing",
"warn",
"The inspected lines do not all have the same number of comma-separated fields. This may indicate a non-rectangular file, quoting problems, or the wrong separator.",
details = details
)
}
checks
}
validate_level_3_import <- function(path, delim = ",", skip = 0L) {
checks <- list()
dat <- NULL
err <- NULL
dat <- tryCatch(
safe_read_vroom(path, delim = delim, skip = skip),
error = function(e) {
err <<- conditionMessage(e)
NULL
}
)
if (is.null(dat)) {
checks[[length(checks) + 1]] <- make_check(
3, "vroom_read", "Read with vroom",
"fail",
"The file could not be read with vroom using a comma separator.",
details = err
)
return(list(checks = checks, data = NULL))
}
checks[[length(checks) + 1]] <- make_check(
3, "vroom_read", "Read with vroom",
"pass",
paste0(
"The file was read successfully with vroom",
if (skip > 0) paste0(" after skipping ", skip, " preamble line(s)") else "",
". Detected ", ncol(dat), " columns and ", nice_n(nrow(dat)), " rows."
)
)
cls <- vapply(dat, function(x) class(x)[1], character(1))
detected <- paste(paste(names(cls), cls, sep = ": "), collapse = "; ")
checks[[length(checks) + 1]] <- make_check(
3, "detected_types", "Auto-detected column types",
"info",
"Detected primary column classes.",
details = detected
)
list(checks = checks, data = dat)
}
validate_level_4_names <- function(dat) {
checks <- list()
nms <- names(dat)
if (is_unique(nms)) {
checks[[length(checks) + 1]] <- make_check(
4, "unique_names", "Unique variable names",
"pass",
"All variable names are unique."
)
} else {
dupes <- unique(nms[duplicated(nms)])
checks[[length(checks) + 1]] <- make_check(
4, "unique_names", "Unique variable names",
"fail",
"Duplicate variable names were found.",
details = paste(dupes, collapse = ", ")
)
}
syntactic <- vapply(nms, syntactic_name_ok, logical(1))
if (all(syntactic)) {
checks[[length(checks) + 1]] <- make_check(
4, "syntactic_names", "Syntactic variable names",
"pass",
"All variable names are syntactic."
)
} else {
bad <- nms[!syntactic]
suggestions <- paste0(bad, " → ", make.names(bad))
checks[[length(checks) + 1]] <- make_check(
4, "syntactic_names", "Syntactic variable names",
"warn",
"Some variable names are not syntactic.",
details = paste(suggestions, collapse = "; ")
)
}
checks
}
validate_level_5_datetime <- function(dat) {
checks <- list()
classes <- vapply(dat, function(x) class(x)[1], character(1))
posix_cols <- names(classes)[classes %in% c("POSIXct", "POSIXlt")]
if (length(posix_cols) > 0) {
checks[[length(checks) + 1]] <- make_check(
5, "posix_detected", "Datetime column detected",
"pass",
paste0("Detected POSIX datetime column(s): ", paste(posix_cols, collapse = ", "), ".")
)
} else {
char_cols <- names(classes)[classes %in% c("character", "vroom_chr")]
parsed_hits <- character(0)
if (length(char_cols) > 0) {
for (nm in char_cols) {
vals <- dat[[nm]]
vals <- vals[!is.na(vals) & nzchar(vals)]
vals <- head(vals, 2000L)
if (length(vals) == 0) next
parsed <- parse_posix_utc(vals)
frac <- mean(!is.na(parsed))
if (isTRUE(frac > 0.9)) parsed_hits <- c(parsed_hits, nm)
}
}
if (length(parsed_hits) > 0) {
checks[[length(checks) + 1]] <- make_check(
5, "posix_detected", "Datetime-like column detected",
"warn",
paste0("No POSIXct column was auto-detected, but these columns look parseable as UTC datetimes: ", paste(parsed_hits, collapse = ", "), ".")
)
} else {
checks[[length(checks) + 1]] <- make_check(
5, "posix_detected", "Datetime column detected",
"fail",
"No POSIXct column was auto-detected and no clearly parseable datetime-like column was found."
)
}
}
if ("Datetime" %in% names(dat)) {
checks[[length(checks) + 1]] <- make_check(
5, "datetime_name", "Suggested timestamp name",
"pass",
"A column named 'Datetime' is present."
)
} else {
possible <- grep("date|time|datetime|timestamp", names(dat), ignore.case = TRUE, value = TRUE)
details <- if (length(possible)) paste(possible, collapse = ", ") else NULL
checks[[length(checks) + 1]] <- make_check(
5, "datetime_name", "Suggested timestamp name",
"warn",
"No column named 'Datetime' was found. This is recommended for interoperability.",
details = details
)
}
checks
}
validate_level_6_temporal <- function(dat) {
checks <- list()
dt_col <- NULL
if ("Datetime" %in% names(dat)) {
dt_col <- "Datetime"
} else {
posix_cols <- names(dat)[vapply(dat, function(x) inherits(x, "POSIXt"), logical(1))]
if (length(posix_cols) > 0) dt_col <- posix_cols[1]
}
if (is.null(dt_col)) {
checks[[length(checks) + 1]] <- make_check(
6, "temporal_checks", "Temporal validation",
"info",
"Temporal validation was skipped because no timestamp column could be identified."
)
return(checks)
}
x <- dat[[dt_col]]
if (!inherits(x, "POSIXt")) {
parsed <- parse_posix_utc(as.character(x))
if (sum(!is.na(parsed)) == 0) {
checks[[length(checks) + 1]] <- make_check(
6, "temporal_checks", "Temporal validation",
"fail",
paste0("The candidate timestamp column '", dt_col, "' could not be parsed as UTC datetimes.")
)
return(checks)
}
x <- parsed
}
non_missing <- !is.na(x)
x2 <- x[non_missing]
if (length(x2) < 2) {
checks[[length(checks) + 1]] <- make_check(
6, "enough_timestamps", "Sufficient timestamps",
"warn",
"Fewer than two valid timestamps are available, so spacing cannot be assessed."
)
return(checks)
}
is_sorted <- !is.unsorted(x2, strictly = FALSE)
checks[[length(checks) + 1]] <- make_check(
6, "sorted_time", "Ascending timestamp order",
if (is_sorted) "pass" else "warn",
if (is_sorted) "Timestamps are in ascending order." else "Timestamps are not in ascending order."
)
any_dupes <- any(duplicated(x2))
checks[[length(checks) + 1]] <- make_check(
6, "unique_time", "Unique timestamps",
if (!any_dupes) "pass" else "warn",
if (!any_dupes) "All non-missing timestamps are unique." else "Duplicate timestamps were found."
)
dx <- as.numeric(diff(x2), units = "secs")
dx <- dx[!is.na(dx)]
if (length(dx) == 0) {
checks[[length(checks) + 1]] <- make_check(
6, "spacing_regular", "Regular observation spacing",
"warn",
"Timestamp intervals could not be assessed."
)
return(checks)
}
tab <- sort(table(dx), decreasing = TRUE)
modal_step <- as.numeric(names(tab)[1])
regular <- length(unique(dx)) == 1
msg <- if (regular) {
paste0("Observation spacing is regular at ", modal_step, " second(s).")
} else {
paste0("Observation spacing is not fully regular. Most common interval: ", modal_step, " second(s).")
}
details <- paste0("Unique interval counts (seconds): ", paste(paste(names(tab), tab, sep = "→"), collapse = ", "))
checks[[length(checks) + 1]] <- make_check(
6, "spacing_regular", "Regular observation spacing",
if (regular) "pass" else "warn",
msg,
details = details
)
checks
}
validate_level_7_content <- function(dat, lines = NULL, lines_is_sample = FALSE) {
checks <- list()
if (ncol(dat) == 1) {
checks[[length(checks) + 1]] <- make_check(
7, "single_column_import", "Single imported column",
"fail",
"The file imported as a single column. This strongly suggests an incorrect separator or malformed structure."
)
} else {
checks[[length(checks) + 1]] <- make_check(
7, "single_column_import", "Single imported column",
"pass",
paste0("The file imported as ", ncol(dat), " columns.")
)
}
if ("Datetime" %in% names(dat)) {
raw_z <- extract_datetime_raw_z_fraction(lines)
if (!is.null(raw_z) && !isTRUE(lines_is_sample)) {
checks[[length(checks) + 1]] <- make_check(
7, "utc_suffix", "UTC Z suffix in Datetime",
if (isTRUE(raw_z$fraction > 0.95)) "pass" else "warn",
if (isTRUE(raw_z$fraction > 0.95)) {
"Most Datetime values end with 'Z' in the raw file, consistent with UTC representation."
} else {
"Many Datetime values do not end with 'Z' in the raw file. UTC with explicit 'Z' is recommended."
},
details = paste0(
"Checked ", nice_n(raw_z$n_non_missing),
" non-missing raw Datetime values in column ", raw_z$col_ix, "."
)
)
} else if (!is.null(raw_z) && isTRUE(lines_is_sample)) {
checks[[length(checks) + 1]] <- make_check(
7, "utc_suffix", "UTC Z suffix in Datetime",
"info",
"UTC suffix estimation from raw lines is preview-only because the file is longer than the inspected slice.",
details = paste0(
"Preview estimate from ", nice_n(raw_z$n_non_missing),
" non-missing Datetime values: ", round(raw_z$fraction * 100, 1), "% with trailing Z/z."
)
)
} else {
datetime_values <- dat$Datetime
z_frac <- fraction_has_utc_z(datetime_values)
if (!is.na(z_frac)) {
checks[[length(checks) + 1]] <- make_check(
7, "utc_suffix", "UTC Z suffix in Datetime",
if (isTRUE(z_frac > 0.95)) "pass" else "warn",
if (isTRUE(z_frac > 0.95)) {
"Most Datetime values end with 'Z', consistent with UTC representation."
} else {
"Many Datetime values do not end with 'Z'. UTC with explicit 'Z' is recommended."
},
details = "Raw Datetime extraction was unavailable; suffix check was performed on imported Datetime values."
)
}
}
}
checks
}
run_all_validations <- function(path, skip_lines = NULL, auto_detect_skip = TRUE) {
raw_lines_preview <- safe_read_lines(path, n = 5001L)
lines_is_sample <- length(raw_lines_preview) > 5000L
raw_lines <- if (lines_is_sample) raw_lines_preview[seq_len(5000L)] else raw_lines_preview
can_auto_detect <- isTRUE(auto_detect_skip) && !identical(tolower(file_ext(path)), "gz")
detected <- if (can_auto_detect) detect_header_skip(raw_lines) else list(
skip = 0L,
header_line = if (length(raw_lines) > 0) raw_lines[1] else NA_character_,
reason = if (identical(tolower(file_ext(path)), "gz")) {
"Auto-detection was disabled for compressed input; defaulted to first line unless manual skip is provided."
} else {
"Auto-detection disabled; defaulted to first line unless manual skip is provided."
}
)
effective_skip <- if (isTRUE(auto_detect_skip) && is.null(skip_lines)) {
detected$skip
} else {
max(0L, as.integer(skip_lines %||% 0L))
}
lines <- skip_preamble_lines(raw_lines, effective_skip)
out <- list()
out$level_1 <- validate_level_1_file(path)
out$level_2 <- validate_level_2_lines(lines, path, skip = effective_skip, skip_reason = detected$reason)
imported <- validate_level_3_import(path, delim = ",", skip = effective_skip)
out$level_3 <- imported$checks
dat <- imported$data
if (!is.null(dat)) {
out$level_4 <- validate_level_4_names(dat)
out$level_5 <- validate_level_5_datetime(dat)
out$level_6 <- validate_level_6_temporal(dat)
out$level_7 <- validate_level_7_content(dat, lines = lines, lines_is_sample = lines_is_sample)
} else {
out$level_4 <- list()
out$level_5 <- list()
out$level_6 <- list()
out$level_7 <- list()
}
out$data <- dat
out$lines <- lines
out$raw_lines <- raw_lines
out$skip_lines <- effective_skip
out$lines_is_sample <- lines_is_sample
out
}
flatten_checks <- function(results) {
lvls <- grep("^level_", names(results), value = TRUE)
unlist(results[lvls], recursive = FALSE)
}
check_card_ui <- function(chk) {
div(
class = paste("card border-", status_class(chk$status), " mb-3"),
div(
class = paste("card-header bg-", status_class(chk$status), " text-white"),
tags$strong(paste(status_icon(chk$status), chk$title))
),
div(
class = "card-body",
tags$p(class = "card-text", chk$message),
if (!is.null(chk$details)) {
tags$details(
class = "details-toggle",
tags$summary("Details"),
tags$pre(style = "white-space: pre-wrap;", chk$details)
)
}
)
)
}
worst_status <- function(checks) {
if (length(checks) == 0) return("info")
statuses <- vapply(checks, `[[`, character(1), "status")
if ("fail" %in% statuses) return("fail")
if ("warn" %in% statuses) return("warn")
if ("pass" %in% statuses) return("pass")
"info"
}
level_tab_title <- function(stage_title, checks) {
status <- worst_status(checks)
count <- length(checks)
paste0(status_icon(status), " ", stage_title, " (", count, ")")
}
level_panel_ui <- function(level_title, checks) {
tagList(
tags$h4(level_title),
if (length(checks) == 0) {
div(class = "alert alert-light", "No checks available at this stage.")
} else {
lapply(checks, check_card_ui)
}
)
}
ui <- fluidPage(
tags$head(
tags$style(HTML("
body { padding-bottom: 40px; }
.main-title { font-weight: 700; margin-bottom: 0.8rem; }
.sidebar-note { font-size: 0.95rem; color: #555; }
.preview-box {
border: 1px solid #d9e0ea;
border-radius: 8px;
padding: 12px;
background: #f8fbff;
max-height: 340px;
overflow-y: auto;
font-family: monospace;
white-space: pre-wrap;
}
.card {
border-radius: 10px;
box-shadow: 0 1px 3px rgba(12, 28, 48, 0.08);
}
.details-toggle {
margin-top: 0.5rem;
border: 1px dashed #b5c2d3;
border-radius: 8px;
padding: 0.5rem 0.65rem;
background: #f8fbff;
}
.details-toggle > summary {
font-weight: 700;
color: #0d6efd;
cursor: pointer;
}
.details-toggle > summary:hover {
text-decoration: underline;
}
"))
),
tags$h2(class = "main-title", "Wearable CSV Validator"),
fluidRow(
column(
width = 3,
fileInput(
"file",
"Upload a wearable CSV file (max 20 MB)",
accept = c(".csv", ".txt", ".gz")
),
checkboxInput("show_preview", "Show raw line preview", value = TRUE),
numericInput("preview_n", "Preview lines", value = 20, min = 5, max = 100, step = 5),
checkboxInput("auto_skip_preamble", "Auto-detect and skip preamble lines", value = TRUE),
numericInput("manual_skip_lines", "Manual lines to skip before header", value = 0, min = 0, step = 1),
tags$p(
class = "sidebar-note",
"The validator performs staged checks: file access, text structure, import, variable names, datetime handling, and time-series regularity."
),
hr(),
uiOutput("summary_box")
),
column(
width = 9,
uiOutput("results_tabs")
)
)
)
server <- function(input, output, session) {
validation_results <- reactive({
req(input$file)
run_all_validations(
input$file$datapath,
skip_lines = if (isTRUE(input$auto_skip_preamble)) NULL else input$manual_skip_lines,
auto_detect_skip = isTRUE(input$auto_skip_preamble)
)
})
all_checks <- reactive({
flatten_checks(validation_results())
})
output$summary_box <- renderUI({
req(all_checks())
chks <- all_checks()
statuses <- vapply(chks, `[[`, character(1), "status")
counts <- table(factor(statuses, levels = c("pass", "warn", "fail", "info")))
div(
class = "card",
div(class = "card-header", tags$strong("Validation summary")),
div(
class = "card-body",
tags$p(paste("Pass:", counts["pass"] %||% 0)),
tags$p(paste("Warnings:", counts["warn"] %||% 0)),
tags$p(paste("Failures:", counts["fail"] %||% 0)),
tags$p(paste("Info:", counts["info"] %||% 0))
)
)
})
output$results_tabs <- renderUI({
base_tabs <- list(
tabPanel("Data preview", tableOutput("data_preview")),
tabPanel("Raw lines", uiOutput("raw_preview"))
)
if (is.null(input$file)) {
return(do.call(tabsetPanel, c(list(id = "validation_tabs"), base_tabs)))
}
results <- validation_results()
do.call(tabsetPanel, c(
list(
id = "validation_tabs",
tabPanel(level_tab_title("Stage 1: File", results$level_1), uiOutput("level_1_ui")),
tabPanel(level_tab_title("Stage 2: Structure", results$level_2), uiOutput("level_2_ui")),
tabPanel(level_tab_title("Stage 3: Import", results$level_3), uiOutput("level_3_ui")),
tabPanel(level_tab_title("Stage 4: Names", results$level_4), uiOutput("level_4_ui")),
tabPanel(level_tab_title("Stage 5: Datetime", results$level_5), uiOutput("level_5_ui")),
tabPanel(level_tab_title("Stage 6: Time series", results$level_6), uiOutput("level_6_ui")),
tabPanel(level_tab_title("Stage 7: Content", results$level_7), uiOutput("level_7_ui"))
),
base_tabs
))
})
output$level_1_ui <- renderUI({
req(validation_results())
level_panel_ui("Stage 1: Basic file checks", validation_results()$level_1)
})
output$level_2_ui <- renderUI({
req(validation_results())
level_panel_ui("Stage 2: Text structure from readLines()", validation_results()$level_2)
})
output$level_3_ui <- renderUI({
req(validation_results())
level_panel_ui("Stage 3: Import with vroom()", validation_results()$level_3)
})
output$level_4_ui <- renderUI({
req(validation_results())
level_panel_ui("Stage 4: Variable names", validation_results()$level_4)
})
output$level_5_ui <- renderUI({
req(validation_results())
level_panel_ui("Stage 5: Datetime detection", validation_results()$level_5)
})
output$level_6_ui <- renderUI({
req(validation_results())
level_panel_ui("Stage 6: Temporal regularity", validation_results()$level_6)
})
output$level_7_ui <- renderUI({
req(validation_results())
level_panel_ui("Stage 7: Content conventions", validation_results()$level_7)
})
output$data_preview <- renderTable({
req(validation_results()$data)
build_data_preview(validation_results()$data, n = 10)
}, striped = TRUE, bordered = TRUE, hover = TRUE)
output$raw_preview <- renderUI({
req(validation_results())
if (!isTRUE(input$show_preview)) {
return(div(class = "alert alert-light", "Raw preview is disabled."))
}
lines <- head(validation_results()$raw_lines, input$preview_n)
div(class = "preview-box", HTML(paste(escape_html(lines), collapse = "\n")))
})
}
shinyApp(ui, server)
```
:::