extract_clusters()
searches for and summarizes clusters where
data meets a certain condition. Clusters have a specified duration and can
be interrupted while still counting as one cluster. The variable can either
be a column in the dataset or an expression that gets evaluated in a
dplyr::mutate()
call.
Cluster start and end times are shifted by half of the epoch each. E.g., a state lasting for 4 measurement points will have a duration of 4 measurement intervals, and a state only occuring once, of one interval. This deviates from simply using the time difference between the first and last occurance, which would be one epoch shorter (e.g., the start and end points for a state lasting a single point is identical, i.e., zero duration)
For correct cluster identification, there can be no gaps in the data!
Gaps can inadvertently be introduced to a gapless dataset through grouping.
E.g., when grouping by photoperiod (day/night) within a participant, this
introduces gaps between the individual days and nights that together form
the group. To avoid this, either group by individual days and nights (e.g.,
by using number_states()
before grouping), which will make sure a cluster
cannot extend beyond any grouping. Alternatively, you can set handle.gaps = TRUE
(at computational cost).
add_clusters()
identifies clusters and adds them back into the
dataset through a rolling join. This is a convenience function built on extract_clusters()
.
Usage
extract_clusters(
data,
Variable,
Datetime.colname = Datetime,
cluster.duration = "30 mins",
duration.type = c("min", "max"),
interruption.duration = 0,
interruption.type = c("max", "min"),
cluster.colname = state.count,
return.only.clusters = TRUE,
handle.gaps = FALSE
)
add_clusters(
data,
Variable,
Datetime.colname = Datetime,
cluster.duration = "30 mins",
duration.type = c("min", "max"),
interruption.duration = 0,
interruption.type = c("max", "min"),
cluster.colname = state,
handle.gaps = FALSE
)
Arguments
- data
A light logger dataset. Expects a dataframe.
- Variable
The variable or condition to be evaluated for clustering. Can be a column name or an expression.
- Datetime.colname
Column name that contains the datetime. Defaults to "Datetime" which is automatically correct for data imported with LightLogR. Expects a symbol.
- cluster.duration
The minimum or maximum duration of a cluster. Defaults to 30 minutes. Expects a lubridate duration object (or a numeric in seconds).
- duration.type
Type of the duration requirement for clusters. Either "min" (minimum duration) or "max" (maximum duration). Defaults to "min".
- interruption.duration
The duration of allowed interruptions within a cluster. Defaults to 0 (no interruptions allowed).
- interruption.type
Type of the interruption duration. Either "max" (maximum interruption) or "min" (minimum interruption). Defaults to "max".
- cluster.colname
Name of the column to use for the cluster identification. Defaults to "state.count". Expects a symbol.
- return.only.clusters
Whether to return only the identified clusters (TRUE) or also include non-clusters (FALSE). Defaults to TRUE.
- handle.gaps
Logical whether the data shall be treated with
gap_handler()
. Is set toFALSE
by default, due to computational costs.
Value
For extract_clusters()
a dataframe containing the identified
clusters or all time periods, depending on return.only.clusters
.
For add_clusters()
a dataframe containing the original data with an additional column
for cluster identification.
Examples
dataset <-
sample.data.environment |>
dplyr::filter(Id == "Participant")
# Extract clusters with minimum duration of 1 hour and interruptions of up to 5 minutes
dataset |>
extract_clusters(
MEDI > 1000,
cluster.duration = "1 hour",
interruption.duration = "5 mins"
)
#> # A tibble: 8 × 6
#> # Groups: Id [1]
#> Id state.count start end epoch
#> <fct> <chr> <dttm> <dttm> <Duration>
#> 1 Participant 1 2023-08-31 10:22:59 2023-08-31 11:26:39 10s
#> 2 Participant 2 2023-09-01 16:09:29 2023-09-01 18:47:19 10s
#> 3 Participant 3 2023-09-02 12:32:39 2023-09-02 13:56:09 10s
#> 4 Participant 4 2023-09-02 14:06:59 2023-09-02 15:48:49 10s
#> 5 Participant 5 2023-09-02 16:55:19 2023-09-02 19:26:39 10s
#> 6 Participant 6 2023-09-03 12:41:29 2023-09-03 13:55:59 10s
#> 7 Participant 7 2023-09-03 15:07:59 2023-09-03 16:18:19 10s
#> 8 Participant 8 2023-09-03 16:52:29 2023-09-03 18:44:59 10s
#> # ℹ 1 more variable: duration <Duration>
# Add clusters to a dataset where lux values are above 1000 for at least 30 minutes
dataset_with_clusters <-
sample.data.environment %>% add_clusters(MEDI > 1000)
dataset_with_clusters |> dplyr::count(state)
#> # A tibble: 12 × 3
#> # Groups: Id [2]
#> Id state n
#> <fct> <chr> <int>
#> 1 Environment 1 1535
#> 2 Environment 2 1579
#> 3 Environment 3 1541
#> 4 Environment 4 1544
#> 5 Environment 5 1577
#> 6 Environment 6 1550
#> 7 Environment NA 7954
#> 8 Participant 1 429
#> 9 Participant 2 211
#> 10 Participant 3 189
#> 11 Participant 4 185
#> 12 Participant NA 50826