1 Introduction

Data is the new gold, is a phrase commonly touted. In healthcare, data is critical as their proper collection and analysis can improve the quality of life and even save lives. Nowadays, people record their daily activities digitally through a significant number of devices and sensors. Such digital recording of a person’s activity and lifestyle is referred to as lifelogging [15]. Recording and analyzing such lifelog data provide a great opportunity for studying an individual’s life experience. It can help monitor a person’s activity to improve health and well-being [24], help recover memories of past events [31], or analyze social behaviour [6, 21]. Lifelogs are also sources of wast rich data for interesting research. For instance, Chokr et al. [4] described a machine learning approach to predict the number of calories from food images, and De Choudhury et al. [5] described the impact of interaction on social media to influence mental health. Therefore, lifelogs containing various types of information from a person’s daily lifestyle are valuable in numerous contexts.

Although lifelogs might contain data highly valuable for research, they are often not generally available to researchers. A lifelog is typically not stored centrally by a single service that can be easily accessed, but rather exists as the union of data stored in a large number of online and offline data silos [19]. Still, some datasets exist, and existing lifelogging datasets [18] usually contain a person’s daily life activities automatically captured and recorded using smartphone applications, wearable devices, and other sensors. One example is the NTCIR Lifelog test collection [14] consisting of lifelogging datasets for the NTCIR-12/13/14 lifelog tasks, which was first released at the NTCIR-12 conference [13]. The images in this dataset are captured by wearable cameras carried by two different lifeloggers. Some work has been done with similar datasets, for example, retrieving moment of interest [8, 21]. However, a key challenge in lifelogging research is the poor availability of test collections [7]. Hence, there is a need for more available lifelog datasets, especially ones collected over longer time periods and with multiple modalities and purposes.

Capturing daily life events is also something many sports professionals do. Athletes have kept written training diaries for a long time using both pen-and-paper and, more recently, digital logging systems. Now, the use of wearables to measure activity and its intensity in both top sport and among the regular physically active population help to improve performance, recovery, and other aspects of health [9]. A challenge is to make sense of the data, and often, the captured data is limited to self-reports since activity logs from smartwatches and phones are hard to understand. Thus, there are still steps needed for integration of data [11] and to find standardized ways to analyze, evaluate, and present data [10]. Another problem in the area of sport is that professional athletes do not control the captured data by themselves, and they need the assistance of coaches, physicians, or technical support staff [19]. This process adds the burden of informed consent, authorization, and privacy. Furthermore, a trainer or team doctor does not have time to properly evaluate the myriads of sensor data from the athletes to possibly detect relevant performance data that could be used to improve training.

Moreover, dietary intake is an important factor affecting metabolic regulation, thereby affecting human health and fitness. However, individual response to dietary components has been known for a long time [22, 23, 29] where evidence has shown that people eating identical meals show high variability in metabolic response, such as blood glucose and lipids [2, 12, 16, 25, 27, 33]. Hence, one set of dietary advice may not be equally efficient for all individuals. To improve healthy lifestyle and reduce the risk of diseases, we need to investigate individual variation using multi-dimensional and high-resolution continuous time-series of markers of individual health and fitness. Furthermore, we need to investigate how various factors correlate. With new technology and analytical tools, we may be able to develop more personalized approaches, but to achieve this, access to relevant data is crucial.

To aid these efforts, automatic methods to analyze sensor data and the quantification of self-reports may play a crucial role in retrieving the information that people may need. To be able to perform these analyses with the increasing volume of data produced by different devices, new methods and tools are needed. ScopeSense is made available in an effort to enable the development of such support systems. We provide a starting point by combining the idea of lifelogging data collection with activity logging. Multiple activity-specific analyses can be performed on such data as predicting sports performance, weight loss, or gain, but there is a lack of available datasets. We have therefore logged several objectives and subjective parameters of a person’s daily life together with all food and drink intake. Specifically, we have used the following systems to capture the data:

  • The Apple Watch series 6 smartwatch to track 24/7 activity and heart rate, and training sessions, electrocardiogram (ECG), sleep, blood oxygen, etc.

  • The PMSys sports logging app to track subjective wellness parameters such as sleep, mood, stress, fatigue, muscle soreness; training load with session rating of perceived exertion (sRPE) and length; and injuries with body location and severity.

  • The Lifesum food tracking app where all intake of food and drinks are self-reported and logged, together with a persons weight, and the content of nutrients are calculated.

ScopeSense contains logging data between 8. February 2021 and 20. October 2021 (255 days) from two individuals. To the best of our knowledge, ScopeSense is the only available dataset to combine both subjective and objective parameters combining daily lifestyle and activities over a long period of time. The dataset is openly available for research.

In the rest of the paper, we describe the data collection procedure in Sect. 2, and in Sect. 3, we describe the dataset in detail. Section 4 presents some initial analysis of the data. Furthermore, we provide examples of possible use-cases and applications of the dataset in Sect. 5, before we conclude the paper in Sect. 6.

2 Data Collection

Based on experiences from previous datasets [30], we conjecture a need for even more details in a lifelog covering activity, wellness, and nutrition data, collected over longer time periods. In this respect, ScopeSense is a 255 days dataset containing the lifelogs of two individuals with objective biometrics and activity data, food and nutrition, and subjective wellness and exercise load. During the data collection phase, there was a goal of regular running and strength workout sessions. The participants gave written consent to publish and distribute the data. Moreover, the data collection has been reported to the Norwegian centre for research data (NSD), and been assessed and approved (reference #294827). Additionally, our application to the regional committees for medical and healthcare research ethics (REK) concluded that no approval from was needed for this data collection (reference #506192).

2.1 Objective Biometrics and Activity Data

We used an Apple Watch (series 6, hardware version 6.2, software version 7.6.1), in combination with an iPhone, to collect objective biometric and activity data. This watch was chosen because it was equipped with the most available sensors in one device at that time. Most of the biometric data were recorded automatically by wearing the smartwatch 24/7; however, for some measurements, we manually started a recording using the various apps on the smartwatch (Fig. 1):

  • All exercise and training data were logged using the Apple workout app. It mainly starts the type of exercise one specifies and ends the session when it is finished.

  • An ECG recording was made once every day to measure the heart’s rhythm and electrical activity. This is a 30 s measurement holding in the clock button.

  • Blood oxygen is measured by starting the app and allowing the sensors to work for 15 s.

  • All other parameters are recorded automatically, like heart rate, sleep, number of hours where the person has been in a vertical position, flights (floors) climbed, counted steps walked/ran, energy used, etc., but one may use the apps to monitor your values.

All the data collected by the smartwatch is stored and extracted from the Apple Health application which works like a central data hub for the Apple devices.

Fig. 1.
figure 1

Examples of logging using Apple Watch version 6.

2.2 Food and Nutrition

Fig. 2.
figure 2

Logging food using Lifesum.

To collect dietary intake and nutrition data, we used the LifesumFootnote 1 mobile app on iPhones with a premium subscription. Basically, Fig. 2 shows the procedure used for collecting dietary intake data. The participants selected meal or snack (Fig. 2a), and then, they inserted the portion of each components (Fig. 2b), either searching in the existing database or scanning the barcode printed on the packaging. Thereafter, they added the estimated amount (e.g., weight in grams). When known, the exact type of a component was inserted, like the type of bread, to get the exact types of nutrients. However, the exact type is sometimes unknown, e.g., when eating out. To quantify the amounts, we used the information provided on the packaging, e.g., while consuming a box of yogurt, we measured the amount of liquid a cup or a glass can contain. Moreover, we weighted the amounts included in a portion of a particular meal, e.g., the weight of a slice of cheese or a spoon of jam. A challenge was when eating out where it is practically infeasible to estimate accurately the individual ingredients used in a dish and the total volume, and if so, an approximate number is reported. To ease the logging, the app supports making favorites and bookmarks of the ingredients, and also to make predefined meals and dishes. Here, we also logged the weight of the participants every day. Moreover, the app sent push-messages to remind participants to log both dietary intake and weight. Data from Lifesum is included in the Apple Health data, but also extracted the data in own comma separated value (CSV) files directly from the Lifesum system.

2.3 Subjective Wellness, Training Load, and Injuries

We used the PMSys systemFootnote 2 to collect subjective data regarding wellness, training load, and injuries. Figure 3 depicts an example of normal reporting sequences. The PMSys system is an online sports logging system where athletes can monitor for example individual training load, daily subjective wellness parameters, and injuries [26, 32]. Wellness has typically been reported once a day through a sequence of questionnaires, as shown in Fig. 3. The wellness data was reported in the morning.

Training load or Session Rating of Perceived Exertion (sRPE) is a metric calculated from the product of the session length and the reported Rating of Perceived Exertion (RPE), i.e., reported similarly as shown in Fig. 3. The perceived training load is reported after every training session. Finally, the injuries questionnaire is recommended completed once a week, regardless of having an injury or not, but here the participants mainly reported the injury when one occurred, where the participants press on a body part to indicate a minor or major injury or pain. To ensure timely reporting, PMSys sends scheduled push notifications directly to the participants’ smartphones. All data was extracted at the end of the logging period from the system into CSV files.

Fig. 3.
figure 3

Subjective parameters logged using PMSys, exemplified by wellness.

2.4 Data Anonymization

To prevent any identity disclosure of the participants, the data has been processed to remove all ids. All occurrences of names in devices, like GPS coordinates, have been changed to a random value or removed completely to make it impossible to re-identify the participants. Exact GPS position data is also not important for the sport-related analysis besides features that can be extracted from the data, such as speed, distance, and elevation level which are part of the dataset.

3 Dataset Overview

ScopeSense contains logging data from two male persons, in the age range of 40–50 years, between 8. February 2021 and 20. October 2021 (255 days) in Norway. The participants have followed no particular food regime and have been exercising regularly during the period. The dataset is organized according to the way it is collected, even though one could alternatively organize the data according to the type of data. We divided and organized data into three folders for each participant as shown in Fig. 4.

The dataset is fully available and open for free use for researchers at the well-known Open Science Framework (OSF)Footnote 3 and at the Simula dataset siteFootnote 4. The dataset is free to use for research and teaching purposes under the license Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)Footnote 5.

Fig. 4.
figure 4

Dataset organization. The structure for participants A and B is the same. Blue boxes represent the folder structure of the dataset. (Color figure online)

3.1 Apple Watch Data

We have exported all data from the Apple health system. The raw data comes in XML-formatFootnote 6, but we have extracted the various parameters as CSV-files. There is one record for each of the measurements, where each record contains the parameter type, the source (which is anonymized), the type of device (hardware and software), the unit, dates, and the value(s). The Apple health system stores a large number of health parameters, including step count, heart rate, resting heart rate, and blood oxygen saturation. Other parameters include sleep, energy burned (basal/active), time standing, distance (walked/ran), and walking speed. The ECG data-series captured once a day is stored in its own sub-folder.

Each training session contains a training record in Apple health. Specifically, combining information from various files, one can extract information like type of training, duration, start and end time, total active energy (kcal), total distance (km), average heart rate and heart rate during training, step count, and speed. Each run session is also collected as a series of points along the route as shown in Fig. 5 giving, for example, time, speed, and elevation, but where the actual GPS coordinates are removed for privacy issues. All these routes are stored in a separate sub-folder.

Fig. 5.
figure 5

Example records a run route session.

3.2 Lifesum Data

The Lifesum system collects as much as possible of consumed dietary (eating and drinking) information. Every report in the system is contained in one row of the table with the date of the report, which meal it is connected to (or a snack), the “name" of the food (including a title, potential manufacturer, and content), and the amount (both for the used metric and in grams). Subsequently, as can be viewed in the last part of each row, the reported intake is used to calculate the number of calories, various types of carbohydrates, various types of fat, protein, sodium, potassium, and cholesterol.

In addition, the Lifesum system has logged the participant’s daily weight. The participants’ weight is given in each row, and also the calculated body mass index (BMI) based on the participants’ height.

Finally, we have also merged Lifesum and PMSys data (see below) in a day-by-day manner. These records are stored in the lifesum-pmsys-per-day- merged.csv file which contains one row per day.

3.3 PMSys Data

In terms of subjective PMSys reporting, the raw data is contained in the CSV files:

  • Wellness includes parameters like time and date, fatigue, mood, readiness, sleep duration (number of hours), sleep quality, soreness (and soreness area), and stress. Fatigue, sleep quality, soreness, stress, and mood all have a 1–5 scale. Score 3 is normal, scores 1–2 are below normal, and 4–5 are scores above normal. Sleep length is just a measure of how long the sleep was in hours, and readiness (scale 1–10) is an overall subjective measure of how ready you are to exercise, i.e., 1 means not ready at all, and 10 indicates that you are exceptionally ready. In elite sport, readiness is used to tune if the athlete should push or pull the training load. In the CSV file, there are columns for each wellness parameter, which contain columns for the date and for the wellness parameter value for each of the participants.

  • A training load report contains a training session’s time, type of activity, perceived exertion (RPE), and duration in the number of minutes. This is, for example, used to calculate the session’s training load or sRPE (RPE \(\times \) duration). The data is stored as one tab in the spreadsheet for each of the participants. There is one line for each session with the date, daily summarized load, the calculated session RPE (sRPE), the experienced RPE, the session length, and then various calculated metrics like weekly load, monotony, strain, acute chronic workload ration, and chronic training load (over 28 and 42 d). If more than one session per day has been logged, the additional lines will have just the sRPE, RPE, and length of the session, which are then used to calculate the total load parameters in the first line.

  • Injury and illness are reported in separate files having one line per incident with date and symptom/place of pain.

Table 1. Important features for the different self reported values in PMSys.

4 Initial Experiments

We provided some simple baseline experiments to provide an initial idea about how the dataset can be used and test its usefulness. For all experiments, we used 60% of the data for training and validation, and the remaining 40% as a test dataset. We prepossessed the data into a-value-per-day records (also included in the dataset) and used a subset of the data (weight_kg, bmi, calories_burned per day, caloeries per day, carbs per day, carbs_fiber per day, carbs_sugar per day, fat per day, fat_saturated per day, fat_unsaturated per day, potassium per day, protein per day, sodium per day, daily load, fatigue, soreness, mood, stress, sleep quality, sleep duration and readiness) where one instance represents one day of the collection period. The split between the train and test datasets was performed randomly and an equal number of instances per participant was included in the splits.

For the first experiment, we explored which features are important to predict different self-reported values. Specifically, we explored the important features for Readiness, Mood, Fatigue, Stress, Sleep Quality, Sleep Duration, Soreness, and Daily Load. We used Correlation-based Feature Subset Selection with Best First Ranker to select and rank the features. Table 1 shows the results where we observe that different features affect different aspects of ones well-being. For example, we can see that potassium is an important feature to predict fatigue, which is reasonable as a symptom is to feel tired/fatigue if a person has a too low level of this chemical element [28]. This shows that the dataset holds the potential to discover different aspects between self-reported and measured values.

For the second experiment, we trained a simple machine learning model to predict readiness (similar to [20, 32]) which is seen as one of the most important factors in a sport context. As a baseline, we used ZeroR which predicts the average of the data. In addition, we trained two regressors using RandomForest and the SMOReg support vector machine. For ZeroR, the mean absolute error and root mean squared error were 1.29 and 1.6103, respectively, compared to 0.9944 and 1.3599 for SMOReg, and 0.9897 and 1.298 for RandomForest. From this initial simple analysis, we observe that some of the subjective measurements can be predicted using just a subset of features.

5 Example Applications of the Dataset

Healthcare systems are undergoing a major transformation. It used to be reactive and symptom-based, where a doctor primarily was consulted when being ill. Doctors diagnosed the sickness followed by proper medical treatment. Recently, system biologists and healthcare researchers started adopting a different perspective. This perspective was introduced as a P4 approach to medicine [1, 17]. The P4 approach is based on a predictive, preventive, personalized, and participatory approach that uses data about a person as the main driver in devising an approach that is not reactive but proactive. This approach requires that each person is considered a unique system and is modeled using longitudinal data. This approach was not practical until smartphones, wearable devices, and associated advances in machine learning and cloud computing came along. Therefore, ubiquitous health monitoring is required for seamless data collection and subsequent applications of artificial intelligence and advanced healthcare technologies.

In medicine, an emerging concept is N-of-1 approach, where longitudinal data about a person is used to model the person rather than collecting population data and considering a person a sample of this population. P4 approaches for all aspects of health and wellness require collecting objective as well as subjective data. As discussed in this paper, some of the data is relatively easy to collect, while others require careful planning, collection, and structured organization. Such an approach is inherently multimodal and requires applications of traditional multimedia systems competence in addition to emerging and novel machine learning approaches.

Valid, privacy-preserved, relevant, and accurate data is fundamental to this approach. In this paper, we presented our initial experience with relatively invasive data collection with data collected over a long period of time. This paper presents this data for researchers with different interests to explore approaches that may be suitable. We hope that this data is the beginning of a collection of such data and sharing in the community to enable exploration of approaches related to health. We want to emphasize the role of food data as a very effective data stream. Food containing essential nutrients has always been considered vital for a healthy human being, and the relation between food and exercise has been investigated for a long time showing the importance of eating correctly to perform best [3]. Yet, approaches for understanding the long-time effect of what, when, where, and how much food affects different aspects of physical and psychological health are only partly explored due to the unavailability of data. Just exploring if and how food intake affects physical activity or exercise patterns can give us simple, useful, and interesting information, especially on an individual basis. Moreover, subjective readiness and other wellness parameters can be validated for “real" coinciding changes against objective heart rate variability data from the Apple watch. We hope that the dataset presented in this paper will jump-start this process, and possibly enable insights into how food, readiness for exercise, execution of exercise, and general wellness are related. Moreover, since the data is collected over a long period of time, it’s worth analyzing it locally in shorter time intervals to consider if changes in some features, such as dining preferences and activity, may change between seasons with a potential causal effect on wellness.

6 Conclusion

We have presented the ScopeSense dataset, containing both objective and subjective longitudinal parameters from activity, wellness, food, and biometrics, potentially enabling the development of several interesting analysis applications. Our initial experiments show that such analyses are possible, but we conjecture the dataset has greater potential beyond what we have demonstrated in this paper. It is for our peer colleagues to use to expand and support this line of technology support for proactive medicine.

We are using our initial experiences with this dataset collection to currently expand further work to include a larger and more diverse cohort with even more parameters monitored. Also, we are developing a wide collection of machine learning applications that can sift through and analyze the data while reporting to enable next-generation proactive medicine to handle massive amounts of heterogeneous data in close to real-time. The goal is to have a personalized digital health screening service analyzing data and, e.g., detecting anomalies and concerning deviations providing potential for rapid response and intervention.