DETECTING HOUSEHOLD OCCUPANCY

USING ELECTRICITY CONSUMPTION DATA

Rafay Khan

OVERVIEW

1. Project Background and Description

Occupancy detection serves many useful purposes in a vast array of applications, ranging from HVAC and lighting

systems to the smart thermostats available in the comfort of one’s home. Traditionally used methods, such as PIR

sensors combined with reed switches can be error-prone, expensive, and cumbersome. Through electricity

consumption data, digital electricity meters present a novel alternative and an opportunistic method as they are

already present – or about to be installed – in millions of households worldwide. Thus, the installation, use and

maintenance of these smart meters does not impose additional costs on the residents. The opportunistic use of

existing sensors increases the occupancy monitoring capabilities and therefore the acceptance of building

automation systems

2. Project Scope

In this project we aim to illustrate using machine learning methods and techniques in the field of feature engineering

that power consumption data can be a viable predictor of household occupancy. We also aim to investigate whether

a model trained in summer can perform suitably in winter.

3. The Dataset

The ECO (Electricity Consumption and Occupancy) data set is a comprehensive open-source (Creative Commons

License CC BY 4.0) data set for non-intrusive load monitoring and occupancy detection research. It was collected in

6 Swiss households over a period of 8 months. For each of the households, the ECO data set provides: * 1 Hz

aggregate consumption data. Each measurement contains data on current, voltage, and phase shift for each of the

three phases in the household. * 1 Hz plug-level data measured from selected appliances. * Occupancy information

measured through a tablet computer (manual labeling) and a passive infrared sensor (in some of the households).

Occupancy information is separated by season; summer and winter. Link to the dataset: http://data-

archive.ethz.ch/delivery/DeliveryManagerServlet?dps_pid=IE594964

4. Data Analysis Methods

The following metrics will be generated through this project: accuracy, precision, recall, f1_score, and ROC-AUC

score. The ROC_AUC score is highlighted as we have an imbalanced classification problem. The following models

were utilized in this project: Logistic regression (LR) with cross-validation for tuning of the regularization parameter,

random forest classifier (RF), and tree based gradient boosting classifier (GBC). There was quite a bit of data

preprocessing that had to be done. Since we were dealing with time-series data, we created lag variables. In

addition, aggregate features and appliance data were utilized. The model training/testing methodology was as

follows: step 1: create a single dataframe for each household (summer), step 2: split each household into 80-20

train-validation separately, step 3: combine each household's train split into the overall train dataset, step 4:

combine each household's validation split into the overall validation dataset, step 5: train model on overall train

dataset, step 6: validate model on overall validation dataset, step 7: load all data for winter, and finally, step 8: test

model on winter data

5. Results & Conclusion

From this figure we can conclude that the dataset that yielded the highest performance in terms of the metric we deemed

most important, the ROC-AUC score, was the appliance_lagged dataset. This means that the appliance data did play a

role in the detection of occupancy and it was beneficial to include it. The lagged features also enhanced occupancy

detection as is evident from this visualization. We can also see that the performance of the model on summer data was

not that different from winter data, albeit, it was expected that it might perform a little better, since the data was trained

over summer. We successfully achieved the aims of our project.