1
DETECTING HOUSEHOLD OCCUPANCY
USING ELECTRICITY CONSUMPTION DATA
Rafay Khan
OVERVIEW
1. Project Background and Description
Occupancy detection serves many useful purposes in a vast array of applications, ranging from HVAC and lighting
systems to the smart thermostats available in the comfort of one’s home. Traditionally used methods, such as PIR
sensors combined with reed switches can be error-prone, expensive, and cumbersome. Through electricity
consumption data, digital electricity meters present a novel alternative and an opportunistic method as they are
already present – or about to be installed – in millions of households worldwide. Thus, the installation, use and
maintenance of these smart meters does not impose additional costs on the residents. The opportunistic use of
existing sensors increases the occupancy monitoring capabilities and therefore the acceptance of building
automation systems
In this project we aim to illustrate using machine learning methods and techniques in the field of feature engineering
that power consumption data can be a viable predictor of household occupancy. We also aim to investigate whether
a model trained in summer can perform suitably in winter.
The ECO (Electricity Consumption and Occupancy) data set is a comprehensive open-source (Creative Commons
License CC BY 4.0) data set for non-intrusive load monitoring and occupancy detection research. It was collected in
6 Swiss households over a period of 8 months. For each of the households, the ECO data set provides: * 1 Hz
aggregate consumption data. Each measurement contains data on current, voltage, and phase shift for each of the
three phases in the household. * 1 Hz plug-level data measured from selected appliances. * Occupancy information
measured through a tablet computer (manual labeling) and a passive infrared sensor (in some of the households).
Occupancy information is separated by season; summer and winter. Link to the dataset: http://data-
archive.ethz.ch/delivery/DeliveryManagerServlet?dps_pid=IE594964
The following metrics will be generated through this project: accuracy, precision, recall, f1_score, and ROC-AUC
score. The ROC_AUC score is highlighted as we have an imbalanced classification problem. The following models
were utilized in this project: Logistic regression (LR) with cross-validation for tuning of the regularization parameter,
random forest classifier (RF), and tree based gradient boosting classifier (GBC). There was quite a bit of data
preprocessing that had to be done. Since we were dealing with time-series data, we created lag variables. In
addition, aggregate features and appliance data were utilized. The model training/testing methodology was as
follows: step 1: create a single dataframe for each household (summer), step 2: split each household into 80-20
train-validation separately, step 3: combine each household's train split into the overall train dataset, step 4:
combine each household's validation split into the overall validation dataset, step 5: train model on overall train
dataset, step 6: validate model on overall validation dataset, step 7: load all data for winter, and finally, step 8: test
model on winter data