1
DETECTING HOUSEHOLD OCCUPANCY
USING ELECTRICITY CONSUMPTION DATA
Rafay Khan
OVERVIEW
1. Project Background and Description
Occupancy detection serves many useful purposes in a vast array of applications, ranging from HVAC and lighting
systems to the smart thermostats available in the comfort of one’s home. Traditionally used methods, such as PIR
sensors combined with reed switches can be error-prone, expensive, and cumbersome. Through electricity
consumption data, digital electricity meters present a novel alternative and an opportunistic method as they are
already present – or about to be installed – in millions of households worldwide. Thus, the installation, use and
maintenance of these smart meters does not impose additional costs on the residents. The opportunistic use of
existing sensors increases the occupancy monitoring capabilities and therefore the acceptance of building
automation systems
2. Project Scope
In this project we aim to illustrate using machine learning methods and techniques in the field of feature engineering
that power consumption data can be a viable predictor of household occupancy. We also aim to investigate whether
a model trained in summer can perform suitably in winter.
3. The Dataset
The ECO (Electricity Consumption and Occupancy) data set is a comprehensive open-source (Creative Commons
License CC BY 4.0) data set for non-intrusive load monitoring and occupancy detection research. It was collected in
6 Swiss households over a period of 8 months. For each of the households, the ECO data set provides: * 1 Hz
aggregate consumption data. Each measurement contains data on current, voltage, and phase shift for each of the
three phases in the household. * 1 Hz plug-level data measured from selected appliances. * Occupancy information
measured through a tablet computer (manual labeling) and a passive infrared sensor (in some of the households).
Occupancy information is separated by season; summer and winter. Link to the dataset: http://data-
archive.ethz.ch/delivery/DeliveryManagerServlet?dps_pid=IE594964
4. Data Analysis Methods
The following metrics will be generated through this project: accuracy, precision, recall, f1_score, and ROC-AUC
score. The ROC_AUC score is highlighted as we have an imbalanced classification problem. The following models
were utilized in this project: Logistic regression (LR) with cross-validation for tuning of the regularization parameter,
random forest classifier (RF), and tree based gradient boosting classifier (GBC). There was quite a bit of data
preprocessing that had to be done. Since we were dealing with time-series data, we created lag variables. In
addition, aggregate features and appliance data were utilized. The model training/testing methodology was as
follows: step 1: create a single dataframe for each household (summer), step 2: split each household into 80-20
train-validation separately, step 3: combine each household's train split into the overall train dataset, step 4:
combine each household's validation split into the overall validation dataset, step 5: train model on overall train
dataset, step 6: validate model on overall validation dataset, step 7: load all data for winter, and finally, step 8: test
model on winter data
2
5. Results & Conclusion
From this figure we can conclude that the dataset that yielded the highest performance in terms of the metric we deemed
most important, the ROC-AUC score, was the appliance_lagged dataset. This means that the appliance data did play a
role in the detection of occupancy and it was beneficial to include it. The lagged features also enhanced occupancy
detection as is evident from this visualization. We can also see that the performance of the model on summer data was
not that different from winter data, albeit, it was expected that it might perform a little better, since the data was trained
over summer. We successfully achieved the aims of our project.