Data sets

1. fseval - A Benchmarking Framework for Feature Selection and Ranking Algorithms
AuthorsOverschie, Jeroen; Alsahaf, Ahmad; Azzopardi, George
Year2022
Downloadhttps://doi.org/10.34894/KDEPR0
PublicationUnder review
Description

The fseval Python package allows benchmarking Feature Selection and Feature Ranking algorithms on a large scale, and facilitates the comparison of multiple algorithms in a systematic way. In particular, fseval enables users to run experiments in parallel and distributed over multiple machines, and export the results to an SQL database. The execution of an experiment can be fully determined by a configuration file, which means the experiment results can be reproduced easily, given only the configuration file. fseval has high test coverage, continuous integration, and rich documentation. The package is open source and can be installed through PyPI.

Notes

Jeroen Overschie was responsible for the implementation. Ahmad Alsahaf and George Azzopardi were the supervisors of this project. This software has been published under GNU license 3.0: https://www.gnu.org/licenses/gpl-3.0.en.html


2. Recognition of Holstein Cattle with Thermal and RGB images
AuthorsBhole, Amey; S. Udmale, Sandeep; Falzon, Owen; Azzopardi, George
Year2021
Downloadhttps://doi.org/10.34894/7M108F
Publication

Bhole, Amey; Udmale, Sandeep S; Falzon, Owen; Azzopardi, George

CORF3D contour maps with application to Holstein cattle recognition from RGB and thermal images Journal Article

Abstract | Links | BibTeX

Description

This data set was collected from the Dairy Campus in Leeuwarden (The Netherlands) with a FLIR E6 thermal camera over a period of 9 days. It consists of 3694 images of 383, with each cow represented with an average of 9 images. Each snapshot created two images; 1) RGB and ii) Temperature. The image filenames are in the format [cow_id-4 digits]_[day no-1 digit]_[counter-1 digit]. The timestamp.xlsx file indicates the day number (day 1 to day 9) of when an image in the data set was collected. This allows to design and run leave-one day-out cross validation, the same as we did in our paper. Here is the link to the scripts that reproduce the results reported in the paper, and the following is the link to the GitHub repository that contains all the scripts


3. Injury Prediction In Competitive Runners With Machine Learning
AuthorsLovdal, Sofie; den Hartigh, Ruud; Azzopardi, George
Year2021
Downloadhttps://doi.org/10.34894/UWU9PV
Publication

Lövdal, S. Sofie; Hartigh, Ruud J. R. Den; Azzopardi, George

Injury Prediction in Competitive Runners With Machine Learning Journal Article

Abstract | Links | BibTeX

Description

The data set consists of a detailed training log from a Dutch high-level running team over a period of seven years (2012-2019). We included the middle and long distance runners of the team, that is, those competing on distances between the 800 meters and the marathon. This design decision is motivated by the fact that these groups have strong endurance based components in their training, making their training regimes comparable. The head coach of the team did not change during the years of data collection. The data set contains samples from 74 runners, of whom 27 are women and 47 are men. At the moment of data collection, they had been in the team for an average of 3.7 years. Most athletes competed on a national level, and some also on an international level. The study was conducted according to the requirements of the Declaration of Helsinki, and was approved by the ethics committee of the second author’s institution (research code: PSY-1920-S-0007). (2020-11-20)


4. Detection of illicit accounts over the Ethereum blockchain
AuthorsFarrugia, Steven; Ellul, Joshua; Azzopardi, George;
Year2021
Downloadhttps://doi.org/10.34894/GKAQYN
Publication

Farrugia, Steven; Ellul, Joshua; Azzopardi, George

Detection of illicit accounts over the Ethereum blockchain Journal Article

Abstract | Links | BibTeX

Description

The recent technological advent of cryptocurrencies and their respective benefits have been shrouded with a number of illegal activities operating over the network such as money laundering, bribery, phishing, fraud, among others. In this work we focus on the Ethereum network, which has seen over 400 million transactions since its inception. Using 2179 accounts flagged by the Ethereum community for their illegal activity coupled with 2502 normal accounts, we seek to detect illicit accounts based on their transaction history using the XGBoost classifier. Using 10 fold cross-validation, XGBoost achieved an average accuracy of 0.963 ( ± 0.006) with an average AUC of 0.994 ( ± 0.0007). The top three features with the largest impact on the final model output were established to be ‘Time diff between first and last (Mins)’, ‘Total Ether balance’ and ‘Min value received’. Based on the results we conclude that the proposed approach is highly effective in detecting illicit accounts over the Ethereum network. Our contribution is multi-faceted; firstly, we propose an effective method to detect illicit accounts over the Ethereum network; secondly, we provide insights about the most important features; and thirdly, we publish the compiled data set as a benchmark for future related works.


5. Labelled Dataset of Retinal Images for Glaucoma detection
AuthorsGuo, Jiapan; Azzopardi, George; Shi, Chenyu; Jansonius, Nomdo; Petkov, Nicolai
Year2021
Downloadhttps://doi.org/10.34894/H2SZSO
Publication

Guo, Jiapan; Azzopardi, George; Shi, Chenyu; Jansonius, Nomdo M; Petkov, Nicolai

Automatic Determination of Vertical Cup-to-Disc Ratio in Retinal Fundus Images for Glaucoma Screening Journal Article

Abstract | Links | BibTeX

Description

Fundus photography is a viable option for glaucoma population screening. In order to facilitate the development of computer-aided glaucoma detection systems, we publish this annotation dataset that contains manual annotations of glaucoma features for seven public fundus image data sets. All manual annotations are made by a specialised ophthalmologist. For each of the fundus images in the seven fundus datasets, the upper, the bottom, the left and the right boundary coordinates of the optic disc and the cup are stored in a .mat file with the corresponding fundus image name.

The seven public fundus image data sets are: CHASEDB, Diaretdb1_v_1_1, DRINSHTI, DRIONS-DB, DRIVE, HRF, Messidor


6. Fall detection and recognition from egocentric visual data: A case study
AuthorsWang, Xueyi; Talavera, Estefania; Karastoyanova, Dimka; Azzopardi, George
Year2020
Downloadhttps://doi.org/10.34894/3DV8BF
Publication

Wang, Xueyi; Martinez, Estefania Talavera; Karastoyanova, Dimka; Azzopardi, George

Fall detection and recognition from egocentric visual data: A case study Inproceedings

Abstract | Links | BibTeX

Description

This data set contains egocentric videos from two cameras attached to the waist and chest of one volunteer. The contents of the videos contain indoor and outdoor scenes and do not contain people. The data set was to for evaluation of a novel fall detection system using ego centric visual data.