Data sets

1. fseval - A Benchmarking Framework for Feature Selection and Ranking Algorithms
Authors	Overschie, Jeroen; Alsahaf, Ahmad; Azzopardi, George
Year	2022
Download	https://doi.org/10.34894/KDEPR0
Publication	Under review
Description	The fseval Python package allows benchmarking Feature Selection and Feature Ranking algorithms on a large scale, and facilitates the comparison of multiple algorithms in a systematic way. In particular, fseval enables users to run experiments in parallel and distributed over multiple machines, and export the results to an SQL database. The execution of an experiment can be fully determined by a configuration file, which means the experiment results can be reproduced easily, given only the configuration file. fseval has high test coverage, continuous integration, and rich documentation. The package is open source and can be installed through PyPI.
Notes	Jeroen Overschie was responsible for the implementation. Ahmad Alsahaf and George Azzopardi were the supervisors of this project. This software has been published under GNU license 3.0: https://www.gnu.org/licenses/gpl-3.0.en.html

2. Recognition of Holstein Cattle with Thermal and RGB images
Authors	Bhole, Amey; S. Udmale, Sandeep; Falzon, Owen; Azzopardi, George
Year	2021
Download	https://doi.org/10.34894/7M108F
Publication	Bhole, Amey; Udmale, Sandeep S; Falzon, Owen; Azzopardi, George CORF3D contour maps with application to Holstein cattle recognition from RGB and thermal images Journal Article Abstract \| Links \| BibTeX @article{bhole2022corf3d, title = {CORF3D contour maps with application to Holstein cattle recognition from RGB and thermal images}, author = {Amey Bhole and Sandeep S Udmale and Owen Falzon and George Azzopardi}, doi = {https://doi.org/10.1016/j.eswa.2021.116354}, year = {2022}, date = {2022-01-01}, urldate = {2022-01-01}, journal = {Expert Systems with Applications}, volume = {192}, number = {116354}, publisher = {Pergamon}, abstract = {Livestock management involves the monitoring of farm animals by tracking certain physiological and phenotypical characteristics over time. In the dairy industry, for instance, cattle are typically equipped with RFID ear tags. The corresponding data (e.g. milk properties) can then be automatically assigned to the respective cow when they enter the milking station. In order to move towards a more scalable, affordable, and welfare-friendly approach, automatic non-invasive solutions are more desirable. Thus, a non-invasive approach is proposed in this paper for the automatic identification of individual Holstein cattle from the side view while exiting a milking station. It considers input images from a thermal-RGB camera. The thermal images are used to delineate the cow from the background. Subsequently, any occluding rods from the milking station are removed and inpainted with the fast marching algorithm. Then, it extracts the RGB map of the segmented cattle along with a novel CORF3D contour map. The latter contains three contour maps extracted by the Combination of Receptive Fields (CORF) model with different strengths of push\textendashpull inhibition. This mechanism suppresses noise in the form of grain type texture. The effectiveness of the proposed approach is demonstrated by means of experiments using a 5-fold and a leave-one day-out cross-validation on a new data set of 3694 images of 383 cows collected from the Dairy Campus in Leeuwarden (the Netherlands) over 9 days. In particular, when combining RGB and CORF3D maps by late fusion, an average accuracy of was obtained for the 5-fold cross validation and for the leave-one day-out experiment. The two maps were combined by first learning two ConvNet classification models, one for each type of map. The feature vectors in the two FC layers obtained from training images were then concatenated and used to learn a linear SVM classification model. In principle, the proposed approach with the novel CORF3D contour maps is suitable for various image classification applications, especially where grain type texture is a confounding variable.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Livestock management involves the monitoring of farm animals by tracking certain physiological and phenotypical characteristics over time. In the dairy industry, for instance, cattle are typically equipped with RFID ear tags. The corresponding data (e.g. milk properties) can then be automatically assigned to the respective cow when they enter the milking station. In order to move towards a more scalable, affordable, and welfare-friendly approach, automatic non-invasive solutions are more desirable. Thus, a non-invasive approach is proposed in this paper for the automatic identification of individual Holstein cattle from the side view while exiting a milking station. It considers input images from a thermal-RGB camera. The thermal images are used to delineate the cow from the background. Subsequently, any occluding rods from the milking station are removed and inpainted with the fast marching algorithm. Then, it extracts the RGB map of the segmented cattle along with a novel CORF3D contour map. The latter contains three contour maps extracted by the Combination of Receptive Fields (CORF) model with different strengths of push–pull inhibition. This mechanism suppresses noise in the form of grain type texture. The effectiveness of the proposed approach is demonstrated by means of experiments using a 5-fold and a leave-one day-out cross-validation on a new data set of 3694 images of 383 cows collected from the Dairy Campus in Leeuwarden (the Netherlands) over 9 days. In particular, when combining RGB and CORF3D maps by late fusion, an average accuracy of was obtained for the 5-fold cross validation and for the leave-one day-out experiment. The two maps were combined by first learning two ConvNet classification models, one for each type of map. The feature vectors in the two FC layers obtained from training images were then concatenated and used to learn a linear SVM classification model. In principle, the proposed approach with the novel CORF3D contour maps is suitable for various image classification applications, especially where grain type texture is a confounding variable. Close doi:https://doi.org/10.1016/j.eswa.2021.116354 Close
Description	This data set was collected from the Dairy Campus in Leeuwarden (The Netherlands) with a FLIR E6 thermal camera over a period of 9 days. It consists of 3694 images of 383, with each cow represented with an average of 9 images. Each snapshot created two images; 1) RGB and ii) Temperature. The image filenames are in the format [cow_id-4 digits]_[day no-1 digit]_[counter-1 digit]. The timestamp.xlsx file indicates the day number (day 1 to day 9) of when an image in the data set was collected. This allows to design and run leave-one day-out cross validation, the same as we did in our paper. Here is the link to the scripts that reproduce the results reported in the paper, and the following is the link to the GitHub repository that contains all the scripts

3. Injury Prediction In Competitive Runners With Machine Learning
Authors	Lovdal, Sofie; den Hartigh, Ruud; Azzopardi, George
Year	2021
Download	https://doi.org/10.34894/UWU9PV
Publication	Lövdal, S. Sofie; Hartigh, Ruud J. R. Den; Azzopardi, George Injury Prediction in Competitive Runners With Machine Learning Journal Article Abstract \| Links \| BibTeX @article{injury2021b, title = {Injury Prediction in Competitive Runners With Machine Learning}, author = {S. Sofie L\"{o}vdal and Ruud J.R. Den Hartigh and George Azzopardi}, doi = {https://doi.org/10.1123/ijspp.2020-0518}, year = {2021}, date = {2021-04-29}, urldate = {2021-04-29}, journal = {International Journal of Sports Physiology and Performance}, volume = {16}, issue = {10}, pages = {1522-1531}, abstract = {Purpose: Staying injury free is a major factor for success in sports. Although injuries are difficult to forecast, novel technologies and data-science applications could provide important insights. Our purpose was to use machine learning for the prediction of injuries in runners, based on detailed training logs. Methods: Prediction of injuries was evaluated on a new data set of 74 high-level middle- and long-distance runners, over a period of 7 years. Two analytic approaches were applied. First, the training load from the previous 7 days was expressed as a time series, with each day’s training being described by 10 features. These features were a combination of objective data from a global positioning system watch (eg, duration, distance), together with subjective data about the exertion and success of the training. Second, a training week was summarized by 22 aggregate features, and a time window of 3 weeks before the injury was considered. Results: A predictive system based on bagged XGBoost machine-learning models resulted in receiver operating characteristic curves with average areas under the curves of 0.724 and 0.678 for the day and week approaches, respectively. The results of the day approach especially reflect a reasonably high probability that our system makes correct injury predictions. Conclusions: Our machine-learning-based approach predicts a sizable portion of the injuries, in particular when the model is based on training-load data in the days preceding an injury. Overall, these results demonstrate the possible merits of using machine learning to predict injuries and tailor training programs for athletes.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Purpose: Staying injury free is a major factor for success in sports. Although injuries are difficult to forecast, novel technologies and data-science applications could provide important insights. Our purpose was to use machine learning for the prediction of injuries in runners, based on detailed training logs. Methods: Prediction of injuries was evaluated on a new data set of 74 high-level middle- and long-distance runners, over a period of 7 years. Two analytic approaches were applied. First, the training load from the previous 7 days was expressed as a time series, with each day’s training being described by 10 features. These features were a combination of objective data from a global positioning system watch (eg, duration, distance), together with subjective data about the exertion and success of the training. Second, a training week was summarized by 22 aggregate features, and a time window of 3 weeks before the injury was considered. Results: A predictive system based on bagged XGBoost machine-learning models resulted in receiver operating characteristic curves with average areas under the curves of 0.724 and 0.678 for the day and week approaches, respectively. The results of the day approach especially reflect a reasonably high probability that our system makes correct injury predictions. Conclusions: Our machine-learning-based approach predicts a sizable portion of the injuries, in particular when the model is based on training-load data in the days preceding an injury. Overall, these results demonstrate the possible merits of using machine learning to predict injuries and tailor training programs for athletes. Close doi:https://doi.org/10.1123/ijspp.2020-0518 Close
Description	The data set consists of a detailed training log from a Dutch high-level running team over a period of seven years (2012-2019). We included the middle and long distance runners of the team, that is, those competing on distances between the 800 meters and the marathon. This design decision is motivated by the fact that these groups have strong endurance based components in their training, making their training regimes comparable. The head coach of the team did not change during the years of data collection. The data set contains samples from 74 runners, of whom 27 are women and 47 are men. At the moment of data collection, they had been in the team for an average of 3.7 years. Most athletes competed on a national level, and some also on an international level. The study was conducted according to the requirements of the Declaration of Helsinki, and was approved by the ethics committee of the second author’s institution (research code: PSY-1920-S-0007). (2020-11-20)

4. Detection of illicit accounts over the Ethereum blockchain
Authors	Farrugia, Steven; Ellul, Joshua; Azzopardi, George;
Year	2021
Download	https://doi.org/10.34894/GKAQYN
Publication	Farrugia, Steven; Ellul, Joshua; Azzopardi, George Detection of illicit accounts over the Ethereum blockchain Journal Article Abstract \| Links \| BibTeX @article{farrugia2020detection, title = {Detection of illicit accounts over the Ethereum blockchain}, author = {Steven Farrugia and Joshua Ellul and George Azzopardi}, doi = {https://doi.org/10.1016/j.eswa.2020.113318}, year = {2020}, date = {2020-01-01}, urldate = {2020-01-01}, journal = {Expert Systems with Applications}, volume = {150}, pages = {113318}, publisher = {Pergamon}, abstract = {The recent technological advent of cryptocurrencies and their respective benefits have been shrouded with a number of illegal activities operating over the network such as money laundering, bribery, phishing, fraud, among others. In this work we focus on the Ethereum network, which has seen over 400 million transactions since its inception. Using 2179 accounts flagged by the Ethereum community for their illegal activity coupled with 2502 normal accounts, we seek to detect illicit accounts based on their transaction history using the XGBoost classifier. Using 10 fold cross-validation, XGBoost achieved an average accuracy of 0.963 ( ± 0.006) with an average AUC of 0.994 ( ± 0.0007). The top three features with the largest impact on the final model output were established to be ‘Time diff between first and last (Mins)’, ‘Total Ether balance’ and ‘Min value received’. Based on the results we conclude that the proposed approach is highly effective in detecting illicit accounts over the Ethereum network. Our contribution is multi-faceted; firstly, we propose an effective method to detect illicit accounts over the Ethereum network; secondly, we provide insights about the most important features; and thirdly, we publish the compiled data set as a benchmark for future related works.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close The recent technological advent of cryptocurrencies and their respective benefits have been shrouded with a number of illegal activities operating over the network such as money laundering, bribery, phishing, fraud, among others. In this work we focus on the Ethereum network, which has seen over 400 million transactions since its inception. Using 2179 accounts flagged by the Ethereum community for their illegal activity coupled with 2502 normal accounts, we seek to detect illicit accounts based on their transaction history using the XGBoost classifier. Using 10 fold cross-validation, XGBoost achieved an average accuracy of 0.963 ( ± 0.006) with an average AUC of 0.994 ( ± 0.0007). The top three features with the largest impact on the final model output were established to be ‘Time diff between first and last (Mins)’, ‘Total Ether balance’ and ‘Min value received’. Based on the results we conclude that the proposed approach is highly effective in detecting illicit accounts over the Ethereum network. Our contribution is multi-faceted; firstly, we propose an effective method to detect illicit accounts over the Ethereum network; secondly, we provide insights about the most important features; and thirdly, we publish the compiled data set as a benchmark for future related works. Close doi:https://doi.org/10.1016/j.eswa.2020.113318 Close
Description	The recent technological advent of cryptocurrencies and their respective benefits have been shrouded with a number of illegal activities operating over the network such as money laundering, bribery, phishing, fraud, among others. In this work we focus on the Ethereum network, which has seen over 400 million transactions since its inception. Using 2179 accounts flagged by the Ethereum community for their illegal activity coupled with 2502 normal accounts, we seek to detect illicit accounts based on their transaction history using the XGBoost classifier. Using 10 fold cross-validation, XGBoost achieved an average accuracy of 0.963 ( ± 0.006) with an average AUC of 0.994 ( ± 0.0007). The top three features with the largest impact on the final model output were established to be ‘Time diff between first and last (Mins)’, ‘Total Ether balance’ and ‘Min value received’. Based on the results we conclude that the proposed approach is highly effective in detecting illicit accounts over the Ethereum network. Our contribution is multi-faceted; firstly, we propose an effective method to detect illicit accounts over the Ethereum network; secondly, we provide insights about the most important features; and thirdly, we publish the compiled data set as a benchmark for future related works.

5. Labelled Dataset of Retinal Images for Glaucoma detection
Authors	Guo, Jiapan; Azzopardi, George; Shi, Chenyu; Jansonius, Nomdo; Petkov, Nicolai
Year	2021
Download	https://doi.org/10.34894/H2SZSO
Publication	Guo, Jiapan; Azzopardi, George; Shi, Chenyu; Jansonius, Nomdo M; Petkov, Nicolai Automatic Determination of Vertical Cup-to-Disc Ratio in Retinal Fundus Images for Glaucoma Screening Journal Article Abstract \| Links \| BibTeX @article{guo2019automatic, title = {Automatic Determination of Vertical Cup-to-Disc Ratio in Retinal Fundus Images for Glaucoma Screening}, author = {Jiapan Guo and George Azzopardi and Chenyu Shi and Nomdo M Jansonius and Nicolai Petkov}, doi = {10.1109/ACCESS.2018.2890544}, year = {2019}, date = {2019-01-01}, urldate = {2019-01-01}, journal = {IEEE Access}, volume = {7}, pages = {8527--8541}, publisher = {IEEE}, abstract = {Glaucoma is a chronic progressive optic neuropathy that causes visual impairment or blindness if left untreated. It is crucial to diagnose it at an early stage in order to enable treatment. Fundus photography is a viable option for population-based screening. A fundus photograph enables the observation of the excavation of the optic disk\textemdashthe hallmark of glaucoma. The excavation is quantified as a vertical cup-to-disk ratio (VCDR). The manual assessment of retinal fundus images is, however, time-consuming and costly. Thus, an automated system is necessary to assist human observers. We propose a computer-aided diagnosis system, which consists of the localization of the optic disk, the determination of the height of the optic disk and the cup, and the computation of the VCDR. We evaluated the performance of our approach on eight publicly available datasets, which have, in total, 1712 retinal fundus images. We compared the obtained VCDR values with those provided by an experienced ophthalmologist and achieved a weighted VCDR mean difference of 0.11. The system provides a reliable estimation of the height of the optic disk and the cup in terms of the relative height error (RHE = 0.08 and 0.09, respectively). The Bland\textendashAltman analysis showed that the system achieves a good agreement with the manual annotations, especially for large VCDRs which indicate pathology.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Close Glaucoma is a chronic progressive optic neuropathy that causes visual impairment or blindness if left untreated. It is crucial to diagnose it at an early stage in order to enable treatment. Fundus photography is a viable option for population-based screening. A fundus photograph enables the observation of the excavation of the optic disk—the hallmark of glaucoma. The excavation is quantified as a vertical cup-to-disk ratio (VCDR). The manual assessment of retinal fundus images is, however, time-consuming and costly. Thus, an automated system is necessary to assist human observers. We propose a computer-aided diagnosis system, which consists of the localization of the optic disk, the determination of the height of the optic disk and the cup, and the computation of the VCDR. We evaluated the performance of our approach on eight publicly available datasets, which have, in total, 1712 retinal fundus images. We compared the obtained VCDR values with those provided by an experienced ophthalmologist and achieved a weighted VCDR mean difference of 0.11. The system provides a reliable estimation of the height of the optic disk and the cup in terms of the relative height error (RHE = 0.08 and 0.09, respectively). The Bland–Altman analysis showed that the system achieves a good agreement with the manual annotations, especially for large VCDRs which indicate pathology. Close doi:10.1109/ACCESS.2018.2890544 Close
Description	Fundus photography is a viable option for glaucoma population screening. In order to facilitate the development of computer-aided glaucoma detection systems, we publish this annotation dataset that contains manual annotations of glaucoma features for seven public fundus image data sets. All manual annotations are made by a specialised ophthalmologist. For each of the fundus images in the seven fundus datasets, the upper, the bottom, the left and the right boundary coordinates of the optic disc and the cup are stored in a .mat file with the corresponding fundus image name. The seven public fundus image data sets are: CHASEDB, Diaretdb1_v_1_1, DRINSHTI , DRIONS-DB, DRIVE, HRF, Messidor

6. Fall detection and recognition from egocentric visual data: A case study
Authors	Wang, Xueyi; Talavera, Estefania; Karastoyanova, Dimka; Azzopardi, George
Year	2020
Download	https://doi.org/10.34894/3DV8BF
Publication	Wang, Xueyi; Martinez, Estefania Talavera; Karastoyanova, Dimka; Azzopardi, George Fall detection and recognition from egocentric visual data: A case study Inproceedings Abstract \| Links \| BibTeX @inproceedings{Wang2021, title = {Fall detection and recognition from egocentric visual data: A case study}, author = {Xueyi Wang and Estefania Talavera Martinez and Dimka Karastoyanova and George Azzopardi}, editor = {Alberto Del Bimbo and Rita Cucchiara and Stan Sclaroff and Giovanni Maria Farinella and Tao Mei and Marco Bertini and others}, url = {https://doi.org/10.34894/3DV8BF}, doi = {https://doi.org/10.1007/978-3-030-68763-2_33}, year = {2021}, date = {2021-01-01}, urldate = {2021-01-01}, booktitle = {25th International Conference on Pattern Recognition Workshops, ICPR 2020}, abstract = {Falling is among the most damaging events for elderly people, which sometimes may end with significant injuries. Due to fear of falling, many elderly people choose to stay more at home in order to feel safer. In this work, we propose a new fall detection and recognition approach, which analyses egocentric videos collected by wearable cameras through a computer vision/machine learning pipeline. More specifically, we conduct a case study with one volunteer who collected video data from two cameras; one attached to the chest and the other one attached to the waist. A total of 776 videos were collected describing four types of falls and nine kinds of non-falls. Our method works as follows: extracts several uniformly distributed frames from the videos, uses a pre-trained ConvNet model to describe each frame by a feature vector, followed by feature fusion and a classification model. Our proposed model demonstrates its suitability for the detection and recognition of falls from the data captured by the two cameras together. For this case study, we detect all falls with only one false positive, and reach a balanced accuracy of 93% in the recognition of the 13 types of activities. Similar results are obtained for videos of the two cameras when considered separately. Moreover, we observe better performance of videos collected in indoor scenes.}, note = {The data set can be downloaded from https://doi.org/10.34894/3DV8BF}, keywords = {}, pubstate = {published}, tppubtype = {inproceedings} } Close Falling is among the most damaging events for elderly people, which sometimes may end with significant injuries. Due to fear of falling, many elderly people choose to stay more at home in order to feel safer. In this work, we propose a new fall detection and recognition approach, which analyses egocentric videos collected by wearable cameras through a computer vision/machine learning pipeline. More specifically, we conduct a case study with one volunteer who collected video data from two cameras; one attached to the chest and the other one attached to the waist. A total of 776 videos were collected describing four types of falls and nine kinds of non-falls. Our method works as follows: extracts several uniformly distributed frames from the videos, uses a pre-trained ConvNet model to describe each frame by a feature vector, followed by feature fusion and a classification model. Our proposed model demonstrates its suitability for the detection and recognition of falls from the data captured by the two cameras together. For this case study, we detect all falls with only one false positive, and reach a balanced accuracy of 93% in the recognition of the 13 types of activities. Similar results are obtained for videos of the two cameras when considered separately. Moreover, we observe better performance of videos collected in indoor scenes. Close https://doi.org/10.34894/3DV8BF doi:https://doi.org/10.1007/978-3-030-68763-2_33 Close
Description	This data set contains egocentric videos from two cameras attached to the waist and chest of one volunteer. The contents of the videos contain indoor and outdoor scenes and do not contain people. The data set was to for evaluation of a novel fall detection system using ego centric visual data.

🚀 Honored to be an Impact Ambassador of @enlight_eu. This award is for our research with RDW on the detection of #facemorphingattacks; crucial for enhancing security for IDs (passports & driving licenses). @infosys_rug @ScienceLinX @BernoulliInsti2 @univgroningen @JTSchool_UG https://t.co/feFOL5hLRU pic.twitter.com/NQNJYjppsK
— George Azzopardi (@azzopardi_g) June 13, 2024

🚀 Just wrapped the 1st 'Applied AI' Symposium @JTSchool_UG in Groningen! Big thanks to our 4 speakers & attendees for enriching talks on AI's diverse applications. Great conversations & networking with coffee☕️& lunch🍽. Next stop? End of April. Stay tuned!
— George Azzopardi (@azzopardi_g) March 27, 2024

Congrats to my PhD student Xueyi Wang for his successful #PhDdefense @univgroningen! His work in #computervision, focused on #egocentricvision for #falldetection: https://t.co/sAPOChg9lj

🍾Here's to more success, Xueyi! @infosys_rug pic.twitter.com/53ovyEcUyY
— George Azzopardi (@azzopardi_g) March 6, 2024

Exciting insights from our workshop on tech against illicit trade as part of working group 4 of the COST Action GLITSS @CA_GLITSS. Hosted at the ERATOSTHENES Centre, Limassol Cyprus. Thanks to all participants! Join our mission: https://t.co/ox2rNFBekC 🚀
— George Azzopardi (@azzopardi_g) March 2, 2024

🎉Excited to announce my tenure & promotion to Associate Prof. in #PatternRecognition, @univgroningen A huge thanks to my network—family, mentors, colleagues & students—for your support! Cheers to more discoveries & growth! @BernoulliInsti2 @infosys_rug @JTSchool_UG
— George Azzopardi (@azzopardi_g) February 9, 2024

Excited to share a glimpse from our inaugural event launching Theme "Applied AI" at @JTSchool_UG, @univgroningen! Diverse attendees explored AI possibilities, enjoyed talks by experts, and connected over refreshments. Cheers for more collaboration and innovation! @infosys_rug pic.twitter.com/6GwdW5kA2s
— George Azzopardi (@azzopardi_g) February 8, 2024

🎉 Kudos to my PhD student Guru Bennabhaktula @bgswaroop who successfully defended his thesis, "Leveraging Image Noise: Source Camera Identification & Convolutional Neural Networks." https://t.co/ogZjCk1F2A.

Glad for Guru's 2-year postdoc journey with me. pic.twitter.com/a39V0fQWar
— George Azzopardi (@azzopardi_g) December 19, 2023

🌟 Two exciting #phd opportunities in #machinelearning & #deeplearning!

Join our #interdisciplinary team to explore & classify hyperkinetic movement disorders. Funded by #NWO, led by the Movement Disorders Groningen and in collaboration with Bernoulli Institute @BernoulliInsti2
— George Azzopardi (@azzopardi_g) December 11, 2023