Rising interest in datasets and machine learning models for colonoscopy images

For many years researchers at SimulaMet have worked on datasets consisting of colonoscopy images and videos, and developed machine learning models trained on this data. There is a rising interest in these developments, and they are for example used in the Nvidia Clara Holoscan, and mentioned in the 2022 AI Index from Stanford University.

An illustration of a woman and a robot collaborating.

Professor Pål Halvorsen, head of the department of Holistic Systems (HOST), initiated the work on medical image analyses nine years ago. Chief Research Scientist Michael Riegler started his PhD at SimulaMet around the same time and together they worked on collecting and analysing data. 

Many PhD students from the department have been involved in the work with these datasets. In the last couple of years Debesh Jha, Steven Hicks, Vajira Thambawita, and Konstantin Pogorelov have had an important role in the development. Debesh Jha has mainly worked with the Kvasir-SEG dataset, and the development of the machine learning model ColonSegNet (pubmed.ncbi.nlm.nih.gov) during his PhD, Machine Learning-based Classification, Detection, and Segmentation of Medical Images, which he defended last year. 

The Kvasir-SEG dataset and ColonSegNet are getting an increasing amount of attention. The latest by being used in the Nvidia Clara Holoscan Sample App Data for AI Colonoscopy Segmentation of Polyps (catalog.ngc.nvidia.com), and mentioned in the 2022 AI Index from Stanford University (https://aiindex.stanford.edu)

This year’s AI Index mentions this rising interest in particular computer vision subtasks: “Only 3 research papers tested systems against the Kvasir-SEG medical imaging benchmark before 2020. In 2021, 25 research papers did. Such an increase suggests that AI research is moving toward research that can have more direct, real-world applications.” 

At the time of writing, according to Google Scholar, the dataset has been used in more than 300 works/papers since 2020. 

– The interest in these datasets has exploded in the last couple of years, and the application of artificial intelligence using healthcare data is on the rise, especially in colonoscopy, says Riegler.  

– It is also very hard to get a hold of these medical images and make them available, which makes the datasets unique, says Halvorsen. 


Nvidia is a global leader in artificial intelligence software and hardware, and Clara Holoscan is their framework for artificial intelligence in health. Everyone can use the framework to develop an application, either for a research purpose or for a company.

– The fact that they have decided to use our dataset and algorithm as a standard in their framework is really exciting, and it is huge for us and our research. This means that everyone using Nvidia equipment, which I assume is in the millions, can see our dataset and algorithm as an example. What others will develop in the field of colonoscopy can be compared to our work, and many will use our dataset as a starting point, says Halvorsen.

More than 20% of polyps are missed in colonoscopies

Using these datasets to train algorithms and develop tools for the detection of for example polyps in the colon can be of great assistance in clinics as artificial intelligence can discover what doctors may have missed. On average, depending a bit on the study, it seems like about  20% of polyps are missed during colonoscopies. 

– This is how we got into the work with these datasets. We had been doing video analysis for a while, and got in touch with some doctors that raised a challenge in their examinations; polyps get overlooked, says Halvorsen. 

The work was first initiated in collaboration with researchers from Tromsø University, and later with Bærum hospital, Cancer Registry, Oslo University Hospital, and Karolinska University Hospital.

The datasets

The first dataset, Kvasir, was released in 2017 and consisted of 1000 high-resolution gastrointestinal polyp images. Then, in 2019, Kvasir-SEG was released, a subset of Kvasir where the images are segmented. The segmentation was done by doctors and cross-verified by professional gastroenterologists. This is the dataset that is being used in the Clara Holoscan, and mentioned in the 2022 AI Index. 

– Segmentation in computer vision is very popular and beneficial. In this dataset, you can tell, for example, the location of a polyp in an image, and not just that it exists within the image, says Riegler. 

In addition, researchers from the department of Holistic Systems have further built the datasets HyperKvasir (2020), the largest gastrointestinal dataset in the world, Kvasir-Capsule (2021), and Kvasir-instruments (2021). Both HyperKvasir and Kvasir-Capsule were published by Nature Scientific Data.

Segmentation of polyps at a fast pace 

Machine learning models have also been developed and tested, and the images have been used to train these models. The most recent one, ColonSegNet, was introduced through the paper titled “Real-Time Polyp Detection, Localization and Segmentation in Colonoscopy Using Deep Learning” (pubmed.ncbi.nlm.nih.gov). ColonSegNet is an encoder-decoder architecture for segmentation of colonoscopy images. 

– This machine learning model is trained on the Kvasir-SEG dataset. It is specially made for the segmentation of polyps, and to do that at a fast pace, says Riegler.

Positive results through testing

A start-up company named Augere Medical (augere.md) has spun out from these research activities, and both Pål and Michael are involved in their work. Augere Medical is led by Andreas Petlund. 


– We have a goal in our work to put our results back into use in society, and we saw potential and decided to make a whole system. The system has been tested in several clinics. The results are promising, and the doctors find it helpful, says Halvorsen. 


The future

Currently, researchers at HOST are working on simpler ways to share even more data, and how it can be obtained. Especially, synthetic data holds a lot of potential in the context of medical research. It makes it easier to share data, but also to create more data from, for example, uncommon diseases, which can then be used to train better models. An example of this is shown in a paper by researchers from HOST, titled SinGAN-Seg: Synthetic training data generation for medical image segmentation (https://journals.plos.org). In this paper, synthetic polyp images and segmentation masks were used to train an image segmentation model.