18 Jan 2023 |
Research article |
Intelligent and Autonomous Systems
Detecting Tables with Weakly Supervised Bounding Box Extraction
Purchased on Istockphoto.com. Copyright
Tables in Ancient Manuscripts — A Wealth of Information
Historic documents contain long-term studies in a wide range of research areas. Because of the scarcity of these documents, their information is in danger of decomposition and irretrievable loss. To preserve and retrieve some of the most important parts from the vast amount of information in these documents, we focused on detecting document pages that contain tables.
These graphical elements are very useful for scientists in obtaining essential information in an abstract format. This task is categorized in the field of object detection, which saw recent progress with the advent of deep-learning algorithms. One of these algorithms is the Faster RCNN  which we combined with a pre-processing Gabor filter , weakly supervised bounding box extraction , and pseudo-labeling to respond to the following challenges:
- High generalization in detecting images with tables among 32 million image data
- Detecting tables with various structures (figure 1)
- Insufficient labeled data for the training phase of deep learning algorithms
Applying a Gabor filter
In the first step of our system design, we applied the Gabor filter to:
- Make the data set more compatible with Faster-RCNN-based framework.
- Obtain better discrimination between the target object (table) from other parts of an image by exaggerating the gap or white background between text and tables.
- Remove visual noise, such as ink stains.
Figure 2 shows the preprocessed image with the Gabor filter.
Terms and Definitions
In this research, we used two sources of scanned historic documents as follows:
- ECCO: Eighteenth-Century Collections Online (ECCO) is an enormous collection of historic documents with over 32 million pages. Based on the timeline of collected data, ECCO is divided into ECCO1 and ECCO2.
- NAS: This data set contains around 0.5 million scanned document images from a longer time period than ECCO (1666 to 1916).
For this binary detection task, we defined two labels:
- Table: Presentation of important data in text or numerical format in rows and columns to summarize information in a compact manner.
- Non-table: All scanned document images without tables, such as diagrams, illustrations, maps, and images either on a blank page or on a page with text (figure 3).
Based on our data sets and the characteristics of the Faster-RCNN algorithm, we used the algorithm as the main object detection module in our research, for the following reasons:
- Better performance on images with low resolutions
- Detecting large and small size objects
- One of the best algorithms to reach a balance between speed and accuracy
Weakly Supervised Bounding Box Extraction
A Faster-RCNN-based model must be trained with adequately labeled data and bounding boxes around their objects to reach proper performance. But manual labeling data and extracting bounding boxes are costly procedures. To solve this issue, in our research we introduced the weakly supervised bounding box extraction (figure 4) technique, which is an automatic spiral learning approach. It consists of the five following phases:
- Phase 1: Train and bias the model based on table
- Phase 2: Test the previous biased model on non-table ‒ Output: weak bounding boxes for non-table
- Phase 3: Train with two labels i.e., tables with accurate bounding boxes and non-tables with weak bounding boxes
- Phase 4: Pseudo labeling ‒ Testing on unlabeled data to augment our train set
- Phase 5: Train ‒ Retrain the model by adding data from the previous step
We compared the Faster-RCNN-based model with and without the weakly supervised bounding box extraction using the subsets of ECCO (mix of ECCO1 and ECCO2) and NAS data sets:
Table 1. Results of Faster-RCNN based model with and without the weakly bounding box extraction on the subset of the ECCO data set
Table 2. Results of Faster-RCNN based model with and without the weakly bounding box extraction on the subset of the NAS data set
To detect all images with tables, we applied our model to three different data sets, which include 32 million images in total (figure 5).
By taking advantage of the Gabor filter and weakly supervised bounding box extraction, we prepared better input data and enough bounding boxes around the target objects for the training phase, which lead to high performance at low costs. It is also a generalized and robust methodology for detecting tables with various layouts among 32 million scanned historical document images.
High labor costs of extracting bounding boxes, and reliable performance on unbalanced data sets are two common challenges in most machine learning tasks, which we solved with a spiral learning approach using the weakly supervised bounding box extraction technique.
For more information on this research, please read the following research paper:
Samari, A., Piper, A., Hedley, A., Cheriet, M. (2021). Weakly Supervised Bounding Box Extraction for Unlabeled Data in Table Detection. In: , et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12667. Springer, Cham. https://doi.org/10.1007/978-3-030-68787-8_25
Arash Samari is a master's student at ÉTS.
Program : Automated Manufacturing Engineering
Research laboratories : SYNCHROMEDIA – Multimedia Communication in Telepresence
Mohamed Cheriet is a professor in the Department of Systems Engineering at ÉTS and Director of Synchromedia. His research focuses on eco-cloud computing, knowledge acquisition and artificial intelligence systems and learning algorithms.
Program : Automated Manufacturing Engineering
Research chair : Canada Research Chair in Smart Sustainable Eco-Cloud