SCIENTIFIC NEWS AND
INNOVATION FROM ÉTS
Detecting Tables with Weakly Supervised Bounding Box Extraction - By : Arash Samari, Mohamed Cheriet, Andrew Piper, Alison Hedley,

Detecting Tables with Weakly Supervised Bounding Box Extraction


Arash Samari
Arash Samari Author profile
Arash Samari is a master's student at ÉTS.

Mohamed Cheriet
Mohamed Cheriet Author profile
Mohamed Cheriet is a professor in the Department of Systems Engineering at ÉTS and Director of Synchromedia. His research focuses on eco-cloud computing, knowledge acquisition and artificial intelligence systems and learning algorithms.

Andrew Piper
Andrew Piper Author profile
Andrew Piper is a professor at McGill University.

Alison Hedley
Alison Hedley Author profile
Alison Hedley is an operations assistant at Antimodular Research.

Table in an ancient manuscript

Purchased on Istockphoto.com. Copyright

Tables in Ancient Manuscripts — A Wealth of Information

Historic documents contain long-term studies in a wide range of research areas. Because of the scarcity of these documents, their information is in danger of decomposition and irretrievable loss. To preserve and retrieve some of the most important parts from the vast amount of information in these documents, we focused on detecting document pages that contain tables.

These graphical elements are very useful for scientists in obtaining essential information in an abstract format. This task is categorized in the field of object detection, which saw recent progress with the advent of deep-learning algorithms. One of these algorithms is the Faster RCNN [1] which we combined with a pre-processing Gabor filter [2], weakly supervised bounding box extraction [3], and pseudo-labeling to respond to the following challenges:

  1. High generalization in detecting images with tables among 32 million image data
  2. Detecting tables with various structures (figure 1)
  3. Insufficient labeled data for the training phase of deep learning algorithms
Tables in ancient documents

Figure 1. Samples of tables in historic documents

Applying a Gabor filter

In the first step of our system design, we applied the Gabor filter to:

  1. Make the data set more compatible with Faster-RCNN-based framework.
  2. Obtain better discrimination between the target object (table) from other parts of an image by exaggerating the gap or white background between text and tables.
  3. Remove visual noise, such as ink stains.

Figure 2 shows the preprocessed image with the Gabor filter.

Image after applying a Gabor filter

Figure 2. Processed image with Gabor filter

Terms and Definitions

In this research, we used two sources of scanned historic documents as follows:

  • ECCO: Eighteenth-Century Collections Online (ECCO) is an enormous collection of historic documents with over 32 million pages. Based on the timeline of collected data, ECCO is divided into ECCO1 and ECCO2.
  • NAS: This data set contains around 0.5 million scanned document images from a longer time period than ECCO (1666 to 1916).

For this binary detection task, we defined two labels:

  • Table: Presentation of important data in text or numerical format in rows and columns to summarize information in a compact manner.
  • Non-table: All scanned document images without tables, such as diagrams, illustrations, maps, and images either on a blank page or on a page with text (figure 3).
Non-tables in ancient documents

Figure 3. Samples of non-tables in historic documents

Faster-RCNN

Based on our data sets and the characteristics of the Faster-RCNN algorithm, we used the algorithm as the main object detection module in our research, for the following reasons:

  1. Better performance on images with low resolutions
  2. Detecting large and small size objects
  3. One of the best algorithms to reach a balance between speed and accuracy

Weakly Supervised Bounding Box Extraction

A Faster-RCNN-based model must be trained with adequately labeled data and bounding boxes around their objects to reach proper performance. But manual labeling data and extracting bounding boxes are costly procedures. To solve this issue, in our research we introduced the weakly supervised bounding box extraction (figure 4) technique, which is an automatic spiral learning approach. It consists of the five following phases:

  1. Phase 1: Train and bias the model based on table
  2. Phase 2: Test the previous biased model on non-table ‒ Output: weak bounding boxes for non-table
  3. Phase 3: Train with two labels i.e., tables with accurate bounding boxes and non-tables with weak bounding boxes
  4. Phase 4: Pseudo labeling ‒ Testing on unlabeled data to augment our train set
  5. Phase 5: Train ‒ Retrain the model by adding data from the previous step
Bounding Box Extraction architecture

Figure 4. Weakly supervised bounding box extraction

Results

We compared the Faster-RCNN-based model with and without the weakly supervised bounding box extraction using the subsets of ECCO (mix of ECCO1 and ECCO2) and NAS data sets:

Table 1. Results of Faster-RCNN based model with and without the weakly bounding box extraction on the subset of the ECCO data set

Table 2. Results of Faster-RCNN based model with and without the weakly bounding box extraction on the subset of the NAS data set

To detect all images with tables, we applied our model to three different data sets, which include 32 million images in total (figure 5).

Results obtained with the bounding Box Extraction method

Figure 5. Results of our model

Conclusion

By taking advantage of the Gabor filter and weakly supervised bounding box extraction, we prepared better input data and enough bounding boxes around the target objects for the training phase, which lead to high performance at low costs. It is also a generalized and robust methodology for detecting tables with various layouts among 32 million scanned historical document images.

High labor costs of extracting bounding boxes, and reliable performance on unbalanced data sets are two common challenges in most machine learning tasks, which we solved with a spiral learning approach using the weakly supervised bounding box extraction technique.

Additional Information

For more information on this research, please read the following research paper:

Samari, A., Piper, A., Hedley, A., Cheriet, M. (2021). Weakly Supervised Bounding Box Extraction for Unlabeled Data in Table Detection. In: , et al. Pattern Recognition. ICPR International Workshops and Challenges. ICPR 2021. Lecture Notes in Computer Science(), vol 12667. Springer, Cham. https://doi.org/10.1007/978-3-030-68787-8_25

Arash Samari

Author's profile

Arash Samari is a master's student at ÉTS.

Program : Automated Manufacturing Engineering 

Research laboratories : SYNCHROMEDIA – Multimedia Communication in Telepresence 

Author profile

Mohamed Cheriet

Author's profile

Mohamed Cheriet is a professor in the Department of Systems Engineering at ÉTS and Director of Synchromedia. His research focuses on eco-cloud computing, knowledge acquisition and artificial intelligence systems and learning algorithms.

Program : Automated Manufacturing Engineering 

Research chair : Canada Research Chair in Smart Sustainable Eco-Cloud 

Research laboratories : SYNCHROMEDIA – Multimedia Communication in Telepresence  CIRODD- Centre interdisciplinaire de recherche en opérationnalisation du développement durable 

Author profile

Andrew Piper

Author's profile

Andrew Piper is a professor at McGill University.

Author profile

Alison Hedley

Author's profile

Alison Hedley is an operations assistant at Antimodular Research.

Author profile


Get the latest scientific news from ÉTS