Unlocking Data in Structured PDF Documents with AI

Golshid Varasteh Kia


Many industries have fallen behind the technology boom and are playing catch up. They have not been able to fully benefit from advancements such as artificial intelligence (AI) for automation and creating insights. Such industries include Construction, Manufacturing, Insurance, Mortgage Lending, Legal and even Healthcare.

What’s common in all these industries is the absence of digitalisation. In most cases, paper, electronic documents and forms are bogging down the digitalisation process. Some of these industries work with decades-old documents that are hundreds of pages long. With so much content, drawings, images, contracts, compliance notes, test results, manuals, reports, and more, reviewing them all manually and extracting what’s required is almost impossible. More importantly, we are trying to promote efficiency at work in these industries and not create more manual labour.

In ‘What can machine learning do for your business?’, Alex highlighted the significance of data in developing machine learning models, and further highlighted how the data and models could boost a business. In Angel’s article ‘Helping Change Ahead make a difference to lives of homeless and vulnerable individuals’, he explained how we are helping the Change Ahead social enterprise produce data for machine learning models. 

In this article, I’ll talk about how we can generate data from a company’s digital assets, particularly from their thousands and thousands of PDF documents. I will later discuss how we are using the extracted text data to develop text-based AI models using Natural Language Processing (NLP) to access intelligence and insights for the company from the information they already have.


Some companies believe that their documents are digital assets ready for AI, and they are almost right. We always tell them that they only have documents and files, and without treating these as data – extracting what’s needed from those documents and files, cleaning up and modelling them as data, and sometimes labelling them ourselves – they are not qualified as effective digital assets for AI. One of those companies approached us – a construction certification organisation. Xerini was asked to investigate approximately 5,000 PDF documents on different construction products and systems to generate data and create insights into these products and systems using AI.

Xerini took the opportunity not only to work on these information-rich documents but, when we were invited by University College London (UCL) to collaborate with the Department of Computer Science Machine Learning team as a part of the Industry Exchange Network (IXN), we spun off a mini research project as well. In this collaboration, Xerini was invited to propose a summer thesis projects for “Machine Learning” MSc students (CSML/DSML/ML: Computational Statistics and Machine Learning, Data Science and Machine learning; Machine Learning), and guide and supervise the students for the duration of their projects. We will explain the collaboration later in this article.

Project Overview

In this project we extracted data (text, images, and metadata) from approximately 5,000 PDF documents in order to enable AI models to access insights for the client. We created a command-line tool to extract raw text, images, and sections within the text under different headings. The tool also creates a manifest file for the extracted artefacts to record information and metadata about each document [Figure 1].  We have created a modular solution that can later expand for any future extraction improvements.

Figure 1

Documents Overview

Each PDF document represents a construction product or system and contains text explaining different aspects of the product/system including general information, regulations with which it complies, technical information, an installation guide, a series of images and technical diagrams, several tables of different specification values, and a bibliography. In addition, each PDF will have some metadata associated with it, such as file name, author’s name, created date, modified date, etc.

Text and Section Extraction

The first step is to extract the raw text from the PDF document. The purpose of this extraction is to access text content independent of the PDF format.  We have used the Apache Tika Java library to develop an extraction module that extracts the text embedded in the PDF to raw text in ‘.txt’ format and performs a text clean-up to remove unwanted characters and breaks or gaps. A limitation of this over other methods, such as Optical Character Recognition (OCR), is that the structure of the extracted text follows the sequence that the text was created in the document rather than the text format and style that we see when we open the PDF. However, it is a fast and accurate parser, and OCR fused with this method can increase the accuracy and expand the input options to even scanned documents.  Apache Tika was chosen because of its versatile configuration and ability to handle multiple file types, making it a future-proof library for document extraction.

The raw text is saved with a unique identifier, relating the file to a single source document. Once the full raw text is extracted, we identify the sections within the main body of text and separate each section into its own file. This was a challenging task as we have already lost any text formatting in the previous step, and it is no longer possible to identify sections using things like bold headings for example. To tackle this challenge, the code looks at the start of each sentence and identifies words and phrases that can be found within a configurable dictionary of possible section headings. If the only content in the line is a word/phrase similar to one in the dictionary, then we have identified this line as the section heading. The text following a section heading is saved as the section content, with a unique identifier pointing to the source document. The benefit of the dictionary vs identifying the heading based on format is that the dictionary can be expanded to include other known sections.

Breaking the text into sections allows for a better-targeted analysis, as some sections are always very similar and therefore bear less weight in the future study which is explained later.

Image Extraction

Images in the documents are often embedded as ‘.jpg/.jpeg’, but there are also diagrams/drawings which are embedded as multiple fragments of smaller vector drawings with overlaid annotations. Therefore, we had to find a method that can extract both embedded images and diagrams/drawings without losing information or annotations. For the former, we use the Apache Tika Java library to extract jpg embedded images: embedded artefacts found by the parser are filtered out to only include files with ‘.jpg/.jpeg’ extensions.

 For the latter, we first convert the PDF files to rasterised images [300 DPI] using Apache PDFBox. We use image processing algorithms complemented by a set of heuristic rules for extracting a single complete image that includes all vector fragments and the overlaid texts. We apply the Harris Corner Detector (a corner detection operator typically used in computer vision algorithms for extracting corners and inferring features on an image] and obtain a set of points (noisy pixels with X, Y coordinate values), denoting every corner found in the image. We then clean up the points to narrow down the correct image corners using two rules in an alternating pattern: Rule 1: find clusters of points with proximity and remove them; Rule 2: for each point not in a cluster, try to find a point-match in either the X or Y-axis and if a match is not found, remove the point. We complete this algorithm by computing the bounding boxes to find four corner points that describe a valid rectangle.

At this stage, the bounding boxes are not guaranteed to describe valid images as they have found only valid rectangles on the PDF page. To remove non-diagram/non-drawing rectangles, we apply a set of case-heuristic rules. For example, we remove small rectangles with a total area of fewer than 400×400 pixels, and smaller rectangles within other larger rectangles. After the bounding box identification and clean-up, we use the bounding boxes to crop the main page and save each image encountered to a ‘.jpg’ file in our output folder.

We save these images with unique identifiers, relating the images, diagrams, and drawings to a single source PDF document.

Manifest File

A manifest file contains metadata and information about all the files extracted from a single source document. Our manifest files record metadata such as title, author, created date, modified date, bundles of extracted figure file names, extracted raw text file names, and extracted section file names. It also includes placeholders for the summary of the document and figure descriptions in the figure captions. The manifest file is unique to a source document and is saved in ‘.json’ format.


Testing the algorithms and each module’s functionality is always a top priority for Xerini. We frequently spend a significant amount of time peer-reviewing the algorithms and write test scenarios to cover all intended functionalities.

Automated test cases were built using JUnit v4.0. Unit tests have been designed as various scenarios to cover functionality for the image, text, and manifest extraction modules. In addition, we have done manual tests on selected documents – a series of documents were randomly selected to run through the extractor and manual test the results.  

To evaluate the quality of text extraction, we have used the cosine similarity metric to measure the similarity between the term-frequency vectors of the text extracted using the Xerini algorithm, and text extracted using Adobe Acrobat. For image extraction, we have examined the extraction quality manually, reviewing each document to find the number of images that we would expect to be extracted and comparing the final extracted images using the Xerini algorithm. 

Lastly, for manifest extraction, we have compared the metadata extracted by the Xerini algorithm with metadata embedded within the PDF, accessed using Adobe Acrobat. We have reached high accuracy in all three manual test cases, which shows the algorithm has been successful and aligned with the intended functionality.

Text Similarity, Summarisation and Keyword Extraction (in collaboration with UCL)

Following extracting data from the PDF documents and cleaning it up, the next stage is to use the data to access insights. The steps are feature extraction from the data, setting up AI models, and generating results [Figure 2].  The challenge with these documents is that they are heavily text-based – some are around 25,000 words with multiple sections. Reducing these documents to concise information, analysing the documents as groups, and accessing the correct material on time is an inefficient and laborious process due to the size and volume of these documents.

Figure 2

As a part of the UCL Industry Exchange Network (IXN 2020/2021) invitation, Xerini and the client have proposed a 2021 summer dissertation project for UCL Machine Learning MSc students of Computer Science. UCL Computer Science has a long history of industry engagement, which was formalised with the launch of the IXN. This educational methodology enables students to work on real-world problems as an assessed core component of their degrees.

Xerini, in collaboration with UCL, is generating summarisation and extracting keywords of various sections of the source documents through text-based AI and NLP models. This research project will help with accessing the knowledge within the heavy volume of text in an efficient way, answering questions for engineers and stakeholders swiftly, enabling text and ultimately product/system classification, and supporting better search optimisation. Automatic summarisation aims to generate a meaningful concise description of the original text. At the same time, keyword extraction will reduce the text to a few words and phrases that best describe the input text. In addition, we are grouping documents that are technically and semantically closer to each other in various sections, assisting the relevant professionals and us with finding the most similar documents (products/systems) to a document for a particular product/system.

It is important to choose models for text summarisation and keyword extraction that cover semantic similarities too, and promote abstractive text summarisation, rather than the extractive one where the summarisation is the exact sentences from the original text. For this reason, for our NLP task we have chosen pre-trained transformer models of Bidirectional Encoder Representations from Transformers (BERT), developed by Google in 2018, and Generative Pre-trained Transformer 3 (GPT-3), introduced by Open AI in 2020. BERT and GPT-3 are open-source, powerful massive deep-learning language models that use an attention mechanism.

In the project’s clustering segment, we continue using the above transformers for word embedding and representing the texts in 3D vector space. As the pre-trained transformer models use an attention mechanism, each embedding also describes the importance of the word, even if the embedding is too large. Following the word embedding, unsupervised clustering algorithms, such as K-Means, are implemented for grouping different products and systems into clusters representing similarity. To evaluate the performance of clustering, and to find the optimum number of clusters with better performance, we are using purity and entropy validation measures.

We will describe and examine the findings of this study in future blog posts. So, stay tuned. 


We have learned from the COVID-19 pandemic that time and human resources are limited and should be used efficiently. We have learned that smart, collaborative methods, and being able to access information at any place with limited resource, are critical in the workplace. It is expected that the technology boom is one of the drivers of the post-COVID recovery plan (similar to post-2008). A significant portion of this recovery will come from the rapid digitalisation of the technologically slow-growing industries, such as Construction, Manufacturing, Insurance, Mortgage, Legal and even Healthcare. Many of the projects that come our way are around digitalisation and optimisation of somewhat manual or inefficient processes that are currently delaying the development of these industries. We use our expertise in machine learning, data management, and integrating complex and disparate systems to empower these advancements. 

To enable this boom, the first and foremost step is to investigate these industries and identify what can be used as data and create sustainable data generating and capturing strategies.

This article has discussed how Xerini has used thousands of text-heavy PDF documents from a construction certification organisation to generate text, images, and metadata on construction products and systems and ultimately create insights for the client. We have designed a command-line tool that extracts and uniquely identifies embedded raw text, concise sections of the text, images and metadata on each product and system. We have also introduced the collaborative research project we have undertaken with UCL on using NLP models and pre-trained transformers such as BERT and GPT-3 for summarisation, keyword extraction, and unsupervised clustering of the information. Attaining concise insights on these products and systems enables efficient access to the vast knowledge of these documents; this is a significant benefit for our client, its customers, and its stakeholders.