Introduction

Boilerplate in the context of web content refers to the standard or repetitive sections of a webpage that don’t contain the main content but are present on multiple pages of a website. Examples include site navigation, footers, headers, advertisements, and other standard design elements.

Boilerplate detection and removal is relevant in several applications:

Web scraping/crawling: When extracting useful content from web pages, eliminating the boilerplate can help in obtaining only the essential data.
Search engines: For search engines, boilerplate content can be seen as duplicate content across different pages of the same website, which can skew indexing and ranking processes. By identifying and disregarding the boilerplate, search engines can focus on the main content of the pages.
Text analytics: When analyzing the text of a web page for natural language processing or other linguistic tasks, the boilerplate can introduce noise into the data.

At Globality, we rely on boilerplate removal to refine the text scraped from websites. By doing so, we meticulously extract valuable signals, including case studies, office locations, service offerings, and industry specifics. These insights then become the foundation for training our classifier and matching models.

Existing Solutions for Boilerplate Detection

The techniques for boilerplate detection can be broadly divided into three primary categories [1]:

Rule-Based and Heuristic Methods: These methods capitalize on shallow features such as text length, text-over-link density, and text-over-tag density. However, they frequently fall short when dealing with contemporary website designs, particularly those saturated with advertisements.

Machine Learning-Based Methods: These are rapidly gaining traction in the industry. Their primary function is to classify different segments of a webpage as either boilerplate or main content. The category encompasses both traditional machine learning techniques and more advanced deep learning methodologies. While the former demands manual feature engineering, the latter thrives on the capabilities of language modeling, often obviating the need for substantial manual input.

Website-Based Methods: These methods operate on the principle that pages housed within the same domain tend to exhibit a congruent style. Established techniques, like those propounded by [2] and [3], root out boilerplate content by identifying analogous subtrees across sets of pages from the same domain.

Our approach, while echoing elements of traditional website-based strategies, boasts distinct features. Firstly, our method employs an efficient plain-text depiction of HTML nodes for subtree comparison. This approach intentionally disregards tag attributes, empowering the algorithm to detect near-identical subtrees with analogous structures and content, albeit with differing tag attributes. Secondly, our technique is both comprehensive and efficient. It offers a holistic boilerplate detection strategy, singling out not just prevalent domain elements but also those that resonate within a narrower subset of domain pages. This wide-reaching approach is accomplished with a commendable O(n) efficiency, pivoting on the comparison of consecutive pages sorted by URL. This strategy is informed by the prevalent inclination of modern domains to employ organized URL paths.

Introducing the `deboiler` Package

We at Globality have open sourced our boilerplate removal library, called deboiler. This package offers a unique website-based algorithm. Distinguishing it from rule-based and ML-based methods, our approach requires access to multiple pages from the same domain to pinpoint boilerplate content. As such, it’s tailored for niche use-cases where every page of a domain is accessible and requires simultaneous cleaning.

At a high level, deboiler detects boilerplate elements by identifying near-identical subtrees (from the html DOM tree) that are shared between pages in the domain. The following provides more details about the underlying approach:

Candidate subtrees: Nodes with specific HTML tags (like <div>, <nav>, <footer>, and <header>) are candidate boilerplate nodes.
Subtree comparison: Each subtree gets represented as plain text, derived by recursively combining its elements’ representations while discarding HTML attributes.
Identifying boilerplate from two pages: Subtrees that are shared between two pages (more than a configurable number of times) are marked as boilerplate.
Compiling boilerplate across the domain: The O(n) efficient technique involves sorting pages by URL and comparing each page against the succeeding one. Given the structured URL paths prevalent in modern domains, pages under similar directories often bear resemblance. This minimizes computation while maximizing boilerplate detection.
Guard against duplicate pages: Avoids erroneously classifying all elements as boilerplate by sidestepping pairs with high intersection-over-union ratios.
Cleaning process: Any subtree in a page identified as boilerplate is removed.

The package is built on lxml, ensuring efficient memory usage and speed. Additionally, it provides a user-friendly API to derive several valuable attributes from a cleaned page, encompassing the title, description, headings, breadcrumbs, lists, and main text.

How to Use `deboiler`

The deboiler package is built with a simple scikit-learn-like API. The following shows how we can create a dataset from several pages (belonging to the same domain) to identify (apply the fit method) and remove (apply the transform method) boilerplate elements.

from deboiler.dataset import JsonDataset
from deboiler import Deboiler

dataset = JsonDataset("path-to-json-lines-file")
deboiler = Deboiler(
	n_processes=1,  # no of processes
	operation_mode="memory",  # operation mode: `memory` or `performance`
	domain="globality",  # domain name (used for logging only)
)

# call the fit method to identify boilerplate elements
deboiler.fit(dataset)

output_pages = []
# call the transform method to yield cleaned pages
for output_page in deboiler.transform(dataset):
	# do something with the output_page
	output_pages.append(output_page)

The deboiler module supports two modes of operation:

low-memory mode: This mode offers the lowest memory footprint. It also supports multi-processing.
high-performance mode: In this mode, parsed pages are kept in memory during fit, to be reused during transform, resulting in faster processing at the cost of higher memory footprint. This mode does not support multi-processing.

The following plot compares deboiler performance for different modes of operation and number of processes. In this benchmarking, deboiler cleans up pages from ~140 domains with 10-10k pages. The “performance” mode completes the tasks faster (38 mins vs. 54 mins) than the “memory” mode with a single process, i.e. (memory, 1). However, the “memory” mode can outperform the “performance” mode if multi-processing is enabled (e.g. 5 or 10 processes in this example).

It is worth noting that the difference between modes of operation and multi-processing becomes more pronounced as the domain size increases.

References

[1] H. Zhang and J. Wang, “Boilerplate Detection via Semantic Classification of TextBlocks,” arXiv:2203.04467 [cs], Mar. 2022, Accessed: Mar. 24, 2022. [Online]. Available: http://arxiv.org/abs/2203.04467.

[2] L. Yi, B. Liu, and X. Li, “Eliminating noisy information in Web pages for data mining,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, in KDD ’03. New York, NY, USA: Association for Computing Machinery, Aug. 2003, pp. 296–305. doi: 10.1145/956750.956785.

[3] K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, “A fast and robust method for web page template detection and removal,” in Proceedings of the 15th ACM international conference on Information and knowledge management – CIKM ’06, Arlington, Virginia, USA: ACM Press, 2006, p. 258. doi: 10.1145/1183614.1183654.

Author

Salman Mashayekh

Salman is a dedicated Data Scientist specializing in crafting machine intelligence software with a product-centric focus. Over the last five years, he has played a pivotal role at Globality, developing advanced natural language models for predictive analysis and understanding. These models have been instrumental in enhancing Globality's platform, which is designed to boost productivity and purpose through autonomous sourcing.
View all posts

Salman Mashayekh

Salman is a dedicated Data Scientist specializing in crafting machine intelligence software with a product-centric focus. Over the last five years, he has played a pivotal role at Globality, developing advanced natural language models for predictive analysis and understanding. These models have been instrumental in enhancing Globality's platform, which is designed to boost productivity and purpose through autonomous sourcing.

Removing Boilerplate from Webpages