Boilerplate in the context of web content refers to the standard or repetitive sections of a webpage that don’t contain the main content but are present on multiple pages of a website. Examples include site navigation, footers, headers, advertisements, and other standard design elements.
Boilerplate detection and removal is relevant in several applications:
At Globality, we rely on boilerplate removal to refine the text scraped from websites. By doing so, we meticulously extract valuable signals, including case studies, office locations, service offerings, and industry specifics. These insights then become the foundation for training our classifier and matching models.
The techniques for boilerplate detection can be broadly divided into three primary categories [1]:
Rule-Based and Heuristic Methods: These methods capitalize on shallow features such as text length, text-over-link density, and text-over-tag density. However, they frequently fall short when dealing with contemporary website designs, particularly those saturated with advertisements.
Machine Learning-Based Methods: These are rapidly gaining traction in the industry. Their primary function is to classify different segments of a webpage as either boilerplate or main content. The category encompasses both traditional machine learning techniques and more advanced deep learning methodologies. While the former demands manual feature engineering, the latter thrives on the capabilities of language modeling, often obviating the need for substantial manual input.
Website-Based Methods: These methods operate on the principle that pages housed within the same domain tend to exhibit a congruent style. Established techniques, like those propounded by [2] and [3], root out boilerplate content by identifying analogous subtrees across sets of pages from the same domain.
Our approach, while echoing elements of traditional website-based strategies, boasts distinct features. Firstly, our method employs an efficient plain-text depiction of HTML nodes for subtree comparison. This approach intentionally disregards tag attributes, empowering the algorithm to detect near-identical subtrees with analogous structures and content, albeit with differing tag attributes. Secondly, our technique is both comprehensive and efficient. It offers a holistic boilerplate detection strategy, singling out not just prevalent domain elements but also those that resonate within a narrower subset of domain pages. This wide-reaching approach is accomplished with a commendable O(n)
efficiency, pivoting on the comparison of consecutive pages sorted by URL. This strategy is informed by the prevalent inclination of modern domains to employ organized URL paths.
deboiler
PackageWe at Globality have open sourced our boilerplate removal library, called deboiler
. This package offers a unique website-based algorithm. Distinguishing it from rule-based and ML-based methods, our approach requires access to multiple pages from the same domain to pinpoint boilerplate content. As such, it’s tailored for niche use-cases where every page of a domain is accessible and requires simultaneous cleaning.
At a high level, deboiler
detects boilerplate elements by identifying near-identical subtrees (from the html DOM tree) that are shared between pages in the domain. The following provides more details about the underlying approach:
<div>
, <nav>
, <footer>
, and <header>
) are candidate boilerplate nodes.O(n)
efficient technique involves sorting pages by URL and comparing each page against the succeeding one. Given the structured URL paths prevalent in modern domains, pages under similar directories often bear resemblance. This minimizes computation while maximizing boilerplate detection.The package is built on lxml, ensuring efficient memory usage and speed. Additionally, it provides a user-friendly API to derive several valuable attributes from a cleaned page, encompassing the title, description, headings, breadcrumbs, lists, and main text.
deboiler
The deboiler
package is built with a simple scikit-learn
-like API. The following shows how we can create a dataset from several pages (belonging to the same domain) to identify (apply the fit
method) and remove (apply the transform
method) boilerplate elements.
from deboiler.dataset import JsonDataset from deboiler import Deboiler dataset = JsonDataset("path-to-json-lines-file") deboiler = Deboiler( n_processes=1, # no of processes operation_mode="memory", # operation mode: `memory` or `performance` domain="globality", # domain name (used for logging only) ) # call the fit method to identify boilerplate elements deboiler.fit(dataset) output_pages = [] # call the transform method to yield cleaned pages for output_page in deboiler.transform(dataset): # do something with the output_page output_pages.append(output_page)
The deboiler
module supports two modes of operation:
fit
, to be reused during transform
, resulting in faster processing at the cost of higher memory footprint. This mode does not support multi-processing.The following plot compares deboiler
performance for different modes of operation and number of processes. In this benchmarking, deboiler
cleans up pages from ~140 domains with 10-10k pages. The “performance” mode completes the tasks faster (38 mins vs. 54 mins) than the “memory” mode with a single process, i.e. (memory, 1)
. However, the “memory” mode can outperform the “performance” mode if multi-processing is enabled (e.g. 5 or 10 processes in this example).
It is worth noting that the difference between modes of operation and multi-processing becomes more pronounced as the domain size increases.
[1] H. Zhang and J. Wang, “Boilerplate Detection via Semantic Classification of TextBlocks,” arXiv:2203.04467 [cs], Mar. 2022, Accessed: Mar. 24, 2022. [Online]. Available: http://arxiv.org/abs/2203.04467.
[2] L. Yi, B. Liu, and X. Li, “Eliminating noisy information in Web pages for data mining,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, in KDD ’03. New York, NY, USA: Association for Computing Machinery, Aug. 2003, pp. 296–305. doi: 10.1145/956750.956785.
[3] K. Vieira, A. S. da Silva, N. Pinto, E. S. de Moura, J. M. B. Cavalcanti, and J. Freire, “A fast and robust method for web page template detection and removal,” in Proceedings of the 15th ACM international conference on Information and knowledge management – CIKM ’06, Arlington, Virginia, USA: ACM Press, 2006, p. 258. doi: 10.1145/1183614.1183654.
Salman is a dedicated Data Scientist specializing in crafting machine intelligence software with a product-centric focus. Over the last five years, he has played a pivotal role at Globality, developing advanced natural language models for predictive analysis and understanding. These models have been instrumental in enhancing Globality's platform, which is designed to boost productivity and purpose through autonomous sourcing.