Extracting text from web code

Here at Furniture Ferret most of what we do is extract descriptions of items of furniture from web pages. Even if we know where it is on the page, it is often not a trivial task to get the text from the HTML code web pages use: words and sentences can run into each other unless we do something clever.

This is why we created semantic text, a software library that is a drop-in replacement for BeautifulSoup’s get_text() in Python.

get_text() in BeautifulSoup simply concatenates strings among the descendants. This can create unexpected results when so-called block-level HTML elements are used, which are expected to semantically separate portions of the text.

For example, for the following HTML:

<ul><li><strong>V</strong>ery interesting</li><li>Thing it is</li></ul>

get_text() returns “Very interestingThing it is” instead of the expected “Very interesting Thing it is” as it disregards that <li> is a block-level element.

beautifulsoup_semantic_text.bs_semantic_text() overcomes this problem by adding a space in front of each block-level element. However, block-level and inline is a historical categorisation of HTML elements and is not defined everywhere. The distinction is still useful to approximate the expected presentation of the HTML in a string.