Remove HTML markup from a String using JavaScript

In this tutorial we show two approaches to remove html tags from a string: the first method is based on regex while the second relies on DOM (Document Object Model) methods.

The DOM manipulation approach is generally more reliable and robust, especially if you want to handle more complex HTML structures. It can better handle edge cases and nested tags. However, it's important to note that using DOM manipulation creates an actual DOM tree in memory, which can be slower compared to regex for very large strings. It also requires a browser environment (not suitable for server-side JavaScript).

On the other hand, regex might be suitable for quick and simple tag removal in scenarios where the HTML structure is well-known, straightforward, and not overly complex.

Remove HTML Tags with RegEx

Let's examine two regex patterns to strip HTML tags from a string with JavaScript, one which matches the start and the end of a html-tag (angle brackets) and one for matching the opening and closing tags (separately):

Regular Expression: /(<([^>]+)>)/ig

  • (<([^>]+)>): This is the main part of the regex. It matches an opening HTML tag followed by any content and then the closing HTML tag.
  • The inner part ([^>]+) matches any sequence of characters that are not >, effectively capturing the tag name and attributes.
  • ig: These are flags at the end of the regex. i makes the matching case-insensitive, and g means the regex will match all occurrences in the input string.
const htmlMarkup = '<b>www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const cleanText = htmlMarkup.replace(/(<([^>]+)>)/ig, '');
// www.DEVLABS.ninja is awesome

Regular Expression: /<\/?[^>]+(>|$)/ig

  • <\/?: This matches an opening < followed by an optional /, which matches both opening and closing tags.
  • [^>]+: This matches one or more characters that are not >, which corresponds to the tag name and attributes.
  • (>|$): This captures either > (end of an opening tag) or the end of the string ($), effectively ensuring that the entire tag is matched.
  • ig: The same as in the first example
const htmlMarkup = '<b>www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const cleanText = htmlMarkup.replace(/<\/?[^>]+(>|$)/ig, '');
// www.DEVLABS.ninja is awesome

Both regular expressions have some limitations:

  • They may not handle nested tags correctly in all cases.
  • They may have unexpected behavior with non-standard or malformed HTML.
  • They may inadvertently remove content that's not actually part of HTML tags (if there are angle brackets used for non-HTML purposes).

Remove HTML Tags using DOM

For more reliable and accurate HTML parsing and tag removal, using a dedicated HTML parser or DOM manipulation is recommended:

const html = '<b>www.www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const div = document.createElement('div');
div.innerHTML = html;
const cleanText = div.textContent;
// www.DEVLABS.ninja is awesome

We are using the textContent property to extract the text without the html markup. There is also the innerText property which would give us the same result, but innerText is much more performance-heavy since it returns the visible text from a node while textContent returns just the text.

Using the DOM manipulation method has also some disadvantages:

  • Can be slower for very large strings due to the overhead of creating a DOM element.
  • Requires a browser environment (not suitable for server-side JavaScript).

Remove HTML Tags using the DOMParser

The last method for removing html tags involves the DOMParser. It comes with the same pros and cons as the DOM manipulation method:

const html = '<b>www.www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const doc = new DOMParser().parseFromString(html, 'text/html');
const clean = doc.body.textContent || '';
// www.DEVLABS.ninja is awesome

While DOMParser provides a more reliable solution, it might be slower for very large strings compared to regex. Creating a DOM tree in memory involves more overhead.

We can also use this method on the server side with Node.js and e.g. Express by using the JSDOM library, which provides a simulated browser environment for server-side JavaScript.

Here is an example of how you can use the DOMParser in an Express route handler:

const express = require('express');
const { JSDOM } = require('jsdom');

const app = express();
const port = 3000;

app.get('/', (req, res) => {
  const html = req.query.html || '';
  const dom = new JSDOM(html);
  const clean = dom.window.document.body.textContent;

  res.send(`Cleaned text: ${clean}`);
});

app.listen(port, () => {
  console.log(`Server is running on port ${port}`);
});