Remove HTML markup from a String using JavaScript

Learn how to remove html tags from a string with JavaScript: We will look at the different approaches available, such as RegEx, DOM methods or the DOMParser, all with practical examples using vanilla JS.

In this tutorial we show two approaches to remove html tags from a string: the first method is based on regex while the second relies on DOM (Document Object Model) methods.

The DOM manipulation approach is generally more reliable and robust, especially if you want to handle more complex HTML structures. It can better handle edge cases and nested tags. However, it's important to note that using DOM manipulation creates an actual DOM tree in memory, which can be slower compared to regex for very large strings. It also requires a browser environment (not suitable for server-side JavaScript).

On the other hand, regex might be suitable for quick and simple tag removal in scenarios where the HTML structure is well-known, straightforward, and not overly complex.

Remove HTML Tags with RegEx

Let's examine two regex patterns to strip HTML tags from a string with JavaScript, one which matches the start and the end of a html-tag (angle brackets) and one for matching the opening and closing tags (separately):

Regular Expression: /(<([^>]+)>)/ig

const htmlMarkup = '<b>www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const cleanText = htmlMarkup.replace(/(<([^>]+)>)/ig, '');
// www.DEVLABS.ninja is awesome

Regular Expression: /<\/?[^>]+(>|$)/ig

const htmlMarkup = '<b>www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const cleanText = htmlMarkup.replace(/<\/?[^>]+(>|$)/ig, '');
// www.DEVLABS.ninja is awesome

Both regular expressions have some limitations:

Remove HTML Tags using DOM

For more reliable and accurate HTML parsing and tag removal, using a dedicated HTML parser or DOM manipulation is recommended:

const html = '<b>www.www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const div = document.createElement('div');
div.innerHTML = html;
const cleanText = div.textContent;
// www.DEVLABS.ninja is awesome

We are using the textContent property to extract the text without the html markup. There is also the innerText property which would give us the same result, but innerText is much more performance-heavy since it returns the visible text from a node while textContent returns just the text.

Using the DOM manipulation method has also some disadvantages:

Remove HTML Tags using the DOMParser

The last method for removing html tags involves the DOMParser. It comes with the same pros and cons as the DOM manipulation method:

const html = '<b>www.www.DEVLABS.ninja</b> is awesome <img src="test.jpg"/>';
const doc = new DOMParser().parseFromString(html, 'text/html');
const clean = doc.body.textContent || '';
// www.DEVLABS.ninja is awesome

While DOMParser provides a more reliable solution, it might be slower for very large strings compared to regex. Creating a DOM tree in memory involves more overhead.

We can also use this method on the server side with Node.js and e.g. Express by using the JSDOM library, which provides a simulated browser environment for server-side JavaScript.

Here is an example of how you can use the DOMParser in an Express route handler:

const express = require('express');
const { JSDOM } = require('jsdom');

const app = express();
const port = 3000;

app.get('/', (req, res) => {
  const html = req.query.html || '';
  const dom = new JSDOM(html);
  const clean = dom.window.document.body.textContent;

  res.send(`Cleaned text: ${clean}`);
});

app.listen(port, () => {
  console.log(`Server is running on port ${port}`);
});