Remove links from Content field

I am struggling to remove html links from the Content field.

I would like to remove the links becase I use <Format length="1000"> <Field content /> </Format> which happens to cutoff some links in the middle, leaving html errors on the page. For example, because the lenght does not care about whole words, etc., after applying it the text may end like ...... and you can check out this article at <a href="h
This is leaving the <a> tag unclosed and causes other undesired consequences on the page.

I tried replace and replace_pattern but ran into couple of troubles, mainly I cannot handle the double quotes inside the expression.

So for example in replace="="" it treats the second double quote as an end of the attribute…

Therefore, I decided to use remove_html.

The issue now is that remove_html also removes the paragraph tags <p> </p> rendering the text totally unformatted.

Is there a way to use remove_html but keep the paragraphs?

My template code currently is:

<Format remove_html> <Format length="1000"> <Field content /> </Format> </Format>

Thanks for any suggestions!

Try <Format words=150>. This was released recently but isn’t yet documented. Might help you with this.

I can’t imagine this would be implemented in the near future since there would be too many unique situations to try to account for. So if you want more granular control, Format replace_pattern is probably your best bet as far as flexibility is concerned. That being said, the regex rules required to swap out different HTML tags might be pretty complicated.

This may not be the most performant approach, but an idea I just had to achieve this would be to use Format replace twice. That way you can convert those specific tags to something that isn’t HTML, then remove the rest of the HTML tags, then add those back in. I haven’t tried this out but here’s my general idea:

<Format replace="open_paragraph" with="<p>" replace_2="close_paragraph" with_2="</p>">
  <Format remove_html>
    <Format replace="<p>" with="open_paragraph" replace_2="</p>" with_2="close_paragraph">
      <Field content />
    </Format>
  </Format>
</Format>

Let me know if that works for you.

In both L&L and HTML, the " and ' symbols are interchangeable, as explained by W3:

By default, SGML requires that all attribute values be delimited using either double quotation marks (ASCII decimal 34) or single quotation marks (ASCII decimal 39). Single quote marks can be included within the attribute value when the value is delimited by double quote marks, and vice versa

So you could simply write replace='="' and it’ll work.

Hopefully, this gives you some pointers that might help you find a solution that works for you! Let me know how you end up implementing this. I know that in some cases browsers/parsers automatically close HTML tag if something is missing. I’m not sure exactly how that works but I imagine that formatting whole words instead of individual characters should limit the situation

1 Like

Thank you so much, @benjamin !

<Format words=150> seems to also trim the HTML so I combined it with your suggestion to use Format replace twice and the final result is what I actually wanted!

My final code is:

<Format replace="open_paragraph" with="<p>" replace_2="close_paragraph" with_2="</p>">
  <Format remove_html>
    <Format words=150>
    <Format replace="<p>" with="open_paragraph" replace_2="</p>" with_2="close_paragraph">
      <Field content />
    </Format>
  </Format>
</Format>
</Format>
2 Likes