Hey guys!
How are you? I hope so, huh!

In this post, I will demonstrate to you how to convert an HTML string to text (Remove HTML tags) using the CLR (C#). If you are new to the blog or have never heard of CLR or don't know how to create your first project using this powerful SQL Server tool, which allows you to create codes in C# or VB.NET and run them through the database, give a list in the post Introduction to SQL CLR (Common Language Runtime) in SQL Server.

In 2014, I made the post Removing HTML tags from a string in SQL Server and therefore, you must be asking yourself: “Dirceu, if you have already made a post about this, why make another similar one using the CLR?” and the answer to that is very simple: Much simpler code and PERFORMANCE! As I already explained in the post SQL Server – Performance comparison between Scalar Function and CLR Scalar Function, CLR functions generally deliver MUCH better performance than UDF T-SQL functions, being up to 200 times faster.

Code used to create records:

IF (OBJECT_ID('dirceuresende.dbo.Teste_HTML') IS NOT NULL) DROP TABLE dirceuresende.dbo.Teste_HTML
CREATE TABLE dirceuresende.dbo.Teste_HTML (
    texto VARCHAR(MAX)
)
GO

INSERT INTO dirceuresende.dbo.Teste_HTML
SELECT '<!DOCTYPE html>
<html>
<body>

<h1>My First Heading</h1>

<p>My first paragraph.</p>

</body>
</html>' AS texto
FROM sys.objects
GO 40

To demonstrate the difference in performance of the two functions, I populated a table with just 6,000 records, all with the same string containing basic HTML code. And the greater the volume of data, the greater the difference in performance between the CLR and UDF T-SQL functions. You can see the result below:

Comparison of CLR function with T-SQL function: 173x faster

Function source code:

To remove HTML tags, I use the HtmlDecode method of the WebUtility class, which belongs to the System.Net library. This method is only available from .NET Framework 4.0 onwards and therefore, it is only possible to use it on SQL Server 2012 or higher (versions 2005 and 2008 of SQL Server use .NET Framework 3.5)

The Fl_Quebra_Linha parameter is used to replace the
tag (and its variants) with a line break in the text. If you enter the value 0 (false) in this parameter, the line breaks will be replaced by an empty string.

using System;
using System.Net;
using System.Text.RegularExpressions;

public partial class UserDefinedFunctions
{
    [Microsoft.SqlServer.Server.SqlFunction]
    public static string fncRemove_Html_String(string Ds_String_HTML, bool Fl_Quebra_Linha)
    {

        if (string.IsNullOrEmpty(Ds_String_HTML))
            return null;


        var html = Ds_String_HTML;
        html = WebUtility.HtmlDecode(html);

        if (Fl_Quebra_Linha)
        {
            html = Regex.Replace(html, "<br/>", Environment.NewLine, RegexOptions.IgnoreCase);
            html = Regex.Replace(html, "<br />", Environment.NewLine, RegexOptions.IgnoreCase);
            html = Regex.Replace(html, "<br>", Environment.NewLine, RegexOptions.IgnoreCase);
        }

        html = Regex.Replace(html, " ", " ", RegexOptions.IgnoreCase);
        html = Regex.Replace(html, "<.*?>", string.Empty);

        return html;
    }
}

That's it, folks!
I hope you enjoyed the post and see you next time.

sql server convert convert string text html remove remove html tags to plain text

sql server convert convert string text html remove remove html tags to plain text