Html'yi düz metne nasıl dönüştürürsünüz?

Question 1

Bir tabloda depolanan Html pasajlarım var. Sayfaların tamamı, hiçbir etiket veya benzeri değil, sadece temel biçimlendirme.

Bu Html'yi belirli bir sayfada yalnızca metin olarak, biçimlendirme olmadan görüntüleyebilmek istiyorum (aslında yalnızca ilk 30-50 karakter ama bu kolay bit).

Bu Html içindeki "metni" düz metin olarak bir dizeye nasıl yerleştiririm?

Yani bu kod parçası.

<b>Hello World.</b><br/><p><i>Is there anyone out there?</i><p>

Oluyor:

Selam Dünya. Orada kimse var mı?

Question 2

Etiket sıyırma hakkında konuşuyorsanız, <script>etiketler gibi şeyler için endişelenmenize gerek yoksa, nispeten basittir . Yapmanız gereken tek şey metni etiketler olmadan görüntülemekse, bunu normal bir ifadeyle gerçekleştirebilirsiniz:

<[^>]*>

<script>Etiketler ve benzerleri hakkında endişelenmeniz gerekiyorsa, o zaman normal ifadelerden biraz daha güçlü bir şeye ihtiyacınız olacak çünkü durumu izlemeniz gerekiyor, daha çok Bağlamdan Bağımsız Dilbilgisi (CFG) gibi bir şey. "Soldan Sağa" veya açgözlü olmayan eşleştirme ile başarabileceğinizi düşündüm.

Normal ifadeleri kullanabiliyorsanız, orada iyi bilgiler içeren birçok web sayfası vardır:

Bir CFG'nin daha karmaşık davranışına ihtiyacınız varsa, üçüncü taraf bir araç kullanmanızı öneririm, ne yazık ki tavsiye edecek iyi bir araç bilmiyorum.

Question 3

Ücretsiz ve açık kaynaklı HtmlAgilityPack , örneklerinden birinde HTML'den düz metne dönüştüren bir yönteme sahiptir.

var plainText = HtmlUtilities.ConvertToPlainText(string html);

Şunun gibi bir HTML dizesi besle:

<b>hello, <i>world!</i></b>

Ve aşağıdaki gibi bir düz metin sonucu alacaksınız:

hello world!

Question 4

HtmlAgilityPack'i kullanamadım, bu yüzden kendime ikinci bir en iyi çözümü yazdım

private static string HtmlToPlainText(string html)
{
    const string tagWhiteSpace = @"(>|$)(\W|\n|\r)+<";//matches one or more (white space or line breaks) between '>' and '<'
    const string stripFormatting = @"<[^>]*(>|$)";//match any character between '<' and '>', even when end tag is missing
    const string lineBreak = @"<(br|BR)\s{0,1}\/{0,1}>";//matches: <br>,<br/>,<br />,<BR>,<BR/>,<BR />
    var lineBreakRegex = new Regex(lineBreak, RegexOptions.Multiline);
    var stripFormattingRegex = new Regex(stripFormatting, RegexOptions.Multiline);
    var tagWhiteSpaceRegex = new Regex(tagWhiteSpace, RegexOptions.Multiline);

    var text = html;
    //Decode html specific characters
    text = System.Net.WebUtility.HtmlDecode(text); 
    //Remove tag whitespace/line breaks
    text = tagWhiteSpaceRegex.Replace(text, "><");
    //Replace <br /> with line breaks
    text = lineBreakRegex.Replace(text, Environment.NewLine);
    //Strip formatting
    text = stripFormattingRegex.Replace(text, string.Empty);

    return text;
}

Question 5

HTTPUtility.HTMLEncode()kodlama HTML etiketlerini dizeler olarak ele almak içindir. Sizin için tüm ağır işleri halleder. Gönderen MSDN Belgeler :

Bir HTTP akışında boşluklar ve noktalama işaretleri gibi karakterler iletilirse, alıcı uçta yanlış yorumlanabilirler. HTML kodlaması, HTML'de izin verilmeyen karakterleri karakter varlık eşdeğerlerine dönüştürür; HTML kod çözme, kodlamayı tersine çevirir. Metin bloğun içine gömülü olan, örneğin, karakterler <ve >yanı kodlanır <ve >HTTP iletim için.

HTTPUtility.HTMLEncode()yöntem, burada ayrıntılı olarak :

public static void HtmlEncode(
  string s,
  TextWriter output
)

Kullanım:

String TestString = "This is a <Test String>.";
StringWriter writer = new StringWriter();
Server.HtmlEncode(TestString, writer);
String EncodedString = writer.ToString();

Question 6

Vfilby'nin yanıtına eklemek için, kodunuzun içinde bir RegEx değiştirme işlemi gerçekleştirebilirsiniz; yeni sınıflara gerek yoktur. Benim gibi başka yeni başlayanlar da bu soruyu yanıtlarsa.

using System.Text.RegularExpressions;

Sonra...

private string StripHtml(string source)
{
        string output;

        //get rid of HTML tags
        output = Regex.Replace(source, "<[^>]*>", string.Empty);

        //get rid of multiple blank lines
        output = Regex.Replace(output, @"^\s*$\n", string.Empty, RegexOptions.Multiline);

        return output;
}

Question 7

HTML'yi Düz Metne dönüştürmek için Üç Adımlı İşlem

Önce HtmlAgilityPack İçin Nuget Paketini Kurmanız Gerekiyor İkinci Bu Sınıfı Oluşturun

public class HtmlToText
{
    public HtmlToText()
    {
    }

    public string Convert(string path)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.Load(path);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    public string ConvertHtml(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        StringWriter sw = new StringWriter();
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
    }

    private void ConvertContentTo(HtmlNode node, TextWriter outText)
    {
        foreach(HtmlNode subnode in node.ChildNodes)
        {
            ConvertTo(subnode, outText);
        }
    }

    public void ConvertTo(HtmlNode node, TextWriter outText)
    {
        string html;
        switch(node.NodeType)
        {
            case HtmlNodeType.Comment:
                // don't output comments
                break;

            case HtmlNodeType.Document:
                ConvertContentTo(node, outText);
                break;

            case HtmlNodeType.Text:
                // script and style must not be output
                string parentName = node.ParentNode.Name;
                if ((parentName == "script") || (parentName == "style"))
                    break;

                // get text
                html = ((HtmlTextNode)node).Text;

                // is it in fact a special closing node output as text?
                if (HtmlNode.IsOverlappedClosingElement(html))
                    break;

                // check the text is meaningful and not a bunch of whitespaces
                if (html.Trim().Length > 0)
                {
                    outText.Write(HtmlEntity.DeEntitize(html));
                }
                break;

            case HtmlNodeType.Element:
                switch(node.Name)
                {
                    case "p":
                        // treat paragraphs as crlf
                        outText.Write("\r\n");
                        break;
                }

                if (node.HasChildNodes)
                {
                    ConvertContentTo(node, outText);
                }
                break;
        }
    }
}

Judah Himango'nun cevabına referansla yukarıdaki sınıfı kullanarak

Üçüncü olarak, yukarıdaki sınıfın Nesnesini oluşturmanız ve ConvertHtml(HTMLContent)HTML'yi Düz Metne dönüştürmek için Yöntem Kullanmanız gerekir.ConvertToPlainText(string html);

HtmlToText htt=new HtmlToText();
var plainText = htt.ConvertHtml(HTMLContent);

Question 8

Uzun satır içi boşlukları daraltmama sınırlaması vardır, ancak kesinlikle taşınabilirdir ve web tarayıcısı gibi düzene saygı duyar.

static string HtmlToPlainText(string html) {
  string buf;
  string block = "address|article|aside|blockquote|canvas|dd|div|dl|dt|" +
    "fieldset|figcaption|figure|footer|form|h\\d|header|hr|li|main|nav|" +
    "noscript|ol|output|p|pre|section|table|tfoot|ul|video";

  string patNestedBlock = $"(\\s*?</?({block})[^>]*?>)+\\s*";
  buf = Regex.Replace(html, patNestedBlock, "\n", RegexOptions.IgnoreCase);

  // Replace br tag to newline.
  buf = Regex.Replace(buf, @"<(br)[^>]*>", "\n", RegexOptions.IgnoreCase);

  // (Optional) remove styles and scripts.
  buf = Regex.Replace(buf, @"<(script|style)[^>]*?>.*?</\1>", "", RegexOptions.Singleline);

  // Remove all tags.
  buf = Regex.Replace(buf, @"<[^>]*(>|$)", "", RegexOptions.Multiline);

  // Replace HTML entities.
  buf = WebUtility.HtmlDecode(buf);
  return buf;
}

Question 9

Bence en kolay yol, bir 'string' genişletme yöntemi yapmaktır (Richard'ın önerdiği kullanıcıya göre):

using System;
using System.Text.RegularExpressions;

public static class StringHelpers
{
    public static string StripHTML(this string HTMLText)
        {
            var reg = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
            return reg.Replace(HTMLText, "");
        }
}

Ardından, programınızdaki herhangi bir 'dizge' değişkeninde bu uzantı yöntemini kullanın:

var yourHtmlString = "<div class=\"someclass\"><h2>yourHtmlText</h2></span>";
var yourTextString = yourHtmlString.StripHTML();

Bu uzantı yöntemini html formatlı yorumları düz metne dönüştürmek için kullanıyorum, böylece bir kristal raporda doğru bir şekilde görüntülenecek ve mükemmel çalışıyor!

Question 10

HtmlAgilityPack'te 'ConvertToPlainText' adında bir yöntem yoktur, ancak bir html dizesini CLEAR dizesine şu şekilde dönüştürebilirsiniz:

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(htmlString);
var textString = doc.DocumentNode.InnerText;
Regex.Replace(textString , @"<(.|n)*?>", string.Empty).Replace("&nbsp", "");

Bu benim için çalışıyor. ANCAK 'HtmlAgilityPack' İÇİNDE 'ConvertToPlainText' ADI İLE BİR YÖNTEM BULAMAM.

Question 11

HTML etiketleri olan verileriniz varsa ve bir kişinin etiketleri görebilmesi için bu verileri görüntülemek istiyorsanız, HttpServerUtility :: HtmlEncode kullanın.

İçinde HTML etiketleri olan verileriniz varsa ve kullanıcının oluşturulan etiketleri görmesini istiyorsanız, metni olduğu gibi görüntüleyin. Metin bir web sayfasının tamamını temsil ediyorsa, bunun için bir IFRAME kullanın.

HTML etiketleri olan verileriniz varsa ve etiketleri çıkarmak ve yalnızca formatlanmamış metni görüntülemek istiyorsanız, normal bir ifade kullanın.

Question 12

Bulduğum en basit yol:

HtmlFilter.ConvertToPlainText(html);

HtmlFilter sınıfı, Microsoft.TeamFoundation.WorkItemTracking.Controls.dll'de bulunur.

DLL şu klasörde bulunabilir:% ProgramFiles% \ Common Files \ microsoft shared \ Team Foundation Server \ 14.0 \

VS 2015'te dll, aynı klasörde bulunan Microsoft.TeamFoundation.WorkItemTracking.Common.dll dosyasına da başvurmayı gerektirir.

Question 13

Benzer bir sorunla karşılaştım ve en iyi çözümü buldum. Aşağıdaki kod benim için mükemmel çalışıyor.

  private string ConvertHtml_Totext(string source)
    {
     try
      {
      string result;

    // Remove HTML Development formatting
    // Replace line breaks with space
    // because browsers inserts space
    result = source.Replace("\r", " ");
    // Replace line breaks with space
    // because browsers inserts space
    result = result.Replace("\n", " ");
    // Remove step-formatting
    result = result.Replace("\t", string.Empty);
    // Remove repeating spaces because browsers ignore them
    result = System.Text.RegularExpressions.Regex.Replace(result,
                                                          @"( )+", " ");

    // Remove the header (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*head([^>])*>","<head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*head( )*>)","</head>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<head>).*(</head>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all scripts (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*script([^>])*>","<script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*script( )*>)","</script>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    //result = System.Text.RegularExpressions.Regex.Replace(result,
    //         @"(<script>)([^(<script>\.</script>)])*(</script>)",
    //         string.Empty,
    //         System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<script>).*(</script>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // remove all styles (prepare first by clearing attributes)
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*style([^>])*>","<style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"(<( )*(/)( )*style( )*>)","</style>",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(<style>).*(</style>)",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert tabs in spaces of <td> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*td([^>])*>","\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line breaks in places of <BR> and <LI> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*br( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*li( )*>","\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // insert line paragraphs (double line breaks) in place
    // if <P>, <DIV> and <TR> tags
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*div([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*tr([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<( )*p([^>])*>","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // Remove remaining tags like <a>, links, images,
    // comments etc - anything that's enclosed inside < >
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"<[^>]*>",string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // replace special characters:
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @" "," ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&bull;"," * ",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lsaquo;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&rsaquo;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&trade;","(tm)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&frasl;","/",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&lt;","<",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&gt;",">",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&copy;","(c)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&reg;","(r)",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove all others. More can be added, see
    // http://hotwired.lycos.com/webmonkey/reference/special_characters/
    result = System.Text.RegularExpressions.Regex.Replace(result,
             @"&(.{2,6});", string.Empty,
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // for testing
    //System.Text.RegularExpressions.Regex.Replace(result,
    //       this.txtRegex.Text,string.Empty,
    //       System.Text.RegularExpressions.RegexOptions.IgnoreCase);

    // make line breaking consistent
    result = result.Replace("\n", "\r");

    // Remove extra line breaks and tabs:
    // replace over 2 breaks with 2 and over 4 tabs with 4.
    // Prepare first to remove any whitespaces in between
    // the escaped characters and remove redundant tabs in between line breaks
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\t)","\t\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\t)( )+(\r)","\t\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)( )+(\t)","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove redundant tabs
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+(\r)","\r\r",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Remove multiple tabs following a line break with just one tab
    result = System.Text.RegularExpressions.Regex.Replace(result,
             "(\r)(\t)+","\r\t",
             System.Text.RegularExpressions.RegexOptions.IgnoreCase);
    // Initial replacement target string for line breaks
    string breaks = "\r\r\r";
    // Initial replacement target string for tabs
    string tabs = "\t\t\t\t\t";
    for (int index=0; index<result.Length; index++)
    {
        result = result.Replace(breaks, "\r\r");
        result = result.Replace(tabs, "\t\t\t\t");
        breaks = breaks + "\r";
        tabs = tabs + "\t";
    }

    // That's it.
    return result;
}
catch
{
    MessageBox.Show("Error");
    return source;
}

}

Normal ifadelerin beklendiği gibi çalışmayı durdurmasına neden olduğundan önce \ n ve \ r gibi çıkış karakterlerinin kaldırılması gerekiyordu.

Ayrıca, sonuç dizesinin metin kutusunda doğru şekilde görüntülenmesini sağlamak için, Metin özelliğine atamak yerine onu bölmek ve metin kutusunun Satırları özelliğini ayarlamak gerekebilir.

this.txtResult.Lines = StripHTML (this.txtSource.Text) .Split ("\ r" .ToCharArray ());

Kaynak: https://www.codeproject.com/Articles/11902/Convert-HTML-to-Plain-Text-2

Question 14

"Html" ile ne demek istediğine bağlı. En karmaşık durum, eksiksiz web sayfaları olacaktır. Metin modunda bir web tarayıcısı kullanabileceğiniz için, bu aynı zamanda kullanımı en kolay olanıdır. Metin modu tarayıcıları dahil web tarayıcılarını listeleyen Wikipedia makalesine bakın . Lynx muhtemelen en iyi bilinendir, ancak diğerlerinden biri ihtiyaçlarınız için daha iyi olabilir.

Question 15

İşte benim çözümüm:

public string StripHTML(string html)
{
    var regex = new Regex("<[^>]+>", RegexOptions.IgnoreCase);
    return System.Web.HttpUtility.HtmlDecode((regex.Replace(html, "")));
}

Misal:

StripHTML("<p class='test' style='color:red;'>Here is my solution:</p>");
// output -> Here is my solution:

Question 16

Aynı soruyu sordum, sadece html'mde basit bir önceden bilinen düzen vardı, örneğin:

<DIV><P>abc</P><P>def</P></DIV>

Bu yüzden çok basit bir kod kullandım:

string.Join (Environment.NewLine, XDocument.Parse (html).Root.Elements ().Select (el => el.Value))

Hangi çıktılar:

abc
def

Question 17

Yazmadı ama kullandı:

using HtmlAgilityPack;
using System;
using System.IO;
using System.Text.RegularExpressions;

namespace foo {
  //small but important modification to class https://github.com/zzzprojects/html-agility-pack/blob/master/src/Samples/Html2Txt/HtmlConvert.cs
  public static class HtmlToText {

    public static string Convert(string path) {
      HtmlDocument doc = new HtmlDocument();
      doc.Load(path);
      return ConvertDoc(doc);
    }

    public static string ConvertHtml(string html) {
      HtmlDocument doc = new HtmlDocument();
      doc.LoadHtml(html);
      return ConvertDoc(doc);
    }

    public static string ConvertDoc(HtmlDocument doc) {
      using (StringWriter sw = new StringWriter()) {
        ConvertTo(doc.DocumentNode, sw);
        sw.Flush();
        return sw.ToString();
      }
    }

    internal static void ConvertContentTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      foreach (HtmlNode subnode in node.ChildNodes) {
        ConvertTo(subnode, outText, textInfo);
      }
    }
    public static void ConvertTo(HtmlNode node, TextWriter outText) {
      ConvertTo(node, outText, new PreceedingDomTextInfo(false));
    }
    internal static void ConvertTo(HtmlNode node, TextWriter outText, PreceedingDomTextInfo textInfo) {
      string html;
      switch (node.NodeType) {
        case HtmlNodeType.Comment:
          // don't output comments
          break;
        case HtmlNodeType.Document:
          ConvertContentTo(node, outText, textInfo);
          break;
        case HtmlNodeType.Text:
          // script and style must not be output
          string parentName = node.ParentNode.Name;
          if ((parentName == "script") || (parentName == "style")) {
            break;
          }
          // get text
          html = ((HtmlTextNode)node).Text;
          // is it in fact a special closing node output as text?
          if (HtmlNode.IsOverlappedClosingElement(html)) {
            break;
          }
          // check the text is meaningful and not a bunch of whitespaces
          if (html.Length == 0) {
            break;
          }
          if (!textInfo.WritePrecedingWhiteSpace || textInfo.LastCharWasSpace) {
            html = html.TrimStart();
            if (html.Length == 0) { break; }
            textInfo.IsFirstTextOfDocWritten.Value = textInfo.WritePrecedingWhiteSpace = true;
          }
          outText.Write(HtmlEntity.DeEntitize(Regex.Replace(html.TrimEnd(), @"\s{2,}", " ")));
          if (textInfo.LastCharWasSpace = char.IsWhiteSpace(html[html.Length - 1])) {
            outText.Write(' ');
          }
          break;
        case HtmlNodeType.Element:
          string endElementString = null;
          bool isInline;
          bool skip = false;
          int listIndex = 0;
          switch (node.Name) {
            case "nav":
              skip = true;
              isInline = false;
              break;
            case "body":
            case "section":
            case "article":
            case "aside":
            case "h1":
            case "h2":
            case "header":
            case "footer":
            case "address":
            case "main":
            case "div":
            case "p": // stylistic - adjust as you tend to use
              if (textInfo.IsFirstTextOfDocWritten) {
                outText.Write("\r\n");
              }
              endElementString = "\r\n";
              isInline = false;
              break;
            case "br":
              outText.Write("\r\n");
              skip = true;
              textInfo.WritePrecedingWhiteSpace = false;
              isInline = true;
              break;
            case "a":
              if (node.Attributes.Contains("href")) {
                string href = node.Attributes["href"].Value.Trim();
                if (node.InnerText.IndexOf(href, StringComparison.InvariantCultureIgnoreCase) == -1) {
                  endElementString = "<" + href + ">";
                }
              }
              isInline = true;
              break;
            case "li":
              if (textInfo.ListIndex > 0) {
                outText.Write("\r\n{0}.\t", textInfo.ListIndex++);
              } else {
                outText.Write("\r\n*\t"); //using '*' as bullet char, with tab after, but whatever you want eg "\t->", if utf-8 0x2022
              }
              isInline = false;
              break;
            case "ol":
              listIndex = 1;
              goto case "ul";
            case "ul": //not handling nested lists any differently at this stage - that is getting close to rendering problems
              endElementString = "\r\n";
              isInline = false;
              break;
            case "img": //inline-block in reality
              if (node.Attributes.Contains("alt")) {
                outText.Write('[' + node.Attributes["alt"].Value);
                endElementString = "]";
              }
              if (node.Attributes.Contains("src")) {
                outText.Write('<' + node.Attributes["src"].Value + '>');
              }
              isInline = true;
              break;
            default:
              isInline = true;
              break;
          }
          if (!skip && node.HasChildNodes) {
            ConvertContentTo(node, outText, isInline ? textInfo : new PreceedingDomTextInfo(textInfo.IsFirstTextOfDocWritten) { ListIndex = listIndex });
          }
          if (endElementString != null) {
            outText.Write(endElementString);
          }
          break;
      }
    }
  }
  internal class PreceedingDomTextInfo {
    public PreceedingDomTextInfo(BoolWrapper isFirstTextOfDocWritten) {
      IsFirstTextOfDocWritten = isFirstTextOfDocWritten;
    }
    public bool WritePrecedingWhiteSpace { get; set; }
    public bool LastCharWasSpace { get; set; }
    public readonly BoolWrapper IsFirstTextOfDocWritten;
    public int ListIndex { get; set; }
  }
  internal class BoolWrapper {
    public BoolWrapper() { }
    public bool Value { get; set; }
    public static implicit operator bool(BoolWrapper boolWrapper) {
      return boolWrapper.Value;
    }
    public static implicit operator BoolWrapper(bool boolWrapper) {
      return new BoolWrapper { Value = boolWrapper };
    }
  }
}

Question 18

Sanırım basit bir cevabı var:

public string RemoveHTMLTags(string HTMLCode)
{
    string str=System.Text.RegularExpressions.Regex.Replace(HTMLCode, "<[^>]*>", "");
    return str;
}

Question 19

Belirli bir html belgesinin metinsel kısaltması için OP sorusuna satırsonu ve HTML etiketleri olmadan tam bir çözüm arayanlar için lütfen aşağıdaki çözümü bulun.

Önerilen her çözümde olduğu gibi, aşağıdaki kodla ilgili bazı varsayımlar vardır:

komut dosyası veya stil etiketleri, komut dosyasının bir parçası olarak komut dosyası ve stil etiketleri içermemelidir
yalnızca büyük satır içi öğeler boşluk olmadan satır içi olacaktır, yani he<span>ll</span>oçıktı vermelidir hello. Satır içi etiketlerin listesi: https://www.w3schools.com/htmL/html_blocks.asp

Yukarıdakileri göz önünde bulundurarak, derlenmiş düzenli ifadelere sahip aşağıdaki dize uzantısı, html çıkışlı karakterlerle ilgili olarak beklenen düz metni ve null girdide null çıktı verecektir.

public static class StringExtensions
{
    public static string ConvertToPlain(this string html)
    {
        if (html == null)
        {
            return html;
        }

        html = scriptRegex.Replace(html, string.Empty);
        html = inlineTagRegex.Replace(html, string.Empty);
        html = tagRegex.Replace(html, " ");
        html = HttpUtility.HtmlDecode(html);
        html = multiWhitespaceRegex.Replace(html, " ");

        return html.Trim();
    }

    private static readonly Regex inlineTagRegex = new Regex("<\\/?(a|span|sub|sup|b|i|strong|small|big|em|label|q)[^>]*>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex scriptRegex = new Regex("<(script|style)[^>]*?>.*?</\\1>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex tagRegex = new Regex("<[^>]+>", RegexOptions.Compiled | RegexOptions.Singleline);
    private static readonly Regex multiWhitespaceRegex = new Regex("\\s+", RegexOptions.Compiled | RegexOptions.Singleline);
}

Question 20

public static string StripTags2 (string html) {return html.Replace ("<", "<"). Replace (">", ">"); }

Bununla bir dizedeki tüm "<" ve ">" karakterlerinden çıkarsınız. İstediğiniz bu mu?