Wednesday, January 31, 2018

Java example on how to remove application invalid HTML tags

I used code similar to the example in this blog awhile back, where I needed a way to parse out HTML tags that were invalid to my application.  The HTML code was proper HTML5 code syntax, but my application could only use specific HTML tags.   This was during a data migration from an older version of Jive Software, where HTML data was being migrated over and displayed in the new application.  I rolled my own java solution here, because I didn’t find a library that could do this easily at the time.

Environment:
Oracle Java 1.7.0_60

Example:

  • The HtmlStringInfo class holds information about the html string that needs to be processed for any bad html tags
  • The HTML being process must be well-formed.   Void element is valid.
  • The example states that <h1> through <h6> and <p> tags are valid tags.  The following will keep all HTML that is in the <body> that is <h1> through <h6> or <p> only.  It will delete any tag that is nested inside an invalid tag
  • This does not currently remove any possible invalid elements nested inside valid elements.  i.e. <h1><bad></bad></h1>
  • The following is case-sensitive, so <h2></h2> is valid but <H2></H2> is not valid.
  • Run using @Test method removeInvalidTagLoop

Sample HTML:
<div class=\"embedded\" id=\"thumbnail_0.jpeg\"></div><div class=\"package-entry\"><h1>HEADER</h1></div><h2>HEADER2</h2><p>PARAGRAPH</p>

After the call to removeInvalidHtmlTagElement finishes, the valid HTML that remains is:
<h2>HEADER2</h2><p>PARAGRAPH</p>

public class ValidHtmlTest {
    private static final String openBracket = "<";
    private static final String closeBracket = ">";
    private static final String slash = "/";
    private static final String space = " ";
    private static final String empty = "";

    /*
     * Holds information about the html string that needs to be processed for
     * any bad html tags.
     */
    protected class HtmlStringInfo {
        private String newString; // The updated string
        private int nextIdxToProcess; // next index to start processing
        private boolean keepLooking; // indicates if there’s other tags that’s
                                        // left in the string that have not been
                                        // processed yet

        public HtmlStringInfo(String newString, int nextIdxToProcess,
                boolean keepLooking) {
            super();
            this.newString = newString;
            this.nextIdxToProcess = nextIdxToProcess;
            this.keepLooking = keepLooking;
        }

        public String getNewString() {
            return newString;
        }

        public int getNextIdxToProcess() {
            return nextIdxToProcess;
        }

        public boolean isKeepLooking() {
            return keepLooking;
        }

        @Override
        public String toString() {
            return "HtmlStringInfo [newString=" + newString
                    + ", nextIdxToProcess=" + nextIdxToProcess
                    + ", keepLooking=" + keepLooking + "]";
        }
    }

    @Test
    public void removeInvalidTagLoop() throws Exception {
        // PREREQUISITE: Grab all XML content inside <body></body> tags, and start processing 
        
        String validTag[] = new String[] { "<h1", "<h2", "<h3", "<h4", "<h5",
                "<h6", "<p" };
        List<String> validTagList = Arrays.asList(validTag);
        
        // Start processing XML inside <body></body> tags
        String xmlString = "<div class=\"embedded\" id=\"thumbnail_0.jpeg\"></div><div class=\"package-entry\"><h1>HEADER</h1></div><h2>HEADER2</h2><p>PARAGRAPH</p>";
        
        xmlString = removeInvalidHtmlTagElement(xmlString, validTagList);

        System.out.println("Final[" + xmlString + "]");
    }

    /*
     * This function removes invalid html tags (that are not specified in the
     * validTagList), and anything in between the invalid element, from the
     * xmlString. NOTE: This does not currently remove any possible invalid
     * elements nested inside valid elements (i.e. <h1><bad></bad></h1>)
     */
    protected String removeInvalidHtmlTagElement(String xmlString,
            List<String> validTagList) {

        HtmlStringInfo tagElementInfo = null;
        // index keeps track of the index to START LOOKING for the next tag
        // element to process
        int nextIdxToProcess = 0;
        do {
            System.out.println("Start[" + xmlString + "]");
            System.out.println("xmlString.length[" + xmlString.length() + "]");

            // figure out entire contents inside the tag element to process
            int closeBracketIdx = xmlString.indexOf(closeBracket,
                    nextIdxToProcess);
            String insideBracket = xmlString.substring((nextIdxToProcess + 1),
                    closeBracketIdx);
            System.out.println(" [" + insideBracket + "]");

            tagElementInfo = getNextTagElement(xmlString, nextIdxToProcess,
                    insideBracket, validTagList);

            xmlString = tagElementInfo.getNewString();
            nextIdxToProcess = xmlString.indexOf(openBracket,
                    tagElementInfo.getNextIdxToProcess());

            // break if we are done processing the string
            if (nextIdxToProcess < 0)
                break;
        } while (tagElementInfo.isKeepLooking());

        return xmlString;
    }

    protected HtmlStringInfo getNextTagElement(String xmlString,
            int openBracketIdx, String insideBracket, List<String> validTagList) {

        // if tag is valid then do not need to strip it out
        boolean isValidTag = validTagList.contains(openBracket + insideBracket);

        int nextIdxToProcess = 0;
        boolean keepLooking = true;
        boolean isVoidElement = insideBracket.contains(slash) ? true : false;
        String strip = null;
        if (isVoidElement) {
            strip = openBracket + insideBracket + closeBracket;
            System.out.println(" strip[" + strip + "]");
        } else {
            // get the element name, need to account for attributes within the
            // element
            int elementIdx = insideBracket.contains(space) ? insideBracket
                    .indexOf(space) : insideBracket.length();
            String element = insideBracket.substring(0, elementIdx);
            System.out.println(" [" + element + "]");

            String endTag = openBracket + slash + element + closeBracket;
            // need to start at the correct index, this accounts for repeats
            int endTagIdx = xmlString.indexOf(endTag, openBracketIdx)
                    + (endTag.length());

            strip = xmlString.substring(openBracketIdx, endTagIdx);
            System.out.println(" strip[" + strip + "]");
        }

        int idxStripStart = xmlString.indexOf(strip, openBracketIdx);
        int idxStripEnd = idxStripStart + (strip.length() - 1);
        System.out.println("idxStripStart[" + idxStripStart + "],idxStripEnd["
                + idxStripEnd + "]");
        System.out.println("xmlString.length[" + xmlString.length() + "]");

        // The tag element is not in the validTagList, so must be stripped
        if (!isValidTag) {
            // The element will be stripped, so need to set the idx to the
            // beginning of where this element was stripped
            nextIdxToProcess = idxStripStart;

            // now set the new xmlString
            xmlString = xmlString.replace(strip, empty);
        } else {
            // this element was not stripped, so need to set the idx to the end
            // + 1 of where this element is
            nextIdxToProcess = idxStripEnd + 1;
        }

        // determine if the nextIdxToProcess is past the length of the current
        // xml string, that means it will determine
        // if we are done processing the string yet
        keepLooking = nextIdxToProcess >= xmlString.length() ? false : true;

        return new HtmlStringInfo(xmlString, nextIdxToProcess, keepLooking);
    }
}

10 comments:

  1. I got you fam, exactly the same thing was here not so long ago :D And yes, the solution has appeared to be so easy, but the problem is, you aren't able to find that through the most popular learning resources, the Udemy courses nor Codeacademy as well don't explain these things. And one thing I got to know for sure from experience, the majority of guys over the forums are not so good as some who answered to you above - they won't give a direct answer. Furthermore, you know what they said on java, what works well for one person, wouldn't work under any circumstances for another one. Well, I'm still on my very beginning of java - practicing was pretty short yet - and I'm using this resource on how to learn java explainjava.com/enable-java-chrome/ in order to keep things like that in mind, so I've stopped to stuck into them like solving riddles. And they're answers on most common questions are way more informative than classic "have you tried turning it off and on again" thing :)

    ReplyDelete
  2. It is interesting to read your blog post and I am going to share it with my friends.aybabg

    ReplyDelete
  3. I have looked couple of days and discovered some spellbinding formed work on it. In any case, it is the best of all. thebestvpn.uk

    ReplyDelete
  4. I was very pleased to find this site.I wanted to thank you for this great read!! I definitely enjoying every little bit of it and I have you bookmarked to check out new stuff you post. https://internetprivatsphare.de/netflix-usa-in-deutschland/

    ReplyDelete
  5. This article is an appealing wealth of informative data that is interesting and well-written. I commend your hard work on this and thank you for this information. You’ve got what it takes to get attention. https://www.debestevpn.nl

    ReplyDelete
  6. Regular visits listed here are the easiest method to appreciate your energy, which is why why I am going to the website everyday, searching for new, interesting info. Many, thank you vpnveteran

    ReplyDelete
  7. I recently found your blog site yahoo and checked out several of your earlier posts. Keep up the great work. I also added your Rss to my MSN News Reader. Looking forward to reading more of your stuff at a later point!… Online JSON Formatter

    ReplyDelete
  8. wow, great, I was wondering how to cure acne naturally. and found your site by google, learned a lot, now i’m a bit clear. I’ve bookmark your site and also add rss. keep us updated. internetprivatsphare

    ReplyDelete
  9. Incredible Article it its truly instructive and inventive update us as often as possible with new upgrades. its was truly important. much obliged.  lemigliorivpn.com

    ReplyDelete

I appreciate your time in leaving a comment!