Friday, January 27, 2017

Html parsing with Jsoup

I recently had a need to read html files saved to disk, and find all the html link data for a data migration project.  jsoup was a great java library to use.

Environment used:
OSX El Capitan 10.11.6
Oracle Java 8

Maven was used, this was the dependency definition:
<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.8.3</version>
</dependency>

sampleDocument1.html
<body><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.<a href="http://cherryshoetech.com/1">http://cherryshoetech.com/1</a>Consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.<a href="http://cherryshoetech.com/2">http://cherryshoetech.com/2</a></p></body>

Below is a simple test method that searches for all links in the html document:

@Test
public void testHtmlBodyParsing() throws Exception {

    // The test html document must be valid
 
    // first example
    File fileData = new File("./src/test/resources/sampleDocument.html");
    String html = readFile(fileData.toPath());
 
    System.out.println(html);
 
    Document jsoupDoc = Jsoup.parse(html);
    Elements links = jsoupDoc.select("a[href]"); // a with href
 
    List<String> linksList = new ArrayList<String>();
    for (Element e : links) {
          linksList.add(e.text()); // get the text of the a tag
    }
 
    for (String link : linksList) {
          System.out.println(link);
    }
                   
}

No comments:

Post a Comment

I appreciate your time in leaving a comment!