Environment used:
OSX El Capitan 10.11.6
Oracle Java 8
Maven was used, this was the dependency definition:
<dependency>
<groupId>org.jsoup</groupId>
<artifactId>jsoup</artifactId>
<version>1.8.3</version>
</dependency>
sampleDocument1.html
<body><p>Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.<a href="http://cherryshoetech.com/1">http://cherryshoetech.com/1</a>Consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.<a href="http://cherryshoetech.com/2">http://cherryshoetech.com/2</a></p></body>
Below is a simple test method that searches for all links in the html document:
@Test
public void testHtmlBodyParsing() throws Exception {
// The test html document must be valid
// first example
File fileData = new File("./src/test/resources/sampleDocument.html");
String html = readFile(fileData.toPath());
System.out.println(html);
Document jsoupDoc = Jsoup.parse(html);
Elements links = jsoupDoc.select("a[href]"); // a with href
List<String> linksList = new ArrayList<String>();
for (Element e : links) {
linksList.add(e.text()); // get the text of the a tag
}
for (String link : linksList) {
System.out.println(link);
}
}
No comments:
Post a Comment
I appreciate your time in leaving a comment!