Getting information from web pages via Powershell


PS1In many cases, the information we need is available on one or many web pages, but we need to process the same information repeatedly. To automate such a repeated task via Powershell we need to read and parse HTML data. For example, a question was recently posted on the Microsoft 2010 Sharepoint forum:

I love how SharePoint 2010 has the page: http://Server:880/_admin/PatchStatus.aspx. However, looking at this page each server has about 100 patches.  Scrolling through this list is difficult to see if one of the servers is missing a patch or has patches that other servers do not. Is there a way to export the information on the PatchStatus.aspx page in Central Admin to an excel spreadsheet?

For the purpose of this exercise, let’s say we want to get the titles of the posts in this URL https://superwidgets.wordpress.com/category/sql

First, let’s read in the HTML code of this page:

HTML2

Now, let’s pipe that to Get-Member to see what kind of object we got and its available methods and properties:

HTML3

After exploring different properties of the $HTML object we have, and with some background HTML knowledge, you can tell that the information we’re looking for is in the HTML Body.

Using the same technique above, we can explore further properties of the HTML object like “ParsedHTML”

$HTML.ParsedHtml | Get-Member

This shows that it’s a HTMLDocumentClass object with tons of events, methods, and properties:

HTML7

Partial list of HTMLDocumentClass object events, methods, and properties

What’s useful to us here is the following 3 methods:

getElementById
getElementsByName
getElementsByTagName

Now that we know how to extract the information we need from a web page, let’s look at the specifics of the web page at hand https://superwidgets.wordpress.com/category/sql. Open it in IE for example, hit F12 to open the DOM explorer at the bottom, expand the HTML tags, and move the mouse over the tags one by one. Notice that the top IE pane changes the background color of the element you’re moving the mouse on. This gives us a visual indication of which element in the HTML code represents which text or area of an HTML page.

HTML6

You can see the article titles we’re interested in are the ones that start with

<h2 class =”entry-title”>

We can now write the following few lines of Powershell script code to complete the task:

HTML8

# Script to display post titles in the SQL Categroy of Superwidgets blog
# Sam Boutros – 08/10/2014
$URI = “https://superwidgets.wordpress.com/category/sql/
$HTML = Invoke-WebRequest -Uri $URI
($HTML.ParsedHtml.getElementsByTagName(‘h2’) | Where{ $_.className -eq ‘entry-title’ } ).innerText

In line 5 we pick HTML elements by the “H2” tag, filter on ClassName = “entry-title” and select innertext property.

Output looks like this:

HTML9

which is the exactly the article titles we set out to get.

This information can be further processed, logged, stored, or repackaged in other HTML, CSV or other reports..

 

Advertisements

2 responses

  1. Chris

    Do you have to be on the web server to run these commands? Or is there a PowerShell kit that I need to install to do this on my local system?

    Thanks.

    July 13, 2016 at 4:43 pm

    • no, no

      July 15, 2016 at 10:54 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s