Data Scraping, How?

Hi
I have APS micro inverters and a local gateway that communicates to the inverters over the powerline which then produces a local website. I’ve worked out the GET URL that produces a table listing the inverters and per panel generation. My knowledge on how to move this forward has reached my limit, what I need is a script that scrapes the data and passes it to emoncms via json, is there anybody that is able to assist or point me in the right direction?

The html table is as follows…

<code>
<html>
<head>
<meta http-equiv=pragma content=no-cache>
<meta http-equiv=expire content=now>
<title></title>
</head>
<body bgcolor=ffffff text=black><br><br>
<table align=center border=1 cellpadding=0 cellspacing=0 bordercolor=#008000 bordercolorlight=#ffffff borderdark=#808000 width=1024>
<center>
<tr bgcolor=#43CD80>
<td align=center>Inverter ID</td>
<td align=center>Current Power</td>
<td align=center>Grid Frequency</td>
<td align=center>Grid Voltage</td>
<td align=center>Temperature</td>
<td align=center>Date</td>
</tr>
</center>
<center>
<tr>
<td align=center>404000128572-A</td>
<td align=center> 27 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 237 V</td>
<td align=center> 21 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128572-B</td>
<td align=center> 27 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 237 V</td>
<td align=center> 21 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128462-A</td>
<td align=center> 31 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 237 V</td>
<td align=center> 20 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128462-B</td>
<td align=center> 31 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 237 V</td>
<td align=center> 20 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128304-A</td>
<td align=center> 27 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 237 V</td>
<td align=center> 19 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128304-B</td>
<td align=center> 27 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 237 V</td>
<td align=center> 19 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128436-A</td>
<td align=center> 26 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 238 V</td>
<td align=center> 20 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128436-B</td>
<td align=center> 29 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 238 V</td>
<td align=center> 20 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
<center>
<tr><td align=center>404000128356-A</td>
<td align=center> 18 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 238 V</td>
<td align=center> 18 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center><center>
<tr><td align=center>404000128356-B</td>
<td align=center> 24 W</td>
<td align=center> 49.9 Hz</td>
<td align=center> 238 V</td>
<td align=center> 18 <sup>o</sup>C</td>
<td align=center> 2017-03-18 13:20:31</td>
</tr>
</center>
</table>
<br><br>
<hr></hr>
<center>
<tr>
<td>&copy2013 Altenergy Power System Inc.</td>
</tr>
</center>
</body>
</html>
</code>

Regards
Dave

What platform and language will you be using?

Hi Robert
I presume the best way to do this is put on the emonpi and use cron to call it every min?
Thnkyou for editing the post, I couldn’t get it to show the html.

Regards
Dave

A quick Google for “html scraping python” showed several approaches/libraries. There appears to be one, “Beautiful Soup” that pops up all over the place, to parse the HTML, then of course you’ll need more work to extract what you need and format it for emonCMS.
I’m not a Pi nor Python expert, so others may have better ideas. In the past, I’ve written stuff like that from scratch in C, i.e. the hard way.

Here’s a Python 3 script to scrape real-time data from an APS Systems ECU. It uses the BeautifulSoup module mentioned above.

Thanks Bill, I had downloaded this script but I just cant get it to run without errors, the bash I’ve been using is…

$ python getECUData.py

I get the error…
Traceback (most recent call last):
File “/var/getECUData.py”, line 9, in
import urllib.request
ImportError: No module named request

Regards
Dave

This might help. Which version of Python do you have?

@Dave, probably your error is because the script is for Python 3, and you are using Python 2.

python --version

To check. You may have Python 3 installed, if so it is probably just called “python3”

python3 getECUData.py

You may also need to install the beautiful soup library for python3, which means you will probably have “pip3”

pip3 install bs4

Actually looking at it more you can probably port that script back to python2 just by replacing the two instances of “urllib.request” with “urllib”

Thanks @mwalker and @Robert.Wall
I realized from your post that I was installing the modules from python v2 rather than python v3 that I needed, the two are effectually separate programs.
Ive modified the code to output data that’s compatible with EmonCMS and its working nicely.
I will write a post detailing what Ive done when I get a spare 30 mins.
Attached below is how the inverters are listed in EmonCMS so you can pull off generation data at per panel level.

Regards
Dave

Hi Dave,

Run into you solution to scrap APS ECU data, copied script and did all what needs to be done.
Unfortunately it will not work. I noticed your script was build for ECU version 3.10.10. My ECU is running version 4.1. Checking the URL I see it does not exist. URL for 4.1 is:
http://ip-address/index.php/realtimedata
Also the returned output looks somewhat different that you have above, part of the table for realtime data:

<thead>
  <tr>
    <th scope="col">Inverter ID</th>
    <th scope="col">Current Power</th>
    <th scope="col">Grid Frequency</th>
    <th scope="col">Grid Voltage</th>
    <th scope="col">Temperature</th>
    <th scope="col">Reporting Time</th>
  </tr>
</thead>
<tbody>
    <div>
        <tr class='active'>
    <td>404000191265-A </td>
    <td> 23 W </td>
    <td rowspan=2 style='vertical-align: middle;'> 50.0 Hz </td>
    <td> 240 V </td>
    <td rowspan=2 style='vertical-align: middle;'> 31 &#176;C </td>
    <td rowspan=2 style='vertical-align: middle;'> 2019-09-03 15:46:05
404000191265-B 174 W 240 V
404000191123-A 170 W 50.0 Hz 240 V 35 °C 2019-09-03 15:46:05

Did you by any change had an update on ECU to v4.1 and updated the script to work?
Or did you find another solution to get the data out of ECU?

Thanks for your thought, ideas, and support,
Bernard