Cassandra support in Emoncms

mabi · 22 November 2017 17:28

This thread is open to discuss the ongoing development of support for using Apache Cassandra as a storage engine for feeds data.
Cassandra is an open-source NoSQL distributed database, designed to handle a large amount of data across many nodes in a highly available cluster with no single point of failure.
When Emoncms is used as a centralized server to collect data from many devices or emonpi installations and spreading data across nodes and datacenters is needed for availability and disaster recovery, a distributed database can be a good solution.
Right now this support is experimental and there is an ongoing discussion about the advisability of including it in the mainstream releases (disabled by default) or leaving it in a separate experimental branch.
Feel free to comment,
thank you!

Jon · 22 November 2017 18:08

Hi Marco! I’ve not heard about Apache Cassandra before. Can you supply a good link that will help others understand what it does. (I’ll start searching also!)

pb66 · 22 November 2017 18:10

Marco has written a short intro here

https://github.com/mabi/emoncms/blob/dev-cassandra/docs/Cassandra.md

and there are ongoing discussions on github

github.com/emoncms/emoncms

Cassandra Integration...

opened 09:00AM - 06 Aug 17 UTC

closed 08:41AM - 17 Jan 19 UTC

Paul-Reed

enhancement

@mabi The below is a summary of some issues around Cassandra integration; In …`default.settings.php`, the Cassandra engine has not been added to the array - see https://github.com/emoncms/emoncms/blob/master/default.settings.php#L38-L44 In `default.emonpi.settings.php`, the Cassandra data directory path has not been added to the array - see https://github.com/emoncms/emoncms/blob/master/default.emonpi.settings.php#L58-L67 As the path `/home/pi/data/cassandra/` will not exist in either the emonpi image, or existing installations, what provisions have been made to ensure that the datadir has been added? The installation guides will need updating to ensure that the datadir is created - see L101 of https://github.com/emoncms/emoncms/blob/master/docs/RaspberryPi/readme.md - Plus all of the other guides if necessary. The `cassandra.md` read-me does not contain sufficient information to allow many users to use the cassandra engine, it would be great if it could be expanded to cover the full 'cassandra' installation process, and get users up and running. Have you considered updating the `$config_file_version` flag in `default.settings.php`, `default.emonpi.settings.php` and `process_settings.php` to ensure that 'non-emonpi' users are prompted to update their settings file, otherwise users may continue to use their 'old' settings, and of course cassandra wouldn't work. See - https://github.com/emoncms/emoncms/blob/master/process_settings.php#L66 Paul

[edit] hi @jon - I started putting these links together before you posted but they might help you.

Jon · 22 November 2017 18:41

Hi Paul @pb66 ! Thanks for the links! I read thru the first two (more to read!).

Marco @mabi - Does this engine install completely on a user device (emonPi in my case)? Or is this a client that installs on a user device and sends data to an outside service? (an outside link to a wikipedia description had me confused)

mabi · 22 November 2017 21:02

Hello @Jon, running Cassandra on an arm processor and SD storage would not be very performing, moreover the PHP client driver for Apache Cassandra used in this implementation is currently supported only on x86 processors, so even using emonPi as a client is not currently an option.
One use case for using Cassandra is storing a lot data that one cannot afford to lose, e.g. a centralized emoncms getting data from many devices or emonPis on the field.
The cluster is very easy to setup on commodity hardware and adding new nodes as needed is easy, if a node fails data is not lost and nodes can be in different data centers to avoid regional outages.
If this scalability and availability is not needed, current backends are more suitable.

stuart · 23 November 2017 11:56

Will emonCMS data fit the “nosql” model - I’ve yet to find a good use case for NoSQL type databases!

pb66 · 23 November 2017 12:51

Can you please explain the model you are referring to in a little more detail?

If I understand correctly you are suggesting a single emoncms instance (x86 based) would be storing it’s data “with cassandra” meaning that data is spread out across multiple cassandra servers (I think you call them nodes, but that has a different meaning to us) in the same cluster so that if any one cassandra server fails the data will still be available to the emoncms instance.

If that is correct, then the emoncms server is still a single point that can fail, I understand the data will still be safe so switching to a failover server or rebuilding the emoncms instance would instantly be able to access all that intact data without having to restore from backups.

But in this instance the data created and passed to emoncms whilst it is off-line is lost (not redirected or handled by another server) unless represented once the emoncms server is back up.

If this is close to the model, then the cassandra essentially offers just a resilient data storage meduim that may negate the need for any backup or restore, it doesn’t remove the single point weakness of a single emoncms server.

This is interesting, but predominantly for commercial emoncms servers rather than emonPi users I guess.

Is there a global/community/public cassandra network or would setting this up involve setting up our own “cluster” with several cassandra servers?

Could this lead to a OEM community cassandra network? or is there security issues with sharing storage?

How much disk space would a cassandra server need to offer the cluster in return for the space occupied elsewhere? Since there are no masters/slaves or hosts/clients, use of cassandra must involve shared responsibility as well as shared resources.

Please put me right if I am looking at this wrong.

We would also need to know a performance comparison, how big on disk is the data compared to phpfina and php timeseries? how fast is data searching etc? Is network speed a factor or is there caching? Also is the data editable or write once?

No rush on any answers you might be able to offer, I know you are probably a busy guy.

mabi · 23 November 2017 15:10

@stuart while the SQL relational databases model is well defined, NoSQLs are just… not SQL so there are a lot of types of NoSQL databases.
Cassandra is a ‘wide column store’ i.e. it uses concepts similar to tables, rows and columns, but the name and format of the columns in each row can vary. Think of a two dimensional key-value store. This model is well suited for timeseries data like emoncms feeds. The simplest approach would be having a row for each device, a column for each timestamp and the column value will be the reading (temperature, power…). Cassandra sorts data and then writes sequentially to disk, when retrieving data by key and range the access pattern is very efficient. So using cassandra for time series data is a fitting use case.

mabi · 23 November 2017 15:39

I will gather more info for an exhaustive answer, meanwhile I answer some of the questions.
The model I have in mind is like the one you described, suitable not for emonPi users but for large emoncms centralized installations like emoncms.org that is collecting data from a lot of devices. For other use cases phpfina is simpler, fast and compact.
Configuration data will be stored on DB as it is now, while feed data will be stored on ‘private’ cassandra servers, I don’t plan using external services other than VPSs, the servers can be the same where emoncms is installed, installing Cassandra is very easy. To avoid a single point of failure and load balance there can be more instances of emoncms using replicated copies of the same configuration DB so clients can send to any emoncms instance and data will end in the same cassandra storage.
It is always better having a backup, but the cluster is very resilient and the need of restore will be rare.
The idea of a community cassandra network is interesting, I don’t know if there is one already, but as you pointed out there will be privacy and security issues, I don’t think it would be pratical beyond experimental or testing purposes.

pb66 · 23 November 2017 18:08

So ideally this would suit a large organisation or group of people, OEM community for example, It is not really suited to even a meduim sized single organisation as running several cassandra servers just to provide storage to one emoncms server isn’t very efficient, where as 5-6 power user decide to form a “club” might work if cassandra security can accommodate that. If I managed10 clients servers that all ran large emoncms instances for their own client base, I could theoretically set up a cassandra network for them all to belong to.

This sound really interesting, I also think it is quite a niche thing that may not attract many emoncms users, I think very large scale users will have their own business model, (may include cassandra or not) and the only other possible use in reality might be an openenergymonitor network, but that would only be of real interest if it worked on ARM devices given the current user base.

It would be good to hear @TrystanLea and @nchaveiro’s take on it.

stuart · 24 November 2017 08:45

I’d tend to agree - the existing PHPFINA and similar engines are very good and efficient for storage - particularly for low power devices like the PI.