Giter Club home page Giter Club logo

Comments (164)

gjr80 avatar gjr80 commented on July 29, 2024 1

Status update:

  • fixed the circular import issue and now use the extant Version metadata field.
  • added --apply-weight-patch action to wee_database
  • first draft documentation of --apply-weight-patch action under wee_database in Utilities Guide

I have pushed these changes to the weighted_summaries_(#61) branch this morning and I believe it should now run without throwing an error.

Things that I know still need to be done:

  • Version numbers. Need to check through all files to make sure there are no testing/development remants that use old/existing weewx version numbers
  • Utilities Guide. Am sure there is a better way of saying what I have written... Matthew? :)
  • Testing. My recent testing has been limited to patching an unpatched 10 year database and then trying to repatch the patched database. This was done both during startup and from wee_database. I do want to do some more extensive testing (I know the patch algorithm is fine it is more the handling of old, new, reverted db and weewx versions that I am interested in) but time has gotten away from me for now. I will work on this in the coming days (I presume this will not be included in a release in the next few days).
  • Documentation. I don't know what thoughts there are on where and how this patch needs to be identified/documented (other than in the changelog and Utilities Guide).

from weewx.

matthewwall avatar matthewwall commented on July 29, 2024 1

perhaps the option --update-database is too general? maybe --reapply-intervals, or --recalculate-aggregations. however, that does not distinguish much from --backfill-daily.

for context, these are the actions for wee_database:

--create-archive
--drop-daily
--backfill-daily
--reconfigure
--string-check

i'm still not sure why we need a separate option. won't --backfill-daily do it?

btw, it is not very future-proof to refer to this thing as a 'patch'. think of how this issue will be needed/referred to in 2 years.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Had a quick look through the accumulator code, only had to take one break so must be a bit easier to get my head around it this time. My reading is that that the weighted sums are being used where they should and the weighted sums formulae are correct, but the weighting factor is always 1.0 so in effect the sum is always used.

Is the fix simply a case of keeping track of the elapsed time between successive loop packets and when calling addSum() (from add_value() or addWindValue()) passing addSum() a weight parameter based on the elapsed time?

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Could I seek a little clarrification of this issue. In the initial description the issue is said to be an issue 'if the archive interval changed'. Is this more completely described as 'This would be an issue for stations using software record generation if the archive interval chnaged'? Being a Davis user I sometimes get stuck in the Davis paradigm (and the ordered world that uses) and am trying to get my head in the right space.

That being said, I am at the stage where I have a concept in my head that I would like to code and test. I am trying to distil lthe test conditions I would need to prove the code. If I had a simulator that emitted loop packets at irregular, but known intervals (rather than the current regular 2.5sec interval) I could capture the accumulator results after a full archive period and compare them to manully calculated results. A similar effect could be achieved by modifying the current simulator to omit (to emit !) 2 or 3 or 4 or more loop packets in a row. Would probably need to change the characteristics of the observation cycle (ie make it a bit faster changing) so that the effect of the weighting over such short times would be clearly seen and not buried in the noise at 10 decimal places. If the aggregates under such conditions were correctly calculated then by extension they would also be correctly calculated should the archive interval change.

Thinking aloud here I guess, but want some reassurance I am onto the right problem and not solving the wrong problem.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I am after some input on expected accumulator behaviour.

Archive records hold the obs that represent the previous archive_period seconds. Does the same effectively hold true for obs in loop packets? I expect it does since a loop packet received on an archive_period boundary is included in the accumulator covering the previous archive_period seconds.

On startup. The very first accumulator is created when the first loop packet arrives. What weighting should the obs in that packet be given? Presently every obs is weighted equally so it is just added to the sums and counted. But if we are to weight the obs, over what period do we weight the first loop packet? Do we give it no weight but just use it to mark a time from which we weight the next packet? Do we weight it over the period since weewx launch? Do we apply a weight of 1? The 2nd option appears to make sense but if there was some catchup going on or other processing at startup that makes it rather meaningless. Applying a weight of 1 seems a bit arbitrary, why not 10, 0.1 etc.

Long loop periods. Consider the case where there is say a 45 sec loop period with a 5 minute archive period. There will be occassions where a loop packet straddles the end of archive period. In such a case how are the obs in that loop packet weighted, and to which accumulator do they apply to? Do we apply the weighted obs proportionally to both accumulators? Or only apply to the accumulator is which the packets arrives? If we are applying the obs to the accumulator in which it falls what weighting is used, the period since the start of the accumualtor or the period since the timestamp of the previous loop packet? At present each accumulator has no knowledge of anything that happened before it came into being, it only knows of the last packet timestamp for packets arriving during its lifetime.

And lets not talk about receiving pressures in a 15 minute loop period in a 5 minute archive period setup.

Sorry there is such a breadth of what seems like fundameental questions but I want to get a clear picture of what behaviour we want. Apply equal weighting to a 2.5sec period loop with 5 min archive period just makes things too easy.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I am thinking about the daily summaries, not the accumulation of LOOP packets into archive records.

The daily summaries are supposed to be an optimization of what's in the archive database. They are a compilation of certain raw statistics about the archive records, from which we can calculate averages, highs, lows, etc., more quickly.

My intention is that they should be weighted by the observation period, that is, the archive period. For example, say we wanted to know the average temperature for the day. If all the archive records had the same archive period, we could just sum the temperatures, then divide by the number of records. However, there's the possibility that the archive interval could have changed during the day. To guard against that, instead of simply summing the temperatures, the code sums the temperature times the archive interval. Then, instead of dividing by the number of records, it divides by the sum of archive intervals (which is, effectively, the total observation time). The result is an average temperature, properly weighted by the observation time of each value.

This is all well and good, but I don't think the code is actually doing this. The function DaySummaryManager._addSingleRecord reads:

       # Now add to the daily summary for the appropriate day:
        _day_summary = self._get_day_summary(_sod_ts, cursor)
        _day_summary.addRecord(record)
        self._set_day_summary(_day_summary, record['dateTime'], cursor)

Trouble here, is the call to addRecord is not passing in the weighting, that is the archive interval. So, instead, the default weighting 1 is being used, and we're back to simply summing the values and dividing by the number of records.

I would really appreciate a second set of eyes looking through this and see if I'm missing something. Better yet, to write a simple unit test that tests this.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

My apologies, when I saw 'looking through the accumulator code' I thought the focus of this issue was the loop data. I will now focus on the archive records/daily summaries.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I guess it could be an issue with the loop code, but, right now, there is no way to weight an observation by the interval between loop packets. Indeed, the definition of a loop packet observation is fuzzy: is it an aggregate value since the last loop packet? Or, an instantaneous value?

I think we have to assume that all loop packets within an archive interval should be weighted equally.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Spent some time looking at DaySummaryManager._addSingleRecord(). Quite correct that a weighting of 1 is applied to all obs, loop or archive. Thinking aloud, for an archive record the weighting will be the interval field (minutes or seconds does not matter as long as the same units are used for weighting throughout). Loop packets don't have an interval field so a solution could be to check for an interval field in Accum.addSum() (and all other Accum methods that call addSum()) and if it exists, pass on interval as weight; otherwise pass 1 as weight. Archive records obs would then be added with weighting and loop packets would not *well they would but it would be 1 as it is now).

I suppose a weight parameter could be introduced higher up the chain (eg add it to Accum.addRecord() but I don't see any necessity to do so when it is effectively (and readily) available lower down the chain.

I will see if I can put together a modified simulator to test this (unless I hear I am off the track again!)

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

You're on the right track.

Detecting whether we're accumulating loop packets or archive records by the presence or absence of a value for interval feels a little kludgy but, I must admit, I don't have a better idea.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I should add, this part of the patch is easy enough. The tough part is what to do with legacy daily summaries. Now they will be a mix of weighted and unweighted sums.

Options:

  1. We could rebuild the summaries, but users with years of data running on low-powered machines wouldn't be happy about it.
  2. We could do a one-time "patch" of the summaries. That is, given an archive interval, go through the summaries and multiply the values of wsum and sumtime by it. Then include a flag in archive_day__metadata that the patch has been done and does not need to be done again.
  3. Or, we could "version" the summaries. Use the weighting in newer versions, don't use it in older versions.

Personally, I like option 2 the best.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Yep, definitely avoid 1, too many RPis with too much data = lots of user issues. 3 sounds like you are creating orphans. 2 should be easy enough and quick, we are fortunate that everyone still uses a single interval throughout their archive.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

We could support limited changes in the archive interval.

Scan the daily summaries day-by-day. For each day, consult the archive and see what archive interval was used for the day and use that to patch the summary.

Of course, this would fail if the archive interval changes during the day, but it's more accurate than blindly multiplying everything by a single number.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Some good time spent looking at the first (easy) part of the problem.

I have tried 2 different approaches, mod1 and mod2:

  • mod1 is as mentioned above whereby the methods listed in Accum.add_record_dict each evaluate a weight parameter which is set to 1 for loop packets and the record interval (in seconds) for archive records. Packets and records are distinguished on the basis of the presence of an interval field in the packet/record. Subsequent calls to ScalarStats.addSum() or VecStats.addSum() now pass on the weight parameter.
  • mod2 takes a different approach, essentially pushing the weight parameter up the chain. DaySummaryManager._addSingleRecord() passes a second parameter weight to Accum.addRecord() where weight=record['interval']*60. Accum.addRecord() now accepts a weight parameter with a default value of 1. Accum.addRecord() passes weight to all called methods. Accum add_value(), add_wind_value(), check_units() and noop() methods all accept a weight parameter with a default value of 1. As with mod1 subsequent calls to ScalarStats.addSum() or VecStats.addSum() pass on the weight parameter.

Both mod1 and mod2 appear to work. To confirm this I modified StdArchive and the Simulator as follows:

  1. can now vary the interval value from record to record, and
  2. added a genArchiveRecords() method to the simulator so that I can better control what appears in the simulator generated archive records.

I have run the simulator in generator mode to give me a full day of archive records (out of interest I implemented a 5-5-5-5-5-5-10-10-10 sequence for interval starting/repeating each hour on the hour). I have taken the dateTime and outTemp fields from each archive record and used them in a spreadsheet to manually calculate the sum, wsum, count and sumtime fields of the accumualtor/daily summaries. These agree with the values generated by both mod1 and mod2. Unfortunately, one of the problems with a sinusoidal simulator is the the weighted and unweighted averages are the same. So to get some numbers that were not the same I added some randomness to the simulator generated archive record obs. I now get different values for unweighted and weighted averages, and the mod1 and mod2 values agree with my manually calculated values. I have only done the manual comparison with outTemp but since the other scalar obs use the same code I know they will be fine. I do need to verify the vector stats but I see no issues there, ultimately weight flows through the same calls as for scalar stats.

Would appreciate thoughts on mod1 and mod2 approach? No code in a repo yet, will see if i get time to knock one up this arvo. Will now look at the next part of the problem. Seems straightforward but doing anything with 'flags in archive_day__metadata' will probably require some coaching!

Fortunately gloomy winter days here so finding time is the easy part.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

mod1 sounds a lot simpler. I assume the function Accum.add_value now looks something like

    def add_value(self, record, obs_type, add_hilo):
        """Add a single observation to myself."""

        val = record[obs_type]

        # If the type has not been seen before, initialize it
        self.init_type(obs_type)
        # Then add to highs/lows, and to the running sum:
        if add_hilo: 
            self[obs_type].addHiLo(val, record['dateTime'])
        self[obs_type].addSum(val, weight=record.get('interval') * 60, 1)

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Yes, that's about it. Mod1 definitely simpler, thought I would see what else I could come up with after your kludge comment.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

As I thought about it I realized that it's not really a kludge. We can think about the presence of an interval field as indicating whether the values are an aggregate over some time interval, or an instantaneous value. If the former, it makes sense to weight the sums with that time interval.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

self[obs_type].addSum(val, weight=record.get('interval') * 60, 1)

I shouldn't reply in the wee hours of the morning... now in the cold hard light of day I am not seeing how the above syntax works, have i missed something?

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

It's me that shouldn't be typing early in the morning. It should read

    self[obs_type].addSum(val, weight=record.get('interval', 1.0/60) * 60)

which looks kind of dumb so, rather than get too clever...

    weight = record.get('interval') * 60 if 'interval' in record else 1
    self[obs_type].addSum(val, weight=weight)

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

ha ha ha. Now this time that is exactly what I have, well except I used _weight.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Am happy now that mathematically the weighted values are correct. Giving some thought to the code to do the patch. You mentioned:

We could do a one-time "patch" of the summaries. That is, given an archive interval, go through the summaries and multiply the values of wsum and sumtime by it. Then include a flag in archive_day__metadata that the patch has been done and does not need to be done again.

and

We could support limited changes in the archive interval.

Scan the daily summaries day-by-day. For each day, consult the archive and see what archive interval was used for the day and use that to patch the summary.

Of course, this would fail if the archive interval changes during the day, but it's more accurate than blindly multiplying everything by a single number.

My thoughts so far, things the patch should do (in no particular order):

  1. only execute the patch if the flag in archive_day__metadata is not set
  2. patch the relevant wsum, sumtime, xsum etc values in the daily summaries (of course)
  3. patch daily summaries where the associated archive has more than one distinct interval value provided there is only ever one distinct interval value for each 'archive day'
  4. leave all daily summaries (not just the offending day) untouched if the patch finds more than one distinct interval in any one 'archive day'
  5. leave a flag in archive_day__metadata if the weighting was completed successfully

Oh, and out of interest I have the code here.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I would add one more thing. There is a very real risk that the user could abort the patching process before it has finished.

To guard against this, the patching should be done day-by-day. That is, a single transaction patches a single day. Then, the flag in archive_day__metadata indicates the last day patched.

For your point 3, if more than one archive interval is found in a day, we could just calculate the daily summary from scratch for that day.

Which brings me to still another point: do we know whether patching the daily summaries is faster than just recalculating everything?

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Sort of an aside to the issue here but I am working on the first cut of a utility to patch the daily summaries. Am using DaySummaryManager._get_day_summary() to get an accumulator containing the daily summaries for a given day and I notice that when you extract an accumulator in this manner the unit_system property of the resulting Accumulator object is left as None. Wondering if there was a reason for this, or have I missed the obvious? I can't see where the property is otherwise set (directly or indirectly) in _get_day_summary() - can see where it is set by other independent methods. Just struck me as odd as I thought it could lead to issues.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I'd say the reason for this inconsistency is history. The unit_system property was put there to make sure we weren't accumulating records or packets using different unit systems.

But, of course, there is no reason why it can't also be used to indicate what unit system the contents of the accumulator uses.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Ok, that explains it. I first discovered the issue when I extracted an accumulator, weighted the relevant stats and then went to save the accumulator. From what I can see _get_day_summary() and _set_day_summary() normally have an addRecord() in between them and that sets unit_system. Just a bit different to what I am doing. No matter, the simple fix was to throw in one line to set unit_system, at least I know it is not because I have missed something.

Back to the main topic, I have a 1st cut of a utility that patches the daily summaries with the interval of the day. Seems to work as it should. You were concerned about times; I just generated a 10 year/5 minute sqlite archive and the utility patched the daily summaries in 72 odd seconds. That was on a VM running on my desktop so not a real world test. I have a spare RPi (1st edition or whatever) that I will fire up tomorrow to see how much longer it will take.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Do you have a number for how long it would take to completely rebuild (aka backfill) the daily summaries? Is there a significant difference in time?

Unless there is a big difference there's no advantage, and a lot of disadvantages, to patching.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Just dropped then rebuilt the same database under the same VM, rebuild took 475 seconds (1051788 records to backfill 3653 day). That was without any fiddling of the number of days per transaction for the rebuild, but I don't recall tweaking that parameter making that much of a difference. Will do both a rebuild and patch on the RPi tomorrow.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

A factor of 6+ is worth the complexity. Frankly, I'm surprised. I didn't think there would be that big a difference.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Hmmm, I am thinking the same(surprised). Thought I would run the patch again, set that running straight after I last posted and its still going, that 13 odd minutes. Will look closer at it tomorrow, bed is calling now!

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

923 seconds. Need to take another look at this.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Well I am at a loss. What I am seeing is telling me that something is not right with the daily summaries in my 10 year db, but I can't see what. I have so far:

  • Run patch on 10 year test db, patch takes 72 seconds.
  • Drop daily summaries, then rebuild, rebuild takes 470 seconds.
  • Run the patch on the rebuilt daily summaries, takes 900 odd seconds. Sure this is effectively double patching so maybe that is the cause of the long execution time.
  • Take 10 year test db, drop and rebuild on a separate machine, takes 450 odd seconds. (This machine does not have the modified accum weight code so I have effectively rebuilt my 10 year test db to what it was originally.)
  • Take that 10 year rebuilt db and run the patch, takes 900 odd seconds, lots of disk activity.

The above all use SQLite, have all been done of a VM and are all reapeatable. I cannot see anything wrong with my original 10 year test db, 1 million plus archive records are there and complete, all the daily summary tables are there and those that should have data in them (3650 odd rows). At first I thought hte double patching may have caused the slowdown due to some very big numbers being generated and stored (I know SQLite does some funy things with number sizes) but the rebuild on the other machine rules that out. I get the feeling that the patch of my original db is not really patching but the checks I have done of the summaries indicates they are patched.

I am at a loss. I am happy to accept the patch utility is too slow but I just want to understand why the conflicting results. Don't like unexplained anomalies like this.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

After much hair pulling and trips down various rabbit holes, it seems my red face is due to transactions. Todays lesson, 'don't update tables of 3653 records one record per transaction'. Now updating 5 per transaction and on my VM am getting a consistent 90 odd seconds when patching a clean 10 year database or a dropped/rebuilt database. Rebuilding is consistently taking 180-200 seconds. Will see how the figures translate to the RPi.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Some RPi times:

  • patch 10year/5minute database - 650 to 700 seconds (have done this a few times)
  • backfill 10year/5minute database - 14000 odd (about 4 hours - only did this once)
  • patch 3year/5minute database - 180 odd seconds(have done this twice)
  • backfill 3year/5minute database is still running but I expect at least 1 hour

I am now at least confident that I am getting patch times that are repeatably close. Based on this I think the patch is worth persistsing with. I will tidy it up and let you know when its ready for review.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I would agree. That's over an order of magnitude difference!

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Actually, just found in my notes another instance last night where I backfilled on the RPi last night and it took 14190 sec, so I have done it twice, at least it was consistent. I really did think it would be quicker, on the VM the difference was nowhere near as great.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Ok, think the patch is ready for someone else to have a decent look at it. I took the approach used by the weewx utilities, hope this is what was intended as I don't think it was ever discussed. The patch (wee_weight_summaries) and modified accum.py and manager.py can be found here.

Usual disclaimer about names, I am not the best for choosing appropriate names.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

A few comments:

  1. Personal preference: a pull request would be easier to review and a lot
    easier to merge into the code base.
  2. Along the same line, I didn't see any difference between the two
    versions of manager.py. Is that right?
  3. Refactor the code so that it is more like the "backfill" code in
    structure. That is, the functionality is present as a library, which may or
    may not be invoked by a utility. This way, if needed, weewxd can patch the
    daily summaries on startup. If we depend on invoking a utility, it may not
    get done.
  4. Do you have any data to support the premise that vacuum significantly
    shortens the processing time? If not, I'd just leave it out. Or, if you do
    have such data, then always (silently) do the vacuum.
  5. Rather than calculate the last unpatched day, then perform date
    arithmetic for each day, why not just loop around a select statement that
    selects daily summaries "greater than" the last patched date? For each such
    date, you open up a transaction, patch the summary, then close and commit
    the transaction. Not only is it simpler, but it you guard yourself against
    any missing day summary. It also more naturally follows the "intention" of
    the exercise: to patch the actual daily summary sitting in a database.
  6. Suggestion to put the transaction context inside the loop over each
    day. This way, each day gets its own transaction. This will still be fast,
    while reducing memory requirements.
  7. I couldn't follow any logic that left a day's summary untouched if the
    archive interval changed within the day. Seemed to me that the program
    requires that all intervals beyond the last patched record have the same
    interval. But, I could definitely be missing something!
  8. The way I have selected for a value while checking that there is only
    one value (common situation when checking unit systems) is to do a select
    on the min and max value, then make sure they are the same. In this case,
    check for the minimum and maximum value of interval within a day, then
    make sure they are equal to each other. This requires only one pass through.

Gary, I'm humbled by how quickly you've picked up the twisted logic in
manager.py and the accumulators! I didn't do a very elegant job on this
stuff. My excuse is that I was constrained by history, which made it
unnecessarily complicated. Nice to know somebody else "gets it!"

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Thanks for the feedback, I guess I should have got a bit more direction on what this was to look like, but no matter, I think all of the code is there, just need to refactor. Re the PR, certainly was my intention to submit as a PR when it was close to maturity, probably my lack of GIT skills, but I did not want to tie up my weewx clone with all of this code and prevent me from doing anything else. I will take on your comments and a PR will be coming.

In fact, on the topic of addressing your comments, I can cover a few now:

2.Along the same line, I didn't see any difference between the two versions of manager.py. Is that right?

You are right, my bad, a hangover from when I was considering the 2 approaches to where the weight factor was introduced. Had some test code in there that seemed to hang around for a while and was later removed bringing it back to what it originally was but never removed manager.py from the repo.

  1. Do you have any data to support the premise that vacuum significantly shortens the processing time? If not, I'd just leave it out. Or, if you do have such data, then always (silently) do the vacuum.

When I was trying to track down the vastly different times to patch my simulator generated 10 year test file v the same file but after a drop daily/backfil,l one of the things I did was vacuum and on some smaller files I noticed a significant (more than just the usual noise) reductions in time to process. Only anecdotal, I'll have a closer look and see if I can get some better data.

  1. Suggestion to put the transaction context inside the loop over each day. This way, each day gets its own transaction. This will still be fast, while reducing memory requirements.

I originally had 1 daily summary per transaction. My 10 year simulator generated file would patch quite quickly but the same file dropped/backfilled would take a lot longer (refer an earlier post above). I put some code in to see where the time was being used and found it was the transaction, all the processing of the record, accummulator etc was happening just as fast in each case but the time to do the transaction was way longer. I did some reading on SQLite and transactions and decided to lift it to 5 days per transaction; speed picked up quite a bit then.

  1. I couldn't follow any logic that left a day's summary untouched if the archive interval changed within the day. Seemed to me that the program requires that all intervals beyond the last patched record have the same interval. But, I could definitely be missing something!

Probably something to do with the approach I took, one of my main concerns was not leaving a database partly patched. That's why I do a check of the archive up front to see if we have any days with multiple distinct intervals. If I find more than 1 value (per day) I don't patch; if I only find 1 value per day then I go ahead and patch knowing full well I won't have any interval 'issues'. That is probably why there is nothing seen to be leaving a day if there are multiple intervals. Let me know if I need to change this approach.

  1. The way I have selected for a value while checking that there is only one value (common situation when checking unit systems) is to do a select on the min and max value, then make sure they are the same. In this case, check for the minimum and maximum value of interval within a day, then make sure they are equal to each other. This requires only one pass through.

Thanks for the tip, must admit I wrote the query and at the time was concerned at how long it would take, it wasn't too bad on my 10 year file so I left it and moved on. Will revisit.

Gary, I'm humbled by how quickly you've picked up the twisted logic in manager.py and the accumulators! I didn't do a very elegant job on this stuff. My excuse is that I was constrained by history, which made it unnecessarily complicated. Nice to know somebody else "gets it!"

Hey, it impressed me. It took a while to sink in, as twisted as it may sound, actually writing a whole lot of sub-classed code for weewx-wd back in 2014(? - weewx 2.x anyway, weewx 3.0 made it all redundnant) actually helped, one of the things I sub-classed was some of the accumulator code so I needed to understand the base code. The one thing I picked up this time around was the use of accumulators when adding an archive record; previously had a picture in my mind that it was loop data only - probably why I initially went down the loop rabbit hole on this issue. Plus this was good for keeping the mind active!

A few days down the Gold Coast this weekend so hopefully have a PR next week.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024
  1. Rather than calculate the last unpatched day, then perform date arithmetic for each day, why not just loop around a select statement that selects daily summaries "greater than" the last patched date? For each such date, you open up a transaction, patch the summary, then close and commit the transaction. Not only is it simpler, but it you guard yourself against any missing day summary. It also more naturally follows the "intention" of the exercise: to patch the actual daily summary sitting in a database.

I am failing to join some dots here, are you talking about creating a generator that uses some SQL to yield daily summary dateTime values > last patched time and then for each yielded value open, patch then commit? If so on what basis do I determine the dateTime values to be yielded, pick any table name? If so will each table alway have identical dateTime values, even for a schema that was customised some time after weewx was first installed?

Or have I msread what your intention completely?

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

That's exactly what I'm thinking.

The dateTime for each summary table is always the start-of-day. For a given day, all days should have a table --- that's the way the code works --- but, of course, it's always possible that something got screwed up. If you're concerned about that, you could handle each table separately. Then commit everything at the end of the day (or group of days, if you're including more than one day in a commit).

BTW, how are you going to detect the situation where the patch was applied sometime in the past?

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Ok, happy with that, the case that sprung to mind was if a user added, say newObs, to the schema some time after weewx was first installed would newObs' daily summary have entries for all days since weewx was installed or just since newObs was added to the schema. If I was then to pick one of the daily summary tables as my source of timestamps to iterate over would I need avoid newObs daily summary. outTemp is an obvious choice but can I guarantee it will exist. As I have typed this I think a way ahead has formed in my mind, will try it out when I am home.

Was not considering doing tables individually, _get_day_summary() iterates through all the day summary tables and did not want to re-invent the wheel. On second thoughts though I will have to see what consequences there might be in the newObs scenario outlined above.

Very briefly thought about detecting whether summaries have been patched, date of last day patched is no help; you won't know if all summaries were patched or just some. Maybe another field is required in archive_day__metadata that is set true when all records are patched, would fail though if someone reverted a patched database to say weewx 3.4.0. Have to think more on this as I enjoy the beach!

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

... Maybe another field is required in archive_day__metadata that is set true when all records are patched...

That would work. Alternatively, update the pointer in the metadata table with every write to the daily summaries, so it's always up to date.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024
  1. Suggestion to put the transaction context inside the loop over each day. This way, each day gets its own transaction. This will still be fast, while reducing memory requirements.

How hard and fast are we with doing 1 day per transaction? I can see some value in it, but I believe it is also a significant bottleneck. On a VM I am taking around 900 seconds to patch a 10 year/5 min database when doing 1 day/trans. The same database patches in 150 sec at 10 days/transaction and 220 sec at 5 days/transaction. Experience so far is that the likes of a RPi will be significantly slower so I believe unless there is a compelling reason the speed gained by modestly increasing the days/transaction is worth it.

Also, on the issue of vacuuming. I am finding patching a vacuumed db is taking between 1/2 and 1/3 the time of patching the same db but unvacuumed. This was on a VM but again given the RPi case I think the speed increase is worth it. Will look at this further when I am using the RPi.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

How hard and fast are we with doing 1 day per transaction?

Not at all. Go where the data leads you. I just wouldn't do everything in a single transaction either.

... vacuuming...

This surprises me. My understanding of vacuuming is that it compacts a database that has been heavily fragmented by lots of deletes. This does not describe a weewx database, which tends to grow from one end only, and, operationally, never deletes.

Vacuuming is also an expensive operation as it involves copying the entire database.

Any ideas why the dramatic speedup?

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Not at all. Go where the data leads you. I just wouldn't do everything in a single transaction either.

Ok, since patching backfill_day_summary() to limit the size of the transactions to x days was a good learning experience. Certainly won't be doing a single transaction.

Vacuuming is also an expensive operation as it involves copying the entire database.
Any ideas why the dramatic speedup?

Must confess to only a cursory glance over the vacuum docs. I have been using 2 databases in my development; the first (let's call it db1) is a 10 year/5min database produced by running the simulator in generator mode. I have been using the resulting database untouched (ie no vacuum/rebuild etc). The 2nd database (lets call it db2) is a copy of db1 but having dropped then rebuilt the daily summaries. db1 is always significantly faster to patch than db2 (about 12 times faster). db1 and db2 both patch faster with a vacuum, db2 usually coming down to less than half the time to patch, db1 only shows marginal improvement.

I put the improvements after vacuuming db2 down to the fact that the daily summary tables had been dropped and then rebuilt, this was not based on any factual evidence though. Annecdotally, when db2 is patched the hard disk can be heard to be working very hard, a lot of head movement; when patching db1 there is hardly any head movement noise. Patching either database after a vacuum has a small but noticeable amount of head movement noise.

Will do a bit more reading about vacuum. Does make me wonder how MySQL will perform.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Hmm, not a lot of source data on vacuum that I would consider to be from a reliable source other than the brief explanation at sqlite.org

I wonder if the issue here is the 'fragmentation' of the SQLite file, how is this for a possible explanation of what I am seeing:

  • A freshly generated weewx db has an archive table that is generated archive record-by-archive record and daily summary tables that are generated day-by-day but updated archive record-by-archive record. Since SQLite just pulls in a new page to store data in whenever it needs it, it would not be unreasonable to suspect that a days archive data may be located close together and similarly a days summary data would be located close together. Consequently, patching of the daily summaries is fairly fast as the the daily summary data has little fragmentation.
  • When you drop/rebuild the daily summaries all of the space previously used by the daily summaries is marked as unused after the drop and is then re-used (plus who knows what other space) during the rebuild. During the rebuild the daily summaries do not go back exactly where they were before, rather they end up all over the place (ie fragmented) and hence patching a dropped/rebuilt db takes somewhat longer and there is much more noticeable disc activity.
  • If we then go and compact the database, SQLite (I suspect) goes through the db table by table copying the table from the old db to the new db. So what you end up with is each table being contiguous but a given days worth of data being somewhat fragmented, but given the daily summary tables are not that big it is probably not too bad. Consequently, when we patch a compacted db we get an improvement over the rebuilt db, but it is not as fast as the freshly generated db.

This would tally with the times I have been observing and the relative amount of disc noise/activity.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Good theory. One way to test it would be to use a RAM or SSD drive. Fragmentation would have much less effect then.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Good theory. One way to test it would be to use a RAM or SSD drive. Fragmentation would have much less effect then.

Will get my spare RPi online soon and give it a try with a RAM drive, best I can do here.

Am close to having something for you to look at. I have retained class Patch() that I had in the earlier utility, all of the overhead associated with structuring the code as a utility has been removed. The code is approx 300 lines. To patch the summaries it is simply a case of creating a Patch object and then calling its run() method. I trust this meets the intent of the patch being more of a 'library' that can be invoked by a utility or weewxd. A couple of questions about structure/organisation:

  1. Where will the code be placed? manager.py, some other existing file (I don't see any other obvious home for it) or a new file?
  2. Rather than the old repo I was using I will place it in my gjr80/weewx clone, I can see a bit more to'ing and fro'ing before its complete, do you want a PR first up or just notification here that it is good to look at?

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024
  1. The way I have selected for a value while checking that there is only one value (common situation when checking unit systems) is to do a select on the min and max value, then make sure they are the same. In this case, check for the minimum and maximum value of interval within a day, then make sure they are equal to each other. This requires only one pass through.

Did some playing around with this last week and noticed COUNT DISTINCT was a good 7 to 8 sec faster on the RPi with a 10 year db. Did a bit more testing this morning and obtained these results on an RPi using a 10 year database:

Jul 11 09:48:11 rainbird weewx[17022]: Checking for multiple 'interval' values per day using MIN(`interval`) and MAX(`interval`)
Jul 11 09:48:11 rainbird weewx[17022]: Checking table 'archive' for multiple 'interval' values per day...
Jul 11 09:48:34 rainbird weewx[17022]: Successfully checked 3653 days for multiple 'interval' values in 22.38 seconds.
Jul 11 09:48:34 rainbird weewx[17022]: Checking for multiple 'interval' values per day using COUNT(DISTINCT `interval`)
Jul 11 09:48:34 rainbird weewx[17022]: Checking table 'archive' for multiple 'interval' values per day...
Jul 11 09:48:55 rainbird weewx[17022]: Successfully checked 3653 days for multiple 'interval' values in 21.29 seconds.

Not quite so telling today. The one thing I have noticed is that COUNT DISTINCT has been faser every time on every machine (RPi or VM). Just wanted to capture this before it gets lost in my logs. Interesting.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Have my spare RPi going again so have been doing some testing on it rather than a VM. All tests have been with 10 years of 5 minute data on an RPI 1 model B.

  • Checking for unique interval value. Under MySQL MIN/MAX and DISTINCT were consistently within 1 second of each other (28, 29 or 30 odd seconds). Under SQLite DISTINCT was consistently 4 sec faster than MIN/MAX (28 compared to 32 odd seconds).
  • Vacuuming. Obviously no effect undel MySQL. Under SQLite vacuuming consistently took 260 odd seconds. The subsequent patching time was reduced by around 90 to 120 seconds (810 down to 670 seconds and 730 down to 640 seonds).
  • Transaction size. Under MySQL handling 5 and 10 days of data per transaction gave near identical results, 1660 odd seconds. Increasing to 25 days dropped this to 1630 seconds. Under SQLite 5 days took 1100 seconds, 25 days 810 seconds and 50 days 730 seconds.

Conclusions:

  • The method for checking for unique interval values is much of a muchness. Will stick with the MIN/MAX as used elsewhere within weewx.
  • Vacuuming may have given an overall net benefit under a VM but under an RPi it slowed the patch process overall. Since the vacuum code is only a handful of lines I will leave it in for now but it will default to False ie no vacuum.
  • Transaction size had a small effect under MySQL but there was a noticeable improvement under SQLite when moving from 5 to 50 days per transaction. Some issues have occurred in the past with transaction size and backfilling, in particular lower end machines such as the RPi have run out of memory. Given that 50 days worked fine on my 1st gen RPi and gave the best results I intend to set the default transaction size to 50 days. I did not test 50 days under MySQL and even though I think we are safe I will confirm that this has no adverse impact under MySQL.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

A couple more points/what-ifs I wanted to capture to deal with later:

  1. Reverting weewx. A user who is running weewx with a patched database reverts weewx to a non-patched version (eg user upgrades from 3.5.0 to (let's say) 3.6.0, as part of upgrade daily summaries are patched, user does not like some aspect of 3.6.0 and uninstalls 3.6.0 to then re-install 3.5.0). User then has a database with partially patched summaries. Aggregates should be fine where they involve only 'weight patched' daily sumamries or only 'non-weight patched' daily sumamries but I would expect incorrect results would be obtained over any period that involved a mix of 'weight patched' and 'non-weight patched' summaries - so if the cuover occurred between midnight and archive period after midnight on 1 January it would be fine but otherwise would generate some sort of error in some aggregates. Dropping/rebuilding the sumamries after reversion would fix the problem. Is it good enough to mention this somewhere in the docs/wiki? Will it occur often enough that it even needs to be considered?
  2. Additional databases. Most users who record additional obs that are not in the default schema just recyle/rename existing db fields. Summaries for such additional data would be 'patched' by patching the binding used by [StdArchive]. Some users use additional databases to record obs that are not in the default schema. In these cases, patching the binding used by [StdArchive] will not patch their additional database(s). One approach is to leave it to the user to take care of any such databases, another approach is rather than just applying the patch to the [StdArchive] binding, apply the patch to each binding listed in [Bindings] that uses weewx.manager.WXDaySummaryManager as its manager.

from weewx.

matthewwall avatar matthewwall commented on July 29, 2024

were the patches in this thread applied for the weewx 3.6 release?

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

were the patches in this thread applied for the weewx 3.6 release?

No.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I guess this project got forgotten over the summer and was not brought to a conclusion.

It will involve big changes, so I think it best to devote a release --- perhaps 3.7? --- to it. We need to test it thoroughly.

Let's get v3.6 out then worry about it.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Certainly my intention to keep working on this once 3.6 is released - I will undoubtedly need guidance.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

A quick update after my VM debacle. It seems I have been able to essentially recover the code from a GitHub repo I was using. The only issue is that the code to do the patching of the daily summaries is in 'wee_xxxxx' utility format rather than as a library. This is largely a restructuring exercise but will also (I suspect) need me to trawl through the comments since 30 June to pick up any changes since.

For the time being I will work in the weighted_summaries branch of my weewx fork.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

How are you coming on this? Is the weighted_summaries branch at a state where I can take a look?

I find it curious that this bug has been in weewx for years, yet not one single person has complained!

I wonder if we couldn't just quietly add the weighting, but not bother with the patching? I don't know how serious I am about this, but it sure would avoid a lot of complexity. Optionally, we could take what you've done, but package it as a separate utility, so it won't clutter the main code base. A user would run it only if s/he had a need.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I think I understand why it was not noticed, well I have a few theories. The accumulators and their management are one of the more complex pieces of code and I doubt few understand it to the level that they would notice the problem by looking at the code. In terms of the results, weighting the observations going into the accumulators will affect the stats but I expect not too many have looked too closely at the results, and besides the change in results between weighted and unweighted will be, I suspect, in most cases fairly small. Finally, there are probably not that many users with significant changes in interval values in their archive.

As for patching, the engineer in me says that we should not be perpetuating dodgy data. I guess the user who has homogenious interval values throughout really doesn't need weighting and never will while he has a single interval value. I think thaose that do need patching are those that may have brought data with them from another app/source. My wee_import experience tells me this data is often all over the place. And then this is tempered by me being spoilt with a Davis so my intervals are always homogenious. Tough decision I guess. think I woudl still prefer to patch.

Back to the code. I believe I have the code back to where it was before my VM wipeout. The change to weight the summaries is trivial and was written ages ago. Given your comment from 30 June:

Refactor the code so that it is more like the "backfill" code in
structure. That is, the functionality is present as a library, which may or
may not be invoked by a utility. This way, if needed, weewxd can patch the
daily summaries on startup. If we depend on invoking a utility, it may not
get done.

I refactored the patch code using the factory skills I learnt from wee_import. I have created an abstract class Patch from which I create class WeightPatch that applies the patch. Probably more complex than is needed though it was a good exercise and I thought the structure might be useful in the future. Certainly weewx version x.y.z into the future is going to need to be able to patch since you don't know what version it will be upgrading from. Any piece of code that wants to apply the patch just calls the factory method and then the resulting objects run method.

The patch patches all 'days' that have a unique interval field. This begs the question of what of the days that have multiple different interval values. These cannot be patched by fiddling the daily summaries only, rather they need rebuilding from all of the archive records for that day. Slower and different calculations. Maybe there is a need for a second pass?

I also have some questions in the back of my mind. I was previously hung up on the user who, say, upgrades from 3.5.0 to 3.7.0 and the patch is applied. What happens if he reverts to 3.5.0? Now that I have thought more I think it is not an issue, 3.5.0 will continue to add unweighted summaries (so he will have a mix of weighted and unweighted), these will still provide valid results. When he does eventually upgrade to 3.7.0+ the patch will only patch the unpatched summary records due to the flag in the metadata. If he drops and rebuilds under 3.5.0 the flag is gone and a future upgrade will patch everything.

The other issue is users with additional databases. I guess the counter here is that the patch structure means you can easily apply the patch to a different binding with only a few lines of code so the developer of the additional database can easily put something together to patch them, though it won't be automatic during the upgrade. Or do we patch all of the bindings in weewx.conf? Easy to do. Just wild thoughts now.

I have waffled, but to pull it back to your questions. I have the caode in the weighted_summaries branch of my fork here. I would appreciate your thoughts on the overall structure of the patch. Am happy to push it into a branch of weewx/weewx if required. As for a separate utility, I am happy to do that, if you recall that was the original structure I used. I am happy to keep working on this whatever the direction, we are coming into summer and I expect some days that are just too hot to do anything so I will have time on my hands - and cricket is not sounding too good this year! And we won't be seeing a 3.7 for a bit yet, will we?

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

File user/patch.py:

  1. Doesn't the import
    import user.patch
    cause a circular import?

Quite right, suspect that was a remnant when I split the files.

  1. The factory pattern, while powerful, is ideally suited for runtime
    binding. That is, you don't know what it is going to resolve to until
    runtime. In this case, you're probably going to know what patch you are
    going to apply before you even invoke the program. Indeed, the little
    snippet at the bottom of weightpatch.py hardwires what it wants the factory
    to produce ('WeightPatch'). I would suggest using the functionality
    provided by Path in a simple "has a" relationship. That is treat Patch like
    you would treat a library: it has a bunch of useful utilities.

Boy with a new toy I guess. I must admit it felt a little odd telling the factory what I wanted. class Patch does not do a great deal though it does setup a few things including logging. In fact, basing class WeightPatch on class Patch achieves the exact same effect without the overheads of a factory.

File user/weightpatch.py:

  1. Something like

    try:
         self.trans_days = int(patch_config_dict['trans_days'])
    except:
         self.trans_days = 5
    

can be simplified to

   self.trans_days = int(patch_config_dict.get('trans_days', 5))

It also has the advantage of not using a blanket "except" clause.

Don't know how I missed that, I know that.

  1. I'm not quite seeing the logic of whether or not patching is needed
    (line 151):

    if np_ts is None or self.dbm.last_timestamp > np_ts :
    

As I understand it, np_ts is the last timestamp of the last day that has
been patched, while db.last_timestamp is the timestamp of the last record
inserted into the database.

As new records are inserted into the database by weewxd, last_timestamp
will increase, but np_ts will not. Don't you need a change to
DaySummaryManager to automatically increase the last patched time stored in
the metadata?

Ahhh, technically yes np_ts is the last timestamp of the last day that has been patched, but I think of np_ts as the timestamp of the next day to patch (all timestamps are daily summary timestamps ie start of day). If np_ts is None (ie there is no patch metadata) then we need to patch because nothing has been patched. We also have 'days' to patch if there are records in the archive that are timestamped sometime on the 'next day to patch', that is where the dbm.last_timestamp comes into it.

Yes dbm.last_timestamp will increase as records are added but I guess I am making an assumption here that this patch is being applied either while weewx is not running or before the main loop starts.

Another approach would be to store the last patched time in the metadata as
a guard against the patching program prematurely aborting. But, once it's
done, delete it, then set a version number in the metadata. Then, all the
patch program has to do is check the version number to see whether this
database has been patched. In this case, DaySummaryManager doesn't have to
do anything: the last patched date is there just as a tool to prevent
partially patched databases.

A database is either patched (no version number present), or completely
patched (version number present). Simple.

Just having typed that last sentence I realise the error in my ways and the reasoning behind your comments is now apparent. Once the patch is applied the only enduring metadata I have implemented is the 'last day patched timestamp'. If that is all we have then it needs to be kept up to date. I think I see where I went astray, you first mentioned a 'patched' flag in the metadata and then later you raised the possibility of the patch being aborted so we needed a 'progress' flag in the metadata. I have implemented the latter but lost site of the former.

Ok, I think I am back on track now, a bit of rejigging and adding that metadata flag and it should look a lot better.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

A few more questions:

  • I presume the weight patch code will need to be in the code base for quite some time into the future, where will we place the file(s), bin/weeutil? bin/weecfg?

  • Is it a fair assumption that this patch will only be applied when weewx is stopped or before the main packet loop is started? Just want to know whether we need to deal with a potential new archive record between the patch being started and it being finished.

  • Dealing with multiple intervals per day. At the moment only databases that have a homogenious interval field in the archive table will be patched. I know there will be users that fall into the 'non-homogenious interval' camp so I need to deal with them. I have only come up with 2 approaches to these cases:

  1. Drop/rebuild the daily summaries (probably easiest to code but not necessarily quickest)
  2. Patch those days that are homogenious and rebuild those days that are non-homogenious (more complex code, speed benefit of patching may be lost in extra control/processing of rebuilding selective days). Could be implemented in one pass (more complex) or two passes.
    Don't know if there are any other workable approaches?

from weewx.

matthewwall avatar matthewwall commented on July 29, 2024

relatively minor nit, but could you avoid the name 'patch' for classes? patch is such a generic word - appropriate for an instance or a method on a class, but rather opaque when used as a class name.

from weewx.

matthewwall avatar matthewwall commented on July 29, 2024

gary, my impression is that there are two places where this code will be executed: (1) during normal operations, applied to new data before it hits the daily tables, and (2) invoked when rebuilding the daily tables. do i have that right? are there other applications?

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I presume the weight patch code will need to be in the code base for quite some time into the future, where will we place the file(s), bin/weeutil? bin/weecfg?

Yes, it will. The directory bin/weecfg seems like a reasonable place.

Is it a fair assumption that this patch will only be applied when weewx is stopped or before the main packet loop is started? Just want to know whether we need to deal with a potential new archive record between the patch being started and it being finished.

I see it applied during the install of a new version. Think of it as a way of bringing a legacy install "up to date." Indeed, there's already a lot of logic like this in weecfg/__init__.py.

Dealing with multiple intervals per day. At the moment only databases that have a homogenious interval field in the archive table will be patched. I know there will be users that fall into the 'non-homogenious interval' camp so I need to deal with them.

I'd say drop/rebuild the daily summaries. I don't think it's worth you spending a few days building/testing code to cover a handful of users when there's such an easy workaround.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

relatively minor nit, but could you avoid the name 'patch' for classes? patch is such a generic word - appropriate for an instance or a method on a class, but rather opaque when used as a class name.

Sure, am certain there will be more chnages yet.

gary, my impression is that there are two places where this code will be executed: (1) during normal operations, applied to new data before it hits the daily tables, and (2) invoked when rebuilding the daily tables. do i have that right? are there other applications?

The 'patch' if you like consists of 2 parts: (a) modification of the accumulator code such that weighted sum fields in the daily summaries actually are weighted and (b) recalculation of all existing daily summary tables so the weighted sums are weighted. So yes, your impression is correct, but only in so much as (a) would be applied in the cases you outline. (b) would be applied at some stage (it seems during a version upgrade) to bring a legacy database up to date.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I see it applied during the install of a new version. Think of it as a way of bringing a legacy install "up to date." Indeed, there's already a lot of logic like this in weecfg/__init__.py.

Good, I can ignore the possibility of new archive records arriving.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

After my git faux pas I think I now have a weighted_sums_(#61) branch in the weewx/weewx repo with the latest code.

Summary of changes:

  • have renamed the classes so they are not generic
  • have implemented drop/rebuild if any one day has multiple different interval values
  • removed factory pattern code, now just derive from base class
  • added a dailySummaryVersion meta data field, set it to 1 after patching (should this be 2? but then there was no 1?)
  • moved patch.py and weightpatch.py from bin/user to bin/weecfg
  • fixed a lot of inefficient config dict lookups
  • numerous changes to logging, mainly what, where and when

Something I am not confident I have right is the logging. It seems to me that when this patch is applied its application should be logged by syslog. But the patch is being applied during a version upgrade which typically is not logged by syslog but rather 'logged' to screen only. To this end I have created a class to support the logging of the patch to both screen and syslog. I don't know that my class is particularly efficient or future proof (was there recent talk (Matthew?) in one of these issues (wee_import?) about changing the approach to logging?).

A note on invocation. At the moment patch is invoked by creating a patch config dict:

patch_config_dict = {'name': 'Weighted Sum',
                     'binding': None,
                     'vacuum': False,
                     'trans_days': None,
                     'verbose': True,
                     'dry_run': False}
                     'log_results': True,
                     'reguired_weewx': '3.6.1'}

name is the only mandatory field, all others will default to sane values. It's then a case of creating a DatabasePatchLog object then a WeightedSumPatch object and executing its run() method.

I have it working on a couple of test databases but have yet to properly stress it on some oddball cases since these latest changes.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Comments

Because an exception of type WeightedSumPatchDbError is raised when a general exception of unknown type occurs, you probably don't want to catch it. Right now, the code keeps going if it encounters such an exception. I would get rid of WeightedSumPatchDbError and, instead, have the except clause (line 487) just log the exception, then reraise it.

Similar comment for the except clause at line 257. If you change the type of the exception, you're losing information. Just log and reraise the exception.

While we're on the subject of exceptions, blanket catches are also risky:

        try:
            return _row[0]
        except:
            _msg = "'interval' field not found in archive day %s." % (span, )
            raise weewx.ViolatedPrecondition(_msg)

Better to do:

        try:
            return _row[0]
        except IndexError:
            _msg = "'interval' field not found in archive day %s." % (span, )
            raise weewx.ViolatedPrecondition(_msg)

Stuff like

        try:
            self.trans_days = int(patch_config_dict.get('trans_days', 50))
        except TypeError:
            self.trans_days = 50

is iffy. If you get a TypeError, obviously something is terribly wrong. Better to just let the exception go (leading to program abort).

How are you planning on invoking the patching routine?

I should have a chance to test this later today or tomorrow.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Branch updated with above changes. I did include the 'trans_days' exception so that if one, not particularly important, option was bad we could use a sane default with the expectation that if anything else was bad (and it was critical) it would be picked up elsewhere. I guess since we are going to be hard coding the patch config dict it doesn't really matter.

I should have a chance to test this later today or tomorrow.

I have committed these changes this morning with no testing other than one run, and won't get back to testing until Sunday (day at the racetrack (horses) with wife's swimming friends today).

How are you planning on invoking the patching routine?

Um, ah, I thought you or Matthew had this part sorted....
I have not given this much thought, OK, some wild 7am-ish thoughts out aloud...
Previous mention above of invoking the patch was along the lines of during or part of an upgrade (at least that was the thgought I had in my head). I had a look at setup.py and did not like what I saw, seems way too difficult. Another thought, could there be a call put in engine.main(), say somewhere after the debug flag is set (line 833) and before the engine is started, that calls a function that runs any patches that maybe relevant. This function would be in bin/weecfg/patch.py and takes care of putting together the patch config dict and invoking the patch. This would be run every time weewx is started, but I guess that has the advantage of picking up the case where the user downgrades or otherwise reverts his db. Like I said wild early thoughts.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Actually, second coffee and another couple hours of being awake... Why not put the call in StdEngine.preLoadServices, I can't think of any reason why the services need to load before we patch, if they do then put the call in StdEngine.postLoadServices.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I now have a function apply_patches being called during engine.preLoadServices. As far as I can tell it behaves well. I had some what I call 'chicken and egg import issues' with the old file structure (patch.py and weightpatch.py) so I have moved all patch related code into patch.py. As I suspected my log class caused havoc with weewx logging so have reverted all logging to standard syslog calls. Made a second pass of what is LOG_INFO and what is LOG_DEBUG. I would like to think it is at least clear in the logs now as to what is being done by the patch when debug=0 and debug=1.

I will leave it now for your review. I do want to work on a means for other databases that use daily summary tables (eg weewxWD) to easily apply the patch as well.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

When running weewxd, I'm getting an error:

kinglet-air:weewx tkeffer$ ./bin/weewxd /Users/weewx/weewx.conf
Traceback (most recent call last):
  File "./bin/weewxd", line 14, in <module>
    import weewx.engine
  File "/Users/tkeffer/git/weewx/weewx/bin/weewx/engine.py", line 26, in <module>
    import weecfg.patch
  File "/Users/tkeffer/git/weewx/weewx/bin/weecfg/__init__.py", line 21, in <module>
    from weewx.engine import all_service_groups
ImportError: cannot import name all_service_groups

I think this is because there is a circular import of weewx.engine.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

When running weewxd, I'm getting an error:

Interesting, not giving me problems but then I have been running weewx as a daemon not directly. Am about to step out the door for a day/night at the cricket then a few days down the coast. So will get into it when back home. And then we are just about into Christmas...

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I think this is because there is a circular import of weewx.engine.

(When I went away last weekend I downloaded pdfs of all python.org 2.7.13 documents onto my ipad and spent the weekend reading most of the ones I don't normally use. The really creepy thing is that there in the FAQ document is an entry about 'best practices for importing' and they covered circular imports.)

Now that I look back I did run across this issue earlier and I made a change (that I did not push) to weecfg.__init__ to overcome the issue. I changed the offending import statement from:

from weewx.engine import all_service_groups

to

from weewx import all_service_groups

Now I don't pretend to fully understand all of the vagaries of imports, but when I went looking in engine.py for all_service_groups all I found was from weewx import all_service_groups and the use of all_service_groups at line 132, so I just changed the weecfg.__init__ import as above. I know the other way to avoid circular imports is to use staight out 'import x' rather than 'from x import y'. My change was made before I read the article on circular imports, so I am not sure if the above change is acceptable or if there is a better approach.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I think we are getting very close.

I got the patch to work by changing the import slightly:

diff --git a/bin/weecfg/__init__.py b/bin/weecfg/__init__.py
index 06a99b5..7482178 100644
--- a/bin/weecfg/__init__.py
+++ b/bin/weecfg/__init__.py
@@ -18,7 +18,7 @@
 import configobj
 
 import weeutil.weeutil
-from weewx.engine import all_service_groups
+from weewx import all_service_groups
 
 minor_comment_block = [""]
 major_comment_block = ["", "##############################################################################", ""]

After that, the patching went very quickly. Maybe 20 seconds on my MacBook Air for 10 years of data.

There are two things I would change:

  1. There is no need to add a new versioning number in the metadata table archive_day__metadata because there already is one (called Version). Just bump the number from 1.0 to 2.0 after the patching is done.

  2. Perhaps we could optionally invoke the patching from wee_database?

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I got the patch to work by changing the import slightly:

Ok, so that was the same change I made (my last) but communicated much better!

  1. There is no need to add a new versioning number in the metadata table archive_day__metadata because there already is one (called Version). Just bump the number from 1.0 to 2.0 after the patching is done.

Well I didn't know about that. I just had a quick look through manager.py, I see where Version is set etc, it defaults to 1.0 when creating the daily summary tables, do we now need to default that to 2.0 if the weewx version >= the version that introduced the patch? Its a bit early in the morning here, so I may have misread this completely.

  1. Perhaps we could optionally invoke the patching from wee_database?

That should be easy to do, something like --apply-weight-patch? You guys come up with much better names than I.

Will get onto these over the next day or so.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Correct. If Version < 2, then apply patch.

Your name --apply-weight-patch seems fine to me!

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I was more thinking about the DaySummaryManager.__init__() code that initialises the day tables. My reading of the code is that Version is set to 1.0 on creation of the tables. Thinking aloud, the patch is checked for/applied in StdEngine.preLoadServices() so on a new, say 4.0 install there would be no databases at this point in the startup (StdArchive does that) so no patch applied. weewx starts normally, with Version = 1.0 and the day summaries are being weighted. Next restart the patch checks and finds Version == 1.0 and patches the day summaries, they have now effectively been double patched.

Is it a case of in that DaySummaryManager.__init__() do we now conditionally set Version = 2.0 based upon weewx version (say the patch is included in 3.7.0):

if weewx_version >= 3.7.0:
    set Version = 2.0
else:
    set Version = 1.0

or do we change the point at which the patch is applied to after the db (more specifically the day summaries) have been created or have I misread the code.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Silly me, no need check weewx version, just default 'Version' to '2.0'

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Exactly. Newly created daily summaries, this time with the weights calculated correctly, are by definition V2.0. The job of the patch is to do the "upgrade," taking the database from V1.0 to V2.0

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Thanks, Gary.

  • Version numbers. I'm pretty sure that daily summary version numbers are not being used.
  • Testing. Its not much of a test, but I tried running wee_reports on my 10 years of unpatched data, then ran it again on the patched data, then compared the differences. No difference. So, at least no harm is done! 👍
    Not sure what you mean by testing "... db and weewx versions."
  • Documentation. Also need a comment in the Upgrade Guide.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Probably v3.7.

If a patched database is run under weewx V3.5.0, it should just "pollute" the database with now unproperly weighted data. I don't see that we can do anything about that. Nor should we.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

As I said I have been doing some testing across versions.

Assume the patch is introduced in 3.7.0. The case that concerns me is where the user upgrades to 3.7.0 or later and the database is patched. The user finds that this new version has caused some unrelated issues, so after a couple of weeks the user reverts to a pre-3.7.0 release. What effect does this have on the integrity of his data.

Based on a desktop analysis I think the answer is it depends.

  • if the user cut-over to 3.7.0 on a midnight archive boundary and then reverted to the earlier version again on a midnight archive boundary and he has an unchanging interval then his data integrity will be retained (some daily summary rows will be weighted and some won't but his aggregates will be true and correct). I think this case is somewhat unlikely.
  • If either upgrade or reversion did not occur on a midnight archive boundary then there will be either 1 or 2 days where the daily summary for that day will be an amalgamation of weighted and unweighted archive records. Data integrity will have been lost for these 1 or 2 days. This case is likely.
  • If the user has a non-homogeneous interval then his daily summaries are true and correct up until he reverts to the pre-3.7.0 release but any daily summaries for the 'day of reversion' and later will be unweighted and, even if a later upgrade is undertaken, the daily summaries from the 'day of reversion' to the day of re-upgrade will be unweighted and never will be weighted. Those unweighted daily summaries will have lost their data integrity.

At the moment the 'patch' algorithm/logic that is applied during weewx startup is:

  • if Version meta-data field is 2.0 or greater the database is deemed to be patched and the patch is skipped
  • if Version is None or < 2.0 the lastSummaryPatched meta-data field is checked and the patch is applied depending on the lastSummaryPatched` value:
    • if it is None then all daily summary rows are patched
    • if it has a valid timestamp all daily summary rows after this date are patched
  • The patch is applied either via a simple weighting of the daily summary fields (if each day has a homogeneous interval value) or by dropping/rebuilding if one or more days have a non-homogeneous interval values
  • The lastSummaryPatched field is only ever populated with a daily summary timestamp (ie midnight) and never anything in between. Once the patch is applied the lastSummaryPatched meta-data field is never used again.

To deal with the reversion case above a slight change to the logic/patch algorithm would maintain the integrity of the data no matter how much chopping and changing occurred:

  • instead of having a lastSummaryPatched meta-data field change it to lastPatched_ts (or something else).
  • if Version meta-data is 2.0 or greater then database is only deemed to be patched if lastPatched_ts >= Manager.lastGoodStamp. Otherwise the database still needs patching.
  • if the database needs patching the lastPatched_ts meta-data is checked as before and the patch is applied depending on the lastPatched_ts value:
    • if it is None then all daily summary rows are patched
    • if it has a valid timestamp all daily summary rows after and including the one containing lastPatched_ts are patched
  • The patch is applied either via a simple weighting of the daily summary fields (if each day has a homogeneous interval value) or by dropping/rebuilding if one or more days have a non-homogeneous interval values. The difference is that the daily summary row containing lastPatched_ts (if lastPatched_ts is not None) is weighted by backfill.
  • The lastPatched_ts field continues to hold the daily summary row ts as we work through patching the summaries, but once the patch is applied lastPatched_ts is continued to be used and is updated with the ts of the archive record each time an archive record is stored to archive.

I guess there are a few issues with this change:

  • need to develop selective backfill (issue #117) but I believe that is pretty trivial, its already halfway implemented - well 'fitted for but not with'!
  • we are now updating the lastPatched_ts meta-data field each time an archive record is stored
  • added complexity

I know the added complexity is a big issue but I can see some users having issues with their data from chopping and changing and I don't want to the situation I have seen with other weather software where the underlying data ends up being a bit of a dogs breakfast, all oblivious to the user.

from weewx.

raymondelooff avatar raymondelooff commented on July 29, 2024

I think that the solution to this problem is getting way too complex. As mentioned by @tkeffer on June 21, you could just rebuild the summaries. Yes, that would take a long time on less powerful machines, but I genuinely think that rebuilding the summaries won't cause more issues than the patch. Just add a note to the release notes that the new version wil trigger a full regeneration of summaries and that it may take some time.

I really prefer a clean database scheme, without extra values for the patch to work properly. I'm running weewx on a Raspberry Pi model B with 918 days of data at the moment, so I know the regeneration will take some time. But to be honest, I don't really care how long it would take to regenerate the summaries.

And most important of all: make sure weewx doesn't mess up the data integrity. Keep it simple! :)

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Lot to comment on!

I'm really not worried about the case where a user updates to V3.7, triggering the patch, then tries to revert to an earlier version. I don't think this will happen very often (if at all) and the consequences are not that severe. The real database-of-record, archive, will be untouched. It's just the daily summaries that will be off and, even then, it will only be the averages that are bad. If the user doesn't like that, s/he can always rebuild the daily summaries using the old version of wee_database.

The time it takes to do a backfill is a real problem. With an RPi2, it takes about an hour to rebuild the daily summaries of my database. Admittedly, it is larger than most (200MB, with 10 years of data), but then, an RPi2 is more powerful than what a lot of users are using.

The present solution (now merged into branch development; please continue work there) is a reasonable blend of efficiency and simplicity.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

The present solution (now merged into branch development; please continue work there) is a reasonable blend of efficiency and simplicity.

OK

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

I'm thinking that the name for the wee_database option to apply the patch should be renamed to something more general. Instead of --apply-weight-patch, how about --update-database?

Before starting, it should also warn the user that this will be an irreversible change.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

I'm thinking that the name for the wee_database option to apply the patch should be renamed to something more general. Instead of --apply-weight-patch, how about --update-database?

I see the case for making it more general but is --update-database a little too general? How about --patch-database (sorry, but remember rule 1 - I am no good with names) ?

Before starting, it should also warn the user that this will be an irreversible change.

Good point, I will make it so.

from weewx.

gjr80 avatar gjr80 commented on July 29, 2024

Before starting, it should also warn the user that this will be an irreversible change.

Included in commit ab5c7d3

Given the last few comments I have not yet changed any reference to 'weight-patch' in wee_database.

i'm still not sure why we need a separate option. won't --backfill-daily do it?

Yes --backfill-daily will do it, but I see it as an option of last resort. Every time (well virtually every time) you do --backfill-daily you are losing granularity from your summary data.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Thinking out loud here...

How about getting rid of --string-check and subsuming it into an all purpose --check-database? This would check for strings, plus the correct version. Adding --fix would upgrade the database.

from weewx.

matthewwall avatar matthewwall commented on July 29, 2024

i think that --fix is too general. it probably should be --fix-bogus-strings or --remove-unicode-strings or something to that effect.

i can imagine a --check-database that does --check-strings plus any other 'check' operation that we might implement. but it is probably still useful to have an entry point for each type of check.

(note that --string-check should be --check-strings in order to be consistent with the verb-precedes-modifier of all the other actions)

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

To summarize, wee_database would have options:

  • --check: Check the integrity of the database, version number, etc.
  • --fix-strings: Fix any strings in the database
  • --upgrade: Upgrade the database to the latest-and-greatest

Any mutating options would warn the user before proceeding.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

When I run wee_database, I get the error:

tkeffer@nuc:~/git/weewx$ ./bin/wee_database --apply-weight-patch /home/weewx/weewx.conf
Traceback (most recent call last):
  File "./bin/wee_database", line 646, in <module>
    main()
  File "./bin/wee_database", line 162, in main
    if StrictVersion(weewx.__version__) < StrictVersion(REQUIRED_WEEWX):
  File "/usr/lib/python2.7/distutils/version.py", line 40, in __init__
    self.parse(vstring)
  File "/usr/lib/python2.7/distutils/version.py", line 107, in parse
    raise ValueError, "invalid version number '%s'" % vstring
ValueError: invalid version number '3.7.x'

But, I wonder, why do the check at all? In order to do an --apply-weight-patch (or whatever we end up calling it), don't I have to be at the correct version? Perhaps you can walk me through a use case where the check would rescue a user?

from weewx.

matthewwall avatar matthewwall commented on July 29, 2024

the 'invalid version number' is my fault. it does not like the x. we need some way of specifying "this is the next version, but it is from the repository, not from an official release". with subversion we used to do this by sticking the revision number in there. i don't know the git way of doing it.

from weewx.

matthewwall avatar matthewwall commented on July 29, 2024

i'm anxious about the notion of a database ugrade. that means tracking database version/features, and introduces a whole new level of complexity.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

i'm anxious about the notion of a database ugrade. that means tracking database version/features, and introduces a whole new level of complexity.

Unfortunately, with the decision to patch the daily summaries, we are already committed to this.

Right now, the check is being done in setup.py (using the weecfg.patch module). Adding a check in wee_database just adds another entry point.

There's also a check in engine.py in case all else fails.

But, I must admit, I'm not feeling real good about all this. It's a heck of a lot of complexity.

from weewx.

raymondelooff avatar raymondelooff commented on July 29, 2024

Yeah Matthew, I was trying to tell the same thing. There are complete libraries written for that specific purpose (database migrations) to make things easier, but the question is whether you want that. In the end it's just a simple recalculation, that may take some time when the user has a large dataset.

And since nobody has complained about this bug yet, you can make it optional. You can add a flag to wee_database that let's the user recalculate the sums when he wants to. And by adding a release note about the added flag the user can decide to recalculate the weighted sums if he cares about it. Doing it this way there is no need for a complete regeneration of daily summaries.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Unfortunately, it's not quite that simple. Your approach would require that the algorithms that maintain the daily summaries be aware that there could potentially be two different types of summaries out there: 'old' style, without weights, and a 'new style,' with weights. So, complexity gets pushed from the update algorithms, into the operational algorithms.

Not saying if it's better or worse, just that the complexity does not go away.

from weewx.

tkeffer avatar tkeffer commented on July 29, 2024

Nothing to add. I just couldn't resist having post number 100 in this epic thread!

from weewx.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.