Checksum Errors & Thermal Overrun/Overheat / Overall crashing/freezing/ruining of prints
Without changing ANYTHING in software, firmware, hardware, etc around the end of January I suddenly started seeing hundreds of checksum errors in my error reporting window in RepeteirHost. These were accompanied by some wicked crashes/freezes, and then later by horrible thermal overrun a few weeks after. We are now almost a month into these horrible issues and I fear something to do with RepHost is at fault...
Related: This is a completely new error from 2-19-17 and I have never seen it before or since! Not even sure if relevant. Only happened once.
Previously:
-Tested 2 new/functional RAMBo boards. Same issues. Same checksums. Ugh. This was before runaway even was an issue... -Tested PSU night of the 17th. Good voltage. No problems there. Very ordinary. -Tested 4 different USB cables and all my USB ports, 2.0 and 3.0, front and back of desktop chassis
-Tested another computer. (Mac) SAME checksums and crashes. Didn't see any thermal overrun but only tested 3-4 hours...
-Tested another host program. Didn't see any checksum or crashing, or overrun... but print quality was awful and workflow was confusing... didn't test this enough. I LOVE REPETIER so I don't want to change...
+++++++++++++++++++
Issues in short:
Not long after this all started, prints started randomly freezing. Sometimes they would restart, sometimes now. Often, the XYZ motors would lock and the E motor would furious spin, inducing a massive retract or extrude command which inevitably led to the usual ruining of the print.
Sometimes, the crashes would put all four motors offline and NO errors would actually be reported at the time. The entire machine would just there, fans running, hotend hot, bed hot, totally 100% frozen. Pressing emergency stop returned all functioning to normal... until the next print of course. Throughout this, the host has never become unresponsive!
Suddenly, on the 17th-ish, the nozzle started having thermal runaway (heating up WAYYY beyond normal and leaking molten filament) when idling/heating before a print. Actually starting the print caused the runaway to stop and printing could proceed normally (with plenty of checksum errors of course!) NOTE: nothing is reported in host or on LCD screen of my romax when this happens! The reporting is 100% normal... the thing just smokes badly and leaks molten filament, clearly getting wayyy too hot. I even installed an auxiliary LED to check and see if the circuit was staying on too long. Nope. LED flickered to indicate heat variance, off and on as it should once temp is reached.
The printer FREQUENTLY ALSO forgets all of its EEPROM settings, resulting in failure to maintain PID, horizontal radius, accel values, extruder steps, etc, etc. I have locked my settings into firmware, but it still forgets horizontal radius - which is impossible to set in repeteir firmware directly. Why in the WORLD does this happen? I can't even find anything while googling this. It has not done this since the 17th, however.
SeeMeCNC support has totally give up, stating they have never heard of anyone encountering these problems before. Reprap IRC is helpful as always, but nobody knows why any of this is happening.
They suggested I try:
_ A: A new RAMBO board
_ B: A new USB cable going to my desktop
_ C: Printing from SD card.
_ A: I tried a new board and exact same thing happened. Tons of checksums and freezing.
_ B: I tried FOUR BRAND NEW USB cables, some even with ferrite chokes to prevent EMI.
_ C: I tried printing from SD cards and successfully got the checksum errors to stop, BUT this is horrible for my workflow AND the hotend runaway / EEPROM forgetting STILL happen...
Before the end of January, none of this happened. I want to reiterate: NONE of this happened. It remembered its settings for years, never froze, and only spat out some non-critical checksum errors every few dozen prints! And never hundreds in a single print.
I did not change the room, the table, the machinery... NOTHING. Nothing has changed. Early in the month of January I did upgrade to the newest version of Repeteir firmware DIRECTLY from SeeMeCNC. No errors occurred and things ran smoothly until the end of the month when things started going nuts.
What in the world is going on? I swore trying a new clean working board, SD printing, or new USB cables would be the solution... SeeMeCNC says out of the thousands of customers they have nobody has ever reported these issues, especially in combination.
Please... if anyone can help... I'd sincerely appreciate it.
I'm located in San Francisco if anyone is local to come take a look. I will pay you for your time/expertise.
_______
ADDITIONAL NOTES:
I upgraded my desktop's RAM from 32GB to 64GB (1600mhz to 2133mhz) on January 7th. I printed from the 7th until the 29th of January and NEVER had any issues! The 29th was the first crash. (or thereabouts within one or two days if memory serves)
The alternate host I tried was Mattercontrol.
I have tried with hundreds of different 3D models. Both mine and public models. Doesn't make a bit of difference.
I have tried reseating my RAM, reseating my GPU, checking my CPU's temps... all NORMAL and properly done. [I've built 30 computers and 10 printers so I'm pretty good at hardware troubleshooting]
I have tried using MANY different versions of RepHost, from 0.95 all the way up to the current 1.6 something. STILL checksums and crashing and thermal overrun.
---
Things I have not done yet but may:
-Switch back to OLD RAM that I know worked perfectly.
-Switch out for a NEW power supply... but I already have an amazing one and the voltage was stable so I'd rather not!
-Continue to do long/difficult prints from SD even though the workflow for that sucks and rephost actually crashes if I don't save the gcode externally...
Comments
1. The thermal runaway has basically stopped happening, but like I said whenever it did the host basically reported everything was perfect. I never saw those values change from what they were set to, watched it like a hawk!
... it just crashed while SD printing. All activity FROZE. All motors locked. Nothing happened. WHILE SD PRINTING! The host wasn't even open and the USB cable was disconnected! What in the world!? WHY???? How am I so cursed?
So there must be something corrupted in: Hardware, board, firmware... but I already tried new versions of all those things!
I tested 2 other computers AND another known working PSU. The voltage on both was within .01 volts the entire time I tested everything. Even during a dry run... tons of checksums and failures.
I already have a very good and very expensive computer ATX PSU. It has protection against EMI/voltage spikes, and since the other unit had the same problem... it's confirmed as NOT the culprit either.
If I've tried new boards, SD printing, new power supply, new/old firmware, different slicers/hosts, different hotend/therm/etc, different computers (PC and Mac)... and it's all still happening......
Then it MUST be something else in my room?
Is it ok to have the printer hooked into a UPS that runs another 3d printer? My Mendelmax has never once suffered because it was next to the Romax. But maybe the Romax suffers when next to the Mendel and its power?
I feel like I've literally tried everything!