Checksum Errors & Thermal Overrun/Overheat / Overall crashing/freezing/ruining of prints

Artesian3D · February 2017

Without changing ANYTHING in software, firmware, hardware, etc around the end of January I suddenly started seeing hundreds of checksum errors in my error reporting window in RepeteirHost. These were accompanied by some wicked crashes/freezes, and then later by horrible thermal overrun a few weeks after. We are now almost a month into these horrible issues and I fear something to do with RepHost is at fault...

Related: This is a completely new error from 2-19-17 and I have never seen it before or since! Not even sure if relevant. Only happened once.

Previously:

-Tested 2 new/functional RAMBo boards. Same issues. Same checksums. Ugh. This was before runaway even was an issue... -Tested PSU night of the 17th. Good voltage. No problems there. Very ordinary. -Tested 4 different USB cables and all my USB ports, 2.0 and 3.0, front and back of desktop chassis

-Tested another computer. (Mac) SAME checksums and crashes. Didn't see any thermal overrun but only tested 3-4 hours...

-Tested another host program. Didn't see any checksum or crashing, or overrun... but print quality was awful and workflow was confusing... didn't test this enough. I LOVE REPETIER so I don't want to change...

+++++++++++++++++++

Issues in short:

Not long after this all started, prints started randomly freezing. Sometimes they would restart, sometimes now. Often, the XYZ motors would lock and the E motor would furious spin, inducing a massive retract or extrude command which inevitably led to the usual ruining of the print.
Sometimes, the crashes would put all four motors offline and NO errors would actually be reported at the time. The entire machine would just there, fans running, hotend hot, bed hot, totally 100% frozen. Pressing emergency stop returned all functioning to normal... until the next print of course. Throughout this, the host has never become unresponsive!
Suddenly, on the 17th-ish, the nozzle started having thermal runaway (heating up WAYYY beyond normal and leaking molten filament) when idling/heating before a print. Actually starting the print caused the runaway to stop and printing could proceed normally (with plenty of checksum errors of course!) NOTE: nothing is reported in host or on LCD screen of my romax when this happens! The reporting is 100% normal... the thing just smokes badly and leaks molten filament, clearly getting wayyy too hot. I even installed an auxiliary LED to check and see if the circuit was staying on too long. Nope. LED flickered to indicate heat variance, off and on as it should once temp is reached.
The printer FREQUENTLY ALSO forgets all of its EEPROM settings, resulting in failure to maintain PID, horizontal radius, accel values, extruder steps, etc, etc. I have locked my settings into firmware, but it still forgets horizontal radius - which is impossible to set in repeteir firmware directly. Why in the WORLD does this happen? I can't even find anything while googling this. It has not done this since the 17th, however.

SeeMeCNC support has totally give up, stating they have never heard of anyone encountering these problems before. Reprap IRC is helpful as always, but nobody knows why any of this is happening.

They suggested I try:

_ A: A new RAMBO board

_ B: A new USB cable going to my desktop

_ C: Printing from SD card.

_ A: I tried a new board and exact same thing happened. Tons of checksums and freezing.

_ B: I tried FOUR BRAND NEW USB cables, some even with ferrite chokes to prevent EMI.

_ C: I tried printing from SD cards and successfully got the checksum errors to stop, BUT this is horrible for my workflow AND the hotend runaway / EEPROM forgetting STILL happen...

Before the end of January, none of this happened. I want to reiterate: NONE of this happened. It remembered its settings for years, never froze, and only spat out some non-critical checksum errors every few dozen prints! And never hundreds in a single print.

I did not change the room, the table, the machinery... NOTHING. Nothing has changed. Early in the month of January I did upgrade to the newest version of Repeteir firmware DIRECTLY from SeeMeCNC. No errors occurred and things ran smoothly until the end of the month when things started going nuts.

What in the world is going on? I swore trying a new clean working board, SD printing, or new USB cables would be the solution... SeeMeCNC says out of the thousands of customers they have nobody has ever reported these issues, especially in combination.

Please... if anyone can help... I'd sincerely appreciate it.

I'm located in San Francisco if anyone is local to come take a look. I will pay you for your time/expertise.

_______

ADDITIONAL NOTES:

I upgraded my desktop's RAM from 32GB to 64GB (1600mhz to 2133mhz) on January 7th. I printed from the 7th until the 29th of January and NEVER had any issues! The 29th was the first crash. (or thereabouts within one or two days if memory serves)

The alternate host I tried was Mattercontrol.

I have tried with hundreds of different 3D models. Both mine and public models. Doesn't make a bit of difference.

I have tried reseating my RAM, reseating my GPU, checking my CPU's temps... all NORMAL and properly done. [I've built 30 computers and 10 printers so I'm pretty good at hardware troubleshooting]

I have tried using MANY different versions of RepHost, from 0.95 all the way up to the current 1.6 something. STILL checksums and crashing and thermal overrun.

---

Things I have not done yet but may:
-Switch back to OLD RAM that I know worked perfectly.
-Switch out for a NEW power supply... but I already have an amazing one and the voltage was stable so I'd rather not!
-Continue to do long/difficult prints from SD even though the workflow for that sucks and rephost actually crashes if I don't save the gcode externally...

Repetier · February 2017

I don't really believe there was no change. I think something changed that you did not notice like a hardware component not running good or windows update.

I'm a bit confused with Mattercontrol having same issues or not.One thing host/server differ from other products is mainly that we can send several commands in parallel while other hosts do only what we call ping-pong mode. So you could first enable ping-pong mode. In parallel mode the receive buffer cache size can cause many errors if set wrong. 63 bytes works always and when firmware is compiled with recent Arduino IDE it will normally use 127 byte. Bigger values create error for sure. Set Transfer protocol to repetier protocol assuming you have repetier-firmware. Also check baud rate. For example having 250000 baud in firmware and 230400 in host will somewhat work but also cause many communication errors.

Deleted eeprom is something different as there is no delete command. You can reset it to configuration.h values with

M502

M500

and it will do the same if at bootup a checksum error in eeprom is detected. But restarting will otherwise not change eeprom especially not to nonsense values.

Also your temp. runaway - what does that mean? If you have communication issues it would be possible that temperatures get set wrong. But host should show current set temperature as well so it is easy to see if something changed target temperature or if it is out of bounds. Firmware should stop heating if target temperature is exceeded.

If nothing bad happens with pure sd print I can only think of wrong communication settings causing lots of wrong commands.

One thing could also be fake FTDI drivers, but then other hosts would have same problem with communication.

Artesian3D · March 2017

Thanks so much for responding!

1. The thermal runaway has basically stopped happening, but like I said whenever it did the host basically reported everything was perfect. I never saw those values change from what they were set to, watched it like a hawk!

2. The firmware has no baud rate control. That is all set in the host. I tell it 250K, it uses 250K. This is repeteir from SeeMeCNC.

3. I understand some but not all of what you're saying about transfer protocols. I don't want the machine using a different protocol from the host. And what we're basically seeing is that the host is doing something the machine doesn't like or doesn't understand. I tried enabling ping-pong mode and that did nothing.

Still checksums!

4. I swapped out the new ram for the old ram in my computer. STILL happens.

5. EEPROM stopped getting deleted thankfully.

6. I tried yet another USB cable, and it's still happening. This is the last thing I need to fix before I have a reliable deltabot...

Artesian3D · March 2017

UPDATE for today:

... it just crashed while SD printing. All activity FROZE. All motors locked. Nothing happened. WHILE SD PRINTING! The host wasn't even open and the USB cable was disconnected! What in the world!? WHY???? How am I so cursed?

So there must be something corrupted in: Hardware, board, firmware... but I already tried new versions of all those things!

Repetier · March 2017

These errors are so weired and random that it could be the power unit. If voltage fluctuates too much problems can happen. If you have some PC or other power unit you might try that. You could also try dry run which reduces power consumption reduncing these effect also. If that works then stable I'd try a new power unit. But that is just a guess as you had exchanged much but not that.

Artesian3D · March 2017

Hi again,

I tested 2 other computers AND another known working PSU. The voltage on both was within .01 volts the entire time I tested everything. Even during a dry run... tons of checksums and failures.

I already have a very good and very expensive computer ATX PSU. It has protection against EMI/voltage spikes, and since the other unit had the same problem... it's confirmed as NOT the culprit either.

If I've tried new boards, SD printing, new power supply, new/old firmware, different slicers/hosts, different hotend/therm/etc, different computers (PC and Mac)... and it's all still happening......

Then it MUST be something else in my room?

Is it ok to have the printer hooked into a UPS that runs another 3d printer? My Mendelmax has never once suffered because it was next to the Romax. But maybe the Romax suffers when next to the Mendel and its power?

Repetier · March 2017

Just try. Since we not know what the reason is how can we say if it influences other devices.

Artesian3D · March 2017

I attempted to print with the power plugged into another room and the MendelMax turned OFF.

STILL getting loads of checksum errors and crashes!

I feel like I've literally tried everything!

Checksum Errors & Thermal Overrun/Overheat / Overall crashing/freezing/ruining of prints

Comments