Server hangs after some time

Hello there,
I've been encountering this issue a lot last months but only now I've decided to write here for help.

I'm using RepetierServer Pro on a Raspberry 3 installed via an official pre-built image. Printer electronics are an Arduino Due clone with a RADDS 1.5 over it, connected to the Raspberry with a shielded short usb cable. The Raspberry is connected via ethernet cable. I've used this same combination of hardware and software in the past with no issues but lately I'm getting issues.
I do mostly long prints (more than a day) and sometimes after some hours I get the web UI unresponsive, and if I need to pause the print the only thing I can do is to abort it by resetting the server (by restarting RepetierServer service or rebooting it)

Usually when I log via SSH during a "freeze" what I see via HTOP is this:
image

(shouldn't the image be visible, here's a link to it: https://postimg.org/image/d74w0wyat/ )

What I see from the [laggy] log is a sequence of "Seems like we missed an OK, continue sending" with cumulative retries and no response.

After I restart the service everything goes ok.

Any suggestions?

Thanks in advance!

Comments

  • What I see is a very high cpu usage on 2 processes and on chrome. Do you have a touchscreen connected? If so what yre you viewing there and if not you should try disabling chrome since it then only waste on cpu and memory to run it. Edit in pi home directory end of .bashrc to not start x server at the end of the file.

    Unfortunately it is not possible to see which threads are using 100%. Some will likely use 100% e.g. if parsing gcodes, but that only happens when new files got added.

    What log do you use to see "missed ok"? The web log needs commands enabled to see if it sends commands. As log as there are good differences in line numbers that would be no big problem, also perfect communication is better.

    Wound unconnecting printer stop cpu usage?

    Since it worked in past, it could also be a error on sd card. With such errors it is not clear what will happen. If you can put a fresh image on a new sd card and copy printer configuration and test that image instead. Especially turning off without shutdown can also lead to sd card corruption.
  • Hello and thanks for the reply!

    No, I've got no touchscreen, not even a display on the RADDS. Gonna disable it right now.

    About the logs: yes, I kept the switches "commands" and "ACK" enabled on the web interface to see what was going on and they worked even if the web ui commands were unresponsive.

    Disconnecting the printer while on 100% cpu? Didn't try that, but resetted it, and that didn't work. Will try with disconnecting and post the result should it come up again after the bashrc mod.

    Will also buy a new SD card just because I had to brutally disconnect power some times in the past and I want to be sure it isn't the SD.

    Will post soon with the results, meanwhile thanks again for the support.
  • edited April 2017
    Hello there!

    I've started the troubleshooting process by disabling the graphical interface from raspi-config, wich I didn't know was enabled since I got no hdmi monitor attached to it. After doing so, the chromium process never started again and the server keeps going smoothly.

    Thanks again for your precious advice!

    image
    https://postimg.org/image/y6u7op915/
  • edited April 2017
    Ok, it went stuck again, but this time we've got new information that could be useful.

    I paused the print for some hours and before I paused it everything was going smoothly (cpu usage between 3% and 9%). After some hours I've seen the server at 104% cpu on a sleeping thread and 100% cpu on a running thread, and the TIME+ value is nearly the same as the time passed since I've pressed the pause button.

    Any clues on a connection between long pauses and server hangs?

    Meanwhile I'll keep checking things as in the upper post list.

    Edit: forgot to add a picture. also, I've tried with disconnecting the printer from the Raspberry and it did nothing, thread still 100%.


    image

  • Just to be clear, when you say pause you mean hitting the pause button during a print and not doing nothing after a print?

    What firmware are you using? Can you enable logging for idle and print so we can see communication afterwards. I think something happens that causes so much communication that it gets 100% load.

    Please also check in ssh with "df" if you have free disk space. Full disk spaces can lead to many errors.
  • edited April 2017
    Hello there, sorry for replying this late, I was trying all the remaining things to check.

    Yes, I mean both pausing the print for some minutes (let's say change a spool) and letting it be idle after a print.

    I'm using Repetier Firmware 1.0.0 dev. version.

    I've enabled log for idle and print as you requested, will collect them and give you back the files.

    I did a fresh install on a brand new 32GB Samsung Evo+ just to be sure that SD wasn't an issue, and it hangs after some time, so SD was good.

    DF shows only a 7% usage on root filesystem:
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/root        30G  1.7G   27G   7% /
    devtmpfs        459M     0  459M   0% /dev
    tmpfs           463M  3.3M  460M   1% /dev/shm
    tmpfs           463M  6.6M  457M   2% /run
    tmpfs           5.0M  4.0K  5.0M   1% /run/lock
    tmpfs           463M     0  463M   0% /sys/fs/cgroup
    /dev/mmcblk0p1   63M   21M   43M  34% /boot
    tmpfs            93M     0   93M   0% /run/user/1000



    I don't know if it may help, it's just a weird feeling, but could it be related to auto-bed leveling? I'm asking since I've got a 1000x1000mm build plate with a very dense calibration grid (3 cm between each point) and I don't know how much performance/space this setting could occupy, not even if could relate to this issue.
  • Autoleveling is not the problem. It does nothing when idle so when it happens then it is not related.

    I have more hope with the log that may show some unusal communication causing to block.

    Another thing that might be a problem is a slow wifi connection and webcam. It seems that since server proxies it and wants to send the complete file but when receiving is slower it stores the difference somewhere. Looking again on the htop output I see a whooping 48% memory usage. That is not typical. Started I see 2.2% and it never uses much even during operation. So if webcam is causing the memory usage and it gets too high everything gets slow and unstable. Could that be the reason?
  • edited April 2017
    Thanks for the follow-up, unfortunately there's no webcam connected, configuration is set to "no webcam" and the whole thing is connected via ethernet cable.

    At least I've got some log files: even though, whenever I reset via ssh, I get a new connection.log. When I see it with nano/cat I can see that when it slows down it simply shows the temperature every second or every two seconds, instead than every fraction of a second, and it stays that way. No other text than normal temperature reading.

    Any clues?
  • Temperatures are queried once a second. Later dev repetier-firmware supports a mode to autoreport temperature, but once a second is no reason to slow down. My installation is after a day and a 3 hour print still on 2.3% memory usage.

    So question is when does it start using memory and why. So there are 2 things you could do:
    1. control with htop and keep an eye on mem usage. Maybe you see which of your actions start increasing it more then 1%. After all your screenshot had already 48%
    2. If your sd card has enough memory, make a copy of /var/lib/Repetier-Server and in your copy go to the model and job folders of each printer and delete all files there. It might be that a corrupt file somehow causes this. Backup is just so you can copy old state. Stop server before making backup/copy and restart it afterwards
    sudo service RepetierServer stop 
    sudo service RepetierServer start

  • Hello, it did not help. But I did try another thing: using the old SD card with OctoPrint and just by using it some minutes I got connection issues: no temperature reading, no command response: I'm guessing it's something serial-port related.. So I came to the idea that something may be wrong with the Raspberry. Got another RPi3 with RepetierServer up 'n running on the machine and... guess what? Everything is working fine, at least for now. I'll keep updating as soon as I get more info.

    So at this point I'm guessing it was all about something hardware on the RPi that went fried somehow.
  • Ok, the problem persists. After a minute or so of pause during print it still hangs, I've set the printer under a Repetier Server installation on a Intel NUC I had on another printer, and it still hangs. So maybe it's the Arduino DUE clone I'm using? Could it be blasted on serial communication? But then it wouldn't stop to work just on pauses I presume...

    I'll try reverting firmware to release 0.92
  • Don't think it is serial communication with the log you showed. That looked good.
    Are you using the programming port of the due? That is what I normally use.

    Please also have a look at ram usage and when it starts to grow if that is the problem. But it is remarkable that it comes back on new sd card, or was it just new pi same card? Sometimes errors on sd card have strange effects. But that would mean your NUC has the same error, so maybe not the real problem.
  • two different SD cards, two different RPi's, and a NUC all showed up with this same error... I really do not know what to think. Anyway, yes, I'm always using the programming port of the Due.

    Now that I installed repetier firmware v0.92 stable on the printer it seems that short pauses don't even add a 0.1% cpu or ram to the load. So, maybe, that's something 1.0.0DEV related...
  • I'm always testing with dev so would wonder why this should be. But ok, lets see how it now works in long run.
    What can be a problem are deltas on dev as dev needs more ram, you need to reduce subsegments per line to get your 900 byte of free ram, if that is the case for you. But nonthing firmware sends should be able to increase server ram usage, maybe cpu if it contiuosly sends a lot of stuff but that would be visible in log.
  • Hi,
    I got the same problem with a Pi3 and a Radds board. Switch from dev to 0.92 seems to fix it. Could anyone fix this problem with the dev Version ?
    Thanks,
    Thomas
  • We are using 1.0 always as we are developing it and have no server hangs, so it is not a bug in firmware. It depends of course on configuration. As I said dev needs a bit more ram so you may need to reduce subbuffers on deltas to get the required memory. If oyu have an idea what response of dev version makes it freeze let us know. But it is hard to fix something when not knowing what.
Sign In or Register to comment.