I just tried out llamafiles and I am blown away on how easy to setup and use they are. Very decent speed on Apple silicon. Very decent code completion.
I have a realtime speed example here:
I want to be a bit more aggressive in auto-retrying LLM prompting in a loop, but the cost is a bit prohibitive so I wanted to explore my options a bit. I am really excited by this option. My M1 Pro 16GB is a little underpowered though, I could not run every model. Next laptop refresh I will beg for a 32GB or better.