Another local LLM front end for Android,
Layla is way more fully featured than MLC chat. The paid version gives heaps of utilities and character creation options, but the free version works fine
Comes with a range of downloadable models, and I get pretty acceptable speed out of the small and medium models on my Oppo A96 (old octacore/ 8gb ram). There's even a tiny model you can try if your phone is a complete potato, or is lacking in RAM (though the responses will be a bit dumber). I'm getting about 3 tokens/second out of the small model, with better responses due to character creation (want code? create a computer programmer character. Want short stories written? Create a writer, though it already comes with one), whereas I was only getting ~0.7-1token/s in MLC for a similar 3B model. The difference in speed is amazing. My phone isn't that much better than a potato, but 3tokens/second makes it pretty bareable.
The tiny model seems to use about 1.8gig of ram while running (so will probably work on damn near anything) and generates about 4.4tokens a second on my phone. Probably faster without analytics on, and I'm pretty sure my phone has heaps of random background apps/tasks running. Quality appears to be about the level of RedPyjama3B (MLC Chat's small model), but may be worse. At least it's quick. Oh, and the Writer character that comes with Layla will write nearly anything, whereas some characters won't.
So, it's just better in general than MLC, with way more model tweaking options, changeable context size, prettier interface, seemingly far better speed, and all kinds of other stuff. Apparently also gets frequent updates, that don't break everything (MLC broke for me after an update). Initial load times are a bit slow, but after that they're good.
I'll probably buy the paid version, just because I like supporting this sort of project. But as mentioned, free works fine.
Anyway, give it a go. You can get it here (an actual Play store app, no side loading required):
https://play.google.com/store/apps/details?id=com.laylaliteIt also allows loading of custom GGUF files for different language models, so I'll have a go at getting phi-3 going on it (a new model from Microsoft that is apparently fairly performant).
Yeah, phi-3 works fine. It's censored, and is slower than phi-2 (which is what Layla uses for its small model, uncensored) at about 2-3tokens/sec, but apparently it's a lot smarter. Was only using about 3.4gigs of ram, so should squeeze into plenty of phone's hardware specs. I'll probably stick to phi-2/ small for most stuff, because the extra speed is awesome, and I'm often not requiring genius level understanding. I'd probably just move up to 7B parameter models if I really needed context or a lack of hallucinations. You can grab it here if you want to try it out:
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.ggufThe new NVIDIA one also works on an 8gb RAM phone, just really slowly, so yeah. Us dwarfs, always on the cutting edge of tech. With axes. And mobile stuff. Praise Armok!
Choose a GGUF, any of them, for a laugh:
https://huggingface.co/bartowski/Llama-3-ChatQA-1.5-8B-GGUF