Friday, August 28, 2015

A Reddit clone (very basic) using Local Storage and no server side storage

Some weeks ago there was a bit of a tussle over at Reddit, with subreddits going private in protest, talk of censorship, etc. This was interesting to see, from a distance. It got me thinking about trying to replicate Reddit, a site where people can share stuff and have discussions, but without having the server control all the data. So I've developed a very basic Reddit clone, where you can post links and stories and comment on them. You can also upvote stories and comments, and downvote what you upvoted (cancel your upvote). But there's a catch: the site has no database. Things are kept in memory, and on the users machine, via the HTML 5 LocalStorage. That's all!

Everytime you upload or upvote something, it gets saved to your LocalStorage for the site. Once something gets downvoted to zero, it disappear. When you go to the site, whatever is in your LocalStorage gets uploaded and upvoted again. So stories can come and go as users connect and disconnect, and only the most popular stories will always be visible on the site (since at least one connected user needs to have uploaded or upvoted a story for it to be visible).

Of course, there is still a server, that could decide to censor stories, modify text, but at least you can always check that what you have on YOUR machine is the data you wanted. you can always copy that data elsewhere easily for safe keeping (browser developer tools let you inspect your LocalStorage content).

All in all, this was probably only an excuse for me to play with Javascript and Java (I did the server side in Java, since it was both easy to build and deploy) and Heroku. I've deployed the app at https://fivemegs.herokuapp.com  and the source code can be found at https://github.com/JPMoresmau/5megs. Any feedback welcome!

Sunday, August 02, 2015

Playing with Recurrent Neural Networks in Haskell

Some time ago an interesting article surfaced on Reddit, about using recurrent neural networks to generate plausible looking text. Andrej did a very good job in explaining how they work and some of the techniques and algorithms he used. I thought this was worth some explorations on my own, so I did a bit of research and tried to implement my own networks in Haskell.

I implemented the basics using the hmatrix package to have vectors and matrices. I went to Khan Academy to learn more about these topics because I had stopped math in school before learning matrices. Once I managed to implement hopefully correctly the algorithm I used the simple-genetic-algorithm package (well, my fork, uploaded on hackage) to evolve the network via genetic selection and mutation.

This gave some results, especially when I fixed my mutation code, that in fact was doing nothing, thus emphasizing again that mutation is a critical part of genetic algorithms, and not just crossovers.

Then I went to implement proper learning algorithms, a bit less random than genetic evolution. I was clearly out of my depth there, but learned a lot. To implement gradient descent I actually used the ad library and had to rewrite the algorithms without hmatrix so they would work on simple lists. Given that a couple of weeks again I didn't even understand what AD was, I'm happy I got some stuff to work. I was lucky to find some good code in python, even trying to understand the under-specified RMSProp algorithm.

The end result is not great, though. My network can learn the really complex sentence "Hello world!" and regenerate it after a couple of thousand iterations of the learning algorithm, in around a minute on my machine. I haven't managed to parallelize the treatment yet, and running on all my cores instead of just one make performance a lot worse. On more complex sentences performance becomes really bad, as both the network becomes bigger (more characters to deal with, etc) and the data to evaluate is bigger.

Trying to use a sparse representation for characters using 2 values instead of 1 as in the standard 1-of-k encoding didn't work well. I also realize that tweaks in the cost function had a huge impact on how well and how fast the network can learn. I have started to look at cutting the data in batches but I think I need to make the learning on simple data as fast as possible before moving on.

So, nothing earth shattering, no great progress or insight, but I feel I learned a lot by implementing my own stuff and running into all the issues that are probably well known of people familiar with these subjects. My artificial intelligence is still pretty dumb, but hopefully my own natural intelligence has progressed a bit!

The code is on github of course, so if anybody wants to have a look, I'll gladly take any feedback!!