Some time ago an interesting article surfaced on Reddit, about using recurrent neural networks to generate plausible looking text. Andrej did a very good job in explaining how they work and some of the techniques and algorithms he used. I thought this was worth some explorations on my own, so I did a bit of research and tried to implement my own networks in Haskell.
I implemented the basics using the hmatrix package to have vectors and matrices. I went to Khan Academy to learn more about these topics because I had stopped math in school before learning matrices. Once I managed to implement hopefully correctly the algorithm I used the simple-genetic-algorithm package (well, my fork, uploaded on hackage) to evolve the network via genetic selection and mutation.
This gave some results, especially when I fixed my mutation code, that in fact was doing nothing, thus emphasizing again that mutation is a critical part of genetic algorithms, and not just crossovers.
Then I went to implement proper learning algorithms, a bit less random than genetic evolution. I was clearly out of my depth there, but learned a lot. To implement gradient descent I actually used the ad library and had to rewrite the algorithms without hmatrix so they would work on simple lists. Given that a couple of weeks again I didn't even understand what AD was, I'm happy I got some stuff to work. I was lucky to find some good code in python, even trying to understand the under-specified RMSProp algorithm.
The end result is not great, though. My network can learn the really complex sentence "Hello world!" and regenerate it after a couple of thousand iterations of the learning algorithm, in around a minute on my machine. I haven't managed to parallelize the treatment yet, and running on all my cores instead of just one make performance a lot worse. On more complex sentences performance becomes really bad, as both the network becomes bigger (more characters to deal with, etc) and the data to evaluate is bigger.
Trying to use a sparse representation for characters using 2 values instead of 1 as in the standard 1-of-k encoding didn't work well. I also realize that tweaks in the cost function had a huge impact on how well and how fast the network can learn. I have started to look at cutting the data in batches but I think I need to make the learning on simple data as fast as possible before moving on.
So, nothing earth shattering, no great progress or insight, but I feel I learned a lot by implementing my own stuff and running into all the issues that are probably well known of people familiar with these subjects. My artificial intelligence is still pretty dumb, but hopefully my own natural intelligence has progressed a bit!
The code is on github of course, so if anybody wants to have a look, I'll gladly take any feedback!!