Genetic Algorithms – Basic Implementation

Standard

After I used Python and Scikit-Learn for Twitter Sentiment analysis – which worked quite well – I started to read and learn about Genetic Algorithms.

Genetic Algorithms (GA) provide a very efficient way to search for optimum solutions in a very huge space of possible solutions. This does not mean that a GA always finds the best solution out there, but it finds a solutions that is working.
Of course you could use brute force to loop through all possible solutions and find the best one. But with large search spaces this would lead to a veeeeeery long runtime.

Example:
Imagine you want to find a certain number with a certain number of 1’s and 0’s (only 0 and 1 is allowed for each digit).
Wanted solution / number is 10 digits with only 1’s (or any other combination of 1’s and 0’s). This would mean 2^10 = 1024 options. Lets say each option takes 1ms to test. Then the overall time that a program needs would be ~1 second.
If you wanted 1000 digits with only 1’s then that means 2^1000 = 1.072 * 10^301 combinations. If each combination would also take 1ms then the overall time needed would be 3.398 * 10^290 years !!!
You see, brute force does not work – at least in my lifetime.

Genetic Algorithms are inspired by Darwin’s theory about evolution. That means GA uses crossover and mutations for finding an optimum solution to your problem.

The core functionality for deciding which partial solution (sometimes called individual or creature) should be used for further reproduction is the calculation of the fitness for each individual (i.e. how high can an individual jump, how fast can it go, or how closed does it come to a certain point, etc.)

Each individual has chromosomes which are encoded by genes.

I started with a fairly easy problem – finding an individual that creates the string “danielkoitzsch” (remember: brute force would be 14^26 possible gene combinations)

 

The basic idea of the algorithms is:

  1. [Start] Generate random population of n chromosomes (suitable solutions for the problem)
  2. [Fitness] Evaluate the fitness f(x) of each chromosome x in the population
  3. [New population] Create a new population by repeating following steps until the new population is complete
    1. [Selection] Select two parent chromosomes from a population according to their fitness (the better fitness, the bigger chance to be selected)
    2. [Crossover] With a crossover probability cross over the parents to form a new offspring (children). If no crossover was performed, offspring is an exact copy of parents.
    3. [Mutation] With a mutation probability mutate new offspring at each locus (position in chromosome).
    4. [Accepting] Place new offspring in a new population
  4. [Replace] Use new generated population for a further run of algorithm
  5. [Test] If the end condition is satisfied, stop, and return the best solution in current population
  6. [Loop] Go to step 2

My main loop looks like:

 

The crossover function:

 

The mutation function:

 

And finally the fitness calculation function (this one is pretty simple !):

 

Here you can download my source files: GeneticAlgorithm

Twitter Sentiment Analysis With Python+Sklearn

Standard

My next step in learning Machine Learning was to implement a Twitter Sentiment Analysis. I wanted to do a 3-class classification with “negative”, “neutral” and “positive”. Therefore it was much better to use a RandomForestClassfier instead of a SVM which is optimized for 2-class classification.

Before you can start you need the following data:
a) List of stop words (i.e. “and”, “because”, “it”, etc. – words that do not have any meaning)
b) List of slang words
c) List of affinity words
d) List of emoticons (i.e. “:-)”, etc.)
e) Set of negative, neutral and positive tweets (for training the model)

Then you can implement the sentiment analysis like this:
1) Download unclassified tweets
2) Pre-process tweets (remove stop words, replace slang words, …)
3) Create bag of words set (from the negative, neutral and positive tweets)
3a) Optional: you can do a k-cross validation to find the best attributes for training the model
4) Train the model
5) Classify the downloaded tweets
5a) Optional: store the classified tweets

You can download my project files here: Twitter Sentiment

You would need Scikit-Learn and NLTK to run the code. I use Eclipse+PyDev+Anaconda3 as development environment.

Next steps: parallelize download of tweets and processing / classifying them.

Python Machine Learning with Iris Dataset

Standard

I recently started to work with Python Scikit-Learn.
My first program was a classification of Iris flowers – as this is usually the first start for everyone 😉

I think it’s quite a good idea to start by just using the code and libraries as your tool. Do not try to understand how Machine Learning works internally. That might be frustrating in the beginning. Especially when it comes to statistics and probabilities and so on and so forth…
You should rather concentrate on the goals you want to achieve – i.e. predict the class of unknown iris flowers.
When your progamm is working and you have a lot of time left, then you could start to think about the algorithm internals.
For getting an idea how the algorithms work and when to use which one take a look at the UserGuide from SciKit-Learn.

Below you can find my code and the dataset I’m using. I got the dataset from the UCI Machine Learning Repository (Iris dataset).

The file “iris.data” with the data: iris

Visual Comparison of Dijkstra and A-Star

Standard

I finalized the beta-version of my routing and OpenStreetMap (OSM) visualization tool.
The first idea was to see how the Dijkstra and A-Star algorithms work. The next idea was to use OSM data to render the roadnetwork and buildings.

The following videos and pictures show the different working style of Dijkstra and A-Star.

A-Star Routing Algorithm (Download Video)

Dijkstra Routing Algorithm (Download Video)

Pictures of the A-Star algorithm:

Pictures of the Dijkstra algorithm:

Passed AZURE 70-533 Exam !!

Standard

Finally, I passed the Azure Exam 70-533 yesterday :-)

I never had anything to do with Azure before. But working with Azure and learning the techniques is really awesome and absolutely interesting!

I’ve been preparing for several weeks in order to get the things in my brain and to use Azure.

How did I prepare?

I’ve used several online sources and of course the free Azure subscription trial (30 days is fair enough for practicing).
Watch all the movies on Microsoft Virtual Academy (MVA), MSDN, Technet, YouTube, etc.
Then of course work through the Ebook (see below).
Use the documentation from the Azure homepage (Azure).
And last but not least – most important – practice, practice, practice, … with your Azure subscription.

My main sources:
Ebook “Exam Ref 70-533 Implementing Microsoft Azure Infrastructure Solutions”
Official Azure Documentation
JustAzure
Troy Hunt
Channel 9
Azure Friday (Channel 9)
Cloud Ranger YouTube VideosLook what others did…

And finally, register for the exam and plan it. Then you have a date and you have to get ready!
Exam 70-533 registration

 

My summary: AZURE ROCKS !!!