Wednesday, October 20, 2010

Stribe

It has been quiet on my blog for some time.
That's because i've been busy with my biggest and most important project so far: Stribe.
This article is meant to tell a little about project and the company behind it and moreover to shine some light on the technical details.

Stribe is a company i started together with a good friend under the wing of my former employer Colours.
In january i started the development on a platform for social shopping; a website that offers a huge catalog of products from lots of big webshops and allows users to create profiles showing what they like, sharing it with friends.
Make wishlists while shopping for goods, compare prices between shops and tons of more functionality... that we eventually did not use in the end result... :P

As it goes with these kind of projects, we have changed the concept along the way and eventually ended up with two very cool websites:

Stribe.nl
Offers discounts on online fashion for a huge amount of brands and webshops.
We also organize "collective discount" actions where users receive a higher discount if more people sign up for the action.
Users need to create a profile that reflect their brand and shop preference so that they will only receive updates for those discounts that are relevant for them.

Boetiek.nl
Is basicly an online catalog that hold all fashion products for over 25 webshops (and growing) which comes down to over a 100.000 products.
It's an easy tool for those who are looking to buy fashion online and don't want to search through all those websites individually.

Ok, so that's the functional (for us tech-geeks, boring) part of the story.
What is behind these sites that makes this all possible?

We use:
- .NET 4.0 Framework, developing in Visual Studio 2010
- C#, MVC 2, jQuery
- MS-SQL 2010
- Spark View Engine
- Xapian search engine
- NHibernate (2.something i think)
- Fluent NHibernate
- NHibernate Lambda Extensions
- NUnit
- Umbraco (for very basic content storage)
- GitHub for sourcecontrol with GitExtensions as the client
- FogBugz for issuetracking
- Yourzine's Footprint for outgoing mailings
- a lot more of small helper utilities

.NET 4.0 and MVC 2 were still in beta when we started.
That's one big advantage of running your own project: you may decide to use techniques that are so new they are not even final released and there is no one there to stop you! :)
Before the start i did not have any experience at all with MVC which made it even more exciting!

I got to develop some pretty cool components during the project.
First there is the whole custom site framework with connections to Umbraco to retrieve content (we did not use umbraco rendering, just for simple content storage) and of course connected to our SQL database (which at some point contained close to 100 tables).

Then there is the Importer Tool, a tool that imports all product feeds from the webshops we offer in our catalog and adds them to our database.
This also includes all related brands and product properties such as color, size, material, etc.
As most shops offer the feed in their own format, each one needs to be converted to our own standard, then added to our database and finally each product needs to be placed in a category which was a huge challenge on it's own.

Another interesting task was implementing the Xapian search engine that crashed when used in combination with 4.0 .NET framework.
This had to be solved using WCF to isolate the search service from the rest of the solution.

Last but not least, the automated build process.
With one push of the button i can build the solution and generate a deployable output that can be dragged and dropped to the production server.
It takes care of merging and minifying of css and javascripts and generating the correct configuration files for the production environment.

I can go on like this for hours, but you get the point: it was a project full of challenges and also with all freedom to solve them.
Every software engineer's wet dream 8-)

We're currently running the sites on 2 loadbalanced Windows 2008 webservers and one dedicated SQL Server.
There is 24h monitoring that alerts me immediately when any of the systems fail.
I'm happy with the result so far and believe that we now have a steady base for future development.

If i get more time and feel like writing, i would like to go more into the specifics of some of the sections and choices i've made.
One subject that deserves some more attention is the Spark View Engine we used, which has a lot of potential (although i suspect that Razor might replace Spark alltogether in the future...)
Also the Fluent NHibernate and Lambda Extension made things so much easier for us.
Hopefully more about that in the future!

For now, i'd like to invite you to check out the end result and see for yourself.
If you have any questions, suggestions, free drinks or marriage proposals, feel free to reply!

/BruuD

Monday, January 25, 2010

Tokenizer / tag extraction method

I was in need for a method to extract a list of tokens from a given text.
The list should not contain stopwords like "the, a, on, of, etc" or punctuation marks and needs to be sorted by most used token.
This is what i came up with:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

public static class Tokenizer
{
private class Token
{
public string Value { get; set; }
public int Count { get; set; }
}

/// <summary>
/// Tokenize a string and return tokens as list sorted by most occurances
/// </summary>
/// <param name="input"></param>
/// <returns></returns>
public static List<string> TokenizeString(string input)
{
string cleanText = String.Empty;

// Replace all non-alphanumeric characters with a space
new List<Char>(input.ToCharArray()).ForEach(c =>
cleanText += (Char.IsLetterOrDigit(c) || Char.IsWhiteSpace(c))
? c.ToString()
: " ");

// We want all tokens in lowercase
cleanText = cleanText.ToLower();

// Split input string on whitespaces
List<string> tokens = cleanText.Split(' ').ToList<string>();

// Remove stop words, whitespace and tokens shorter than 3 chars from token list
// NOTE: Configuration.StopWords is of type List<string> and contains the stopwords (loaded from a configuration file)
Configuration.StopWords.ForEach(stopWord =>
tokens.RemoveAll(token =>
token == stopWord || String.IsNullOrWhiteSpace(token) || token.Length < 3));

// Copy the tokens to a new list
// Count how many times each token occurs and add it only once, so we have no double entries
List<Token> tokenList = new List<Token>();
tokens.ForEach(delegate(string t)
{
// Continue only if token does not exist in tokenList
if (!tokenList.Exists(e => e.Value == t))
{
// Add token to list, including the count of how many times it occurred
tokenList.Add(new Token { Value = t, Count = tokens.Count<string>(c => c == t) });
}
});

// Sort the list on occurrance count
tokenList.Sort(delegate(Token a, Token b)
{
return b.Count.CompareTo(a.Count);
});

// At this point we have a list with unique tokens, sorted by most occurrances
// We convert it back to a string list and return it
tokens = new List<string>();
tokenList.ForEach(t => tokens.Add(t.Value));

return tokens;
}
}

Here is an example of it's usage:

// Dummy input string
string input = "Google-oprichters Larry Page en Sergey Brin willen af van hun gedeelde meerderheid aan aandelen van het bedrijf. Uit een bericht van de Amerikaanse beurswaakhond SEC willen de twee de komende vijf jaar tien miljoen van hun Google aandelen verkopen. Brin en Page, nu nog voor 59 procent eigenaar, bezitten dan nog maar 48 procent van de aandelen. Met de huidige koers levert dat hun in totaal 5,5 miljard euro op. De Google oprichters kiezen bewust voor een geleidelijke afbouw om de aandelen op de beurs niet teveel onder de druk te zetten. Ondanks de aandelenverkoop zullen Brin en Page nog aardig hun stempel op het beleid kunnen drukken. Naast henzelf zijn voor besluiten slechts krap twee procent van de overige aandeelhouders nodig.";

List<string> tokens = Tokenizer.TokenizeString(input);

This will result in 59 tokens with the first 5 being:

aandelen
google
brin
nog
procent


The list of stopwords that was loaded in Configuration.StopWords is: aan,naar,dat,nu,de,om,den,onder,der,ons,des,onze,deze,ook,die,op,dit,over,door,een,te,enige,tegen,enkele,ten,enz,ter,etc,tot,haar,uit,het,hierin,hoe,vanaf,hun,ik,vol,inzake,voor,is,wat,je,wie,na,zijn,u,uw,met,naar,jij,hij,zij,zonder,en,van,amp

Because i already had a list with dutch stopwords loaded, i have used a dutch text in the example (sorry english readers :)

/Ruud