Curation Tool – Creating unique GUID from a Uri

I’m currently building a tool that helps me curate information from the web. Right now, it’s restricted to Twitter however, more will come.

One of the many problems I’ve faced is to avoid duplicate URLs. The main problem remains that URLs change and they sometime contains junk.

I’ve not resolved any possible scenarios but I would like to show you a code snippet of what I have so far.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
public static class UrlHelper
{
public static string GenerateId(Uri url)
{

var hashingAlgorithm = MD5.Create();
var computedHash = hashingAlgorithm.ComputeHash(Encoding.UTF8.GetBytes(url.ToString()));
return BitConverter.ToString(computedHash).Replace("-", "").ToLowerInvariant();
}

public static Uri FilterQueryString(Uri url)
{

try
{
var builder = new UriBuilder(url);

var parameters = ExtractParameters(builder.Query);
var queryString = CreateQueryString(parameters);
builder.Query = queryString;

return builder.Uri;
}
catch
{
return null;
}
}

private static string CreateQueryString(IEnumerable<QueryStringParameter> parameters)
{

var aggregate = parameters
.OrderBy(t => t.Name)
.Where(t => !t.Name.Contains("utm_"))
.Select(t => string.Format("{0}={1}", t.Name, t.Value))
.Aggregate(string.Empty, (current, next) => current + "&" + next);
return aggregate;
}

private static IEnumerable<QueryStringParameter> ExtractParameters(string query)
{
string withoutQuestionMark = query;
if (withoutQuestionMark.IndexOf("?", StringComparison.Ordinal) == 0)
withoutQuestionMark = withoutQuestionMark.Remove(0, 1);
if (!string.IsNullOrWhiteSpace(withoutQuestionMark))
{
string[] nameValues = withoutQuestionMark.Split('&');
foreach (var nameValue in nameValues)
{
var pairSplitted = nameValue.Split('=');
if (pairSplitted.Length == 2)
yield return new QueryStringParameter(pairSplitted[0], pairSplitted[1]);
}
}
}
}

So what does this code do? It exposes 2 methods. One for filtering and the other one to generate a string id that will look very similar to a GUID. The filter will reorganize the query string by alphabetical order and remove any “utm_*” from the query string. Those parameters are normally used in Google Analytics Campaigns and will only add noise to our data. They can safely be stripped out.

As for generating the ID, I just generate an MD5 hash which I convert into hexadecimal without separators.

One GUID to go you say? You got it.