I’m currently building a tool that helps me curate information from the web. Right now, it’s restricted to Twitter however, more will come.
One of the many problems I’ve faced is to avoid duplicate URLs. The main problem remains that URLs change and they sometime contains junk.
I’ve not resolved any possible scenarios but I would like to show you a code snippet of what I have so far.
1 | public static class UrlHelper |
So what does this code do? It exposes 2 methods. One for filtering and the other one to generate a string id that will look very similar to a GUID. The filter will reorganize the query string by alphabetical order and remove any “utm_*” from the query string. Those parameters are normally used in Google Analytics Campaigns and will only add noise to our data. They can safely be stripped out.
As for generating the ID, I just generate an MD5 hash which I convert into hexadecimal without separators.
One GUID to go you say? You got it.