Question: How do you detect the duplicate documents?
Consider you have over million urls.

Solution: Iterate through the pages and compute the hash table of each one.
Check if the hash value is in the hash table. If it is, throw out the url as a duplicate. If it is not, then keep the url and insert it in into the hash table.