Design a system to identify companies who are stealing TechCrunch's articles and republishing them on their websites. Cover logical architecture, technology choices, performance and efficiency of the system.
Input : a file with home page URLs of the suspecting companies.
Output: Your program should crawl the given websites, find pages that resemble pages hosted on TechCrunch and generate an overall score that indicates the degree of similarity.