I also tried mining Wikipedia, but I found it's much more reliable to use the categories of the movies instead of the links.
The links are very often unrelated: "unlike in the movie X", "who by then had already became famous playing X in movie Z", etc. Instead of using the entire article, I only used links from the very first sentence. (" X is an Italian 1984 drama starring Z, etc...") These links could be connected to categories.
The problem I didn't expect was that the "hierarchy" of the categories is a mess. Seems easy in concept: "comedy-drama films" is a subcategory of both "comedy films" and "drama films", so you can use all the parent categories. In practice you get to completely unrelated categories in a few steps. The biggest challenge was cleaning this mess up algorithmically. Once that was done, I had a very nice taxonomy of the movies.
The result didn't come close anywhere to the top contestants, but it was a very interesting learning experience.
The links are very often unrelated: "unlike in the movie X", "who by then had already became famous playing X in movie Z", etc. Instead of using the entire article, I only used links from the very first sentence. (" X is an Italian 1984 drama starring Z, etc...") These links could be connected to categories.
The problem I didn't expect was that the "hierarchy" of the categories is a mess. Seems easy in concept: "comedy-drama films" is a subcategory of both "comedy films" and "drama films", so you can use all the parent categories. In practice you get to completely unrelated categories in a few steps. The biggest challenge was cleaning this mess up algorithmically. Once that was done, I had a very nice taxonomy of the movies.
The result didn't come close anywhere to the top contestants, but it was a very interesting learning experience.