Madhavan Mukund

Assignment 1: Frequent itemsets

8 Feb, 2025
Due 21 Feb 23 Feb, 2025

The Bag of Words Dataset
The Task
Solving the Task

The Bag of Words Dataset

The Bag of Words Data Set in the UCI Machine Learning Repository contains five text collections.

In each of these text collections, each document is summarized as a bag (multiset, set with multiplicities) of words. The individual documents are identified by document IDs and the words are identified by word IDs.

After some cleaning up, in each collection the vocabulary of unique words has been truncated to only keep words that occurred more than ten times overall in that collection.

For each collection XYZ:

vocab.XYZ.txt is the vocabulary file, listing all words that appear in the collection XYZ, one word per line. Each word has an implicit wordID that is its line number in this file, starting with 1 (the word on line 1 has wordID 1, the word on line 2 has wordID 2, ...)
docword.XYZ.txt lists out the number of times each word in vocab.XYZ.txt occurs in each document (only non-zero counts are recorded).

The file docword.XYZ.txt begins with 3 header lines
```
     D
     W
     NNZ
```
where D is the number of documents in the collection, W is the number of words whose frequency is counted (i.e., W is the number of words in vocab.XYZ.txt) and NNZ is the number of non-zero frequency entries for this collection (i.e. NNZ is 3 less than the number of lines in docword.XYZ.txt).

This is followed by NNZ lines of the form
```
     docID wordID count
```
where count is the number of time the word with id wordID appears in document with id docID. Remember that only non-zero counts are recorded.

Information about the datasets in the repository

Enron Emails:

     orig source: www.cs.cmu.edu/~enron
     D=39861
     W=28102
     N=6,400,000 (approx)

NIPS full papers:

     orig source: books.nips.cc
     D=1500
     W=12419
     N=1,900,000 (approx)

KOS blog entries:

     orig source: dailykos.com
     D=3430
     W=6906
     N=467714

NYTimes news articles:

     orig source: ldc.upenn.edu
     D=300000
     W=102660
     N=100,000,000 (approx)

PubMed abstracts:

     orig source: www.pubmed.gov
     D=8200000
     W=141043
     N=730,000,000 (approx)

The Task

Your task is to compute frequent itemsets for this data. As usual, a K-itemset of words is a collection of words of size K that occur together in the same document. Write a program to find all K-itemsets of words occurring with frequency F, where K and F are parameters to your program.

The datasets are of different sizes. Report your results on the three smaller datasets (Enron emails, NIPS full papers, KOS blog entries) for different values of K and F. In addition to the actual output, report the time it took to complete the job and, in case your program did not terminate for a given dataset and combination of K and F, report how long you tried before you gave up.

Solving the Task

You can use any programming language, including Python and R. You can make use of standard packages for analytics and machine learning. Clearly document any external packages used by your code.
Submit the following via Moodle, as a Jupyter notebook if you are using Python and as a single archive (zip, tar.gz, …) otherwise:
- The code you used to solve the assignment.
- If you have voluminous output to report, save it somewhere on the cloud and provide a link.
- A short write up describing how your code ran on the data sets: the parameters used, time taken, space required, and anything else of interest.
You may work alone or in groups of two. Each group makes a single submission to Moodle. Use either person's Moodle account to submit. The submission should mention the names of the two partners.
There will be a short oral presentation and question/answer session for each submission.

Madhavan Mukund

Teaching

Data Mining and Machine Learning,
Jan-Apr 2025

Assignment 1: Frequent itemsets

8 Feb, 2025
Due 21 Feb 23 Feb, 2025

The Bag of Words Dataset

Information about the datasets in the repository

The Task

Solving the Task

Madhavan Mukund

Teaching

Data Mining and Machine Learning,Jan-Apr 2025

Assignment 1: Frequent itemsets

8 Feb, 2025 Due 21 Feb 23 Feb, 2025

The Bag of Words Dataset

Information about the datasets in the repository

The Task

Solving the Task

Data Mining and Machine Learning,
Jan-Apr 2025

8 Feb, 2025
Due 21 Feb 23 Feb, 2025