A software framework for text mining algorithms

Author: František Dařena, frantisek.darena [at] gmail.com

When using the application, please, cite as: Žižka, J., Dařena, F. Automatic Sentiment Analysis Using the Textual Pattern Content Similarity in Natural Language. Lecture Notes in Artificial Intelligence, 2010, 6231, 1: 224--231. ISSN 0302-9743.

The algorithm

In order to be able to classify textual data, they must be transformed to representation suitable for the learning algorithm and classification task. Textual data might be structured according to the level on which the data is analyzed, from sub-word level (decomposition of words and their morphology) to pragmatic level (the meaning of text with respect to context and situation). Ambiguities on each level can be solved using the next higher level (e.g. net level can help decide whether a word is a noun or a verb). Generally, the higher the level, the more details about the text is captured and the higher is the complexity of automatic creation of the representation. In many cases, words are meaningful units of little ambiguity even without considering the context and therefore are the basis for most work in text classification. A big advantage of word-based representations is their simplicity and straightforward process of their creation (Joachims, 2002).

In certain approaches, some of the words can be removed. These words usually include words that are very rare of very common in all classes and don't reduce the uncertainty during classification considerably. Also very short words, e.g. consisting from one or two characters can be removed. However, such an approach might require deeper analysis of the texts and might be also dependent on a particular language (Žižka, Dařena, 2010).

The texts are simply transformed to a bag of words, a sequence of words where the ordering is irrelevant. Each text example is then represented by a vector where individual dimensions represent values of individual attributes of the text. Commonly, each word is treated as one such attribute (Joachims, 2002).

Values of attributes represent the weights of individual words (terms) in corresponding texts. Several possible methods for determining the weights of the words can be used (Nie, 2010):

The application

The module TextMining.pm provides a method process_data_file that needs two arguments. The first is the name of configuration file, the second represents the name of a file containing the original textual entries. The method produces a file with vector representations of the texts according the parameters defined in a configuration file together with other files containing various information based on processing the data file (see section Produced outputs).

In the input file, each line represents one textual entry. It must contain the class of the text and the text, separated by a tab character. See an example of input file below.

_NEGATIVE	NO LIFT. VERY NARROW, STEEP & SHORT TREAD STAIRS. NO KETTLE OR DRINK MAKING FACILITIES IN ROOM.
_NEGATIVE	WINDOWS. SOME OF THEM ARE VERY VERY OLD.
_NEGATIVE	LIMITED  PARKING FACILITIES, ONLY 5 SPACES FOR 7 ROOMS. THE ON-SUITE WAS TOO SMALL BUT ADEQUATE.
_NEGATIVE	RATE WAS CHARGED BEFORE THE ACTUAL TRIP.
_POSITIVE	ALL STAFF WERE VERY ATTENTIVE, AND THE FOOD WAS EXCELLANT
_POSITIVE	THE RECEPTIONIST WAS VERY HELPFUL, NOT MINDING AT ALL ABOUT EXPLAINING THE FRONT DOOR PROCEDURES TWICE.
_POSITIVE	IT'S CLOSE TO THE TRAIN STATION AND CITY CENTRE.
...

The input file must be in UTF-8 encoding, output files are produced in UTF-8 encoding as well.

During processing the input data, all tags (e.g. <a href="...">, <em>) and entities (e.g. &nbsp;, &alpha;), and characters that are not letters (i.g. digits, undersoce, hyphen, quotes, ;, :, $ and others) are removed.

Cofiguration file

Configuration file is a simple text file containing names and values of parameters used during data processing. The names and values are separated by one or more spaces or tabs. The complete list of the parameters is in followinf table. The text after # is considered a comment and is ignored.

Parameter name Allowed values Meaning
frequency_type TP, TF, TF-IDF type of vector representations
  The user can choose from three possible representations of weights of individual words in the vectors. TP is simple Term Presence (0 or 1), TF is Term Frequency (an integer representing the number of occurences of the word in the document) and TF-IDF representing tf-idf weighting schema.
output_format C5, ARFF, GENERIC output format for vector representations
  The user can choose from three output formats:
  • C5 -- format suitable for software C5/See5, vectors are described by two files -- vectors.names (description of the attributes) and vectors.data (vectors)
  • ARFF -- produces a file vectors.arff in Attribute-Relation File Format, can be used e.g. by Weka software
  • GENERIC -- produces a file vectors.text where the first line contains the list of words, "_CLASS_" and optionally "_NZ_", following lines contain the vectors, where attributes values are joined with a user defined delimiter (last value of the vector is the document class and optionally the number of nonzero items in the vector).
dictionary_file filename name of a file containing the list of allowed words
  When the user wants to work only with selected set of words, the words can be provided in a separate file. Words in this file are on one line and must be separated by commas.
dictionary_only yes, no possibility of creating just the dictionary
  When selecting yes, after the dictionary is created and written, the programs stops. It is suitable for extraction just the words from all documents according the parameters.
min_word_length positive integer minimal acceptable length of words
  Only words with the length greater or equal than entered value will be considered.
min_word_frequency positive integer minimal frequency of a word in all documents
  Only words with the frequency greater or equal than entered value will be considered.
print_number_of_nonzero_items yes, no printing the number of nonzero items in each vector for GENERIC output format
  When selecting yes, after each vector in GENERIC output format a number representing the number of nonzero values in the vector will be printed,
delimiter string delimiter of vector values for GENERIC format
  Values of vectors in GENERIC output format will be joined with entered characters.
max_lines_in_output_file positive integer number of vectors in output file
  Optional parameter max_lines_in_output_file is suitable for large data collections when the file with vectors representing the documents can occupy hundreds of gigabytes. By joining the files, the original file can be obtained.

An example of configuration file:

# type of frequency representation
frequency_type TF

# minimal word length
min_word_length 1 

# minimal word frequency
min_word_frequency 5

# create dictionary only
dictionary_only	no

Produced outputs

During data processing, several output files are created.

Files with vectors

When C5 output format is selected, information about the vectors are stored in two files -- vectors.names (description of the attributes) and vectors.data (vectors). When ARFF format is chosen, a file vectors.arff in Attribute-Relation File Format is produced. GENERIC output format produces a file vectors.text

Dictionary

When a new dictionary is created as the result of processing the decuments files dictionary.text and dictionary_frequencies.text are creatd. The former contains a line with all words separated by commas, the latter contains all words and their frequencies, sorted by frequencies (one word per line).

Other information

At the end of file stat.text, the information about number of texts in individual classes and the numbed of words in the dictionary is appended.

Papers based on the framework