Step 2: Visualizing Data – but not well

Step 2: Visualizing cumulative polarity of sentiments across lines of text

The last post basically provided a run down of getting the data ready to use – reading in a text file that was then split into a CSV with polarity and subjectivity columns. This is sort of a bare minimum and the post itself may need to be expanded. 

This post cuts that post short – because without some of the details it’s worse than useless.

Building on the last python script to create the CSV with polarity and subjectivity, several elements were added to graph the cumulative polarity of lines of text in the Stealing the Network books.

First – there is a little kludge to just the chapter number for the Stealing the Network books. Kludge for Chapter Numbers in the Title

The actual graph generating code is pretty straight forward matplotlib code:

Basic matplotlib code

The complete code is here.

The following graphs illustrate what this code does. To get a sense of the polarity of the book, the cumulative polarity is shown line by line with the line number as the x axis and the polarity sum (present polarity added to all preceding lines), where polarity is between -1 and 1 (with 0 representing a neutral statement).  

For comparison, I planned to show /var/log/syslog (by adding periods to the end of each line) – however, I discovered an SNMP bug generating error messages.
A very negavtive syslogClosing is negative. /var/log/auth.log

So, I decided to look at /var/log/auth.log and discovered that “closed” is evaluated as negative…. Note: The black dots occur every hundred lines.

 

This helps to contextualize the chapter by chapter cumulative polarity graphs from the “Stealing the Network” books.

Here is book 1:Cumulative Polarity STN book 1

Here are the links to the Cumulative Polarity Sum graphs for book 2, book 3, and book 4.

The montages were generated using Image Magick’s montage command e.g. Make montage using Image Magick

Why these are worse than useless

These graphs are problematic and worse than useless for multiple reasons.

  1. There is no consistency across axes.
    • The X axis – notice the black dots – they are spaced to occur every hundred lines. Without looking closely, one might assume the texts are of comparable length. They are not – which is then exacerbated by:
    • The Y axis – these graphs are autoscaled to the max Y value. Chapters 1 and 9 look similar but they have very different values.
  2. Behaviorally, it is hard to tell what is going on. Are there lots of positive and negatively valuated statements that balance each other out? Or is it simply a few positively valuated statements that make the chapter seem overall ‘positive’?

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *