Task 2: Creating Variables

In this part, we will create variables that measure different characteristics of each of the email message in our training data. We will use these variables to predict if a message is HAM or SPAM in next task.

Some potentially useful functions include: grep(), grepl(), gsub(), gregexpr(), regexec(), regmathes(), nchar(), strsplit(), table(), plot(), smoothScatter(), hexbin().

isSpam: logical - **whether mail is Spam (TRUE) or Ham (FALSE). You can compute this from the name of the messages in the top-level list of messages.
isRe: logical - **if the string Re: appears as the first word in the subject of the message
numLinesInBody: integer - **a count of the number of lines in the body of the email message
bodyCharacterCount: integer - **the number of characters in the body of the email message
replyUnderline: logical - **whether the Reply-To field in the header has an underline and numbers/letters
subjectExclamationCount: integer - **a count of the number of exclamation marks (!) in the subject of the message
subjectQuestCount: integer - **the number of question marks in the subject
numAttachments: integer - **the number of attachments in the message
priority: logical - **whether the message’s header had an X-Priority or X-Msmail-Priority that was set to high
numRecipients: integer - **the number of recipients in the To, Cc fields
percentCapitals: numeric - **the percentage of the characters in the body of the email that are upper case (excluding blanks, numbers, and punctuation)
isInReplyTo: logical - **whether the header of the message has an In-Reply-To field
sortedRecipients: logical - **the recipient list is sorted by address
subjectPunctuationCheck: logical - **whether the subject has punctuation or digits surrounded by characters, e.g. V?agra and pay1ng, but not New!
hourSent: integer - **the hour in the day the mail was sent (0 - 23)
multipartText: logical - **whether the header states that the message is a multipart/text, i.e. with attachments
containsImages: logical - **whether the message contain images (in HTML)
isPGPsigned: logical - **indicates whether the mail was digitally signed (e.g. using PGP or GPG)
percentHTMLTags: numeric - **the proportion of any HTML text in the message’s body that is made up of HTML markup and not content
subjectSpamWords: logical - **whether the subject contains one of the following phrases: viagra, pounds, free, weight, guarantee, millions, dollars, credit, risk, prescription, generic, drug, money back, credit card
percentSubjectBlanks: numeric - **the percentage of blanks in the subject
messageIdHasHostname: logical - **whether the message id that uniquely identifies the message has no component identifying the machine from which it was set
fromnumericEnd: logical - **whether the user login in the From: field ends in numbers
isYelling: logical - **whether the Subject of the mail is in capital letters
percentForwards: numeric - **percent of the message’s body that is made up of content included from other messages
isOriginalMessage: logical - **body does not contain the phrase ``original message’’ or something similar
isDear: logical - **whether the message body contains a form of the introduction Dear
isWrote: logical - **whether the text includes a line indicating an included message as identified by the word wrote: in several different possible languages
averageWordLength: numeric - **the average length of the words in the body of the message
numDollarSigns: integer - **the number of dollar signs in the body of the message
numSpamURLs: integer - **the number of URLs in the message that are “known” to be SPAM URLs. We compute this set of URLs by looking at all of the SPAM messages in our training set and finding all the URLs and removing URLs that are in HAM messages

Task 2: Creating Variables

Zifan Lin

March 19, 2015

Implement