Hive supports the Java regular expression, which makes text mining much easier and more accurate. However, Java regex does not function as expected in Hive if you do not make some minor changes. I wasted so many hours last week to figure out what was wrong with my Hive query and finally figured out this is a problem called slashception.

I work with lots of text data from Twitter. Usually, I prefer to download the text data from a local Hadoop server cluster managed by the Social Media and Democracy (SMAD) Research Group at the University of Wisconsin-Madison and analyze them using R or Python. But, my current research–that looked at hybrid media campaign strategy of a handful of presidential candidates–made me want to do everything on the Hive server. Because downloading multiple large datasets (per each candidate) on my laptop and writing similar R codes for analyzing different candidates is cumbersome and time-consuming. I thought I could make the process more efficient by using powerful capabilities of the regular expression available on Hive.

Text Wrangling in Hive

Hive, by default, supports some features of text mining. It supports basic string functions, such as lower(ConvertToLowerCase), replace(OriginalString, OldPattern, NewPattern), %WildcardExpression%, and many more (See Hive Language Manual).

However, one problem of using wildcard expression is that it often introduces random noise. For example, I can map the tweets that include RT @realDonaldTrump using the following Hive query:

INSERT overwrite LOCAL directory 'rtDT'
ROW format delimited
FIELDS TERMINATED BY '\t'
SELECT text
FROM twitterGardenhose
WHERE
LOWER(text) LIKE '%rt @realdonaldtrump%';

Uh-oh, I found some of these irrelevant tweets in your data:

'RT @realDonaldTrumpJrJrJr'
'support @realDonaldTrump cause of some reason'

These are not the retweets of @realDonaldTrump, but it is hard to leave these out just using simple text matching.

Java Regex in Hive

However, you can do more complex text wrangling using regex. Yes, Hive supports Java regex!

I can set a word boundary to capture every tweet that includes RT @realDonaldTrump but not support @realDonaldTrump or RT @realDonaldTrumpJrJrJr. This is a solution that works in Java: ^.*\brt\s@realdonaldtrump\b.*

Also, you can select every other tweet that does not include RT @realDonaldTrump using the following regex: ^((?!^.*\brt\s@realdonaldtrump\b.*$).)*$

So, I revised the Hive query a bit and tested this in Hive… only to find it does not work properly.

INSERT overwrite LOCAL directory 'rtDT'
ROW format delimited
FIELDS TERMINATED BY '\t'
SELECT text
FROM twitterGardenhose
WHERE
LOWER(text) LIKE '%@realdonaldtrump%'
AND
LOWER(text) RLIKE '^((?!^.*\brt\s@realdonaldtrump\b.*$).)*$';

I already knew there’s slashception in Hive, which means you need an extra backslash() to escape the protected backslash character in regular expressions. So, I revised the query as following:

INSERT overwrite LOCAL directory 'rtDT'
ROW format delimited
FIELDS TERMINATED BY '\t'
SELECT text
FROM twitterGardenhose
WHERE
LOWER(text) LIKE '%@realdonaldtrump%'
AND
LOWER(text) RLIKE '^((?!^.*\\brt\\s@realdonaldtrump\\b.*$).)*$';

It still doesn’t work. What’s the problem? This is a part I spent so many hours to find the answer. Then the other night, I had a chat with my roommate who told me he uses Java to automatically classify massive amount of law document. Although he had no experience of using regex, he told me that there are several special characters in Java that need an escape, such as “, (, ), #, and \.

Oh?!!

So, I added extra \ before ( and ).

INSERT overwrite LOCAL directory 'rtDT'
ROW format delimited
FIELDS TERMINATED BY '\t'
SELECT text
FROM twitterGardenhose
WHERE
LOWER(text) LIKE '%@realdonaldtrump%'
AND
LOWER(text) RLIKE '^\(\(?!^.*\\brt\\s@realdonaldtrump\\b.*$\).\)*$';

Bingo! It worked!

Looking back, it turns out to be a really simple solution. But it was hard for me to get the answer. If some of you found this post because of a similar problem, I hope it helped!