[2016–10–18] Election 2016: Analyzing Real-Time Twitter Sentiment with MemSQL Pipelines

How it Works

First, we wrote a python script to collect tweets and retweets that contain the words Hillary, hillary, Trump, or trump directly from Twitter.com. We picked the words “Hillary” and “Trump” as descriptors since they are the most used for the candidates. The script pushes this content to an Apache Kafka queue in real time. Messages in this Kafka queue are then streamed using MemSQL Pipelines. Released in September 2016 at Strata+Hadoop World, Pipelines features a brand new SQL command CREATE PIPELINE, enabling native ingest from Apache Kafka and creation of real-time streaming pipelines.

CREATE PIPELINE `twitter_pipeline`
AS LOAD DATA KAFKA ‘your-kafka-host-ip:9092/your-kafka-topic’
INTO TABLE `tweets`;
CREATE TABLE `tweets` (
`id` bigint(20) DEFAULT NULL,
`ts` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
`tweet` JSON COLLATE utf8_bin,
`text` as tweet::$text PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
`retweet_count` as tweet::%retweet_count PERSISTED int(11),
`candidate` as CASE
WHEN (text LIKE '%illary%') THEN 'Clinton'
WHEN (text LIKE '%rump%') THEN 'Trump'
ELSE 'Unknown' END PERSISTED text CHARACTER SET utf8 COLLATE utf8_general_ci,
`created` as FROM_UNIXTIME(`tweet`::$created_at) PERSISTED datetime,
KEY `id` (`id`) /*!90619 USING CLUSTERED COLUMNSTORE */,
/*!90618 SHARD */ KEY `id_2` (`id`)
);
memsql> SELECT * from tweets LIMIT 1\G
*************************** 1. row ***************************
id: 786409507039485952
ts: 2016-10-13 03:33:53
tweet: {"created_at":1476329611,"favorite_count":0,"id":786409507039485952,"retweet_count":0,"text":"RT @BlackWomen4Bern: This will be an interesting Halloween this year...expect me to tweet some epic Hillary costumes...I expect there will…","username":"hankandmya12"}
text: RT @BlackWomen4Bern: This will be an interesting Halloween this year...expect me to tweet some epic Hillary costumes...I expect there will…
retweet_count: 0
candidate: Clinton
created: 2016-10-13 03:33:31
1 row in set (0.03 sec)
CREATE PIPELINE `twitter_sentiment_pipeline`
AS LOAD DATA KAFKA 'your-kafka-host-ip:9092/your-kafka-topic'
WITH TRANSFORM ('http://download.memsql.com/pipelines-twitter-demo/transform.tar.gz' , 'transform.py' , '')
INTO TABLE `tweet_sentiment`;
SELECT
sentiment_bucket,
SUM(IF(candidate = "Clinton", tweet_volume, 0)) as clinton_tweets,
SUM(IF(candidate = "Trump", tweet_volume, 0)) as trump_tweets
FROM tweets_per_sentiment_per_candidate_timeseries t
GROUP BY sentiment_bucket
ORDER BY sentiment_bucket;

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Neil Dahlke

Neil Dahlke

Engineer. @hashicorp , formerly @memsql , @UChiResearch . @depaulu alum.