paint-brush
Analyzing Twitter Conversations with the New Twitter V2 APIby@jzeoli
166 reads

Analyzing Twitter Conversations with the New Twitter V2 API

by Joe ZeoliFebruary 17th, 2022
Read on Terminal Reader
Read this story w/o Javascript
tldt arrow

Too Long; Didn't Read

Using the new Twitter API v2 endpoint, we can get actionable insights around a topic using the new Conversation ID. The Conversation ID lets us get the full discussion around the topic besides just one-off Tweets. I used the new v2 version of their API to do a 7 day search for any Tweets mentioning “Gardening” and was about to grab around 22,000! The API will only pull Tweets that contain text that matches your query so it can’t pull everything that relates.
featured image - Analyzing Twitter Conversations with the New Twitter V2 API
Joe Zeoli HackerNoon profile picture

Getting Actionable insights around a topic using the new Twitter API v2 endpoint

Knowing what people are talking about in near real-time is incredibly powerful. This type of information can inform campaigns, brand strategy and product innovation. The standard way to gather this data is to go out, find consumers and ask them pointed questions. This method can be costly, time consuming and completely irrelevant if your questions aren’t perfect.


Twitter offers a way to gather unfiltered data about how people or companies talk about specific topics. The “unfiltered” part, however, is probably the biggest challenge in trying to pull insights out of raw twitter data. One of the downsides to this data is that it’s only a small piece of the puzzle. The API will only pull Tweets that contain text that matches your query so it can’t pull everything that relates. These days, Twitter is used more conversationally, where an initial Tweet may spark a discussion below it. I wanted to figure out a way to map out this type of information and get a full picture of a given topic.

Collecting

For this project, I wanted to get some insight into conversations around Gardening. Twitter’s free version of their API allows you to search back up to 7 days, so that was my starting point. I used the new v2 version of their API to do a 7 day search for any Tweets mentioning “Gardening” and was about to grab around 22,000!


The new awesome new feature of the v2 endpoint is the Conversation ID. This lets us get the full discussion around the topic besides just one-off Tweets. To get this for the Tweet objects we got previously, we’re going to:


  1. Get all of the Tweets we saved with replies that don’t have replies themselves (the initial node of the conversa)
  2. For each of those get all of the Tweets that have that same Conversation ID
  3. Make sure the Tweet doesn’t already exist in our database and save.


Below is the general idea:

conversationIDs = []

def get_replys():
    ###Getting all Tweets that have replys and isn't a reply to a previous Tweet
    for tweet in Tweet.objects(replys__gte=3, replyto__exists=False):
      
        if(tweet.conversation_id in conversationIDs):
            continue;
        query_params = {'query': 'conversation_id:' + tweet.conversation_id, 'tweet.fields': 'created_at,conversation_id,text,public_metrics,lang', 'user.fields': 'username', 'expansions': 'author_id,referenced_tweets.id','max_results': 100}

        conversationIDs.append(tweet.conversation_id);

        def get_convo(next=None):

            if(next):
                query_params["next_token"] = next;

            json_response = connect_to_endpoint('https://api.twitter.com/2/tweets/search/recent', query_params)

            if("data" not in json_response.keys()):
                return

            for status in json_response['data']:

                if(not status['lang'] == 'en'):
                    continue;

                if('RT @' in status['text']):
                    continue;

                ###SAVE TWEET HERE

            if('next_token' in json_response['meta']):
                get_convo(json_response['meta']['next_token'])

        ###Set sleep to prevent hitting the rate limit for the endpoint
        time.sleep(2)
        get_convo()

Topic Clustering

My first thought, after collecting the Tweets, was to cluster them into sub-topics using NLP. After some research into models that are best suited for tweet data, I decided to go with the Latent Dirichlet Allocation (LDA) Mallet Model. To run this, I’m cleaning all of the Tweet text for urls, emojis, hashtags and user tags.


To keep it simple I picked 20 as the number of topics. Below is the output sorted by the top topics.


This output is pretty interesting, but it only gets us so far. I want to be able to dive deeper into this list. For instance, I’d like to know which tweets are under each topic to get a better sense of what they’re really talking about.

Topic Scatter Plot

To be able to see this visually, I decided to create an interactive Bokeh scatter plot.

Below is the basic code I’m using to generate the plot. I’m pulling in the Tweet data so I can hover over each node to learn about what Tweet it is. Also, since the amount of data we’re working with here is large, I wanted to be able to distinguish more important Tweets. To do this, I’m using the amount of favorites the Tweet received to determine the size.


# Get topic weights and dominant topics ------------
from sklearn.manifold import TSNE
from bokeh.plotting import ColumnDataSource, figure, output_file, show
from bokeh.models import Label
from bokeh.io import output_notebook
from bokeh.resources import INLINE
import matplotlib.colors as mcolors

# Get topic weights
topic_weights = []

for row_list in optimal_model[corpus]:
    tmp = np.zeros(20)
    for i, w in row_list:
        tmp[i] = w
    topic_weights.append(tmp)


# Array of topic weights
arr = pd.DataFrame(topic_weights).fillna(0).values


# Dominant topic number in each doc
topic_num = np.argmax(arr, axis=1)

# tSNE Dimension Reduction
tsne_model = TSNE(n_components=2, verbose=1, random_state=0, angle=.5, init='pca')
tsne_lda = tsne_model.fit_transform(arr)


# Plot the Topic Clusters using Bokeh
output_notebook()
n_topics = 20
mycolors = np.array([color for name, color in mcolors.CSS4_COLORS.items()])


source = ColumnDataSource(data=dict(
    x=tsne_lda[:,0],
    y=tsne_lda[:,1],
    desc=data2,
    color=mycolors[topic_num],
    size=favorites
))


TOOLTIPS = [
    ("tweet", "@desc"),
]

plot = figure(title="t-SNE Clustering of {} LDA Topics".format(n_topics),
              plot_width=1400, plot_height=700, tooltips=TOOLTIPS)

##plot.scatter(x=tsne_lda[:,0], y=tsne_lda[:,1], color=mycolors[topic_num], size=4)
plot.circle('x', 'y', size='size', source=source, fill_alpha=1, fill_color='color', line_width=0)


output_file("plot.html")
show(plot)



Now we’re getting closer! Using this chart I’m able to zoom in on clusters and read the top favorited Tweets to get a really good idea about what the topics are about. However, this still isn’t the full picture. The clusters are only representative of clusters of random Tweets and doesn’t give us an understanding of how conversations form around the topic.

Conversation Network Graph

Conversations are essentially mini networks within the larger topic. In this case, they’re connected by the Conversation ID we collected earlier. We can determine how everything links together by connecting Tweets to their “reply to” Tweet object. I’m going to stick with Bokeh and create a network graph based on that criteria.


The different-sized nodes in the scatter plot were super helpful so I’m going to stick with that for this graph. One thing I noticed was due to the large difference in Tweet Favorites across the board, the scale gets to be a bit out of whack, so I’m capping the size as the square root of the favorites from 5 to 150. It’s also difficult to determine the beginning tweet of the conversation, so for that, if it doesn’t have a “reply to” saved in the database, I’m marking it red.


To create the network graph I’m adding a node for every tweet I saved then connecting the tweets by an edge via the “reply to” field. Lastly, to make this easier to visualize, I’m removing any nodes that don’t have a connection. We can also remove any networks that have less than a certain amount of connections if we just want to map out the largest conversations. The basic steps to do this is below.

for tweet in Tweet.objects():
    size = math.sqrt(tweet['favorites']);
    if(size > 150):
        size = 150
    if(size < 5):
        size = 5
        
    G.add_node(tweet.tweetid, size = size, desc=tweet.tweet, id=tweet.tweetid, color=color, username=tweet.username, color2=sent_color, sizeStandard=10, gender=gender_color)
    
    #create array of all nodes to check later
    singleNodes.append(tweet.tweetid)


for tweet in Tweet.objects(replyto__exists=True):
    if(tweet.replyto in singleNodes):
        G.add_edge(tweet.replyto,tweet.tweetid)


G.remove_nodes_from(list(networkx.isolates(G)))


for component in list(networkx.connected_components(G)):
    if len(component)<5:
        for node in component:
            G.remove_node(node)


The output is a cool visualization of the conversations around “gardening” regardless of whether or not the Tweet mentions gardening. This gives us a full picture of what’s going on. From here we can zoom into specific conversations and see what users are saying.


We now have a conversation map around a real topic that could provide endless insights into a subject. This could be useful for researching blog topics, generating market research, and even trend spotting. By adjusting the node sizes we can see quickly which type of Tweets & content generate both the most clicks and the most conversation.


First Published here