Exploring My Instagram Network Visualization and Key Insights
Written on
In my pursuit of knowledge, I'm currently enrolled in an advanced Master's program in Artificial Intelligence and Big Data Analytics at KU Leuven, Belgium. One of the elective courses I'm taking is focused on "Analyzing Large Scale Social Networks," where we delve into network data handling and interpretation. A significant portion of our final assessment involves creating a report in a network-related domain, and I opted to visualize and analyze my Instagram network.
The first thing I'll present is the outcome of my work. In the following sections, I'll elaborate on the methods I used to achieve this visualization. If you're less interested in the technical coding aspects, feel free to skip ahead to section 4, where we will explore some network theory and apply it for deeper insights.
You can view the final interactive result here: https://bl.ocks.org/MaximPiessen/raw/6c4637faeebbdc3fd12ecfd7b67cdd27/
When you hover over a node, the username appears. Clicking on a node reveals only the nodes connected to it, while clicking away restores the entire network view. You can also zoom in and drag the network around. For those on mobile devices, I highly recommend switching to a laptop for an optimal experience. For anyone unwilling to do so, here’s a non-interactive screenshot:
Section 1 — Data Acquisition Visualizing networks can be quite gratifying. However, the challenge lies in obtaining the necessary data for such visualizations. For my project, I aimed to compile a list of profiles I follow, as well as those that follow me. With this network in hand, I wanted to understand the connections among these profiles. Unfortunately, Instagram offers limited API endpoints for data retrieval. One potential workaround is to manually inspect each follower to document whom they follow, but this could take an extensive amount of time.
Fortunately, the creation of Selenium Webdriver has streamlined this process. Selenium allows users to create automated bots that perform tedious, repetitive tasks. Before I get into how I utilized Selenium, make sure to install the following (Mac users; Windows users can find equivalents via Google):
$ brew install python $ brew install chromedriver $ pip install selenium
I developed a bot class along with two scripts utilizing this class. The first script generates a text file containing all profiles that follow you and whom you follow. The second script processes this file to identify connections and outputs a text file with the results. The source code is available here: https://github.com/MaximPiessen/instagram_network_analysis Note: I'm aware that the code could be significantly improved. If you're inclined to enhance it, please feel free to submit a pull request on GitHub!
To execute the first script, navigate to the “01 scraping” folder in your terminal and run:
$ python get_my_followers.py --username your_IG_username --password your_IG_password
After the first script completes, you can run the second script:
$ python get_relations.py --username your_IG_username --password your_IG_password --relations_file relations.txt
The second script uses the output from the first script, saving the connections in relations.txt. Depending on your follower count, this process may take several days. Occasionally, Instagram may block your requests, causing the program to terminate automatically. The first run will create a start_profile.txt file, which helps resume scraping from the last unprocessed profile on subsequent runs.
A few tips: - When running the program, ensure part of the Chrome window is visible; otherwise, Instagram may block requests. I typically position the window in a corner of my screen to multitask. - If the program encounters a profile with thousands of followers, it may fail. In such cases, you can stop the program and manually increment the number in start_profile.txt to skip that profile on the next run.
Eventually, the program will gather all profiles, resulting in a relations.txt file. To visualize this data, we need a .json formatted file. I created a script for this conversion:
$ python relations_to_json.py --username your_IG_username --input_txt_file relations.txt --output_json_file relations.json --include_me False
Setting the include_me flag to "True" will include your profile and any connections related to it in the JSON file. I opted for "False" since I did not want my profile represented in the visualization.
Section 2 — Working with d3.js With our relations.json file ready, we can now create our visualization. The key features I wanted included:
- Force-directed graph (learn more in this excellent post)
- Distinct styles for bi-directional edges (mutual follows) versus non-bi-directional edges
- Display of usernames on hover
- On clicking a node, visibility of all connected nodes
A crucial note: d3.js only operates in Firefox when run locally.
2.1 Force-directed Graph The d3.js JavaScript library provides a vast array of powerful visualization tools, supported by a vibrant community sharing their creations. My project was inspired by a gist titled "D3v4 Selectable, Draggable, Zoomable Force Directed Graph," which serves its purpose perfectly.
2.2 Different Edge Styles As mentioned, I aimed for distinct styles between A ? B and A ? B (where arrows denote following relationships). Utilizing actual arrows was impractical given my network's size (over 300 nodes and 1900 edges); instead, I chose a gradient approach. I opted for a red-to-green gradient, indicating that red follows green. This choice is illustrated below:
Using this jsfiddle and a stackoverflow post, I managed to implement gradients on the edges. The key takeaway is defining a gradient with a specific ID for each edge within a <defs></defs> HTML element. When drawing edges, the stroke style is set to the corresponding gradient. The coordinates update dynamically with node movements.
2.3 Displaying Usernames on Hover Inspired by this gist, adding tooltips for usernames on hover was straightforward. The tooltip's style can be modified via the .tooltip CSS class.
2.4 Showing Connected Nodes on Click Finally, I wanted to display only the nodes connected to a selected node. This feature simplifies information gathering within the dense network. The code uses jQuery DOM manipulations based on assigned CSS classes and IDs. A hidden CSS class is utilized to adjust opacity.
Section 3 — Creating Your Own Visualization This section provides a brief guide to follow:
- Log in or create a GitHub account.
- Fork my gist: https://gist.github.com/MaximPiessen/6c4637faeebbdc3fd12ecfd7b67cdd27
- Paste your network data into the relations.json section.
- Save the gist.
- Copy the gist link.
- Change gist.github.com to bl.ocks.org to obtain a link like https://bl.ocks.org/MaximPiessen/6c4637faeebbdc3fd12ecfd7b67cdd27.
- Share your link with friends (to view fully, click "open" under the visualization).
- Please share your links in the comments; I'm eager to explore your networks!
Section 4 — Analyzing Network Properties In this section, we will apply theoretical network concepts practically. Network theory distinguishes between directed and undirected graphs. Directed graphs have edges with specific directions (e.g., A ? B) while undirected graphs do not. We'll utilize a directed graph since "following" is inherently directional.
Networks can be analyzed on global or local levels. According to my course notes: Global level analysis: This approach focuses on the large-scale structure of the network or substantial subgraphs. Local level analysis: This examines the roles or positions of individual nodes or smaller subgraphs within the network.
We will also experiment with various clustering/community detection algorithms to identify distinct groups within our network, comparing these results against my understanding to validate them.
For the subsequent subsections, ensure you have installed matplotlib, networkX, and scipy:
$ pip install matplotlib $ pip install networkX $ pip install scipy
4.1 Global Level Analysis To run the global analysis, navigate to the “03 analysis” folder and execute:
$ python global_analysis.py --username your_IG_username --input_txt_file path_to_relations.txt --include_me False
You will receive the following statistics about your network:
Density. This metric indicates how dense the network is. The maximum number of edges for a directed network with N nodes is N(N-1), counting A ? B and B ? A separately. For my network, the max is 314 * 313 = 98282. The actual edge count is typically much lower. Density quantifies this by dividing the actual number by the maximum edges. For our graph, we find: Density = 0.033
Degree. This property indicates how many edges connect to a node. In a directed graph, we differentiate between in-degree (profiles following a specific profile) and out-degree (profiles a specific profile follows). We anticipate that the average in-degree equals the average out-degree since each edge connects two nodes. Average in-degree and out-degree = 10.3. The degree distribution is illustrated in figure 3:
The overlapping distributions suggest that, on average, profiles have a similar number of followers and followings. Additionally, the counts follow a power law with powers of -0.56 and -0.55 for in-degree and out-degree respectively. This indicates that more profiles have few followers or followings than those with many.
The degree distribution adhering to a power law signifies a scale-free network. But what exactly is a scale-free network and a power law? Drawing from an insightful article:
"A key distinction between normal and power-law distributions is that the number of nodes with high edge counts is much greater in power-law distributions than in normal distributions. Conversely, well-connected nodes are more prevalent in normal distributions. This implies that networks often feature a small number of highly connected nodes, which would be rare in a normal distribution."
A network is termed scale-free if its characteristics remain consistent regardless of network size, meaning that as the network expands, its underlying structure stays the same.
Average Shortest Path Length. This metric calculates the minimum number of profiles needed to traverse from profile A to profile B. By assessing this length for all possible node pairs, we can determine the average shortest path length. In our network, the average shortest path length = 3.62. For comparison, the average shortest path length across all Facebook users is 4.74. Our figure is lower, as my network consists of closely connected friends.
4.2 Local Level Analysis Now, execute:
$ python local_analysis.py --username your_IG_username --input_txt_file path_to_relations.txt --include_me False
Before examining my results, let me briefly outline the local centrality measures we will apply:
Betweenness Centrality: This measure identifies the number of shortest paths passing through a specific node, highlighting nodes acting as bridges within the network.
Closeness Centrality: For a specific node, this is calculated as the reciprocal of the sum of all shortest path lengths to every other node, indicating which nodes facilitate the quickest access to others.
Degree Centrality: In a directed graph, both in-degree and out-degree centrality can be calculated, indicating the number of nodes a particular node is connected to (out-degree centrality) or how many nodes are connected to that specific node (in-degree centrality).
Eigenvector Centrality & PageRank: This measure expands on degree centrality by considering not just direct connections but also the influence of those connections. PageRank assigns scores to nodes based on the number and quality of incoming links, and is famously part of Google's ranking algorithm.
Results and Interpretation for My Network: Below is a table displaying the top three profiles per local centrality measure.
Interpreting these results: juliepiessen (my sister) connects different friend groups, achieving the highest betweenness score and fifth in PageRank. chaimfes, a close friend, ranks highly across all measures due to connections made through a student association (SINC). The account stereyou_box ranks solely in out-degree centrality, as it follows many but is not followed back often. maxschoepen, an old friend, holds the highest PageRank due to connections I introduced him to, while the remaining top PageRankers also connect various friend groups.
4.3 Clustering / Community Detection From my course notes, I summarize:
"Clustering or Cluster Analysis refers to a range of techniques aimed at grouping similar items or objects, especially when the data structure is unknown."
In simpler terms, we aim to identify profiles that cluster together within my network. Given my familiarity with it, I plotted my manual groupings in figure 6, with meanings provided in the caption. This figure will serve for qualitative comparison against the automated clustering algorithm outcomes.
Two hierarchical clustering methods exist: agglomerative and divisive. Agglomerative methods assign each node to its own cluster and merge iteratively, while divisive methods start with a single cluster and split it into smaller ones.
Louvain Method (Agglomerative): This method optimizes network modularity, defined as the fraction of edges within clusters minus the expected fraction if edges were randomly distributed. Higher modularity indicates improved clustering. The process involves:
- Assigning each node to its own cluster, then iterating through each node to assign it to a neighbor's cluster, calculating modularity changes. If positive changes exist, the node shifts to the cluster with the largest positive change.
- Constructing a new, coarse-grained network from the clusters obtained in step 1, repeating until no further modularity-optimizing changes can be made.
Girvan-Newman Method (Divisive): This method uses edge betweenness to identify clusters, calculating the betweenness for each edge, removing the highest, and recalculating until no changes occur. The result is a dendrogram displaying cluster relationships. I calculated modularity at each depth to determine the optimal clustering depth.
Results: $ pip install python-community $ python community_detection.py --username your_IG_username --input_txt_file path_to_relations.txt --input_json_file path_to_relations.json --include_me False
The Louvain method achieved a modularity score of 0.63, while Girvan-Newman reached 0.58. The discrepancy arises because Louvain directly optimizes for modularity. Below are the graphs illustrating the results alongside my manual groupings.
The Louvain method (figure 7) effectively identifies most of the groups I anticipated, merging clusters E and D due to overlapping connections.
Conversely, the Girvan-Newman method (figure 8) groups D, E, F, and G together but captures smaller clusters that Louvain misses, such as the four red nodes representing my girlfriend, her brother, and their family.
While the Louvain method is superior in terms of modularity scores, I personally find the Girvan-Newman method's clustering to be more reflective of my network.
Section 5 — Conclusion Congratulations on reaching the conclusion of this introduction to network analysis through practical example. I hope you found it insightful, and I'm eager to see how your Instagram network unfolds!
If you're a developer interested in enhancing my code or making adjustments, feel free to submit a pull request!