A Comparative Algorithm Audit of Conspiracies on the Net: Methodology

This paper is available on arxiv under CC 4.0 license.

Authors:

(1) Aleksandra Urman, She is a corresponding author from Department of Informatics, University of Zurich, Switzerland;

(2) Mykola Makhortykh, Institute of Communication and Media Studies, University of Bern, Switzerland;

(3) Roberto Ulloa, GESIS - Leibniz-Institut für Sozialwissenschaften, Germany;

(4) Juhi Kulshrestha, Department of Politics and Public Administration, University of Konstanz, Germany.

Table of Links

Methodology

Data collection

Agent-based audit setup. To collect the data we used agent-based algorithm impact auditing (see Ulloa et al., 2021). This type of auditing relies on automated agents, i.e., software that simulates user browsing behavior (e.g. scrolling web pages) and records the outputs. The benefit of this approach is that it allows controlling for search personalization (Hannak et al., 2013) and randomization (Makhortykh et al., 2020). Unlike human agents, automated agents can be easily synchronized (i.e., to isolate the effect of time at which the search actions are conducted) and deployed in a controlled environment (e.g., a network of virtual machines using the same IP range, the same type of operating system (OS) and the same browsing software) to limit the effects of personalization based on user-specific factors (e.g. location or OS type). Our approach is comparative in the sense that we assess the outputs of several SEs across multiple locations, time periods, and queries.

To increase the robustness of our observations, the data were collected over two collection rounds: March 18-19 and May 8-9 2021. From the technical standpoint, the data collection was organized as follows: each agent was made of two browser plugins (for either Chrome or Firefox desktop browsers). The first plugin (the bot) simulated human activity in the browser, in our case opening the search engine page, entering a query and scrolling down the first page of search outputs. The second plugin (the tracker) recorded html content appearing in the browser and sent it to the remote server.

The agents were deployed via Amazon Elastic Compute Cloud (EC2). We focused our exploration to the five biggest SEs by market share on desktop devices worldwide: Google, Bing, Yahoo, Yandex and DuckDuckGo (Desktop Search Engine Market Share Worldwide, n.d.). During the first round of data collection (March), we deployed 60 agents per location (see below), whereas for the second round we decreased this number to 30 agents per location because of budgetary limitations. The agents were equally distributed between the SEs (i.e., 12 agents per engine for the first round and 6 agents per engine for the second round). The majority of agents were able to perform their planned routine; the only major exception was Yandex, where multiple agents were blocked because of the aggressive bot detection mechanisms (e.g., frequent captchas) utilized by the search engine (hence, the absence of results for Yandex for some search queries).

Search query selection. We have collected search results for six queries that correspond to specific conspiracy theories (“flat earth”, “new world order”, “qanon”) or to subjects related to conspiracy theories (“9/11”, “illuminati”, “george soros”). We specifically included these two groups of queries, because we suggest that they might be entered by users with two different intentions: those corresponding to the names of conspiracies are more likely to be entered by people already interested in a given theory, while the latter - by people merely interested in a certain topic. Thus if conspiratorial content is returned for the latter queries, it might lead to users’ incidental exposure to conspiracies. Further, we expect to observe discrepancies between the two groups of queries with more conspiratorial content returned for the queries corresponding to specific theories and less for the subjects which are surrounded by conspiracy theories.

Location selection. We analyze the data collected for the five SEs, across three locations: two in the US (Ohio and California) and one in the UK (London). Our selection of locations is determined in part by the availability of different EC2 servers (see the description of technical implementation of the data collection below). Also, we utilized locations where English is the official language for the purposes of comparability - e.g., to make sure that any results we retrieve are due to the differences in the location, not to data voids in a specific language. Finally, we selected Ohio and California due to the differences in the ideological orientations of the two states (while California is a solid “blue” (Democratic) state in the US, Ohio is a “swing” state; we included it since there are no solid “red” states where EC2 servers are located); the inclusion of the UK server allows us to check whether our observations are country-specific. This comparative approach allows us to go beyond single-country observations that are common for algorithm audit studies (e.g., Haim et al., 2018; Puschmann, 2019; Urman et al., 2021a).

Data analysis

After the data were collected, we extracted all unique URLs that appeared on the first page of search results for each engine and search query. We have focused on the first page of search results since people tend to perceive top search results as more credible and rarely look at/click on the results beyond the first page (Pan et al., 2007; Schultheiß et al., 2018; Unkel and Haas, 2017; Urman and Makhortykh, 2021). In total, there were 375 such unique URLs collected. The content under these URLs was then manually coded by two trained coders with the disagreements between the two resolved through consensus coding. The coding involved two variables, broadly corresponding to the RQs: conspiracy-related stance and type of source.

The conspiracy-related stance variable included the following categories:

● Debunks conspiracy: the content under the URL lists argument(s) why a conspiracy theory is not true

● No mention of conspiracy: there are no mentions of conspiracy theories in the source at all

● Promotes conspiracy: the content under the URL lists argument(s) why a conspiracy theory is true

● Mentions conspiracy: a conspiracy theory is mentioned, but there is no clear stance associated - e.g., no arguments for it being true or false OR arguments for it being true or false are presented in equal shares.

The type of source variable included the following categories:

● Reference website: e.g., online encyclopedias such as Wikipedia or Britannica

● Media: any media organization regardless of its ideological position or themes, except purely scientific news-focused sites (see the next category

● Science: scientific repositories such as JSTOR or news sites devoted exclusively to science such as Science News

● Social media: e.g., Facebook pages, YouTube videos or Twitter accounts

● Conspiracy website: a website dedicated to the promotion of one specific conspiracy or of an array of conspiracy theories, such as a website of the Flat Earth Society

● Other: all other websites that do not fit into any of the previous categories

To assess the prevalence of conspiratorial content in search results (RQ1), we calculated the share of results for each stance towards conspiracy theories per condition (i.e., query, engine, location and collection round). To assess the prioritization of source types in relation to conspiracy content (RQ2), we first calculated the shares of results for each type per condition, and then the share of results with different stances towards conspiracy theories for each source type per location-round pair.