top of page

Wikipedia Growth (Part 1- Exploration)

Updated: Sep 22, 2020

How has Wikipedia grown in the past two decades?


After analyzing the history of edits and pages added to Wikipedia by a privacy research lab at UC Berkeley, I was curious about the overall trends of Wikipedia as a whole. After some digging around, I found this site: https://stats.wikimedia.org/#/all-projects.


While the numbers and figures it presented were interesting, some data from the early years of Wikipedia were missing, and some types of analyses and comparisons were not present. I decided to take matters in my own hands and downloaded their datasets from Wikimedia (shoutout to them for open access). To supplement the data, I also used estimates from https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia.


I used Python and pandas in a Jupyter notebook to aggregate and sort through the data before visualizing the numbers with matplotlib.

(Note: This data pertains only to the English Wikipedia.)


Data Summaries & Visualizations

I was first curious as to the userbase of Wikipedia. This chart shows that Wikipedia has actually maintained a fairly level rate of newly registered users per month, which surprised me. I was happy to see that people are interested in maintaining and contributing to Wikipedia. Of course, this is not necessarily indicative of active users, but regardless I think that a level number of newly registered users is a healthy sign. I find it interesting to see the two sudden spikes of users - the first one makes sense, I can see how Wikipedia could've gained massive numbers of new users when it was newer, but I am curious as to what drove the spikes in 2014-2016.

Similarly, Wikipedia has grown at a fairly constant rate for the past decade, with some notable exceptions. There appears to be a significant purge of some sort in 2010 in which much was removed, and a massive number of additions in 2015. While I am curious about those events, I am also curious about how the number of bytes added per month has remained so even. I would have expected a slight upward trend as more users registered and contributed to Wikipedia, but perhaps content is also regularly removed as well. I have to wonder if a constant rate of addition is actually enough to keep Wikipedia truly updated, but that is leaning towards speculation instead of analysis with the data on hand.

This chart of the size of Wikipedia over time was created by adding the net change in bytes per month to create a cumulative count of the total bytes of Wikipedia. The rate of change is remarkably constant, and I wonder if this is happenstance or a result of Wikipedia's vigilance against misinformation - I know they have many editors and bots dedicated to reverting improper edits. Either way, the lack of an interesting rate of change is quite interesting.

This graph also demonstrates a fairly level rate of pages added per month. Overall trends, including the spike in 2015, seem to correspond to the net change in bytes per month, which makes a reasonable amount of sense. I think it's far more interesting that the rate of pages added per month has remained fairly constant - that means that new subjects and topics are constantly added to Wikipedia, even after 2 decades. (Of course, this figure does not include the number of pages that are removed per month.)

The number of pages added to Wikipedia per month, shown here cumulatively, corresponds quite well to the size of Wikipedia in bytes over time. This would suggest that even with removed pages, the number of pages is very correlated with the number of bytes.

This chart compares the number of pages over time with the number of gigabytes over time to visualize their ratio. The two have maintained a very constant relationship over time. In fact, by comparing the two, I found that the average Wikipedia pages size is roughly 1595 bytes (or 1.595 KB, or 1.595e-6 GB), with a standard deviation of 4148 bytes (or 4.148 KB, or 4148e-6 GB).

Edits per month have been a little less constant; they seemed to have decreased until 2014, when they began to increase at a fairly slow rate. I do wonder as the the driving forces behind these changes.

Strangely, the Wikimedia data had two types of editing data, "edits per month" and "user edits per month." From this graph, it seems that user edits per month are actually more constant and a little more regular, and the trends seem a little more defined here.

Of course, differentiating between "edits per month" and "user edits per month" implies the existence of non-user edits. By subtracting user edits from overall edits, I created a history of non-user edits, which I assume are done by various Wikipedia bots.

These graphs indicate the ratios are roughly the same, all things considered, and very regular. Amazingly regular, really.



I'd like to thank the people at Wikipedia/Wikimedia for making this data available for access. Wikipedia is great, and it looks like it is growing at a remarkably regular pace.

Comments


bottom of page