Besides analyzing Wikipedia growth, I was curious to see how they could be modeled using polynomial regression. Polynomial regression is a type of regression analysis that attempts to fit a polynomial equation according to 2-dimensional data. Like all models, polynomial regression is flawed in unique ways. A polynomial of too low a degree could underfit the data, meaning the model could be too simple to fit the data accurately. A polynomial of too high a degree could overfit the data, meaning the model could learn too much from the data and fit to the noise and error of the data instead of the overall trends of the data itself. Both models can perform poorly when dealing with new data. Even with the appropriate degree, polynomial regression may not necessarily be the best way to model data. One of the issues of polynomial regression is its extremities; at the edges of the data, the polynomial often rapidly trends to infinity or negative infinity, which may not be appropriate for the data. However, it remains a good method for understanding the general trends of data by simplifying it to something cleaner. As the saying goes, no models are accurate, but some are useful.
I used Python and pandas in a Jupyter notebook to aggregate and sort through the data before visualizing the numbers with matplotlib.
(Note: This data pertains only to the English Wikipedia.)
Data Summaries & Visualizations
This graph explores how different degrees of polynomial can model data differently. The higher the degree, the closer the polynomial fits to the data. This could initially appear to be a good thing, but fitting too close the data makes it harder for the model to accurately predict new data.
While the polynomial of degree 8 seems to fit the data the best, it also ends by trending downward sharply, while the data appears to have evened out; the other polynomials don't seem to fare any better. This graph also demonstrates one of the issues of polynomial regression - as the model reaches the end of the data, all polynomials either incline or decline significantly, which may not be correct.
This graph shows the Polynomial regression seems to work well for slightly simpler and more uniform data; as demonstrated, the page count of Wikipedia seems to have climbed at a very steady rate, which the polynomial captures very well.
Below are some other interesting polynomial models of Wikipedia growths, which follow similar trends as the above.
Once again, thanks go to the Wikipedia/Wikipedia people for open access to their data.
コメント