The Rubik's Cube is a puzzle that has captured the attention and excitement of people around the world for decades. Once seen as a nearly impossible task, the Rubik's cube has evolved over time into a fun challenge that after dedicated practice, can be solved in mere seconds. There is a whole world of competitive Rubik's Cube solvers who compete with each other to see who can solve what puzzles, in what different ways, the fastest.
In this tutorial, we will explore and analyze data from the world of competitive Rubik's Cube solving. Through this data analysis, we aim to gain more insight into questions regarding the current status, and future of cubing as an activity.
There is some terminology in this tutorial that may be unfamiliar. They will make more sense in the context they are given, but here is a basic glossary of terms:
- WCA - aka the World Cube Association, the governing body that organizes, oversees, and regulates Rubik's Cube competitions all over the world
- cubing - the activity/hobby/sport of solving Rubik's cubes
- 3x3 - the standard 3x3x3 Rubik's Cube
- event - aside from the standard 3x3, there are many different events in which competitors compete in, including 4x4, 3x3 blindfolded, and many more
- solve - a unit of measurement representing the process of solving a Rubik's Cube (ie. "that solve was really fast!")
- single - the time of one solve. Usually used in contrast to average (ie. "I got a 9.50 single today!")
- average - The arithmetic mean of the middle 3 solves out of 5 (ie. out of 5 solves, the slowest and fastest solves are disregarded and the rest are averaged, done to try to eliminate luck as a factor). In some less common contexts average will be the arithmetic mean.
The World Cube Association conveniently provides an exportable SQL file of the entire database. The most recent export can be obtained on this page, however this tutorial uses the May 04 2023 6:02GMT export of the database (download it here). Due to the frequently udpating nature of the WCA database, it was necessary to decide on only one version to use in order to facilitate development of this tutorial.
Due to Github's restrictions on file size, we were not able to include the WCA database in this repository. If you want to follow along with this tutorial exactly, you must download the database at the link provided above.
Aside from the required Python libraries, this tutorial also requires MySQL Server in order to follow along. This tutorial will not detail how to get MySQL Server installed and running, however the process is fairly straightfoward and there are many online guides that can help.
Clone this repository here, and navigate into the cloned repository. Install all the required Python libraries by running the cell below, or by running the command (without the % symbol) in your terminal.
This setup portion assumes basic familiarity with the command line and git repositories. For more information, check out the recommended readings: First steps with git, Git Basics, Command line crash course, Command line basics
%pip install requirements.txt
# imports
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from getpass import getpass
from mysql.connector import connect
from sklearn import svm
from sklearn.linear_model import LinearRegression, LogisticRegression
import statsmodels.formula.api as sm
from keras.models import Sequential
from keras.layers import Dense, LSTM
from sklearn.preprocessing import MinMaxScaler
# surpressing the "pandas only supports SQLAlchemy warning"
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)
# full warning:
# """ UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection.
# Other DBAPI2 objects are not tested. Please consider using SQLAlchemy. """
Run the cell below and input your MySQL Server username and password (they are kept in memory until the kernel is stopped). Depending on how you configured MySQL Server, you may not need a username nor a password. If this is the case, then you can press enter twice to skip both fields.
Most users can run
mysql -u rootin their terminal to log into MySQL Server. If you can do this successfully, that means that you can press enter twice to skip both fields after running the cell below.
mysql_username = input("Enter username: ")
mysql_password = getpass("Enter password: ")
Now, we can establish a connection with MySQL Server by using the mysql-connector-python library. First we create a connection object, use that to create a cursor object, then we execute a query to create an empty database called wca. Lastly, we close the connection and the cursor.
# establish connection with MySQL Server
connection = connect(
host="localhost",
user=mysql_username,
password=mysql_password
)
qry_create_db = "CREATE DATABASE wca" # query to create an empty database called "wca"
cursor = connection.cursor() # create the cursor
cursor.execute(qry_create_db) # execute the query
cursor.close() # close the cursor
connection.close() # close the connection
Now, we use our downloaded SQL file to actually create the database. If you have set up MySQL Server in the default configuration and are using the same SQL file as I am, run the following cell (or run it in your terminal (without the ! symbol).
If you have configured a specific MySQL Server username/password, or if you are using a different version of the SQL file downloaded from the WCA, then run the following command in your terminal:
mysql -u username -p password wca < filename.sqlreplacing
usernameandpasswordwith your username and password (respectively) and replacingfilename.sqlwith the [path to the] SQL file you want to use.
Note that the WCA database is fairly large, so don't be alarmed if this command takes a few minutes to execute.
!mysql -u root wca < WCA_export192_20230504T040001Z/WCA_export.sql
Now we can finally get started! Let's run a basic query on the database: we want to see the top fifty 3x3 solvers in the world.
We construct the appropriate SQL query, establish the connection to the database, and use the pandas function read_sql in order to execute this query and retrieve the result as a dataframe.
# establish the connection to our database "wca"
connection = connect(host="localhost", user=mysql_username, password=mysql_password, database="wca")
qry_top_50 = """
SELECT *
FROM RanksSingle
WHERE eventID='333'
LIMIT 50 ;
"""
# execute the query and store the result in a dataframe
df_top_50 = pd.read_sql(qry_top_50, con=connection)
display(df_top_50)
connection.close() # close the connection
| personId | eventId | best | worldRank | continentRank | countryRank | |
|---|---|---|---|---|---|---|
| 0 | 2015DUYU01 | 333 | 347 | 1 | 1 | 1 |
| 1 | 2012PARK03 | 333 | 363 | 2 | 1 | 1 |
| 2 | 2016JONE04 | 333 | 388 | 3 | 1 | 1 |
| 3 | 2017KIMM01 | 333 | 389 | 4 | 2 | 2 |
| 4 | 2019WANY36 | 333 | 390 | 5 | 2 | 2 |
| 5 | 2017GARR05 | 333 | 395 | 6 | 3 | 3 |
| 6 | 2016KOLA02 | 333 | 397 | 7 | 1 | 1 |
| 7 | 2017SIAU02 | 333 | 403 | 8 | 4 | 4 |
| 8 | 2017XURU04 | 333 | 406 | 9 | 3 | 3 |
| 9 | 2016INAB01 | 333 | 413 | 10 | 5 | 5 |
| 10 | 2009ZEMD01 | 333 | 416 | 11 | 2 | 2 |
| 11 | 2016DEXT01 | 333 | 419 | 12 | 3 | 3 |
| 12 | 2012PONC02 | 333 | 424 | 13 | 6 | 6 |
| 13 | 2016SANT08 | 333 | 426 | 14 | 7 | 1 |
| 14 | 2015BORR01 | 333 | 431 | 15 | 4 | 1 |
| 15 | 2010WEYE02 | 333 | 432 | 16 | 2 | 1 |
| 16 | 2016SATO01 | 333 | 437 | 17 | 1 | 1 |
| 17 | 2015SANC11 | 333 | 438 | 18 | 8 | 7 |
| 18 | 2018DULL01 | 333 | 438 | 18 | 3 | 1 |
| 19 | 2019REDI02 | 333 | 438 | 18 | 8 | 7 |
| 20 | 2016SHEL03 | 333 | 442 | 21 | 4 | 1 |
| 21 | 2015FUSH01 | 333 | 442 | 21 | 5 | 1 |
| 22 | 2015MOHA10 | 333 | 444 | 23 | 10 | 9 |
| 23 | 2014SEBA01 | 333 | 444 | 23 | 5 | 1 |
| 24 | 2015MILL01 | 333 | 448 | 25 | 11 | 10 |
| 25 | 2015MACK06 | 333 | 448 | 25 | 11 | 10 |
| 26 | 2016LINB01 | 333 | 450 | 27 | 13 | 2 |
| 27 | 2021ZHAN01 | 333 | 451 | 28 | 6 | 4 |
| 28 | 2015LARS04 | 333 | 453 | 29 | 6 | 1 |
| 29 | 2015GRIE02 | 333 | 454 | 30 | 14 | 12 |
| 30 | 2010KIPA01 | 333 | 459 | 31 | 7 | 2 |
| 31 | 2014MILL04 | 333 | 459 | 31 | 7 | 1 |
| 32 | 2012CHOS01 | 333 | 459 | 31 | 7 | 1 |
| 33 | 2010WANG68 | 333 | 462 | 34 | 15 | 3 |
| 34 | 2015DELA05 | 333 | 463 | 35 | 9 | 1 |
| 35 | 2013BALI01 | 333 | 464 | 36 | 16 | 13 |
| 36 | 2022NUNE01 | 333 | 465 | 37 | 2 | 1 |
| 37 | 2018OPAC01 | 333 | 466 | 38 | 10 | 3 |
| 38 | 2012TYCK01 | 333 | 468 | 39 | 17 | 14 |
| 39 | 2013JOHN10 | 333 | 468 | 39 | 17 | 14 |
| 40 | 2015CHER07 | 333 | 468 | 39 | 17 | 14 |
| 41 | 2018KHAN28 | 333 | 471 | 42 | 20 | 17 |
| 42 | 2014BONA02 | 333 | 472 | 43 | 3 | 1 |
| 43 | 2014XUZI01 | 333 | 472 | 43 | 8 | 5 |
| 44 | 2017WONG01 | 333 | 473 | 45 | 9 | 2 |
| 45 | 2007VALK01 | 333 | 474 | 46 | 11 | 2 |
| 46 | 2016YENC01 | 333 | 475 | 47 | 21 | 18 |
| 47 | 2019SICH01 | 333 | 475 | 47 | 21 | 18 |
| 48 | 2012LEWI01 | 333 | 475 | 47 | 21 | 18 |
| 49 | 2010BRAD01 | 333 | 476 | 50 | 24 | 21 |
Interpreting the above results, we can see the top fifty 3x3 solvers in the world, ranked by their best single. We can see their personID (aka their WCA ID, which is a unique identifier for every WCA competitor), their best single result (measured in centiseconds), their world ranking, continental ranking, and national ranking (all based on their best single time). The eventId being 333 means that these results are in the 3x3 Rubik's Cube event.
We can see that the best 3x3 single result ever is 3.47 seconds, held by 2015DUYU01.
Seeing the results are great, but with only the WCA ID of each competitor, we don't know much about them. What if we want to see their names, what country they are from, and their gender?
# using the Python `with` statement for easier resource management
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
qry_top_50_extra = """
SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksSingle.best, RanksSingle.worldRank
FROM RanksSingle
INNER JOIN Persons ON RanksSingle.personID=Persons.id
WHERE RanksSingle.eventID='333'
ORDER BY RanksSingle.best
LIMIT 50 ;
"""
df_top_50_extra = pd.read_sql(qry_top_50_extra, con=connection)
display(df_top_50_extra)
| id | name | countryId | gender | best | worldRank | |
|---|---|---|---|---|---|---|
| 0 | 2015DUYU01 | Yusheng Du (杜宇生) | China | m | 347 | 1 |
| 1 | 2012PARK03 | Max Park | USA | m | 363 | 2 |
| 2 | 2016JONE04 | Jode Brewster | Australia | m | 388 | 3 |
| 3 | 2017KIMM01 | Asher Kim-Magierek | USA | m | 389 | 4 |
| 4 | 2019WANY36 | Yiheng Wang (王艺衡) | China | m | 390 | 5 |
| 5 | 2017GARR05 | Luke Garrett | USA | m | 395 | 6 |
| 6 | 2016KOLA02 | Tymon Kolasiński | Poland | m | 397 | 7 |
| 7 | 2017SIAU02 | Max Siauw | USA | m | 403 | 8 |
| 8 | 2017XURU04 | Ruihang Xu (许瑞航) | China | m | 406 | 9 |
| 9 | 2016INAB01 | Matty Hiroto Inaba | USA | m | 413 | 10 |
| 10 | 2009ZEMD01 | Feliks Zemdegs | Australia | m | 416 | 11 |
| 11 | 2016DEXT01 | Riley Dexter | Australia | m | 419 | 12 |
| 12 | 2012PONC02 | Patrick Ponce | USA | m | 424 | 13 |
| 13 | 2016SANT08 | Kyle Santucci | Canada | m | 426 | 14 |
| 14 | 2015BORR01 | Leo Borromeo | Philippines | m | 431 | 15 |
| 15 | 2010WEYE02 | Sebastian Weyer | Germany | m | 432 | 16 |
| 16 | 2016SATO01 | Caio Hideaki Sato | Brazil | m | 437 | 17 |
| 17 | 2018DULL01 | Twan Dullemond | Netherlands | m | 438 | 18 |
| 18 | 2015SANC11 | Nicolás Sánchez | USA | m | 438 | 18 |
| 19 | 2019REDI02 | Dominic Redisi | USA | m | 438 | 18 |
| 20 | 2015FUSH01 | Firstian Fushada (符逢城) | Indonesia | m | 442 | 21 |
| 21 | 2016SHEL03 | Lukas Shelley | Denmark | m | 442 | 21 |
| 22 | 2014SEBA01 | Juliette Sébastien | France | f | 444 | 23 |
| 23 | 2015MOHA10 | Varun Mohanraj | USA | m | 444 | 23 |
| 24 | 2015MILL01 | Dylan Miller | USA | m | 448 | 25 |
| 25 | 2015MACK06 | Zeke Mackay | USA | m | 448 | 25 |
| 26 | 2016LINB01 | Brennen Lin | Canada | m | 450 | 27 |
| 27 | 2021ZHAN01 | Bofan Zhang (张博藩) | China | m | 451 | 28 |
| 28 | 2015LARS04 | Kim Roger Høyland Larsen | Norway | m | 453 | 29 |
| 29 | 2015LARS04 | Kim Roger Haraldsen | Norway | m | 453 | 29 |
| 30 | 2015GRIE02 | Luke Griesser | USA | m | 454 | 30 |
| 31 | 2014MILL04 | Chris Mills | United Kingdom | m | 459 | 31 |
| 32 | 2010KIPA01 | Jakub Kipa | Poland | m | 459 | 31 |
| 33 | 2012CHOS01 | SeungBeom Cho (조승범) | Korea | m | 459 | 31 |
| 34 | 2010WANG68 | Bill Wang | Canada | m | 462 | 34 |
| 35 | 2015DELA05 | Richard Delacoste | Switzerland | m | 463 | 35 |
| 36 | 2013BALI01 | Tanzer Balimtas | USA | m | 464 | 36 |
| 37 | 2022NUNE01 | Robert Yomi Cadenas Nuñez | Peru | m | 465 | 37 |
| 38 | 2018OPAC01 | Kajetan Opach | Poland | m | 466 | 38 |
| 39 | 2013JOHN10 | Brian Johnson | USA | m | 468 | 39 |
| 40 | 2015CHER07 | Tommy Cherry | USA | m | 468 | 39 |
| 41 | 2012TYCK01 | Luke Tycksen | USA | m | 468 | 39 |
| 42 | 2018KHAN28 | Zayn Khanani | USA | m | 471 | 42 |
| 43 | 2014BONA02 | Bautista Bonazzola | Argentina | m | 472 | 43 |
| 44 | 2014XUZI01 | Zibo Xu (徐子博) | China | m | 472 | 43 |
| 45 | 2017WONG01 | Brenton Angelo Lo Wong | Philippines | m | 473 | 45 |
| 46 | 2007VALK01 | Mats Valk | Netherlands | m | 474 | 46 |
| 47 | 2016YENC01 | Christopher Yen | USA | m | 475 | 47 |
| 48 | 2012LEWI01 | Phillip Lewicki | USA | m | 475 | 47 |
| 49 | 2019SICH01 | Oliver Michael Sitja Sichel | USA | m | 475 | 47 |
Looks pretty good! If we compare this dataframe (df_top_50_extra) with the previous one (df_top_50), we notice some slight discrepancies. Notice how df_top_50 ends with 2010BRAD01's result of 4.76 seconds, but df_top_50_extra ends with 2019SICH01's result of 4.75 seconds. This would suggest that we have an extra value in our current dataframe (df_top_50_extra) which is pushing everything down (thus some slower results are pushed out of the top 50).
Why does this discrepancy exist when all we are doing is adding extra information like competitor name and country? If we look closely at (df_top_50_extra), we can see an anomaly...
display(df_top_50_extra[28:30])
| id | name | countryId | gender | best | worldRank | |
|---|---|---|---|---|---|---|
| 28 | 2015LARS04 | Kim Roger Høyland Larsen | Norway | m | 453 | 29 |
| 29 | 2015LARS04 | Kim Roger Haraldsen | Norway | m | 453 | 29 |
We have a duplicate! These two results have a similar name, as well as the exact same countryId, gender, best, and worldRank. This would all be fine (albeit suspicious and unlikely) if not for the fact that the id field is the same. WCA IDs are supposed to be unique identifiers for competitors, and we queried for top fifty 3x3 competitors in the world, so no competitor should appear twice in our dataframe.
The reason this happens is because the WCA allows competitors to change certain data about themeslves, such as in the case that a competitor moves countries, or legally changes their name. When such a change happens, the competitor's record in the Persons field is not updated, but rather a new record is created with the altered data. We can see this by querying the results table for 2015LARS04:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
qry_2015lars04 = """
SELECT *
FROM Persons
WHERE id = '2015LARS04' ;
"""
df_2015lars04 = pd.read_sql(qry_2015lars04, con=connection)
display(df_2015lars04)
| id | subid | name | countryId | gender | |
|---|---|---|---|---|---|
| 0 | 2015LARS04 | 1 | Kim Roger Haraldsen | Norway | m |
| 1 | 2015LARS04 | 2 | Kim Roger Høyland Larsen | Norway | m |
Here we see two records with the same id of 2015LARS04. name is the field that differs. As mentioned previously, this represents the same competitor, before and after a legal name change. The subid field is important here, because it indicates that their Persons record has been changed before. The current version is represented by a subid of 1.
Knowing this, we can alter our query for the top fifty 3x3 solvers including name, country, and gender to disallow duplicate WCA IDs. We can do this by adding Persons.subid=1 to the WHERE clause.
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
qry_top_50_extra = """
SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksSingle.best, RanksSingle.worldRank
FROM RanksSingle
INNER JOIN Persons ON RanksSingle.personID=Persons.id
WHERE RanksSingle.eventID='333'
AND Persons.subid=1
ORDER BY RanksSingle.best
LIMIT 50 ;
"""
df_top_50_extra = pd.read_sql(qry_top_50_extra, con=connection)
display(df_top_50_extra)
| id | name | countryId | gender | best | worldRank | |
|---|---|---|---|---|---|---|
| 0 | 2015DUYU01 | Yusheng Du (杜宇生) | China | m | 347 | 1 |
| 1 | 2012PARK03 | Max Park | USA | m | 363 | 2 |
| 2 | 2016JONE04 | Jode Brewster | Australia | m | 388 | 3 |
| 3 | 2017KIMM01 | Asher Kim-Magierek | USA | m | 389 | 4 |
| 4 | 2019WANY36 | Yiheng Wang (王艺衡) | China | m | 390 | 5 |
| 5 | 2017GARR05 | Luke Garrett | USA | m | 395 | 6 |
| 6 | 2016KOLA02 | Tymon Kolasiński | Poland | m | 397 | 7 |
| 7 | 2017SIAU02 | Max Siauw | USA | m | 403 | 8 |
| 8 | 2017XURU04 | Ruihang Xu (许瑞航) | China | m | 406 | 9 |
| 9 | 2016INAB01 | Matty Hiroto Inaba | USA | m | 413 | 10 |
| 10 | 2009ZEMD01 | Feliks Zemdegs | Australia | m | 416 | 11 |
| 11 | 2016DEXT01 | Riley Dexter | Australia | m | 419 | 12 |
| 12 | 2012PONC02 | Patrick Ponce | USA | m | 424 | 13 |
| 13 | 2016SANT08 | Kyle Santucci | Canada | m | 426 | 14 |
| 14 | 2015BORR01 | Leo Borromeo | Philippines | m | 431 | 15 |
| 15 | 2010WEYE02 | Sebastian Weyer | Germany | m | 432 | 16 |
| 16 | 2016SATO01 | Caio Hideaki Sato | Brazil | m | 437 | 17 |
| 17 | 2018DULL01 | Twan Dullemond | Netherlands | m | 438 | 18 |
| 18 | 2015SANC11 | Nicolás Sánchez | USA | m | 438 | 18 |
| 19 | 2019REDI02 | Dominic Redisi | USA | m | 438 | 18 |
| 20 | 2015FUSH01 | Firstian Fushada (符逢城) | Indonesia | m | 442 | 21 |
| 21 | 2016SHEL03 | Lukas Shelley | Denmark | m | 442 | 21 |
| 22 | 2014SEBA01 | Juliette Sébastien | France | f | 444 | 23 |
| 23 | 2015MOHA10 | Varun Mohanraj | USA | m | 444 | 23 |
| 24 | 2015MILL01 | Dylan Miller | USA | m | 448 | 25 |
| 25 | 2015MACK06 | Zeke Mackay | USA | m | 448 | 25 |
| 26 | 2016LINB01 | Brennen Lin | Canada | m | 450 | 27 |
| 27 | 2021ZHAN01 | Bofan Zhang (张博藩) | China | m | 451 | 28 |
| 28 | 2015LARS04 | Kim Roger Haraldsen | Norway | m | 453 | 29 |
| 29 | 2015GRIE02 | Luke Griesser | USA | m | 454 | 30 |
| 30 | 2012CHOS01 | SeungBeom Cho (조승범) | Korea | m | 459 | 31 |
| 31 | 2014MILL04 | Chris Mills | United Kingdom | m | 459 | 31 |
| 32 | 2010KIPA01 | Jakub Kipa | Poland | m | 459 | 31 |
| 33 | 2010WANG68 | Bill Wang | Canada | m | 462 | 34 |
| 34 | 2015DELA05 | Richard Delacoste | Switzerland | m | 463 | 35 |
| 35 | 2013BALI01 | Tanzer Balimtas | USA | m | 464 | 36 |
| 36 | 2022NUNE01 | Robert Yomi Cadenas Nuñez | Peru | m | 465 | 37 |
| 37 | 2018OPAC01 | Kajetan Opach | Poland | m | 466 | 38 |
| 38 | 2012TYCK01 | Luke Tycksen | USA | m | 468 | 39 |
| 39 | 2013JOHN10 | Brian Johnson | USA | m | 468 | 39 |
| 40 | 2015CHER07 | Tommy Cherry | USA | m | 468 | 39 |
| 41 | 2018KHAN28 | Zayn Khanani | USA | m | 471 | 42 |
| 42 | 2014XUZI01 | Zibo Xu (徐子博) | China | m | 472 | 43 |
| 43 | 2014BONA02 | Bautista Bonazzola | Argentina | m | 472 | 43 |
| 44 | 2017WONG01 | Brenton Angelo Lo Wong | Philippines | m | 473 | 45 |
| 45 | 2007VALK01 | Mats Valk | Netherlands | m | 474 | 46 |
| 46 | 2016YENC01 | Christopher Yen | USA | m | 475 | 47 |
| 47 | 2019SICH01 | Oliver Michael Sitja Sichel | USA | m | 475 | 47 |
| 48 | 2012LEWI01 | Phillip Lewicki | USA | m | 475 | 47 |
| 49 | 2010BRAD01 | Drew Brads | USA | m | 476 | 50 |
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
# querying the RanksAverage table instead of RanksSingle table
# in order to get competitor's best WCA average of five solves in a round
qry_top_20_444_avg = """
SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksAverage.best, RanksAverage.worldRank
FROM RanksAverage
INNER JOIN Persons ON RanksAverage.personID=Persons.id
WHERE RanksAverage.eventID='444'
AND Persons.subid=1
ORDER BY RanksAverage.best
LIMIT 20 ;
"""
df_top_20_444_avg = pd.read_sql(qry_top_20_444_avg, con=connection)
display(df_top_20_444_avg)
| id | name | countryId | gender | best | worldRank | |
|---|---|---|---|---|---|---|
| 0 | 2012PARK03 | Max Park | USA | m | 1938 | 1 |
| 1 | 2010WEYE02 | Sebastian Weyer | Germany | m | 2146 | 2 |
| 2 | 2009ZEMD01 | Feliks Zemdegs | Australia | m | 2157 | 3 |
| 3 | 2013NAHM01 | Seung Hyuk Nahm (남승혁) | Korea | m | 2167 | 4 |
| 4 | 2016KOLA02 | Tymon Kolasiński | Poland | m | 2171 | 5 |
| 5 | 2016INAB01 | Matty Hiroto Inaba | USA | m | 2254 | 6 |
| 6 | 2015WANG09 | Kai-Wen Wang (王楷文) | Taiwan | m | 2261 | 7 |
| 7 | 2012PONC02 | Patrick Ponce | USA | m | 2307 | 8 |
| 8 | 2016NUNE11 | Brandon Nunez | USA | m | 2361 | 9 |
| 9 | 2017RAND06 | Ari Randers-Pehrson | USA | m | 2367 | 10 |
| 10 | 2012BEAH01 | Ciarán Beahan | Ireland | m | 2377 | 11 |
| 11 | 2015BORR01 | Leo Borromeo | Philippines | m | 2377 | 11 |
| 12 | 2017XURU04 | Ruihang Xu (许瑞航) | China | m | 2386 | 13 |
| 13 | 2019HUNG16 | Đỗ Quang Hưng | Vietnam | m | 2394 | 14 |
| 14 | 2010WANG68 | Bill Wang | Canada | m | 2397 | 15 |
| 15 | 2017SIAU02 | Max Siauw | USA | m | 2398 | 16 |
| 16 | 2015DELA05 | Richard Delacoste | Switzerland | m | 2440 | 17 |
| 17 | 2010WUIF01 | Yi-Fan Wu (吳亦凡) | Taiwan | m | 2487 | 18 |
| 18 | 2015FUSH01 | Firstian Fushada (符逢城) | Indonesia | m | 2509 | 19 |
| 19 | 2017GOLU01 | Theo Goluboff | Argentina | m | 2526 | 20 |
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
qry_top_20_sq1_avg = """
SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksAverage.best, RanksAverage.worldRank
FROM RanksAverage
INNER JOIN Persons ON RanksAverage.personID=Persons.id
WHERE RanksAverage.eventID='sq1'
AND Persons.subid=1
AND Persons.gender='f'
ORDER BY RanksAverage.best
LIMIT 20 ;
"""
df_top_20_sq1_avg = pd.read_sql(qry_top_20_sq1_avg, con=connection)
display(df_top_20_sq1_avg)
| id | name | countryId | gender | best | worldRank | |
|---|---|---|---|---|---|---|
| 0 | 2016CHEN53 | Anshu Chennuru | USA | f | 668 | 15 |
| 1 | 2013KATO01 | Eva Kato | USA | f | 770 | 52 |
| 2 | 2017MELL02 | Laura Mellier | Switzerland | f | 856 | 93 |
| 3 | 2014CHAN23 | Sophie Chan | USA | f | 911 | 132 |
| 4 | 2017PABI01 | Magdalena Pabisz | Poland | f | 916 | 140 |
| 5 | 2016SINN01 | Eleanor Sinnott | United Kingdom | f | 964 | 179 |
| 6 | 2013KLEI03 | Livia Kleiner | USA | f | 982 | 196 |
| 7 | 2015CALD02 | Kymberlyn Calderon | USA | f | 991 | 205 |
| 8 | 2015JOIN01 | Kyra Joiner | USA | f | 1055 | 267 |
| 9 | 2022THUY01 | Lê Phương Thùy | Vietnam | f | 1068 | 282 |
| 10 | 2011WIJA03 | Angeline Wijaya (黃千儀) | Indonesia | f | 1071 | 287 |
| 11 | 2021WARC01 | Dominika Warchoł | Poland | f | 1093 | 314 |
| 12 | 2015JINX01 | Tina Xiangyan Jin (金香延) | China | f | 1101 | 326 |
| 13 | 2018DILW02 | Lacey Dilworth | USA | f | 1115 | 343 |
| 14 | 2010HULL01 | Katie Hull | USA | f | 1121 | 349 |
| 15 | 2020SIBU02 | Josephine Siburian | Australia | f | 1127 | 359 |
| 16 | 2018GRAH05 | Kate Grahame | USA | f | 1149 | 380 |
| 17 | 2015TRAC01 | Catherine Trace | USA | f | 1183 | 422 |
| 18 | 2015ZHOU02 | Lucinda Zhou | USA | f | 1255 | 533 |
| 19 | 2022BERN01 | Beatriz Hernandes Bernardes | Brazil | f | 1316 | 654 |
How about ranking countries by number of WCA competitions that have occurred in that country in 2022?
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
qry_2022_country_comps = """
SELECT countryId, COUNT(countryId) as number_of_comps
FROM Competitions
WHERE Competitions.year = 2022
GROUP BY countryId
ORDER BY number_of_comps DESC
LIMIT 20
"""
df_2022_country_comps = pd.read_sql(qry_2022_country_comps, con=connection)
display(df_2022_country_comps)
| countryId | number_of_comps | |
|---|---|---|
| 0 | USA | 280 |
| 1 | Australia | 66 |
| 2 | Poland | 46 |
| 3 | India | 45 |
| 4 | Canada | 32 |
| 5 | United Kingdom | 32 |
| 6 | Brazil | 32 |
| 7 | Spain | 29 |
| 8 | France | 28 |
| 9 | Colombia | 24 |
| 10 | Sweden | 23 |
| 11 | Denmark | 22 |
| 12 | Turkey | 22 |
| 13 | Italy | 21 |
| 14 | Bolivia | 16 |
| 15 | New Zealand | 16 |
| 16 | Indonesia | 14 |
| 17 | Peru | 14 |
| 18 | Germany | 13 |
| 19 | Norway | 13 |
Now we pivot our focus to a different question: how will the cubing world records evolve over time? In recent years, due to many advancements in cube hardware as well as new methods of solving, we have seen significant drops in world record times across every single event in the WCA. At the WCA's inception, logically the world records were lowered significantly every time they were broken, since there simply weren't that many people competing or interested in the activity. But nowadays, we see records being broken by only hundredths of a second, and records standing for years and years. Will the progress of cubing world records slow and plateau? Or will they keep on being lowered dramatically as the amount of competitors and interest in the activity increases?
Let's run a query to see the entire history of 3x3 single world records. Besides just the result, we want to know when each world record was set, so we will join the Competitions table (which has data about when the world record was set and at what competition) on the Results table. To query for just world records, we must include Results.regionalSingleRecord = 'WR' in the WHERE clause of our query.
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
qry_333_singleWRs = """
SELECT Results.personName,
Results.best,
Results.competitionId,
Competitions.year,
Competitions.month,
Competitions.day
FROM Results
INNER JOIN Competitions ON Results.competitionId = Competitions.id
WHERE
Results.regionalSingleRecord = 'WR' AND
Results.eventId = '333'
ORDER BY Results.best DESC
"""
df_333_singleWRs = pd.read_sql(qry_333_singleWRs, con=connection)
display(df_333_singleWRs)
| personName | best | competitionId | year | month | day | |
|---|---|---|---|---|---|---|
| 0 | Minh Thai | 2295 | WC1982 | 1982 | 6 | 5 |
| 1 | Dan Knights | 1671 | WC2003 | 2003 | 8 | 23 |
| 2 | Jess Bonde | 1653 | WC2003 | 2003 | 8 | 23 |
| 3 | Shotaro Makisumi (牧角章太郎) | 1507 | CaltechWinter2004 | 2004 | 1 | 24 |
| 4 | Shotaro Makisumi (牧角章太郎) | 1476 | CaltechWinter2004 | 2004 | 1 | 24 |
| 5 | Shotaro Makisumi (牧角章太郎) | 1393 | CaltechSpring2004 | 2004 | 4 | 3 |
| 6 | Shotaro Makisumi (牧角章太郎) | 1211 | CaltechSpring2004 | 2004 | 4 | 3 |
| 7 | Jean Pons | 1175 | DutchOpen2005 | 2005 | 10 | 16 |
| 8 | Leyan Lo | 1113 | CaltechWinter2006 | 2006 | 1 | 14 |
| 9 | Toby Mao (毛台立) | 1048 | US2006 | 2006 | 8 | 4 |
| 10 | Edouard Chambon | 1036 | BelgianOpen2007 | 2007 | 2 | 24 |
| 11 | Thibaut Jacquinot | 986 | SpanishOpen2007 | 2007 | 5 | 5 |
| 12 | Erik Akkersdijk | 977 | DutchOpen2007 | 2007 | 10 | 13 |
| 13 | Ron van Bruchem | 955 | Netherlands2007 | 2007 | 11 | 24 |
| 14 | Edouard Chambon | 918 | MurciaOpen2008 | 2008 | 2 | 23 |
| 15 | Yu Nakajima (中島悠) | 872 | KashiwaOpen2008 | 2008 | 5 | 5 |
| 16 | Yu Nakajima (中島悠) | 872 | KashiwaOpen2008 | 2008 | 5 | 5 |
| 17 | Erik Akkersdijk | 708 | CzechOpen2008 | 2008 | 7 | 12 |
| 18 | Feliks Zemdegs | 703 | MelbourneCubeDay2010 | 2010 | 11 | 13 |
| 19 | Feliks Zemdegs | 677 | MelbourneCubeDay2010 | 2010 | 11 | 13 |
| 20 | Feliks Zemdegs | 665 | KubarooOpen2011 | 2011 | 5 | 7 |
| 21 | Feliks Zemdegs | 665 | MelbourneSummer2011 | 2011 | 1 | 29 |
| 22 | Feliks Zemdegs | 624 | KubarooOpen2011 | 2011 | 5 | 7 |
| 23 | Feliks Zemdegs | 618 | MelbourneWinterOpen2011 | 2011 | 6 | 25 |
| 24 | Feliks Zemdegs | 566 | MelbourneWinterOpen2011 | 2011 | 6 | 25 |
| 25 | Mats Valk | 555 | ZonhovenOpen2013 | 2013 | 3 | 2 |
| 26 | Collin Burns | 525 | DoylestownSpring2015 | 2015 | 4 | 25 |
| 27 | Lucas Etter | 490 | RiverHillFall2015 | 2015 | 11 | 21 |
| 28 | Mats Valk | 474 | JawaTimurOpen2016 | 2016 | 11 | 5 |
| 29 | Feliks Zemdegs | 473 | POPSOpen2016 | 2016 | 12 | 11 |
| 30 | Patrick Ponce | 469 | RallyInTheValley2017 | 2017 | 9 | 2 |
| 31 | SeungBeom Cho (조승범) | 459 | Chicago2017 | 2017 | 10 | 28 |
| 32 | Feliks Zemdegs | 459 | HobartSummer2018 | 2018 | 1 | 27 |
| 33 | Feliks Zemdegs | 422 | CubeforCambodia2018 | 2018 | 5 | 6 |
| 34 | Yusheng Du (杜宇生) | 347 | WuhuOpen2018 | 2018 | 11 | 24 |
Looks good, but we have some cleaning to do. First, notice that the best column is measured in centiseconds. We want this column to be ever so slightly more human readable, so we will divide each value in this column by 100 to get the time in seconds.
Now we are doing some manipulations with
Pandasdataframes. If you are unfamiliar with the basics ofdataframesand the Pythonpandaslibrary, here are some recommended readings: 10 minutes to pandas, Pandas Tutorial, Pandas API
df_333_singleWRs["best"] = df_333_singleWRs["best"].apply(lambda x: x/100)
df_333_singleWRs.tail()
| personName | best | competitionId | year | month | day | |
|---|---|---|---|---|---|---|
| 30 | Patrick Ponce | 4.69 | RallyInTheValley2017 | 2017 | 9 | 2 |
| 31 | SeungBeom Cho (조승범) | 4.59 | Chicago2017 | 2017 | 10 | 28 |
| 32 | Feliks Zemdegs | 4.59 | HobartSummer2018 | 2018 | 1 | 27 |
| 33 | Feliks Zemdegs | 4.22 | CubeforCambodia2018 | 2018 | 5 | 6 |
| 34 | Yusheng Du (杜宇生) | 3.47 | WuhuOpen2018 | 2018 | 11 | 24 |
Our goal is to plot this information a line plot where the x-axis is time and the y-axis is the result. In order to do so, we must have a column representing time that can be easily interpreted by a plotting library like seaborn. Let's combine the year, month, and day columns into a single column called date of datetime objects.
For more information on the
Pythondatetimemodule, here are some recommended readings: Python docs datetime, PYnative
df_333_singleWRs["date"] = pd.to_datetime(df_333_singleWRs[["year", "month", "day"]])
df_333_singleWRs.tail()
| personName | best | competitionId | year | month | day | date | |
|---|---|---|---|---|---|---|---|
| 30 | Patrick Ponce | 4.69 | RallyInTheValley2017 | 2017 | 9 | 2 | 2017-09-02 |
| 31 | SeungBeom Cho (조승범) | 4.59 | Chicago2017 | 2017 | 10 | 28 | 2017-10-28 |
| 32 | Feliks Zemdegs | 4.59 | HobartSummer2018 | 2018 | 1 | 27 | 2018-01-27 |
| 33 | Feliks Zemdegs | 4.22 | CubeforCambodia2018 | 2018 | 5 | 6 | 2018-05-06 |
| 34 | Yusheng Du (杜宇生) | 3.47 | WuhuOpen2018 | 2018 | 11 | 24 | 2018-11-24 |
If we inspect carefully, there is a slight issue with the way that our dataframe is sorted.
df_333_singleWRs[20:23]
| personName | best | competitionId | year | month | day | date | |
|---|---|---|---|---|---|---|---|
| 20 | Feliks Zemdegs | 6.65 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 |
| 21 | Feliks Zemdegs | 6.65 | MelbourneSummer2011 | 2011 | 1 | 29 | 2011-01-29 |
| 22 | Feliks Zemdegs | 6.24 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 |
Our query ordered the dataframe by world record time (the best column) but sometimes the world record gets tied. The WCA still counts a tied world record as a world record in the database. As we see above, The time of 6.65 seconds set on Jan 29 occured before the tied world record of 6.65 seconds on May 07, but it occurs later in the dataframe. This could potentially cause some problems with our last step of data cleaning so we want to solve it now. To do so, we can simply sort our dataframe by date first, then by best for when date is the same.
df_333_singleWRs = df_333_singleWRs.sort_values(by=["date", "best"], ascending=[True, False])
df_333_singleWRs[20:23]
| personName | best | competitionId | year | month | day | date | |
|---|---|---|---|---|---|---|---|
| 21 | Feliks Zemdegs | 6.65 | MelbourneSummer2011 | 2011 | 1 | 29 | 2011-01-29 |
| 20 | Feliks Zemdegs | 6.65 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 |
| 22 | Feliks Zemdegs | 6.24 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 |
Before we can plot, we have one more issue. Notice that near the top of our dataframe, multiple records have the same date. We can not plot this correctly on a line plot, so we must have some way of differentiating (by time) world records that happened on the same date.
Note that this issue only occurs with older world records, because the WCA has since changed their policy so that if a world record is broken multiple times on the same day, only the latest one is considered a world record.
df_333_singleWRs.head()
| personName | best | competitionId | year | month | day | date | |
|---|---|---|---|---|---|---|---|
| 0 | Minh Thai | 22.95 | WC1982 | 1982 | 6 | 5 | 1982-06-05 |
| 1 | Dan Knights | 16.71 | WC2003 | 2003 | 8 | 23 | 2003-08-23 |
| 2 | Jess Bonde | 16.53 | WC2003 | 2003 | 8 | 23 | 2003-08-23 |
| 3 | Shotaro Makisumi (牧角章太郎) | 15.07 | CaltechWinter2004 | 2004 | 1 | 24 | 2004-01-24 |
| 4 | Shotaro Makisumi (牧角章太郎) | 14.76 | CaltechWinter2004 | 2004 | 1 | 24 | 2004-01-24 |
In order to differentiate world records that occurred on the same date, we can add a time column for each record. Since the results are already sorted in descending order, we know that a record with a lower index in our dataframe means it occurred before any record with a higher index. For dates where only one world record was set, we assign an hour of 00:00. For dates where multiple world records are set, we can arbitrarily separate the world records by time occurred on that date (we say "arbitrarily" because the actual time does not matter, only the fact that slower world records occurred before faster ones). Careful inspection of our dataframe reveals that the most world records set on one date is only 2. So for when two world records occurred on the same date, we can arbitrarily say that the slower one occurred at 00:00 on that date and the faster one occurred on 12:00 of that date.
hour = [] # new column to add to df
prev_row = df_333_singleWRs.iloc[0] # prev row of our df (used when iterating)
# iterating through each row
for i, row in df_333_singleWRs.iterrows():
if i == 0: # base case
hour.append(0)
else:
if prev_row["date"] == row["date"]: # if date is same as the previous row
hour.append(hour[-1] + 12)
else:
hour.append(0)
prev_row = row
# add the new column
df_333_singleWRs["hour"] = hour
df_333_singleWRs
| personName | best | competitionId | year | month | day | date | hour | |
|---|---|---|---|---|---|---|---|---|
| 0 | Minh Thai | 22.95 | WC1982 | 1982 | 6 | 5 | 1982-06-05 | 0 |
| 1 | Dan Knights | 16.71 | WC2003 | 2003 | 8 | 23 | 2003-08-23 | 0 |
| 2 | Jess Bonde | 16.53 | WC2003 | 2003 | 8 | 23 | 2003-08-23 | 12 |
| 3 | Shotaro Makisumi (牧角章太郎) | 15.07 | CaltechWinter2004 | 2004 | 1 | 24 | 2004-01-24 | 0 |
| 4 | Shotaro Makisumi (牧角章太郎) | 14.76 | CaltechWinter2004 | 2004 | 1 | 24 | 2004-01-24 | 12 |
| 5 | Shotaro Makisumi (牧角章太郎) | 13.93 | CaltechSpring2004 | 2004 | 4 | 3 | 2004-04-03 | 0 |
| 6 | Shotaro Makisumi (牧角章太郎) | 12.11 | CaltechSpring2004 | 2004 | 4 | 3 | 2004-04-03 | 12 |
| 7 | Jean Pons | 11.75 | DutchOpen2005 | 2005 | 10 | 16 | 2005-10-16 | 0 |
| 8 | Leyan Lo | 11.13 | CaltechWinter2006 | 2006 | 1 | 14 | 2006-01-14 | 0 |
| 9 | Toby Mao (毛台立) | 10.48 | US2006 | 2006 | 8 | 4 | 2006-08-04 | 0 |
| 10 | Edouard Chambon | 10.36 | BelgianOpen2007 | 2007 | 2 | 24 | 2007-02-24 | 0 |
| 11 | Thibaut Jacquinot | 9.86 | SpanishOpen2007 | 2007 | 5 | 5 | 2007-05-05 | 0 |
| 12 | Erik Akkersdijk | 9.77 | DutchOpen2007 | 2007 | 10 | 13 | 2007-10-13 | 0 |
| 13 | Ron van Bruchem | 9.55 | Netherlands2007 | 2007 | 11 | 24 | 2007-11-24 | 0 |
| 14 | Edouard Chambon | 9.18 | MurciaOpen2008 | 2008 | 2 | 23 | 2008-02-23 | 0 |
| 15 | Yu Nakajima (中島悠) | 8.72 | KashiwaOpen2008 | 2008 | 5 | 5 | 2008-05-05 | 0 |
| 16 | Yu Nakajima (中島悠) | 8.72 | KashiwaOpen2008 | 2008 | 5 | 5 | 2008-05-05 | 12 |
| 17 | Erik Akkersdijk | 7.08 | CzechOpen2008 | 2008 | 7 | 12 | 2008-07-12 | 0 |
| 18 | Feliks Zemdegs | 7.03 | MelbourneCubeDay2010 | 2010 | 11 | 13 | 2010-11-13 | 0 |
| 19 | Feliks Zemdegs | 6.77 | MelbourneCubeDay2010 | 2010 | 11 | 13 | 2010-11-13 | 12 |
| 21 | Feliks Zemdegs | 6.65 | MelbourneSummer2011 | 2011 | 1 | 29 | 2011-01-29 | 0 |
| 20 | Feliks Zemdegs | 6.65 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 | 0 |
| 22 | Feliks Zemdegs | 6.24 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 | 12 |
| 23 | Feliks Zemdegs | 6.18 | MelbourneWinterOpen2011 | 2011 | 6 | 25 | 2011-06-25 | 0 |
| 24 | Feliks Zemdegs | 5.66 | MelbourneWinterOpen2011 | 2011 | 6 | 25 | 2011-06-25 | 12 |
| 25 | Mats Valk | 5.55 | ZonhovenOpen2013 | 2013 | 3 | 2 | 2013-03-02 | 0 |
| 26 | Collin Burns | 5.25 | DoylestownSpring2015 | 2015 | 4 | 25 | 2015-04-25 | 0 |
| 27 | Lucas Etter | 4.90 | RiverHillFall2015 | 2015 | 11 | 21 | 2015-11-21 | 0 |
| 28 | Mats Valk | 4.74 | JawaTimurOpen2016 | 2016 | 11 | 5 | 2016-11-05 | 0 |
| 29 | Feliks Zemdegs | 4.73 | POPSOpen2016 | 2016 | 12 | 11 | 2016-12-11 | 0 |
| 30 | Patrick Ponce | 4.69 | RallyInTheValley2017 | 2017 | 9 | 2 | 2017-09-02 | 0 |
| 31 | SeungBeom Cho (조승범) | 4.59 | Chicago2017 | 2017 | 10 | 28 | 2017-10-28 | 0 |
| 32 | Feliks Zemdegs | 4.59 | HobartSummer2018 | 2018 | 1 | 27 | 2018-01-27 | 0 |
| 33 | Feliks Zemdegs | 4.22 | CubeforCambodia2018 | 2018 | 5 | 6 | 2018-05-06 | 0 |
| 34 | Yusheng Du (杜宇生) | 3.47 | WuhuOpen2018 | 2018 | 11 | 24 | 2018-11-24 | 0 |
Now we can construct the full column of datetime objects.
df_333_singleWRs["date"] = pd.to_datetime(df_333_singleWRs[["year", "month", "day", "hour"]])
df_333_singleWRs
| personName | best | competitionId | year | month | day | date | hour | |
|---|---|---|---|---|---|---|---|---|
| 0 | Minh Thai | 22.95 | WC1982 | 1982 | 6 | 5 | 1982-06-05 00:00:00 | 0 |
| 1 | Dan Knights | 16.71 | WC2003 | 2003 | 8 | 23 | 2003-08-23 00:00:00 | 0 |
| 2 | Jess Bonde | 16.53 | WC2003 | 2003 | 8 | 23 | 2003-08-23 12:00:00 | 12 |
| 3 | Shotaro Makisumi (牧角章太郎) | 15.07 | CaltechWinter2004 | 2004 | 1 | 24 | 2004-01-24 00:00:00 | 0 |
| 4 | Shotaro Makisumi (牧角章太郎) | 14.76 | CaltechWinter2004 | 2004 | 1 | 24 | 2004-01-24 12:00:00 | 12 |
| 5 | Shotaro Makisumi (牧角章太郎) | 13.93 | CaltechSpring2004 | 2004 | 4 | 3 | 2004-04-03 00:00:00 | 0 |
| 6 | Shotaro Makisumi (牧角章太郎) | 12.11 | CaltechSpring2004 | 2004 | 4 | 3 | 2004-04-03 12:00:00 | 12 |
| 7 | Jean Pons | 11.75 | DutchOpen2005 | 2005 | 10 | 16 | 2005-10-16 00:00:00 | 0 |
| 8 | Leyan Lo | 11.13 | CaltechWinter2006 | 2006 | 1 | 14 | 2006-01-14 00:00:00 | 0 |
| 9 | Toby Mao (毛台立) | 10.48 | US2006 | 2006 | 8 | 4 | 2006-08-04 00:00:00 | 0 |
| 10 | Edouard Chambon | 10.36 | BelgianOpen2007 | 2007 | 2 | 24 | 2007-02-24 00:00:00 | 0 |
| 11 | Thibaut Jacquinot | 9.86 | SpanishOpen2007 | 2007 | 5 | 5 | 2007-05-05 00:00:00 | 0 |
| 12 | Erik Akkersdijk | 9.77 | DutchOpen2007 | 2007 | 10 | 13 | 2007-10-13 00:00:00 | 0 |
| 13 | Ron van Bruchem | 9.55 | Netherlands2007 | 2007 | 11 | 24 | 2007-11-24 00:00:00 | 0 |
| 14 | Edouard Chambon | 9.18 | MurciaOpen2008 | 2008 | 2 | 23 | 2008-02-23 00:00:00 | 0 |
| 15 | Yu Nakajima (中島悠) | 8.72 | KashiwaOpen2008 | 2008 | 5 | 5 | 2008-05-05 00:00:00 | 0 |
| 16 | Yu Nakajima (中島悠) | 8.72 | KashiwaOpen2008 | 2008 | 5 | 5 | 2008-05-05 12:00:00 | 12 |
| 17 | Erik Akkersdijk | 7.08 | CzechOpen2008 | 2008 | 7 | 12 | 2008-07-12 00:00:00 | 0 |
| 18 | Feliks Zemdegs | 7.03 | MelbourneCubeDay2010 | 2010 | 11 | 13 | 2010-11-13 00:00:00 | 0 |
| 19 | Feliks Zemdegs | 6.77 | MelbourneCubeDay2010 | 2010 | 11 | 13 | 2010-11-13 12:00:00 | 12 |
| 21 | Feliks Zemdegs | 6.65 | MelbourneSummer2011 | 2011 | 1 | 29 | 2011-01-29 00:00:00 | 0 |
| 20 | Feliks Zemdegs | 6.65 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 00:00:00 | 0 |
| 22 | Feliks Zemdegs | 6.24 | KubarooOpen2011 | 2011 | 5 | 7 | 2011-05-07 12:00:00 | 12 |
| 23 | Feliks Zemdegs | 6.18 | MelbourneWinterOpen2011 | 2011 | 6 | 25 | 2011-06-25 00:00:00 | 0 |
| 24 | Feliks Zemdegs | 5.66 | MelbourneWinterOpen2011 | 2011 | 6 | 25 | 2011-06-25 12:00:00 | 12 |
| 25 | Mats Valk | 5.55 | ZonhovenOpen2013 | 2013 | 3 | 2 | 2013-03-02 00:00:00 | 0 |
| 26 | Collin Burns | 5.25 | DoylestownSpring2015 | 2015 | 4 | 25 | 2015-04-25 00:00:00 | 0 |
| 27 | Lucas Etter | 4.90 | RiverHillFall2015 | 2015 | 11 | 21 | 2015-11-21 00:00:00 | 0 |
| 28 | Mats Valk | 4.74 | JawaTimurOpen2016 | 2016 | 11 | 5 | 2016-11-05 00:00:00 | 0 |
| 29 | Feliks Zemdegs | 4.73 | POPSOpen2016 | 2016 | 12 | 11 | 2016-12-11 00:00:00 | 0 |
| 30 | Patrick Ponce | 4.69 | RallyInTheValley2017 | 2017 | 9 | 2 | 2017-09-02 00:00:00 | 0 |
| 31 | SeungBeom Cho (조승범) | 4.59 | Chicago2017 | 2017 | 10 | 28 | 2017-10-28 00:00:00 | 0 |
| 32 | Feliks Zemdegs | 4.59 | HobartSummer2018 | 2018 | 1 | 27 | 2018-01-27 00:00:00 | 0 |
| 33 | Feliks Zemdegs | 4.22 | CubeforCambodia2018 | 2018 | 5 | 6 | 2018-05-06 00:00:00 | 0 |
| 34 | Yusheng Du (杜宇生) | 3.47 | WuhuOpen2018 | 2018 | 11 | 24 | 2018-11-24 00:00:00 | 0 |
Now we can finally plot the data!
Note we will primarily be using the
seabornlibrary to plot our data. This library is quite extensive. For more information, here are some recommended readings: Seaborn introduction, Searbon API overview
sns.lineplot(data=df_333_singleWRs, y="best", x="date", marker="o")
<Axes: xlabel='date', ylabel='best'>
We can add data point labels to see individual values, but it can get a little bit cluttered.
sns.lineplot(data=df_333_singleWRs, y="best", x="date", marker="o")
for date, best in zip(df_333_singleWRs["date"], df_333_singleWRs["best"]):
label = f"({str(date.year)}, {best})"
plt.annotate(label, (date, best), xytext=(3,3), textcoords="offset points", fontsize=6)
Let's try to fit a model to the data. Let's start off with a simple linear regression.
# converting to datetime objects date.ordinals since LinearRegression can not take datetime objects
df_333_singleWRs["ordinal"] = df_333_singleWRs["date"].apply(datetime.date.toordinal)
X = df_333_singleWRs[["ordinal"]] # df of dates
y = df_333_singleWRs["best"] # Series of results to match the dates
linreg = LinearRegression().fit(X, y) # fit the linreg model
linreg_prediction = linreg.predict(X) # creating predicted y values using our linreg
df_333_singleWRs["predicted_linreg"] = linreg_prediction # adding predicted values to our df
# plotting
sns.lineplot(data=df_333_singleWRs, y="best", x="date", marker="o")
sns.lineplot(data=df_333_singleWRs, x="date", y=linreg_prediction, color="#87bdff")
<Axes: xlabel='date', ylabel='best'>
The line looks to fit the downwart trend of the model, but there is a lot of room for improvement. To analyze how well our model fits, lets make a residual plot.
sns.residplot(data=df_333_singleWRs, x="ordinal", y="best")
<Axes: xlabel='ordinal', ylabel='best'>
Note that ordinal is an integer representation of a
datetimeobject (an ordinal represents the date corresponding to the proleptic Gregorian ordinal, where January 1 of year 1 has ordinal 1). Since we are only looking at residuals the human-interpretability of the x-axis labels is not too important.
We can see that this linear regression does not fit too well. A linear regression model has several assumptions: linearity, homoscedasticity, normality, and independence. A residual plot should show a fairly random pattern, even spread of points, no discernible pattern of increasing or decreasing variability, and a normal distribution of residuals. We can see in our residual plot that there seems to be some sort of pattern, or at the very least the residuals do not seem to be random. We can see residuals being very high above 0 in earlier years, dipping below 0 in the middle years, then going slightly above 0 in later years. The obvious visual explanation to our problem is that the data does not seem to fit along a straight line but rather a curve. Another reason for this pattern in our residual plot could simply be that we do not have that many data points in our model.
To quantitatively see the accuracy of our model, let's try running an Ordinary Least Squares regression using the statsmodels library.
ols_model = sm.ols(formula="best ~ ordinal", data=df_333_singleWRs).fit()
print(ols_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: best R-squared: 0.873
Model: OLS Adj. R-squared: 0.869
Method: Least Squares F-statistic: 227.1
Date: Sat, 13 May 2023 Prob (F-statistic): 2.36e-16
Time: 23:28:11 Log-Likelihood: -65.260
No. Observations: 35 AIC: 134.5
Df Residuals: 33 BIC: 137.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 1239.0797 81.641 15.177 0.000 1072.980 1405.179
ordinal -0.0017 0.000 -15.069 0.000 -0.002 -0.001
==============================================================================
Omnibus: 7.722 Durbin-Watson: 0.695
Prob(Omnibus): 0.021 Jarque-Bera (JB): 6.267
Skew: 0.945 Prob(JB): 0.0436
Kurtosis: 3.854 Cond. No. 2.20e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.2e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
Notice that the p-value associated with the time (ordinal) coefficient is less than 0.05 (P>|t| = 0.000), which indicates that the relationship between time and world record single is statistically significant at the 95% confidence level. This is of course trivially true because over time the world record can only get lower and lower. Later on, we will explore a hypothesis that is not so obviously true.
Here we see that the R-squared value for our OLS model is 0.873 and our log-likelihood value is -65.260.
Let's try a different model to see if we can get a better fit.
sns.regplot(data=df_333_singleWRs, x="ordinal", y="best", order=3, ci=None)
# changing x-axis tick labels from ordinals to year
ax = plt.gca() # get current ax
xticks = ax.get_xticks() # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks]) # convert each ordinal to a year
ax.set_xlabel("year") # changing x-axis label
Text(0.5, 0, 'year')
Above we are using seaborn's regplot with order=3, which is a linear regression model fit using a polynomial of order 3. Visually, this appears to fit our data set much better. Let's see how the residual plot looks.
sns.residplot(data=df_333_singleWRs, x="ordinal", y="best", order=3)
<Axes: xlabel='ordinal', ylabel='best'>
Much better! Here we see residuals seemingly randomly scattered, and with more residuals clustered around 0. This indicates a pretty good fit for our linear regression with polynomial terms model!
If we look back at our polynomial with order 3, we can see how the line curves upwards at the end despite decreasing y-values. How about we change it to order 4?
sns.regplot(data=df_333_singleWRs, x="ordinal", y="best", order=4, ci=None)
# changing x-axis tick labels from ordinals to year
ax = plt.gca() # get current ax
xticks = ax.get_xticks() # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks]) # convert each ordinal to a year
ax.set_xlabel("year") # changing x-axis label
Text(0.5, 0, 'year')
sns.residplot(data=df_333_singleWRs, x="ordinal", y="best", order=4)
<Axes: xlabel='ordinal', ylabel='best'>
The residual plot looks relatively similar to the one with order=3. Let's run OLS models for these polynomials.
print("Polynomial of order 3")
ols_model_poly3 = sm.ols(formula="best ~ ordinal + I(ordinal**2) + I(ordinal**3)", data=df_333_singleWRs).fit()
print(ols_model_poly3.summary())
print("\n\n")
print("Polynomial of order 4")
ols_model_poly3 = sm.ols(formula="best ~ ordinal + I(ordinal**2) + I(ordinal**3) + I(ordinal**4)", data=df_333_singleWRs).fit()
print(ols_model_poly3.summary())
Polynomial of order 3
OLS Regression Results
==============================================================================
Dep. Variable: best R-squared: 0.875
Model: OLS Adj. R-squared: 0.871
Method: Least Squares F-statistic: 231.1
Date: Sat, 13 May 2023 Prob (F-statistic): 1.83e-16
Time: 23:37:20 Log-Likelihood: -64.991
No. Observations: 35 AIC: 134.0
Df Residuals: 33 BIC: 137.1
Df Model: 1
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 1.324e-20 8.53e-22 15.528 0.000 1.15e-20 1.5e-20
ordinal 6.456e-15 4.16e-16 15.528 0.000 5.61e-15 7.3e-15
I(ordinal ** 2) 2.362e-09 1.52e-10 15.528 0.000 2.05e-09 2.67e-09
I(ordinal ** 3) -3.198e-15 2.07e-16 -15.420 0.000 -3.62e-15 -2.78e-15
==============================================================================
Omnibus: 7.381 Durbin-Watson: 0.663
Prob(Omnibus): 0.025 Jarque-Bera (JB): 5.938
Skew: 0.936 Prob(JB): 0.0514
Kurtosis: 3.755 Cond. No. 5.86e+24
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.59e-13. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Polynomial of order 4
OLS Regression Results
==============================================================================
Dep. Variable: best R-squared: 0.898
Model: OLS Adj. R-squared: 0.892
Method: Least Squares F-statistic: 140.8
Date: Sat, 13 May 2023 Prob (F-statistic): 1.38e-16
Time: 23:37:20 Log-Likelihood: -61.447
No. Observations: 35 AIC: 128.9
Df Residuals: 32 BIC: 133.6
Df Model: 2
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 1.226e-20 8.64e-22 14.187 0.000 1.05e-20 1.4e-20
ordinal 5.999e-15 4.23e-16 14.187 0.000 5.14e-15 6.86e-15
I(ordinal ** 2) 2.187e-09 1.54e-10 14.187 0.000 1.87e-09 2.5e-09
I(ordinal ** 3) -2.959e-15 2.1e-16 -14.081 0.000 -3.39e-15 -2.53e-15
I(ordinal ** 4) 1.152e-19 4.3e-20 2.680 0.012 2.76e-20 2.03e-19
==============================================================================
Omnibus: 3.361 Durbin-Watson: 1.072
Prob(Omnibus): 0.186 Jarque-Bera (JB): 2.044
Skew: 0.485 Prob(JB): 0.360
Kurtosis: 3.679 Cond. No. 1.08e+23
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+23. This might indicate that there are
strong multicollinearity or other numerical problems.
We have R-squared values of 0.875 and 0.898 for the OLS models for polynomials of order 3 and 4 respectively. These are both marginally higher than the R-squared value for our original linear OLS model of 0.873. Additionally, we have log-likelihood values of -64.991 and -61.4471 for the order-3 and order-4 polynomial respectively, both of which are better than our original value of -65.260.
Above we stated that it was trivially true to reject the null hypothesis when considering whether time has a statistically significant impact on world record 3x3 single (because the world record can only get faster with time). But what about the influence of time on world record improvement? Meaning, if we define the reduction factor of a single world record x to be the number by which the world record immediately previous to x is multiplied with to get x, how does the reduction factor change with time?
Answering this question can help give us more information on the trends of cubing world records, and subsequently cubing as a whole. Are new advancements in methods and hardware making a significant difference in the activity or have we reached the point where world records are starting to plateau, and improvement becomes harder and harder?
Let's create a new dataframe based on our previous one. This time, we will include the reduction factor for each world record 3x3 single (remember our definition of reduction factor from above). Any reduction factor will lie on the interval (0,1], with 1 being a tied world record, and values will get closer to 0 the more drastic the drop from the previous world record is. We will also have to ignore the first row from our new dataframe, since there is no reduction factor for the first ever world record (there is no previous world record to compare it to).
df_333_singleWRs_reduction = df_333_singleWRs.copy() # creating a copy of our original df
reduction = [] # list of reduction factors
prev_row = df_333_singleWRs.iloc[0] # prev row of our df (used when iterating)
# iterating through df
for i, row in df_333_singleWRs_reduction.iterrows():
if i == 0:
reduction.append(0.0)
else:
reduction.append(row["best"]/prev_row["best"]) # reduction factor = current WR / previous WR
prev_row = row
# adding new column
df_333_singleWRs_reduction["reduction"] = reduction
# rearranging columns and dropping ones we don't need
df_333_singleWRs_reduction = df_333_singleWRs_reduction[["personName", "best", "reduction", "competitionId", "date", "ordinal"]]
# dropping the first row since it's not relevant (N/A reduction factor since it's the first WR)
df_333_singleWRs_reduction = df_333_singleWRs_reduction.drop(index=0)
# reindex the df
df_333_singleWRs_reduction.reset_index(drop=True, inplace=True)
df_333_singleWRs_reduction
| personName | best | reduction | competitionId | date | ordinal | |
|---|---|---|---|---|---|---|
| 0 | Dan Knights | 16.71 | 0.728105 | WC2003 | 2003-08-23 00:00:00 | 731450 |
| 1 | Jess Bonde | 16.53 | 0.989228 | WC2003 | 2003-08-23 12:00:00 | 731450 |
| 2 | Shotaro Makisumi (牧角章太郎) | 15.07 | 0.911676 | CaltechWinter2004 | 2004-01-24 00:00:00 | 731604 |
| 3 | Shotaro Makisumi (牧角章太郎) | 14.76 | 0.979429 | CaltechWinter2004 | 2004-01-24 12:00:00 | 731604 |
| 4 | Shotaro Makisumi (牧角章太郎) | 13.93 | 0.943767 | CaltechSpring2004 | 2004-04-03 00:00:00 | 731674 |
| 5 | Shotaro Makisumi (牧角章太郎) | 12.11 | 0.869347 | CaltechSpring2004 | 2004-04-03 12:00:00 | 731674 |
| 6 | Jean Pons | 11.75 | 0.970273 | DutchOpen2005 | 2005-10-16 00:00:00 | 732235 |
| 7 | Leyan Lo | 11.13 | 0.947234 | CaltechWinter2006 | 2006-01-14 00:00:00 | 732325 |
| 8 | Toby Mao (毛台立) | 10.48 | 0.941599 | US2006 | 2006-08-04 00:00:00 | 732527 |
| 9 | Edouard Chambon | 10.36 | 0.988550 | BelgianOpen2007 | 2007-02-24 00:00:00 | 732731 |
| 10 | Thibaut Jacquinot | 9.86 | 0.951737 | SpanishOpen2007 | 2007-05-05 00:00:00 | 732801 |
| 11 | Erik Akkersdijk | 9.77 | 0.990872 | DutchOpen2007 | 2007-10-13 00:00:00 | 732962 |
| 12 | Ron van Bruchem | 9.55 | 0.977482 | Netherlands2007 | 2007-11-24 00:00:00 | 733004 |
| 13 | Edouard Chambon | 9.18 | 0.961257 | MurciaOpen2008 | 2008-02-23 00:00:00 | 733095 |
| 14 | Yu Nakajima (中島悠) | 8.72 | 0.949891 | KashiwaOpen2008 | 2008-05-05 00:00:00 | 733167 |
| 15 | Yu Nakajima (中島悠) | 8.72 | 1.000000 | KashiwaOpen2008 | 2008-05-05 12:00:00 | 733167 |
| 16 | Erik Akkersdijk | 7.08 | 0.811927 | CzechOpen2008 | 2008-07-12 00:00:00 | 733235 |
| 17 | Feliks Zemdegs | 7.03 | 0.992938 | MelbourneCubeDay2010 | 2010-11-13 00:00:00 | 734089 |
| 18 | Feliks Zemdegs | 6.77 | 0.963016 | MelbourneCubeDay2010 | 2010-11-13 12:00:00 | 734089 |
| 19 | Feliks Zemdegs | 6.65 | 0.982275 | MelbourneSummer2011 | 2011-01-29 00:00:00 | 734166 |
| 20 | Feliks Zemdegs | 6.65 | 1.000000 | KubarooOpen2011 | 2011-05-07 00:00:00 | 734264 |
| 21 | Feliks Zemdegs | 6.24 | 0.938346 | KubarooOpen2011 | 2011-05-07 12:00:00 | 734264 |
| 22 | Feliks Zemdegs | 6.18 | 0.990385 | MelbourneWinterOpen2011 | 2011-06-25 00:00:00 | 734313 |
| 23 | Feliks Zemdegs | 5.66 | 0.915858 | MelbourneWinterOpen2011 | 2011-06-25 12:00:00 | 734313 |
| 24 | Mats Valk | 5.55 | 0.980565 | ZonhovenOpen2013 | 2013-03-02 00:00:00 | 734929 |
| 25 | Collin Burns | 5.25 | 0.945946 | DoylestownSpring2015 | 2015-04-25 00:00:00 | 735713 |
| 26 | Lucas Etter | 4.90 | 0.933333 | RiverHillFall2015 | 2015-11-21 00:00:00 | 735923 |
| 27 | Mats Valk | 4.74 | 0.967347 | JawaTimurOpen2016 | 2016-11-05 00:00:00 | 736273 |
| 28 | Feliks Zemdegs | 4.73 | 0.997890 | POPSOpen2016 | 2016-12-11 00:00:00 | 736309 |
| 29 | Patrick Ponce | 4.69 | 0.991543 | RallyInTheValley2017 | 2017-09-02 00:00:00 | 736574 |
| 30 | SeungBeom Cho (조승범) | 4.59 | 0.978678 | Chicago2017 | 2017-10-28 00:00:00 | 736630 |
| 31 | Feliks Zemdegs | 4.59 | 1.000000 | HobartSummer2018 | 2018-01-27 00:00:00 | 736721 |
| 32 | Feliks Zemdegs | 4.22 | 0.919390 | CubeforCambodia2018 | 2018-05-06 00:00:00 | 736820 |
| 33 | Yusheng Du (杜宇生) | 3.47 | 0.822275 | WuhuOpen2018 | 2018-11-24 00:00:00 | 737022 |
Now let's plot our dataframe, and see how the reduction factor of 3x3 world record singles has changed over time.
sns.lineplot(data=df_333_singleWRs_reduction, x="date", y="reduction", marker="o")
<Axes: xlabel='date', ylabel='reduction'>
From visual inspection, there is no strong and obvious correlation amongst the data. Reduction factors seem to be clustered in the 0.95-1 range, with a few in the 0.90-0.95 range, and a few outliers as well. Let's plot a linear regression line to fit the data.
Before we can do linear regressions with this dataframe, let's sort out the ordinals column. The seaborn library does not play nice with regression calculations using Python datetime objects, so we converted each date to its ordinal, aka integer representation. However, this reintroduces our problem of overlapping results when a record is broken on the same day. Therefore, we apply the same logic as we did previously (with the adding of 12:00 to differentiate dates), but this time we add 0.5 to differentiate between any 2 ordinals that are the same.
ordinal_time = [] # new column to add to df
prev_row = df_333_singleWRs_reduction.iloc[0] # prev row of our df (used when iterating)
# iterating through each row
for i, row in df_333_singleWRs_reduction.iterrows():
if i == 0: # base case
ordinal_time.append(row["ordinal"])
else:
if prev_row["ordinal"] == row["ordinal"]: # if ordinal is same as the previous row
ordinal_time.append(row["ordinal"]+0.5)
else:
ordinal_time.append(row["ordinal"])
prev_row = row
# add the new column
df_333_singleWRs_reduction["ordinal"] = ordinal_time
df_333_singleWRs_reduction
| personName | best | reduction | competitionId | date | ordinal | |
|---|---|---|---|---|---|---|
| 0 | Dan Knights | 16.71 | 0.728105 | WC2003 | 2003-08-23 00:00:00 | 731450.0 |
| 1 | Jess Bonde | 16.53 | 0.989228 | WC2003 | 2003-08-23 12:00:00 | 731450.5 |
| 2 | Shotaro Makisumi (牧角章太郎) | 15.07 | 0.911676 | CaltechWinter2004 | 2004-01-24 00:00:00 | 731604.0 |
| 3 | Shotaro Makisumi (牧角章太郎) | 14.76 | 0.979429 | CaltechWinter2004 | 2004-01-24 12:00:00 | 731604.5 |
| 4 | Shotaro Makisumi (牧角章太郎) | 13.93 | 0.943767 | CaltechSpring2004 | 2004-04-03 00:00:00 | 731674.0 |
| 5 | Shotaro Makisumi (牧角章太郎) | 12.11 | 0.869347 | CaltechSpring2004 | 2004-04-03 12:00:00 | 731674.5 |
| 6 | Jean Pons | 11.75 | 0.970273 | DutchOpen2005 | 2005-10-16 00:00:00 | 732235.0 |
| 7 | Leyan Lo | 11.13 | 0.947234 | CaltechWinter2006 | 2006-01-14 00:00:00 | 732325.0 |
| 8 | Toby Mao (毛台立) | 10.48 | 0.941599 | US2006 | 2006-08-04 00:00:00 | 732527.0 |
| 9 | Edouard Chambon | 10.36 | 0.988550 | BelgianOpen2007 | 2007-02-24 00:00:00 | 732731.0 |
| 10 | Thibaut Jacquinot | 9.86 | 0.951737 | SpanishOpen2007 | 2007-05-05 00:00:00 | 732801.0 |
| 11 | Erik Akkersdijk | 9.77 | 0.990872 | DutchOpen2007 | 2007-10-13 00:00:00 | 732962.0 |
| 12 | Ron van Bruchem | 9.55 | 0.977482 | Netherlands2007 | 2007-11-24 00:00:00 | 733004.0 |
| 13 | Edouard Chambon | 9.18 | 0.961257 | MurciaOpen2008 | 2008-02-23 00:00:00 | 733095.0 |
| 14 | Yu Nakajima (中島悠) | 8.72 | 0.949891 | KashiwaOpen2008 | 2008-05-05 00:00:00 | 733167.0 |
| 15 | Yu Nakajima (中島悠) | 8.72 | 1.000000 | KashiwaOpen2008 | 2008-05-05 12:00:00 | 733167.5 |
| 16 | Erik Akkersdijk | 7.08 | 0.811927 | CzechOpen2008 | 2008-07-12 00:00:00 | 733235.0 |
| 17 | Feliks Zemdegs | 7.03 | 0.992938 | MelbourneCubeDay2010 | 2010-11-13 00:00:00 | 734089.0 |
| 18 | Feliks Zemdegs | 6.77 | 0.963016 | MelbourneCubeDay2010 | 2010-11-13 12:00:00 | 734089.5 |
| 19 | Feliks Zemdegs | 6.65 | 0.982275 | MelbourneSummer2011 | 2011-01-29 00:00:00 | 734166.0 |
| 20 | Feliks Zemdegs | 6.65 | 1.000000 | KubarooOpen2011 | 2011-05-07 00:00:00 | 734264.0 |
| 21 | Feliks Zemdegs | 6.24 | 0.938346 | KubarooOpen2011 | 2011-05-07 12:00:00 | 734264.5 |
| 22 | Feliks Zemdegs | 6.18 | 0.990385 | MelbourneWinterOpen2011 | 2011-06-25 00:00:00 | 734313.0 |
| 23 | Feliks Zemdegs | 5.66 | 0.915858 | MelbourneWinterOpen2011 | 2011-06-25 12:00:00 | 734313.5 |
| 24 | Mats Valk | 5.55 | 0.980565 | ZonhovenOpen2013 | 2013-03-02 00:00:00 | 734929.0 |
| 25 | Collin Burns | 5.25 | 0.945946 | DoylestownSpring2015 | 2015-04-25 00:00:00 | 735713.0 |
| 26 | Lucas Etter | 4.90 | 0.933333 | RiverHillFall2015 | 2015-11-21 00:00:00 | 735923.0 |
| 27 | Mats Valk | 4.74 | 0.967347 | JawaTimurOpen2016 | 2016-11-05 00:00:00 | 736273.0 |
| 28 | Feliks Zemdegs | 4.73 | 0.997890 | POPSOpen2016 | 2016-12-11 00:00:00 | 736309.0 |
| 29 | Patrick Ponce | 4.69 | 0.991543 | RallyInTheValley2017 | 2017-09-02 00:00:00 | 736574.0 |
| 30 | SeungBeom Cho (조승범) | 4.59 | 0.978678 | Chicago2017 | 2017-10-28 00:00:00 | 736630.0 |
| 31 | Feliks Zemdegs | 4.59 | 1.000000 | HobartSummer2018 | 2018-01-27 00:00:00 | 736721.0 |
| 32 | Feliks Zemdegs | 4.22 | 0.919390 | CubeforCambodia2018 | 2018-05-06 00:00:00 | 736820.0 |
| 33 | Yusheng Du (杜宇生) | 3.47 | 0.822275 | WuhuOpen2018 | 2018-11-24 00:00:00 | 737022.0 |
Now we can run a linear regression on our data.
sns.regplot(data=df_333_singleWRs_reduction, x="ordinal", y="reduction", ci=None)
# changing x-axis tick labels from ordinals to year
ax = plt.gca() # get current ax
xticks = ax.get_xticks() # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks]) # convert each ordinal to a year
ax.set_xlabel("year") # changing x-axis label
Text(0.5, 0, 'year')
sns.regplot(data=df_333_singleWRs_reduction, x="ordinal", y="reduction", order=4, ci=None)
# changing x-axis tick labels from ordinals to year
ax = plt.gca() # get current ax
xticks = ax.get_xticks() # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks]) # convert each ordinal to a year
ax.set_xlabel("year") # changing x-axis label
Text(0.5, 0, 'year')
Once again, the line does not seem to fit the data too well. Without any obvious correlation in the data, we would be hard pressed to find a visually good fitting model. Let's run an OLS regression like before.
reduction_ols_model = sm.ols(formula="reduction ~ ordinal", data=df_333_singleWRs_reduction).fit()
print(reduction_ols_model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: reduction R-squared: 0.028
Model: OLS Adj. R-squared: -0.002
Method: Least Squares F-statistic: 0.9353
Date: Sat, 13 May 2023 Prob (F-statistic): 0.341
Time: 23:27:56 Log-Likelihood: 48.207
No. Observations: 34 AIC: -92.41
Df Residuals: 32 BIC: -89.36
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -3.1887 4.278 -0.745 0.461 -11.902 5.524
ordinal 5.637e-06 5.83e-06 0.967 0.341 -6.24e-06 1.75e-05
==============================================================================
Omnibus: 24.687 Durbin-Watson: 1.784
Prob(Omnibus): 0.000 Jarque-Bera (JB): 38.589
Skew: -1.937 Prob(JB): 4.17e-09
Kurtosis: 6.498 Cond. No. 3.03e+08
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.03e+08. This might indicate that there are
strong multicollinearity or other numerical problems.
Notice that the p-value associated with the time (ordinal) coefficient is greater than 0.05 (P>|t| = 0.341), which indicates that the relationship between time and reduction factor is not statistically significant at the 95% confidence level. Thus, we fail to reject the null hypothesis that time and reduction factor are not significantly correlated. Here we also see that the R-squared value for our OLS model is 0.028, much lower than our R-squared value for our OLS model for 3x3 single world records (0.873)
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
tymon_id = '2016KOLA02'
qry_tymon_333avgs = f"""
SELECT Results.competitionId,
Results.roundTypeId,
Results.average,
Competitions.year,
Competitions.month,
Competitions.day
FROM Results
INNER JOIN Persons ON Results.personID=Persons.id
INNER JOIN Competitions ON Results.competitionId=Competitions.id
WHERE Results.personId = '{tymon_id}'
AND Persons.subid = 1
AND Results.eventId = '333'
AND Results.average > 0
ORDER BY Year ASC, Month ASC, Day ASC;
"""
df_tymon_333avgs = pd.read_sql(qry_tymon_333avgs, con=connection)
display(df_tymon_333avgs)
# with pd.option_context('display.max_rows', None,):
# display(df_tymon_333avgs)
| competitionId | roundTypeId | average | year | month | day | |
|---|---|---|---|---|---|---|
| 0 | WLSWiosna2016 | 1 | 2752 | 2016 | 5 | 8 |
| 1 | LodzCubingSummer2016 | 1 | 2437 | 2016 | 6 | 4 |
| 2 | WLSLato2016 | 1 | 2133 | 2016 | 7 | 9 |
| 3 | Euro2016 | 1 | 2049 | 2016 | 7 | 15 |
| 4 | IIMasovianOpenPlock2016 | 2 | 1809 | 2016 | 8 | 27 |
| ... | ... | ... | ... | ... | ... | ... |
| 318 | PBsinPalisades2023 | 2 | 558 | 2023 | 4 | 15 |
| 319 | PBsinPalisades2023 | 1 | 518 | 2023 | 4 | 15 |
| 320 | CapeCod2023 | f | 582 | 2023 | 4 | 22 |
| 321 | CapeCod2023 | 2 | 555 | 2023 | 4 | 22 |
| 322 | CapeCod2023 | 1 | 560 | 2023 | 4 | 22 |
323 rows × 6 columns
df_tymon_333avgs_groupby_comp = df_tymon_333avgs.groupby("competitionId")
modified_groups = []
for competition, group in df_tymon_333avgs_groupby_comp:
hours = list(range(0, 24, 24//len(group)))
hours.reverse()
group["hour"] = hours
modified_groups.append(group)
df_tymon_333avgs = pd.concat(modified_groups)
df_tymon_333avgs = df_tymon_333avgs.sort_values(by=["year", "month", "day", "hour"])
display(df_tymon_333avgs)
# with pd.option_context('display.max_rows', None,):
# display(df_tymon_333avgs)
| competitionId | roundTypeId | average | year | month | day | hour | |
|---|---|---|---|---|---|---|---|
| 0 | WLSWiosna2016 | 1 | 2752 | 2016 | 5 | 8 | 0 |
| 1 | LodzCubingSummer2016 | 1 | 2437 | 2016 | 6 | 4 | 0 |
| 2 | WLSLato2016 | 1 | 2133 | 2016 | 7 | 9 | 0 |
| 3 | Euro2016 | 1 | 2049 | 2016 | 7 | 15 | 0 |
| 5 | IIMasovianOpenPlock2016 | 1 | 1662 | 2016 | 8 | 27 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... |
| 317 | PBsinPalisades2023 | 3 | 560 | 2023 | 4 | 15 | 12 |
| 316 | PBsinPalisades2023 | f | 527 | 2023 | 4 | 15 | 18 |
| 322 | CapeCod2023 | 1 | 560 | 2023 | 4 | 22 | 0 |
| 321 | CapeCod2023 | 2 | 555 | 2023 | 4 | 22 | 8 |
| 320 | CapeCod2023 | f | 582 | 2023 | 4 | 22 | 16 |
323 rows × 7 columns
df_tymon_333avgs["date"] = pd.to_datetime(df_tymon_333avgs[["year", "month", "day", "hour"]])
df_tymon_333avgs
| competitionId | roundTypeId | average | year | month | day | hour | date | |
|---|---|---|---|---|---|---|---|---|
| 0 | WLSWiosna2016 | 1 | 2752 | 2016 | 5 | 8 | 0 | 2016-05-08 00:00:00 |
| 1 | LodzCubingSummer2016 | 1 | 2437 | 2016 | 6 | 4 | 0 | 2016-06-04 00:00:00 |
| 2 | WLSLato2016 | 1 | 2133 | 2016 | 7 | 9 | 0 | 2016-07-09 00:00:00 |
| 3 | Euro2016 | 1 | 2049 | 2016 | 7 | 15 | 0 | 2016-07-15 00:00:00 |
| 5 | IIMasovianOpenPlock2016 | 1 | 1662 | 2016 | 8 | 27 | 0 | 2016-08-27 00:00:00 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 317 | PBsinPalisades2023 | 3 | 560 | 2023 | 4 | 15 | 12 | 2023-04-15 12:00:00 |
| 316 | PBsinPalisades2023 | f | 527 | 2023 | 4 | 15 | 18 | 2023-04-15 18:00:00 |
| 322 | CapeCod2023 | 1 | 560 | 2023 | 4 | 22 | 0 | 2023-04-22 00:00:00 |
| 321 | CapeCod2023 | 2 | 555 | 2023 | 4 | 22 | 8 | 2023-04-22 08:00:00 |
| 320 | CapeCod2023 | f | 582 | 2023 | 4 | 22 | 16 | 2023-04-22 16:00:00 |
323 rows × 8 columns
sns.lineplot(data=df_tymon_333avgs, x="date", y="average")
<Axes: xlabel='date', ylabel='average'>
# data preprocessing
scaler = MinMaxScaler()
data = scaler.fit_transform(df_tymon_333avgs["average"].values.reshape(-1,1))
# splitting data into training and validation sets
training_size = int(len(data) * 0.8)
training_data = data[:training_size, :]
validation_data = data[training_size:, :]
# print(training_data)
# print(validation_data)
# defining the RNN model architecture
n_steps = 3 # number of time steps to consider for each input
n_features = 1 # number of features in each input
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')
# training the RNN model
# X_train, y_train = create_dataset(train_data, n_steps)
# X_val, y_val = create_dataset(val_data, n_steps)
# model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_val, y_val))
We hope the reader has gained new insight into the world of competitive Rubik's Cube solving, learned more about the data science pipeline, and gained more experience with data analysis in Python. Here we have seen several different models for our questions regarding world record changes over time. Although machines can more and more accurately predict the results of our questions, we are constantly still surprised by the real-world results that pop up, often defying all machine-derived expectations (true for cubing and the world in general). Cubing is still a relatively "young" activity and the dataset available to answer many of our questions are still relatively small. It will be interesting to see how the exponential growth of cubing leads to more and more data, and how that data can subsequently impact our models and predictions for the future.