Analysis of Competitive Rubik's Cube Solving¶

Liam Chen - CMSC320 Final Tutorial¶

Image of a Rubik's Cube

Introduction¶

The Rubik's Cube is a puzzle that has captured the attention and excitement of people around the world for decades. Once seen as a nearly impossible task, the Rubik's cube has evolved over time into a fun challenge that after dedicated practice, can be solved in mere seconds. There is a whole world of competitive Rubik's Cube solvers who compete with each other to see who can solve what puzzles, in what different ways, the fastest.

In this tutorial, we will explore and analyze data from the world of competitive Rubik's Cube solving. Through this data analysis, we aim to gain more insight into questions regarding the current status, and future of cubing as an activity.

There is some terminology in this tutorial that may be unfamiliar. They will make more sense in the context they are given, but here is a basic glossary of terms:

  • WCA - aka the World Cube Association, the governing body that organizes, oversees, and regulates Rubik's Cube competitions all over the world
  • cubing - the activity/hobby/sport of solving Rubik's cubes
  • 3x3 - the standard 3x3x3 Rubik's Cube
  • event - aside from the standard 3x3, there are many different events in which competitors compete in, including 4x4, 3x3 blindfolded, and many more
  • solve - a unit of measurement representing the process of solving a Rubik's Cube (ie. "that solve was really fast!")
  • single - the time of one solve. Usually used in contrast to average (ie. "I got a 9.50 single today!")
  • average - The arithmetic mean of the middle 3 solves out of 5 (ie. out of 5 solves, the slowest and fastest solves are disregarded and the rest are averaged, done to try to eliminate luck as a factor). In some less common contexts average will be the arithmetic mean.

Data Collection & Setup¶

The World Cube Association conveniently provides an exportable SQL file of the entire database. The most recent export can be obtained on this page, however this tutorial uses the May 04 2023 6:02GMT export of the database (download it here). Due to the frequently udpating nature of the WCA database, it was necessary to decide on only one version to use in order to facilitate development of this tutorial.

Due to Github's restrictions on file size, we were not able to include the WCA database in this repository. If you want to follow along with this tutorial exactly, you must download the database at the link provided above.

Aside from the required Python libraries, this tutorial also requires MySQL Server in order to follow along. This tutorial will not detail how to get MySQL Server installed and running, however the process is fairly straightfoward and there are many online guides that can help.

Clone this repository here, and navigate into the cloned repository. Install all the required Python libraries by running the cell below, or by running the command (without the % symbol) in your terminal.

This setup portion assumes basic familiarity with the command line and git repositories. For more information, check out the recommended readings: First steps with git, Git Basics, Command line crash course, Command line basics

In [ ]:
%pip install requirements.txt
In [160]:
# imports
import datetime
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from getpass import getpass
from mysql.connector import connect
from sklearn import svm
from sklearn.linear_model import LinearRegression, LogisticRegression
import statsmodels.formula.api as sm
from keras.models import Sequential
from keras.layers import Dense, LSTM
from sklearn.preprocessing import MinMaxScaler

# surpressing the "pandas only supports SQLAlchemy warning"
import warnings
warnings.simplefilter(action='ignore', category=UserWarning)  
# full warning: 
# """ UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. 
# Other DBAPI2 objects are not tested. Please consider using SQLAlchemy. """

Run the cell below and input your MySQL Server username and password (they are kept in memory until the kernel is stopped). Depending on how you configured MySQL Server, you may not need a username nor a password. If this is the case, then you can press enter twice to skip both fields.

Most users can run mysql -u root in their terminal to log into MySQL Server. If you can do this successfully, that means that you can press enter twice to skip both fields after running the cell below.

In [4]:
mysql_username = input("Enter username: ")
mysql_password = getpass("Enter password: ")

Now, we can establish a connection with MySQL Server by using the mysql-connector-python library. First we create a connection object, use that to create a cursor object, then we execute a query to create an empty database called wca. Lastly, we close the connection and the cursor.

In [19]:
# establish connection with MySQL Server
connection = connect(
    host="localhost",
    user=mysql_username,
    password=mysql_password
)
qry_create_db = "CREATE DATABASE wca"   # query to create an empty database called "wca"
cursor = connection.cursor()            # create the cursor
cursor.execute(qry_create_db)           # execute the query
cursor.close()                          # close the cursor
connection.close()                      # close the connection

Now, we use our downloaded SQL file to actually create the database. If you have set up MySQL Server in the default configuration and are using the same SQL file as I am, run the following cell (or run it in your terminal (without the ! symbol).

If you have configured a specific MySQL Server username/password, or if you are using a different version of the SQL file downloaded from the WCA, then run the following command in your terminal:

mysql -u username -p password wca < filename.sql

replacing username and password with your username and password (respectively) and replacing filename.sql with the [path to the] SQL file you want to use.

Note that the WCA database is fairly large, so don't be alarmed if this command takes a few minutes to execute.

In [20]:
!mysql -u root wca < WCA_export192_20230504T040001Z/WCA_export.sql

Queries¶

Now we can finally get started! Let's run a basic query on the database: we want to see the top fifty 3x3 solvers in the world.

We construct the appropriate SQL query, establish the connection to the database, and use the pandas function read_sql in order to execute this query and retrieve the result as a dataframe.

In [50]:
# establish the connection to our database "wca"
connection = connect(host="localhost", user=mysql_username, password=mysql_password, database="wca")

qry_top_50 = """
                SELECT * 
                FROM RanksSingle 
                WHERE eventID='333'     
                LIMIT 50 ;
            """

# execute the query and store the result in a dataframe
df_top_50 = pd.read_sql(qry_top_50, con=connection)
display(df_top_50)

connection.close() # close the connection
personId eventId best worldRank continentRank countryRank
0 2015DUYU01 333 347 1 1 1
1 2012PARK03 333 363 2 1 1
2 2016JONE04 333 388 3 1 1
3 2017KIMM01 333 389 4 2 2
4 2019WANY36 333 390 5 2 2
5 2017GARR05 333 395 6 3 3
6 2016KOLA02 333 397 7 1 1
7 2017SIAU02 333 403 8 4 4
8 2017XURU04 333 406 9 3 3
9 2016INAB01 333 413 10 5 5
10 2009ZEMD01 333 416 11 2 2
11 2016DEXT01 333 419 12 3 3
12 2012PONC02 333 424 13 6 6
13 2016SANT08 333 426 14 7 1
14 2015BORR01 333 431 15 4 1
15 2010WEYE02 333 432 16 2 1
16 2016SATO01 333 437 17 1 1
17 2015SANC11 333 438 18 8 7
18 2018DULL01 333 438 18 3 1
19 2019REDI02 333 438 18 8 7
20 2016SHEL03 333 442 21 4 1
21 2015FUSH01 333 442 21 5 1
22 2015MOHA10 333 444 23 10 9
23 2014SEBA01 333 444 23 5 1
24 2015MILL01 333 448 25 11 10
25 2015MACK06 333 448 25 11 10
26 2016LINB01 333 450 27 13 2
27 2021ZHAN01 333 451 28 6 4
28 2015LARS04 333 453 29 6 1
29 2015GRIE02 333 454 30 14 12
30 2010KIPA01 333 459 31 7 2
31 2014MILL04 333 459 31 7 1
32 2012CHOS01 333 459 31 7 1
33 2010WANG68 333 462 34 15 3
34 2015DELA05 333 463 35 9 1
35 2013BALI01 333 464 36 16 13
36 2022NUNE01 333 465 37 2 1
37 2018OPAC01 333 466 38 10 3
38 2012TYCK01 333 468 39 17 14
39 2013JOHN10 333 468 39 17 14
40 2015CHER07 333 468 39 17 14
41 2018KHAN28 333 471 42 20 17
42 2014BONA02 333 472 43 3 1
43 2014XUZI01 333 472 43 8 5
44 2017WONG01 333 473 45 9 2
45 2007VALK01 333 474 46 11 2
46 2016YENC01 333 475 47 21 18
47 2019SICH01 333 475 47 21 18
48 2012LEWI01 333 475 47 21 18
49 2010BRAD01 333 476 50 24 21

Interpreting the above results, we can see the top fifty 3x3 solvers in the world, ranked by their best single. We can see their personID (aka their WCA ID, which is a unique identifier for every WCA competitor), their best single result (measured in centiseconds), their world ranking, continental ranking, and national ranking (all based on their best single time). The eventId being 333 means that these results are in the 3x3 Rubik's Cube event. We can see that the best 3x3 single result ever is 3.47 seconds, held by 2015DUYU01.

Seeing the results are great, but with only the WCA ID of each competitor, we don't know much about them. What if we want to see their names, what country they are from, and their gender?

In [51]:
# using the Python `with` statement for easier resource management
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    qry_top_50_extra = """
                    SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksSingle.best, RanksSingle.worldRank
                    FROM RanksSingle
                    INNER JOIN Persons ON RanksSingle.personID=Persons.id
                    WHERE RanksSingle.eventID='333'
                    ORDER BY RanksSingle.best
                    LIMIT 50 ;
                """

    df_top_50_extra = pd.read_sql(qry_top_50_extra, con=connection)
    display(df_top_50_extra)
id name countryId gender best worldRank
0 2015DUYU01 Yusheng Du (杜宇生) China m 347 1
1 2012PARK03 Max Park USA m 363 2
2 2016JONE04 Jode Brewster Australia m 388 3
3 2017KIMM01 Asher Kim-Magierek USA m 389 4
4 2019WANY36 Yiheng Wang (王艺衡) China m 390 5
5 2017GARR05 Luke Garrett USA m 395 6
6 2016KOLA02 Tymon Kolasiński Poland m 397 7
7 2017SIAU02 Max Siauw USA m 403 8
8 2017XURU04 Ruihang Xu (许瑞航) China m 406 9
9 2016INAB01 Matty Hiroto Inaba USA m 413 10
10 2009ZEMD01 Feliks Zemdegs Australia m 416 11
11 2016DEXT01 Riley Dexter Australia m 419 12
12 2012PONC02 Patrick Ponce USA m 424 13
13 2016SANT08 Kyle Santucci Canada m 426 14
14 2015BORR01 Leo Borromeo Philippines m 431 15
15 2010WEYE02 Sebastian Weyer Germany m 432 16
16 2016SATO01 Caio Hideaki Sato Brazil m 437 17
17 2018DULL01 Twan Dullemond Netherlands m 438 18
18 2015SANC11 Nicolás Sánchez USA m 438 18
19 2019REDI02 Dominic Redisi USA m 438 18
20 2015FUSH01 Firstian Fushada (符逢城) Indonesia m 442 21
21 2016SHEL03 Lukas Shelley Denmark m 442 21
22 2014SEBA01 Juliette Sébastien France f 444 23
23 2015MOHA10 Varun Mohanraj USA m 444 23
24 2015MILL01 Dylan Miller USA m 448 25
25 2015MACK06 Zeke Mackay USA m 448 25
26 2016LINB01 Brennen Lin Canada m 450 27
27 2021ZHAN01 Bofan Zhang (张博藩) China m 451 28
28 2015LARS04 Kim Roger Høyland Larsen Norway m 453 29
29 2015LARS04 Kim Roger Haraldsen Norway m 453 29
30 2015GRIE02 Luke Griesser USA m 454 30
31 2014MILL04 Chris Mills United Kingdom m 459 31
32 2010KIPA01 Jakub Kipa Poland m 459 31
33 2012CHOS01 SeungBeom Cho (조승범) Korea m 459 31
34 2010WANG68 Bill Wang Canada m 462 34
35 2015DELA05 Richard Delacoste Switzerland m 463 35
36 2013BALI01 Tanzer Balimtas USA m 464 36
37 2022NUNE01 Robert Yomi Cadenas Nuñez Peru m 465 37
38 2018OPAC01 Kajetan Opach Poland m 466 38
39 2013JOHN10 Brian Johnson USA m 468 39
40 2015CHER07 Tommy Cherry USA m 468 39
41 2012TYCK01 Luke Tycksen USA m 468 39
42 2018KHAN28 Zayn Khanani USA m 471 42
43 2014BONA02 Bautista Bonazzola Argentina m 472 43
44 2014XUZI01 Zibo Xu (徐子博) China m 472 43
45 2017WONG01 Brenton Angelo Lo Wong Philippines m 473 45
46 2007VALK01 Mats Valk Netherlands m 474 46
47 2016YENC01 Christopher Yen USA m 475 47
48 2012LEWI01 Phillip Lewicki USA m 475 47
49 2019SICH01 Oliver Michael Sitja Sichel USA m 475 47

Looks pretty good! If we compare this dataframe (df_top_50_extra) with the previous one (df_top_50), we notice some slight discrepancies. Notice how df_top_50 ends with 2010BRAD01's result of 4.76 seconds, but df_top_50_extra ends with 2019SICH01's result of 4.75 seconds. This would suggest that we have an extra value in our current dataframe (df_top_50_extra) which is pushing everything down (thus some slower results are pushed out of the top 50).

Why does this discrepancy exist when all we are doing is adding extra information like competitor name and country? If we look closely at (df_top_50_extra), we can see an anomaly...

In [44]:
display(df_top_50_extra[28:30])
id name countryId gender best worldRank
28 2015LARS04 Kim Roger Høyland Larsen Norway m 453 29
29 2015LARS04 Kim Roger Haraldsen Norway m 453 29

We have a duplicate! These two results have a similar name, as well as the exact same countryId, gender, best, and worldRank. This would all be fine (albeit suspicious and unlikely) if not for the fact that the id field is the same. WCA IDs are supposed to be unique identifiers for competitors, and we queried for top fifty 3x3 competitors in the world, so no competitor should appear twice in our dataframe.

The reason this happens is because the WCA allows competitors to change certain data about themeslves, such as in the case that a competitor moves countries, or legally changes their name. When such a change happens, the competitor's record in the Persons field is not updated, but rather a new record is created with the altered data. We can see this by querying the results table for 2015LARS04:

In [52]:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    qry_2015lars04 = """
                    SELECT * 
                    FROM Persons
                    WHERE id = '2015LARS04' ;
                """
    df_2015lars04 = pd.read_sql(qry_2015lars04, con=connection)
    display(df_2015lars04)
id subid name countryId gender
0 2015LARS04 1 Kim Roger Haraldsen Norway m
1 2015LARS04 2 Kim Roger Høyland Larsen Norway m

Here we see two records with the same id of 2015LARS04. name is the field that differs. As mentioned previously, this represents the same competitor, before and after a legal name change. The subid field is important here, because it indicates that their Persons record has been changed before. The current version is represented by a subid of 1.

Knowing this, we can alter our query for the top fifty 3x3 solvers including name, country, and gender to disallow duplicate WCA IDs. We can do this by adding Persons.subid=1 to the WHERE clause.

In [58]:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    qry_top_50_extra = """
                    SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksSingle.best, RanksSingle.worldRank
                    FROM RanksSingle
                    INNER JOIN Persons ON RanksSingle.personID=Persons.id
                    WHERE RanksSingle.eventID='333' 
                        AND Persons.subid=1
                    ORDER BY RanksSingle.best
                    LIMIT 50 ;
                """

    df_top_50_extra = pd.read_sql(qry_top_50_extra, con=connection)
    display(df_top_50_extra)
id name countryId gender best worldRank
0 2015DUYU01 Yusheng Du (杜宇生) China m 347 1
1 2012PARK03 Max Park USA m 363 2
2 2016JONE04 Jode Brewster Australia m 388 3
3 2017KIMM01 Asher Kim-Magierek USA m 389 4
4 2019WANY36 Yiheng Wang (王艺衡) China m 390 5
5 2017GARR05 Luke Garrett USA m 395 6
6 2016KOLA02 Tymon Kolasiński Poland m 397 7
7 2017SIAU02 Max Siauw USA m 403 8
8 2017XURU04 Ruihang Xu (许瑞航) China m 406 9
9 2016INAB01 Matty Hiroto Inaba USA m 413 10
10 2009ZEMD01 Feliks Zemdegs Australia m 416 11
11 2016DEXT01 Riley Dexter Australia m 419 12
12 2012PONC02 Patrick Ponce USA m 424 13
13 2016SANT08 Kyle Santucci Canada m 426 14
14 2015BORR01 Leo Borromeo Philippines m 431 15
15 2010WEYE02 Sebastian Weyer Germany m 432 16
16 2016SATO01 Caio Hideaki Sato Brazil m 437 17
17 2018DULL01 Twan Dullemond Netherlands m 438 18
18 2015SANC11 Nicolás Sánchez USA m 438 18
19 2019REDI02 Dominic Redisi USA m 438 18
20 2015FUSH01 Firstian Fushada (符逢城) Indonesia m 442 21
21 2016SHEL03 Lukas Shelley Denmark m 442 21
22 2014SEBA01 Juliette Sébastien France f 444 23
23 2015MOHA10 Varun Mohanraj USA m 444 23
24 2015MILL01 Dylan Miller USA m 448 25
25 2015MACK06 Zeke Mackay USA m 448 25
26 2016LINB01 Brennen Lin Canada m 450 27
27 2021ZHAN01 Bofan Zhang (张博藩) China m 451 28
28 2015LARS04 Kim Roger Haraldsen Norway m 453 29
29 2015GRIE02 Luke Griesser USA m 454 30
30 2012CHOS01 SeungBeom Cho (조승범) Korea m 459 31
31 2014MILL04 Chris Mills United Kingdom m 459 31
32 2010KIPA01 Jakub Kipa Poland m 459 31
33 2010WANG68 Bill Wang Canada m 462 34
34 2015DELA05 Richard Delacoste Switzerland m 463 35
35 2013BALI01 Tanzer Balimtas USA m 464 36
36 2022NUNE01 Robert Yomi Cadenas Nuñez Peru m 465 37
37 2018OPAC01 Kajetan Opach Poland m 466 38
38 2012TYCK01 Luke Tycksen USA m 468 39
39 2013JOHN10 Brian Johnson USA m 468 39
40 2015CHER07 Tommy Cherry USA m 468 39
41 2018KHAN28 Zayn Khanani USA m 471 42
42 2014XUZI01 Zibo Xu (徐子博) China m 472 43
43 2014BONA02 Bautista Bonazzola Argentina m 472 43
44 2017WONG01 Brenton Angelo Lo Wong Philippines m 473 45
45 2007VALK01 Mats Valk Netherlands m 474 46
46 2016YENC01 Christopher Yen USA m 475 47
47 2019SICH01 Oliver Michael Sitja Sichel USA m 475 47
48 2012LEWI01 Phillip Lewicki USA m 475 47
49 2010BRAD01 Drew Brads USA m 476 50

No more duplicates!

Let's run some more involved queries to get more familiar with the database schema. What if we wanted to see the top rankings in other events, such as the 4x4 cube, ranked by competitor's best average of five solves?

In [63]:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    # querying the RanksAverage table instead of RanksSingle table 
    # in order to get competitor's best WCA average of five solves in a round
    qry_top_20_444_avg = """
                    SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksAverage.best, RanksAverage.worldRank
                    FROM RanksAverage
                    INNER JOIN Persons ON RanksAverage.personID=Persons.id
                    WHERE RanksAverage.eventID='444' 
                        AND Persons.subid=1
                    ORDER BY RanksAverage.best
                    LIMIT 20 ;
                """
    df_top_20_444_avg = pd.read_sql(qry_top_20_444_avg, con=connection)
    display(df_top_20_444_avg)
id name countryId gender best worldRank
0 2012PARK03 Max Park USA m 1938 1
1 2010WEYE02 Sebastian Weyer Germany m 2146 2
2 2009ZEMD01 Feliks Zemdegs Australia m 2157 3
3 2013NAHM01 Seung Hyuk Nahm (남승혁) Korea m 2167 4
4 2016KOLA02 Tymon Kolasiński Poland m 2171 5
5 2016INAB01 Matty Hiroto Inaba USA m 2254 6
6 2015WANG09 Kai-Wen Wang (王楷文) Taiwan m 2261 7
7 2012PONC02 Patrick Ponce USA m 2307 8
8 2016NUNE11 Brandon Nunez USA m 2361 9
9 2017RAND06 Ari Randers-Pehrson USA m 2367 10
10 2012BEAH01 Ciarán Beahan Ireland m 2377 11
11 2015BORR01 Leo Borromeo Philippines m 2377 11
12 2017XURU04 Ruihang Xu (许瑞航) China m 2386 13
13 2019HUNG16 Đỗ Quang Hưng Vietnam m 2394 14
14 2010WANG68 Bill Wang Canada m 2397 15
15 2017SIAU02 Max Siauw USA m 2398 16
16 2015DELA05 Richard Delacoste Switzerland m 2440 17
17 2010WUIF01 Yi-Fan Wu (吳亦凡) Taiwan m 2487 18
18 2015FUSH01 Firstian Fushada (符逢城) Indonesia m 2509 19
19 2017GOLU01 Theo Goluboff Argentina m 2526 20

How about the top twenty female rankings by average in the Square-1 event?

In [66]:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    qry_top_20_sq1_avg = """
                    SELECT Persons.id, Persons.name, Persons.countryId, Persons.gender, RanksAverage.best, RanksAverage.worldRank
                    FROM RanksAverage
                    INNER JOIN Persons ON RanksAverage.personID=Persons.id
                    WHERE RanksAverage.eventID='sq1' 
                        AND Persons.subid=1
                        AND Persons.gender='f'
                    ORDER BY RanksAverage.best
                    LIMIT 20 ;
                """
    df_top_20_sq1_avg = pd.read_sql(qry_top_20_sq1_avg, con=connection)
    display(df_top_20_sq1_avg)
id name countryId gender best worldRank
0 2016CHEN53 Anshu Chennuru USA f 668 15
1 2013KATO01 Eva Kato USA f 770 52
2 2017MELL02 Laura Mellier Switzerland f 856 93
3 2014CHAN23 Sophie Chan USA f 911 132
4 2017PABI01 Magdalena Pabisz Poland f 916 140
5 2016SINN01 Eleanor Sinnott United Kingdom f 964 179
6 2013KLEI03 Livia Kleiner USA f 982 196
7 2015CALD02 Kymberlyn Calderon USA f 991 205
8 2015JOIN01 Kyra Joiner USA f 1055 267
9 2022THUY01 Lê Phương Thùy Vietnam f 1068 282
10 2011WIJA03 Angeline Wijaya (黃千儀) Indonesia f 1071 287
11 2021WARC01 Dominika Warchoł Poland f 1093 314
12 2015JINX01 Tina Xiangyan Jin (金香延) China f 1101 326
13 2018DILW02 Lacey Dilworth USA f 1115 343
14 2010HULL01 Katie Hull USA f 1121 349
15 2020SIBU02 Josephine Siburian Australia f 1127 359
16 2018GRAH05 Kate Grahame USA f 1149 380
17 2015TRAC01 Catherine Trace USA f 1183 422
18 2015ZHOU02 Lucinda Zhou USA f 1255 533
19 2022BERN01 Beatriz Hernandes Bernardes Brazil f 1316 654

How about ranking countries by number of WCA competitions that have occurred in that country in 2022?

In [91]:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    qry_2022_country_comps = """
                        SELECT countryId, COUNT(countryId) as number_of_comps
                        FROM Competitions
                        WHERE Competitions.year = 2022
                        GROUP BY countryId
                        ORDER BY number_of_comps DESC
                        LIMIT 20
                    """
    df_2022_country_comps = pd.read_sql(qry_2022_country_comps, con=connection)
    display(df_2022_country_comps)
countryId number_of_comps
0 USA 280
1 Australia 66
2 Poland 46
3 India 45
4 Canada 32
5 United Kingdom 32
6 Brazil 32
7 Spain 29
8 France 28
9 Colombia 24
10 Sweden 23
11 Denmark 22
12 Turkey 22
13 Italy 21
14 Bolivia 16
15 New Zealand 16
16 Indonesia 14
17 Peru 14
18 Germany 13
19 Norway 13

The Future of World Records¶

Now we pivot our focus to a different question: how will the cubing world records evolve over time? In recent years, due to many advancements in cube hardware as well as new methods of solving, we have seen significant drops in world record times across every single event in the WCA. At the WCA's inception, logically the world records were lowered significantly every time they were broken, since there simply weren't that many people competing or interested in the activity. But nowadays, we see records being broken by only hundredths of a second, and records standing for years and years. Will the progress of cubing world records slow and plateau? Or will they keep on being lowered dramatically as the amount of competitors and interest in the activity increases?

Let's run a query to see the entire history of 3x3 single world records. Besides just the result, we want to know when each world record was set, so we will join the Competitions table (which has data about when the world record was set and at what competition) on the Results table. To query for just world records, we must include Results.regionalSingleRecord = 'WR' in the WHERE clause of our query.

In [6]:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    qry_333_singleWRs = """
                        SELECT Results.personName, 
                            Results.best,
                            Results.competitionId, 
                            Competitions.year,
                            Competitions.month,
                            Competitions.day
                        FROM Results
                        INNER JOIN Competitions ON Results.competitionId = Competitions.id
                        WHERE
                            Results.regionalSingleRecord = 'WR' AND
                            Results.eventId = '333'
                        ORDER BY Results.best DESC
                    """
    df_333_singleWRs = pd.read_sql(qry_333_singleWRs, con=connection)
    display(df_333_singleWRs)
personName best competitionId year month day
0 Minh Thai 2295 WC1982 1982 6 5
1 Dan Knights 1671 WC2003 2003 8 23
2 Jess Bonde 1653 WC2003 2003 8 23
3 Shotaro Makisumi (牧角章太郎) 1507 CaltechWinter2004 2004 1 24
4 Shotaro Makisumi (牧角章太郎) 1476 CaltechWinter2004 2004 1 24
5 Shotaro Makisumi (牧角章太郎) 1393 CaltechSpring2004 2004 4 3
6 Shotaro Makisumi (牧角章太郎) 1211 CaltechSpring2004 2004 4 3
7 Jean Pons 1175 DutchOpen2005 2005 10 16
8 Leyan Lo 1113 CaltechWinter2006 2006 1 14
9 Toby Mao (毛台立) 1048 US2006 2006 8 4
10 Edouard Chambon 1036 BelgianOpen2007 2007 2 24
11 Thibaut Jacquinot 986 SpanishOpen2007 2007 5 5
12 Erik Akkersdijk 977 DutchOpen2007 2007 10 13
13 Ron van Bruchem 955 Netherlands2007 2007 11 24
14 Edouard Chambon 918 MurciaOpen2008 2008 2 23
15 Yu Nakajima (中島悠) 872 KashiwaOpen2008 2008 5 5
16 Yu Nakajima (中島悠) 872 KashiwaOpen2008 2008 5 5
17 Erik Akkersdijk 708 CzechOpen2008 2008 7 12
18 Feliks Zemdegs 703 MelbourneCubeDay2010 2010 11 13
19 Feliks Zemdegs 677 MelbourneCubeDay2010 2010 11 13
20 Feliks Zemdegs 665 KubarooOpen2011 2011 5 7
21 Feliks Zemdegs 665 MelbourneSummer2011 2011 1 29
22 Feliks Zemdegs 624 KubarooOpen2011 2011 5 7
23 Feliks Zemdegs 618 MelbourneWinterOpen2011 2011 6 25
24 Feliks Zemdegs 566 MelbourneWinterOpen2011 2011 6 25
25 Mats Valk 555 ZonhovenOpen2013 2013 3 2
26 Collin Burns 525 DoylestownSpring2015 2015 4 25
27 Lucas Etter 490 RiverHillFall2015 2015 11 21
28 Mats Valk 474 JawaTimurOpen2016 2016 11 5
29 Feliks Zemdegs 473 POPSOpen2016 2016 12 11
30 Patrick Ponce 469 RallyInTheValley2017 2017 9 2
31 SeungBeom Cho (조승범) 459 Chicago2017 2017 10 28
32 Feliks Zemdegs 459 HobartSummer2018 2018 1 27
33 Feliks Zemdegs 422 CubeforCambodia2018 2018 5 6
34 Yusheng Du (杜宇生) 347 WuhuOpen2018 2018 11 24

Looks good, but we have some cleaning to do. First, notice that the best column is measured in centiseconds. We want this column to be ever so slightly more human readable, so we will divide each value in this column by 100 to get the time in seconds.

Now we are doing some manipulations with Pandas dataframes. If you are unfamiliar with the basics of dataframes and the Python pandas library, here are some recommended readings: 10 minutes to pandas, Pandas Tutorial, Pandas API

In [7]:
df_333_singleWRs["best"] = df_333_singleWRs["best"].apply(lambda x: x/100)
df_333_singleWRs.tail()
Out[7]:
personName best competitionId year month day
30 Patrick Ponce 4.69 RallyInTheValley2017 2017 9 2
31 SeungBeom Cho (조승범) 4.59 Chicago2017 2017 10 28
32 Feliks Zemdegs 4.59 HobartSummer2018 2018 1 27
33 Feliks Zemdegs 4.22 CubeforCambodia2018 2018 5 6
34 Yusheng Du (杜宇生) 3.47 WuhuOpen2018 2018 11 24

Our goal is to plot this information a line plot where the x-axis is time and the y-axis is the result. In order to do so, we must have a column representing time that can be easily interpreted by a plotting library like seaborn. Let's combine the year, month, and day columns into a single column called date of datetime objects.

For more information on the Python datetime module, here are some recommended readings: Python docs datetime, PYnative

In [8]:
df_333_singleWRs["date"] = pd.to_datetime(df_333_singleWRs[["year", "month", "day"]])
df_333_singleWRs.tail()
Out[8]:
personName best competitionId year month day date
30 Patrick Ponce 4.69 RallyInTheValley2017 2017 9 2 2017-09-02
31 SeungBeom Cho (조승범) 4.59 Chicago2017 2017 10 28 2017-10-28
32 Feliks Zemdegs 4.59 HobartSummer2018 2018 1 27 2018-01-27
33 Feliks Zemdegs 4.22 CubeforCambodia2018 2018 5 6 2018-05-06
34 Yusheng Du (杜宇生) 3.47 WuhuOpen2018 2018 11 24 2018-11-24

If we inspect carefully, there is a slight issue with the way that our dataframe is sorted.

In [9]:
df_333_singleWRs[20:23]
Out[9]:
personName best competitionId year month day date
20 Feliks Zemdegs 6.65 KubarooOpen2011 2011 5 7 2011-05-07
21 Feliks Zemdegs 6.65 MelbourneSummer2011 2011 1 29 2011-01-29
22 Feliks Zemdegs 6.24 KubarooOpen2011 2011 5 7 2011-05-07

Our query ordered the dataframe by world record time (the best column) but sometimes the world record gets tied. The WCA still counts a tied world record as a world record in the database. As we see above, The time of 6.65 seconds set on Jan 29 occured before the tied world record of 6.65 seconds on May 07, but it occurs later in the dataframe. This could potentially cause some problems with our last step of data cleaning so we want to solve it now. To do so, we can simply sort our dataframe by date first, then by best for when date is the same.

In [10]:
df_333_singleWRs = df_333_singleWRs.sort_values(by=["date", "best"], ascending=[True, False])
df_333_singleWRs[20:23]
Out[10]:
personName best competitionId year month day date
21 Feliks Zemdegs 6.65 MelbourneSummer2011 2011 1 29 2011-01-29
20 Feliks Zemdegs 6.65 KubarooOpen2011 2011 5 7 2011-05-07
22 Feliks Zemdegs 6.24 KubarooOpen2011 2011 5 7 2011-05-07

Before we can plot, we have one more issue. Notice that near the top of our dataframe, multiple records have the same date. We can not plot this correctly on a line plot, so we must have some way of differentiating (by time) world records that happened on the same date.

Note that this issue only occurs with older world records, because the WCA has since changed their policy so that if a world record is broken multiple times on the same day, only the latest one is considered a world record.

In [11]:
df_333_singleWRs.head()
Out[11]:
personName best competitionId year month day date
0 Minh Thai 22.95 WC1982 1982 6 5 1982-06-05
1 Dan Knights 16.71 WC2003 2003 8 23 2003-08-23
2 Jess Bonde 16.53 WC2003 2003 8 23 2003-08-23
3 Shotaro Makisumi (牧角章太郎) 15.07 CaltechWinter2004 2004 1 24 2004-01-24
4 Shotaro Makisumi (牧角章太郎) 14.76 CaltechWinter2004 2004 1 24 2004-01-24

In order to differentiate world records that occurred on the same date, we can add a time column for each record. Since the results are already sorted in descending order, we know that a record with a lower index in our dataframe means it occurred before any record with a higher index. For dates where only one world record was set, we assign an hour of 00:00. For dates where multiple world records are set, we can arbitrarily separate the world records by time occurred on that date (we say "arbitrarily" because the actual time does not matter, only the fact that slower world records occurred before faster ones). Careful inspection of our dataframe reveals that the most world records set on one date is only 2. So for when two world records occurred on the same date, we can arbitrarily say that the slower one occurred at 00:00 on that date and the faster one occurred on 12:00 of that date.

In [12]:
hour = []                                       # new column to add to df
prev_row = df_333_singleWRs.iloc[0]             # prev row of our df (used when iterating)

# iterating through each row
for i, row in df_333_singleWRs.iterrows():
    if i == 0:                                  # base case
        hour.append(0)
    else:
        if prev_row["date"] == row["date"]:     # if date is same as the previous row
            hour.append(hour[-1] + 12)
        else:
            hour.append(0)
        prev_row = row

# add the new column
df_333_singleWRs["hour"] = hour
df_333_singleWRs
Out[12]:
personName best competitionId year month day date hour
0 Minh Thai 22.95 WC1982 1982 6 5 1982-06-05 0
1 Dan Knights 16.71 WC2003 2003 8 23 2003-08-23 0
2 Jess Bonde 16.53 WC2003 2003 8 23 2003-08-23 12
3 Shotaro Makisumi (牧角章太郎) 15.07 CaltechWinter2004 2004 1 24 2004-01-24 0
4 Shotaro Makisumi (牧角章太郎) 14.76 CaltechWinter2004 2004 1 24 2004-01-24 12
5 Shotaro Makisumi (牧角章太郎) 13.93 CaltechSpring2004 2004 4 3 2004-04-03 0
6 Shotaro Makisumi (牧角章太郎) 12.11 CaltechSpring2004 2004 4 3 2004-04-03 12
7 Jean Pons 11.75 DutchOpen2005 2005 10 16 2005-10-16 0
8 Leyan Lo 11.13 CaltechWinter2006 2006 1 14 2006-01-14 0
9 Toby Mao (毛台立) 10.48 US2006 2006 8 4 2006-08-04 0
10 Edouard Chambon 10.36 BelgianOpen2007 2007 2 24 2007-02-24 0
11 Thibaut Jacquinot 9.86 SpanishOpen2007 2007 5 5 2007-05-05 0
12 Erik Akkersdijk 9.77 DutchOpen2007 2007 10 13 2007-10-13 0
13 Ron van Bruchem 9.55 Netherlands2007 2007 11 24 2007-11-24 0
14 Edouard Chambon 9.18 MurciaOpen2008 2008 2 23 2008-02-23 0
15 Yu Nakajima (中島悠) 8.72 KashiwaOpen2008 2008 5 5 2008-05-05 0
16 Yu Nakajima (中島悠) 8.72 KashiwaOpen2008 2008 5 5 2008-05-05 12
17 Erik Akkersdijk 7.08 CzechOpen2008 2008 7 12 2008-07-12 0
18 Feliks Zemdegs 7.03 MelbourneCubeDay2010 2010 11 13 2010-11-13 0
19 Feliks Zemdegs 6.77 MelbourneCubeDay2010 2010 11 13 2010-11-13 12
21 Feliks Zemdegs 6.65 MelbourneSummer2011 2011 1 29 2011-01-29 0
20 Feliks Zemdegs 6.65 KubarooOpen2011 2011 5 7 2011-05-07 0
22 Feliks Zemdegs 6.24 KubarooOpen2011 2011 5 7 2011-05-07 12
23 Feliks Zemdegs 6.18 MelbourneWinterOpen2011 2011 6 25 2011-06-25 0
24 Feliks Zemdegs 5.66 MelbourneWinterOpen2011 2011 6 25 2011-06-25 12
25 Mats Valk 5.55 ZonhovenOpen2013 2013 3 2 2013-03-02 0
26 Collin Burns 5.25 DoylestownSpring2015 2015 4 25 2015-04-25 0
27 Lucas Etter 4.90 RiverHillFall2015 2015 11 21 2015-11-21 0
28 Mats Valk 4.74 JawaTimurOpen2016 2016 11 5 2016-11-05 0
29 Feliks Zemdegs 4.73 POPSOpen2016 2016 12 11 2016-12-11 0
30 Patrick Ponce 4.69 RallyInTheValley2017 2017 9 2 2017-09-02 0
31 SeungBeom Cho (조승범) 4.59 Chicago2017 2017 10 28 2017-10-28 0
32 Feliks Zemdegs 4.59 HobartSummer2018 2018 1 27 2018-01-27 0
33 Feliks Zemdegs 4.22 CubeforCambodia2018 2018 5 6 2018-05-06 0
34 Yusheng Du (杜宇生) 3.47 WuhuOpen2018 2018 11 24 2018-11-24 0

Now we can construct the full column of datetime objects.

In [13]:
df_333_singleWRs["date"] = pd.to_datetime(df_333_singleWRs[["year", "month", "day", "hour"]])
df_333_singleWRs
Out[13]:
personName best competitionId year month day date hour
0 Minh Thai 22.95 WC1982 1982 6 5 1982-06-05 00:00:00 0
1 Dan Knights 16.71 WC2003 2003 8 23 2003-08-23 00:00:00 0
2 Jess Bonde 16.53 WC2003 2003 8 23 2003-08-23 12:00:00 12
3 Shotaro Makisumi (牧角章太郎) 15.07 CaltechWinter2004 2004 1 24 2004-01-24 00:00:00 0
4 Shotaro Makisumi (牧角章太郎) 14.76 CaltechWinter2004 2004 1 24 2004-01-24 12:00:00 12
5 Shotaro Makisumi (牧角章太郎) 13.93 CaltechSpring2004 2004 4 3 2004-04-03 00:00:00 0
6 Shotaro Makisumi (牧角章太郎) 12.11 CaltechSpring2004 2004 4 3 2004-04-03 12:00:00 12
7 Jean Pons 11.75 DutchOpen2005 2005 10 16 2005-10-16 00:00:00 0
8 Leyan Lo 11.13 CaltechWinter2006 2006 1 14 2006-01-14 00:00:00 0
9 Toby Mao (毛台立) 10.48 US2006 2006 8 4 2006-08-04 00:00:00 0
10 Edouard Chambon 10.36 BelgianOpen2007 2007 2 24 2007-02-24 00:00:00 0
11 Thibaut Jacquinot 9.86 SpanishOpen2007 2007 5 5 2007-05-05 00:00:00 0
12 Erik Akkersdijk 9.77 DutchOpen2007 2007 10 13 2007-10-13 00:00:00 0
13 Ron van Bruchem 9.55 Netherlands2007 2007 11 24 2007-11-24 00:00:00 0
14 Edouard Chambon 9.18 MurciaOpen2008 2008 2 23 2008-02-23 00:00:00 0
15 Yu Nakajima (中島悠) 8.72 KashiwaOpen2008 2008 5 5 2008-05-05 00:00:00 0
16 Yu Nakajima (中島悠) 8.72 KashiwaOpen2008 2008 5 5 2008-05-05 12:00:00 12
17 Erik Akkersdijk 7.08 CzechOpen2008 2008 7 12 2008-07-12 00:00:00 0
18 Feliks Zemdegs 7.03 MelbourneCubeDay2010 2010 11 13 2010-11-13 00:00:00 0
19 Feliks Zemdegs 6.77 MelbourneCubeDay2010 2010 11 13 2010-11-13 12:00:00 12
21 Feliks Zemdegs 6.65 MelbourneSummer2011 2011 1 29 2011-01-29 00:00:00 0
20 Feliks Zemdegs 6.65 KubarooOpen2011 2011 5 7 2011-05-07 00:00:00 0
22 Feliks Zemdegs 6.24 KubarooOpen2011 2011 5 7 2011-05-07 12:00:00 12
23 Feliks Zemdegs 6.18 MelbourneWinterOpen2011 2011 6 25 2011-06-25 00:00:00 0
24 Feliks Zemdegs 5.66 MelbourneWinterOpen2011 2011 6 25 2011-06-25 12:00:00 12
25 Mats Valk 5.55 ZonhovenOpen2013 2013 3 2 2013-03-02 00:00:00 0
26 Collin Burns 5.25 DoylestownSpring2015 2015 4 25 2015-04-25 00:00:00 0
27 Lucas Etter 4.90 RiverHillFall2015 2015 11 21 2015-11-21 00:00:00 0
28 Mats Valk 4.74 JawaTimurOpen2016 2016 11 5 2016-11-05 00:00:00 0
29 Feliks Zemdegs 4.73 POPSOpen2016 2016 12 11 2016-12-11 00:00:00 0
30 Patrick Ponce 4.69 RallyInTheValley2017 2017 9 2 2017-09-02 00:00:00 0
31 SeungBeom Cho (조승범) 4.59 Chicago2017 2017 10 28 2017-10-28 00:00:00 0
32 Feliks Zemdegs 4.59 HobartSummer2018 2018 1 27 2018-01-27 00:00:00 0
33 Feliks Zemdegs 4.22 CubeforCambodia2018 2018 5 6 2018-05-06 00:00:00 0
34 Yusheng Du (杜宇生) 3.47 WuhuOpen2018 2018 11 24 2018-11-24 00:00:00 0

Now we can finally plot the data!

Note we will primarily be using the seaborn library to plot our data. This library is quite extensive. For more information, here are some recommended readings: Seaborn introduction, Searbon API overview

In [13]:
sns.lineplot(data=df_333_singleWRs, y="best", x="date", marker="o")
Out[13]:
<Axes: xlabel='date', ylabel='best'>

We can add data point labels to see individual values, but it can get a little bit cluttered.

In [14]:
sns.lineplot(data=df_333_singleWRs, y="best", x="date", marker="o")

for date, best in zip(df_333_singleWRs["date"], df_333_singleWRs["best"]):
    label = f"({str(date.year)}, {best})"
    plt.annotate(label, (date, best), xytext=(3,3), textcoords="offset points", fontsize=6)

Let's try to fit a model to the data. Let's start off with a simple linear regression.

In [14]:
# converting to datetime objects date.ordinals since LinearRegression can not take datetime objects
df_333_singleWRs["ordinal"] = df_333_singleWRs["date"].apply(datetime.date.toordinal)

X = df_333_singleWRs[["ordinal"]]       # df of dates
y = df_333_singleWRs["best"]            # Series of results to match the dates
linreg = LinearRegression().fit(X, y)   # fit the linreg model

linreg_prediction = linreg.predict(X)                       # creating predicted y values using our linreg
df_333_singleWRs["predicted_linreg"] = linreg_prediction    # adding predicted values to our df

# plotting
sns.lineplot(data=df_333_singleWRs, y="best", x="date", marker="o")
sns.lineplot(data=df_333_singleWRs, x="date", y=linreg_prediction, color="#87bdff")
Out[14]:
<Axes: xlabel='date', ylabel='best'>

The line looks to fit the downwart trend of the model, but there is a lot of room for improvement. To analyze how well our model fits, lets make a residual plot.

In [30]:
sns.residplot(data=df_333_singleWRs, x="ordinal", y="best")
Out[30]:
<Axes: xlabel='ordinal', ylabel='best'>

Note that ordinal is an integer representation of a datetime object (an ordinal represents the date corresponding to the proleptic Gregorian ordinal, where January 1 of year 1 has ordinal 1). Since we are only looking at residuals the human-interpretability of the x-axis labels is not too important.

We can see that this linear regression does not fit too well. A linear regression model has several assumptions: linearity, homoscedasticity, normality, and independence. A residual plot should show a fairly random pattern, even spread of points, no discernible pattern of increasing or decreasing variability, and a normal distribution of residuals. We can see in our residual plot that there seems to be some sort of pattern, or at the very least the residuals do not seem to be random. We can see residuals being very high above 0 in earlier years, dipping below 0 in the middle years, then going slightly above 0 in later years. The obvious visual explanation to our problem is that the data does not seem to fit along a straight line but rather a curve. Another reason for this pattern in our residual plot could simply be that we do not have that many data points in our model.

To quantitatively see the accuracy of our model, let's try running an Ordinary Least Squares regression using the statsmodels library.

In [112]:
ols_model = sm.ols(formula="best ~ ordinal", data=df_333_singleWRs).fit()
print(ols_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   best   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.869
Method:                 Least Squares   F-statistic:                     227.1
Date:                Sat, 13 May 2023   Prob (F-statistic):           2.36e-16
Time:                        23:28:11   Log-Likelihood:                -65.260
No. Observations:                  35   AIC:                             134.5
Df Residuals:                      33   BIC:                             137.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   1239.0797     81.641     15.177      0.000    1072.980    1405.179
ordinal       -0.0017      0.000    -15.069      0.000      -0.002      -0.001
==============================================================================
Omnibus:                        7.722   Durbin-Watson:                   0.695
Prob(Omnibus):                  0.021   Jarque-Bera (JB):                6.267
Skew:                           0.945   Prob(JB):                       0.0436
Kurtosis:                       3.854   Cond. No.                     2.20e+08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.2e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

Notice that the p-value associated with the time (ordinal) coefficient is less than 0.05 (P>|t| = 0.000), which indicates that the relationship between time and world record single is statistically significant at the 95% confidence level. This is of course trivially true because over time the world record can only get lower and lower. Later on, we will explore a hypothesis that is not so obviously true.

Here we see that the R-squared value for our OLS model is 0.873 and our log-likelihood value is -65.260.

Let's try a different model to see if we can get a better fit.

In [169]:
sns.regplot(data=df_333_singleWRs, x="ordinal", y="best", order=3, ci=None)

# changing x-axis tick labels from ordinals to year 
ax = plt.gca()              # get current ax
xticks = ax.get_xticks()    # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks])    # convert each ordinal to a year
ax.set_xlabel("year")       # changing x-axis label
Out[169]:
Text(0.5, 0, 'year')

Above we are using seaborn's regplot with order=3, which is a linear regression model fit using a polynomial of order 3. Visually, this appears to fit our data set much better. Let's see how the residual plot looks.

In [50]:
sns.residplot(data=df_333_singleWRs, x="ordinal", y="best", order=3)
Out[50]:
<Axes: xlabel='ordinal', ylabel='best'>

Much better! Here we see residuals seemingly randomly scattered, and with more residuals clustered around 0. This indicates a pretty good fit for our linear regression with polynomial terms model!

If we look back at our polynomial with order 3, we can see how the line curves upwards at the end despite decreasing y-values. How about we change it to order 4?

In [59]:
sns.regplot(data=df_333_singleWRs, x="ordinal", y="best", order=4, ci=None)

# changing x-axis tick labels from ordinals to year 
ax = plt.gca()              # get current ax
xticks = ax.get_xticks()    # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks])    # convert each ordinal to a year
ax.set_xlabel("year")       # changing x-axis label
Out[59]:
Text(0.5, 0, 'year')
In [58]:
sns.residplot(data=df_333_singleWRs, x="ordinal", y="best", order=4)
Out[58]:
<Axes: xlabel='ordinal', ylabel='best'>

The residual plot looks relatively similar to the one with order=3. Let's run OLS models for these polynomials.

In [121]:
print("Polynomial of order 3")
ols_model_poly3 = sm.ols(formula="best ~ ordinal + I(ordinal**2) + I(ordinal**3)", data=df_333_singleWRs).fit()
print(ols_model_poly3.summary())
print("\n\n")

print("Polynomial of order 4")
ols_model_poly3 = sm.ols(formula="best ~ ordinal + I(ordinal**2) + I(ordinal**3) + I(ordinal**4)", data=df_333_singleWRs).fit()
print(ols_model_poly3.summary())
Polynomial of order 3
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   best   R-squared:                       0.875
Model:                            OLS   Adj. R-squared:                  0.871
Method:                 Least Squares   F-statistic:                     231.1
Date:                Sat, 13 May 2023   Prob (F-statistic):           1.83e-16
Time:                        23:37:20   Log-Likelihood:                -64.991
No. Observations:                  35   AIC:                             134.0
Df Residuals:                      33   BIC:                             137.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        1.324e-20   8.53e-22     15.528      0.000    1.15e-20     1.5e-20
ordinal          6.456e-15   4.16e-16     15.528      0.000    5.61e-15     7.3e-15
I(ordinal ** 2)  2.362e-09   1.52e-10     15.528      0.000    2.05e-09    2.67e-09
I(ordinal ** 3) -3.198e-15   2.07e-16    -15.420      0.000   -3.62e-15   -2.78e-15
==============================================================================
Omnibus:                        7.381   Durbin-Watson:                   0.663
Prob(Omnibus):                  0.025   Jarque-Bera (JB):                5.938
Skew:                           0.936   Prob(JB):                       0.0514
Kurtosis:                       3.755   Cond. No.                     5.86e+24
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.59e-13. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.



Polynomial of order 4
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   best   R-squared:                       0.898
Model:                            OLS   Adj. R-squared:                  0.892
Method:                 Least Squares   F-statistic:                     140.8
Date:                Sat, 13 May 2023   Prob (F-statistic):           1.38e-16
Time:                        23:37:20   Log-Likelihood:                -61.447
No. Observations:                  35   AIC:                             128.9
Df Residuals:                      32   BIC:                             133.6
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept        1.226e-20   8.64e-22     14.187      0.000    1.05e-20     1.4e-20
ordinal          5.999e-15   4.23e-16     14.187      0.000    5.14e-15    6.86e-15
I(ordinal ** 2)  2.187e-09   1.54e-10     14.187      0.000    1.87e-09     2.5e-09
I(ordinal ** 3) -2.959e-15    2.1e-16    -14.081      0.000   -3.39e-15   -2.53e-15
I(ordinal ** 4)  1.152e-19    4.3e-20      2.680      0.012    2.76e-20    2.03e-19
==============================================================================
Omnibus:                        3.361   Durbin-Watson:                   1.072
Prob(Omnibus):                  0.186   Jarque-Bera (JB):                2.044
Skew:                           0.485   Prob(JB):                        0.360
Kurtosis:                       3.679   Cond. No.                     1.08e+23
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.08e+23. This might indicate that there are
strong multicollinearity or other numerical problems.

We have R-squared values of 0.875 and 0.898 for the OLS models for polynomials of order 3 and 4 respectively. These are both marginally higher than the R-squared value for our original linear OLS model of 0.873. Additionally, we have log-likelihood values of -64.991 and -61.4471 for the order-3 and order-4 polynomial respectively, both of which are better than our original value of -65.260.

Improvements in World Records¶

Above we stated that it was trivially true to reject the null hypothesis when considering whether time has a statistically significant impact on world record 3x3 single (because the world record can only get faster with time). But what about the influence of time on world record improvement? Meaning, if we define the reduction factor of a single world record x to be the number by which the world record immediately previous to x is multiplied with to get x, how does the reduction factor change with time?

Answering this question can help give us more information on the trends of cubing world records, and subsequently cubing as a whole. Are new advancements in methods and hardware making a significant difference in the activity or have we reached the point where world records are starting to plateau, and improvement becomes harder and harder?

Let's create a new dataframe based on our previous one. This time, we will include the reduction factor for each world record 3x3 single (remember our definition of reduction factor from above). Any reduction factor will lie on the interval (0,1], with 1 being a tied world record, and values will get closer to 0 the more drastic the drop from the previous world record is. We will also have to ignore the first row from our new dataframe, since there is no reduction factor for the first ever world record (there is no previous world record to compare it to).

In [84]:
df_333_singleWRs_reduction = df_333_singleWRs.copy()        # creating a copy of our original df
reduction = []                                              # list of reduction factors 
prev_row = df_333_singleWRs.iloc[0]                         # prev row of our df (used when iterating)

# iterating through df
for i, row in df_333_singleWRs_reduction.iterrows():
    if i == 0: 
        reduction.append(0.0)
    else:
        reduction.append(row["best"]/prev_row["best"])      # reduction factor = current WR / previous WR
    prev_row = row

# adding new column
df_333_singleWRs_reduction["reduction"] = reduction     
# rearranging columns and dropping ones we don't need
df_333_singleWRs_reduction = df_333_singleWRs_reduction[["personName", "best", "reduction", "competitionId", "date", "ordinal"]]
# dropping the first row since it's not relevant (N/A reduction factor since it's the first WR)
df_333_singleWRs_reduction = df_333_singleWRs_reduction.drop(index=0)
# reindex the df
df_333_singleWRs_reduction.reset_index(drop=True, inplace=True)
df_333_singleWRs_reduction
Out[84]:
personName best reduction competitionId date ordinal
0 Dan Knights 16.71 0.728105 WC2003 2003-08-23 00:00:00 731450
1 Jess Bonde 16.53 0.989228 WC2003 2003-08-23 12:00:00 731450
2 Shotaro Makisumi (牧角章太郎) 15.07 0.911676 CaltechWinter2004 2004-01-24 00:00:00 731604
3 Shotaro Makisumi (牧角章太郎) 14.76 0.979429 CaltechWinter2004 2004-01-24 12:00:00 731604
4 Shotaro Makisumi (牧角章太郎) 13.93 0.943767 CaltechSpring2004 2004-04-03 00:00:00 731674
5 Shotaro Makisumi (牧角章太郎) 12.11 0.869347 CaltechSpring2004 2004-04-03 12:00:00 731674
6 Jean Pons 11.75 0.970273 DutchOpen2005 2005-10-16 00:00:00 732235
7 Leyan Lo 11.13 0.947234 CaltechWinter2006 2006-01-14 00:00:00 732325
8 Toby Mao (毛台立) 10.48 0.941599 US2006 2006-08-04 00:00:00 732527
9 Edouard Chambon 10.36 0.988550 BelgianOpen2007 2007-02-24 00:00:00 732731
10 Thibaut Jacquinot 9.86 0.951737 SpanishOpen2007 2007-05-05 00:00:00 732801
11 Erik Akkersdijk 9.77 0.990872 DutchOpen2007 2007-10-13 00:00:00 732962
12 Ron van Bruchem 9.55 0.977482 Netherlands2007 2007-11-24 00:00:00 733004
13 Edouard Chambon 9.18 0.961257 MurciaOpen2008 2008-02-23 00:00:00 733095
14 Yu Nakajima (中島悠) 8.72 0.949891 KashiwaOpen2008 2008-05-05 00:00:00 733167
15 Yu Nakajima (中島悠) 8.72 1.000000 KashiwaOpen2008 2008-05-05 12:00:00 733167
16 Erik Akkersdijk 7.08 0.811927 CzechOpen2008 2008-07-12 00:00:00 733235
17 Feliks Zemdegs 7.03 0.992938 MelbourneCubeDay2010 2010-11-13 00:00:00 734089
18 Feliks Zemdegs 6.77 0.963016 MelbourneCubeDay2010 2010-11-13 12:00:00 734089
19 Feliks Zemdegs 6.65 0.982275 MelbourneSummer2011 2011-01-29 00:00:00 734166
20 Feliks Zemdegs 6.65 1.000000 KubarooOpen2011 2011-05-07 00:00:00 734264
21 Feliks Zemdegs 6.24 0.938346 KubarooOpen2011 2011-05-07 12:00:00 734264
22 Feliks Zemdegs 6.18 0.990385 MelbourneWinterOpen2011 2011-06-25 00:00:00 734313
23 Feliks Zemdegs 5.66 0.915858 MelbourneWinterOpen2011 2011-06-25 12:00:00 734313
24 Mats Valk 5.55 0.980565 ZonhovenOpen2013 2013-03-02 00:00:00 734929
25 Collin Burns 5.25 0.945946 DoylestownSpring2015 2015-04-25 00:00:00 735713
26 Lucas Etter 4.90 0.933333 RiverHillFall2015 2015-11-21 00:00:00 735923
27 Mats Valk 4.74 0.967347 JawaTimurOpen2016 2016-11-05 00:00:00 736273
28 Feliks Zemdegs 4.73 0.997890 POPSOpen2016 2016-12-11 00:00:00 736309
29 Patrick Ponce 4.69 0.991543 RallyInTheValley2017 2017-09-02 00:00:00 736574
30 SeungBeom Cho (조승범) 4.59 0.978678 Chicago2017 2017-10-28 00:00:00 736630
31 Feliks Zemdegs 4.59 1.000000 HobartSummer2018 2018-01-27 00:00:00 736721
32 Feliks Zemdegs 4.22 0.919390 CubeforCambodia2018 2018-05-06 00:00:00 736820
33 Yusheng Du (杜宇生) 3.47 0.822275 WuhuOpen2018 2018-11-24 00:00:00 737022

Now let's plot our dataframe, and see how the reduction factor of 3x3 world record singles has changed over time.

In [171]:
sns.lineplot(data=df_333_singleWRs_reduction, x="date", y="reduction", marker="o")
Out[171]:
<Axes: xlabel='date', ylabel='reduction'>

From visual inspection, there is no strong and obvious correlation amongst the data. Reduction factors seem to be clustered in the 0.95-1 range, with a few in the 0.90-0.95 range, and a few outliers as well. Let's plot a linear regression line to fit the data.

Before we can do linear regressions with this dataframe, let's sort out the ordinals column. The seaborn library does not play nice with regression calculations using Python datetime objects, so we converted each date to its ordinal, aka integer representation. However, this reintroduces our problem of overlapping results when a record is broken on the same day. Therefore, we apply the same logic as we did previously (with the adding of 12:00 to differentiate dates), but this time we add 0.5 to differentiate between any 2 ordinals that are the same.

In [85]:
ordinal_time = []                                       # new column to add to df
prev_row = df_333_singleWRs_reduction.iloc[0]           # prev row of our df (used when iterating)

# iterating through each row
for i, row in df_333_singleWRs_reduction.iterrows():
    if i == 0:                                          # base case
        ordinal_time.append(row["ordinal"])
    else:
        if prev_row["ordinal"] == row["ordinal"]:       # if ordinal is same as the previous row
            ordinal_time.append(row["ordinal"]+0.5)
        else:
            ordinal_time.append(row["ordinal"])
        prev_row = row

# add the new column
df_333_singleWRs_reduction["ordinal"] = ordinal_time
df_333_singleWRs_reduction
Out[85]:
personName best reduction competitionId date ordinal
0 Dan Knights 16.71 0.728105 WC2003 2003-08-23 00:00:00 731450.0
1 Jess Bonde 16.53 0.989228 WC2003 2003-08-23 12:00:00 731450.5
2 Shotaro Makisumi (牧角章太郎) 15.07 0.911676 CaltechWinter2004 2004-01-24 00:00:00 731604.0
3 Shotaro Makisumi (牧角章太郎) 14.76 0.979429 CaltechWinter2004 2004-01-24 12:00:00 731604.5
4 Shotaro Makisumi (牧角章太郎) 13.93 0.943767 CaltechSpring2004 2004-04-03 00:00:00 731674.0
5 Shotaro Makisumi (牧角章太郎) 12.11 0.869347 CaltechSpring2004 2004-04-03 12:00:00 731674.5
6 Jean Pons 11.75 0.970273 DutchOpen2005 2005-10-16 00:00:00 732235.0
7 Leyan Lo 11.13 0.947234 CaltechWinter2006 2006-01-14 00:00:00 732325.0
8 Toby Mao (毛台立) 10.48 0.941599 US2006 2006-08-04 00:00:00 732527.0
9 Edouard Chambon 10.36 0.988550 BelgianOpen2007 2007-02-24 00:00:00 732731.0
10 Thibaut Jacquinot 9.86 0.951737 SpanishOpen2007 2007-05-05 00:00:00 732801.0
11 Erik Akkersdijk 9.77 0.990872 DutchOpen2007 2007-10-13 00:00:00 732962.0
12 Ron van Bruchem 9.55 0.977482 Netherlands2007 2007-11-24 00:00:00 733004.0
13 Edouard Chambon 9.18 0.961257 MurciaOpen2008 2008-02-23 00:00:00 733095.0
14 Yu Nakajima (中島悠) 8.72 0.949891 KashiwaOpen2008 2008-05-05 00:00:00 733167.0
15 Yu Nakajima (中島悠) 8.72 1.000000 KashiwaOpen2008 2008-05-05 12:00:00 733167.5
16 Erik Akkersdijk 7.08 0.811927 CzechOpen2008 2008-07-12 00:00:00 733235.0
17 Feliks Zemdegs 7.03 0.992938 MelbourneCubeDay2010 2010-11-13 00:00:00 734089.0
18 Feliks Zemdegs 6.77 0.963016 MelbourneCubeDay2010 2010-11-13 12:00:00 734089.5
19 Feliks Zemdegs 6.65 0.982275 MelbourneSummer2011 2011-01-29 00:00:00 734166.0
20 Feliks Zemdegs 6.65 1.000000 KubarooOpen2011 2011-05-07 00:00:00 734264.0
21 Feliks Zemdegs 6.24 0.938346 KubarooOpen2011 2011-05-07 12:00:00 734264.5
22 Feliks Zemdegs 6.18 0.990385 MelbourneWinterOpen2011 2011-06-25 00:00:00 734313.0
23 Feliks Zemdegs 5.66 0.915858 MelbourneWinterOpen2011 2011-06-25 12:00:00 734313.5
24 Mats Valk 5.55 0.980565 ZonhovenOpen2013 2013-03-02 00:00:00 734929.0
25 Collin Burns 5.25 0.945946 DoylestownSpring2015 2015-04-25 00:00:00 735713.0
26 Lucas Etter 4.90 0.933333 RiverHillFall2015 2015-11-21 00:00:00 735923.0
27 Mats Valk 4.74 0.967347 JawaTimurOpen2016 2016-11-05 00:00:00 736273.0
28 Feliks Zemdegs 4.73 0.997890 POPSOpen2016 2016-12-11 00:00:00 736309.0
29 Patrick Ponce 4.69 0.991543 RallyInTheValley2017 2017-09-02 00:00:00 736574.0
30 SeungBeom Cho (조승범) 4.59 0.978678 Chicago2017 2017-10-28 00:00:00 736630.0
31 Feliks Zemdegs 4.59 1.000000 HobartSummer2018 2018-01-27 00:00:00 736721.0
32 Feliks Zemdegs 4.22 0.919390 CubeforCambodia2018 2018-05-06 00:00:00 736820.0
33 Yusheng Du (杜宇生) 3.47 0.822275 WuhuOpen2018 2018-11-24 00:00:00 737022.0

Now we can run a linear regression on our data.

In [172]:
sns.regplot(data=df_333_singleWRs_reduction, x="ordinal", y="reduction", ci=None)

# changing x-axis tick labels from ordinals to year 
ax = plt.gca()              # get current ax
xticks = ax.get_xticks()    # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks])    # convert each ordinal to a year
ax.set_xlabel("year")       # changing x-axis label
Out[172]:
Text(0.5, 0, 'year')
In [102]:
sns.regplot(data=df_333_singleWRs_reduction, x="ordinal", y="reduction", order=4, ci=None)

# changing x-axis tick labels from ordinals to year 
ax = plt.gca()              # get current ax
xticks = ax.get_xticks()    # get xticks
ax.set_xticklabels([datetime.date.fromordinal(int(x)).year for x in xticks])    # convert each ordinal to a year
ax.set_xlabel("year")       # changing x-axis label
Out[102]:
Text(0.5, 0, 'year')

Once again, the line does not seem to fit the data too well. Without any obvious correlation in the data, we would be hard pressed to find a visually good fitting model. Let's run an OLS regression like before.

In [111]:
reduction_ols_model = sm.ols(formula="reduction ~ ordinal", data=df_333_singleWRs_reduction).fit()
print(reduction_ols_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              reduction   R-squared:                       0.028
Model:                            OLS   Adj. R-squared:                 -0.002
Method:                 Least Squares   F-statistic:                    0.9353
Date:                Sat, 13 May 2023   Prob (F-statistic):              0.341
Time:                        23:27:56   Log-Likelihood:                 48.207
No. Observations:                  34   AIC:                            -92.41
Df Residuals:                      32   BIC:                            -89.36
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -3.1887      4.278     -0.745      0.461     -11.902       5.524
ordinal     5.637e-06   5.83e-06      0.967      0.341   -6.24e-06    1.75e-05
==============================================================================
Omnibus:                       24.687   Durbin-Watson:                   1.784
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               38.589
Skew:                          -1.937   Prob(JB):                     4.17e-09
Kurtosis:                       6.498   Cond. No.                     3.03e+08
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.03e+08. This might indicate that there are
strong multicollinearity or other numerical problems.

Notice that the p-value associated with the time (ordinal) coefficient is greater than 0.05 (P>|t| = 0.341), which indicates that the relationship between time and reduction factor is not statistically significant at the 95% confidence level. Thus, we fail to reject the null hypothesis that time and reduction factor are not significantly correlated. Here we also see that the R-squared value for our OLS model is 0.028, much lower than our R-squared value for our OLS model for 3x3 single world records (0.873)

Predicting the Future¶

In [200]:
with connect(host="localhost", user=mysql_username, password=mysql_password, database="wca") as connection:
    tymon_id = '2016KOLA02'
    qry_tymon_333avgs = f"""
                    SELECT Results.competitionId, 
                        Results.roundTypeId, 
                        Results.average, 
                        Competitions.year, 
                        Competitions.month,
                        Competitions.day
                    FROM Results
                    INNER JOIN Persons ON Results.personID=Persons.id
                    INNER JOIN Competitions ON Results.competitionId=Competitions.id
                    WHERE Results.personId = '{tymon_id}'
                        AND Persons.subid = 1
                        AND Results.eventId = '333'
                        AND Results.average > 0 
                    ORDER BY Year ASC, Month ASC, Day ASC;
                """
    df_tymon_333avgs = pd.read_sql(qry_tymon_333avgs, con=connection)
    display(df_tymon_333avgs)

    # with pd.option_context('display.max_rows', None,):
    #     display(df_tymon_333avgs)
competitionId roundTypeId average year month day
0 WLSWiosna2016 1 2752 2016 5 8
1 LodzCubingSummer2016 1 2437 2016 6 4
2 WLSLato2016 1 2133 2016 7 9
3 Euro2016 1 2049 2016 7 15
4 IIMasovianOpenPlock2016 2 1809 2016 8 27
... ... ... ... ... ... ...
318 PBsinPalisades2023 2 558 2023 4 15
319 PBsinPalisades2023 1 518 2023 4 15
320 CapeCod2023 f 582 2023 4 22
321 CapeCod2023 2 555 2023 4 22
322 CapeCod2023 1 560 2023 4 22

323 rows × 6 columns

In [201]:
df_tymon_333avgs_groupby_comp = df_tymon_333avgs.groupby("competitionId")
modified_groups = []
for competition, group in df_tymon_333avgs_groupby_comp:
    hours = list(range(0, 24, 24//len(group)))
    hours.reverse()
    group["hour"] = hours
    modified_groups.append(group)

df_tymon_333avgs = pd.concat(modified_groups)
df_tymon_333avgs = df_tymon_333avgs.sort_values(by=["year", "month", "day", "hour"])
display(df_tymon_333avgs)
# with pd.option_context('display.max_rows', None,):
    # display(df_tymon_333avgs)
competitionId roundTypeId average year month day hour
0 WLSWiosna2016 1 2752 2016 5 8 0
1 LodzCubingSummer2016 1 2437 2016 6 4 0
2 WLSLato2016 1 2133 2016 7 9 0
3 Euro2016 1 2049 2016 7 15 0
5 IIMasovianOpenPlock2016 1 1662 2016 8 27 0
... ... ... ... ... ... ... ...
317 PBsinPalisades2023 3 560 2023 4 15 12
316 PBsinPalisades2023 f 527 2023 4 15 18
322 CapeCod2023 1 560 2023 4 22 0
321 CapeCod2023 2 555 2023 4 22 8
320 CapeCod2023 f 582 2023 4 22 16

323 rows × 7 columns

In [202]:
df_tymon_333avgs["date"] = pd.to_datetime(df_tymon_333avgs[["year", "month", "day", "hour"]])
df_tymon_333avgs
Out[202]:
competitionId roundTypeId average year month day hour date
0 WLSWiosna2016 1 2752 2016 5 8 0 2016-05-08 00:00:00
1 LodzCubingSummer2016 1 2437 2016 6 4 0 2016-06-04 00:00:00
2 WLSLato2016 1 2133 2016 7 9 0 2016-07-09 00:00:00
3 Euro2016 1 2049 2016 7 15 0 2016-07-15 00:00:00
5 IIMasovianOpenPlock2016 1 1662 2016 8 27 0 2016-08-27 00:00:00
... ... ... ... ... ... ... ... ...
317 PBsinPalisades2023 3 560 2023 4 15 12 2023-04-15 12:00:00
316 PBsinPalisades2023 f 527 2023 4 15 18 2023-04-15 18:00:00
322 CapeCod2023 1 560 2023 4 22 0 2023-04-22 00:00:00
321 CapeCod2023 2 555 2023 4 22 8 2023-04-22 08:00:00
320 CapeCod2023 f 582 2023 4 22 16 2023-04-22 16:00:00

323 rows × 8 columns

In [203]:
sns.lineplot(data=df_tymon_333avgs, x="date", y="average")
Out[203]:
<Axes: xlabel='date', ylabel='average'>
In [205]:
# data preprocessing
scaler = MinMaxScaler()
data = scaler.fit_transform(df_tymon_333avgs["average"].values.reshape(-1,1))

# splitting data into training and validation sets
training_size = int(len(data) * 0.8)
training_data = data[:training_size, :]
validation_data = data[training_size:, :]

# print(training_data)
# print(validation_data)

# defining the RNN model architecture
n_steps = 3                             # number of time steps to consider for each input
n_features = 1                          # number of features in each input
model = Sequential()
model.add(LSTM(50, activation='relu', input_shape=(n_steps, n_features)))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse')

# training the RNN model 
# X_train, y_train = create_dataset(train_data, n_steps)
# X_val, y_val = create_dataset(val_data, n_steps)
# model.fit(X_train, y_train, epochs=100, batch_size=32, validation_data=(X_val, y_val))

Conclusion¶

We hope the reader has gained new insight into the world of competitive Rubik's Cube solving, learned more about the data science pipeline, and gained more experience with data analysis in Python. Here we have seen several different models for our questions regarding world record changes over time. Although machines can more and more accurately predict the results of our questions, we are constantly still surprised by the real-world results that pop up, often defying all machine-derived expectations (true for cubing and the world in general). Cubing is still a relatively "young" activity and the dataset available to answer many of our questions are still relatively small. It will be interesting to see how the exponential growth of cubing leads to more and more data, and how that data can subsequently impact our models and predictions for the future.