| 1 | ||
| 1 | ||
| 1 | ||
| 1 | ||
| 1 |
Background and Context:
This site was started to play around with one of the initial ideas Aaron Swartz had for Reddit. His initial idea was that upvotes and downvotes would customize what content you saw. This was very novel at the time. Ultimately Reddit never got to that milestone within Swartz's lifetime. As such, posts and users have positional data attached to them on this site. The first version of this was very simple. The user and post positions are updated any time you vote on something. You go closer to the post, and the post goes closer to you. The opposite happens when there is a downvote. There is a bit more code to avoid "identity collapse" where everything would become a singularity from people upvoting more than downvoting. This is a part of the sorting algorithm here. The standard sort here is called "custom," which allows users to control the weights used in the sort algorithm, and proximity is one factor they can either maximize or minimize.
But I wasn't sure the original method developed for the site was giving the best results, so I decided to try using a bit of machine learning to back process all the data we have so far. It did improve the positions quite a bit. And I think those improvements have held even with the system continuing to process with the old method with votes that have happened since I ran this.
The method I used treats the existing positions as model parameters. We have a gradient function that wants to drive upvoted posts and users closer, and negative posts and users further apart. Because many users don't downvote, and I needed this to have some balance, I also estimated when these posts would have been visible. If the user has an upvote around the same time as an ignored post would have been visible, I marked this as a minor downvote. This somewhat works because this is a small site where it's easy to see every post. But a bit of contra-indicating noise can't hurt anyway.
There are three main programs involved. One extracts data from an database to a dataframe that is ready for machine learning. The next program optimizes user and post positions in a 10D Hilbert space using autograd and gradient descent. This runs on a GPU. The last converts the 10D result into a principal component analysis (PCA) and pushes it back into a database format.
import sqlite3
import pandas as pd
import numpy as np
import gc
import shutil
import os
def prepare_data():
print("Loading SQLite Data in read-only mode...")
conn_data = sqlite3.connect("file:data.db?mode=ro", uri=True)
conn_matrix = sqlite3.connect("file:matrixdb.db?mode=ro", uri=True)
# Load items (posts and comments)
print("Loading items...")
df_tags = pd.read_sql_query("SELECT tag as id, time FROM tag", conn_matrix)
df_comments = pd.read_sql_query("SELECT comment as id, time FROM comment", conn_data)
df_items = pd.concat([df_tags, df_comments]).drop_duplicates(subset=['id']).reset_index(drop=True)
# Load votes
print("Loading votes...")
df_votes = pd.read_sql_query("SELECT user, post as item, vote, time FROM votestate", conn_data)
# Clean up unneeded connections
conn_data.close()
conn_matrix.close()
# Filter votes to only include known items (if any are missing)
df_votes = df_votes[df_votes['item'].isin(df_items['id'])]
# Map Users and Items to consecutive Integer IDs
print("Mapping IDs...")
unique_users = df_votes['user'].unique()
user2idx = {u: i for i, u in enumerate(unique_users)}
idx2user = {i: u for u, i in user2idx.items()}
unique_items = df_items['id'].unique()
item2idx = {item: i for i, item in enumerate(unique_items)}
idx2item = {i: item for item, i in item2idx.items()}
df_votes['u_idx'] = df_votes['user'].map(user2idx)
df_votes['i_idx'] = df_votes['item'].map(item2idx)
df_items['i_idx'] = df_items['id'].map(item2idx)
print(f"Total users: {len(unique_users)}")
print(f"Total items: {len(unique_items)}")
print(f"Total explicit votes: {len(df_votes)}")
# Inferred neutral votes (0)
# For each user vote, sample 1 item that was posted around the same time (within +/- 3 days)
# We will do this via a time-sorted array of items
print("Inferring neutral votes based on time proximity...")
df_items = df_items.sort_values('time').reset_index(drop=True)
item_times = df_items['time'].values
item_idxs = df_items['i_idx'].values
# Create a set of (u_idx, i_idx) for fast lookup
existing_votes = set(zip(df_votes['u_idx'], df_votes['i_idx']))
neutral_votes = []
# 3 days in seconds = 3 * 24 * 3600 = 259200
TIME_WINDOW = 259200
# To speed up, we'll iterate through votes and randomly pick an item within the time window
# We'll use np.searchsorted
import random
for row in df_votes.itertuples():
u = row.u_idx
v_time = row.time # The time the *vote* happened.
# Wait, the prompt says "if a user voted on one thing, they likely saw content posted around the same time"
# Since v_time is vote time, we should look for items posted around v_time, because those were on the front page.
# Let's find items posted between v_time - TIME_WINDOW and v_time.
idx_start = np.searchsorted(item_times, v_time - TIME_WINDOW)
idx_end = np.searchsorted(item_times, v_time)
if idx_start < idx_end:
# Sample up to 2 items they didn't vote on
candidates = item_idxs[idx_start:idx_end]
if len(candidates) > 0:
sampled = np.random.choice(candidates, size=min(2, len(candidates)), replace=False)
for cand in sampled:
if (u, cand) not in existing_votes:
neutral_votes.append({'u_idx': u, 'i_idx': cand, 'vote': 0})
existing_votes.add((u, cand)) # prevent creating duplicate neuts
if neutral_votes:
df_neutrals = pd.DataFrame(neutral_votes)
print(f"Created {len(df_neutrals)} inferred neutral votes")
# Combine
df_all_votes = pd.concat([df_votes[['u_idx', 'i_idx', 'vote']], df_neutrals])
else:
df_all_votes = df_votes[['u_idx', 'i_idx', 'vote']]
print("Saving processed data...")
df_all_votes.to_parquet("votes_processed.parquet")
pd.Series(idx2user).to_frame(name='user').to_parquet("idx2user.parquet")
pd.Series(idx2item).to_frame(name='item').to_parquet("idx2item.parquet")
print("Data preparation complete.")
if __name__ == "__main__":
prepare_data()
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import pandas as pd
import numpy as np
import time
class VoteDataset(Dataset):
def __init__(self, df):
self.u = torch.tensor(df['u_idx'].values, dtype=torch.long)
self.i = torch.tensor(df['i_idx'].values, dtype=torch.long)
self.v = torch.tensor(df['vote'].values, dtype=torch.float32)
def __len__(self):
return len(self.u)
def __getitem__(self, idx):
return self.u[idx], self.i[idx], self.v[idx]
class CPModel(nn.Module):
def __init__(self, num_users, num_items, embedding_dim=10):
super(CPModel, self).__init__()
# Using a slight initialization spread
self.user_emb = nn.Embedding(num_users, embedding_dim)
self.item_emb = nn.Embedding(num_items, embedding_dim)
nn.init.normal_(self.user_emb.weight, std=0.1)
nn.init.normal_(self.item_emb.weight, std=0.1)
def forward(self, u_idx, i_idx):
u = self.user_emb(u_idx)
i = self.item_emb(i_idx)
# Euclidean distance squared
dist_sq = torch.sum((u - i)**2, dim=1)
return dist_sq
def train():
print("Loading data...")
df_votes = pd.read_parquet("votes_processed.parquet")
idx2user = pd.read_parquet("idx2user.parquet")
idx2item = pd.read_parquet("idx2item.parquet")
num_users = len(idx2user)
num_items = len(idx2item)
# Analyze distribution per user to "center"
# The prompt mentioned "center the distribution so a 0 vote becomes a lesser negative than a downvote"
# For CML:
# vote == 1 -> We want distance to be 0
# vote == -1 -> We want distance to be > margin_hard
# vote == 0 -> We want distance to be > margin_soft
dataset = VoteDataset(df_votes)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
# We don't need a DataLoader for a small dataset, it slows down Python with collating.
u_all = dataset.u.to(device)
i_all = dataset.i.to(device)
v_all = dataset.v.to(device)
model = CPModel(num_users, num_items, embedding_dim=10).to(device)
optimizer = optim.Adam(model.parameters(), lr=0.01)
margin_hard = 5.0
margin_soft = 2.0
epochs = 200
print("Starting training...")
max_duration = 4.5
start_train_t = time.time()
for epoch in range(epochs):
if time.time() - start_train_t > max_duration:
print(f"Time limit of {max_duration}s reached. Stopping at epoch {epoch}")
break
model.train()
start_t = time.time()
optimizer.zero_grad()
dist_sq = model(u_all, i_all)
dist = torch.sqrt(dist_sq + 1e-8)
mask_up = (v_all == 1)
mask_down = (v_all == -1)
mask_neut = (v_all == 0)
loss_up = torch.sum(dist_sq[mask_up])
loss_down = torch.sum(torch.relu(margin_hard - dist[mask_down])**2)
loss_neut = torch.sum(torch.relu(margin_soft - dist[mask_neut])**2) * 0.5
loss = loss_up + loss_down + loss_neut
loss.backward()
optimizer.step()
if (epoch+1) % 50 == 0:
print(f"Epoch {epoch+1}/{epochs} | Loss: {loss.item():.4f} | Time: {time.time()-start_t:.4f}s")
print("Saving embeddings...")
torch.save(model.state_dict(), "model_weights.pt")
# Save embeddings to numpy for post-processing
model.eval()
with torch.no_grad():
u_emb = model.user_emb.weight.cpu().numpy()
i_emb = model.item_emb.weight.cpu().numpy()
np.save("user_emb.npy", u_emb)
np.save("item_emb.npy", i_emb)
print("Training complete.")
if __name__ == "__main__":
train()
import pandas as pd
import numpy as np
import sqlite3
import shutil
import os
def post_process():
print("Loading embeddings...")
u_emb = np.load("user_emb.npy")
i_emb = np.load("item_emb.npy")
idx2user = pd.read_parquet("idx2user.parquet")
idx2item = pd.read_parquet("idx2item.parquet")
print("Applying PCA to reduce 10D embeddings to 2D...")
all_emb = np.vstack([u_emb, i_emb])
# Numpy PCA
all_emb_centered = all_emb - np.mean(all_emb, axis=0)
cov = np.cov(all_emb_centered, rowvar=False)
evals, evecs = np.linalg.eigh(cov)
idx = np.argsort(evals)[::-1]
evecs_2d = evecs[:, idx[:2]]
all_emb_2d = np.dot(all_emb_centered, evecs_2d)
num_users = len(u_emb)
u_emb_2d = all_emb_2d[:num_users]
i_emb_2d = all_emb_2d[num_users:]
# Optional scaling to make coordinates nice (e.g. within -100 to 100)
max_val = np.max(np.abs(all_emb_2d))
if max_val > 0:
u_emb_2d = (u_emb_2d / max_val) * 100.0
i_emb_2d = (i_emb_2d / max_val) * 100.0
print("Preparing SQL updates...")
# Update 'user' table in matrixdb_updated.db
# Update 'tag' table in matrixdb_updated.db
if os.path.exists("matrixdb_updated.db"):
os.remove("matrixdb_updated.db")
shutil.copy("matrixdb.db", "matrixdb_updated.db")
conn = sqlite3.connect("matrixdb_updated.db")
cursor = conn.cursor()
cursor.execute("BEGIN TRANSACTION")
# We will do an UPDATE or INSERT for users
print("Updating user positions...")
user_updates = []
for i, row in idx2user.iterrows():
username = row['user']
x, y = u_emb_2d[i]
user_updates.append((float(x), float(y), username))
cursor.executemany("UPDATE user SET x=?, y=? WHERE user=?", user_updates)
# Update tags (posts)
print("Updating tag positions...")
tag_updates = []
for i, row in idx2item.iterrows():
itemname = row['item']
x, y = i_emb_2d[i]
# We only update 'tag' table because 'comment' in data.db doesn't have x, y fields in its schema.
# Let's check if the item is in tag table by trying to update.
tag_updates.append((float(x), float(y), itemname))
cursor.executemany("UPDATE tag SET x=?, y=? WHERE tag=?", tag_updates)
cursor.execute("COMMIT")
conn.close()
print("Done! You can now replace matrixdb.db with matrixdb_updated.db")
if __name__ == "__main__":
post_process()
It did seem to vastly improve the apparent quality of relative post positions. It may be that running this batch program is a smart thing to run now and again. Or perhaps it drove enough coherence into the positions that the continuous algorithm can perform a sufficient job now. One downside of back processing is that, at least as it is written now, there is no weighting for how recent the vote was made. The continous and live method could adjust to a user's preferences day to day.
The next thing I want to play with, and partially have, is calculating negative positions for a user. Essentially, estimating the center of what they downvote (or ignore if downvote data is missing).
Comment preview