AnnouncementsMatrixEventsFunnyVideosMusicAncapsTechnologyEconomicsPrivacyGIFSCringeAnarchyFilmPicsThemesIdeas4MatrixAskMatrixHelpTop Subs
4

The thumbnail is based on the 'Matrix logo:

One issue with creating subs on demand is making sure submissions don't get diluted across many trivial variations of a topic name. The worst case scenario would be having all of these subs map to different locations.

  • Funny Videos
  • funny videos
  • funnyvideos
  • FunnyVideos
  • funny_videos
  • funny-videos
  • Funny-Videos
  • Funny_videos

We already consolidate several of those splits by the fact that subs are case insensitive but case preserving. To get rid of the other variations, all subs are alpha-numeric. And we prefer using PascalCase to indicate words within a topic.

But people still submit spaces, underscores, and hyphens as topics all the time. The prior strategy was to complain and tell them to try again. But I decided that with a bit more code we can make use of those inputs.

Hyphens and underscores are the most simple. Simply remove them and capitalize the letter after it. Alsways capitalize the first letter no matter what to enourage PascalCase.

But spaces can be a little more ambigious. We support posting to more than one topic at a time. This helps us fill common topics faster. Ideally this is being used to fill in additional common topics, rather than obscure ones. This solves a problem of having more available subs than most small sites, while still having the same number of users and posts as a small site. We want people to be able to navigate to a typical sub they assume should exist and find something there. And so making cross-posting easy helps us get there. We typically use commas to seperate multiple topics. Thus spaces could be interpreted multiple ways. Since it wasn't a supported character before if a user uses it they are free-handing it and could be intending anything. For example someone could type any of these.

  • Dirt Biking
  • Memes Funny
  • SouthAmerica Americas
  • Europe, southern France
  • funny videos
  • graphics-programming gaming

There are examples there where it is clear that a single topic is being referred to or that two topics were being referred to. But the obviousness of it isn't as clear to a computer. Did the first user intend to post to the topic 'Dirt'?

But for some of these a computer can figure it out. We need to know if the space should be turned into a contraction or a seperation. If we see an indication that the user has done either of those things explicitly themselves, we know we can use the space to do the other.

In the case of 'SouthAmerica Americas' because of the first capital-A we know the space indicates two seperate topics.

In the case of 'Europe, southern France' because of the comma we know that southern France can be turned into SouthernFrance.

Where neither is done it can't be clear to a computer. Sometimes either a split or contraction is ok, like in the case of 'funny videos'. But sometimes it isn't ok to do one or the other.

The algorithm if you care to know goes like this:

  1. Get rid of any unneeded spaces at the front of the input, or end, or near a comma. Since spaces have syntactic relevence we don't want to process unneeded ones.
  2. Capitalize every alpha character after a non-alpha-numeric one, including the start. Now all words and sub-words we can find are capitalized, if we can find them.
  3. At this point we can just delete hypens and underscores.
  4. Classify some global properties of the input:
    Does it have commas?
    Does it have a capitalized alpha-character after an uncapitalized alpha-numeric character?
    Does it have spaces?
  5. If it has commas, treat spaces as contractions. Safe to delete because we already did capitalization.
  6. If it has sub-word capitalization as determined by the classification step, turn spaces into commas.
  7. If neither of those things were true but it has spaces, complain that we don't know what to do with them.
  8. Split the modified input by commas to form a list.
  9. Perform as case-insensitive deduplication of the list.

tl;dr: You can now use hyphens and underscores without getting a complaint. You can even use spaces if we can figure out what you mean.

Comment preview