Metafilter n-gram viewer February 1, 2013 3:42 PM Subscribe
The recent meta post about weird tags got me playing again with the infodump, and I was finally motivated to create something I've wanted for a while: A clone of the Google Books N-Gram viewer, but for Metafilter: http://mefingram.appspot.com/.
Look at trends in metafilter cliches. See which presidential candidates we like to discuss the most, and which vice presidential candidates. How has our interest in the giants of the internet changed over the years?
What social networks have been giving people trouble for the past 10 years? Which browsers? When did we first start wondering if it was safe to eat something?
The most common 6-grams in askme titles give you some idea of the canonical askme questions:
what is the best way to
what do i need to know
should i stay or should i
how do i get rid of
what is the name of this
The source code for the view is available on github at https://github.com/wiseman/mefingram.
Look at trends in metafilter cliches. See which presidential candidates we like to discuss the most, and which vice presidential candidates. How has our interest in the giants of the internet changed over the years?
What social networks have been giving people trouble for the past 10 years? Which browsers? When did we first start wondering if it was safe to eat something?
The most common 6-grams in askme titles give you some idea of the canonical askme questions:
what is the best way to
what do i need to know
should i stay or should i
how do i get rid of
what is the name of this
The source code for the view is available on github at https://github.com/wiseman/mefingram.
NEAT!!!!!
(Sorry, but that was my actual out loud reaction so I had to share it.)
snowflake, special, special snowflake
posted by MCMikeNamara at 3:51 PM on February 1, 2013
(Sorry, but that was my actual out loud reaction so I had to share it.)
snowflake, special, special snowflake
posted by MCMikeNamara at 3:51 PM on February 1, 2013
Looks like we hit peak cortex awhile ago.
posted by dersins at 4:01 PM on February 1, 2013 [1 favorite]
posted by dersins at 4:01 PM on February 1, 2013 [1 favorite]
Awesome work, jjwiseman. Interesting but I guess not really surprising on reflection that there's enough title data alone to do something interesting with.
If you're interested in playing around with a muuuuuch larger dataset, I could look into creating some one-off n-gram files for comment content for you to try incorporating into this.
posted by cortex (staff) at 4:08 PM on February 1, 2013 [2 favorites]
If you're interested in playing around with a muuuuuch larger dataset, I could look into creating some one-off n-gram files for comment content for you to try incorporating into this.
posted by cortex (staff) at 4:08 PM on February 1, 2013 [2 favorites]
hmm...seems punctuation throws it off. "google+" appears to swallow results for "google"
posted by juv3nal at 4:10 PM on February 1, 2013
posted by juv3nal at 4:10 PM on February 1, 2013
Features I'd like to add:
posted by jjwiseman at 4:10 PM on February 1, 2013
- Views of the most common/least common n-grams.
- "Auto-suggest": type some words, then see the most likely/least likely next words.
posted by jjwiseman at 4:10 PM on February 1, 2013
juv3nal: Yes, see "How does the n-gram viewer handle punctuation?". Ideally it would be nice to be able to choose whether punctuation is significant (imagine being able to search for "metafilter:").
posted by jjwiseman at 4:12 PM on February 1, 2013
posted by jjwiseman at 4:12 PM on February 1, 2013
oh my bad. I'd actually read that and somehow glossed over it, I guess because the chart shows the extra line/legend item.
posted by juv3nal at 4:14 PM on February 1, 2013
posted by juv3nal at 4:14 PM on February 1, 2013
I have no idea what I am doing, but I like this. I just put "beer, job" in it for Ask and it is sad to see that job has out-performed beer lately. From 2010 - 2012 job has an upward sloping graph while beer is downward sloping. What's up with that folks? Maturity sucks.
posted by JohnnyGunn at 4:23 PM on February 1, 2013
posted by JohnnyGunn at 4:23 PM on February 1, 2013
If I bought stock based on this I'd buy Tumblr stock today.
posted by Potomac Avenue at 4:23 PM on February 1, 2013
posted by Potomac Avenue at 4:23 PM on February 1, 2013
juv3nal, I consider the fact that it didn't correctly give a count for competing n-grams a bug, so thanks for finding that.
BTW, while processing I ran into the following issues with the data in the infodump:
Encoding issues--titles that aren't UTF-8:
postdata_mefi.txt has a bad record for post 113202 due to an embedded newline:
posted by jjwiseman at 4:25 PM on February 1, 2013
BTW, while processing I ran into the following issues with the data in the infodump:
Encoding issues--titles that aren't UTF-8:
2013-02-01 13:55:56,040:INFO: Joining post data for askme...
2013-02-01 13:55:57,125:WARNING: Skipped 375 posts due to UTF8 errors.
2013-02-01 13:56:06,825:INFO: Joining post data for mefi...
2013-02-01 13:56:07,372:WARNING: Skipped 50 posts due to UTF8 errors.
2013-02-01 13:56:12,568:INFO: Joining post data for meta...
2013-02-01 13:56:12,657:WARNING: Skipped 19 posts due to UTF8 errors.
2013-02-01 13:56:13,554:INFO: Joining post data for music...
2013-02-01 13:56:13,583:WARNING: Skipped 30 posts due to UTF8 errors.
postdata_mefi.txt has a bad record for post 113202 due to an embedded newline:
113202 129814 2012-02-24 20:48:48.907 0 3 0 1 This is maybe too random
to achieve traction. -- <a href="http://www.metafilter.com/user/292" id="sig">jessamyn</a>
posted by jjwiseman at 4:25 PM on February 1, 2013
seems punctuation throws it off.
Which explains why restless_nomad is traveling so under the radar?
posted by jessamyn (staff) at 5:14 PM on February 1, 2013
Which explains why restless_nomad is traveling so under the radar?
posted by jessamyn (staff) at 5:14 PM on February 1, 2013
Maybe we hated Bush more than we love Obama?
posted by double block and bleed at 6:33 PM on February 1, 2013
posted by double block and bleed at 6:33 PM on February 1, 2013
Depends on the bush.
posted by cjorgensen at 7:08 PM on February 1, 2013
posted by cjorgensen at 7:08 PM on February 1, 2013
taters, tater, fedoras, fedora.
Man, I can't wait until we can n-gram the whole text corpus and not just the titles!
posted by barnacles at 7:22 PM on February 1, 2013
Man, I can't wait until we can n-gram the whole text corpus and not just the titles!
posted by barnacles at 7:22 PM on February 1, 2013
Looks like we hit peak cortex awhile ago.
And now we are down to seeds and brain stems.
posted by y2karl at 7:51 PM on February 1, 2013 [1 favorite]
And now we are down to seeds and brain stems.
posted by y2karl at 7:51 PM on February 1, 2013 [1 favorite]
Could someone please release an engram viewer next? I've got a bunch of thetans to audit and my e-meter's busted so an app or something would just be great.
posted by FAMOUS MONSTER at 8:40 PM on February 1, 2013 [1 favorite]
posted by FAMOUS MONSTER at 8:40 PM on February 1, 2013 [1 favorite]
Thank Xenu, I'm not the only one. That is exactly where my brain goes when I hear about N-grams as well.
posted by maryr at 9:18 PM on February 1, 2013
posted by maryr at 9:18 PM on February 1, 2013
This is fun.
Mefi:
What's your pleasure? (beer, apparently)
Dogs and cats about equally popular
Crisis points
Ask:
The family members who can't talk are the most puzzling
Questions for every occasion (but mainly for weddings and parties)
posted by Orinda at 11:42 PM on February 1, 2013
Mefi:
What's your pleasure? (beer, apparently)
Dogs and cats about equally popular
Crisis points
Ask:
The family members who can't talk are the most puzzling
Questions for every occasion (but mainly for weddings and parties)
posted by Orinda at 11:42 PM on February 1, 2013
I assume everything drops off sharply at the end because we have only had one month in 2013 in which to talk about stuff. Would it make sense to have an option to smooth the data? You could, for example, multiply the results for each year by 12/n where n is the number of months of that year which have elapsed. This would allow any emerging trends to become apparent, especially when the current month falls toward the beginning of a year.
posted by tractorfeed at 2:30 AM on February 2, 2013 [1 favorite]
posted by tractorfeed at 2:30 AM on February 2, 2013 [1 favorite]
I don't know why I was curious about this, but I was and huh.
posted by Grangousier at 3:59 AM on February 2, 2013
posted by Grangousier at 3:59 AM on February 2, 2013
jessaymn, it looks like restless_nomad just hasn't been mentioned in any post titles yet.
tractorfeed, that's right. I have a bug open for that: https://github.com/wiseman/mefingram/issues/2
posted by jjwiseman at 10:52 AM on February 2, 2013
tractorfeed, that's right. I have a bug open for that: https://github.com/wiseman/mefingram/issues/2
posted by jjwiseman at 10:52 AM on February 2, 2013
Awesomsauce! Also the most use of the word 'pony' so far happened in 2011. Whodathunkit?
posted by Faintdreams at 11:36 AM on February 2, 2013
posted by Faintdreams at 11:36 AM on February 2, 2013
Would it be trivial or possible to be able to click through to see the posts being referenced? Like, what was going on with BP in 2002?
Also, Scottish pedants might not care for this.
posted by cmoj at 12:42 PM on February 2, 2013
Also, Scottish pedants might not care for this.
posted by cmoj at 12:42 PM on February 2, 2013
cmoj, yes, that is planned. Maybe I'll have a chance to do that this weekend, even: https://github.com/wiseman/mefingram/issues/4.
(Also I noticed that hate is stagnant, love is growing).
posted by jjwiseman at 1:53 PM on February 2, 2013
(Also I noticed that hate is stagnant, love is growing).
posted by jjwiseman at 1:53 PM on February 2, 2013
cmoj, You can now click on the data points for a year and see the first 30 posts that match your query in that year, e.g. http://mefingram.appspot.com/?content=bp&corpus=mefi#2002.
posted by jjwiseman at 6:21 PM on February 2, 2013
posted by jjwiseman at 6:21 PM on February 2, 2013
> Would it make sense to have an option to smooth the data? You could, for example, multiply the results for each year by 12/n where n is the number of months of that year which have elapsed.
What would be more useful is to correlate results with post volume, rather than time.
Otherwise, almost every phrase will trend upwards since Mefi has been growing continuously, and you can't actually gauge what terms are actually tailing off in general usage (0.01% of all phrases in 2003 is probably significantly less than 0.001% of all phrases in 2013).
As an added benefit, there will be less tendency for results to drop to 0 at the end of the graph regardless of when the graph is generated.
posted by ardgedee at 6:35 AM on February 3, 2013
What would be more useful is to correlate results with post volume, rather than time.
Otherwise, almost every phrase will trend upwards since Mefi has been growing continuously, and you can't actually gauge what terms are actually tailing off in general usage (0.01% of all phrases in 2003 is probably significantly less than 0.001% of all phrases in 2013).
As an added benefit, there will be less tendency for results to drop to 0 at the end of the graph regardless of when the graph is generated.
posted by ardgedee at 6:35 AM on February 3, 2013
Display results in terms of parts per million rather than as raw count is the default approach to this sort of thing, and is actually what I was assuming this was doing though I never did actually sanity check that.
posted by cortex (staff) at 7:48 AM on February 3, 2013
posted by cortex (staff) at 7:48 AM on February 3, 2013
I should have been more explicit. The way I was planning on handling relative frequency is exactly how Google handles it. For example, if you search for "pony" in meta, what the chart will show for each year is what percentage of all unigrams in meta are "pony" for that year.
By the way, being able to click through to posts has gotten me digging around in the early days of mefi, and it is interesting to see all the ways it's different. A few examples: Before permalinks became a big deal, before formalizing the rules about double posts, allowing links in titles. Seeing all the broken links also makes me sad about how much has been lost from the earlier days of the web.
posted by jjwiseman at 12:03 PM on February 3, 2013
By the way, being able to click through to posts has gotten me digging around in the early days of mefi, and it is interesting to see all the ways it's different. A few examples: Before permalinks became a big deal, before formalizing the rules about double posts, allowing links in titles. Seeing all the broken links also makes me sad about how much has been lost from the earlier days of the web.
posted by jjwiseman at 12:03 PM on February 3, 2013
You are not logged in, either login or create an account to post comments
posted by atrazine at 3:49 PM on February 1, 2013