Is it still possible to export all your MetaFilter contributions? January 2, 2011 5:02 AM Subscribe
Is it still possible to export all your MetaFilter contributions?
I have a plain text file containing all my MetaFilter comments from some time ago, but I can't remember where I got it from. Was it a one-off?
I have a plain text file containing all my MetaFilter comments from some time ago, but I can't remember where I got it from. Was it a one-off?
Awesome. Thanks!
posted by hoverboards don't work on water at 5:12 AM on January 2, 2011
posted by hoverboards don't work on water at 5:12 AM on January 2, 2011
It's a weekly off thing if I recall correctly.
Also, I just exported mine and it's nearly 2 mbs!
Also, I noticed I have a lot of spelling errors. Can a mod fix those? Please spellcheck the site, ok?
Also, welcome back, Brandon.
posted by cjorgensen at 8:19 AM on January 2, 2011 [1 favorite]
Also, I just exported mine and it's nearly 2 mbs!
Also, I noticed I have a lot of spelling errors. Can a mod fix those? Please spellcheck the site, ok?
Also, welcome back, Brandon.
posted by cjorgensen at 8:19 AM on January 2, 2011 [1 favorite]
I thought Brandon was still around but posting under a different handle. Was I wrong?
posted by Think_Long at 8:52 AM on January 2, 2011
posted by Think_Long at 8:52 AM on January 2, 2011
No, you weren't wrong. I didn't know that. His profile has that info. Makes me feel a bit dumb.
posted by cjorgensen at 8:54 AM on January 2, 2011
posted by cjorgensen at 8:54 AM on January 2, 2011
You users and your mutable indenities making me think I'm the janitor of the Danger Room or something.
posted by The Whelk at 9:01 AM on January 2, 2011 [1 favorite]
posted by The Whelk at 9:01 AM on January 2, 2011 [1 favorite]
You must spend a lot of time sweeping up rubble from blown up laser cannons.
posted by Think_Long at 9:32 AM on January 2, 2011
posted by Think_Long at 9:32 AM on January 2, 2011
You users and your mutable indenities making me think I'm the janitor of the Danger Room or something.
Evidently the Danger Room is alive and a woman now, so yeah, take it to FetLife.
posted by Brandon Blatcher at 9:49 AM on January 2, 2011
Evidently the Danger Room is alive and a woman now, so yeah, take it to FetLife.
posted by Brandon Blatcher at 9:49 AM on January 2, 2011
Just out of curiosity, why do people want to export all their contributions?
posted by crunchland at 9:52 AM on January 2, 2011
posted by crunchland at 9:52 AM on January 2, 2011
Don't you ever go away again, Brandon Blatcher!
posted by cjorgensen at 9:52 AM on January 2, 2011
posted by cjorgensen at 9:52 AM on January 2, 2011
Huh! As lackadaisical and out of the loop as I usually am with MeTa, I was beginning to think nomadicink's comments sounded familiar. Nice to have you back flying the good flag, BB.
posted by cavalier at 9:54 AM on January 2, 2011
posted by cavalier at 9:54 AM on January 2, 2011
Don't you ever go away again, Brandon Blatcher!
Did Brandon really go away? Tune in at 11 to find out.
posted by special-k at 10:47 AM on January 2, 2011
Did Brandon really go away? Tune in at 11 to find out.
posted by special-k at 10:47 AM on January 2, 2011
Just out of curiosity, why do people want to export all their contributions?
Main reasons we've heard seem to be:
- to be able to do local full-text search (which for some applications might be quicker than using the site search I guess)
- to be able to wrangle data directly (specific types of search, manipulation for linguistics purposes)
- to have an archive in case of a nuclear strike that destroys the site and its backups
- to feel more like they have in-practice ownership of their content than they would without such a feature (cf. the relative difficulty of extracting one's own content from some blogging/content/whathaveyou platforms)
If you want horse's mouth, though, here's the metatalk thread from early 2008 that led to it getting implemented, and some more chatter from later that year.
Bonus threads: another request in 2003, and a thread from last year about an XMLizing script for the comment dump if you're into that sort of thing.
posted by cortex (staff) at 11:36 AM on January 2, 2011
Main reasons we've heard seem to be:
- to be able to do local full-text search (which for some applications might be quicker than using the site search I guess)
- to be able to wrangle data directly (specific types of search, manipulation for linguistics purposes)
- to have an archive in case of a nuclear strike that destroys the site and its backups
- to feel more like they have in-practice ownership of their content than they would without such a feature (cf. the relative difficulty of extracting one's own content from some blogging/content/whathaveyou platforms)
If you want horse's mouth, though, here's the metatalk thread from early 2008 that led to it getting implemented, and some more chatter from later that year.
Bonus threads: another request in 2003, and a thread from last year about an XMLizing script for the comment dump if you're into that sort of thing.
posted by cortex (staff) at 11:36 AM on January 2, 2011
I like it for GYOB. Sometimes I'll tell a story on here and think of putting the same content on my own blog (I seldom do). But I do use my comments for ideas for blog posts or personal projects.
It's just nice to read some of the stuff sometimes. It's also indexed by spotlight on my mac, so if I am trying to find a file where I mentioned a product it often comes up as well, then I can compare the two and see if I am consistent in my opinions.
posted by cjorgensen at 11:59 AM on January 2, 2011 [1 favorite]
It's just nice to read some of the stuff sometimes. It's also indexed by spotlight on my mac, so if I am trying to find a file where I mentioned a product it often comes up as well, then I can compare the two and see if I am consistent in my opinions.
posted by cjorgensen at 11:59 AM on January 2, 2011 [1 favorite]
My main worry is the nuclear strike. That or EMP.
But I've said too much...
posted by djgh at 2:37 PM on January 2, 2011
But I've said too much...
posted by djgh at 2:37 PM on January 2, 2011
Hey Lana!
What, Archer?
Danger Zone!
I need this show back RIGHT NOW!
posted by crossoverman at 4:36 PM on January 2, 2011
What, Archer?
Danger Zone!
I need this show back RIGHT NOW!
posted by crossoverman at 4:36 PM on January 2, 2011
delmoi prints his off every month. He's solely responsible for the deforestation of the rain forests.
posted by cjorgensen at 4:37 PM on January 2, 2011
posted by cjorgensen at 4:37 PM on January 2, 2011
Just out of curiosity, why do people want to export all their contributions?
In addition to reasons already mentioned, you can run the exported comments through the dissociated press command in emacs for a sort of do-it-yourself MarkovFilter, since the real one went away and hasn't come back. Er... not that I would do such a thing of course, this is totally theoretical.
posted by FishBike at 5:18 PM on January 2, 2011 [1 favorite]
In addition to reasons already mentioned, you can run the exported comments through the dissociated press command in emacs for a sort of do-it-yourself MarkovFilter, since the real one went away and hasn't come back. Er... not that I would do such a thing of course, this is totally theoretical.
posted by FishBike at 5:18 PM on January 2, 2011 [1 favorite]
Now, how can I export everyone else's comments as one plain text file?
posted by FelliniBlank at 5:22 PM on January 2, 2011
posted by FelliniBlank at 5:22 PM on January 2, 2011
cortex: "- to have an archive in case of a nuclear strike that destroys the site and its backups"
I do it for this reason as part of my monthly backup regime. I invest a lot of time on this website, and I don't ever want to lose that.
Don't take it personal, pb - I do it for G-Mail and other will-never-break sites too!
posted by l33tpolicywonk at 5:38 PM on January 2, 2011
I do it for this reason as part of my monthly backup regime. I invest a lot of time on this website, and I don't ever want to lose that.
Don't take it personal, pb - I do it for G-Mail and other will-never-break sites too!
posted by l33tpolicywonk at 5:38 PM on January 2, 2011
cjorgensen: "delmoi prints his off every month. He's solely responsible for the deforestation of the rain forests."
Yeah, I thought it was really weird when he brought the whole thing to our meetup and just slammed it down on the table like that...
posted by l33tpolicywonk at 5:39 PM on January 2, 2011
Yeah, I thought it was really weird when he brought the whole thing to our meetup and just slammed it down on the table like that...
posted by l33tpolicywonk at 5:39 PM on January 2, 2011
you can run the exported comments through the dissociated press command in emacs
Weird, I just today noticed that you can download all your facebook contributions, and started that process just for that reason. (still waiting for the email that says my download is ready)
posted by ctmf at 6:00 PM on January 2, 2011
Weird, I just today noticed that you can download all your facebook contributions, and started that process just for that reason. (still waiting for the email that says my download is ready)
posted by ctmf at 6:00 PM on January 2, 2011
In addition to reasons already mentioned, you can run the exported comments through the dissociated press command in emacs for a sort of do-it-yourself MarkovFilter, since the real one went away and hasn't come back.
I've been building this Rube Goldberg machine out of old typewriters, scissors and a few hot glue guns just incase the power goes out.
posted by Sailormom at 6:21 PM on January 2, 2011
I've been building this Rube Goldberg machine out of old typewriters, scissors and a few hot glue guns just incase the power goes out.
posted by Sailormom at 6:21 PM on January 2, 2011
Now, how can I export everyone else's comments as one plain text file?
Heh. I've actually been wrestling lately with a lot of ideas tied to one bigger idea: the Metafilter database as a corpus of Internet English. I've used bits of the db for little language-related experiments in the past—MarkovFilter, the Word Clouds thing—but I've never really made the effort to tackle the thing as a whole and try and do serious language analysis on it.
And so I've been digging into some basic ideas the last few weeks, working up in particular a script that tokenizes raw comments and creates a frequency table for the results. Initial tests there have gone well: doing a year at a time across the big three subsites, I've been able to create frequency tables for on the order of 80 million words in a few minutes. Doing larger portions at a go is a bit trickier just for memory-footprint reasons, but I'm working on it.
But I've also never done any kind of corpus construction before, and in trying to figure out how to potentially approach the gap between mefi-db-as-is and mefi-db-contents-as-linguistics-corpus I've been doing what reading I can on the theory and practice of building corpora. There's a lot of questions that I hadn't considered or at least not in detail before, both philosophical (what's the intended use of such a corpus, who is the intended audience, what kind of divisions or distinctions in category of text and social vector of authors will it represent and how) and technical (how would such a corpus be distributed or otherwise made available for use, how should the corpus be represented in data/format terms, what sort of post-processing or annotation, if any, should be done to the raw comment data).
And one of the big questions is: what are the implications of making the comment data itself more vs. less directly accessible, in flat bulk for download vs. via some sort of search interface? Setting aside the practicality of making available going-on-a-billion words of comment text in a flat file format, that'd also be something approximating a mirror of the site content, which raises practical and ethical questions in a way that making stuff available via a search interface doesn't so much do.
As it is, there's the practical fact that all comment data is available via direct scraping to someone who really wanted to go there (and was careful enough about pacing their scrape to not get blocked at the IP level by us for dragging down site responsiveness with a bulldog spider), but that it can be done doesn't mean it's something we think is a great idea or something that Joe Random Mefite would really appreciate as far as scraping all of their comments in the process is concerned. So building a corpus that's a (certainly at least slightly filtered/processed, but perhaps not more than that) dump of the whole of the db's comment fields isn't a trivial decision.
Making it available via a search interface seems a lot more immediately reasonable, since that's closer to the current provided functionality of our internal search anyway; the distinction would be one of specific search functionality more than anything, providing a more fine-grained (and linguistically useful) method of looking for specific kinds of words or strings or whatnot than our normal search does. Seems like a sound approach, though it limits the flexibility with which a given linguist can dig into the data since they can't go surfing through the raw text on their own terms and are limited to whatever search processes we provide.
The tl;dr on all this is that there's a tension between the de facto accessibility of every user's comment history to someone who really wanted to go scraping manually for it and the question of how comfortable any given user would be with the idea of having that comment history handed over, with a smile and a nod and in a convenient lump, to any passing stranger without that user having any say in the exchange. My inclination is to be conservative about such things and opt on the side of keeping things more like the status quo, which argues against anything that effectively makes it trivial to grab huge hunks of data from the comment db tables.
As someone who is enthusiastic about word nerdery and with a constantly growing interest in computational linguistics, I see a lot of exciting potential in freely-distributed, conveniently assembled corpora of contemporary English and I think a well-constructed Mefi corpus could be a really interesting and useful resource for a lot of legitimate research, mefi-related or more general. But the question of how to make that happen has only gotten bigger and more complicated as I've dug farther into the ideas I've been having and the literature that I've been able to find on the subject.
posted by cortex (staff) at 6:34 PM on January 2, 2011 [3 favorites]
Heh. I've actually been wrestling lately with a lot of ideas tied to one bigger idea: the Metafilter database as a corpus of Internet English. I've used bits of the db for little language-related experiments in the past—MarkovFilter, the Word Clouds thing—but I've never really made the effort to tackle the thing as a whole and try and do serious language analysis on it.
And so I've been digging into some basic ideas the last few weeks, working up in particular a script that tokenizes raw comments and creates a frequency table for the results. Initial tests there have gone well: doing a year at a time across the big three subsites, I've been able to create frequency tables for on the order of 80 million words in a few minutes. Doing larger portions at a go is a bit trickier just for memory-footprint reasons, but I'm working on it.
But I've also never done any kind of corpus construction before, and in trying to figure out how to potentially approach the gap between mefi-db-as-is and mefi-db-contents-as-linguistics-corpus I've been doing what reading I can on the theory and practice of building corpora. There's a lot of questions that I hadn't considered or at least not in detail before, both philosophical (what's the intended use of such a corpus, who is the intended audience, what kind of divisions or distinctions in category of text and social vector of authors will it represent and how) and technical (how would such a corpus be distributed or otherwise made available for use, how should the corpus be represented in data/format terms, what sort of post-processing or annotation, if any, should be done to the raw comment data).
And one of the big questions is: what are the implications of making the comment data itself more vs. less directly accessible, in flat bulk for download vs. via some sort of search interface? Setting aside the practicality of making available going-on-a-billion words of comment text in a flat file format, that'd also be something approximating a mirror of the site content, which raises practical and ethical questions in a way that making stuff available via a search interface doesn't so much do.
As it is, there's the practical fact that all comment data is available via direct scraping to someone who really wanted to go there (and was careful enough about pacing their scrape to not get blocked at the IP level by us for dragging down site responsiveness with a bulldog spider), but that it can be done doesn't mean it's something we think is a great idea or something that Joe Random Mefite would really appreciate as far as scraping all of their comments in the process is concerned. So building a corpus that's a (certainly at least slightly filtered/processed, but perhaps not more than that) dump of the whole of the db's comment fields isn't a trivial decision.
Making it available via a search interface seems a lot more immediately reasonable, since that's closer to the current provided functionality of our internal search anyway; the distinction would be one of specific search functionality more than anything, providing a more fine-grained (and linguistically useful) method of looking for specific kinds of words or strings or whatnot than our normal search does. Seems like a sound approach, though it limits the flexibility with which a given linguist can dig into the data since they can't go surfing through the raw text on their own terms and are limited to whatever search processes we provide.
The tl;dr on all this is that there's a tension between the de facto accessibility of every user's comment history to someone who really wanted to go scraping manually for it and the question of how comfortable any given user would be with the idea of having that comment history handed over, with a smile and a nod and in a convenient lump, to any passing stranger without that user having any say in the exchange. My inclination is to be conservative about such things and opt on the side of keeping things more like the status quo, which argues against anything that effectively makes it trivial to grab huge hunks of data from the comment db tables.
As someone who is enthusiastic about word nerdery and with a constantly growing interest in computational linguistics, I see a lot of exciting potential in freely-distributed, conveniently assembled corpora of contemporary English and I think a well-constructed Mefi corpus could be a really interesting and useful resource for a lot of legitimate research, mefi-related or more general. But the question of how to make that happen has only gotten bigger and more complicated as I've dug farther into the ideas I've been having and the literature that I've been able to find on the subject.
posted by cortex (staff) at 6:34 PM on January 2, 2011 [3 favorites]
Based on the current counts of (non-deleted) threads in the infodump, I estimate that if you set your screen-scraper spider to retrieve one URL every 10 seconds you could suck down the whole thing in 32.7 days. I'm not sure if 10 seconds would be considered totally harmless, but even if you tripled it to 30 seconds it would still be a feasible undertaking, and once you had all the backlog out of the way getting just the parts that have changed from there on would be considerably less work. Although being open for a year does make AskMe threads a little bit hard for a scraper to efficiently deal with, assuming you want fresh data, otherwise you just configure it to only retrieve closed threads and have a year delay.
posted by Rhomboid at 7:18 PM on January 2, 2011
posted by Rhomboid at 7:18 PM on January 2, 2011
(That was just curiosity about the scale of such a thing, I have no intention to set up any such system.)
posted by Rhomboid at 7:19 PM on January 2, 2011
posted by Rhomboid at 7:19 PM on January 2, 2011
cortex, this may or may not be useful to you, but the Stanford CS department has made the full text of its data mining course textbook available to all, for free. It contains much about document processing—analogous to comments here, I'd imagine—that may be applicable to your playwork.
Full book in a single PDF.
posted by speedo at 1:03 PM on January 3, 2011 [1 favorite]
Full book in a single PDF.
posted by speedo at 1:03 PM on January 3, 2011 [1 favorite]
Ooh, fantastic, speedo. Probably some useful stuff in there, yeah, for this project and for the sort of stuff I get up to more generally. Thanks!
posted by cortex (staff) at 1:09 PM on January 3, 2011
posted by cortex (staff) at 1:09 PM on January 3, 2011
(and was careful enough about pacing their scrape to not get blocked at the IP level by us for dragging down site responsiveness with a bulldog spider)
ewww...
but 10 legs?
posted by russm at 11:43 PM on January 3, 2011
ewww...
but 10 legs?
posted by russm at 11:43 PM on January 3, 2011
In addition to reasons already mentioned, you can run the exported comments through the dissociated press command in emacs for a sort of do-it-yourself MarkovFilter, since the real one went away and hasn't come back. Er... not that I would do such a thing of course, this is totally theoretical.
What people don't realise is that FishBike's scripts became self-aware several months ago when he hooked them into a self-modifying LISP ELIZA program and accidentally left it running over a weekend. Since then any comment purportedly from FishBike has actually been the MarkovFilter-type output of these scripts combined with queries they are running against their own copy of the MeFi datadump. The location of the real FishBike is unknown.
posted by Electric Dragon at 2:49 PM on January 4, 2011
What people don't realise is that FishBike's scripts became self-aware several months ago when he hooked them into a self-modifying LISP ELIZA program and accidentally left it running over a weekend. Since then any comment purportedly from FishBike has actually been the MarkovFilter-type output of these scripts combined with queries they are running against their own copy of the MeFi datadump. The location of the real FishBike is unknown.
posted by Electric Dragon at 2:49 PM on January 4, 2011
Figures, Facebook's dump would be, not in a useful format, but in... html, just like the live page. With everyone else's comments and crap. Still, maybe I can come up with a sed recipe that will isolate my own stuff in a text file. Anyone else tried this?
posted by ctmf at 7:37 PM on January 4, 2011
posted by ctmf at 7:37 PM on January 4, 2011
ctmf - something like BeautifulSoup or nokogiri (or whatever, depending on your language preferences) is a much easier way to scrape selected content out of an HTML document... I've done my fair share of nasty hand-crufted scraping, but for multi-line content that you're selecting based on a nearby string, well, sed surely can't be the best tool for the job...
posted by russm at 11:08 PM on January 4, 2011
posted by russm at 11:08 PM on January 4, 2011
grep -A2 'my name' wall.html | sed -f sedfile > fbtext.txt
where sedfile is:
s/<[^>]*>//g
s/my name *//
/^--/d
/new photos/d
/^$/d
It takes advantage of the specific structure of wall.html, but it does the trick, close enough.
posted by ctmf at 4:54 PM on January 29, 2011
where sedfile is:
s/<[^>]*>//g
s/my name *//
/^--/d
/new photos/d
/^$/d
It takes advantage of the specific structure of wall.html, but it does the trick, close enough.
posted by ctmf at 4:54 PM on January 29, 2011
You are not logged in, either login or create an account to post comments
posted by Electric Dragon at 5:09 AM on January 2, 2011 [3 favorites]