AO3 Exchanges: Finding Migrated Tags
Aug. 8th, 2017 07:55 pm![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
There is a known issue in AO3 tag sets where some tags (usually, but not always, AO3 canonicals) will "wander" or "migrate" from the fandom they were initially placed in to other fandoms, probably related to the first, but possibly not. If these migrating tags aren't fixed (by replacing them with more specific disambiguated versions associated with the right fandom and only the right fandom), problems can arise during the sign up stage of exchanges.
While doing some of the admin for Round 2 of Fandom Growth Exchange, I hit upon a spreadsheet-based method to search for these migrating tags. As far as I'm aware, exchanges are currently relying on people spotting these issues manually, so I thought it would be worth writing up my method in case others want to use it too. (Given that tag sets are publically available, it doesn't even need to be mods who do it; third parties who just want to be helpful can too.) Fair warning: it's very kludgey, somewhat time-consuming (possibly a couple of hours or so, for a large tag set, though the time taken can be reduced by working collaboratively once the spreadsheet has been set up), and throws up a lot of false positives (weeding these out is what makes it time-consuming). However, I believe that it should (allowing for human error creeping in at the weeding out stage), find all migrating tags in a tag set.
ETA: There is a very helpful comment here from
mornelithe_falconsbane that is well worth reading. Makes some of the steps a lot easier. At some point in the future when I have more time I will incorporate it into these instructions proper.
(Confession time: I didn't realise the usefulness of the fandom names appearing in the list when I first did this for FGE, and just deleted them from the list. Then in a much longer version of Step 10, the mod team looked manually at every single tag with 2 or more occurrences to check it always appeared in the same fandom. Many apologies to
mossy_bench and
prosodiical for creating so much extra work!)
I've just done this to the current state of the
femslashex tag set (just shy of 500 fandoms), and it took about an hour, although I was also writing the instructions up at the time. The false positive rate (i.e. the proportion of tags occurring more than once which weren't migrating) was just under 95%. (On Fandom Growth, we had somewhere around 85%, but I think that tag set is structured differently and has more potential for migration:
femslashex uses a lot of umbrella fandoms, and by the time I did this exercise they had already been doing some clean up.)
While doing some of the admin for Round 2 of Fandom Growth Exchange, I hit upon a spreadsheet-based method to search for these migrating tags. As far as I'm aware, exchanges are currently relying on people spotting these issues manually, so I thought it would be worth writing up my method in case others want to use it too. (Given that tag sets are publically available, it doesn't even need to be mods who do it; third parties who just want to be helpful can too.) Fair warning: it's very kludgey, somewhat time-consuming (possibly a couple of hours or so, for a large tag set, though the time taken can be reduced by working collaboratively once the spreadsheet has been set up), and throws up a lot of false positives (weeding these out is what makes it time-consuming). However, I believe that it should (allowing for human error creeping in at the weeding out stage), find all migrating tags in a tag set.
ETA: There is a very helpful comment here from
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
- Open the main tag set page. Really open it; do "Expand All" to every fandom category. (It's easiest to do this working up from the bottom, as you then don't have to scroll through all the stuff that's just been expanded to get to the next "Expand All" button.) Copy the entire contents of this page onto the clipboard. (Easiest to do with keyboard shortcuts, in my opinion, rather than trying to select with the mouse -- in Windows, Ctrl-A followed by Ctrl-C.) Leave the tag set with everything expanded open in a tab; you will need it later.
- Open a new spreadsheet. (If you don't have your own spreadsheet software, you can use sheets.google.com -- you will need to sign into/create a Google account for this; I have done these instructions primarily for Google Sheets, but noted where Excel is different. Apologies to those using other software.)
- With the cursor in cell A1, paste the contents of the clipboard into the spreadsheet. It should automatically separate itself out into different cells in the first column. Some of these, including the very first ones, will contain AO3 boilerplate bits, not tags at all. Don't worry about this; they will go away by themselves in Step 6. (If you're using Excel, the heights of your cells may go very large at this point; an optional but recommended step is to reset them before carrying on.)
- At the top of the second column (cell B1), enter the formula:
=COUNTIF(A$1:A$10000,A1)
(Note: the$
s are important, as they keep the range correct as you copy the formula in the next step. The10000
is chosen to be A Large Number; if you have a really large tag set, check that the data does indeed stop before cell A10000, and if it doesn't, increase the number in the formula.) - The formula counts the number of occurrences of each entry in column A. You now need to copy it down the whole spreadsheet. There are two ways to do this:
- Using the mouse, click on the very bottom right corner of the cell, and scroll down while holding the button down. When you release the button, the formula will copy into all the selected cells. With very large ranges such as we're dealing with here, this can take a while, and will probably make your mouse button finger cramp up a bit, but it will work.
- Alternatively, copy and paste the formula from cell B1 into cell B2, then select both cells and copy and paste into B3 and B4. Then select all four cells and copy into cells B5 to B8. Thanks to the power of exponential growth, even if your tag set does have 10,000 entries in column A, you will only need to do fourteen doublings. (This method works especially well if you prefer using keyboard shortcuts.)
- You now have a spreadsheet of everything on the tag set page, tag or otherwise, and how many times it appears. The next step is to sort it so that the tags with the most appearances come at the top. In Sheets, you just need to go into the Data menu and select "Sort sheet by column B, Z->A" (which for numbers is descending order); in Excel, select columns A and B, do Data->Sort, and then Sort by "Column B", "largest to smallest". All the non-tag parts of the page only appear once, so they will be sorted into the bottom, along with the tags that only appear once and therefore definitely aren't migrating. If you want you can delete all these, but it's not actually necessary.
- You now need to remove the duplicate entries from the list. On Sheets, it's easiest to do this by starting a new sheet and typing in cell A1:
=UNIQUE(Sheet1!A1:Bxxxx)
whereBxxxx
is the last cell in the sorted column B with a 2 instead of a 1 (you'll have to scroll down to find this). In Excel, you can just select the whole range and use "Remove duplicates" in the Data tab. - You now have a list of all the tags which appear more than once, and how many times they appear. The vast majority of these are false positives: they appear more than once because their fandom is associated with more than one top level category in the AO3 tagging system (e.g. "Star Wars - All Media Types" is part of "Books & Literature", "Cartoons & Comics & Graphic Novels", "Movies" and "Video Games", so any tag under this fandom appears 4 times in the list). Because the order has been preserved in the previous steps, fandom names appear followed by that fandom's tags in your list, except that migrated tags appear higher up than their parent fandom's name, because they are appearing in places they shouldn't.
- (Optional, but recommended if working collaboratively) This is the stage where if you want, you can pretty things up a bit and add column headings etc. (Column C is going to be for which fandoms the migrated tags are in); you can share the spreadsheet at this stage, and each person can take a chunk of the list to check (i.e. the next two steps can be done by multiple people working in parallel).
- Go through the list, identifying the tags that have a number which is "too high". If so, go back to the tag set that you left fully open in Step 1, and search in the page for that tag; you should then be able to find which fandoms it currently appears under; make a note of these in Column C. One of these will be the correct one, and the others are ones it's migrated into. These are the tags that need to be fixed later. (If you know which fandom they originally appeared in, so much the better, but if not you may have to ask the nominators.)
- Whenever you find a fandom name, check that the number of tags below it is the same as that given in brackets; if not, some of its tags have migrated (you hopefully found them further up the list already, but this is a useful double check). Checking against the tag set page that the first and last tags are correct is a helpful, especially if the next entry is not a fandom name but a migrated tag from further down the list. Tags that you establish have not migrated can be hidden, or deleted wholesale, according to preference. If you're working collaboratively, you may find it easier to just mark them as OK in Column C and then remove them all from the list at the end.
- As you go along, you should find that tags that have migrated together appear together, so once you have a list of which fandoms they're in, this can be cut and pasted.
- Repeat until you reach the end of the list. What you are left with is a list of all migrating tags that need to be fixed.
(Confession time: I didn't realise the usefulness of the fandom names appearing in the list when I first did this for FGE, and just deleted them from the list. Then in a much longer version of Step 10, the mod team looked manually at every single tag with 2 or more occurrences to check it always appeared in the same fandom. Many apologies to
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
![[personal profile]](https://www.dreamwidth.org/img/silk/identity/user.png)
I've just done this to the current state of the
![[community profile]](https://www.dreamwidth.org/img/silk/identity/community.png)
![[community profile]](https://www.dreamwidth.org/img/silk/identity/community.png)
no subject
Date: 2017-08-08 08:26 pm (UTC)no subject
Date: 2017-08-09 10:50 am (UTC)no subject
Date: 2017-08-10 03:52 am (UTC)google sheets/excel/me is like my OT3, feel free to ignore me
If you're in Google Sheets, then you can skip step 5 if you type the formula into B1 like this:
=ArrayFormula(if(A:A="","",COUNTif(A:A,A:A)))
It'll do the count for you in every B cell, so long as there's actually something in the corresponding A cell.
And I don't know if this helps, but you can also use import html to import the fandom and pairings into a single cell (which you can then use unique on?) Unfortunately, it takes a bit of work, because you need to grab the top-level lists and the number of them changes as fandoms are added during noms.
For example, for the femslashex tagset
={IMPORTHTML("http://archiveofourown.org/tag_sets/1926","list",10);IMPORTHTML("http://archiveofourown.org/tag_sets/1926","list",82);IMPORTHTML("http://archiveofourown.org/tag_sets/1926","list",160)}
Will import all of the fandoms and ships in anime, books, and cartoons. I'm not really sure it's helpful given what you're doing, but it's a cool feature? to find the right list number, start at 10 (anime/manga), and then add the number of anime fandoms plus 1 for books, then add the number of book fandoms to that +1 to get the list # for cartoons.
And you can separate the fandoms into one column and the ships into another if you use split and arrayformula (ctrl+shift+enter automatically inserts your formula into an arrayformula)
=arrayformula(split(A:A,"↓ ↑",false))
(when your imported HTML is in the A column)
I believe you can import the fandom and pairings using importxml, too, but I...am super not good at websites. And Xpath is really confusing. The closest I could get is
=IMPORTXML("http://archiveofourown.org/tag_sets/1926", "//li/ol")
which certainly gets you relevant data, but the order makes no sense to me. But maybe you're better than I am at that kind of thing?
(sorry, this got really big /o\)
no subject
Date: 2017-08-10 11:31 am (UTC)I won't have time to experiment again for quite a while, but when I do I will update. For now, I'm going to ETA that people should read your comment as well.