Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Migration utf8 -> utf8mb4 and test emojis for comment and answer #3007

Closed
wants to merge 12 commits into from

Conversation

namangupta01
Copy link
Member

@namangupta01 namangupta01 commented Jul 7, 2018

Make sure these boxes are checked before your pull request (PR) is ready to be reviewed and merged. Thanks!
#2928

  • tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR -- or run tests locally with rake test
  • code is in uniquely-named feature branch and has no merge conflicts
  • PR is descriptively titled
  • PR body includes fixes #0000-style reference to original issue #
  • ask @publiclab/reviewers for help, in a comment below

We're happy to help you get this ready -- don't be afraid to ask for help, and don't be discouraged if your tests fail at first!

If tests do fail, click on the red X to learn why by reading the logs.

Please be sure you've reviewed our contribution guidelines at https://publiclab.org/contributing-to-public-lab-software

We have a loose schedule of reviewing and pulling in changes every Tuesday and Friday, and publishing changes on Fridays.

Thanks!

@plotsbot
Copy link
Collaborator

plotsbot commented Jul 7, 2018

2 Warnings
⚠️ New migrations added. Please update schema.rb.example by overwriting it with a copy of the up-to-date db/schema.rb. Also, be aware to preserve the MySQL-specific conditions for full-text indices.
⚠️ It looks like you merged from master in this pull request. Please rebase to get rid of the merge commits – you may want to rewind the master branch and rebase instead of merging in from master, which can cause problems when accepting new code!
2 Messages
📖 @namangupta01 Thank you for your pull request! I’m here to help with some tips and recommendations. Please take a look at the list provided and help us review and accept your contribution! And don’t be discouraged if you see errors – we’re here to help.
📖 Pull Request is marked as Work in Progress

Generated by 🚫 Danger

@namangupta01
Copy link
Member Author

Hey @jywarren, as discussed in #2928 I have added the tests here

@ghost ghost removed the in progress label Jul 9, 2018
@namangupta01 namangupta01 reopened this Jul 9, 2018
@ghost ghost added the in progress label Jul 9, 2018
@jywarren
Copy link
Member

jywarren commented Jul 9, 2018

I restarted this, but if you see this error you can restart it too by closing and reopening the PR!

@namangupta01
Copy link
Member Author

@jywarren Tests passed in travis. But when i tried on local machine using mysql it failed.

@namangupta01
Copy link
Member Author

screen shot 2018-07-09 at 11 48 23 pm

@jywarren
Copy link
Member

jywarren commented Jul 9, 2018

Hmm, I think we may need to redo the mysql encoding as in #2665

Do you want to try the suggestion @icarito made on changing encoding to utf8mb4 here: #2209 (comment) ?

@namangupta01
Copy link
Member Author

Sure! So should we change the encoding of the whole db or just the tables or columns we need?

@jywarren
Copy link
Member

jywarren commented Jul 9, 2018 via email

@namangupta01
Copy link
Member Author

@jywarren I have changed encoding for comment field in comments table.

@namangupta01
Copy link
Member Author

Pushing to unstable

@jywarren
Copy link
Member

jywarren commented Sep 7, 2018

Ah, did this end up working?

@SidharthBansal
Copy link
Member

As the person is inactive for more than a month, I am closing the PR. In case you want to push changes please feel free to open a new PR.
Thanks for contributing at Public Lab

@ghost ghost removed the in progress label Dec 7, 2018
@jywarren
Copy link
Member

jywarren commented Dec 7, 2018

Actually this one I'd like to reopen and take on myself - we really do need to solve this issue for many tables. Thanks!

@jywarren jywarren reopened this Dec 7, 2018
@ghost ghost assigned jywarren Dec 7, 2018
@ghost ghost added the in progress label Dec 7, 2018
@icarito
Copy link
Member

icarito commented May 21, 2019

I think this last reference that I shared looks the simplest / most modern.
So I say we merge this when we have this:

  • Add to my.cnf
 # in /etc/mysql/my.cnf add the following in the correct sections
  [client]
  default-character-set = utf8mb4

  [mysqld]
  character-set-client-handshake = FALSE
  character-set-server = utf8mb4
  collation-server = utf8mb4_unicode_ci

  [mysql]
  default-character-set = utf8mb4
  • write a migration like this:
# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;
  • modify database.yml.example (note manual step in production):
  # database.yml
  development:
    adapter: mysql2
    database: db
    username:
    password:
    encoding: utf8mb4
    collation: utf8mb4_0900_ai_ci

That should be about it!
To write the above migration here's some code in Ruby that could be used as a basis:

require 'rubygems'
require 'sequel'
require 'mysql2'
database = 'your_database_name'
username = 'your_user_name'
password = 'password'
host = 'localhost'

DB = Sequel.connect("mysql2://#{username}:#{password}@#{host}/#{database}")
sw_tables = DB["show tables"].all.map{|x| x.first[1]}
puts DB["ALTER DATABASE #{database} CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci"].all.inspect
sw_tables.each do |t|
  puts DB["ALTER TABLE #{t} CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"].all.inspect
end

Also note that if there are indexed fields of type VARCHAR(255) then we may need the following initializer:

# config/initializer/mysqlpls.rb

require 'active_record/connection_adapters/abstract_mysql_adapter'

module ActiveRecord
  module ConnectionAdapters
    class AbstractMysqlAdapter
      NATIVE_DATABASE_TYPES[:string] = { :name => "varchar", :limit => 191 }
    end
  end
end

Credit for these instructions to David Chua.

# changing table that will store unicode execute:
execute "ALTER TABLE comments CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin"
# changin string/text column with unicode content execute:
execute "ALTER TABLE comments MODIFY comment VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really feel comfortable just truncating the field?
Let's not truncate the field just yet (line 6), it shouldn't be necesssary.

Also let's use collation utf8mb4_0900_ai_ci it seems to be preferred (we run MariaDB 10.2 which supports this).

@jywarren
Copy link
Member

jywarren commented May 22, 2019 via email

@SidharthBansal
Copy link
Member

Hi, just checking if you've gotten stuck on this at all, or if I could help in any way? Thanks!

@jywarren
Copy link
Member

jywarren commented Jan 17, 2020 via email

@SidharthBansal
Copy link
Member

Request accepted: Here is the task https://codein.withgoogle.com/dashboard/tasks/6653314876309504/

@CodeSarthak
Copy link
Collaborator

I love emojis too!
There is a good reason that this is delayed, I think I failed to express it clearly, is because I found that this is likely a long, delicate, manual process of migrating the database. It's not just a matter of merging this patch, unfortunately.
I'll start making a list of steps from below references:

Step 1: Create a backup [DONE]
Step 2: Upgrade the MySQL server [DONE]
Step 3: Modify databases, tables, and columns
Step 4: Check the maximum length of columns and index keys [NOT NEEDED because we are using MariaDB > 5.7.7 (10.2)]
Step 5: Modify connection, client, and server character sets
Step 6: Repair and optimize all tables

References

This tool is 404 but supposedly automates part of this:

Hey, I believe it would be better if the GCI task could also be broken down into multiple hard/hof tasks, according to the steps mentioned above, since it would be difficult to complete all the steps by a single person.

Thanks.

@icarito
Copy link
Member

icarito commented Jan 18, 2020

Jeff and I agreed to prioritize this so I'll be reviewing the reference https://mathiasbynens.be/notes/mysql-utf8mb4 again and try to break it down into smaller pieces so it is clear what needs to be done.

@jywarren
Copy link
Member

jywarren commented Jan 24, 2020

Models which could possibly contain emoji, as far as I can determine:

comment.rb
csvfile.rb
image.rb
node.rb
revision.rb
tag.rb
user.rb
user_session.rb
user_tag.rb

@jywarren
Copy link
Member

Hi @icarito - what if we, temporarily, screened for emoji using regex, using an ActiveRecord validation, before we save records. Here is some regex we might be able to use:

https://www.regextester.com/106421

https://github.com/mathiasbynens/emoji-regex

Actually i think we already have some ability to scan for emoji:

def emojify(content)
if content.present?
content.to_str.gsub(/:([\w+-]+):(?![^\[]*\])/) do |match|
if emoji = Emoji.find_by_alias(Regexp.last_match(1))
emoji.raw || %(<img class="emoji" alt="#{Regexp.last_match(1)}" src="#{image_path("emoji/#{emoji.image_filename}")}" style="vertical-align:middle" width="20" height="20" />)
else
match
end
end
end
end
def emoji_names_list
emojis = []
image_map = {}
Emoji.all.each do |e|
next unless e.raw
val = ":#{e.name}:"
emojis << { value: val, text: e.name }
image_map[e.name] = e.raw
end
{ emojis: emojis, image_map: image_map }
end
def emoji_info
emoji_names = %w(thumbs-up thumbs-down laugh hooray confused heart)
prefix = "https://github.githubassets.com/images/icons/emoji/unicode/"
emoji_image_map = {
"thumbs-up" => "#{prefix}1f44d.png",
"thumbs-down" => "#{prefix}1f44e.png",
"laugh" => "#{prefix}1f604.png",
"hooray" => "#{prefix}1f389.png",
"confused" => "#{prefix}1f615.png",
"heart" => "#{prefix}2764.png"
}
[emoji_names, emoji_image_map]
end

Wait, we can use the REGEX provided by that lib, already installed, to replace them!

https://github.com/janlelis/unicode-emoji

Yes, something like this, in the before_save filter:

require "unicode/emoji"
require "gemoji"

string = "String which contains all kinds of emoji:😴▶️🛌🏽🇵🇹🏴󠁧󠁢󠁳󠁣󠁴󠁿2️⃣🤾🏽‍♀️"

string.match(Unicode::Emoji::REGEX) do |match|
  string.gsub!(match.to_s,  ":" + Emoji.find_by_unicode(match.to_s) + ":")
end

Using gemoji to swap for the text name of the emoji: https://github.com/github/gemoji

@jywarren
Copy link
Member

Making a new issue for this: #7469 🙌

@icarito
Copy link
Member

icarito commented Feb 12, 2020

I guess this would be an interim solution but the thing is that emojis are not the only kind of invalid characters for this encoding. For instance acording to https://mathiasbynens.be/notes/mysql-utf8mb4 - he found the issue by trying to insert the U+1D306 TETRAGRAM FOR CENTRE (𝌆) symbol.

So I think it'd be a lot of effort for a few cases, and rather the encoding migration should be done.

@icarito
Copy link
Member

icarito commented Feb 13, 2020

Hi @jywarren @namangupta01 I finally got around to working on this and made it into a new PR here #7479 - also, after I did, I found a snippet that may be a nicer / more complete migration?
https://gist.github.com/amuntasim/f3b12f20a30e9a9f3fb0

Please check and review if we should test this in unstable. Migration is not reversible so I guess we should try to get our tests to work (get unstable to fail before pushing this?). Let me know what you think! I also found that not all emojis should trigger this problem, only those that require 4 bytes such as U+1F4A9 PILE OF POO (💩) or U+1D306 TETRAGRAM FOR CENTRE (𝌆)

Regards,
Sebastian

@icarito
Copy link
Member

icarito commented Apr 6, 2020

Okay an update on this. I tried the latest patch on unstable and the mysql database crashed leaving the database corrupt.
I believe trying to rebuild the entire database is too heavy.
I see two ways forward. One is trying to do this differently / tweak the command and even upgrade mysql if it will help.
Two is looking for a way to change the encoding by dumping the whole database then changing the encoding and then uploading the dump back. This should be way less processing if it can handle, but likely it would mean we can't do it in a migration (can't we? maybe we can....). Looking for solutions here.
I'm going to frist try option 1.

@icarito
Copy link
Member

icarito commented Apr 6, 2020

Restoring a fresh dump from backup for unstable.

@icarito
Copy link
Member

icarito commented Jun 2, 2020

I am feeling and channeling the focus on this issue. 👓 ☄️ 🦸

Here's my current assessment and plan.

Since Naman's well started PR, I expanded the code to convert every one of our database tables and fields to the new encoding, but the times we tried to run this in the staging database, it crashes and corrupts the database, probably because of the size of our data overflowing our RAM.

The plan I have now is as follows:

  • Drop the database of unstable staging instance
  • Recreate the database and make sure with the right encoding
  • Export (mysqldump) the data from production (see if can export subset, e.g. "SQL LIMIT")
    • Data may need to be converted to utf8mb4 before importing it to staging.
  • Perform tests in unstable.publiclab.org
    • Check unicodes and international characters already in the database
    • Write new nodes and comments with problematic characters and check results

@icarito
Copy link
Member

icarito commented Jun 3, 2020

This article from Moodle more or less describes the procedure we want to follow:
https://docs.moodle.org/32/en/Converting_your_MySQL_database_to_UTF8

@icarito
Copy link
Member

icarito commented Jun 11, 2020

Thanks @namangupta01 @jywarren and everyone who contributed to this. I'm closing it in favour of #8003 which is the solution that worked! Good try though, it would've worked with a smaller database.

@icarito icarito closed this Jun 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants