[WIP] Migration utf8 -> utf8mb4 and test emojis for comment and answer #3007

namangupta01 · 2018-07-07T21:38:56Z

Make sure these boxes are checked before your pull request (PR) is ready to be reviewed and merged. Thanks!
#2928

tests pass -- look for a green checkbox ✔️ a few minutes after opening your PR -- or run tests locally with rake test
code is in uniquely-named feature branch and has no merge conflicts
PR is descriptively titled
PR body includes fixes #0000-style reference to original issue #
ask @publiclab/reviewers for help, in a comment below

We're happy to help you get this ready -- don't be afraid to ask for help, and don't be discouraged if your tests fail at first!

If tests do fail, click on the red X to learn why by reading the logs.

Please be sure you've reviewed our contribution guidelines at https://publiclab.org/contributing-to-public-lab-software

We have a loose schedule of reviewing and pulling in changes every Tuesday and Friday, and publishing changes on Fridays.

Thanks!

plotsbot · 2018-07-07T21:44:50Z

	2 Warnings
⚠️	New migrations added. Please update `schema.rb.example` by overwriting it with a copy of the up-to-date `db/schema.rb`. Also, be aware to preserve the MySQL-specific conditions for full-text indices.
⚠️	It looks like you merged from master in this pull request. Please rebase to get rid of the merge commits – you may want to rewind the master branch and rebase instead of merging in from master, which can cause problems when accepting new code!

	2 Messages
📖	@namangupta01 Thank you for your pull request! I’m here to help with some tips and recommendations. Please take a look at the list provided and help us review and accept your contribution! And don’t be discouraged if you see errors – we’re here to help.
📖	Pull Request is marked as Work in Progress

Generated by 🚫 Danger

namangupta01 · 2018-07-09T18:01:55Z

Hey @jywarren, as discussed in #2928 I have added the tests here

jywarren · 2018-07-09T18:05:21Z

I restarted this, but if you see this error you can restart it too by closing and reopening the PR!

namangupta01 · 2018-07-09T18:15:35Z

@jywarren Tests passed in travis. But when i tried on local machine using mysql it failed.

namangupta01 · 2018-07-09T18:18:55Z

jywarren · 2018-07-09T18:45:51Z

Hmm, I think we may need to redo the mysql encoding as in #2665

Do you want to try the suggestion @icarito made on changing encoding to utf8mb4 here: #2209 (comment) ?

namangupta01 · 2018-07-09T19:01:25Z

Sure! So should we change the encoding of the whole db or just the tables or columns we need?

jywarren · 2018-07-09T20:20:45Z

Let's start by doing just the columns, and if they work, we can probably move the whole db? But i think that's a Q for @icarito.

…

On Mon, Jul 9, 2018 at 3:02 PM Naman Gupta ***@***.***> wrote: Sure! So should we change the encoding of the whole db or just the tables or columns we need? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3007 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AABfJ_PnrcpkUfRUpPYEZnY25eLJA0bhks5uE6iFgaJpZM4VGgFv> .

namangupta01 · 2018-07-21T18:26:16Z

@jywarren I have changed encoding for comment field in comments table.

namangupta01 · 2018-07-21T18:54:20Z

Pushing to unstable

jywarren · 2018-09-07T16:16:05Z

Ah, did this end up working?

SidharthBansal · 2018-12-07T00:02:29Z

As the person is inactive for more than a month, I am closing the PR. In case you want to push changes please feel free to open a new PR.
Thanks for contributing at Public Lab

jywarren · 2018-12-07T19:27:19Z

Actually this one I'd like to reopen and take on myself - we really do need to solve this issue for many tables. Thanks!

icarito · 2019-05-21T04:58:53Z

I think this last reference that I shared looks the simplest / most modern.
So I say we merge this when we have this:

Add to my.cnf

 # in /etc/mysql/my.cnf add the following in the correct sections
  [client]
  default-character-set = utf8mb4

  [mysqld]
  character-set-client-handshake = FALSE
  character-set-server = utf8mb4
  collation-server = utf8mb4_unicode_ci

  [mysql]
  default-character-set = utf8mb4

write a migration like this:

# For each database:
ALTER DATABASE database_name CHARACTER SET = utf8mb4 COLLATE = utf8mb4_0900_ai_ci;
# For each table:
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci;

modify database.yml.example (note manual step in production):

  # database.yml
  development:
    adapter: mysql2
    database: db
    username:
    password:
    encoding: utf8mb4
    collation: utf8mb4_0900_ai_ci

That should be about it!
To write the above migration here's some code in Ruby that could be used as a basis:

require 'rubygems'
require 'sequel'
require 'mysql2'
database = 'your_database_name'
username = 'your_user_name'
password = 'password'
host = 'localhost'

DB = Sequel.connect("mysql2://#{username}:#{password}@#{host}/#{database}")
sw_tables = DB["show tables"].all.map{|x| x.first[1]}
puts DB["ALTER DATABASE #{database} CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci"].all.inspect
sw_tables.each do |t|
  puts DB["ALTER TABLE #{t} CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;"].all.inspect
end

Also note that if there are indexed fields of type VARCHAR(255) then we may need the following initializer:

# config/initializer/mysqlpls.rb

require 'active_record/connection_adapters/abstract_mysql_adapter'

module ActiveRecord
  module ConnectionAdapters
    class AbstractMysqlAdapter
      NATIVE_DATABASE_TYPES[:string] = { :name => "varchar", :limit => 191 }
    end
  end
end

Credit for these instructions to David Chua.

icarito · 2019-05-21T05:02:50Z

db/migrate/20180721114724_change_comments_to_utf8mb4.rb

+    # changing table that will store unicode execute:
+    execute "ALTER TABLE comments CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin"
+    # changin string/text column with unicode content execute:
+    execute "ALTER TABLE comments MODIFY comment VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin"


Do we really feel comfortable just truncating the field?
Let's not truncate the field just yet (line 6), it shouldn't be necesssary.

Also let's use collation utf8mb4_0900_ai_ci it seems to be preferred (we run MariaDB 10.2 which supports this).

jywarren · 2019-05-22T00:25:59Z

Thanks so much for pushing this along, Sebastian!!! I will plan to go through it carefully.

…

On Tue, May 21, 2019 at 1:03 AM Sebastian Silva ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In db/migrate/20180721114724_change_comments_to_utf8mb4.rb <#3007 (comment)>: > @@ -0,0 +1,8 @@ +class ChangeCommentsToUtf8mb4 < ActiveRecord::Migration[5.2] + def change + # changing table that will store unicode execute: + execute "ALTER TABLE comments CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_bin" + # changin string/text column with unicode content execute: + execute "ALTER TABLE comments MODIFY comment VARCHAR(191) CHARACTER SET utf8mb4 COLLATE utf8mb4_bin" Do we really feel comfortable just truncating the field? Let's not truncate the field just yet (line 6), it shouldn't be necesssary. Also let's use collation utf8mb4_0900_ai_ci it seems to be preferred (we run MariaDB 10.2 which supports this). — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#3007?email_source=notifications&email_token=AAAF6J5BGH35JJIYCORD2P3PWN67VA5CNFSM4FI2AFX2YY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOBZFXJWA#pullrequestreview-239826136>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAAF6J6BBPDFEVFDHLUWLITPWN67VANCNFSM4FI2AFXQ> .

SidharthBansal · 2020-01-17T06:13:14Z

Hi, just checking if you've gotten stuck on this at all, or if I could help in any way? Thanks!

jywarren · 2020-01-17T15:44:55Z

I wonder if this could be a hall-of-fame?

…

On Fri, Jan 17, 2020 at 1:13 AM Sidharth Bansal ***@***.***> wrote: Hi, just checking if you've gotten stuck on this at all, or if I could help in any way? Thanks! — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#3007?email_source=notifications&email_token=AAAF6J4Q5VXEJPZ5EO5CIY3Q6FD7XA5CNFSM4FI2AFX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEJGSFGI#issuecomment-575480473>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAF6J25622IGECQZ7UQHP3Q6FD7XANCNFSM4FI2AFXQ> .

SidharthBansal · 2020-01-17T16:02:14Z

Request accepted: Here is the task https://codein.withgoogle.com/dashboard/tasks/6653314876309504/

CodeSarthak · 2020-01-18T12:45:44Z

I love emojis too!
There is a good reason that this is delayed, I think I failed to express it clearly, is because I found that this is likely a long, delicate, manual process of migrating the database. It's not just a matter of merging this patch, unfortunately.
I'll start making a list of steps from below references:

Step 1: Create a backup [DONE]
Step 2: Upgrade the MySQL server [DONE]
Step 3: Modify databases, tables, and columns
Step 4: Check the maximum length of columns and index keys [NOT NEEDED because we are using MariaDB > 5.7.7 (10.2)]
Step 5: Modify connection, client, and server character sets
Step 6: Repair and optimize all tables

References

https://mathiasbynens.be/notes/mysql-utf8mb4

This tool is 404 but supposedly automates part of this:

https://hanoian.com/content/index.php/24-automate-the-converting-a-mysql-database-character-set-to-utf8mb4

Hey, I believe it would be better if the GCI task could also be broken down into multiple hard/hof tasks, according to the steps mentioned above, since it would be difficult to complete all the steps by a single person.

Thanks.

icarito · 2020-01-18T13:10:06Z

Jeff and I agreed to prioritize this so I'll be reviewing the reference https://mathiasbynens.be/notes/mysql-utf8mb4 again and try to break it down into smaller pieces so it is clear what needs to be done.

jywarren · 2020-01-24T21:00:02Z

Models which could possibly contain emoji, as far as I can determine:

comment.rb
csvfile.rb
image.rb
node.rb
revision.rb
tag.rb
user.rb
user_session.rb
user_tag.rb

jywarren · 2020-02-11T18:16:14Z

Hi @icarito - what if we, temporarily, screened for emoji using regex, using an ActiveRecord validation, before we save records. Here is some regex we might be able to use:

https://www.regextester.com/106421

https://github.com/mathiasbynens/emoji-regex

Actually i think we already have some ability to scan for emoji:

plots2/app/helpers/application_helper.rb

Lines 9 to 46 in 591f59a

    
           def emojify(content) 
        
             if content.present? 
        
               content.to_str.gsub(/:([\w+-]+):(?![^\[]*\])/) do |match| 
        
                 if emoji = Emoji.find_by_alias(Regexp.last_match(1)) 
        
                   emoji.raw || %(<img class="emoji" alt="#{Regexp.last_match(1)}" src="#{image_path("emoji/#{emoji.image_filename}")}" style="vertical-align:middle" width="20" height="20" />) 
        
                 else 
        
                   match 
        
                 end 
        
               end 
        
             end 
        
           end 
        
           def emoji_names_list 
        
             emojis = [] 
        
             image_map = {} 
        
             Emoji.all.each do |e| 
        
               next unless e.raw 
        
               val = ":#{e.name}:" 
        
               emojis << { value: val, text: e.name } 
        
               image_map[e.name] = e.raw 
        
             end 
        
             { emojis: emojis, image_map: image_map } 
        
           end 
        
           def emoji_info 
        
             emoji_names = %w(thumbs-up thumbs-down laugh hooray confused heart) 
        
             prefix = "https://github.githubassets.com/images/icons/emoji/unicode/" 
        
             emoji_image_map = { 
        
               "thumbs-up"   => "#{prefix}1f44d.png", 
        
               "thumbs-down" => "#{prefix}1f44e.png", 
        
               "laugh"       => "#{prefix}1f604.png", 
        
               "hooray"      => "#{prefix}1f389.png", 
        
               "confused"    => "#{prefix}1f615.png", 
        
               "heart"       => "#{prefix}2764.png" 
        
             } 
        
             [emoji_names, emoji_image_map] 
        
           end

Wait, we can use the REGEX provided by that lib, already installed, to replace them!

https://github.com/janlelis/unicode-emoji

Yes, something like this, in the before_save filter:

require "unicode/emoji"
require "gemoji"

string = "String which contains all kinds of emoji:😴▶️🛌🏽🇵🇹🏴󠁧󠁢󠁳󠁣󠁴󠁿2️⃣🤾🏽‍♀️"

string.match(Unicode::Emoji::REGEX) do |match|
  string.gsub!(match.to_s,  ":" + Emoji.find_by_unicode(match.to_s) + ":")
end

Using gemoji to swap for the text name of the emoji: https://github.com/github/gemoji

jywarren · 2020-02-11T18:24:32Z

Making a new issue for this: #7469 🙌

icarito · 2020-02-12T04:34:07Z

I guess this would be an interim solution but the thing is that emojis are not the only kind of invalid characters for this encoding. For instance acording to https://mathiasbynens.be/notes/mysql-utf8mb4 - he found the issue by trying to insert the U+1D306 TETRAGRAM FOR CENTRE (𝌆) symbol.

So I think it'd be a lot of effort for a few cases, and rather the encoding migration should be done.

icarito · 2020-02-13T06:22:37Z

Hi @jywarren @namangupta01 I finally got around to working on this and made it into a new PR here #7479 - also, after I did, I found a snippet that may be a nicer / more complete migration?
https://gist.github.com/amuntasim/f3b12f20a30e9a9f3fb0

Please check and review if we should test this in unstable. Migration is not reversible so I guess we should try to get our tests to work (get unstable to fail before pushing this?). Let me know what you think! I also found that not all emojis should trigger this problem, only those that require 4 bytes such as U+1F4A9 PILE OF POO (💩) or U+1D306 TETRAGRAM FOR CENTRE (𝌆)

Regards,
Sebastian

icarito · 2020-04-06T12:51:00Z

Okay an update on this. I tried the latest patch on unstable and the mysql database crashed leaving the database corrupt.
I believe trying to rebuild the entire database is too heavy.
I see two ways forward. One is trying to do this differently / tweak the command and even upgrade mysql if it will help.
Two is looking for a way to change the encoding by dumping the whole database then changing the encoding and then uploading the dump back. This should be way less processing if it can handle, but likely it would mean we can't do it in a migration (can't we? maybe we can....). Looking for solutions here.
I'm going to frist try option 1.

icarito · 2020-04-06T12:52:00Z

Restoring a fresh dump from backup for unstable.

icarito · 2020-06-02T19:42:02Z

I am feeling and channeling the focus on this issue. 👓 ☄️ 🦸

Here's my current assessment and plan.

Since Naman's well started PR, I expanded the code to convert every one of our database tables and fields to the new encoding, but the times we tried to run this in the staging database, it crashes and corrupts the database, probably because of the size of our data overflowing our RAM.

The plan I have now is as follows:

Drop the database of unstable staging instance
Recreate the database and make sure with the right encoding
Export (mysqldump) the data from production (see if can export subset, e.g. "SQL LIMIT")
- Data may need to be converted to utf8mb4 before importing it to staging.
Perform tests in unstable.publiclab.org
- Check unicodes and international characters already in the database
- Write new nodes and comments with problematic characters and check results

icarito · 2020-06-03T05:06:08Z

This article from Moodle more or less describes the procedure we want to follow:
https://docs.moodle.org/32/en/Converting_your_MySQL_database_to_UTF8

icarito · 2020-06-11T17:56:26Z

Thanks @namangupta01 @jywarren and everyone who contributed to this. I'm closing it in favour of #8003 which is the solution that worked! Good try though, it would've worked with a smaller database.

Added tests for comment and answer for emojis

c21fbfc

ghost assigned namangupta01 Jul 7, 2018

ghost added the in progress label Jul 7, 2018

namangupta01 added the summer-of-code label Jul 7, 2018

namangupta01 closed this Jul 9, 2018

ghost removed the in progress label Jul 9, 2018

namangupta01 reopened this Jul 9, 2018

ghost added the in progress label Jul 9, 2018

namangupta01 added 3 commits July 21, 2018 23:52

Changed comment encoding

a94b776

Added migration version number to schema.rb.example

8e11d39

Merge branch 'master' into emoji_comment

e2c2336

SidharthBansal closed this Dec 7, 2018

ghost removed the in progress label Dec 7, 2018

jywarren reopened this Dec 7, 2018

ghost assigned jywarren Dec 7, 2018

ghost added the in progress label Dec 7, 2018

Merge branch 'master' into emoji_comment

c534188

icarito reviewed May 21, 2019

View reviewed changes

jywarren mentioned this pull request Aug 23, 2019

"Hall of Fame" for most stuck issues publiclab/community-toolbox#257

Open

jywarren added the hall-of-fame label Aug 23, 2019

cesswairimu mentioned this pull request Oct 31, 2019

Add welcome page for first-time posters (while held in moderation) #2627

Open

jywarren removed the high-priority label Nov 20, 2019

This was referenced Jan 10, 2020

"Internal Server Error" when posting research note #7062

Closed

test bad revisions #7173

Closed

icarito mentioned this pull request Feb 6, 2020

Users are unable to post Questions #7441

Closed

jywarren mentioned this pull request Feb 11, 2020

Bypass emoji trouble by filtering and converting emoji #7469

Closed

icarito mentioned this pull request Feb 13, 2020

[WIP] Migration utf8 -> utf8mb4 and test emojis for comment and answer #7479

Closed

5 tasks

jywarren mentioned this pull request Apr 28, 2020

Upgrade: Ruby 2.6, Rails 4.2 and Node 13 publiclab/spectral-workbench#499

Merged

5 tasks

icarito mentioned this pull request Jun 8, 2020

Config for dump&restore to UTF8MB4 for emojis #8003

Closed

5 tasks

icarito closed this Jun 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Migration utf8 -> utf8mb4 and test emojis for comment and answer #3007

[WIP] Migration utf8 -> utf8mb4 and test emojis for comment and answer #3007

namangupta01 commented Jul 7, 2018 •

edited

Loading

plotsbot commented Jul 7, 2018 •

edited

Loading

namangupta01 commented Jul 9, 2018

jywarren commented Jul 9, 2018

namangupta01 commented Jul 9, 2018

namangupta01 commented Jul 9, 2018

jywarren commented Jul 9, 2018

namangupta01 commented Jul 9, 2018

jywarren commented Jul 9, 2018 via email

namangupta01 commented Jul 21, 2018

namangupta01 commented Jul 21, 2018

jywarren commented Sep 7, 2018

SidharthBansal commented Dec 7, 2018

jywarren commented Dec 7, 2018

icarito commented May 21, 2019

icarito May 21, 2019

jywarren commented May 22, 2019 via email

SidharthBansal commented Jan 17, 2020

jywarren commented Jan 17, 2020 via email

SidharthBansal commented Jan 17, 2020

CodeSarthak commented Jan 18, 2020

References

icarito commented Jan 18, 2020

jywarren commented Jan 24, 2020 •

edited

Loading

jywarren commented Feb 11, 2020

jywarren commented Feb 11, 2020

icarito commented Feb 12, 2020

icarito commented Feb 13, 2020

icarito commented Apr 6, 2020 •

edited

Loading

icarito commented Apr 6, 2020

icarito commented Jun 2, 2020 •

edited

Loading

icarito commented Jun 3, 2020

icarito commented Jun 11, 2020

[WIP] Migration utf8 -> utf8mb4 and test emojis for comment and answer #3007

[WIP] Migration utf8 -> utf8mb4 and test emojis for comment and answer #3007

Conversation

namangupta01 commented Jul 7, 2018 • edited Loading

plotsbot commented Jul 7, 2018 • edited Loading

namangupta01 commented Jul 9, 2018

jywarren commented Jul 9, 2018

namangupta01 commented Jul 9, 2018

namangupta01 commented Jul 9, 2018

jywarren commented Jul 9, 2018

namangupta01 commented Jul 9, 2018

jywarren commented Jul 9, 2018 via email

namangupta01 commented Jul 21, 2018

namangupta01 commented Jul 21, 2018

jywarren commented Sep 7, 2018

SidharthBansal commented Dec 7, 2018

jywarren commented Dec 7, 2018

icarito commented May 21, 2019

icarito May 21, 2019

Choose a reason for hiding this comment

jywarren commented May 22, 2019 via email

SidharthBansal commented Jan 17, 2020

jywarren commented Jan 17, 2020 via email

SidharthBansal commented Jan 17, 2020

CodeSarthak commented Jan 18, 2020

References

icarito commented Jan 18, 2020

jywarren commented Jan 24, 2020 • edited Loading

jywarren commented Feb 11, 2020

jywarren commented Feb 11, 2020

icarito commented Feb 12, 2020

icarito commented Feb 13, 2020

icarito commented Apr 6, 2020 • edited Loading

icarito commented Apr 6, 2020

icarito commented Jun 2, 2020 • edited Loading

icarito commented Jun 3, 2020

icarito commented Jun 11, 2020

namangupta01 commented Jul 7, 2018 •

edited

Loading

plotsbot commented Jul 7, 2018 •

edited

Loading

jywarren commented Jan 24, 2020 •

edited

Loading

icarito commented Apr 6, 2020 •

edited

Loading

icarito commented Jun 2, 2020 •

edited

Loading