Welcome to the Question2Answer Q&A. There's also a demo if you just want to try it out.

Import content form external website, utf-8 character problem in post

+1 vote
81 views
asked Sep 9 in Plugins by esqeudero
edited Sep 10 by esqeudero

I am working on a custom plugin that import posts from external website.

I use following code to create a post in QA platform with data retrieved from external website.

qa_post_create('Q', null, $title, $content, $format = '', $categoryid, $tags, $userid, $notify = null, $email = null, $extravalue = null, $name = null);

I get $title, $content, and $tags successfully from external website using simple HTML DOM parser. And I use strip_tags() function to remove html tags and keep only text.

Everything is well. the qa_post_create() function creates post. However, in content of just created post there are non-utf-8 characters such as 

ý (ý)
ü
“
—
”
Ç (Ç)

Actually, in $title there are also some characters such as (ý), but it appears normal, not as ý. It happens only in content of question.

When I just retrieve the data and print it with echo() or print_r() functions the text appear normal without any irregular symbols. These symbols appear only when I use qa_post_create() function to create question.

Also, when I change format from '' to 'html' in qa_post_create() function, it resolves. But I do not want to use format='html'. Is there any other way to fix it? 

Q2A version: q2a 1.7.5 customized

1 Answer

+2 votes
answered Sep 9 by pupi1985
selected Sep 10 by esqeudero
 
Best answer

Everything would be clearer if you query the database and show exactly what you're storing. You are saying that what you see in the browser is X and you're assuming you're storing X while, in practice, it doesn't really have to be that way. For example a bold text is actually stored in the database as <strong>text</strong>.

The DOM parser is most likely getting the HTML content HTML encoded (e.g.: é becomes &eacute;). You just need to de-HTML the content. Check this function:

Just run it for the fields that are HTML encoded before inserting them in the database.

commented Sep 10 by esqeudero
Yes DOM gets html content, however, I use strip_tags() function to remove html tags. But the above problem occurs.

I have not tried html decode function, let me try if it works.
commented Sep 10 by pupi1985
I assumed you were removing the tags. However, you will also need to convert the HTML that's left into plain text. I think with those 2 steps you should be fine to store it in plain text. Also, don't miss the first sentence of the answer :)
commented Sep 10 by esqeudero
@pupi1985 many thanks html_​entity_​decode() function fixed my issue.
before fixing the content in database looked like this. http://i.hizliresim.com/DDg8Bm.png
...