# How can I convert (move) individual webpages into QA system?

+1 vote
1.7k views
in Plugins
edited

What I have is just the 1000s of html pages which consist of question and answers. Now how can I move then into QA system. Example: What I actually have is (sample only):-

question2answer.org folder > qa folder / 52327 folder / best-users-per-month.html page (contains question and answers) for url like http://www.question2answer.org/qa/52327/best-users-per-month. This the structure of my site. They are just folders and pages. how can i move them into q/a system.

This is how the files present on my Pc.

selected

You can write a small PHP parser for this. It won't be difficult to construct post format from a HTML dump provided it is from a Q2A site. Otherwise also its not difficult- you have to match all items like category, tags, date, user etc. I had once done this to restore my site content from cached HTML pages.

edited

So now what I have to do exactly? I just want to move the questions and answers. Don't want user login details or anything else. I have thousands of pages like this.
<div id = summarydescription> gives the content. I guess that is answer. Similarly you can find the <div id> or <div class> of all the parts needed in Q2A.

Then you can  use a DOM parser to extract them: http://simplehtmldom.sourceforge.net/manual.htm

Finally insert them to Q2A posts table.

edited
Yeah, the question title is in between the tags <title> and </title> (also present in between <h1> and </h1> tags).
The question description is in between the <div id="summaryDescription">
All the answer descriptions are in between div classes <div class="postContent">
All the comments are in between <div class="commentBody" parent_type="question">

Only these 3 are I needed.

Sorry for bothering you, I read that manual but I can't understand it properly. It says to use this code to find div id,  $ret =$html->find('div[id]');             ------->  (but where to use it?)

Extract contents from html:-

// Dump contents (without tags) from HTML
echo file_get_html('http://www.google.com/')->plaintext;        -------> (Do I have to use div id here?) (Where to run these codes?) (Will this modify on all index.html files?)

https://3.bp.blogspot.com/-WmOdvWPy1wU/V6i3jiHDODI/AAAAAAAAAH8/IA3US85eV7oDXVbweCZvbNTNF9GD9CsnwCLcB/s1600/phphtmlparser.PNG

Finally insert them to Q2A posts table --------> I think Q2A tables are in the SQL database which is modified from the PhpMyAdmin. But do I have to insert all these content manually
there. Sorry again and thanks in advance.

Do I need to use softwares like Notepad ++, Adobe Dreamweaver in this process? Notepad ++ says it can open multiple files only if they are present inside a single folder.
:) You can surely automate this. But need shell access either on server or on your system. What we did was:
Put all html files in a folder. Now in the PHP script iterate through each file- parse the relevant stuffs and make an sql query out of them and output this query in a file. Finally just ran these queries in the Q2A database. You can even use Q2A DB access function - as it has a function to insert post. But I do not think that is required for you- you can simply reindex everything via Q2A admin once imported.

For your case may be you can create the user and categories first and then the posts as that can take care of foreign key dependencies. I suggest to get the help of someone decent in scripting/PHP.
The name of all html files are just index.html, so i have to rename all files? The posts can be from anonymous users, do i still need to create users?
"Finally insert them to Q2A posts table --------> I think Q2A tables are in the SQL database which is modified from the PhpMyAdmin. But do I have to insert all these content manually
there. Sorry again and thanks in advance."

I was supposing you have direct shell access to server. Otherwise does PHPMyAdmin allows bulk sql queries to be executed? There should be a way for it- for me I have shell access to server and so it was straight forward.

And yes, you need to write a script in PHP or whatever language to iterate over all HTML files. Is not simple as working on a GUI but for a programmer is an easy task.
You do not need to rename them, the script can handle them from directories. And regarding users- it is up to you, for anonymous users you can simply put NULL in the user part of query. For genuine users you should decide what to do as these users are not there in the database I suppose. Adding users might be more tricky as you need to have password, emails etc. Or just attribute all posts to admin which will be easier.
Thats nice. So, you just need to parse the HTML files and make an output file of the SQL insert queries.
As for parsing, what should I have to do? Run this in where? Are these correct?

<title><?php echo $data['title'] ?></title>, echo file_get_div('summaryDescription')->plaintext; echo file_get_div('postContent')->plaintext;$file_contents = '<div class="lot-price-block">Test content</div>';

$value=preg_match_all('/<div\s*class="lot\-price\-block"\s*>(?P<content>.*)<\/div>/',$file_contents,$estimates); echo "<pre>"; print_r($estimates);

It has said this is for div class, will it work for div id? http://stackoverflow.com/questions/15272256/preg-match-find-content-between-specific-div-with-class
Well, the script should be a proper PHP script. You are doing correctly, but the total program sequence should be matching the desired result. How big are the HTML files? If its not big can you zip it and mail me. I can send you the SQL file.

edited
Thank you so much. I want to learn how to do it too, so next time I can do it myself. So, I have Zipped 20 folders containing the index.html files. Can u please record how you doing it with a screen recorder (http://www.sketchman-studio.com/download/Rylstim-Screen-Recorder.exe) or with a .gif recorder (http://www.cockos.com/licecap/licecap126-install.exe). Thanks a lot. Rar file = https://drive.google.com/open?id=0B4RFn3PmjNy9VThOWF9YZDZCTEk

If possible, can the answers be posted with different account names? Otherwise just let it be anonymous.
Oh. In that case I suggest you to try it yourself. Even if I screen record it wont be useful for you as I use linux and not exe stuffs.

edited
Is it possible to do it with Eclipse?  http://stackoverflow.com/questions/21473959/insert-data-into-database-using-servlet-and-jsp-in-eclipse . Can you try linux screen recorders https://community.linuxmint.com/tutorial/view/1229, I can try to use linux on virtualbox. If not possible then just tell me in detail that what you have done like " move all .index files into 1 folder > change....." like that. So I can try it myself.
You are saying a lot of tools- nothing is required. Just a simple PHP file all is needed. First lets convert for 1 HTML file. i.e., one index.html. Just go to that directory. Parse that file - output the needed contents- that is question content, answer content and comments - along with author names, dates, categories if you need them. Just simple output to screen.

Next step, each of these stuffs output can be made to a SQL query - I can help here as this needs some understanding of Q2A database.

That's all - no right- we have to do this for all index.html files in all the directories. In linux, we can do
find . -name "index.html"
and we get the list of all files and then simply pass them one by one to our php script.

May be there is some way in windows but ...
I can modify all the html files at once using the Notpad ++.
okay. Then you can do it. I have no idea about Notepad++

edited
Will Nokogiri help me?

All must be on same folder as the root directory of the folders of the HTML files.

Now run

php parse.php

It should produce some output files. Open qinsert.sql
Likewise do for ainsert.sql and cinsert.sql

category, tags, categorypath - these things you have to add so do userid.
Do I need to install php first on my PC?
Yes you need PHP-cli. On debian Linux  machines:

sudo apt-get install php5-cli
or
sudo apt-get install php7.0-cli

edited
I'm already started to install XAMPP, it says it has php. Should I stop it and install PHP-cli? XAMPP stands for Cross-Platform (X), Apache (A), MySQL (M), PHP (P) and Perl (P).

http://i.imgur.com/vatAh9n.png

When I opened it, there was a icon named "shell". It looks like command prompt, I typed php -version and it says PHP 7.0.8. But when I typed php -version on my normal windows command prompt it doesn't work.
Probably you should then do everything on that shell. Windows is like that :)
I just run the "php parse.php" and it produces ainsert.sql qinsert.sql cinsert.sql. Does it has all the html files included in the input.txt? What is the next step?
No, Open the parse.php file and see. It is producing output for only qinsert.sql. Other files would be empty.  But you can easily modify them as you want. You should first create users and categories as you like. Then only do question insert.

edited
How many users I need to create? Each answer should be from a new user and should not be from the question author. All these questions are going to be only in "Living" category. I can change the category when create the Sql query for the next group new questions.
I opened the qinsert.sql, it has:-
insert into qa_posts (postid, type, parentid, categoryid, catidpath1, catidpath2, catidpath3, acount, amaxvote, selchildid, userid, upvotes, downvotes,netvotes, views, created, title, content, tags, format) values (,,'Q',null,,,,,,,,,,,,,'','3nder Is a New Hookup App... for Threeways: Sweet or So Gross?','','','html');
insert into qa_posts (postid, type, parentid, categoryid, catidpath1, catidpath2, catidpath3, acount, amaxvote, selchildid, userid, upvotes, downvotes,netvotes, views, created, title, content, tags, format) values (,,'Q',null,,,,,,,,,,,,,'','2 Amp energy drinks= kidney failure?','','','html');
insert into qa_posts (postid, type, parentid, categoryid, catidpath1, catidpath2, catidpath3, acount, amaxvote, selchildid, userid, upvotes, downvotes,netvotes, views, created, title, content, tags, format) values (,,'Q',null,,,,,,,,,,,,,'','॥ सत्यम् शिवम् सुन्दरम् ॥','','','html');
insert into qa_posts (postid, type, parentid, categoryid, catidpath1, catidpath2, catidpath3, acount, amaxvote, selchildid, userid, upvotes, downvotes,netvotes, views, created, title, content, tags, format) values (,,'Q',null,,,,,,,,,,,,,'',':&#39;(','','','html');

edited
It only retrieves the data for the question heading, so I have to add appropriate code in the same parse.php to get other details like question description, answers? or It has to be a new parse file?
So, I don't have to modify the code in the HTML files? Since the code in html flies has't modified to get the 'Title'.
See that parse.php has only the skeleton code which output the SQL. You have to do the remaining part like

Then when question is created it will be given a question id. Same id, you have to use for its answers in place of parent id. Similarly comments to answers should have the parent id as the id of the answer.

I have added a bit more to the parse file. But unless you know and understand some PHP you can in no way do this. Try to get help from some one who knows PHP. I already have tens of pending jobs. So, please excuse me.

edited
Why now it stopped to produce the results in qinsert.sql? I just added more .html file containing folders into the root where parse, txt & simple html dom files are present. I have given all the path of the html files in the text file.

https://drive.google.com/open?id=0B4RFn3PmjNy9T3Y5MFNrd1VPUXM ----> How the files are present

If you add more htmlt index files you have to run this command assuming a linux system.
This will put all their paths in the input file. There might be a windows equivalent.

"find . -name 'index.html' >input.txt"
I guess your error is due to windows using "\" for paths and linux using "/".
I find this page which discuss about the topic > http://stackoverflow.com/questions/6285148/windows-equivalent-for-unix-find-command-to-search-multiple-file-types still not sure what it is.
but it worked fine before when using your input txt file! Should I change all "\" to "/" ?
(  : - | )...
See, whatever I gave is more than enough for anyone to do an import. If you cannot do, just do the following and NOT SOMETHING ELSE:

1. Give me the url of your Q2A site.
2. Add the categories to it to where you want the questions to be imported.
3. Create some users if you want or else all questions and answers will be from anonymous.
4. From the files output see if questions, answers, comments are proper or anything requires a change
5. What should be the question description- I saw only question title in the HTML files.

If you can do these I can complete the sql files for you which you can run on your database.
I know u done more than a person can do but I'm ready to pay for your work. I feel so sorry for giving you trouble but I didn't have anyone to help. I can also be dump sometimes. My Q/A site = www.userring.com. The question description can be found if you search for: <div id="summaryDescription"> and the answer description can be found when u search for <div class="postContent">. If there is no description found then that particular questions doesn't have any.

edited
The comments are in the <div class="comment"> and the  <a class="ravesLabel"> is the votes of that comment. Votes aren't so important.

<div class="commentBody" parent_type="question">
<div class="showParent"></div><div class="chupchick"></div><div class="avatar"><a class="authorLink" data="2460383"><div class="maxIcon"></div><img src="/images/static/blank.gif" lazysrc="/profiles/000000001/profiles_1202SHAvatarFteal_1034_180115_square.jpeg" alt="Tyson" title="Tyson" class="lazyImage" height="50" width="50"></a></div>
<div class="comment">
<span id="" class="ravesContainer">
<a class="ravesLabel">
<strong>+2</strong>
<span class="chubchik"></span>
</a><span title="Good Quality" class="needAuth unRavedUp"><span class="ravesIcon"></span></span></span>
<div class="postContent">I think it is a good thing to give a child a gun and show him how it works and how to fire it so the child comes accross a gun the child will be able to make it safe and handle it without indangering lives rather than picking it up and pulling the trigger because they have never seen a real gun before.</div><div class="postActions"><a class="permalink" href="/living/2-year-old-accidentally-shoots-dad-and-kills-him-whats-your-take-on-this/question-171160/comment-55082461/" title="permalink" rel="nofollow"></a><a class="replyLink blueButton needAuth">
</div>
</div>

edited
There is some difference between the answer and comment but I can't tell it properly.
The only difference I can tell clearly is the tag <div class="votecolor vcolor1 ....    > is present for the answer and its absent for the comment.

Maybe both answer and the comment can be simply posted as the answer.
okay. Since your Q2A is new, can you take the mysql dump and give me?

edited
I have exported the sql database and I hope it is what you asking for. https://drive.google.com/open?id=0B4RFn3PmjNy9a19vak1HYjM1WWc
okay.. Let me try..
You can redownload the parse.php and run it after giving it the database details - at 2 places. The produced contents in qinsert.sql and ainsert.sql can be directly imported to the database.

All posts are given to admin, if you want this can be changed by changing the "1" used for userid field. Give the userid of the new user.
Tag is "living" for all posts. - you can change in the file
All posts take today as creation date - this is not hard to change, you can try.

Also no comments are considered as the HTML file is really hard to understand - everything post is added as an answer.

Meanwhile I do not know how you gave me the dump file- it worked for insert but is not a proper mysqldump I guess because it did not clean the already existing tables.

edited
Thanks. It works to get the question + the answers for the questions of the input file given by you. Some question didn't works though.

It isn't working for my input file. Also when I copypaste your input file text in notepad and create a new input.txt file, its not working. I tried both wordpad and notepad. So what has to be the problem. It says error on line 69 which is this line

$dom = str_get_html(file_get_contents($file));
you have the files - you can see the difference in them. Also the input file would need all HTML files to be in same structure - I generated that using "find" command - how you made your input file?
There is no difference in them. This is what I did in Windows.

I go to the folder which contains the folders of the html files > searched index.html > got all index.html files by the search > selected all by Control+A > pressed Shift+Right Click and I got the option to 'Copy as Path' for all files.

But the path looked like the below with double quotes.

"C:\xampp\living\1-in-3-jobs-will-be-replaced-by-robots-or-machines-software-by-2025-believe-it-or-not\question-4532027\index.html"

So I replaced        "C:\xampp\living\   with     ./
Also replaced      all    \      with      /
Also replaced " with null

So I finally got

./1-in-3-jobs-will-be-replaced-by-robots-or-machines-software-by-2025-believe-it-or-not/question-4532027/index.html
./....................................
./......................................
This is the input.txt file which i created by copypasting the links from your input.txt in the notepad. It's not working for me and does it works for you. https://drive.google.com/open?id=0B4RFn3PmjNy9YkFqZ05xSUNpUzg
"Also the input file would need all HTML files to be in same structure" ---> You mean the html content inside them? Yes they are same.
So, how can I do it in comment like u did?
There should be some character mismatch issue or something. Sorry but I last used windows in 2010 and never plan to use it again. Moreover I really hate it for destroying the career of Computer Science students in India - See windows is useful for old people who never used computers. These corporates dump these to developing countries like India and make them use. In developed countries govt. offices and colleges all use linux.

https://www.quora.com/Why-do-most-of-the-developers-in-Silicon-Valley-prefer-OS-X-over-Linux-or-Windows
On linux/unix shell you can simply do "find . -name "index.html" > input.txt"
I'm getting this error when I try this code on the base folder (with XAMPP shell).

I tried turning off User Control Access in Windows, changing the permission of that folder.
Then I downloaded 'cygwin' and tried the code on it but it doesn't give any result.

I heard that windows uses 1252 encoding and Linux uses utf-8 encoding for txt files. So if I give my text file to you can you copypaste the text and create the new text file on your pc. So it will work I think.
Yes, the problem is due to windows uses 1252 encoding and Linux uses utf-8 encoding for txt files. I went to my website Cpanel and there I can able to create the utf-8 encoding for txt files. Now I run it and it works.

edited
Thank you so much it works but I have few problems now. Now my problem is

1) Only few questions are fetched - https://3.bp.blogspot.com/-yn5M5sbWp_E/V7VyjjN0RLI/AAAAAAAAAJ4/dIcxFMDvHrYFMSHmwkcVf0eKOsib10wtQCLcB/s1600/Question%2Bstops.PNG (There are lot more, I don't know why its stops after a few).

2. I want to use different users on Answers.
Nice, you fixed that. I see this part being a possible reason: you should see if any particular input file is causing the break. Changing "break;" to "continue;" will allow to proceed but will miss some files.

if(!$file) break;$dom = str_get_html(file_get_contents($file)); if(!$dom)
{
print "Error html: ".$file."\n"; break; } There are many 404 pages in between. I suppose that should still work, just that output wont be correct. If the script is stopping in between - it means some HTML files which are in the "input.txt" are not there in the directory. This shouldnt be happening. There are some html files of the groups pages which stopped it. I removed them and now it's including all and it is still running. Is it possible to strip Html from the text except the image tags <img src>? Also I have to change the userid from 1 to 2. So do I have to change these 3 spots?$quserid = $auserid1 = 1; ------------to------------>$quserid = $auserid2 = 2; ............. ............. ......................categorypath[3].",".$auserid1.",".................... ----------to------------> categorypath[3].",". $auserid2."," ................... ............ypath[3].",".$auserid1.",". $aup ...... ---------------to-----------------> .......ypath[3].",".$auserid2.",". $aup Yes, you can. But what is the requirement? HTML is natively supported in Q2A and anyway you need to get image working. When I did this on my site I had to strip down some MathJax tags but I do not see any thing like that for you. Anyway if you want to do, you have to do a bit more parsing by modifying the code- if you spend some time on it it will be straightforward. To change the user id-$quserid = $auserid1 = 1; ------------to------------>$quserid =1; \$auserid2 = 2;

The first one is for question and second one for answer. Don't think you would want both to be same. There is a change user plugin for Q2A post and you can even use that later if required.

edited
Is it possible to make duplicate questions as 1? http://i.imgur.com/l6uK9l4.png

I just want to strip the html to reduced some overall memory since it has some useless div tags. But it can also be there.
That won't work. Even if you strip HTML, in database content has a fixed size. So, it wont affect. To reduce size you can do alternate ways.

Duplicates - How is it coming even? I suppose the HTML files were unique. To eliminate I guess you have to modify the code and check for duplicates, or remove duplicate content from the SQL file - in the later case you might have some missing "IDs" in post table.
Some HTML files has copies and I'm in a shared hosting. Will running those will cause any problem?
Guess you can start a new Q2A site just for your doubts :O
It doesn't cause any problem, but just the CPU limit given by the hosting provider increases. So if it reaches 100% then I have to wait for sometime I think.

I have dropped a message in your profile and do u think u can answer that.

I can get the description inside the command prompt.

What I did?

Create a Example.rb file on my PC. Entered the below code inside the file and saved it.

require 'nokogiri'
require 'open-uri'

url = "C:/Users/Manikandan/websites/index.html"

data = Nokogiri::HTML(open(url))

puts data.at_css("#summaryDescription").text.strip

puts data.css(".postContent").text.strip

data.at_css('h1.title').text.strip

When running the file on Command prompt, I can get the details