Novel transcoding method, device, equipment and storage medium
1. A novel transcoding method, comprising:
searching at least one webpage address associated with the target novel;
acquiring at least one piece of directory information corresponding to the at least one webpage address one by one;
performing catalog aggregation based on the at least one catalog information to generate a chapter relationship diagram of the target novel; and
and acquiring the optimal catalogue of the target novel based on the chapter relation diagram so as to select the corresponding optimal text for the novel chapter requested to be accessed by the user to transcode based on the optimal catalogue.
2. The method of claim 1, wherein performing catalog aggregation based on the at least one catalog information to generate a chapter relationship graph for the target novel comprises:
for each directory information of the at least one directory information, performing a preprocessing operation to obtain corresponding preprocessed directory information, wherein the preprocessing operation includes at least one of: chapter de-duplication operation and chapter completion operation; and
and performing catalog aggregation based on at least one catalog information obtained after preprocessing to generate the chapter relation graph.
3. The method of claim 2, wherein performing catalog aggregation based on the preprocessed at least one catalog information to generate the chapter relationship graph comprises:
establishing a chapter relation tree according to a chapter sequence based on at least one directory information obtained after the preprocessing; and
combining the nodes with the same chapter title main body in the chapter relation tree to obtain at least one aggregation node; and
generating the chapter relationship graph based on the at least one aggregation node.
4. The method of claim 3, wherein obtaining the optimal catalog of target novels based on the chapter graph comprises:
and searching the child node of the node from the first node of the chapter relation graph, and selecting the child node with the largest number of aggregation nodes as the next node until the leaf node is searched, wherein the path passed by the process is used as the optimal directory.
5. The method of claim 1, further comprising:
and caching the optimal directory.
6. The method of any of claims 1 to 5, further comprising:
and responding to a transcoding request initiated by the user for the target novel, and selecting a corresponding optimal text for the novel chapter accessed by the user request to transcode based on the optimal directory.
7. The method of claim 6, wherein selecting the corresponding optimal text for the novel section requested to be accessed by the user for transcoding based on the optimal directory comprises:
determining an aggregation section corresponding to the novel section which the user requests to access in the optimal directory;
acquiring at least one chapter text webpage link associated with the aggregated chapter;
acquiring at least one chapter text corresponding to each other on the basis of the at least one chapter text webpage link;
selecting an optimal chapter text from the at least one chapter text based on the content quality; and
and transcoding the optimal chapter text to the user.
8. The method of claim 1, wherein finding at least one web page address associated with a target novel comprises:
determining a book set to which the target novel belongs, wherein each novel contained in the book set and the target novel are the same novel with different sources, and a webpage address corresponding to each novel is associated with the target novel; and
and searching a webpage address corresponding to each novel.
9. A novel transcoding device, comprising:
the searching module is used for searching at least one webpage address associated with the target novel;
the first acquisition module is used for acquiring at least one piece of directory information corresponding to the at least one webpage address one by one;
the aggregation module is used for carrying out catalog aggregation based on the at least one catalog information and generating a chapter relation diagram of the target novel; and
and the second acquisition module is used for acquiring the optimal catalogue of the target novel based on the chapter relation diagram so as to select the corresponding optimal text for the novel chapter requested to be accessed by the user for transcoding based on the optimal catalogue.
10. The apparatus of claim 9, wherein the aggregation module comprises:
a preprocessing unit, configured to perform a preprocessing operation on each directory information in the at least one directory information to obtain corresponding preprocessed directory information, where the preprocessing operation includes at least one of: chapter de-duplication operation and chapter completion operation; and
and the aggregation unit is used for carrying out catalog aggregation based on at least one catalog information obtained after the preprocessing so as to generate the chapter relation graph.
11. The apparatus of claim 10, wherein the aggregation unit comprises:
a construction subunit, configured to establish a chapter relationship tree according to a chapter sequence based on the at least one catalog information obtained after the preprocessing; and
the merging subunit is used for merging the nodes with the same chapter title main body in the chapter relation tree to obtain at least one aggregation node; and
a generating unit, configured to generate the chapter relationship graph based on the at least one aggregation node.
12. The apparatus of claim 11, wherein the second obtaining means is further configured to:
and searching the child node of the node from the first node of the chapter relation graph, and selecting the child node with the largest number of aggregation nodes as the next node until the leaf node is searched, wherein the path passed by the process is used as the optimal directory.
13. The apparatus of claim 9, further comprising:
and the caching module is used for caching the optimal directory.
14. The apparatus of any of claims 9 to 13, further comprising:
and the transcoding module is used for responding to a transcoding request initiated by the user aiming at the target novel, and selecting a corresponding optimal text for the novel chapter accessed by the user request to transcode based on the optimal directory.
15. The apparatus of claim 14, wherein the transcoding module comprises:
a first determining unit, configured to determine an aggregation chapter corresponding to the novel chapter requested to be accessed by the user in the optimal directory;
a first obtaining unit, configured to obtain at least one chapter text web page link associated with the aggregated chapter;
the second acquisition unit is used for acquiring at least one chapter text in one-to-one correspondence based on the at least one chapter text webpage link;
the selecting unit is used for selecting the optimal chapter text from the at least one chapter text based on the content quality; and
and the transcoding unit is used for transcoding the optimal chapter text to a user.
16. The apparatus of claim 9, wherein the lookup module comprises:
a second determining unit, configured to determine a book set to which the target novel belongs, where each novel included in the book set and the target novel are the same novel from different sources, and a web address corresponding to each novel is associated with the target novel; and
and the searching unit is used for searching the webpage address corresponding to each novel.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.
19. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-8.
Background
The novel text stored in the server is not usually in a format that can be directly read by the user, so that when the novel text is displayed on the reading terminal, the novel text needs to be transcoded to generate a text format that can be understood by the user. Therefore, the novel transcoding is to convert the text format and convert the novel into the format supported by the reading terminal for displaying.
Disclosure of Invention
The present disclosure provides a novel transcoding method, apparatus, device, storage medium and computer program product.
According to an aspect of the present disclosure, there is provided a novel transcoding method, including: searching at least one webpage address associated with the target novel; acquiring at least one piece of directory information corresponding to the at least one webpage address one by one; performing catalog aggregation based on the at least one catalog information to generate a chapter relationship diagram of the target novel; and acquiring the optimal catalogue of the target novel based on the chapter relation diagram so as to select the corresponding optimal text for the novel chapter requested to be accessed by the user to transcode based on the optimal catalogue.
According to another aspect of the present disclosure, there is provided a novel transcoding apparatus, including: the searching module is used for searching at least one webpage address associated with the target novel; the first acquisition module is used for acquiring at least one piece of directory information corresponding to the at least one webpage address one by one; the aggregation module is used for carrying out catalog aggregation based on the at least one catalog information and generating a chapter relation diagram of the target novel; and the second acquisition module is used for acquiring the optimal catalogue of the target novel based on the chapter relation diagram so as to select the corresponding optimal text for the novel chapter requested to be accessed by the user to transcode based on the optimal catalogue.
According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the embodiments of the present disclosure.
According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to embodiments of the present disclosure.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1A illustrates a system architecture suitable for embodiments of the present disclosure;
FIG. 1B illustrates a scene diagram in which embodiments of the disclosure may be implemented;
fig. 2 illustrates a flow chart of a novel transcoding method according to an embodiment of the present disclosure;
FIG. 3 illustrates a schematic diagram of generating a chapter relationship diagram according to an embodiment of the present disclosure;
FIG. 4 illustrates a schematic diagram of obtaining an optimal novel text according to an embodiment of the disclosure;
fig. 5 illustrates a block diagram of a novel transcoding device, according to an embodiment of the present disclosure; and
FIG. 6 illustrates a block diagram of an electronic device used to implement embodiments of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
At present, there are many fiction websites in the market, but the quality of fiction texts provided by different websites is uneven, and the reading experience of users is affected by book quality problems such as lacking chapters or disorder chapters even existing in the fiction texts provided by part of websites.
The traditional novel transcoding scheme only performs format conversion and structure adjustment on novel webpages so as to adapt to corresponding reading terminals, and cannot solve the problem of book quality. Therefore, the user needs to discriminate the quality of the website and find a novel source with better quality.
In view of the above, the present disclosure provides a novel transcoding scheme based on information fusion. According to the scheme, transcoding is carried out according to the novel latitude instead of the webpage latitude, when a user requests to read a certain novel, the contents of different chapters of the novel can be simultaneously obtained from a plurality of websites, a relatively complete novel is combined based on the obtained contents, and then transcoding is carried out for the user to read, so that quality problems of text lack of the novel, repeated chapters, messy chapters and the like can be avoided, and the quality of the transcoded novel is improved.
The present disclosure will be described in detail below with reference to specific examples.
A system architecture suitable for the novel transcoding method and apparatus of the embodiments of the present disclosure is introduced below.
FIG. 1A illustrates a system architecture suitable for embodiments of the present disclosure. It should be noted that fig. 1A is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be used in other environments or scenarios.
As shown in fig. 1A, the system architecture 100 may include: server 101, reading terminals 102, 103, and 104, and web sites a, B, and C.
It should be understood that there are many fiction websites on the market, such as website a, website B, and website C, which can all be fiction websites. The quality of the novel text provided by the websites may be uneven, for example, the novel text provided by the website a has only the first 3 chapters, the novel text provided by the website B has a disordered sequence, the novel text provided by the website C has a repeat phenomenon, and the like, which may affect the reading experience of the user.
In the embodiment of the present disclosure, the server 101 may obtain contents of different chapters of the same novel from multiple websites (e.g., website a, website B, website C, etc.), then combine the contents into a novel text which is relatively complete and has no problems such as chapter repetition and disorder, and finally respond to an access request of a user, such as an access request initiated by any one or more of the reading terminals 102, 103, and 104, transcode the novel text obtained by combination and feed the novel text back to the user for reading, so as to improve the quality of the transcoded novel, and improve the reading experience of the user at the same time.
It should be understood that the number of websites, servers, and reading terminals in FIG. 1A are merely illustrative. There may be any number of websites, servers and reading terminals, as desired for the implementation.
Application scenarios of the novel transcoding method and apparatus suitable for the embodiments of the present disclosure are introduced below.
As shown in fig. 1B, a novel downloaded from website 1 has only chapters 1, 3 to 5 (chapter 2 is missing), and chapter 4 has poor text quality; chapters 1 to 5 (no missing chapter) of the novel downloaded from the web site 2, but chapters 3 and 5 have poor text quality.
It is clear that in this case, no matter whether the novel web page provided by website 1 or website 2 is transcoded alone for the novel, a higher quality transcoded novel cannot be obtained.
By using the novel transcoding scheme provided by the embodiment of the present disclosure, for example, chapter 1, chapter 3, and chapter 5 of the novel can be obtained from the website 1, and chapter 2 and chapter 4 of the novel can be obtained from the website 2, and then combined into a relatively complete novel text with high content quality of each chapter to be transcoded and fed back to the user for reading, so as to improve the quality of the transcoded novel and improve the reading experience of the user. Or, for example, chapter 3 and chapter 5 of the novel can be obtained from the website 1, chapter 2, and chapter 4 of the novel can be obtained from the website 2, and then the novel texts are combined into a novel text which is relatively complete and has high content quality of each chapter to be transcoded and fed back to the user for reading, so that the quality of the transcoded novel is improved, and the reading experience of the user is improved.
According to an embodiment of the present disclosure, the present disclosure provides a novel transcoding method.
Fig. 2 illustrates a flowchart of a novel transcoding method according to an embodiment of the present disclosure.
As shown in fig. 2, a novel transcoding method 200 includes: operations S210 to S240.
In operation S210, at least one web page address associated with the target novel is searched;
in operation S220, acquiring at least one directory information corresponding to at least one web page address one to one;
performing catalog aggregation based on the at least one catalog information to generate a chapter relationship diagram of the target novel in operation S230; and
in operation S240, the optimal directory of the target novel is obtained based on the chapter relation diagram, so that the corresponding optimal text is selected for the novel chapter requested to be accessed by the user for transcoding based on the optimal directory.
In operation S210, the target novel may be any novel, and each of at least one web page address associated with the novel corresponds to a web site. In other words, the novel may be accessed through each of the at least one web page addresses associated with the novel, except that the text quality of the accessed novel may be more or less problematic. In operation S220, for each web page address found in operation S210, directory information of the target novel provided by each corresponding website may be acquired. In operation S230, the target novel directories provided by the websites may be merged based on the directory information of the target novel provided by the websites, and a chapter relationship diagram of the target novel may be generated. In operation S240, an optimal directory of the target novel may be selected, and based on the optimal directory, a corresponding optimal chapter text is selected from each website to be transcoded and fed back to the user.
As an embodiment, in response to a transcoding request initiated by a user for a target novel, operations S210 to S240 may be performed to search at least one web page address associated with the target novel, initiate a transcoding request to the at least one web page address to obtain at least one piece of directory information, perform directory aggregation based on the at least one piece of directory information to generate a chapter relationship diagram of the target novel, further obtain an optimal directory of the target novel based on the chapter relationship diagram, and finally select a corresponding optimal novel text for a novel chapter accessed by the user to transcode based on the optimal directory.
Further, in this embodiment, the optimal directory of the target novel can be cached, and the link of the optimal novel text corresponding to the optimal directory can be cached, so that when a subsequent user accesses the target novel, the link of the corresponding optimal directory and the optimal chapter text can be directly obtained from the cache to perform novel transcoding, and the novel transcoding efficiency can be improved.
As another embodiment, for some popular novels, operations S210 to S240 may be performed in advance, that is, at least one web address associated with the target novels is searched, and at least one piece of directory information corresponding to the at least one web address is obtained, then directory aggregation is performed based on the at least one piece of directory information to generate a chapter relation diagram of the target novels, and then an optimal directory of the target novels is obtained and stored based on the chapter relation diagram, so that when a user accesses a relevant chapter of the target novels, a corresponding optimal novel text for the novels that the user requests to access may be selected for transcoding based on the optimal directory.
In the embodiment of the disclosure, the catalogues and the texts of the same electronic novel provided by a plurality of websites are fused, transcoding is performed at the novel latitude instead of the webpage latitude, when a user requests to read a certain novel, the contents of different chapters of the novel can be simultaneously obtained from the websites, and the contents are combined into a relatively complete novel based on the obtained contents and then transcoded to be read by the user, so that quality problems of text lack of novel, repeated chapters, disordered chapters and the like can be avoided, and the quality of the transcoded novel is improved.
As an alternative embodiment, performing catalog aggregation based on at least one catalog information to generate a chapter relationship diagram for a target novel may include the following operations.
For each directory information in the at least one directory information, performing a preprocessing operation to obtain corresponding preprocessed directory information, wherein the preprocessing operation includes at least one of: chapter deduplication operation and chapter completion operation.
And performing catalog aggregation based on at least one catalog information obtained after preprocessing to generate a chapter relation graph.
In the embodiment of the present disclosure, when performing a preprocessing operation on each directory information of a target novel (that is, a same novel) provided by different websites, chapter titles of the target novel on different websites may be cleaned, so that duplicate removal, completion, and the like of chapters are realized.
In one embodiment, for each directory information, each section title may be processed through a regular expression, and a main body of each section title (referred to as a section main body) is extracted.
For chapters without a chapter body, the chapter body may be set to a chapter name. For example, if chapter 1 and chapter 5 both have no chapter body, chapter 1 may be regarded as the chapter body of chapter 1, and chapter 5 may be regarded as the chapter body of chapter 5, so as to implement chapter padding processing.
For a directory in which the chapter body is repeated many times, the repeated chapter and the subsequent chapter can be discarded, and only the non-repeated part of the first half of the directory is reserved to implement chapter de-duplication processing. For example, if chapter 4 and chapter 5 of the catalog displayed on website a by a novel are both "for sweeping a floor for a brief moment", chapter 5 and later chapters may be discarded, and only chapter 4 and previous chapters may be retained.
According to the embodiment of the disclosure, the preprocessing operation is executed in the process of aggregating a plurality of catalogue information of the same novel from different sources and generating the corresponding chapter relation graph, so that the problems of chapter missing, disorder or repeated chapters and the like of the novel text formed by combining the catalogue information and the chapter relation graph can be avoided.
Further, as an optional embodiment, performing directory aggregation based on at least one directory information obtained after the preprocessing to generate the chapter relationship diagram may include the following operations.
And establishing a chapter relation tree according to the chapter sequence based on at least one piece of directory information obtained after preprocessing.
And merging the nodes with the same chapter title main body in the chapter relation tree to obtain at least one aggregation node.
Generating a chapter relationship graph based on the at least one aggregation node.
In the embodiment of the disclosure, the chapter relationship tree can be established according to the chapter order represented by each directory information (each directory information corresponds to a web page). For example, a common virtual node 0 may be set and all pages' first chapters may be directed to the node 0. Then, for each web page, chapter one is the parent node of chapter two, chapter two is the parent node of chapter three, and so on. Then, starting from node 0 of the chapter relationship tree, the chapter bodies of all the child nodes of the chapter relationship tree are inquired, if the chapter bodies of any two or more child nodes are the same, the child nodes are considered to form a node cluster, and therefore the child nodes can be combined into an aggregation node to represent a node cluster. And simultaneously, recording the number of the aggregated sub-nodes in the aggregation node and the relevant information of each aggregated original sub-node in the aggregation node. Illustratively, the specific operation of merging two child nodes includes: deleting one child node, and transferring the relationship between the parent node and the child node of the child node to the child node which is not deleted. In addition, if the chapter bodies of all the child nodes of the same parent node are different, the chapter bodies of other child nodes in the current chapter relationship tree are continuously searched, if at least two child nodes with the same chapter body exist in the other child nodes, the chapter bodies of the parent nodes of the at least two child nodes are compared, and if the chapter bodies of the parent nodes of the at least two child nodes are also the same, the child nodes of the current same chapter body and the parent nodes of the child nodes are merged. By analogy, the chapter relationship trees formed by the chapters of the web pages can be finally combined into a directed acyclic graph called a chapter relationship graph.
Illustratively, as shown in fig. 3, for a novel, the chapter bodies shown on the novel web page of website a are "boa, fringed and strength" in turn, the chapter bodies shown on the novel web page of website B are "boa, voiceprint, fringed and strength" in turn, and the chapter bodies shown on the novel web page of website C are also "boa, voiceprint, fringed and strength" in turn. By the technical scheme provided by the embodiment of the disclosure, based on the chapter main bodies displayed by the websites A-C, a chapter relation tree shown in the left half part of FIG. 3 can be constructed for the novel. Based on the chapter relationship tree, a chapter relationship diagram as shown in the right half of fig. 3 can be generated by merging the chapter main bodies "python" on the 3 websites, merging the chapter main bodies "suju" and "strength" on the 3 websites, and the chapter main bodies "source prints" on the websites B and C.
According to the embodiment of the disclosure, for the same novel, the catalog information provided by different websites can be aggregated, so that a corresponding chapter relation graph containing one or more aggregation nodes is generated, the novel transcoding can select the text of each chapter and transcode based on the chapter relation graph, and a plurality of novel versions with different qualities from various websites are combined into a novel version with relatively complete content and relatively high content quality to be transcoded, so that the quality of the transcoded novel is improved, and the reading experience of a user is improved.
Further, as an alternative embodiment, the obtaining the optimal catalog of the target novel based on the chapter relation graph may include: starting from the first node of the chapter relation graph, searching the child nodes of the node and selecting the child nodes with the largest number of aggregation nodes as the next node, and thus, until the leaf nodes are found, taking the path passed by the process as the optimal directory.
In the embodiment of the present disclosure, starting from node No. 0 (head node) of the chapter relation graph, child nodes of the chapter relation graph are searched, a greedy algorithm is adopted, and the child node with the largest aggregation node is selected as a next node, so that leaf nodes are found all the time, and a path passed by the whole process can be used as a finally generated optimal directory. In embodiments of the present disclosure, the optimal catalog may be presented to the user.
Illustratively, with continued reference to fig. 3, from the chapter relationship diagram in the figure, "python → striae → frison → power" can be presented to the user as an optimal category.
By the embodiment of the disclosure, the child node with the largest number of aggregation nodes is taken as the next node, so that more possibilities are provided for selecting a chapter text with higher quality.
As an alternative embodiment, the method may further comprise: and caching the optimal directory.
In some embodiments of the present disclosure, after a user accesses a novel and generates a corresponding optimal directory, the generated optimal directory may be cached in a cache, so as to accelerate a response speed of a transcoding request for reading the novel by a subsequent user.
Alternatively, in other embodiments of the present disclosure, for popular novels, the optimal directory may also be generated in advance and stored in a cache for reading by the user. Thereby, the response speed of the transcoding request for reading the popular novel by the user can be accelerated.
It should be understood that, in the embodiment of the present disclosure, for the high-quality novel catalog stored in the cache, the web page maintenance may also be performed periodically, for example, dead link web pages appearing therein are eliminated, and new novel source web pages are added for replacement, so as to ensure that the combined novel text is kept at a relatively complete level as much as possible, thereby bringing a reading experience with stable quality to the user.
As an alternative embodiment, the method may further comprise: and responding to a transcoding request initiated by a user for the target novel, and selecting a corresponding optimal text for the novel chapter accessed by the user request to transcode based on the optimal directory.
Illustratively, with continued reference to fig. 3, from the chapter relationship diagram in the figure, "python → striae → frison → power" can be presented to the user as an optimal category. In the optimal catalog, websites A to C can provide the text of sections of python, sapling and strength, and websites B to C can provide the text of section of 'source line'.
Therefore, in one embodiment, for the section of the python, one of the sections provided by the websites a to C can be selected arbitrarily for transcoding and displaying. In another embodiment, for the sections of the python, although the websites a to C can provide corresponding section texts, in order to select section texts with better quality, the section texts provided by the websites can be screened and processed (for example, advertisements and other magazines in the section texts are filtered), so that the section texts with better quality are selected for transcoding and displaying, thereby improving the reading experience of the user.
Similarly, the two chapters "Su Young" and "power" can be treated similarly, and are not described herein again.
In addition, in one embodiment, for the "source line" section, one section can be selected from the sections provided by the websites B to C for transcoding and displaying. In another embodiment, for the sections of the python, although the websites B to C can all provide corresponding section texts, in order to select section texts with better quality, the section texts provided by the websites can be screened and processed (for example, advertisements, other magazines, and the like in the section texts are filtered), so that the section texts with better quality are selected to be transcoded and displayed, and the reading experience of the user is improved.
Further, as an optional embodiment, based on the optimal directory, selecting the corresponding optimal text for the novel chapter that the user requests to access to perform transcoding may include the following operations.
And determining an aggregation section corresponding to the novel section which is requested to be accessed by the user in the optimal directory.
At least one chapter body web page link associated with the aggregated chapter is obtained.
And acquiring at least one chapter text corresponding to each other on the basis of at least one chapter text webpage link.
Based on the content quality, an optimal chapter body is selected from at least one chapter body.
And transcoding the optimal chapter text to the user.
In the embodiment of the present disclosure, since the aggregated sub-nodes in each aggregation node and the related information (such as the information of the websites to which the aggregated sub-nodes belong, the chapter text web links, and the like) may be recorded when the chapter relationship graph is generated, when a user accesses a chapter corresponding to a certain aggregation node, the chapter texts corresponding to each sub-node may be obtained based on the chapter text web links of each sub-node aggregated in the aggregation node, and then the corresponding invalid texts may be obtained and marked by filtering the invalid texts of the chapter texts.
For example, a typical invalid body may be preconfigured, such as "chapter is being hit by hand, please access later. For each acquired chapter of the text content, for example, the similarity between the first 3 paragraphs of the text and the invalid text can be compared. And if the comparison result represents that the similarity is greater than or equal to a preset value, the currently compared text is considered to be an invalid text, and the invalid text is marked and the priority of the webpage is reduced to the minimum.
In the embodiment of the present disclosure, the chapter text marked as the valid text may be further processed by screening. For example, the text is divided according to the line break, the period and the exclamation mark to obtain text clauses with top K (K is any positive integer) length, the text clauses are used as the characteristic sentences of the text, and all the text characteristic sentences are collected. Meanwhile, a feature vector is generated according to whether the text contains a certain feature sentence. Then, the similarity of the chapters is determined through the cosine theorem, and a group of chapters with the highest similarity is selected. And selecting the text with the number of paragraphs and the number of sentences as a median from the text, and transcoding the text as the optimal text to the user.
It should be noted that, in the embodiment of the present disclosure, if all the texts are marked as invalid texts, one of the texts may be randomly selected for transcoding and presentation.
Similarly, in the embodiment of the present disclosure, in order to accelerate the loading speed of the text, the optimal text web page link corresponding to each chapter may also be cached, and the next time the user accesses the chapters, the cached optimal text web page link may be directly accessed to perform the novel transcoding and the displaying.
For example, as shown in fig. 4, in the optimal catalog "python → voiceprint → suzuo → force", website a to website C can provide the text of the sections "python", "suzuo" and "force", and website B to website C can provide the text of the section "voiceprint". If only the 'boa' provided by the website A is an effective text, only the 'suzuo' and 'strength' provided by the website B are effective texts, and the 'source lines' provided by the websites B to C are all ineffective texts, the effective text provided by the website A can be transcoded for the 'boa' chapter and displayed for a user, the effective text provided by the website B can be transcoded for the 'suzuo' and 'strength' chapters and displayed for the user, and one of the ineffective texts provided by the website B or the website C can be randomly transcoded for the 'source line' chapter and displayed.
It should be noted that, in the embodiment of the present disclosure, in the process of transcoding the optimal chapter text to the user, in order to optimize the quality of each chapter text, the chapter text may be further processed as follows: invalid content such as advertisements, impurities (e.g., "read free million; contents not found on the web |"), etc. in the chapter text is filtered. Therefore, the quality of the transcoded novel can be further improved, and the reading experience of a user is improved.
As an alternative embodiment, finding at least one web page address associated with the target novel may include the following operations.
Determining a book set to which the target novel belongs, wherein each novel contained in the book set and the target novel are the same novel with different sources, and the webpage address corresponding to each novel is associated with the target novel.
And searching a webpage address corresponding to each novel.
In the embodiment of the present disclosure, before performing the novel transcoding, the alternative novel web pages may be obtained first, that is, the web pages corresponding to each novel are collected from the network. And then carrying out novel transcoding, namely integrating different webpage contents of the same novel to obtain a final novel text. Integrating different web page contents of the same novel can be divided into two steps: directory information integration and text content screening.
In the directory information integration process, for the same novel, the directory structures of a plurality of novel web pages can be extracted, a directory relation graph (namely a chapter relation graph) is established, and then a directory path with the best quality is selected from the directory relation graph for displaying.
And then, in the process of screening the text content, simultaneously requesting a plurality of texts corresponding to each chapter, selecting the text with the best quality (the optimal text) by analyzing the text content and the website weight, and transcoding the optimal text.
In one embodiment of the present disclosure, for the same novel, novel web pages on multiple websites can be simultaneously captured, corresponding directory structures are extracted, and then the final optimal directory is combined based on the directory structures.
Furthermore, in an embodiment of the present disclosure, for the novel text, the text pages of multiple corresponding chapters may be evaluated based on the aggregation node, and the text page with the best quality is selected for transcoding display. Therefore, better quality of the transcoded novels can be obtained.
By the embodiment of the disclosure, for the same novel, webpage transcoding for a single novel is not limited any more, and optimization processing can be performed on webpage content while webpage transcoding for the novel is performed, so that the quality of the displayed novel content is improved.
Furthermore, in embodiments of the present disclosure, a full novel crawl may be performed. For example, a group of high-quality novel websites can be manually configured, all novel in each website can be crawled, or a popular novel can be searched by a search engine, and a directory page link of the novel can be obtained. Then crawl the directory pages of the novel one by one, extract the title, author and chapter title of the first 60 chapters of the novel, and store.
Furthermore, in the disclosed embodiment, the crawled novel catalogue can be cleaned. For example, the sequence number and invalid content in the chapter header may be purged. For example, the chapter title is "first chapter can be swept for a short time (asking for a monthly ticket-)" before cleaning, and the chapter title can be "can be swept for a short time after cleaning". The cleaning process may include the following steps.
A title normalization process is performed, including but not limited to: turning full angle to half angle, removing title name, author, space, invalid character in title, and changing number in title to 0.
And removing the serial number information before chapters through regular expression or common prefix filtering. The common prefix filtering may be performed by establishing a wire tree by chapter titles and filtering the common prefixes whose occurrence times exceed a predetermined number of times, for example, 10 times. It should be understood that in the disclosed embodiments, such as chapters 1 and 2, the previous "title normalization processing" step has been replaced with chapter 0, but may be identified and filtered out as a common prefix in this step.
The number changed to 0 in the first step and not filtered is restored to the original number by the original chapter title.
Invalid suffix filtering is performed. The content in the brackets is found by comparing the sections before and after the sections, and the brackets and the content in the brackets are removed. In the case of filtering invalid suffixes, if the titles of the preceding and following sections after the parentheses are removed are the same, the parentheses and the contents thereof are considered to be valid suffixes (; middle part), (upper part), (middle part), (lower part) of the parentheses and the contents thereof are valid suffixes, and thus, filtering is not required. In addition, if the titles of the preceding and following sections after the content in the parentheses is removed are different, the parentheses and the content therein are considered as invalid suffixes and can be filtered out. Therefore, the chapter main body corresponding to the original chapter can be extracted through catalog cleaning. For chapters without a body, such as "chapter one", "chapter two", the chapter body is set to be empty.
In the disclosed embodiment, book aggregation may also be performed so that similar books are aggregated in one collection. For example, the main body of the chapter title can be obtained by catalog cleaning, and book aggregation based on catalogues can be performed on books whose chapter main bodies are not empty. The specific method can be as follows: and comparing the chapter main bodies of the books pairwise, and obtaining the similarity of the books according to the repetition degree of the chapter main bodies.
The book similarity is obtained by the following method: the chapter main bodies of the books form bipartite graphs, and the similarity of the chapter main bodies is determined by methods such as a Levensstein ratio and an editing distance. If two chapters in the book are similar, one edge of the bipartite graph is correspondingly added; the similarity between two books is the maximum matching (number of sides) of the bipartite graph/the minimum number of chapters in the books, and the two books are considered similar when the similarity exceeds a certain threshold.
If the two books are similar, they can be put into one collection. When there are multiple books in the collection, if a book is similar to any book in the collection, the book is added to the collection. Finally, the full library books can be put into a plurality of sets, and the sets are not intersected with each other.
In addition, in order to speed up comparison of book similarity, books with the same title and author may be directly considered to be similar. Only books with different book names and authors can be compared through a similarity method, so that the number of times of book comparison is reduced.
In the embodiment of the disclosure, the web page relationship record can also be performed. For example, for each book collection, there are multiple title, author, and web page address. The relation between the title, the author and the webpage address can be extracted, and a many-to-many mapping is established. Illustratively, there is a web page a in which the book is "super-program/super-human", and a web page B in which the book is "super-program/super-human", so that "super-program/super-human" may correspond to the web pages [ a, B ], "super-program/super-human" may also correspond to the web pages [ a, B ].
According to the embodiment of the disclosure, the disclosure also provides a novel transcoding device.
Fig. 5 illustrates a block diagram of a novel transcoding device, according to an embodiment of the present disclosure.
As shown in fig. 5, the novel transcoding apparatus 500 may include: a lookup module 510, a first acquisition module 520, an aggregation module 530, and a second acquisition module 540.
A lookup module 510 for looking up at least one web page address associated with the target novel.
The first obtaining module 520 is configured to obtain at least one directory information corresponding to the at least one web page address one to one.
An aggregating module 530, configured to perform catalog aggregation based on the at least one catalog information, and generate a chapter relationship diagram of the target novel.
A second obtaining module 540, configured to obtain an optimal directory of the target novel based on the chapter relation diagram, so as to select, based on the optimal directory, a corresponding optimal text for the novel chapter requested to be accessed by the user to perform transcoding.
As an alternative embodiment, the aggregation module includes: a preprocessing unit, configured to perform a preprocessing operation on each directory information in the at least one directory information to obtain corresponding preprocessed directory information, where the preprocessing operation includes at least one of: chapter de-duplication operation and chapter completion operation; and the aggregation unit is used for carrying out catalog aggregation based on at least one catalog information obtained after the preprocessing so as to generate the chapter relation diagram.
As an alternative embodiment, the polymerization unit comprises: a construction subunit, configured to establish a chapter relationship tree according to a chapter sequence based on the at least one directory information obtained after the preprocessing; the merging subunit is used for merging the nodes with the same chapter title main body in the chapter relation tree to obtain at least one aggregation node; and a generating unit configured to generate the chapter relationship graph based on the at least one aggregation node.
As an optional embodiment, the second obtaining module is further configured to: and searching the child nodes of the node from the first node of the chapter relation graph, and selecting the child node with the largest number of aggregation nodes as the next node until the leaf node is searched, wherein the path passed by the process is used as the optimal directory.
As an alternative embodiment, the apparatus further comprises: and the caching module is used for caching the optimal directory.
As an alternative embodiment, the apparatus further comprises: and the transcoding module is used for responding to a transcoding request initiated by the user aiming at the target novel, and selecting a corresponding optimal text for the novel chapter accessed by the user request to transcode based on the optimal directory.
As an optional embodiment, the transcoding module comprises: the first determining unit is used for determining an aggregation chapter corresponding to the novel chapter which the user requests to access in the optimal directory; a first obtaining unit, configured to obtain at least one chapter text web page link associated with the aggregated chapter; the second acquisition unit is used for acquiring at least one chapter text in one-to-one correspondence based on the at least one chapter text webpage link; a selecting unit, configured to select an optimal chapter text from the at least one chapter text based on the content quality; and the transcoding unit is used for transcoding the optimal chapter text to the user.
As an alternative embodiment, the lookup module includes: a second determining unit, configured to determine a book collection to which the target novel belongs, where each novel included in the book collection and the target novel are the same novel from different sources, and a web address corresponding to each novel is associated with the target novel; and the searching unit is used for searching the webpage address corresponding to each novel.
It should be understood that the embodiments of the apparatus portion of the present disclosure correspond to the same or similar embodiments of the method portion of the present disclosure, and the detailed description of the present disclosure is omitted here.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the electronic device 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data necessary for the operation of the electronic apparatus 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the electronic device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the various methods and processes described above, such as the novel transcoding method. For example, in some embodiments, the novel transcoding method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the novel transcoding method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the novel transcoding method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, which is also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and a VPS service ("Virtual Private Server", or "VPS" for short). The server may also be a server of a distributed system, or a server incorporating a blockchain.
In the technical scheme of the disclosure, the related user information is recorded, stored, applied and the like, which all accord with the regulations of related laws and regulations and do not violate the good customs of the public order.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:信息提取方法、装置、电子设备和介质